# Science modules in Fink: an example

A science module contains necessary routines and classes to process the data, and add values. Typically, you will receive alerts in input, and output the same alerts with additional information. Input alert information contains position, flux, telescope properties, ... You can find what information is available in an alert [here](https://zwickytransientfacility.github.io/ztf-avro-alert/), or check the current [Fink added values](https://fink-broker.readthedocs.io/en/latest/science/added_values/).

In this simple example, we explore a simple science module that takes magnitudes contained in each alert, and computes the change in magnitude between the last two measurements.

In [1]:
# utility from fink-science
from fink_utils.spark.utils import concat_col

# user-defined function from the current folder
from processor import deltamaglatest

## Loading the data

Fink receives data as Avro. However, the internal processing makes use of Parquet files. We provide here alert data as Parquet: it contains original alert data from ZTF and some added values from Fink:

In [2]:
# Load the data into a Spark DataFrame
df = spark.read.format('parquet').load('sample.parquet')

23/03/21 16:30:33 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


You can check what's in the data

In [3]:
df.printSchema()

root
 |-- candid: long (nullable = true)
 |-- schemavsn: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- candidate: struct (nullable = true)
 |    |-- jd: double (nullable = true)
 |    |-- fid: integer (nullable = true)
 |    |-- pid: long (nullable = true)
 |    |-- diffmaglim: float (nullable = true)
 |    |-- pdiffimfilename: string (nullable = true)
 |    |-- programpi: string (nullable = true)
 |    |-- programid: integer (nullable = true)
 |    |-- candid: long (nullable = true)
 |    |-- isdiffpos: string (nullable = true)
 |    |-- tblid: long (nullable = true)
 |    |-- nid: integer (nullable = true)
 |    |-- rcid: integer (nullable = true)
 |    |-- field: integer (nullable = true)
 |    |-- xpos: float (nullable = true)
 |    |-- ypos: float (nullable = true)
 |    |-- ra: double (nullable = true)
 |    |-- dec: double (nullable = true)
 |    |-- magpsf: float (nullable = true)
 |    |-- sigmapsf: float (nullab

## Calling the science module

First, you need to concatenate historical + current measurements for the quantities of interest. Here, we only need `magpsf`. Hence we create a new column to the DataFrame called `cmagpsf` (for _concatenated_ `magpsf`):

In [4]:
# Required alert columns
what = ['magpsf']

# Use for creating temp name
prefix = 'c'
what_prefix = [prefix + i for i in what]

# Concatenate historical + current measurements
for colname in what:
    df = concat_col(df, colname, prefix=prefix)

Let's apply the science module, that is creating a new column to the DataFrame whose values are the change in magnitude between the last 2 measurements. All the user logic is contained in the routine `deltamaglatest` defined in `processor.py`. This routine is a user-defined function that encapsulates the necessary operations, and it can call functions from user-defined modules (here `mymodule.py`) or third-party libraries (e.g. `numpy`, `pandas`, etc). Note that the input arguments of `deltamaglatest` are column names of the DataFrame, and they are materialised as `pd.Series` inside the routine.

In [5]:
df_change = df.withColumn('deltamag', deltamaglatest('cmagpsf'))

# print the result for the 20 first alerts
df_change.select(['objectId', 'cdsxmatch', 'deltamag']).show()

[Stage 1:>                                                          (0 + 1) / 1]

+------------+---------------+------------+
|    objectId|      cdsxmatch|    deltamag|
+------------+---------------+------------+
|ZTF18aceiioy|          RRLyr|        null|
|ZTF19aacasnk|        Unknown|        null|
|ZTF18abvwsjv|          RSCVn|        null|
|ZTF18abvbosx|            EB*|-0.046627045|
|ZTF18aborogc|             V*|        null|
|ZTF19abxxedn|         Galaxy|        null|
|ZTF21acqdcyb|        Unknown| -0.02725029|
|ZTF18abmqkeb|  Candidate_EB*|        null|
|ZTF18acdwxpy|           LPV*|  0.79428864|
|ZTF18aazlscs|           Star|        null|
|ZTF18aaxdlik|           Mira|        null|
|ZTF18aaxyuft|  Candidate_LP*|    3.439746|
|ZTF18aawacbm|Candidate_RRLyr|        null|
|ZTF18aawvauw|  Candidate_Mi*|    2.488881|
|ZTF18aaxauqh|            SB*|        null|
|ZTF18abjrdau|         PulsV*|        null|
|ZTF19aawfxge|            AGN|        null|
|ZTF18aanofea|     PulsV*WVir|        null|
|ZTF18abbtyyr|             C*|        null|
|ZTF18aclyzzb|  Candidate_YSO|  

                                                                                

We can also quickly check some statistics on this new column:

In [6]:
df_change.select('deltamag').describe().show()

+-------+-------------------+
|summary|           deltamag|
+-------+-------------------+
|  count|                176|
|   mean|0.09352213686162775|
| stddev| 0.9564824046920042|
|    min|         -2.8283176|
|    max|           3.439746|
+-------+-------------------+



[Stage 2:>                                                          (0 + 6) / 6]                                                                                

Et voilà! Of course, this science module is extremely simple - but the logic remains the same for more complex cases!