# Science modules in Fink: an example

A science module contains necessary routines and classes to process the data, and add values. Typically, you will receive alerts in input, and output the same alerts with additional information. Input alert information contains position, flux, telescope properties, ... You can find what information is available in an alert [here](https://zwickytransientfacility.github.io/ztf-avro-alert/), or check the current [Fink added values](https://fink-broker.readthedocs.io/en/latest/science/added_values/).

In this simple example, we explore a simple science module that takes magnitudes contained in each alert, and computes the change in magnitude between the last two measurements.

In [175]:
# utility from fink-science
from fink_science.utilities import concat_col

## Loading the data

Fink receives data as Avro. However, the internal processing makes use of Parquet files. We provide here alert data as Parquet: it contains original alert data from ZTF and some added values from Fink:

In [176]:
# Load the data into a Spark DataFrame
df = spark.read.format('parquet').load('sample.parquet')

You can check what's in the data

In [177]:
df.printSchema()

root
 |-- candid: long (nullable = true)
 |-- schemavsn: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- objectId: string (nullable = true)
 |-- candidate: struct (nullable = true)
 |    |-- jd: double (nullable = true)
 |    |-- fid: integer (nullable = true)
 |    |-- pid: long (nullable = true)
 |    |-- diffmaglim: float (nullable = true)
 |    |-- pdiffimfilename: string (nullable = true)
 |    |-- programpi: string (nullable = true)
 |    |-- programid: integer (nullable = true)
 |    |-- candid: long (nullable = true)
 |    |-- isdiffpos: string (nullable = true)
 |    |-- tblid: long (nullable = true)
 |    |-- nid: integer (nullable = true)
 |    |-- rcid: integer (nullable = true)
 |    |-- field: integer (nullable = true)
 |    |-- xpos: float (nullable = true)
 |    |-- ypos: float (nullable = true)
 |    |-- ra: double (nullable = true)
 |    |-- dec: double (nullable = true)
 |    |-- magpsf: float (nullable = true)
 |    |-- sigmapsf: float (nullab

## Calling the science module

First, you need to concatenate historical + current measurements for the quantities of interest. Here, we only need `magpsf`. Hence we create a new column to the DataFrame called `cmagpsf` (for _concatenated_ `magpsf`):

In [178]:
# Required alert columns
what = ['magpsf', 'jd', 'sigmapsf', 'fid']

# Use for creating temp name
prefix = 'c'
what_prefix = [prefix + i for i in what]

# Concatenate historical + current measurements
for colname in what:
    df = concat_col(df, colname, prefix=prefix)

what_prefix

['cmagpsf', 'cjd', 'csigmapsf', 'cfid']

In [179]:
import light_curve as lc
import numpy as np
import pandas as pd

In [180]:
def fix_nans(arr):
    if len(arr.shape)>1:
        raise Exception("Only 1D arrays are supported.")
    
    if np.isnan(arr[0]):
        for elem in arr:
            if not np.isnan(elem):
                arr[0] = elem
                break
        else:
            raise ValueError("nans only!")

    last_value = arr[0]
    for idx, elem in enumerate(arr):
        if not np.isnan(elem):
            last_value = elem
        else:
            arr[idx] = last_value
        

In [191]:
def extract_row(frame, index=0, dtype='float64'):
    return np.array(frame[index][0], dtype=dtype)


def npsave(*args, **kwargs):
    args = [*args]
    args[0] = f"samples/{args[0]}"
    if not args[0].endswith('.npy'):
        args[0] += '.npy'
    return np.save(*args, **kwargs)


def npload(*args, **kwargs):
    args = [*args]
    args[0] = f"samples/{args[0]}"
    if not args[0].endswith('.npy'):
        args[0] += '.npy'
    args[0] = open(args[0], 'rb')
    return np.load(*args, **kwargs)


In [192]:
import os


def create_sample(n=0):
    cmagpsf = extract_row(df.select('cmagpsf').take(n+1), n)
    cjd = extract_row(df.select('cjd').take(n+1), n)
    csigmapsf = extract_row(df.select('csigmapsf').take(n+1), n)

    if not os.path.exists(f'samples/{n}'):
        os.mkdir(f'samples/{n}')

    npsave(f"{n}/cmagpsf", cmagpsf)
    npsave(f"{n}/cjd", cjd)
    npsave(f"{n}/csigmapsf", csigmapsf)

    return cmagpsf, cjd, csigmapsf


def load_sample(n=0):
    cmagpsf = npload(f"{n}/cmagpsf")
    cjd = npload(f"{n}/cjd")
    csigmapsf = npload(f"{n}/csigmapsf")

    return cmagpsf, cjd, csigmapsf

In [196]:
create_sample(0)
create_sample(1)
create_sample(2)
create_sample(3)
create_sample(4)
create_sample(5)

print(os.listdir("samples"))

['test.npy', '0', '1', '4', '3', '2', '5']


In [206]:
def extract(magpsf, jd, sigmapsf):
    fix_nans(cmagpsf)
    fix_nans(csigmapsf)

    extractor = lc.Extractor(
        lc.Amplitude(),
        lc.BeyondNStd(nstd=1),
        lc.LinearFit(),
        lc.Mean(),
        lc.Median(),
        lc.StandardDeviation(),
        lc.Cusum(),
        lc.ExcessVariance(),
        lc.MeanVariance(),
        lc.Kurtosis(),
        lc.MaximumSlope(),
        lc.Skew(),
        lc.WeightedMean(),
        lc.Eta(),
        lc.AndersonDarlingNormal(),
        lc.ReducedChi2(),
        lc.InterPercentileRange(quantile=0.1),
        #lc.MagnitudePercentageRatio(),
        lc.MedianBufferRangePercentage(quantile=0.1),
        lc.PercentDifferenceMagnitudePercentile(quantile=0.1),
        lc.MedianAbsoluteDeviation(),
        lc.PercentAmplitude(),
        lc.EtaE(),
        lc.LinearTrend(),
        lc.StetsonK(),
        lc.WeightedMean(),
        #lc.Bins(),
        #lc.OtsuSplit(),
    )

    result = extractor(cjd, cmagpsf, csigmapsf)
    print('\n'.join("{} = {:.2f}".format(name, value) for name, value in zip(extractor.names, result)))  # DEBUG
    return result

In [207]:
cmagpsf, cjd, csigmapsf = load_sample(1)

describe = lambda v: print(type(v), len(v), np.mean(v), np.std(v))

assert cmagpsf.shape == cjd.shape == csigmapsf.shape, 'Mismatched shapes'

describe(cmagpsf)
describe(cjd)
describe(csigmapsf)

# separator for result
print("\n--------------\n")

res = extract(cmagpsf, cjd, csigmapsf)

<class 'numpy.ndarray'> 34 nan nan
<class 'numpy.ndarray'> 34 2459526.124098582 7.321780152225849
<class 'numpy.ndarray'> 34 nan nan

--------------

amplitude = 1.52
beyond_1_std = 0.47
linear_fit_slope = -0.07
linear_fit_slope_sigma = 0.00
linear_fit_reduced_chi2 = 98.81
mean = 18.47
median = 18.22
standard_deviation = 0.97
cusum = 0.32
excess_variance = 0.00
mean_variance = 0.05
kurtosis = -1.26
maximum_slope = 123.08
skew = -0.17
weighted_mean = 17.55
eta = 1.32
anderson_darling_normal = 1.25
chi2 = 112.38
inter_percentile_range_10 = 2.52
median_buffer_range_percentage_10 = 0.00
percent_difference_magnitude_percentile_10 = 0.14
median_absolute_deviation = 0.85
percent_amplitude = 1.56
eta_e = 591.79
linear_trend = -0.07
linear_trend_sigma = 0.02
linear_trend_noise = 0.86
stetson_K = 0.84
weighted_mean = 17.55


In [74]:
# user-defined function from the current folder
import importlib
import processor
from processor import extract_features_ztf
importlib.reload(processor)

df_change = df.withColumn('ztf_TEST', extract_features_ztf(*what_prefix))
df_change.select(['objectId', 'ztf_TEST']).show()

[Stage 37:>                                                         (0 + 1) / 1]21/12/15 20:08:28 ERROR Executor: Exception in task 0.0 in stage 37.0 (TID 37)
org.apache.spark.api.python.PythonException: TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/serializers.py", line 290, in dump_stream
    for series in iterator:
  File "<string>", line 1, in <lambda>
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 101, in <lambda>
    return lambda *a

Py4JJavaError: An error occurred while calling o1882.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 37, localhost, executor driver): org.apache.spark.api.python.PythonException: TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/serializers.py", line 290, in dump_stream
    for series in iterator:
  File "<string>", line 1, in <lambda>
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 101, in <lambda>
    return lambda *a: (verify_result_length(*a), arrow_return_type)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 92, in verify_result_length
    result = f(*a)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/igor/Work/snad/fink-science/fink_science/ztf/processor.py", line 22, in extract_features_ztf
    jd_arr = jd.to_numpy().astype(float)
ValueError: setting an array element with a sequence.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:102)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:100)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1925)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1913)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1912)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1912)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:948)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:948)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2146)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2095)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2084)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:759)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:365)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/serializers.py", line 290, in dump_stream
    for series in iterator:
  File "<string>", line 1, in <lambda>
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 101, in <lambda>
    return lambda *a: (verify_result_length(*a), arrow_return_type)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/worker.py", line 92, in verify_result_length
    result = f(*a)
  File "/Users/igor/Work/snad/spark-installation/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "/Users/igor/Work/snad/fink-science/fink_science/ztf/processor.py", line 22, in extract_features_ztf
    jd_arr = jd.to_numpy().astype(float)
ValueError: setting an array element with a sequence.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
	at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.<init>(ArrowEvalPythonExec.scala:102)
	at org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:100)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:127)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$anonfun$doExecute$1.apply(EvalPythonExec.scala:89)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:823)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


Let's apply the science module, that is creating a new column to the DataFrame whose values are the change in magnitude between the last 2 measurements. All the user logic is contained in the routine `deltamaglatest` defined in `processor.py`. This routine is a user-defined function that encapsulates the necessary operations, and it can call functions from user-defined modules (here `mymodule.py`) or third-party libraries (e.g. `numpy`, `pandas`, etc). Note that the input arguments of `deltamaglatest` are column names of the DataFrame, and they are materialised as `pd.Series` inside the routine.

In [5]:
df_change = df.withColumn('deltamag', deltamaglatest('cmagpsf'))

# print the result for the 20 first alerts
df_change.select(['objectId', 'cdsxmatch', 'deltamag']).show()

+------------+-------------------+--------------------+
|    objectId|          cdsxmatch|            deltamag|
+------------+-------------------+--------------------+
|ZTF18abjrdau|             PulsV*|  0.1650867462158203|
|ZTF18abmmrzp|               Star|                null|
|ZTF19abjfoad|Candidate_LensSyste|                null|
|ZTF18acmwkqr|               RGB*|                null|
|ZTF21acqeepb|            Unknown|                null|
|ZTF17aaanpdf|       PulsV*delSct|  1.3444271087646484|
|ZTF18abadigg|            Cepheid|  0.2772483825683594|
|ZTF19aawfxge|                AGN|   -0.25921630859375|
|ZTF18aaxypzn|                MIR|                null|
|ZTF18abtrvkm|                 SN|                null|
|ZTF18acmwkqr|               RGB*|  0.5792255401611328|
|ZTF18abjcxoj|                SG*|                null|
|ZTF18aaxyyjv|         PulsV*bCep| -0.9435768127441406|
|ZTF18abcvdid|             Pulsar|  -0.055511474609375|
|ZTF17aaabqqd|                 V*|              

We can also quickly check some statistics on this new column:

In [6]:
df_change.select('deltamag').describe().show()

+-------+-------------------+
|summary|           deltamag|
+-------+-------------------+
|  count|                176|
|   mean|0.09352213686162775|
| stddev| 0.9564824046920042|
|    min| -2.828317642211914|
|    max| 3.4397459030151367|
+-------+-------------------+



Et voilà! Of course, this science module is extremely simple - but the logic remains the same for more complex cases!