# Interpret the output of the `statistical_checks` module

This notebook is not meant as a tutorial on how to run the `statistical_checks` module, as this was already covered in the tutorial about the `pygwb_stats` script [here](stat_checks.html). In this tutorial, we go over the plots generated by the module, and provide a brief description of each of the plots to help the user interpret the results. 

For more information on the module, we refer the user to the module API page [here](api/pygwb.statistical_checks.html). For more general information about the module, we refer the user to the [pygwb paper](https://arxiv.org/pdf/2303.15696.pdf).

We start by importing some packages needed for the execution of the tutorial:

In [4]:
import numpy as np
from pygwb.statistical_checks import StatisticalChecks
from pygwb.statistical_checks import run_statistical_checks_from_file
%matplotlib inline

*Note: make sure to run this notebook within an environment that has all the above packages installed.*

*Disclaimer: The output of* `statistical checks` *is dependent on changes in other modules (e.g.* `delta sigma cut`*). Therefore, the emphasis of the current status of notebook should be that this is a tool to give an idea of what the* `statistical checks` *do/how to use it, rather than on the actual numerical values (as these are subject to change due to modification in other parts of the code). Furthermore, the datasets shown below might not have been run with the most up-to-date pipeline.*

## Initializing the `statistical checks` object

As a concrete example, we consider a stretch of O3 data, on which `pygwb_pipe` was run. As mentioned above, we rely on the current version of a `statistical checks` pipeline that reads in from files (called `run_statistical_checks_from_file`). Note that this method initializes a `statistical checks` object behind the screens. Only part of the files of the O3 run are read in for the sake of time in this tutorial. The method used for reading in the files requires the following input:

- A directory where the output of the pygwb_pipe is saved

- A path to the combined spectra are saved

- The path to the parameter file used for the analysis

- A directory where the plots of statistical checks will be saved

These are defined below:

In [None]:
param_file = "./input/parameters.ini"
combine_file = "./input/point_estimate_sigma_spectra_alpha_0.0_fref_25_1267911343-1269307030.npz"
dsc_file = "./input/delta_sigma_cut_1267911343-1269307030.npz"
plot_dir = './'

Given the above paths, the `run_statistical_checks_from_file` method (more information [here](api/pygwb.statistical_checks.run_statistical_checks_from_file.html#pygwb.statistical_checks.run_statistical_checks_from_file)) will read all necessary quantities from the files in those directories and initialize an instance of the `StatisticalChecks` class:

In [None]:
stat_checks_pygwb_O3=run_statistical_checks_from_file(combine_file, dsc_file, plot_dir, param_file)

The `statistical_checks` module is a collection of plotting methods, which can all be called individually or together through the `generate_all_plots` method. More information can be found [here](api/pygwb.statistical_checks.StatisticalChecks.html#pygwb.statistical_checks.StatisticalChecks.generate_all_plots). We choose to go over all the plots one by one below, and provide a brief description of each of them.

In [None]:
stat_checks_pygwb_O3.plot_running_point_estimate()
stat_checks_pygwb_O3.plot_running_sigma()

The above two plots show the running point estimate and the standard deviation as the analysis run evolves, i.e., as more data is accumulated. This is particularly useful to determine whether the point estimate evolves towards a non-zero value, meaning it is detecting something, or not. In addition, the standard deviation should follow a $1/\sqrt{\rm time}$ behavior, which can be easily verified in the plot. More information about the above quantities can be found in the [pygwb paper](https://arxiv.org/pdf/2303.15696.pdf).

In [None]:
stat_checks_pygwb_O3.plot_IFFT_point_estimate_integrand()

The plot above shows the inverse Fourier transform of the point estimate integrand. In the case of detection, the plot should display a peak around 0, corresponding to a signal that is present only when the two data-streams are not time-shifted, whereas the signal disappears for other non-zero values. Combining this plot with the previous two plots provides additional certainty in the detection of a signal.

In [None]:
stat_checks_pygwb_O3.plot_SNR_spectrum()
stat_checks_pygwb_O3.plot_cumulative_SNR_spectrum()
stat_checks_pygwb_O3.plot_real_SNR_spectrum()
stat_checks_pygwb_O3.plot_imag_SNR_spectrum()
stat_checks_pygwb_O3.plot_sigma_spectrum()
stat_checks_pygwb_O3.plot_cumulative_sensitivity()

The above plots show the signal-to-noise ratio (SNR) as a function of frequency, together with the standard deviation. Both the areal and imaginary part of the spectrum are shown, as the real part of the spectrum contains information about the signal, whereas the imaginary part contains information about the noise. The information of the standard deviation plot allows us to tell how noisy some of the frequency bins were, and therefore, together with the information of the SNR plot, how much a frequency bin contributed to the analysis.

In [None]:
stat_checks_pygwb_O3.plot_omega_sigma_in_time()

The plot above shows the value of the point estimate, and its standard deviation per analysis segment, as a function of time. Deviations from the mean of the whole analysis run are shown as well. The quantities are shown before and after the delta-sigma cut, allowing to visualize its effect. Outliers in the standard deviation plot should be removed after the delta-sigma cut. If not, these should be flagged for follow-up. Any other irregularities in these plots are easily picked up and can then be flagged for further investigation.

In [None]:
stat_checks_pygwb_O3.plot_hist_sigma_dsc()
stat_checks_pygwb_O3.plot_scatter_sigma_dsc()
stat_checks_pygwb_O3.plot_scatter_omega_sigma_dsc()
stat_checks_pygwb_O3.plot_hist_omega_pre_post_dsc()

The few plots above illustrate the effect of the delta-sigma cut on the quantities of interest (see [here](api/pygwb.delta_sigma_cut.html) for information about the delta-sigma cut in `pygwb`). A clear cut should be seen in the value of the delta-sigma itself when comparing before and after cut values (the cut should be at the value specified in the parameter file for the delta-sigma cut). Note that any outliers should in principle be removed by the delta-sigma cut, as this removes any abnormally loud segments. Any remaining outliers should be flagged for follow-up.

In [None]:
stat_checks_pygwb_O3.plot_KS_test()

The Kolmogorov-Smirnov test can be used to verify the consistency of the data with some assumed distribution. For stochastic searches, we assume Gaussianity of our data. Hence, this test is good to verify this assumption. The plot shows the cumulative distribution function, together with the one of the data. The maximum deviation from the assumed distribution is displayed as a test statistic, together with the p-value. More information about the KS test can be found [here](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).

In [None]:
stat_checks_pygwb_O3.plot_hist_sigma_squared()

The plot above illustrates the variation in the variance. The histogram should be centered around 1, as this would correspond to values of the variance centered around the mean of the variance. Large fluctuations or outlier bins should be investigated.

In [None]:
stat_checks_pygwb_O3.plot_omega_time_fit()
stat_checks_pygwb_O3.plot_sigma_time_fit()

The two plots above plot the values of the point estimate and the standard deviation as a function of time and allow for a linear fit to be performed. Any trend in these quantities can easily be visualized in this plot, together with the numerical values of the fit parameters.

In [None]:
stat_checks_pygwb_O3.plot_gates_in_time()

If gating was performed during the analysis run, the duration of the gates can be represented visually as a function of the analysis segment. Gates lasting longer than a few seconds should be flagged for follow-up. More information on the gating procedure can be found in the [pygwb paper](https://arxiv.org/pdf/2303.15696.pdf) or in  the implementation of gating in the `pygwb`module [here](api/pygwb.preprocessing.self_gate_data.html#pygwb.preprocessing.self_gate_data).