# Image cleaning

**Author(s):**
 - Dr. Michele Peresano (CEA-Saclay/IRFU/DAp/LEPCHE), 2020

**Description:**

This notebook contains DL1-image-cleaning plots and benchmark proposals for the _protopipe_ pipeline.  
This was mainly triggered by the step-by-step comparison against _CTA-MARS_, but it can be extended to other pipelines as well.  
**NOTE** Let's try to follow [this](https://www.overleaf.com/16933164ghbhvjtchknf) document by adding those benchmarks or proposing new ones.  
**WARNING** Contrary to the calibration notebook, I am still working on this one, so it's a bit messy and incomplete! 

**Requirements:**

To run this notebook you will need a DL1 file which can be generated using _protopipe.scripts.write_dl1.py_ .    
Reference simtel-file, plots, values and settings can be found [here (please, always refer to the latest version)](https://forge.in2p3.fr/projects/benchmarks-reference-analysis/wiki/Comparisons_between_pipelines) until we have a more automatic and fancy approach (aka [cta-benchmarks](https://github.com/cta-observatory/cta-benchmarks)+[ctaplot](https://github.com/cta-observatory/ctaplot)).  

The data format required to run the notebook is the current one used by _protopipe_ . Later on it will be the same as in _ctapipe_ .  
**WARNING:** Mono-telescope images (2 triggers - 1 image or 1 trigger - 1 image) are not currently taken into account by the publicly available development version (the new DL1 script will have them), until then expect a somewhat lower statistics.

**Development and testing:**  

For the moment this notebook is optimized to work only on files produced from LSTCam + NectarCam telescope configurations.  
As with any other part of _protopipe_ and being part of the official repository, this notebook can be further developed by any interested contributor.  
The execution of this notebook is not currently automatic, it must be done locally by the user - preferably _before_ pushing a pull-request.

**TODO:**  
* fix best I/O
* add missing plots in section [Total image charge ("Intensity") resolution for selected images"](https://forge.in2p3.fr/projects/step-by-step-reference-mars-analysis/wiki#Total-image-charge-Intensity-resolution-for-selected-images)
* finish Direction LUTs and clean-up
* even better: make _direction reconstruction_ a separate notebook because in the new format will be part of DL2

## Imports

In [None]:
# import tables
# import h5py
from pathlib import Path
import numpy as np
import pandas
from scipy.stats import binned_statistic, binned_statistic_2d, cumfreq, percentileofscore
from astropy import units as u
from astropy.table import Table
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
from mpl_toolkits.mplot3d import Axes3D

from ctapipe.instrument import OpticsDescription
from ctapipe.image.hillas import camera_to_shower_coordinates

## Functions

### Load the base data file or reset it if overwritten

This part has multiple I/O approaches because I am still testing which is best.  
It's possible that with the new data format this will be much easier.

In [None]:
def load_reset_dl1(indir = "./", fileName = "dl1_tail_gamma_z20_az180_LaPalma_baseline_run100_withMono.h5", config="test"):
    """(Re)load the file containing DL1(a) data and extract the data per telescope type."""
    # load DL1 images
    data = tables.open_file(f"{indir}/{fileName}")
    data_LST = data.get_node("/feature_events_LSTCam")
    data_MST = data.get_node("/feature_events_NectarCam")
    suffix = config # all generated plots will have this as a suffix in their name
    return data_LST, data_MST, suffix

In [None]:
def load_reset_dl1_astropy(indir = "./", fileName = "dl1_tail_gamma_z20_az180_LaPalma_baseline_run100_withMono.h5", config="test"):
    """(Re)load the file containing DL1(a) data and extract the data per telescope type."""
    # load DL1 images
    data_LST = Table.read(f"{indir}/{fileName}", path="/feature_events_LSTCam", format='hdf5')
    data_MST = Table.read(f"{indir}/{fileName}", path="/feature_events_NectarCam", format='hdf5')
    suffix = config # all generated plots will have this as a suffix in their name
    return data_LST, data_MST, suffix

In [None]:
def load_reset_dl1_pandas(indir = "./", fileName = "dl1_tail_gamma_z20_az180_LaPalma_baseline_run100_withMono.h5", config="test"):
    """(Re)load the file containing DL1(a) data and extract the data per telescope type."""
    # load DL1 images
    data_LST = pandas.read_hdf(f"{indir}/{fileName}", "/feature_events_LSTCam")
    data_MST = pandas.read_hdf(f"{indir}/{fileName}", "/feature_events_NectarCam")
    suffix = config # all generated plots will have this as a suffix in their name
    return data_LST, data_MST, suffix

### Convert distances in degrees (approximate result)

In [None]:
def distance_deg(distance, focal_length):
    '''Convert astropy distance numpy array in meters to astropy degrees.'''
    return np.degrees(np.arctan(distance/focal_length))

### DL1 quantities to plot

In [None]:
def dl1_quantities(data):
    """A dictionary of the quantities available with this format of DL1 in protopipe.
    
    WARNING: for the moment protopipe uses one cleaning algorithm (biggest cluster),
    even though it allows for two;
    this means that all the quantities with the suffix "_reco" are the same as those without suffix.
    """
    
    if type(data)!=pandas.core.frame.DataFrame:
        
        dictionary = {

            "Intensity [#phe]"   : data.col("sum_signal_cam"), # aka SIZE
            "Width [m]"          : data.col("width"),
            "Length [m]"         : data.col("length"),
            "Skewness"           : data.col("skewness"),
            "Kurtosis"           : data.col("kurtosis"),
            "H_max [m]"          : data.col("h_max"),
            "n_pixel"            : data.col("n_pixel"),
            "Ellipticity"        : data.col("ellipticity"),
            "Leakage 1"          : data.col("leak1_reco")  # see cta-observatory/protopipe#41
            "psi"                : (data.col("psi_reco") * u.deg).to(u.rad),
            "cog_x"              : data.col("cog_x"),
            "cog_y"              : data.col("cog_y"),
            "cog_r"              : data.col("local_distance_reco"),

        }
        
    else:
        
        dictionary = {

            "Intensity [#phe]"   : data["sum_signal_cam"], # aka SIZE
            "Width [m]"          : data["width"],
            "Length [m]"         : data["length"],
            "Skewness"           : data["skewness"],
            "Kurtosis"           : data["kurtosis"],
            "H_max [m]"          : data["h_max"],
            "n_pixel"            : data["n_pixel"],
            "Ellipticity"        : data["ellipticity"],
            "Leakage 1"          : data.col("leak1_reco")  # see cta-observatory/protopipe#41
            "psi"                : data["psi_reco"],
            "cog_x"              : data["cog_x"],
            "cog_y"              : data["cog_y"],
            "cog_r"              : data["local_distance_reco"],
        }
    
    return dictionary

### Add statistical information to a plot

In [None]:
def add_stats(x, ax):
    """Add a textbox containing statistical information."""
    mu = x.mean()
    median = np.median(x)
    sigma = x.std()
    textstr = '\n'.join((
        r'$\mu=%.2f$' % (mu, ),
        r'$\mathrm{median}=%.2f$' % (median, ),
        r'$\sigma=%.2f$' % (sigma, )))

    # these are matplotlib.patch.Patch properties
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)

    # place a text box in upper left in axes coords
    ax.text(0.70, 0.85, 
            textstr, 
            transform=ax.transAxes, 
            fontsize=10,
            horizontalalignment='left',
            verticalalignment='center', 
            bbox=props)

## Plots

First we check if a _plots_ folder exists already.  
If not, we create it.

In [None]:
Path("./plots_image_cleaning").mkdir(parents=True, exist_ok=True)

In [None]:
# fill with the correct path, filename of the generated file in your system
data_LST, data_MST, config = load_reset_dl1_pandas(indir = "",
                                                   fileName = "",
                                                   config="test")
cameras = ["LSTCam", "NectarCam"]

In [None]:
# Get DL1 quantities as numpy arrays or pandas.Dataframe columns
DL1_LST = dl1_quantities(data_LST)
DL1_MST = dl1_quantities(data_MST)
DL1 = [DL1_LST, DL1_MST]

In [None]:
# Transform DL1 dictionaris in pandas DataFrames
for camera_index in range(len(cameras)):
    DL1[camera_index] = pandas.DataFrame.from_dict(DL1[camera_index])

### Fraction of events (relative to telescope triggers) that survive a given intensity cut

In [None]:
nbins = 250
xrange = [0,6]
cameras = ["LSTCam", "NectarCam"]
cameras_radii = {"LSTCam" : 1.129 , "NectarCam" : 1.132} # meters

for camera_index in range(len(cameras)):
    
    fig = plt.figure(figsize=(6, 5), tight_layout=False)
    plt.xlabel("log10(intensity #p.e)")
    plt.ylabel("Fraction of telescope triggers with  log10(intensity #p.e)> x phe")

    tot_entries = len(DL1[camera_index]["Intensity [#phe]"])

    # No cuts
    DL1_filtered = DL1[camera_index].loc[:]
    intensity_hist, xbins = np.histogram( np.log10(DL1_filtered["Intensity [#phe]"]), bins=nbins, range=xrange)
    plt.plot(xbins[:-1], intensity_hist[::-1].cumsum()[::-1]/tot_entries, drawstyle="steps-post", label="No cuts")
    
    # Cut in the number of pixels
    DL1_filtered = DL1[camera_index].loc[DL1[camera_index]['n_pixel'] > 3]
    intensity_hist, xbins = np.histogram( np.log10(DL1_filtered["Intensity [#phe]"]), bins=nbins, range=xrange)
    plt.plot(xbins[:-1], intensity_hist[::-1].cumsum()[::-1]/tot_entries, drawstyle="steps-post", label="n_pixel")
    
    # Cut in ellipticity
    DL1_filtered = DL1[camera_index].loc[(DL1[camera_index]['Ellipticity'] > 0.1) & (DL1_LST['Ellipticity'] < 0.6)]
    intensity_hist, xbins = np.histogram( np.log10(DL1_filtered["Intensity [#phe]"]), bins=nbins, range=xrange)
    plt.plot(xbins[:-1], intensity_hist[::-1].cumsum()[::-1]/tot_entries, drawstyle="steps-post", label="ellipticity")
    
    # Cut in containment radius
    DL1_filtered = DL1[camera_index].loc[DL1[camera_index]['cog_r'] < (cameras_radii[cameras[camera_index]]*0.8)]
    intensity_hist, xbins = np.histogram( np.log10(DL1_filtered["Intensity [#phe]"]), bins=nbins, range=xrange)
    plt.plot(xbins[:-1], intensity_hist[::-1].cumsum()[::-1]/tot_entries, drawstyle="steps-post", label="COG continment")

    plt.ylim([0.,1.05])
    plt.minorticks_on()
    plt.grid()
    plt.legend()
    
    # Print info about threshold cuts (as from tilcut notes of TS and JD)
    
    # This is the phe cut that saves 99.7% of the images
    cut = np.quantile(DL1[camera_index]["Intensity [#phe]"], 1-0.997)
    images_saved = percentileofscore(DL1[camera_index]["Intensity [#phe]"], 0)
    plt.vlines(np.log10(cut), ymin=1.e-7, ymax=1, color='red')
    
    print(f"{cameras[camera_index]}: cutting at {cut} phe saves 99.7% of the images and saves {images_saved:.1f}% of the images")

    fig.savefig(f"./plots_image_cleaning/eventsAboveIntensity_{cameras[camera_index]}_protopipe_{config}.png")

### Image-parameter distributions

From [here](https://www.overleaf.com/16933164ghbhvjtchknf) : use all telescope events with; this is not a benchmark, but useful for monitoring (best done in energy bins)

In [None]:
nbins = 100
cameras = ["LSTCam", "NectarCam"]

for camera_index in range(len(cameras)):
    
    to_plot = DL1[camera_index]
    
    for key in to_plot.keys():

        fig = plt.figure(figsize=(6, 5), tight_layout=False)
        
        plt.ylabel("Number of events")
        plt.yscale('log')

        if key == "Intensity [#phe]":
            plt.xlabel(f"log10({key})")
            plt.hist(np.log10(to_plot[key]), bins=nbins)
        else:
            plt.xlabel(f"{key}")
            plt.hist(to_plot[key], bins=nbins)

        plt.minorticks_on()
        plt.grid()
        
        add_stats(to_plot[key], plt.gca())

        fig.savefig(f"./plots_image_cleaning/{key.split(' ')[0]}_{cameras[camera_index]}_protopipe_{config}.png")