<h1><center> <code>lstchain</code> DL1b to DL2 stage</center></h1>

## Content
In this notebook we will go through the following topics:
 - (dataset separation into training and testing sets)
 - Merging of DL1 sub_runs
 - Training machine learning models for lstchain
 - `lstchain` DL1 to DL2 stage
 
 
All these stages can be run locally - no need of the IT La Palma cluster.

## 0. Training/testing dataset separation


<center> <b>Note</b> that this stage is <i>usually</i> done before the R0 to DL1 stage</center>

This way, the `r0_to_dl1` stage is optimised by running parallel jobs on simtel sub-runs.

The goal of the training/testing dataset separation is to split (randomly) the files of a dataset into these two sets. So that they can be easily used in later stages.
 - Classical machine learning algorithms set (generally) the train/test ratio to 80/20.

For this lecture, we will just make a "rough" 50/50 train/test separation (of DL1 data). \
Note that current MC productions are composed between 1000 to 5000 files (dependant of the particle).

An exercise (end of this notebook) is proposed to do the train/test splitting in a more formal way, using lstchain tools.

In [None]:
# We define all the paths needed to run this notebook
from pathlib import Path

lst_ana_repo_dir = Path('../data').resolve().absolute()

mc_data_dir = Path.joinpath(lst_ana_repo_dir,'mc')
mc_dl1_data_dir = Path.joinpath(lst_ana_repo_dir, 'mc/DL1')

%cd {mc_dl1_data_dir}

In [None]:
# We check how data is stored
!ls

In [None]:
!ls gamma

In [None]:
!ls gamma | wc -l

In [None]:
# Example with gammas, later we will do similar with the rest of the particles
%cd gamma
!mkdir -p testing training

In [None]:
!ls

In [None]:
%%bash 
mv `ls *.h5 | head -n 10` testing && mv *.h5 training

In [None]:
# We check data was moved correctly 
!ls testing | wc -l && ls training | wc -l

In [None]:
# For the rest of the files
%cd ..
!mkdir -p electron/testing electron/training
!mkdir -p proton/testing proton/training
!mkdir -p gamma-diffuse/testing gamma-diffuse/training

In [None]:
%%bash
cd electron
mv `ls *.h5 | head -n 10` testing && mv *.h5 training

cd ../proton
mv `ls *.h5 | head -n 10` testing && mv *.h5 training

cd ../gamma-diffuse
mv `ls *.h5 | head -n 10` testing && mv *.h5 training

## 1. Merging of lstchain MC DL1 sub_runs

Once all the sub-runs have been converted into the DL1 stage, it is advised to merge all the files into a single one to ease the rest of the analysis.

To do so, we will just use the `lstchain_merge_hdf5_file` entry point*.

$^{\ast}$An entry point is a program that comes together with the installation of lstchain

In [None]:
!lstchain_merge_hdf5_files -h

We move to the directory where the MC DL1 data is stored

In [None]:
%cd {mc_dl1_data_dir}

and start merging the data, separated by particle classification

In [None]:
# Example with gamma
!lstchain_merge_hdf5_files -d gamma/training -o gamma/dl1_gamma_training.h5
!lstchain_merge_hdf5_files -d gamma/testing -o gamma/dl1_gamma_testing.h5

In [None]:
!ls gamma

and finally for the rest of the files

In [None]:
!lstchain_merge_hdf5_files -d gamma-diffuse/training -o gamma-diffuse/dl1_gamma-diffuse_training.h5
!lstchain_merge_hdf5_files -d gamma-diffuse/testing -o gamma-diffuse/dl1_gamma-diffuse_testing.h5

!lstchain_merge_hdf5_files -d electron/training -o electron/dl1_electron_training.h5
!lstchain_merge_hdf5_files -d electron/testing -o electron/dl1_electron_testing.h5

!lstchain_merge_hdf5_files -d proton/training -o proton/dl1_proton_training.h5
!lstchain_merge_hdf5_files -d proton/testing -o proton/dl1_proton_testing.h5

!cd ..

In [None]:
!ls -l proton
!ls -l electron
!ls -l gamma-diffuse

## 2. Training of Random Forest models

`lstchain` uses 'classical' machine learning (ML) algorithms that are applied during the `dl1_to_dl2` stage to perform:
 - gamma/hadron separation,
 - energy reconstruction,
 - direction reconstruction.
 
These ML algorithms, that are trained with DL1 data, are
 - Random Forest regressor (energy and direction reconstruction)
 - Random Forest classifier (gamma/hadron separation)
loaded from the `scikit-learn` python pibrary.

The set of parameters used to train the models are defined in a configuration file.

For not expert users, standard parameters (those found in the lstchain_standard_config.json file) should be used.

Currently, the user can select between two kinds of direction reconstruction trainings:
 - `disp_ver`
     * A single RF regressor is trained for the `disp_norm` vector coordinates.
 - `disp_norm_sign`   (**Default choice**)
     * A RF regressor is trained for the module of the `disp_norm` vector.
     * A RF classifier to classify the `disp_norm` sign.

In [None]:
%cd {mc_data_dir}
!cat configs/lstchain_trainpipe_dl1b_dl2_config.json 

### `lstchain` RF traning

In [None]:
# Let's train our models from the previously merged DL1 files
!mkdir -p models 

In [None]:
!lstchain_mc_trainpipe -h

In [None]:
!lstchain_mc_trainpipe --fg DL1/gamma-diffuse/dl1_gamma-diffuse_training.h5 \
 --fp DL1/proton/dl1_proton_training.h5 -o models -c configs/lstchain_trainpipe_dl1b_dl2_config.json 

In [None]:
!ls models

### We can also plot the RF parameter importances

In [None]:
import matplotlib.pyplot as plt
from lstchain.visualization.plot_dl2 import plot_models_features_importances
%matplotlib inline

In [None]:
pwd

In [None]:
plot_models_features_importances(path_models='models', config_file='configs/lstchain_trainpipe_dl1b_dl2_config.json')

## 3. `lstchain` DL1 to DL2 stage

In this stage, the trained models are applied to DL1 data.

Energy and position are reconstructed, and gamma/hadron separation applied (probability of an event to be a gamma).

In [None]:
# We create the path tree
%cd {mc_data_dir}
!mkdir -p DL2
%cd DL2
!mkdir -p proton electron gamma gamma-diffuse
%cd ..

In [None]:
!lstchain_dl1_to_dl2 -h

In [None]:
# example with the gamma file
!lstchain_dl1_to_dl2 -f DL1/gamma/dl1_gamma_testing.h5 -p models/ -o DL2/gamma -c configs/lstchain_trainpipe_dl1b_dl2_config.json 

In [None]:
# And for all the rest of particles
!lstchain_dl1_to_dl2 -f DL1/proton/dl1_proton_testing.h5 -p models/ -o DL2/proton -c configs/lstchain_trainpipe_dl1b_dl2_config.json 
!lstchain_dl1_to_dl2 -f DL1/electron/dl1_electron_testing.h5 -p models/ -o DL2/electron -c configs/lstchain_trainpipe_dl1b_dl2_config.json 
!lstchain_dl1_to_dl2 -f DL1/gamma-diffuse/dl1_gamma-diffuse_testing.h5 -p models/ -o DL2/gamma-diffuse -c configs/lstchain_trainpipe_dl1b_dl2_config.json 

#### Let's check which parameters have been added in the `dl1_to_dl2` stage


In [None]:
import tables
hf_dl1 = tables.open_file('DL1/gamma/dl1_gamma_testing.h5')
hf_dl2 = tables.open_file('DL2/gamma/dl2_gamma_testing.h5')

from lstchain.io.io import dl1_params_lstcam_key, dl2_params_lstcam_key
dl1_parameters = hf_dl1.root[dl1_params_lstcam_key].colnames
dl2_parameters = hf_dl2.root[dl2_params_lstcam_key].colnames

set(dl2_parameters)-set(dl1_parameters)

In [None]:
# And the existing DL1 parameters
dl1_parameters

### Tip
To explore any HDF5 file through a GUI, you can use https://vitables.org/ (among many other ways).

<h1><center> Exercise section </center></h1>
<h2> train/test data set separation & RF training </h2>

- Restore the state of the DL1 directory
    - Move all the testing and trainig files to the particle's directory
    - Erase the merged DL1 files & the DL2 files & the models
- Split train/test data into 80/20
    - TIP: You can use the scikit-learn to ease this process (f.ex: `from sklearn.model_selection import train_test_split`)
- Train your dataset using the `disp_ver` option by changing this parameter in the configuration file

## RESTORE THE ORIGINAL STATE OF THE DL1 DIRECTORY


Convert the below cell into `code` (cell --> cell type --> code) and run it

%cd {mc_data_dir}
!rm -rf DL2/ models/
%cd {mc_dl1_data_dir}
!rm gamma/dl1_gamma_training.h5 gamma/dl1_gamma_testing.h5
!rm proton/dl1_proton_training.h5 proton/dl1_proton_testing.h5
!rm electron/dl1_electron_training.h5 electron/dl1_electron_testing.h5
!rm gamma-diffuse/dl1_gamma-diffuse_training.h5 gamma-diffuse/dl1_gamma-diffuse_testing.h5
%cd gamma
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../proton
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../electron
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../gamma-diffuse
!mv testing/* . && mv training/* . && rm -rf training testing