<h1><center> <code>lstchain</code> DL1b to DL2 stage - Solutions </center></h1>


# Exercise section 
## train/test data set separation & RF training

- Restore the state of the DL1 directory
    - Move all the testing and trainig files to the particle's directory
    - Erase the merged DL1 files & the DL2 files & the models
- Split train/test data into 80/20
    - TIP: You can use the scikit-learn to ease this process (f.ex: `from sklearn.model_selection import train_test_split`)
- Train your dataset using the `disp_vector` option by changing this parameter in the configuration file

# Solutions

In [None]:
# First of all, we define the absolute paths that we will be using in this notebook

from pathlib import Path

lst_ana_repo_dir = Path('../data').resolve().absolute()

mc_data_dir = Path.joinpath(lst_ana_repo_dir,'mc')
mc_dl1_data_dir = Path.joinpath(lst_ana_repo_dir, 'mc/DL1')

## Restore the original state of the `/data/mc/DL1` dir

In [None]:
# We first move the data to its origin structure, so that we can perform again the train/test splitting and merging.

%cd {mc_data_dir}
!rm -rf DL2/ models/
%cd {mc_dl1_data_dir}
!rm gamma/dl1_gamma_training.h5 gamma/dl1_gamma_testing.h5
!rm proton/dl1_proton_training.h5 proton/dl1_proton_testing.h5
!rm electron/dl1_electron_training.h5 electron/dl1_electron_testing.h5
!rm gamma-diffuse/dl1_gamma-diffuse_training.h5 gamma-diffuse/dl1_gamma-diffuse_testing.h5
%cd gamma
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../proton
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../electron
!mv testing/* . && mv training/* . && rm -rf training testing
%cd ../gamma-diffuse
!mv testing/* . && mv training/* . && rm -rf training testing

## Split the train/test dataset into a 80/20 ratio

#### TIP
You can use `from sklearn.model_selection import train_test_split`

In [None]:
import shutil
from sklearn.model_selection import train_test_split

In [None]:
# And we create a list with all the files within the /DL1/gamma dir

files = [file.as_posix() for file in mc_dl1_data_dir.joinpath('gamma').iterdir()]

In [None]:
files

In [None]:
# We split data into a 80/20 ratio

training, testing = train_test_split(files, test_size=0.2, random_state=42)

In [None]:
# Check that the spliting was done correctly

len(training), len(testing)

In [None]:
# And move the files into the /training and /testing sub-dirs

mc_dl1_data_dir.joinpath('gamma/training').mkdir(exist_ok=True)
mc_dl1_data_dir.joinpath('gamma/testing').mkdir(exist_ok=True)

for file in training:
    shutil.move(file, mc_dl1_data_dir.joinpath('gamma/training'))
for file in testing:
    shutil.move(file, mc_dl1_data_dir.joinpath('gamma/testing'))

In [None]:
# We can do the same for the rest of the files

for particle in ['gamma-diffuse', 'proton', 'electron']:
    
    files = [file.as_posix() for file in mc_dl1_data_dir.joinpath(particle).iterdir()]
    training, testing = train_test_split(files, test_size=0.2, random_state=42)
    
    print(f'Working with {particle}. Training size: {len(training)}, testing size: {len(testing)}.')
    
    mc_dl1_data_dir.joinpath(particle, 'training').mkdir(exist_ok=True)
    mc_dl1_data_dir.joinpath(particle, 'testing').mkdir(exist_ok=True)
    
    for file in training:
        shutil.move(file, mc_dl1_data_dir.joinpath(particle, 'training'))
    for file in testing:
        shutil.move(file, mc_dl1_data_dir.joinpath(particle, 'testing'))

#### We will need to merge again the DL1 datasets

In [None]:
for particle in ['gamma', 'gamma-diffuse', 'proton', 'electron']:
    
    source_dir = mc_dl1_data_dir.joinpath(particle, 'training').as_posix()
    output_file = mc_dl1_data_dir.joinpath(particle, f'dl1_{particle}_training.h5').as_posix()
    !lstchain_merge_hdf5_files -d $source_dir -o $output_file
    
    source_dir = mc_dl1_data_dir.joinpath(particle, 'testing').as_posix()
    output_file = mc_dl1_data_dir.joinpath(particle, f'dl1_{particle}_testing.h5').as_posix()
    !lstchain_merge_hdf5_files -d $source_dir -o $output_file

In [None]:
# We check that dl1 merged files were correctly created

for particle in ['gamma', 'gamma-diffuse', 'proton', 'electron']:
    dl1_particle_dir = mc_dl1_data_dir.joinpath(particle).as_posix()
    print(f' * {particle} directory:')
    !ls $dl1_particle_dir

### We create a new configuration changing the RF `disp_method` 

Have a look to the first item of the `new_rf_config` object, as well as to the features changed in the `particle_classification_features` dictionary

In [None]:
from traitlets.config import Config

new_rf_config = Config({
    
  "disp_method": "disp_vector",
    
  "random_forest_energy_regressor_args": {
    "max_depth": 50,
    "min_samples_leaf": 2,
    "n_jobs": 4,
    "n_estimators": 150,
    "bootstrap": True,
    "criterion": "squared_error",
    "max_features": "auto",
    "max_leaf_nodes": None,
    "min_impurity_decrease": 0.0,
    "min_samples_split": 2,
    "min_weight_fraction_leaf": 0.0,
    "oob_score": False,
    "random_state": 42,
    "verbose": 0,
    "warm_start": False
  },

  "random_forest_disp_regressor_args": {
    "max_depth": 50,
    "min_samples_leaf": 2,
    "n_jobs": 4,
    "n_estimators": 150,
    "bootstrap": True,
    "criterion": "squared_error",
    "max_features": "auto",
    "max_leaf_nodes": None,
    "min_impurity_decrease": 0.0,
    "min_samples_split": 2,
    "min_weight_fraction_leaf": 0.0,
    "oob_score": False,
    "random_state": 42,
    "verbose": 0,
    "warm_start": False
  },

  "random_forest_disp_classifier_args": {
    "max_depth": 100,
    "min_samples_leaf": 2,
    "n_jobs": 4,
    "n_estimators": 100,
    "criterion": "gini",
    "min_samples_split": 2,
    "min_weight_fraction_leaf": 0.0,
    "max_features": "auto",
    "max_leaf_nodes": None,
    "min_impurity_decrease": 0.0,
    "bootstrap": True,
    "oob_score": False,
    "random_state": 42,
    "verbose": 0.0,
    "warm_start": False,
    "class_weight": None
  },

  "random_forest_particle_classifier_args": {
    "max_depth": 100,
    "min_samples_leaf": 2,
    "n_jobs": 4,
    "n_estimators": 100,
    "criterion": "gini",
    "min_samples_split": 2,
    "min_weight_fraction_leaf": 0.0,
    "max_features": "auto",
    "max_leaf_nodes": None,
    "min_impurity_decrease": 0.0,
    "bootstrap": True,
    "oob_score": False,
    "random_state": 42,
    "verbose": 0.0,
    "warm_start": False,
    "class_weight": None
  },

  "energy_regression_features": [
    "log_intensity",
    "width",
    "length",
    "x",
    "y",
    "wl",
    "skewness",
    "kurtosis",
    "time_gradient",
    "leakage_intensity_width_2"
  ],

  "disp_regression_features": [
    "log_intensity",
    "width",
    "length",
    "wl",
    "skewness",
    "kurtosis",
    "time_gradient",
    "leakage_intensity_width_2"
  ],

  "disp_classification_features": [
    "log_intensity",
    "width",
    "length",
    "wl",
    "skewness",
    "kurtosis",
    "time_gradient",
    "leakage_intensity_width_2"
  ],

  "particle_classification_features": [
    "log_intensity",
    "width",
    "length",
    "x",
    "y",
    "wl",
    "skewness",
    "kurtosis",
    "time_gradient",
    "leakage_intensity_width_2",
    "log_reco_energy",
    #"reco_disp_norm",
    #"reco_disp_sign"
    "reco_disp_dx",
    "reco_disp_dy"
  ],

  "source_dependent": False,
  "allowed_tels": [1]
})

#### And we train again, this time in an alternative way, not using the `lstchain` entry point

In [None]:
from lstchain.reco.dl1_to_dl2 import build_models

In [None]:
mc_dl1_data_dir.joinpath('models').mkdir(exist_ok=True)
models_dir = mc_dl1_data_dir.joinpath('models').as_posix()

dl1_gamma_diffuse_file = mc_dl1_data_dir.joinpath('gamma-diffuse/dl1_gamma-diffuse_training.h5')
dl1_proton_file = mc_dl1_data_dir.joinpath('proton/dl1_proton_training.h5')

build_models(dl1_gamma_diffuse_file,
            dl1_proton_file,
            save_models=True,
            path_models=models_dir,
            custom_config=new_rf_config
           )

In [None]:
!ls $models_dir

### Note

If you are using the school environment, or an environment with `lstchain-v0.8.4`, there is an error with the models' file name (`reg_disp_norm.sav ` and `cls_disp_sign.sav` should not be present). This bug is solved in later `lstchain` versions.