# Running the LCD dataset analysis: The basics

Please follow the step-by-step instructions below using jupyter notebook. Note this needs to be run from your DLKit/ directory.

## CaloDNN Package Layout

CaloDNN is a DLKit-based package used for studying LCD Calorimetry with deep neural networks. The training and analysis for the classification problem are done by running an "experiment" using the CaloDNN package and Jupyter notebooks (it is also possible to run on the command line).  

When an experiment is run there are four steps that are done:
    1. Loading the dataset
    2. Load a pre-existing model or build your own model
    3. Train the model
    4. Run the analysis
The package allows you to turn on or off each of these steps as needed using command-line arguments that are described in the next step.

The package consists of the following primary python files:

*CaloDNN/ClassificationExperiment.py*: This is the “experiment” that drives everything.

*CaloDNN/ClassificationArguments.py*: This is the file where all of the above arguments are defined and parsed. You can add your own options here if need be. Some defaults are defined here.

*CaloDNN/ClassificationScanConfig.py*: This is the configuration file. The model and experiment parameters are set here. This example is setup to allow hyper-parameter scanning. It also contains the list of input files and maps the files and datasets to classes as well as controls what variables are used in the Neural Network.

*CaloDNN/Models.py*: This contains the Keras models, wrapped in a DLKit ModelWrapper class.

*CaloDNN/LCDData.py*: Contains the DLGenerators to read the data.

The basic analysis package is the CaloDNN/ClassificationExperiment.py. You can run this with the `--help` switch to see the command-line options:  

(Running in Jupyter Notebooks: click on the box below and hit shift-enter.)

In [7]:
%run -im CaloDNN.ClassificationExperiment -- --help

usage: ClassificationExperiment.py [-h] [-C CONFIG] [-L LOADMODEL]
                                   [--gpu GPUID] [--cpu] [--NoTrain]
                                   [--NoAnalysis] [--NoModel] [--LowMem]
                                   [--Test] [--Recover] [-s HYPERPARAMSET]
                                   [--nocache] [--preload] [-r RUNNINGTIME]
                                   [-p] [--GracefulExit]

optional arguments:
  -h, --help            show this help message and exit
  -C CONFIG, --config CONFIG
                        Use specified configuration file.
  -L LOADMODEL, --LoadModel LOADMODEL
                        Loads a model from specified directory.
  --gpu GPUID           Use specified GPU.
  --cpu                 Use CPU.
  --NoTrain             Do not run training.
  --NoAnalysis          Do not run analysis.
  --NoModel             Do not build or load model.
  --LowMem              Minimize Memory Usage.
  --Test                Run in test mode (reduced ex

The command-line options above have a good explanation beside them. The default configuration file can be found at CaloDNN/ClassificationScanConfig.py and is explained in the next section.


#### Configuration and Hyperparameter Scanning

The DNN architecture is defined by the types, number, and dimension of layers. Hyper-parameter
scanning refers to the process of searching for an optimal architecture that performs well for a task and can be trained and applied within reasonable time. Beyond the parameters that define the DNN architecture, other configuration parameters allow setting and testing activation and cost functions, optimization (e.g. minimization) techniques, and rate other training parameters.

In the DLKit, these parameters are set in a configuration file, which defines a single python key/value dictionary. DLKit puts the contents of this dictionary in the global scope with the
keys as the variable names. For the CaloDNN package an example is `CaloDNN/ClassificationScanConfig.py`:

Some of the key configuration parameters are shown below:

```
Config={
    "MaxEvents":int(3.e6),
    "NTestSamples":100000,
    "NClasses":4,

    "Epochs":1000,
    "BatchSize":1024,

...

    # Configure Running time callback
    # Set RunningTime to a value to stop training after N seconds.
    "RunningTime": 2*3600,

    # Load last trained version of this model configuration. (based on Name var below)
    "LoadPreviousModel":True
    }
```

An important parameter in this configuration file is the `RunningTime`, which sets duration of the training. Using this parameter, you can train a model for a fixed amount of time. You can rerun the job to continue training, which will automatically load the last successful training session, as set by `LoadPreviousModel` parameter.

[`CaloDNN/ClassificationScanConfig.py`](https://github.com/UTA-HEP-Computing/CaloDNN/blob/master/ClassificationScanConfig.py) is well commented. We suggest you read through the comments.

The following lines:

```
# Parameters to scan and their scan points.
Params={    "optimizer":["'RMSprop'","'Adam'","'SGD'"],
            "Width":[32,64,128,256,512],
            "Depth":range(1,5),
            "lr":[0.01,0.001],
            "decay":[0.01,0.001],
          }
 ```

will generate 3 x 5 x 4 x 2 x 2 = 240 different configurations options, which we can test.

## Running a Test Classification Problem

We will now run a test classification experiment using the --Test flag on the command line. In this mode there are a reduced number of events and epochs run. This is a good test of your setup and to walk-through the code.

In [9]:
%matplotlib inline
import matplotlib.pyplot as plt
%run -im CaloDNN.ClassificationExperiment -- --Test

Using GPU 
Found 8 CPUs and 2 GPUs. Using 4 threads. max_threads = 12
HyperParameter Scan:  240 possible combiniations.
______________________________________
ScanConfiguration
______________________________________
Picked combination:  0
Combo[0]={'Width': 32, 'Depth': 1, 'lr': 0.01, 'optimizer': "'RMSprop'", 'decay': 0.01}
Model Filename:  CaloDNN_32_1_0.01_RMSprop_0.01
______________________________________
Test Mode: Set MaxEvents to 20000 and Epochs to 10
Searching in : /data/LCD/*/*.h5
Found 639 files.
Train Class Index Map: {'Pi0': 0, 'ChPi': 1, 'Gamma': 2, 'Ele': 3}
Caching data on disk for faster processing after first epoch. Hope you have enough disk space.


ImportError: No module named merge

You will see the evolution of the analysis as a function of epochs (in this test we are only running 20k events with 10 epochs. When we run an experiment the model is saved to CaloDNN/TrainedModels, so you can re-load the model in future analyses or further experiments with more events and epochs. The naming of the model reflects the hyper-parameter settings. 

At the end you will see a plot reflecting the 'success' of the model at classifying each of the four types of particles (electron, photon, charged or neutral pions). The 'area' (area under the curve) gives a measure of how well this hyper-parameter scan was overall at classifying each particle type (the larger the number the better!). In the next part we will test different models to see which gives the best rate of successfully classifying these events.





