# Calorimetry with DNNs

The data from highly granular calorimeters can be viewed as 3D images, making them ideal for image classification problems. In this lab, we will apply image classification to simulated calorimeter data from the LCD detector concept for the CLIC accelerator. We will use the [CaloDNN](https://github.com/UTA-HEP-Computing/CaloDNN) package to systematically study different neural network architectures, optimizers, loss functions, and other hyperparameters.

The data is compused of 4 particle types: electrons, neutral pions (pi0s), charged pions, and photons (gamma). The LCD calorimeter is composed of electromagnetic (ECAL) and hadronic (HCAL) sections. The simulation shoots a single particle into the calorimeter and stores a 25 by 25 by 25 cell part of the ECAL and 5 by 5 by 60 part of the HCAL around the particle. 

## CaloDNN

CaloDNN is a DLKit based package for studying LCD Calorimetry with DNNs.

The package consists of the following files:

  **CaloDNN/ClassificationExperiment.py**: This is the “experiment” that drives everything.

   **CaloDNN/ClassificationArguments.py**: This is the file where all of the above arguments are defined and parsed. You can add your own options here if need be. Some defaults are defined here.

   **CaloDNN/ClassificationScanConfig.py**: This is the configuration file. The model and experiment parameters are set here. This example is setup to allow hyper-parameter scanning. It also contains the list of input files and maps the files and datasets to classes as well as controls what variables are used in the Neural Network.

   **CaloDNN/Models.py**: This contains the Keras models, wrapped in a DLKit ModelWrapper class.

   **CaloDNN/LCDData.py**: Contains the DLGenerators to read the data.
   
Typically we run these experiments from the shell command prompt (e.g. here getting help):

    python -m CaloDNN.ClassificationExperiment --help
    
But we can also do it in our current Jupyter session as follows:


In [None]:
%run -im CaloDNN.ClassificationExperiment -- --help

The experiment has 4 steps:

    1. Setup Loading Data
    2. Load or Build Model
    3. Train Model
    4. Run Analysis 

You can turn off steps as needed using the flags above. 

You can run a quick test (reduced number of events and epochs) as follows:

In [None]:
%run -im CaloDNN.ClassificationExperiment -- --Test

### Configuration and Hyperparameter Scanning

Our DNN architecture is defined by the types, number, and dimension of layers. Hyper-parameter
scanning refers to the process of searching for an optimal architecture that performs well for a
task and can be trained and applied within reasonable time. Beyond the parameters that define the
DNN architecture, other configuration parameters allow setting and testing activation and cost
functions, optimization (e.g. minimization) techniques, and rate other training parameters.

In DLKit, these parameters are set in a configuration file, which defines a single python key/value
dictionary called **Config**. DLKit puts the contents of this dictionary in the global scope with the
keys as the variable names. As an example, see `CaloDNN/ClassificationScanConfig.py`:

```
Config={
    "MaxEvents":int(3.e6),
    "NTestSamples":100000,
    "NClasses":4,

    "Epochs":1000,
    "BatchSize":1024,

...

    # Configure Running time callback
    # Set RunningTime to a value to stop training after N seconds.
    "RunningTime": 2*3600,

    # Load last trained version of this model configuration. (based on Name var below)
    "LoadPreviousModel":True
    }
```

These parameters are fixed and will be used by the Experiment to build the model. 

An important parameter in this configuration file is the `RunningTime`, which sets duration of the training. Using this parameter, you can train a model for a fix amount of time. You can rerun the job to continue training, which will automatically load the last successful training session, as set by `LoadPreviousModel` parameter.

[`CaloDNN/ClassificationScanConfig.py`](https://github.com/UTA-HEP-Computing/CaloDNN/blob/master/ClassificationScanConfig.py) is well commented. We suggest you read through the comments.

For hyper-parameter scanning, it would be cumbersome to generate a new configuration file for every
network we would like to try. Instead, **ScanConfig.py** uses a second dictionary  to specify
parameters that you would like to scan, and the **DLTools.Permutator** class to generate all possible
resulting configurations. For example the following lines:

```
# Parameters to scan and their scan points.
Params={    "optimizer":["'RMSprop'","'Adam'","'SGD'"],
            "Width":[32,64,128,256,512],
            "Depth":range(1,5),
            "lr":[0.01,0.001],
            "decay":[0.01,0.001],
          }
 ```

will generate 3 x 5 x 4 x 2 x 2 = 240 different configurations, which we can enumerate through. To check, we can
simply run the **ClassificationScanConfig.py** file:

In [None]:
%run -m CaloDNN.ClassificationScanConfig

This should tell you the number possible configurations. We will select
specific ones using the **-s** flag when running the experiment.

In [None]:
%run -im CaloDNN.ClassificationExperiment -- --Test -s 10

## Performing a Scan

From above, it should be appearant that in order you can easily try all possible configurations by running the same command with all possible values of the `-s` parameter. And since every configuration is independent, you can run the experiments in parallel. 

### PBS/Torque Batch System

On most GPU equipped clusters, like UTA-DL, a batch system allows you to submit "jobs" into "queues" which will then execute each job when appropriate resources become available. 

You can get a list of available queue, using the `qstat -Q` command:

In [None]:
!qstat -Q

On the UTA-DL cluster, the queues are setup as follows. The `cpu_queue` and `gpu_queue` routing queues send jobs to CPU and GPU resources on each of 5 nodes:

    * thecount: 44-core 10 GPU
    * super: 24-core 4 GPU
    * thingone and thingtwo: 6 core 4 GPU each.
    * oscar: 6 core 2 GPU (used for Jupyter sessions).
    
Submitting to the queue system, requires you to write a script. For example, this is the script `CaloDNN/ScanJob.sh`:


This scripts creates a directory to store the `stdout/stderr` output of the job. Sets up the environment, and starts the job. To set the `-s` parameter, we use Torque's array job mechanism, which will set the `$PBS_ARRAYID` environment variable, to an interger as specified during submission.

So for example, to run configurations 10-20, we do (don't run this unless you mean it):

In [None]:
!qsub -q gpu_queue -t 10-20 CaloDNN/ScanJob.sh

You can monitor your jobs using the `qstat` command:

In [None]:
!qstat

## Analysis

After you jobs start to complete, you can start viewing the performance using:

In [None]:
!python -m DLAnalysis.Scan TrainedModels/

You can explore the performance of all of the models in your scan using the python notebook [`CaloDNN/AnalyzeScan-OptimizerStudy.ipynb`](https://github.com/UTA-HEP-Computing/CaloDNN/blob/master/AnalyzeScan-OptimizerStudy.ipynb). Simply make a copy of the notebook into your DLKit directory and navigate Jupyter to the notebook:

In [None]:
!cp CaloDNN/AnalyzeScan-OptimizerStudy.ipynb ./AnalyzeScan-MyStudy.ipynb

Similarly use can use [`CaloDNN/AnalyzerPerformance.ipynb`](https://github.com/UTA-HEP-Computing/CaloDNN/blob/master/AnalyzePerformance.ipynb) to study the performance of a specific model in detail.

## The Experiment

The main driver of the experiment, [`CaloDNN/ClassificationExperiment.py`](https://github.com/UTA-HEP-Computing/CaloDNN/blob/master/ClassificationExperiment.py), is well commented. In order for you to add you own models and modify things, you should carefully read through this file.