# Asimov: Simple Example

This notebook details a simple setup for using the Asimov framework in your own analysis.
A dataset from the Higgs Boson Machine Learning Challenge is used for the demo*.


*Data accessed from Higgs Boson Machine Learning Challenge
http://opendata.cern.ch/record/328
on 18 September 2018 (converted from .csv to .root for this example)

In [1]:
import os
import sys
from time import strftime,localtime

### Import Necessary Modules
Update python paths to include the directories for Asimov and hepPlotter -- subject to how you have checked out these repositories

In [2]:
cwd = os.getcwd()
hpd = cwd.rstrip("examples")+"python/"
if hpd not in sys.path:
    sys.path.insert(0,hpd)
hpd2 = cwd.replace("asimov/examples","hepPlotter/python/")
if hpd2 not in sys.path:
    sys.path.insert(0,hpd2)

In [3]:
import util
from training import Training
from config import Config
import example_plotlabels as plb

/Users/demarley/Desktop/CERN/CMS/common/asimov/examples /Users/demarley/Desktop/CERN/CMS/common/hepPlotter/python//examples
Welcome to JupyROOT 6.10/02


  "to style.  Ignoring".format(key))
  return _orig_ihook( name, *args, **kwds )
Using TensorFlow backend.


The python file `example_plotlabels.py` contains an example for how you can organize information concerning sample and variable labels on your plots.  
The objects in this file (`Sample` & `Variable`) are assumed to be available in the Asimov framework.

### Setup Configuration

In [4]:
config   = Config("example_config.txt")  # Set options for asimov
vb       = util.VERBOSE()                # Tool that handles print statements to the console
vb.level = config.verbose_level
vb.initialize()

# Create a new (unique) directory to store the outputs
# I use the current date/time, but this is just a personal preference
date = strftime("%d%b%Y-%H%M")
hep_data_name = config.hep_data.split('/')[-1].split('.')[0]
output = "{0}/{1}".format( config.output_path,hep_data_name)
output += "/training-{0}/".format(date)

vb.INFO("RUN :  Saving output to {0}".format(output))
if not os.path.isdir(output):
    os.system( 'mkdir -p {0}'.format(output) )

 INFO :: RUN :  Saving output to ./example/training-19Sep2018-1253/


### Initialize & Setup Deep Learning class

In [5]:
dnn = Training() # class to do the training

In [6]:
dnn.variable_labels = plb.variable_labels()  # labels for variables
dnn.sample_labels   = plb.sample_labels()    # labels for samples

dnn.hep_data   = config.hep_data
dnn.model_name = config.dnn_data
dnn.msg_svc    = vb
dnn.treename   = config.treename
dnn.useLWTNN   = True
dnn.dnn_name   = "dnn"
dnn.output_dim = config.output_dim
dnn.loss       = config.loss
dnn.init       = config.init
dnn.nNodes     = config.nNodes
dnn.metrics    = config.metrics
dnn.features   = config.features
dnn.epochs     = config.epochs
dnn.optimizer  = config.optimizer
dnn.input_dim  = len(config.features)
dnn.batch_size = config.batch_size
dnn.activations    = config.activation.split(',')
dnn.nHiddenLayers  = config.nHiddenLayers
dnn.earlystopping  = {'monitor':'loss','min_delta':0.0001,'patience':10,'mode':'auto'}
dnn.runDiagnostics = True
dnn.classes = {"bckg":0,"signal":1}
dnn.output_dir = output

NB: in the function call `dnn.train(ndims=X)`, `ndims` represents the number of dimensions to consider for plotting features for signal/background.  
- `ndims = -1`:    Plot 1D & 2D features for signal and background (can be slow)
- `ndims = 1`:     Plot 1D features only

### Run the training!

In [7]:
dnn.initialize()
dnn.load_data(['target'])  # load HEP data (add 'target' branch to dataframe)
dnn.preprocess_data()      # equal statistics for each class
dnn.train(ndims=1)         # build and train the model!

 INFO :: FOUNDATION : Load HEP data
 INFO :: FOUNDATION : -- pre-training :: features


  (prop.get_family(), self.defaultFamily[fontext]))
  ratio_data.data.content = (num_data / np.sqrt(den_data)).copy()
  ratio_data.data.content = (num_data / np.sqrt(den_data)).copy()
  tmp = np.divide( (sig-bkg)**2 , (sig+bkg), dtype=np.float32)


 INFO :: FOUNDATION : -- pre-training :: correlations
 INFO :: FOUNDATION : -- pre-training :: separations
Train on 293538 samples, validate on 97846 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
 INFO :: DL : Plot the train/test predictions
 INFO :: FOUNDATION : -- post-training :: ROC
 INFO :: FOUNDATION : -- post-training :: History


2018-09-19 12:54:24.030958: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-09-19 12:54:24.030976: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-09-19 12:54:24.030981: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2018-09-19 12:54:24.030986: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.


## Plots

Now the training is done!  
You can inspect the output plots to understand the performance of the neural network, then modify the configuration file and re-train, if necessary.
The model is saved in the output directory (by default to be used in the Lightweight NN framework).

In [8]:
from IPython.display import IFrame

In [9]:
IFrame("{0}acc_epochs.pdf".format(output), width=600, height=300) # doesn't handle '//' in filename

In [10]:
IFrame("{0}loss_epochs.pdf".format(output), width=600, height=300)

In [11]:
IFrame("{0}roc_curve.pdf".format(output), width=600, height=300)

In [12]:
IFrame("{0}hist_DNN_prediction_signal.pdf".format(output), width=600, height=300)

Note: In the current setup of Asimov, the tool is built for multi-classification.
Thus, if you are doing binary classification, you will end up with 2 outputs, rather than 1 (the second output is just 1-X).  The ROC curve is configured to only draw one curve, but there will be extra (& separate) plots for the signal prediction and background prediction.  You can choose to ignore these.