# **Neural Networks**

_Credit:_ this notebook is based on **CMS Data Analysis School** [ML exercise](https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideCMSDataAnalysisSchoolCERN2020MLShortExercise) (CMS restricted) and was developed by [Marcel Rieger](mailto:marcel.rieger@cern.ch).

## **Introduction**

Our case study will be that of discriminating between jets produced by a hadronically decaying top quark which hadronizes, to jets produced by a light flavour quark or a gluon.

If the top quark has a very high transverse momentum, the decay products of the top (one b jet and two quark jets stemming from the decaying W boson), will be merged into one single large jet, which is referred to as a **top jet**. Potentially, this jet can exhibit three distinct, resolvable *sub jets*, whereas a light quark or gluon jet only appears as one single, large jet without any significant substructure.

The different appearance of these jets can be used as a handle to discriminate between them.  Being able to correctly identify top jets, and tell them apart from the overwhelming background of other light-flavored jets, is extremely important for many reasons.

Since the top quark is so heavy, being the only fermion we know of with a mass on the order of the weak scale, several extensions of the Standard Model which attempt to solve the hierarchy problem predict large couplings of new, hitherto unobserved particles to top quarks. Weeding top quark jets out of the ocean of other jets is therefore crucial for many **New Physics** searches!

<center><img src="assets/top_vs_qcd.png" width="60%"/></center>

## **Training data**

The input data consists of jets, originating from either
  - hadronically decaying top quarks (this is our **signal** ✔︎), or
  - dijet QCD events (our **background** ✘),
 
and clustered using the [anti-$k_{T}$ algorithm](https://arxiv.org/abs/0802.1189) with $\Delta R$ = 0.8.

<br />

Data was generated using Pythia & Delphes, configured
  - To collide protons at 14 TeV center-of-mass energy,
  - To generate jets with a $p_{T}$ range of [550, 650] GeV (before hadronization ❗️), and
  - **Without** mixing in pileup events for simplicity
  - All made for default ATLAS detector card

### Input features

Per jet, you are given the four-vectors of up to **200** of its *constituents* (i.e., the particles that form the jet by means of clustering).

   - These up to 800 values define your **input features**.
   - Note that not all jets have that many constituents❗️
   - To spare you the trouble of working with uneven (so-called *jagged*) arrays, these "missing" constituents vectors are filled with zeros.

### Training target

Per jet, you are provided a flag that marks the true origin of the jet   
    - `1` for jets from top quark decays  
    - `0` for light jets from QCD events  
 
We want to separate **top-like events** from **QCD background** i.e. to solve a _classification_ problem for jets.

### Diving into the data

Let's check out the data! It is stored in NumPy arrays across several files, with 50k jets per file. This way, prototyping and test runs are way quicker. You are given

- 20 training files (`"train"`)
- 8 validation files (`"valid"`)
- 8 testing files (`"test"`)  

For a lightweight version we're going to use two files: one for training, the other for validation. You can find the full dataset using the following [link](https://drive.google.com/drive/folders/1LzfFSglalmV9jICKj0uI4SWdMOE00-1d?usp=sharing).

A few tools to perform recurrent tasks such as data loading are available in the dedicated `dasml` package. Let's import all packages we need during this exercise and load two training files and inspect the contents.

In [None]:
# check the tf version
import tensorflow as tf
print(tf.__version__)
import numpy as np
print(np.__version__)

In [None]:
# load the dasml package and other software
import dasml
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import tensorflow as tf
from sklearn.metrics import roc_curve, roc_auc_score
from tqdm.auto import tqdm
import ipywidgets
import livelossplot as llp

In [None]:
# load the content of two "train" files
c_vectors, true_vectors, labels = dasml.data.load("train", start_file=0, stop_file=1)
c_vectors.shape, true_vectors.shape, labels.shape

In [None]:
print(c_vectors[:4])

All arrays have 100k (2 x 50k) *rows* (dimension 0).

- Per jet, we have up to 200 constituents (`c_vectors`) with 4 variables ($E$, $p_x$, $p_y$, $p_z$) each, thus `(200, 4)`.
- Consistently, `true_vectors` only has 4 values per jet.
- The `labels`, however, are single values.

Let's create a few plots to get some insights into our data.

In [None]:
# define some flags to make four-vector element access more verbose
E, PX, PY, PZ = range(4)
print('E={}, PX={}, PY={}, PZ={} '.format(E,PX,PY,PZ))

In [None]:
# define a histogram helper
def plot_hist(arr, names=None, xlabel=None, xlim=None, ylabel="Entries", filename=None, legend_loc="upper center", **kwargs):
    kwargs.setdefault("bins", 20)
    kwargs.setdefault("alpha", 0.7)
   
    # consider multiple arrays and names given as a tuple
    arrs = arr if isinstance(arr, tuple) else (arr,)
    names = names or (len(arrs) * [""])

    # start plot
    fig, ax = plt.subplots()
    for arr, name in zip(arrs, names):
        bin_edges = ax.hist(arr, label=name, **kwargs)[1]
        kwargs["bins"] = bin_edges
    if xlim:
         ax.set_xlim(xlim)
    # legend
    if any(names):
        legend = ax.legend(loc=legend_loc)
        legend.get_frame().set_linewidth(0.0)
    
    # styles and custom adjustments
    ax.tick_params(axis="both", direction="in", top=True, right=True)
    if xlabel:
        ax.set_xlabel(xlabel)
    if ylabel:
        ax.set_ylabel(ylabel)
 
    if filename:
        fig.savefig(filename)
    
    return fig

#### Truth distributions

In [None]:
# distribution of truth labels
fig = plot_hist(labels, xlabel="Label distribution")

In [None]:
# energy distribution of the true top quark particle
# remember, this is only available for top jets (zero otherwise)
is_top = labels == 1
fig = plot_hist(true_vectors[is_top, E], xlabel="True energy / GeV")
fig = plot_hist(true_vectors[np.logical_not(is_top), E], xlabel="True energy / GeV")

In [None]:
# px distribution of the true particle
fig = plot_hist(true_vectors[is_top, PX], xlabel="True $p_x$ / GeV")


In [None]:
# mass distribution of the true particle
hist_kwargs = {'bins':300}
mass = (true_vectors[:, E]**2 - np.sum(true_vectors[:, PX:]**2, axis=1))**0.5 
fig = plot_hist(mass[is_top], xlabel="True mass / GeV",xlim=[160,180],**hist_kwargs)

Yep, this is a top!

#### Input feature distributions

In [None]:
# number of constituents per jet
# remember, missing constituents are filled with zeros, so we take the energy value as a marker
n_c = np.count_nonzero(c_vectors[:, :, E], axis=1)
fig = plot_hist(n_c, xlabel="N constituents per jet")

In [None]:
# energy distribution of all constituents
e_c = c_vectors[:, :, E].flatten()
# store a mask to remove zeros
non_zero = (e_c != 0)
fig = plot_hist(e_c[non_zero], log=True, xlabel="Constituents energy / GeV")

In [None]:
# px distribution of all constituents, zeros removed with the mask defined above
px_c = c_vectors[:, :, PX].flatten()
fig = plot_hist(px_c[non_zero], log=True, xlabel="Constituents $p_x$ / GeV")

In [None]:
# pz distribution of all constituents
pz_c = c_vectors[:, :, PZ].flatten()
fig = plot_hist(pz_c[non_zero], log=True, xlabel="Constituents $p_z$ / GeV")

### Lessons learned

Altough you were promised *up to* 200 constituents per jet, only a few of them seem to have more than 100 constituents!

Expect these *findings*, but don't interpret anything as bad intention 😉 The work packages of large-scale analysis are often shared and spread among multiple people, working groups and institutes. Staying on top of things is naturally a complex part, so communication and documentation is - as always - key!

Ok, so now that we understood the data, it would not make sense to include all these zeros in a network training. We can safely pick only the first, say, **120 constituents**.

## **Keras overview: a minimal workflow**

**Keras** is a deep learning interface written in Python, running on top of the machine learning platform **TensorFlow**. It was developed with a focus on enabling fast experimentation.

Keras operates with **layers** and **models**. To make the simpliest model, we need to specify input layer and internal layers. 



Before creating a full-blown training setup, let's first do a quickshot. This helps us to understand how a model is built, trained, and eventually evaluated. We can also already define a few plot methods to assess the performance.

For this purpose, we use TensorFlow with the Keras high level API in its [functional version](https://keras.io/guides/functional_api).

In [None]:
#Template function for the input layer 
#Only for illustration
tf.keras.Input(
    shape=None,
    batch_size=None,
    name=None,
    dtype=None,
    sparse=False,
    tensor=None,
    ragged=False,
    **kwargs
)

In [None]:
#Template function for densely-connected NN layer
#Only for illustration
tf.keras.layers.Dense(
    units,
    activation=None,
    use_bias=True,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    kernel_regularizer=None,
    bias_regularizer=None,
    activity_regularizer=None,
    kernel_constraint=None,
    bias_constraint=None,
    **kwargs
)

Dense implements the operation: <span style="color:blue">output = activation(dot(input, kernel) + bias) </span> where <span style="color:blue">activation</span> is the element-wise activation function passed as the activation argument, <span style="color:blue">kernel</span> is a weights matrix created by the layer, and <span style="color:blue">bias</span> is a bias vector.

In [None]:
# define the model generating function
# - 2 hidden layers
# - 128 units each
# - tanh activation
# - 2 output units with softmax activation
# (applies exp() to outputs and normalizes sum of all outputs to 1)
def create_model():
    x = tf.keras.Input(shape=(480,))
    #Then the first internal layer takes input layer as an input
    a1 = tf.keras.layers.Dense(128, use_bias=True, activation="tanh")(x)
    a2 = tf.keras.layers.Dense(128, use_bias=True, activation="tanh")(a1)
    y = tf.keras.layers.Dense(2, use_bias=True, activation="softmax")(a2)
    return tf.keras.Model(inputs=x, outputs=y, name="toptagging_quickshot")

In [None]:
# create the actual model
model = create_model()
model.summary()

In [None]:
# let's see what happens when we call it with zeros
# note: here we create zeros with in the shape (1, 480)
# where the leading one marks the *batch size*,
# i.e. the number of examples that are simultaneously fed
# into the network to benefit from clever vectorization
t = model.predict(np.zeros((1, 480)))
print(type(t))
print(t)

This produces a Numpy array of predictions for each object. Computation is done in batches. This method is designed for performance in large scale inputs. 

The return value is `[0.5, 0.5]`. This means that, given a vector of input features consisting only of zeros, the network is unsure whether to assign it to the signal class (top jets) nor to the background class (light jets). This is totally reasonable as we haven't trained it yet. So let's do that!

In [None]:
# first, we define a preprocessing function that (e.g.) takes the
# constiuents and returns an other representation of them
# in this case, we select only the first 120 constituents and
# flatten the resulting array from (..., 120, 4) to (..., 480,)
def preprocess_constituents(constituents):
    return constituents[:, :120].reshape((-1, 480))

In [None]:
# also, for the training we need to convert the label to a "one-hot" representation
# 0. -> [1., 0.]
# 1. -> [0., 1.]
def labels_to_onehot(labels):
    labels = labels.astype(np.int32)
    onehot = np.zeros((labels.shape[0], labels.max() + 1), dtype=np.float32)
    onehot[np.arange(labels.shape[0]), labels] = 1
    return onehot

In [None]:
# load more training, and also validation data
c_vectors_train, _, labels_train = dasml.data.load("train", stop_file=1)
c_vectors_valid, _, labels_valid = dasml.data.load("valid", stop_file=1)

# run the preprocessing
c_vectors_train = preprocess_constituents(c_vectors_train)
c_vectors_valid = preprocess_constituents(c_vectors_valid)

# create one-hot labels
labels_train = labels_to_onehot(labels_train)
labels_valid = labels_to_onehot(labels_valid)

In [None]:
# compile the model
# this means that the internal computational graph structure is built,
# the loss function (the function that provides the feedback by comparing
# expected and predicted result, more on that later), and metrics are
# registered that are shown during the training
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=["accuracy"],
)

In [None]:
# start the training for 5 epochs (running through all data 5 times)
model.fit(
    c_vectors_train,
    labels_train,
    batch_size=200,
    epochs=5,
    callbacks=[llp.PlotLossesKerasTF(outputs=[llp.outputs.MatplotlibPlot(cell_size=(4, 2))])],
)

We ended up with an accuracy of about 70%, which is already quite good for such a small network (and lot's of important things we did not even consider yet ...)!

Let's check if the model generalized by evaluating the validation data and manually computing the accuracy.

In [None]:
# evaluate all training and validation data again for ruther study
predictions_train = model.predict(c_vectors_train)
predictions_valid = model.predict(c_vectors_valid)

In [None]:
# determine the accuracy
def calculate_accuracy(labels, predictions):
    # while the labels (NumPy array) are one-hot encoded,
    # each prediction (TF tensor) consists of two numbers whose sum is 1,
    # so we interpret the prediction to be the signal when the second value (index 1) is > 0.5
    # hence, we can use argmax
    predicteds_top = np.argmax(predictions, axis=-1) == 1
    labels_top = labels[:, 1] == 1
    return (predicteds_top == labels_top).mean()

In [None]:
acc_train = calculate_accuracy(labels_train, predictions_train)
acc_valid = calculate_accuracy(labels_valid, predictions_valid)

print(f"train accuracy: {acc_train:.4f}")
print(f"valid accuracy: {acc_valid:.4f}")

This looks fairly similar, so for now, we don't seem to experience overtraining.

We proceed by taking a look at the output distributions of the validation dataset, separated into signal and background components. Since we are dealing with a binary classification, and the sum of the two output values is normalized to one, it is sufficient to inspect just one of the output nodes. Since our goal is to identify signal, we look at the second column with index 1 (note that the same is considered in the accuracy calculation above).

In [None]:
fig = plot_hist(
    (predictions_valid[labels_valid == 0], predictions_valid[labels_valid == 1]),
    names=("Light jets", "Top jets"),
    xlabel="Output distribution",
)

Besides the classification accuracy, we can study the *receiver operating characteristic* curve or **ROC** curve. It shows the relation between the true positive (jets *correctly* identified as top jets) and false positive rates (light jets *mistaken* as a top jets).

In [None]:
# helper to draw a ROC curve"
def plot_roc(labels, predictions, names=None, xlim=(0.01, 1), ylim=(0.01, 1)):   
    # start plot
    fig, ax = plt.subplots()
    ax.set_xlabel("True positive rate")
    ax.set_ylabel("False positive rate")
    ax.tick_params(axis="both", direction="in", top=True, right=True)
    ax.set_xticks([0.2, 0.4, 0.6, 0.8, 1.0])
    ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
    ax.set_xlim(left=xlim[0], right=xlim[1])
    ax.set_ylim(bottom=ylim[0], top=ylim[1])
    ax.set_aspect(1)
    plots = []

    # treat labels and predictions as tuples
    labels = labels if isinstance(labels, tuple) else (labels,)
    predictions = predictions if isinstance(predictions, tuple) else (predictions,)
    names = names or (len(labels) * [""])
    for l, p, n in zip(labels, predictions, names):
        # linearize
        l = l[:, 1]
        p = p[:, 1]

        # create the ROC curve and get the AUC
        fpr, tpr, _ = roc_curve(l, p)
        auc = roc_auc_score(l, p)
        
        # apply lower x limit to prevent zero division warnings below
        fpr = fpr[tpr > xlim[0]]
        tpr = tpr[tpr > xlim[0]]

        # plot
        plot_name = (n and (n + ", ")) + "AUC {:.3f}".format(auc)
        plots.extend(ax.plot(fpr, tpr, label=plot_name))

    # legend
    legend = ax.legend(plots, [p.get_label() for p in plots], loc="upper right")
    legend.get_frame().set_linewidth(0.0)

    return fig

In [None]:
# do the roc plot
fig = plot_roc(
    (labels_train, labels_valid),
    (predictions_train, predictions_valid),
    names=("train", "valid"),
)
fig.set_size_inches(6,6) 


In [None]:
# helper to draw a ROC curve"
def plot_log_roc(labels, predictions, names=None, xlim=(0.01, 1), ylim=(1, 1e2)):   
    # start plot
    fig, ax = plt.subplots()
    #ax.set_xlabel("Signal efficiency")
    ax.set(xlabel='Signal efficiency '+ r'$[\varepsilon_{S}]$', ylabel='Background rejection ' + r'$[1/\varepsilon_{B}]$')
    ax.set_yscale("log")
    ax.tick_params(axis="both", direction="in", top=True, right=True)
    ax.set_xticks([0.2, 0.4, 0.6, 0.8, 1.0])
    ax.set_xlim(left=xlim[0], right=xlim[1])
    ax.set_ylim(bottom=ylim[0], top=ylim[1])
    plots = []

    # treat labels and predictions as tuples
    labels = labels if isinstance(labels, tuple) else (labels,)
    predictions = predictions if isinstance(predictions, tuple) else (predictions,)
    names = names or (len(labels) * [""])
    for l, p, n in zip(labels, predictions, names):
        # linearize
        l = l[:, 1]
        p = p[:, 1]

        # create the ROC curve and get the AUC
        fpr, tpr, _ = roc_curve(l, p)
        auc = roc_auc_score(l, p)
        
        # apply lower x limit to prevent zero division warnings below
        fpr = fpr[tpr > xlim[0]]
        tpr = tpr[tpr > xlim[0]]

        # plot
        plot_name = (n and (n + ", ")) + "AUC {:.3f}".format(auc)
        plots.extend(ax.plot(tpr, 1. / fpr, label=plot_name))

    # legend
    legend = ax.legend(plots, [p.get_label() for p in plots], loc="upper right")
    legend.get_frame().set_linewidth(0.0)

    return fig

In [None]:
#plot fancy roc curve
fig = plot_log_roc(
    (labels_train, labels_valid),
    (predictions_train, predictions_valid),
    names=("train", "valid"),
)
fig.set_size_inches(6,6) 

The curves above are produced by scanning potential values to cut on the network output and examining the resulting signal classification (true positive) and background mis-classification (false positive) rates.

Naturally, a well performing network has a high true positive rate while keeping the (reciprocal) false positive rate at a reasonably low (high) level. For the choice of the axes above, this would lead to a curve that is bent towards the upper right corner. But be aware that other representations of the ROC curve exist which might look somewhat different (e.g. "1 - false positive rate" on the y-axis). Their message is, however, identical.

A commonly used proxy that compiles the values for all possible cuts into one metric is the area-under-curve - **AUC**. A value of 1 signalizes a perfectly working network that allows for a cut value leading to 100% signal efficiency and 0% background contamination. Opposed to that, a value of 0.5 means that the two output distributions of signal and background events are probably fully overlapping, lacking the opportunity to apply a cut that would favor signal examples. A value of 0 has the same logical meaning as 1, but the definition of what is signal and background is flipped. Therefore, the distance from 0.5 is what actually matters here.

A value around 0.75 is already quite decent, but there's still potential. You can try to beat this value in the full training setup below.

### Lessons learned

- Now we know how to build a simple model using TensorFlow and Keras.
- We learned how to one-hot encode labels.
- We performed a quick training using the `fit()` method of Keras models.
- To ensure model generalization, we evaluated validation data with our trainined model.
- We calculated accuracies and visualized the output distributions.
- We learned about ROC curves, AUC values and how to plot / compute them.

With these tools at hand, we can jump into the next section and build a custom training loop.

## **Training++**

To get more insights into the actual neural network training process, we will use Keras only to compose the model. For preprocessing, the definition of losses, and the training loop, we will use bare TensorFlow operations and tools.

Also, we reconsider some of the choices we made above and incorporate a few techniques that improve the network training.

Here are a few TensorFlow resources that might help you in the process:

- [Datasets](https://www.tensorflow.org/api_docs/python/tf/data/Dataset)
- [Loading NumPy data](https://www.tensorflow.org/tutorials/load_data/numpy)
- [Keras layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers)
- [Eager execution](https://www.tensorflow.org/guide/eager)
- [Gradient tape and differentiation](https://www.tensorflow.org/guide/autodiff)
- [Graphs and introduction to `tf.function`](https://www.tensorflow.org/guide/intro_to_graphs)
- [Better performance with `tf.function`](https://www.tensorflow.org/guide/function)
- [Training loop from scratch](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)
- [TensorFlow 2 tutorial held at CERN](https://indico.cern.ch/event/882992/contributions/3721506/attachments/1994721/3327402/TensorFlow_2_Workshop_CERN_2020.pdf)

### Eager execution and graphs

Just like with NumPy, we can interactively work with TensorFlow tensors. Each operation is executed *eagerly* as soon as the interpreter reaches and evaluates that line.

In [None]:
t = tf.range(0., 10.)
print(t)
t = t * 2
print(t)
t = t + 1
print(type(t))

Actually, one might not be interested in intermediate results so the outcome of `t = t * 2` in line 3 is perhaps not required. Also, imagine the operation `(t * 2) + 1` is executed on a GPU (Graphical Processing Unit). The content of `t` - not that many bytes in this example, but tensors can easily reach a couple MBs - is transferred to the GPU, together with the instructions to multiply each value by 2 and then adding 1. The output of this computation is sent back to the CPU where (e.g.) the Python interpreter can print the numbers as done in line 6.

There is obviously no need to send back the result of `t * 2`. However, this is exactly what would happen in the example above. While this is a nice and intuitive way to prototype a new model, we somehow need a way to tell TensorFlow to compute a set of instructions as a whole, and that we are only interested in the final result. This is where **graphs** enter the equation.

A computational graph describes the symbolic instructions that should be performed on certain input tensors (orange) to produce the result of a complex computation. These instructions are represented by `tf.Operation` objects (green), while the data flowing between them is contained in `tf.Tensor`'s (purple). The graph of the computation above would look like this:

[![](https://mermaid.ink/img/eyJjb2RlIjoiZ3JhcGggTFJcbkFbcmFuZ2UgMCAtIDEwXVxuQltjb25zdGFudCAyXVxuQ1tjb25zdGFudCAxXVxuTXttdWx9XG5Oe2FkZH1cbkRbdCddXG5FW3QnJ11cbkEgJiBCIC0tPiBNXG5NIC0tPiBEXG5EICYgQyAtLT4gTlxuTiAtLT4gRVxuc3R5bGUgQSBmaWxsOiNmOTZcbnN0eWxlIEIgZmlsbDojZjk2XG5zdHlsZSBDIGZpbGw6I2Y5Nlxuc3R5bGUgTSBmaWxsOiNiZGFcbnN0eWxlIE4gZmlsbDojYmRhIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQiLCJ0aGVtZVZhcmlhYmxlcyI6eyJiYWNrZ3JvdW5kIjoid2hpdGUiLCJwcmltYXJ5Q29sb3IiOiIjRUNFQ0ZGIiwic2Vjb25kYXJ5Q29sb3IiOiIjZmZmZmRlIiwidGVydGlhcnlDb2xvciI6ImhzbCg4MCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwicHJpbWFyeUJvcmRlckNvbG9yIjoiaHNsKDI0MCwgNjAlLCA4Ni4yNzQ1MDk4MDM5JSkiLCJzZWNvbmRhcnlCb3JkZXJDb2xvciI6ImhzbCg2MCwgNjAlLCA4My41Mjk0MTE3NjQ3JSkiLCJ0ZXJ0aWFyeUJvcmRlckNvbG9yIjoiaHNsKDgwLCA2MCUsIDg2LjI3NDUwOTgwMzklKSIsInByaW1hcnlUZXh0Q29sb3IiOiIjMTMxMzAwIiwic2Vjb25kYXJ5VGV4dENvbG9yIjoiIzAwMDAyMSIsInRlcnRpYXJ5VGV4dENvbG9yIjoicmdiKDkuNTAwMDAwMDAwMSwgOS41MDAwMDAwMDAxLCA5LjUwMDAwMDAwMDEpIiwibGluZUNvbG9yIjoiIzMzMzMzMyIsInRleHRDb2xvciI6IiMzMzMiLCJtYWluQmtnIjoiI0VDRUNGRiIsInNlY29uZEJrZyI6IiNmZmZmZGUiLCJib3JkZXIxIjoiIzkzNzBEQiIsImJvcmRlcjIiOiIjYWFhYTMzIiwiYXJyb3doZWFkQ29sb3IiOiIjMzMzMzMzIiwiZm9udEZhbWlseSI6IlwidHJlYnVjaGV0IG1zXCIsIHZlcmRhbmEsIGFyaWFsIiwiZm9udFNpemUiOiIxNnB4IiwibGFiZWxCYWNrZ3JvdW5kIjoiI2U4ZThlOCIsIm5vZGVCa2ciOiIjRUNFQ0ZGIiwibm9kZUJvcmRlciI6IiM5MzcwREIiLCJjbHVzdGVyQmtnIjoiI2ZmZmZkZSIsImNsdXN0ZXJCb3JkZXIiOiIjYWFhYTMzIiwiZGVmYXVsdExpbmtDb2xvciI6IiMzMzMzMzMiLCJ0aXRsZUNvbG9yIjoiIzMzMyIsImVkZ2VMYWJlbEJhY2tncm91bmQiOiIjZThlOGU4IiwiYWN0b3JCb3JkZXIiOiJoc2woMjU5LjYyNjE2ODIyNDMsIDU5Ljc3NjUzNjMxMjglLCA4Ny45MDE5NjA3ODQzJSkiLCJhY3RvckJrZyI6IiNFQ0VDRkYiLCJhY3RvclRleHRDb2xvciI6ImJsYWNrIiwiYWN0b3JMaW5lQ29sb3IiOiJncmV5Iiwic2lnbmFsQ29sb3IiOiIjMzMzIiwic2lnbmFsVGV4dENvbG9yIjoiIzMzMyIsImxhYmVsQm94QmtnQ29sb3IiOiIjRUNFQ0ZGIiwibGFiZWxCb3hCb3JkZXJDb2xvciI6ImhzbCgyNTkuNjI2MTY4MjI0MywgNTkuNzc2NTM2MzEyOCUsIDg3LjkwMTk2MDc4NDMlKSIsImxhYmVsVGV4dENvbG9yIjoiYmxhY2siLCJsb29wVGV4dENvbG9yIjoiYmxhY2siLCJub3RlQm9yZGVyQ29sb3IiOiIjYWFhYTMzIiwibm90ZUJrZ0NvbG9yIjoiI2ZmZjVhZCIsIm5vdGVUZXh0Q29sb3IiOiJibGFjayIsImFjdGl2YXRpb25Cb3JkZXJDb2xvciI6IiM2NjYiLCJhY3RpdmF0aW9uQmtnQ29sb3IiOiIjZjRmNGY0Iiwic2VxdWVuY2VOdW1iZXJDb2xvciI6IndoaXRlIiwic2VjdGlvbkJrZ0NvbG9yIjoicmdiYSgxMDIsIDEwMiwgMjU1LCAwLjQ5KSIsImFsdFNlY3Rpb25Ca2dDb2xvciI6IndoaXRlIiwic2VjdGlvbkJrZ0NvbG9yMiI6IiNmZmY0MDAiLCJ0YXNrQm9yZGVyQ29sb3IiOiIjNTM0ZmJjIiwidGFza0JrZ0NvbG9yIjoiIzhhOTBkZCIsInRhc2tUZXh0TGlnaHRDb2xvciI6IndoaXRlIiwidGFza1RleHRDb2xvciI6IndoaXRlIiwidGFza1RleHREYXJrQ29sb3IiOiJibGFjayIsInRhc2tUZXh0T3V0c2lkZUNvbG9yIjoiYmxhY2siLCJ0YXNrVGV4dENsaWNrYWJsZUNvbG9yIjoiIzAwMzE2MyIsImFjdGl2ZVRhc2tCb3JkZXJDb2xvciI6IiM1MzRmYmMiLCJhY3RpdmVUYXNrQmtnQ29sb3IiOiIjYmZjN2ZmIiwiZ3JpZENvbG9yIjoibGlnaHRncmV5IiwiZG9uZVRhc2tCa2dDb2xvciI6ImxpZ2h0Z3JleSIsImRvbmVUYXNrQm9yZGVyQ29sb3IiOiJncmV5IiwiY3JpdEJvcmRlckNvbG9yIjoiI2ZmODg4OCIsImNyaXRCa2dDb2xvciI6InJlZCIsInRvZGF5TGluZUNvbG9yIjoicmVkIiwibGFiZWxDb2xvciI6ImJsYWNrIiwiZXJyb3JCa2dDb2xvciI6IiM1NTIyMjIiLCJlcnJvclRleHRDb2xvciI6IiM1NTIyMjIiLCJjbGFzc1RleHQiOiIjMTMxMzAwIiwiZmlsbFR5cGUwIjoiI0VDRUNGRiIsImZpbGxUeXBlMSI6IiNmZmZmZGUiLCJmaWxsVHlwZTIiOiJoc2woMzA0LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTMiOiJoc2woMTI0LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkiLCJmaWxsVHlwZTQiOiJoc2woMTc2LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTUiOiJoc2woLTQsIDEwMCUsIDkzLjUyOTQxMTc2NDclKSIsImZpbGxUeXBlNiI6ImhzbCg4LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTciOiJoc2woMTg4LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkifX0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoiZ3JhcGggTFJcbkFbcmFuZ2UgMCAtIDEwXVxuQltjb25zdGFudCAyXVxuQ1tjb25zdGFudCAxXVxuTXttdWx9XG5Oe2FkZH1cbkRbdCddXG5FW3QnJ11cbkEgJiBCIC0tPiBNXG5NIC0tPiBEXG5EICYgQyAtLT4gTlxuTiAtLT4gRVxuc3R5bGUgQSBmaWxsOiNmOTZcbnN0eWxlIEIgZmlsbDojZjk2XG5zdHlsZSBDIGZpbGw6I2Y5Nlxuc3R5bGUgTSBmaWxsOiNiZGFcbnN0eWxlIE4gZmlsbDojYmRhIiwibWVybWFpZCI6eyJ0aGVtZSI6ImRlZmF1bHQiLCJ0aGVtZVZhcmlhYmxlcyI6eyJiYWNrZ3JvdW5kIjoid2hpdGUiLCJwcmltYXJ5Q29sb3IiOiIjRUNFQ0ZGIiwic2Vjb25kYXJ5Q29sb3IiOiIjZmZmZmRlIiwidGVydGlhcnlDb2xvciI6ImhzbCg4MCwgMTAwJSwgOTYuMjc0NTA5ODAzOSUpIiwicHJpbWFyeUJvcmRlckNvbG9yIjoiaHNsKDI0MCwgNjAlLCA4Ni4yNzQ1MDk4MDM5JSkiLCJzZWNvbmRhcnlCb3JkZXJDb2xvciI6ImhzbCg2MCwgNjAlLCA4My41Mjk0MTE3NjQ3JSkiLCJ0ZXJ0aWFyeUJvcmRlckNvbG9yIjoiaHNsKDgwLCA2MCUsIDg2LjI3NDUwOTgwMzklKSIsInByaW1hcnlUZXh0Q29sb3IiOiIjMTMxMzAwIiwic2Vjb25kYXJ5VGV4dENvbG9yIjoiIzAwMDAyMSIsInRlcnRpYXJ5VGV4dENvbG9yIjoicmdiKDkuNTAwMDAwMDAwMSwgOS41MDAwMDAwMDAxLCA5LjUwMDAwMDAwMDEpIiwibGluZUNvbG9yIjoiIzMzMzMzMyIsInRleHRDb2xvciI6IiMzMzMiLCJtYWluQmtnIjoiI0VDRUNGRiIsInNlY29uZEJrZyI6IiNmZmZmZGUiLCJib3JkZXIxIjoiIzkzNzBEQiIsImJvcmRlcjIiOiIjYWFhYTMzIiwiYXJyb3doZWFkQ29sb3IiOiIjMzMzMzMzIiwiZm9udEZhbWlseSI6IlwidHJlYnVjaGV0IG1zXCIsIHZlcmRhbmEsIGFyaWFsIiwiZm9udFNpemUiOiIxNnB4IiwibGFiZWxCYWNrZ3JvdW5kIjoiI2U4ZThlOCIsIm5vZGVCa2ciOiIjRUNFQ0ZGIiwibm9kZUJvcmRlciI6IiM5MzcwREIiLCJjbHVzdGVyQmtnIjoiI2ZmZmZkZSIsImNsdXN0ZXJCb3JkZXIiOiIjYWFhYTMzIiwiZGVmYXVsdExpbmtDb2xvciI6IiMzMzMzMzMiLCJ0aXRsZUNvbG9yIjoiIzMzMyIsImVkZ2VMYWJlbEJhY2tncm91bmQiOiIjZThlOGU4IiwiYWN0b3JCb3JkZXIiOiJoc2woMjU5LjYyNjE2ODIyNDMsIDU5Ljc3NjUzNjMxMjglLCA4Ny45MDE5NjA3ODQzJSkiLCJhY3RvckJrZyI6IiNFQ0VDRkYiLCJhY3RvclRleHRDb2xvciI6ImJsYWNrIiwiYWN0b3JMaW5lQ29sb3IiOiJncmV5Iiwic2lnbmFsQ29sb3IiOiIjMzMzIiwic2lnbmFsVGV4dENvbG9yIjoiIzMzMyIsImxhYmVsQm94QmtnQ29sb3IiOiIjRUNFQ0ZGIiwibGFiZWxCb3hCb3JkZXJDb2xvciI6ImhzbCgyNTkuNjI2MTY4MjI0MywgNTkuNzc2NTM2MzEyOCUsIDg3LjkwMTk2MDc4NDMlKSIsImxhYmVsVGV4dENvbG9yIjoiYmxhY2siLCJsb29wVGV4dENvbG9yIjoiYmxhY2siLCJub3RlQm9yZGVyQ29sb3IiOiIjYWFhYTMzIiwibm90ZUJrZ0NvbG9yIjoiI2ZmZjVhZCIsIm5vdGVUZXh0Q29sb3IiOiJibGFjayIsImFjdGl2YXRpb25Cb3JkZXJDb2xvciI6IiM2NjYiLCJhY3RpdmF0aW9uQmtnQ29sb3IiOiIjZjRmNGY0Iiwic2VxdWVuY2VOdW1iZXJDb2xvciI6IndoaXRlIiwic2VjdGlvbkJrZ0NvbG9yIjoicmdiYSgxMDIsIDEwMiwgMjU1LCAwLjQ5KSIsImFsdFNlY3Rpb25Ca2dDb2xvciI6IndoaXRlIiwic2VjdGlvbkJrZ0NvbG9yMiI6IiNmZmY0MDAiLCJ0YXNrQm9yZGVyQ29sb3IiOiIjNTM0ZmJjIiwidGFza0JrZ0NvbG9yIjoiIzhhOTBkZCIsInRhc2tUZXh0TGlnaHRDb2xvciI6IndoaXRlIiwidGFza1RleHRDb2xvciI6IndoaXRlIiwidGFza1RleHREYXJrQ29sb3IiOiJibGFjayIsInRhc2tUZXh0T3V0c2lkZUNvbG9yIjoiYmxhY2siLCJ0YXNrVGV4dENsaWNrYWJsZUNvbG9yIjoiIzAwMzE2MyIsImFjdGl2ZVRhc2tCb3JkZXJDb2xvciI6IiM1MzRmYmMiLCJhY3RpdmVUYXNrQmtnQ29sb3IiOiIjYmZjN2ZmIiwiZ3JpZENvbG9yIjoibGlnaHRncmV5IiwiZG9uZVRhc2tCa2dDb2xvciI6ImxpZ2h0Z3JleSIsImRvbmVUYXNrQm9yZGVyQ29sb3IiOiJncmV5IiwiY3JpdEJvcmRlckNvbG9yIjoiI2ZmODg4OCIsImNyaXRCa2dDb2xvciI6InJlZCIsInRvZGF5TGluZUNvbG9yIjoicmVkIiwibGFiZWxDb2xvciI6ImJsYWNrIiwiZXJyb3JCa2dDb2xvciI6IiM1NTIyMjIiLCJlcnJvclRleHRDb2xvciI6IiM1NTIyMjIiLCJjbGFzc1RleHQiOiIjMTMxMzAwIiwiZmlsbFR5cGUwIjoiI0VDRUNGRiIsImZpbGxUeXBlMSI6IiNmZmZmZGUiLCJmaWxsVHlwZTIiOiJoc2woMzA0LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTMiOiJoc2woMTI0LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkiLCJmaWxsVHlwZTQiOiJoc2woMTc2LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTUiOiJoc2woLTQsIDEwMCUsIDkzLjUyOTQxMTc2NDclKSIsImZpbGxUeXBlNiI6ImhzbCg4LCAxMDAlLCA5Ni4yNzQ1MDk4MDM5JSkiLCJmaWxsVHlwZTciOiJoc2woMTg4LCAxMDAlLCA5My41Mjk0MTE3NjQ3JSkifX0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)

To declare a computational graph, we can write a function and decorate it with `tf.function`.

In [None]:
@tf.function
def my_func(t):
    print("new graph created", t.dtype, t.shape)
    t = t * 2
    print(t)
    t = t + 1
    print(t)
    return t

t = tf.range(0., 10.)
my_func(t)

As you can see, the output is exactly the same, but the intermediate tensors no longer have values attached to them. The first time we called `my_func` in line 10, a concrete graph was created that expects an input tensor with type `float32` and shape `(10,)`. In fact, when we repeat this call with an input tensor of identical type and shape, `my_func` is not even called, but TensorFlow uses the previously created graph and executes it.

In [None]:
my_func(tf.range(10., 20.))

No line `new graph created ...` is printed, implying that `my_func` is indeed not called!

However, if we use a tensor with a different type or shape, a new graph is created and stored internally. This powerful feature is called **signature tracing** and you can read more about it [here](https://www.tensorflow.org/guide/function).

In [None]:
my_func(tf.range(10., 21.))
print("---")
my_func(tf.range(10, 20, dtype=tf.int32))

With these concepts at hand, we can go ahead and start building our data pipeline!

### Data pipeline and preprocessing

Above, we created a method `preprocess_constituents` to select the first 120 constituents per jet and to merge the last two dimensions so that we can feed a NN that expects an input feature vector. This was sufficient as the `model.fit` method knows how to apply batching of input jets and how to repeat the dataset to train for more than one epoch.

Using plain TensorFlow, we use a `tf.data.Dataset` object for this purpose. Let's write a function that returns a dataset for our training.

But now, we include an important aspect of deep learning, namely **feature scaling** (FS)! To introduce FS, we first need to understand the concept of **numerical domains** in the context of NN applications.

Our input data - a selection of four-vectors with values given in GeV - clearly comes from the domain of physics. As we have seen in the plots above, their numerical values range from -500 to 500 for $p_x$ and $p_y$, and up to 2000 for $E$ and $p_z$ values. We can use the abstract term *application domain* to describe these ranges. However, the domain of numbers being passed back and forth through the network is entirely different and can even vary depending on the architecture you pick! A classical feed forward network, such as the one we created above, and a plentora of techniques that were developed throughout the last decade(s) prefer values to be in a range between, say, -1 and 1, and we can call it *network domain*. This is just an example and somewhat larger values are certainly fine as well. But still, you get the idea that numerical application and network domains are *entirely different*.

Numerically, the output of our *classification* model is still in the network domain and we simply interpret it as a binary classification decision, so we are safe on this end. Things get a bit more tricky when we perform a *regression* task that should predict the value of a physics quantitiy👾.

In [None]:
def create_dataset(kind, shuffle=False, repeat=1, batch_size=100, n_constituents=120, seed=None, **kwargs):
    # first, we load the data as before, passing all unresolved keyword arguments
    c_vectors, true_vectors, labels = dasml.data.load(kind, **kwargs)
    
    # first, we measure the mean and standard deviation of the raw input vectors,
    # of course not taking into account missing constituents
    # we will need them for the feature scaling later on
    non_zero = c_vectors[:, :, E].flatten() > 0
    means = tf.constant([
        np.mean(c_vectors[:, :, v].flatten()[non_zero])
        for v in (E, PX, PY, PZ)
    ])
    variances = tf.constant([
        np.var(c_vectors[:, :, v].flatten()[non_zero])
        for v in (E, PX, PY, PZ)
    ])
    stddevs = tf.maximum(variances, 1e-6)**0.5
    
    # then we apply the cut on the first n_constituents per jet
    # this is prettly basic and can happen outside the data pipeline
    c_vectors = c_vectors[:, :n_constituents]
    
    # one-hot encode labels
    labels = labels_to_onehot(labels)
    
    # create a tf dataset
    data = (c_vectors, true_vectors, labels)
    ds = tf.data.Dataset.from_tensor_slices(data)
    
    # in the following, we amend the dataset object using methods
    # that return a new dataset object *without* copying the data
    
    # apply shuffeling
    if shuffle:
        ds = ds.shuffle(10 * batch_size, reshuffle_each_iteration=True, seed=seed)
    
    # apply repitition, i.e. start iterating from the beginning when the dataset is exhausted
    ds = ds.repeat(repeat)
    
    # apply batching
    if batch_size < 1:
        batch_size = c_vectors.shape[0]
    ds = ds.batch(batch_size)
    
    # store the original data for later access
    ds._orig_data = data
    
    return ds, means, stddevs

In [None]:
# create a training dataset
dataset_train, means_train, stddevs_train = create_dataset(
    "train", shuffle=True, repeat=-1, batch_size=200, stop_file=1)

# also load all validation data but disable batching for easier handling
dataset_valid, _, _ = create_dataset("valid", batch_size=-1, stop_file=1)

dataset_train

### Feature scaling in a custom Keras layer

In [None]:
# define the feature scaling procedure as a custom keras layer
# that has, of course, no weights as it is not trainable
# see https://keras.io/guides/making_new_layers_and_models_via_subclassing for more info

class FeatureScaling(tf.keras.layers.Layer):

    def __init__(self, means, stddevs):
        """
        Constructor. Stores arguments as instance members.
        """
        super(FeatureScaling, self).__init__(trainable=False)

        self.means = means
        self.stddevs = stddevs

    def get_config(self):
        """
        Method that is required for model cloning and saving. It should return a
        mapping of instance member names to the actual members.
        """
        return {"means": self.means, "stddevs": self.stddevs}

    def compute_output_shape(self, input_shape):
        """
        Method that, given an input shape, defines the shape of the output tensor.
        This way, the entire model can be built without actually calling it.
        """
        return (input_shape[0], input_shape[1] * input_shape[2])
    
    def build(self, input_shape):
        """
        Any variables defined by this layer should be created inside this method.
        This helps Keras to defer variable registration to the point where it is
        needed the first time, and in particular not at definition time.
        """
        # nothing to do here as our feature scaling has not trainable parameters

    def call(self, c_vectors):
        """
        Payload of the layer that takes inputs and computes the requested output
        whose shape should match what is defined in compute_output_shape.
        """
        # scale each feature such that it is distributed around 0 with a standard deviation of 1
        # BUT: there are already many zeros in the input features and they have
        #      a distinct meaning (missing constituents); we want to keep this information, so we
        #      shift these values to -3, i.e. 3 standard deviations to the left
        e, px, py, pz = tf.unstack(c_vectors, axis=-1)
        zero_pos = -3. * tf.ones_like(e)
        non_zero = e > 0
        e = tf.where(non_zero, (e - self.means[E]) / self.stddevs[E], zero_pos)
        px = tf.where(non_zero, (px - self.means[PX]) / self.stddevs[PX], zero_pos)
        py = tf.where(non_zero, (py - self.means[PY]) / self.stddevs[PY], zero_pos)
        pz = tf.where(non_zero, (pz - self.means[PZ]) / self.stddevs[PZ], zero_pos)

        # we anyway need to flatten the vectors, so just concatenate components
        features = tf.concat((e, px, py, pz), axis=-1)
        
        return features

### Define the new model

In [None]:
def create_model(input_shape, units=(128, 128, 128), activation="tanh", dropout_rate=0., fs_args=None):
    # track weights for later use
    weights = []
    
    # input layer
    x = tf.keras.Input(input_shape)
    
    # feature scaling
    if not fs_args:
        fs_args = (tf.constant(4 * [0.]), tf.constant(4 * [1.]))
    a = FeatureScaling(*fs_args)(x)

    # add layers programatically
    for n in units:
        # build the layer
        layer = tf.keras.layers.Dense(n, use_bias=True, activation=activation)
        a = layer(a)

        # store the weight matrix for later use
        weights.append(layer.kernel)

        # add random unit dropout
        if dropout_rate:
            a = tf.keras.layers.Dropout(dropout_rate)(a)

    # add the softmax layer
    y = tf.keras.layers.Dense(2, use_bias=True, activation="softmax")(a)
    
    # build the model
    model = tf.keras.Model(inputs=x, outputs=y, name="toptagging_custom")

    return model, weights

In [None]:
# create the model
model, regularization_weights = create_model((120, 4), fs_args=(means_train, stddevs_train))
model.summary()

### Loss definition

#### Cross entropy

The last ingredient before running the training loop is the definition of the loss function. Since we only use Keras for the model building process above, we are free to use anything we want!

The main component of the loss is - as above - the binary cross entropy (CE) loss, which is a common choice for classification problems that use a softmax activation in the last layer. Although many variations of CE exist (e.g. the group of *focal* losses), we stick with this simple yet powerful formula,
$$
\begin{align}
L_\text{CE}(y, y_t) = -y_t \cdot \log(y)
\end{align}
$$
where $y$ is the NN prediction and $y_t$ is the ground truth.

We could have also used the Keras implementation which, in combination with an *unactivated* output layer, takes a shortcut around applying exponential functions in the output layer and building logarithms again in the loss. However, we are here to learn so we do this on our own 😉

#### L2 regularization

The second term in our loss function does not compare predicted and expected values, but only considers the values of all weights used in the model and provides a bad feedback in case these variables obtain rather high values. To understand why high variable values are discouraged in typical NN applications, you can imagine a simple fit of a 1-D function to a set of examples (see image below).

<center><img src="assets/nn_capacity.png" width="60%"/></center>

In case a network has too few parameters (case 1), its capacity is insufficient to describe the examples with good accuracy (*underfitting*). A network with appropriate capacity (case 2) describes the data in all parts of the phase space $x$. However, guessing the correct capacity that applies equally to all parts of the usually high-dimensional phase space is not always possible.

Therefore, the scenario of *overfitting* becomes relevant (case 3). In some parts of the phase space (here, for low x values), the prediction of the network fluctuates significantly to explain each example. This is realized through large values of the paramters of the underlying fit model. The same observation holds for higher-dimensional fits and hence, also for neural networks.

For this reason, we introduce the $L_2$ regularization loss, which simply sums up the squares (thus $_2$) of all traininable parameters. Before adding this term to the CE loss defined above, we scale it by a factor $\lambda$ (`l2_norm`) to control the overall strength of the $L_2$ regularization.

In [None]:
# define the losses
def create_losses(weights, l2_norm=0.001):
    # cross entropy
    @tf.function
    def loss_ce_fn(labels, predictions):
        # ensure proper prediction values before applying log's
        predictions = tf.clip_by_value(predictions, 1e-6, 1 - 1e-6)
        loss_ce = tf.reduce_mean(-labels * tf.math.log(predictions))
        return loss_ce

    # l2 loss
    @tf.function
    def loss_l2_fn(labels, predictions):
        # accept labels and predictions although we don't need them
        # but this makes it easier to call all loss functions the same way
        loss_l2 = sum(tf.reduce_sum(w**2) for w in weights)
        
        return l2_norm * loss_l2
        
    # return a dict with all loss function components
    return {"ce": loss_ce_fn, "l2": loss_l2_fn}

In [None]:
loss_fns = create_losses(regularization_weights, l2_norm=0.001)

### Optimizer

During training, we need an optimizer object that handles the propagation of derivatives back through the network and updates all trainable weights. There are many different optimizers out there, but for now, we will stick with the [*Adam*](https://arxiv.org/abs/1412.6980) optimizer.

In [None]:
# define the optimizer with a variable learning rate
def create_optimizer(initial_learning_rate=0.005):
    learning_rate = tf.Variable(initial_learning_rate, dtype=tf.float32, trainable=False)
    optimizer = tf.keras.optimizers.Adam(learning_rate)
    return optimizer, learning_rate

In [None]:
optimizer, learning_rate = create_optimizer()

### Training loop

Now it's time to define the training loop. Here, we use the TensorFlow `GradientTape` which tracks all executed operations and provides the partial gradients of the loss function with respect to all traininable weights, that are used to update their values as part of the backpropagation algorithm. You can learn more on the `GradientTape` and custom training loops [here](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch).

In [None]:
def training_loop(dataset_train, dataset_valid, model, loss_fns, optimizer, learning_rate,
                  max_steps=10000, log_every=1, validate_every=100): 
    # store the best model, identified by the best validation accuracy
    best_model = None

    # metrics to update during training
    metrics = dict(
        step=0, step_val=0,
        acc_train=0., acc_valid=0., acc_valid_best=0.,
        auc_train=0., auc_valid=0., auc_valid_best=0.,
    )
    for name in loss_fns:
        for kind in ["train", "valid"]:
            metrics[f"loss_{name}_{kind}"] = 0.
    
    # progress bar format
    fmt = ["{percentage:3.0f}% {bar} Step: {pfx[0][step]}/{total}, Validations: {pfx[0][step_val]}"]
    for name in loss_fns:
        fmt.append(f"Loss '{name}': {{pfx[0][loss_{name}_train]:.4f}} | {{pfx[0][loss_{name}_valid]:.4f}}")
    fmt.append("Accuracy: {pfx[0][acc_train]:.4f} | {pfx[0][acc_valid]:.4f} | {pfx[0][acc_valid_best]:.4f}")
    fmt.append("ROC AUC: {pfx[0][auc_train]:.4f} | {pfx[0][auc_valid]:.4f} | {pfx[0][auc_valid_best]:.4f}")
    fmt.append("(loss format: 'last train | last valid', metric format: 'last train | last valid | best valid')")
    fmt = " --- ".join(fmt).replace("pfx", "postfix")

    # helper to update metrics
    def update_metrics(bar, kind, step, labels, predictions, losses):
        # calculate accuracy and roc auc
        acc = calculate_accuracy(labels.numpy(), predictions.numpy())
        auc = roc_auc_score(labels[:, 1], predictions[:, 1])
        # update bar data
        metrics["step"] = step + 1
        metrics[f"acc_{kind}"] = acc
        metrics[f"auc_{kind}"] = auc
        for name, loss in losses.items():
            metrics[f"loss_{name}_{kind}"] = loss
        # validation specific
        if kind == "valid":
            metrics["step_val"] += 1
            metrics["acc_valid_best"] = max(metrics["acc_valid_best"], acc)
            metrics["auc_valid_best"] = max(metrics["auc_valid_best"], auc)
            # return True when this was the best validation step
            return acc == metrics["acc_valid_best"]
    
    # start the loop for all batches
    with tqdm(total=max_steps, bar_format=fmt, postfix=[metrics]) as bar:
        for step, (c_vectors, true_vectors, labels) in enumerate(dataset_train):
            if step >= max_steps:
                print(f"{max_steps} steps reached, stopping training")
                break
                
            # do a train step
            with tf.GradientTape() as tape:
                # get predictions
                predictions = model(c_vectors, training=True)
                # comput all losses and combine them into the total loss
                losses = {
                    name: loss_fn(labels, predictions)
                    for name, loss_fn in loss_fns.items()
                }
                loss = tf.add_n(list(losses.values()))
            # get and propagate gradients
            gradients = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

            # logging
            do_log = step % log_every == 0
            if do_log:
                update_metrics(bar, "train", step, labels, predictions, losses)

            # validation
            do_validate = step % validate_every == 0
            if do_validate:
                c_vectors_valid, true_vectors_valid, labels_valid = next(iter(dataset_valid))
    
                predictions_valid = model(c_vectors_valid, training=False)
                losses_valid = {
                    name: loss_fn(labels_valid, predictions_valid)
                    for name, loss_fn in loss_fns.items()
                }
                is_best = update_metrics(bar, "valid", step, labels_valid, predictions_valid, losses_valid)
                
                # store the best model
                if is_best:
                    best_model = tf.keras.models.clone_model(model)
            
            bar.update()
        else:
            log("dataset exhausted, stopping training")

    print("validation metrics of the best model:")
    print(f"Accuracy: {metrics['acc_valid_best']:.4f}")
    print(f"ROC AUC : {metrics['auc_valid_best']:.4f}")
    
    return best_model, metrics

### Start the training!

In [None]:
best_model, metrics = training_loop(
    dataset_train,
    dataset_valid,
    model,
    loss_fns,
    optimizer,
    learning_rate,
    max_steps=100,
)


In [None]:
labels_train = dataset_train._orig_data[2]
labels_valid = dataset_valid._orig_data[2]

predictions_train = best_model.predict(dataset_train._orig_data[0])
predictions_valid = best_model.predict(dataset_valid._orig_data[0])

In [None]:
acc_train = calculate_accuracy(labels_train, predictions_train)
acc_valid = calculate_accuracy(labels_valid, predictions_valid)

print(f"train accuracy: {acc_train:.4f}")
print(f"valid accuracy: {acc_valid:.4f}")

### Lessons learned

- We learned about eager execution and graphs.
- We know what `tf.function`'s are and how they create and cache graphs by the means of signature tracing.
- Feature scaling and the separation of numerical domains between network and physics application were motivated.
- We built a data pipeline using TensorFlow datasets.
- We created and used a custom training loop using the GradientTape.

## **How the results can be further improved?**

What are some drawbacks ...

- We only loaded a fraction of the input data.
- The training only ran for 2000 steps, i.e., 2000 forward pass and back propagation calls. Given the amount of data, this is clearly not enough.
- None of the hyper-parameters is tuned yet.

Now, it's up to you to improve the training! Perhaps also try to include further concepts. Good starting points are

- [Learning rate scheduling](https://keras.io/api/optimizers/learning_rate_schedules/exponential_decay)
- [Batch normalization](https://keras.io/api/layers/normalization_layers/batch_normalization)
- [Activations](https://keras.io/api/layers/activations/#selu-function)
- [Focal loss](https://medium.com/visionwizard/understanding-focal-loss-a-quick-read-b914422913e7)
- [...](https://lmgtfy.com/?q=How+to+improve+my+neural+network)


Can you reach an accuracy of 85%? You can use the cells below which wrap all of the above settings and methods in less space.

**Note**: If you experience notebook kernal interruptions or messages like `Allocation of XXXXXXXX exceeds XX% of free system memory` on the terminal, reduce the number of input files again with the `stop_file` parameter as we did above. Reasonble results can already be achieved with a subset of the input data.

In [None]:
# define hyper-parameters
# ACTION REQUIRED
n_constituents = ...
batch_size = ...
l2_norm = ...
initial_learning_rate = ...
units = ...
activation = ...
dropout_rate = ...
n_train_files = ...  # set this to a value that works with your RAM
n_valid_files = ...  # set this to a value that works with your RAM

In [None]:
# load all data
dataset_train, means_train, stddevs_train = create_dataset(
    "train",
    shuffle=True,
    repeat=-1,
    batch_size=batch_size,
    n_constituents=n_constituents,
    stop_file=n_train_files,
)
dataset_valid, _, _ = create_dataset(
    "valid",
    batch_size=-1,
    n_constituents=n_constituents,
    stop_file=n_valid_files,
)

In [None]:
# create the model
model, regularization_weights = create_model(
    (n_constituents, 4),
    units=units,
    activation=activation,
    dropout_rate=dropout_rate,
    fs_args=(means_train, stddevs_train),
)
loss_fns = create_losses(regularization_weights, l2_norm)
optimizer, learning_rate = create_optimizer(initial_learning_rate)
model.summary()

In [None]:
# and train
best_model, metrics = training_loop(
    dataset_train,
    dataset_valid,
    model,
    loss_fns,
    optimizer,
    learning_rate,
    max_steps=5000,
)

You can create the ROC and output plots we defined above to study the training.

In [None]:
labels_train = dataset_train._orig_data[2]
labels_valid = dataset_valid._orig_data[2]

predictions_train = best_model.predict(dataset_train._orig_data[0])
predictions_valid = best_model.predict(dataset_valid._orig_data[0])

In [None]:
plot_roc(
    (labels_train, labels_valid),
    (predictions_train, predictions_valid),
    names=("train", "valid"),
).show()

plot_hist(
    (predictions_valid[labels_valid[:, 1] == 0][:, 1], predictions_valid[labels_valid[:, 1] == 1][:, 1]),
    names=("Light jets", "Top jets"),
    xlabel="Output distribution",
).show()

In [None]:
acc_train = calculate_accuracy(labels_train, predictions_train)
acc_valid = calculate_accuracy(labels_valid, predictions_valid)

print(f"train accuracy: {acc_train:.4f}")
print(f"valid accuracy: {acc_valid:.4f}")