<hr style="height: 1px;">
<i>This notebook was authored by the 8.S50x Course Team, Copyright 2022 MIT All Rights Reserved.</i>
<hr style="height: 1px;">
<br>

<h1>Lesson 14: An Example With LHC Data</h1>


<a name='section_14_0'></a>
<hr style="height: 1px;">


## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.0 Overview</h2>


<h3>Navigation</h3>

<table style="width:100%">
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_14_1">L14.1 Large Hadron Collider Data</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_14_1">L14.1 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_14_2">L14.2 Loading Data and Defining the Network</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_14_2">L14.2 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_14_3">L14.3 Training and Testing the Network</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_14_3">L14.3 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_14_4">L14.4 Adding a Hidden Layer</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_14_4">L14.4 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_14_5">L14.5 Regularization, Batch Normalization, and Dropout</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_14_5">L14.5 Exercises</a></td>
    </tr>
</table>



<h3>Learning Objectives</h3>

This lesson covers topics related to analysis of data from experiments at the Large Hadron Collider (LHC), including how to work with LHC data, how to train and test a network, and how to add a hidden layer to the network.

The topics include the following:

- Large Hadron Collider Data
- Working with LHC Data
- Training and Testing the Network
- Adding a Hidden Layer

<h3>Data</h3>

>description: CMS Crystal Shower Shape Data<br>
>source: https://zenodo.org/record/8035308 <br>
>attribution: Rankin, Dylan (CMS Collaboration), DOI:10.5281/zenodo.8035308 

In [None]:
#>>>RUN: L14.0-runcell00

# NOTE: these files are too large to include in the original repository, so you must download them from here:
# https://www.dropbox.com/s/i1dbakzr3pn9twd/xtalTuple_TTbar_PU0.z?dl=0
#
# Ways to download:
#     1. Copy/paste the link (replace =0 with =1 to download automatically)
#     2. Use the wget commands below (works in Colab, but you may need to install wget if using locally)
#
# Location of files:
#     Move the files to the directory data/L14
#
# Using wget: (works in Colab)
#     Upon downloading, the code below will move them to the appropriate directory

#get the data
!wget -P data/L14 https://www.dropbox.com/s/i1dbakzr3pn9twd/xtalTuple_TTbar_PU0.z?dl=0
!mv data/L14/xtalTuple_TTbar_PU0.z?dl=0 data/L14/xtalTuple_TTbar_PU0.z 

<h3>Importing Libraries</h3>

Before beginning, run the cell below to import the relevant libraries for this notebook.


In [None]:
#>>>RUN: L14.0-runcell01

#If using notebooks locally, run the following within your conda environment (if not done already)
#conda install pandas

import numpy as np               #https://numpy.org/doc/stable/
import matplotlib.pyplot as plt  #https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
import h5py                      #https://docs.h5py.org/en/stable/quick.html#quick
import pandas as pd              #https://pandas.pydata.org/docs/user_guide/index.html
import torch                     #https://pytorch.org/docs/stable/torch.html

<h3>Setting Default Figure Parameters</h3>

The following code cell sets default values for figure parameters.

In [None]:
#>>>RUN: L14.0-runcell02

#set plot resolution
%config InlineBackend.figure_format = 'retina'

#set default figure parameters
plt.rcParams['figure.figsize'] = (9,6)

medium_size = 12
large_size = 15

plt.rc('font', size=medium_size)          # default text sizes
plt.rc('xtick', labelsize=medium_size)    # xtick labels
plt.rc('ytick', labelsize=medium_size)    # ytick labels
plt.rc('legend', fontsize=medium_size)    # legend
plt.rc('axes', titlesize=large_size)      # axes title
plt.rc('axes', labelsize=large_size)      # x and y labels
plt.rc('figure', titlesize=large_size)    # figure title


<a name='section_14_1'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.1 Large Hadron Collider Data</h2>  

| [Top](#section_14_0) | [Previous Section](#section_14_0) | [Exercises](#exercises_14_1) | [Next Section](#section_14_2) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS14/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS14_vid1" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L14/slides_L14_01.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L14.1-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L14/slides_L14_01.html', width=970, height=550)

<h3>Overview</h3>

The electromagnetic calorimeter (ECAL) is a key component of the CMS detector at the LHC. It is designed to measure the energy of photons and electrons produced in particle collisions. When a photon or electron enters the ECAL, it interacts with a lead tungstate crystal and produces a shower of secondary particles, which in turn deposit their energy in the crystal. This energy causes the crystal to scintillate, creating light which is collected by photodetectors and converted into an electrical signal. The intensity of the light is proportional to the total amount of energy deposited in the crystal. These crystals are long enough that both electrons and photons created in LHC collisions lose essentially all of their energy.

As introduced in Lesson 9 and discussed in the video, pileup is an effect where other proton-proton collisions occur almost simultaneously with a primary collision, producing additional signals in the detector. This complicates the analysis of the data since the signals from the additional interactions can interfere with the measurement of the particles produced in the primary collision. This is one form of background in the ECAL data. The exercises for this Section will introduce another background process, namely how a neural network could separate photons from other particles that can mimic photons.

In the next Section, we will continue to investigate how to analyze data and extract useful information in the presence of background.

<a name='exercises_14_1'></a>     

| [Top](#section_14_0) | [Restart Section](#section_14_1) | [Next Section](#section_14_2) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.1.1</span>

The CMS ECAL is intended to identify photons and electrons. However, it is often the case that you can get particles that mimic photons and electrons. In particular, pions can leave large energy deposits in the ECAL. Charged pions will produce a charged track and shower in the calorimeter. These are usually not a problem since they can be identified by the fact that they also deposit energy in the Hadron Calorimeter behind the ECAL.

On the other hand, neutral pions will decay into two photons that are close to each other (colinear). In fact, they are typically so close together that they look like a single photon. The problem that we would like to solve is the separation of neutral pion decays from photons directly from the original collision. Let's say we are looking for a process that decays to photons, for example the Higgs decay to two well-separated photons. Selecting the Higgs involves selecting two photons on top of backgrounds from *fake* photons. What could a neural network do to remove fake photons? 

A) Reduce the background by eliminating fake events that are produced from pions. This is done by selecting events that have a large probability of containing TWO real photons.\
B) Do nothing to the background, just help to make suggestions as to what is more likely background.\
C) Generate a weight for each event quantifying the likelihood that it contains real photons. This weight can be used to look for the Higgs.\
D) Reduce the background by eliminating fake events that are produced from pions. This is done by selecting events that have a large probability of containing ONE real photon.



### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.1.2</span>

If our dominant background comes from pions that decay into two nearby photons, what would allow us to discriminate these cases from real photons? Select all that apply:

A) Calorimeter shapes that look like two energy blobs in the cells.\
B) Wider calorimeter shapes.\
C) A single high energy deposit.\
D) Calorimeter shapes that are wider in a specific direction.



<a name='section_14_2'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.2 Loading Data and Defining the Network</h2>  

| [Top](#section_14_0) | [Previous Section](#section_14_1) | [Exercises](#exercises_14_2) | [Next Section](#section_14_3) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS14/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS14_vid2" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>


Now that we have trained a very simple network, let's go ahead and take a real physics dataset and train a neural network to analyze it. For this, we are going to train a neural network that does photon identification at the Large Hadron Collider.  In particular, we will construct a discriminator that outputs the probability that a particle that we reconstructed was a real or "fake" photon. A fake photon in this case is typically a pion. Pions consist of two quarks and the neutral ones can decay into a pair of photons that are so close that they effectively merge, thereby looking very similar to a single photon. Photon identification is an important problem and it is what we used to identify good photons that led to the Higgs boson discovery. We regularly update our photon identification to try to improve it, and it has used some sort of either deep learning or machine learning for the last 10 years at the LHC. Suffice it to say that this is an excellent machine learning problem. 

In this particular case, we will only use information from the ECAL and, therefore, will be trying to separate real egamma events (i.e. photons and electrons) from background (mostly pions). In order to do this separation, we will incorporate a large number of different variables for the reconstructed particles from the collision, most of which are based on the moments and energies of the distribution of signals from the electromagnetic calorimeter elements. 

In the full particle identification, data from other detectors is used to discriminate electrons from photons.

In the last Lesson, we analyzed a very simple simulated dataset in which each event had only two variables (the location of the point in 2D space) and was identified as either blue or red. The goal was to find a procedure which would most accurately predict the color of an event given its location.

Now, we will consider a much more complicated case. However, the new dataset again consists of a simulation with the critically important attribute of identifying which cases are real egamma events and which are not. The difference is that the events in this case have many more than just two variables.

One last thing is that we will use `pandas` dataframes to process this. Pandas is one of the standard python data formats that is quite popular for machine learning. Let's go ahead and look at what we get!

In [None]:
#>>>RUN: L14.2-runcell01

import h5py
import pandas as pd


treename = 'l1pf_egm_reg'

VARS = ['pt', 'eta', 'phi', 'energy',
  'e2x2', 'e2x5', 'e3x5', 'e5x5', 'e2x2_div_e2x5', 'e2x2_div_e5x5', 'e2x5_div_e5x5',#7
  'hoE', 'bremStrength', 'ecalIso', 'crystalCount',#4
  'lowerSideLobePt','upperSideLobePt',#2
  'phiStripContiguous0', 'phiStripOneHole0', 'phiStripContiguous3p', 'phiStripOneHole3p',#4
  'sihih','sipip','sigetaeta','sigphiphi','sigetaphi',#5
  'e_m2_m2','e_m2_m1','e_m2_p0','e_m2_p1','e_m2_p2',
  'e_m1_m2','e_m1_m1','e_m1_p0','e_m1_p1','e_m1_p2',
  'e_p0_m2','e_p0_m1','e_p0_p0','e_p0_p1','e_p0_p2',
  'e_p1_m2','e_p1_m1','e_p1_p0','e_p1_p1','e_p1_p2',
  'e_p2_m2','e_p2_m1','e_p2_p0','e_p2_p1','e_p2_p2',#^25
  'h_m1_m1','h_m1_p0','h_m1_p1',
  'h_p0_m1','h_p0_p0','h_p0_p1',
  'h_p1_m1','h_p1_p0','h_p1_p1',#^9
  'gen_match']

filename = 'data/L14/xtalTuple_TTbar_PU0.z'

h5file = h5py.File(filename, 'r') # open read-only
params = h5file[treename][()]

df = pd.DataFrame(params,columns=VARS)

TODROP = [
  'e2x2_div_e2x5', 'e2x2_div_e5x5', 'e2x5_div_e5x5',#7
  'e_m2_m2','e_m2_m1','e_m2_p0','e_m2_p1','e_m2_p2',
  'e_m1_m2','e_m1_m1','e_m1_p0','e_m1_p1','e_m1_p2',
  'e_p0_m2','e_p0_m1','e_p0_p0','e_p0_p1','e_p0_p2',
  'e_p1_m2','e_p1_m1','e_p1_p0','e_p1_p1','e_p1_p2',
  'e_p2_m2','e_p2_m1','e_p2_p0','e_p2_p1','e_p2_p2',#^25
  'h_m1_m1','h_m1_p0','h_m1_p1',
  'h_p0_m1','h_p0_p0','h_p0_p1',
  'h_p1_m1','h_p1_p0','h_p1_p1',#^9
]

df = df.drop(TODROP, axis=1) #remove custom variables

#normalize the shower shapes by energy
for ie in ['e2x2', 'e2x5', 'e3x5', 'e5x5']:
    df[ie] /= df['energy']

#add some labels
df['isPU'] = pd.Series(df['gen_match']==0, index=df.index, dtype='i4')
df['isEG'] = pd.Series(df['gen_match']==1, index=df.index, dtype='i4')

#now select the dataset based on their transverse momentum (pt)
MINPT = 0.5
MAXPT = 100.
df = df.loc[(df['pt']>MINPT) & (MAXPT>df['pt']) & (1.3>abs(df['eta']))]
df.fillna(0., inplace=True)

#take a fixed nubmer of events
df0 = df[df['gen_match']==0].head(100000)
df1 = df[df['gen_match']==1].head(10000)

df = pd.concat([df0, df1], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
col_names = list(df.columns)

#Now let's check it all
print(df)
print(sum(df['gen_match']==0))
print(sum(df['gen_match']==1))

The problem at hand is that we are separating out egamma events (electrons or  photons) from background (mostly pions). The label `gen_match` refers to the state of an event that is identified as either egamma or background. It is 1 if the event is egamma and 0 otherwise. Similarly, the label `isEG` refers to egamma events, and the label `isPU` refers to background events.

Now that we have the dataset, let's go ahead and plot everything. We can make the classic color choice that blue is signal and red is background. Our goal in the end will be to use all of the variables that we have selected in the dataset to separate blue from red. This is a large multidimensional problem, and so you can see why it would be hard for us to just simply select cuts on parameters by hand. 

Here, we really need to come up with an automated scheme, and this is what pytorch helps us with.

In [None]:
#>>>RUN: L14.2-runcell02

col_names = list(df.columns)
print(col_names)

fig, axs = plt.subplots(len(col_names),1,figsize=(4,4*len(col_names)))
for ix,ax in enumerate(axs):
    ax.hist(df[col_names[ix]][df['gen_match']==0],bins=np.linspace(np.min(df[col_names[ix]]),np.max(df[col_names[ix]]),20),histtype='step',color='r',density=True)
    ax.hist(df[col_names[ix]][df['gen_match']==1],bins=np.linspace(np.min(df[col_names[ix]]),np.max(df[col_names[ix]]),20),histtype='step',color='b',density=True)
    ax.set_xlabel(col_names[ix])

plt.show()

Note in particular the last three plots. As discussed above and in the video, `isPU` indicates whether a particular event is background and `isEG` indicates whether it is egamma. So, the good events (blue histograms) would have `isPU=0` (i.e. not background) and `isEG=1` (i.e. an egamma event). For this particular simulation, `isEG` and `gen match` flag the same events.

The other thing to remind ourselves of is that many of these variables can be correlated. What that means is that when we select on one variable, we are potentially also indirectly selecting on another. The network can recognize this and attempt to "decorrelate" variables to get better performance. Sometimes it's too good at this and makes our problem more difficult.

<h3>Simple Regression Network</h3>

Let's make a simple logistic regression network on this data to differentiate between background data (labeled as PU) and egamma events (EG).



An important component of training a neural network is preparing the input. It is typical to split the data you have into different sets, with the most common three being: "training", "testing", and "validation". Here we use 30% of the data for testing and 70% for training and validation, with that 70% of the data split into 50%/20% for training/validation.

The reason for each is: 

 * Training : This the dataset that we use to adjust the parameters within the network to most accurately select the events we want.
 * Validation : This is a dataset that we use while training to check that our loss is the same or similar on independent data. It can also be used for tuning the so-called "hyperparameters", variables which control how the training proceeds.
 * Testing : This is the dataset that we use to compute the performance of the algorithm.
 
Let's go ahead and make these three separate datasets. It will be clear how we use them in a sec. Note that PyTorch uses `dataloaders` which help handle batching, etc.  


In [None]:
#>>>RUN: L14.2-runcell03


dataset = df.values

X = dataset[:,4:-3]
#last 3 columns are labels
ninputs = len(list(df.columns))-3-4

Y = dataset[:,-1:]
#last column will be used for the label

test_frac = 0.3
val_frac = 0.2

alldataset = torch.utils.data.TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(Y, dtype=torch.float32))

torch.random.manual_seed(42) # fix a random seed for reproducibility
testdataset, trainvaldataset = torch.utils.data.random_split(
    alldataset, [int(len(Y)*test_frac),
              int(len(Y)*(1-test_frac))])

torch.random.manual_seed(42) # fix a random seed for reproducibility
traindataset, valdataset = torch.utils.data.random_split(
    trainvaldataset, [int(len(Y)*(1.-test_frac)*(1.-val_frac)),
              int(len(Y)*(1.-test_frac)*val_frac)])

testloader = torch.utils.data.DataLoader(testdataset,
                                          num_workers=6,
                                          batch_size=500,
                                          shuffle=False)
trainloader = torch.utils.data.DataLoader(traindataset,
                                          num_workers=6,
                                          batch_size=500,
                                          shuffle=True)
valloader = torch.utils.data.DataLoader(valdataset,
                                        num_workers=6,
                                        batch_size=500,
                                        shuffle=False)


Now, we need to define our network architecture and the connections. Let's start with the kind of logistic regression network we saw already. This will be a 1-layer network. We will take in the number of inputs and then run our sigmoid function on the outputs, like we did before. 

PyTorch requires that we first define the layers we want to use in `__init__()` (here we build using standard library layers), and then we define the connection in `forward()`. The input `x` is first passed through the fully connected layer, `self.fc1` with input size `ninputs`, and then through the sigmoid activation function, and the output of size 1 is returned. This setup will allow PyTorch to construct the backward pass automatically although, for more complex or specialized networks, it is possible to define the backward pass manually.

In [None]:
#>>>RUN: L14.2-runcell04

class LR_net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(ninputs,1)
        self.output = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.output(x)
        return x
        
torch.random.manual_seed(42)  # fix a random seed for reproducibility

model_lr = LR_net()
print(model_lr)
print('----------')
print(model_lr.state_dict())

When the neural network is created, the code randomly generates initial values for all of the parameters, in roughly the range of +/-0.25 for this case . The `torch.random.manual_seed` line is used to guarantee the same random set is chosen every time, a feature which will be important for reproducibility when answering some of the later exercises.

<a name='exercises_14_2'></a>     

| [Top](#section_14_0) | [Restart Section](#section_14_2) | [Next Section](#section_14_3) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.2.1</span>

Consider the variables that you plotted in `L14.2-runcell02`. Among the following variables, which appear to have high discrimination power? In other words, in which plots is the background (red) distinguishable from the egamma (blue), meaning there is a relatively small overlap between the red and blue histograms? Select all that apply.

A) pt\
B) eta\
C) phi\
D) energy\
E) e2x2\
F) sihih


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.2.2</span>

How are training, validation, and testing data used in machine learning model development?

A) Training data is used to evaluate the model's performance, validation data is used to select the best hyperparameters, and testing data is used to train the model.\
B) Training data is used to train the model, validation data is used to tune the hyperparameters, and testing data is used to evaluate the model's performance.\
C) Training data is used to tune the hyperparameters, validation data is used to evaluate the model's performance, and testing data is used to train the model.\
D) All three datasets are used interchangeably to train, tune, and evaluate the model.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.2.3</span>

In the one-layer neural network that we have defined in this section (`LR_net` from `L14.2-runcell04`), we are using 19 input features. How many weights does this neural network have? Enter your answer as an integer.

Extra: How is this different from the two-weight model we were using in the last Lesson?

<a name='section_14_3'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.3 Training and Testing the Network</h2>  

| [Top](#section_14_0) | [Previous Section](#section_14_2) | [Exercises](#exercises_14_3) | [Next Section](#section_14_4) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS14/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS14_vid3" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Now let's train! We do this using the `Adam` optimizer and binary cross entropy loss (as before). We don't need to write out the formulae. We can just declare the loss from a loss function and an optimizer. In this training, we will also post the losses for the validation data. We want these to be similar to the regular loss found during the training. In fact, they should be statistically similar. 

**NOTE:** This could take a little time, depending on your computing resources.

In [None]:
#>>>RUN: L14.3-runcell01

criterion = torch.nn.BCELoss()
optimizer_lr = torch.optim.Adam(model_lr.parameters(), lr=0.003) 

history_lr = {'loss':[], 'val_loss':[]}

for epoch in range(20):

    current_loss = 0.0 #rezero loss
    
    for i, data in enumerate(trainloader):

        inputs, labels = data
        
        # zero the parameter gradients
        optimizer_lr.zero_grad()

        # forward + backward + optimize (training magic)
        # This will use the pytorch autograd feature to adjust the
        ## parameters of our function to minimize the loss
        outputs = model_lr(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer_lr.step()
        
        # add loss statistics
        current_loss += loss.item()
        
        if i == len(trainloader)-1:
            current_val_loss = 0.0
            with torch.no_grad():#disable updating gradient
                for iv, vdata in enumerate(valloader):
                    val_inputs, val_labels = vdata
                    val_loss = criterion(model_lr(val_inputs), val_labels)
                    current_val_loss += val_loss.item()
            print('[%d, %4d] loss: %.4f  val loss: %.4f' % 
                  (epoch + 1, i + 1, current_loss/float(i+1) , current_val_loss/float(len(valloader))))

            history_lr['loss'].append(current_loss/float(i+1))
            history_lr['val_loss'].append(current_val_loss/float(len(valloader)))
            
print('Finished Training')
torch.save(model_lr.state_dict(), 'data/L14/lr_model.pt')
print(model_lr.state_dict())

Ok, how is the training doing? Let's visualize the evolution of the loss by epoch.

In [None]:
#>>>RUN: L14.3-runcell02

plt.semilogy(history_lr['loss'], label='loss')
plt.semilogy(history_lr['val_loss'], label='val_loss')
plt.legend(loc="upper right")
plt.xlabel('epoch')
plt.ylabel('loss (binary crossentropy)')
plt.show()

We start to see that our loss is converging, although it looks like more epochs might give an even better result.  Importantly, we also see that our validation and regular losses are similar. Interestingly, the validation loss could be "better" that the training loss. We address that point below.

Sometimes, overtraining can occur. This is when the sensitivity starts to exceed the statistical precision of the dataset and we start training on random fluctuation features in the data. When overtraining occurs, the network starts to isolate features specific to the training dataset which are not present in the validation data, and the two loss values start to deviate from each other. 


<h3>Tuning the Training</h3>

How do we prevent overtraining? Let's define a "stopping criteria" by using the validation loss. We will stop the training if the validation loss appears to have hit its minimum, but we will let the training run for a few more epochs to allow for the possibility of a local minimum or single-epoch spikes. There are other ways to define an early stopping criterion, but we will use this technique for now. The code cell below just defines everything that is needed, but does not actually run the training. Also note that it uses the final parameter values found at the end of the previous 20 epochs as starting values. If you want to start completely from scratch, rerun code cell `L14.2-runcell04` before running the code cells below. 

In [None]:
#>>>RUN: L14.3-runcell03

def train(model,trainloader,valloader,nepochs=100,lr=0.003,l2reg=0.,patience=5,name=None):

    criterion = torch.nn.BCELoss()
    
    #NOTE: l2 regularization is set to 0 by default,
    #but we will address this in later sections
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=l2reg) 

    history = {'loss':[], 'val_loss':[]}

    min_loss = 999999.
    min_epoch = 0
    min_model = model.state_dict()
    should_stop = False
    
    for epoch in range(nepochs):

        current_loss = 0.0 #rezero loss

        for i, data in enumerate(trainloader):

            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            # This will use the pytorch autograd feature to adjust the
            ## parameters of our function to minimize the loss
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            current_loss += loss.item()

            if i == len(trainloader)-1:
                current_val_loss = 0.0
                with torch.no_grad():#disable updating gradient
                    model.eval() #place model in evaluation state
                                ## necessary for some layer types (like dropout)
                    for iv, vdata in enumerate(valloader):
                        val_inputs, val_labels = vdata
                        val_loss = criterion(model(val_inputs), val_labels)
                        current_val_loss += val_loss.item()
                    model.train() #return to training state
                current_loss = current_loss/float(i+1)
                current_val_loss = current_val_loss/float(len(valloader))
                print('[%d, %4d] loss: %.4f  val loss: %.4f' % 
                      (epoch + 1, i + 1, current_loss , current_val_loss))

                if current_val_loss < min_loss:
                    min_loss = current_val_loss
                    min_model = model.state_dict()
                    min_epoch = epoch
                elif epoch-min_epoch==5:
                    model.load_state_dict(min_model)
                    should_stop = True
                    break

                history['loss'].append(current_loss)
                history['val_loss'].append(current_val_loss)
                
            if should_stop:
                break

    print('Finished Training')
    if name is not None:
        filename_save = 'data/L14/' + name + '.pt'
        torch.save(model.state_dict(), filename_save)
    return history

Let's now run for a number of epochs and see if we get to a point where the training stops.

In [None]:
history_lr = train(model_lr,trainloader,valloader,name='lr_model')

Like before, we can also make a plot of the evolution to see if our loss values are behaving the way we expect, and also to  make a judgement on what is going on. 

In [None]:
#>>>RUN: L14.3-runcell04

plt.semilogy(history_lr['loss'], label='loss')
plt.semilogy(history_lr['val_loss'], label='val_loss')
plt.legend(loc="upper right")
plt.xlabel('epoch')
plt.ylabel('loss (binary crossentropy)')
plt.show()

As you can see, the training continued until it appeared that either the validation loss had hit a minimum (and then stopped a few epochs later) or until 100 epochs had been run.

<h3>Applying to Testing Data</h3>

Now let's see what happens when we apply this set of network parameters to our test data. This is essentially the same setup we used for the validation data. However, in this case, we are looking at data which was never considered during the training.

In [None]:
#>>>RUN: L14.3-runcell05

def apply(model, testloader):
    with torch.no_grad():
        model.eval()
        outputs = []
        labels = []
        for data in testloader:
            test_inputs, test_labels = data
            outputs.append(model(test_inputs).numpy())
            labels.append(test_labels.numpy())
        model.train()

        Y_test_predict = outputs
        Y_test = labels

    Y_test_predict = np.concatenate(Y_test_predict)
    Y_test = np.concatenate(Y_test)
    
    return Y_test_predict,Y_test

Y_test_predict_lr, Y_test = apply(model_lr, testloader)

print(Y_test_predict_lr.shape)
print(Y_test.shape)

And now let's plot the distribution of the output of the network. 

In [None]:
#>>>RUN: L14.3-runcell06

plt.hist(Y_test_predict_lr[Y_test==0],histtype='step',color='r',density=True)
plt.hist(Y_test_predict_lr[Y_test==1],histtype='step',color='b',density=True)
plt.xlabel('Logistic Regression Discriminant')
plt.show()

As we wanted, the background (red) is peaked at 0, while the EG (blue) is closer to 1. 

<a name='exercises_14_3'></a>     

| [Top](#section_14_0) | [Restart Section](#section_14_3) | [Next Section](#section_14_4) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.3.1</span>

Which of the following features indicates that training is NOT performing successfully? Select all that apply:

A) The loss for the validation data differs significantly from that for the training data.\
B) The loss as a function of epoch flattens out for both data sets.\
C) The loss as a function of epoch is shifted slightly for the validation data compared to the training data.\
D) The loss as a function of epoch continues to decrease for the training data set, but remains constant for the validation data set.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.3.2</span>

We can approximate the uncertainty on the loss by assuming that the number of events in both datasets follows a Poisson distribution. Using this concept, complete the code below to calculate the statistical disagreement between the training loss and validation loss. Specifically, write a function that returns the difference between the losses in terms of the number of standard deviations.

**Extra:** How significant is the difference after the last epoch? Try plotting this as a function of epoch!

**Note:** The related plot may yield wildly different results, depending on how your network runs. Here we focus on how to write the function, instead of analyzing the output of the plot.

In [None]:
#>>>EXERCISE: L14.3.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def num_stdev(iNTrain,iNVal,loss1_array, loss2_array):
    #convert from list to array
    loss1_array = np.array(loss1_array)
    loss2_array = np.array(loss2_array)
    
    sigma_loss1 = #YOUR CODE HERE (the stdev of the training loss)
    sigma_loss2 = #YOUR CODE HERE (the stdev of the validation loss)
    
    #the combined uncertainty
    sigma_tot = np.sqrt(sigma_loss1**2. + sigma_loss2**2.) 
    
    #the difference in losses
    delta = loss2_array-loss1_array
    
    #calculate the difference in terms of number of standard deviations
    diff = abs(delta/sigma_tot) 
    
    return diff

#plot
#----------------------------------------------------
N_train = len(trainloader)*trainloader.batch_size
N_val   = len(valloader)*valloader.batch_size #the number of rows in the data set
diff_sig = num_stdev(N_train,N_val,history_lr['loss'], history_lr['val_loss'])
print("Significance of last epoch",diff_sig[-1])


plt.plot(np.arange(len(diff_sig)),diff_sig)
plt.xlabel("N-iteration")
plt.ylabel("(train-test)/$\sigma$")
plt.show()

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.3.3</span>

A common way to select events based on a particular final discriminator value from the neural network, is to make normalized histograms (i.e. histograms with the same integral), as is done in the previous examples, and then to select only events above the line where the signal and background histograms cross. For the histogram produced by code cell `L14.3-runcell06`, what fraction of egamma events (blue) or background (red) are above the bin of intersection (you should see this intersection occur at a value of `intersection_bin = 0.20`)?

Report your answer as a list of numbers with precision 1e-2: `[frac EG, frac PU]`


In [None]:
#>>>EXERCISE: L14.3.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

#determine fraction of events above intersection
intersection_bin = 0.20
EG_frac = #YOUR CODE HERE
PU_frac = #YOUR CODE HERE

print("EG", EG_frac)
print("PU:", PU_frac)

>#### Follow-up 14.3.3a (ungraded)
>
>Define a two-weight network, as we had in the last Lesson, and run the training of your two-weight network on the data. What do the resulting weights look like? Can you make a 1D histogram of the separation? Play with the smearing parameter, how do things change?
>
>**NOTE:** Be sure to label your classes, functions, and outputs differently. We will continue to use the results from above, so do not get your previous results confused with your results from this follow-up exercise!

In [None]:
#>>>EXERCISE: L14.3.3a
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

class LR_net_2(torch.nn.Module):
    #YOUR CODE HERE

        
model_lr_2 = LR_net_2()
print(model_lr_2)
print('----------')
print(model_lr_2.state_dict())


#-----------------
#TRAIN THE NETWORK
#YOUR CODE HERE

<a name='section_14_4'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.4 Adding a Hidden Layer</h2>     

| [Top](#section_14_0) | [Previous Section](#section_14_3) | [Exercises](#exercises_14_4) | [Next Section](#section_14_5) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS14/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS14_vid4" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

What we did above was a small 1-layer network and really not representative of the power of deep learning. Instead of just a few weights, let's now pump up the number of free parameters by adding a variety of layers. This will allow us to be much more expressive in the way we can discriminate events. Details of exactly what is added to the neural network in the code below is described in the video. Let's start by declaring the updated model.

In [None]:
#>>>RUN: L14.4-runcell01

class MLP2_net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(ninputs,30)
        self.act1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(30,10)
        self.act2 = torch.nn.ReLU()
        self.fc3 = torch.nn.Linear(10,1)
        self.output = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.act1(x)
        x = self.fc2(x)
        x = self.act2(x)
        x = self.fc3(x)
        x = self.output(x)
        return x
    
torch.random.manual_seed(42)  # fix a random seed for reproducibility

model_mlp_2layer = MLP2_net()
print(model_mlp_2layer)

Now, we can go ahead and train. Pay attention to the loss value compared to before, you should already start to see signs that the performance is better because we have now made the model much more flexible by adding a second layer. Let's run the training. 

In [None]:
#>>>RUN: L14.4-runcell02

history_mlp_2layer = train(model_mlp_2layer,trainloader,valloader,name='mlp_2layer_model')
Y_test_predict_mlp_2layer, Y_test = apply(model_mlp_2layer, testloader)

Like we did before, we can again go ahead and scan our training to make sure the loss is close to convergence and we haven't overtrained. 

In [None]:
#>>>RUN: L14.4-runcell03

plt.semilogy(history_mlp_2layer['loss'], label='loss')
plt.semilogy(history_mlp_2layer['val_loss'], label='val_loss')
plt.legend(loc="upper right")
plt.xlabel('epoch')
plt.ylabel('loss (binary crossentropy)')
plt.show()

plt.hist(Y_test_predict_mlp_2layer[Y_test==0],histtype='step',color='r',density=True)
plt.hist(Y_test_predict_mlp_2layer[Y_test==1],histtype='step',color='b',density=True)
plt.xlabel('MLP (2 hidden layers) Discriminant')
plt.show()

We can check the signal and background efficiencies based on their distributions versus discriminant value. Compared to the plot generated by code cell `L14.3-runcell06`, there appears to be a dramatic improvement! The tail of the background histogram is much lower, and the signal histogram is now peaked at 1. So, it looks like we are keeping more signal and getting more background rejection. As a quantitative check, let's repeat what we did in `Ex-14.3.3`. However, since the two histograms now cross at 0.1, let's extract the integrated values above that bin.

In [None]:
#>>>RUN: L14.4-runcell04

print("Signal",len(Y_test_predict_mlp_2layer[Y_test==1][Y_test_predict_mlp_2layer[Y_test==1] > 0.10])/len(Y_test_predict_mlp_2layer[Y_test==1]))
print("Big:"  ,len(Y_test_predict_mlp_2layer[Y_test==0][Y_test_predict_mlp_2layer[Y_test==0] > 0.10])/len(Y_test_predict_mlp_2layer[Y_test==0]))

<h3>Plotting the ROC</h3>

This does look better than the logistic regression. But how would we establish that improvement in more detail? A Receiver Operating Characteristic (ROC) curve is a typical way to compare multiple algorithms. Basically, we are going to make a requirement on the neural network output: if it is below a given value we will call it background, and if it is above that value it's EG. We can scan this cutoff value between 0 and 1 and then plot the fraction of background and signal at each point, thereby quantifying how well a particular cutoff predicts the correct labels. 


More specifically, we can define two axes for the ROC as 

 * $\epsilon_{s}=\int_{-\infty}^{x} {\rm Disc}(x|x_{i}\in{\rm Signal}) dx$
 * $\epsilon_{b}=\int_{-\infty}^{x} {\rm Disc}(x|x_{i}\in{\rm Background}) dx$

In [None]:
#>>>RUN: L14.4-runcell05

def compute_ROC(labels, predicts, npts=101):
    cutvals = np.linspace(0.,1.,num=npts)
    tot0 = float(len(labels[labels==0]))
    tot1 = float(len(labels[labels==1]))
    tpr = []
    fpr = []
    for c in cutvals:
        fpr.append(float(len(predicts[(labels==0) & (predicts>c)]))/tot0)
        tpr.append(float(len(predicts[(labels==1) & (predicts>c)]))/tot1)
    
    return np.array(fpr),np.array(tpr)

mlp_2layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_2layer)
lr_rocpts = compute_ROC(Y_test,Y_test_predict_lr)

plt.plot(mlp_2layer_rocpts[0],mlp_2layer_rocpts[1],'g-',label="MLP (2 hidden layers)")
plt.plot(lr_rocpts[0],lr_rocpts[1],'m--',label="Logistic Regression")
plt.title("ROC (Receiver Operating Characteristic) Curve")
plt.xlabel("False Positive Rate (FPR) aka Background Efficiency")
plt.ylabel("True Positive Rate (TPR) aka Signal Efficiency")
plt.legend(loc="lower right")
plt.show()

Reading a ROC can take a second if you haven't seen one before. The horizontal and vertical axes show the fraction of the background and signal, respectively, that are accepted for different values of the discriminant cut. The fixed endpoints at (1,1) and (0,0) (which are identical for any model) represent the trivial cases of accepting or rejecting all events, respectively.

The key is usually to find an algorithm that gets as close as possible to the **top left corner**, since that represents rejecting more background (moving to the left) with higher efficiency for the signal (moving up). As you can see by looking at this top left corner, the model with hidden layers does do better than the earlier logistic regression.

**Note:** The "background efficiency" and "false positive rate" mean the same thing. They are oppositely related to the "background rejection rate," which is the the percentage of background events that are rejected. In other words, a background efficiency of 0.03 is equal to a background rejection rate of 0.97.

However, picking a specific point on the curve to use in the data analysis (sometimes called the "operating" point) is not always as simple as being as close as possible to the top left corner. For example, if the background appears as a relatively smooth distribution underlying a narrow signal peak, it may be acceptable to include more background in order to take advantage of the corresponding increase in signal efficiency. In other words, one would move the operating point up and to the right on the curve. Alternatively, if it's necessary to maximally reduce the background, and the dataset has more than enough signal events, the operating point would be moved in the opposite direction.

<a name='exercises_14_4'></a>   

| [Top](#section_14_0) | [Restart Section](#section_14_4) | [Next Section](#section_14_5) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.4.1</span>

When we compare the performance of two algorithms, we often like to fix the signal efficiency and look at the change in the background rejection rate. For a fixed signal efficiency of 97%, what is the fractional reduction in the false-positive rate between the logistic and MLP networks (i.e. the difference between the two divided by the value for the logistic)? Report your answer as a number with precision 5e-2.

In [None]:
#>>>EXERCISE: L14.4.1
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.


def frac_reduc_fpr(lr_rocpts, mlp_2layer_rocpts, sig_eff=0.97):
    #false-positive-rate (background efficiency)
    #true-positive-rate (signal efficiency):
    lr_fpr = lr_rocpts[0]
    lr_tpr = lr_rocpts[1]
    mlp_2layer_fpr = mlp_2layer_rocpts[0]
    mlp_2layer_tpr = mlp_2layer_rocpts[1]
    
    #find lr_fpr where lr_tpr is closest to sig_eff
    lr_fpr_val = #YOUR CODE HERE
    
    #find mlp_2layer_fpr where lr_tpr is closest to sig_eff
    mlp_2layer_fpr_val = #YOUR CODE HERE
    
    #calculate the fractional reduction in false-positive-rate
    frac_red = #YOUR CODE HERE
    
    return frac_red

#find where the signal efficiency is 97%
#find difference in logistic vs. MLP
print("Fractional Reduction in FPR:",frac_reduc_fpr(lr_rocpts, mlp_2layer_rocpts, 0.97))

<a name='section_14_5'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L14.5 Regularization, Batch Normalization, and Dropout</h2>     

| [Top](#section_14_0) | [Previous Section](#section_14_4) | [Exercises](#exercises_14_5) |


<h3>Regularization and Network Tuning</h3>

*Note: There is no corresponding video for this section.*

The 2-layer MLP we made does indeed do better, especially if we want to reject as much as possible of the background.

Let's try an even larger network with additional layers. Also, as discussed below, we'll add some other techniques to our network in order to further improve the quality of our training. 

In [None]:
#>>>RUN: L14.5-runcell01

class MLP3_net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(ninputs,50)
        self.act1 = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(50,30)
        self.act2 = torch.nn.ReLU()
        self.fc3 = torch.nn.Linear(30,10)
        self.act3 = torch.nn.ReLU()
        self.fc4 = torch.nn.Linear(10,1)
        self.output = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.act1(x)
        x = self.fc2(x)
        x = self.act2(x)
        x = self.fc3(x)
        x = self.act3(x)
        x = self.fc4(x)
        x = self.output(x)
        return x
    
torch.random.manual_seed(42)  # fix a random seed for reproducibility
model_mlp_3layer = MLP3_net()
print(model_mlp_3layer)

<h3>Regularization</h3>

Let's also try to use a form of regularization, in this case L2. If left unchecked, larger networks especially can begin to find and abuse certain subtle features that we perhaps don't want them to. The obvious case is if the feature is a statistical glitch only present in the training set. In such a situation, we would be hurting ourselves by focusing on "fitting" that. L2 regularization adds a "penalty term" to the loss function, which is a function of the magnitude squared of the weight values. 

$$\mathcal{L} = \mathcal{L}_\textrm{BCE} + \lambda\sum |W|^2$$

We can control the relative importance of this term via the $\lambda$ parameter. By encouraging the network to keep the weights small, it is less able to magnify the importance of one particular feature/node. In `pytorch`, this done by adding options to the optimizer. 

In [None]:
#>>>RUN: L14.5-runcell02

# NOTE: we add the l2 regularization to the optimzer by setting l2reg=0.0001, which modifies the loss.
# The l2 regularization is already implemented in the train() function, with default value l2reg=0.

#you may choose to call the model again, if you have already run it
#model_mlp_3layer.load_state_dict(torch.load('mlp_3layer_model.pt'))

history_mlp_3layer = train(model_mlp_3layer,trainloader,valloader,l2reg=0.0001,name='mlp_3layer_model')
Y_test_predict_mlp_3layer, Y_test = apply(model_mlp_3layer, testloader)

In [None]:
#>>>RUN: L14.5-runcell03

mlp_2layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_2layer)
mlp_3layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer)
lr_rocpts = compute_ROC(Y_test,Y_test_predict_lr)

plt.plot(mlp_2layer_rocpts[0],mlp_2layer_rocpts[1],'g-',label="MLP (2 hidden layers)")
plt.plot(mlp_3layer_rocpts[0],mlp_3layer_rocpts[1],'--',color='orange',label="MLP (3 hidden layers)")
plt.plot(lr_rocpts[0],lr_rocpts[1],'m--',label="Logistic Regression")
plt.title("ROC (Receiver Operating Characteristic) Curve")
plt.xlabel("False Positive Rate (FPR) aka Background Efficiency")
plt.ylabel("True Positive Rate (TPR) aka Signal Efficiency")
plt.legend(loc="lower right")
plt.show()

Hmmm... It's hard to see a dramatic improvement. However, perhaps a plot of the efficiencies of accepting signal and background ($\epsilon_{s}$ and $\epsilon_{b}$, respectively) is not the best way to evaluate the performance. Let's instead use $1/\epsilon_{b}$ for the horizontal axis to really bring out the behavior for low background acceptance.

In this new plot below, large values of $1/\epsilon_{b}$ correspond to small values of background efficiency. Thus, these large values of $1/\epsilon_{b}$ are also related to larger values of the background rejection rate. So, for a fixed value of signal efficiency, we would want a large value of $1/\epsilon_{b}$.

From this alternative way of plotting, you can see if there are very large gains in the background rejection rate, especially at lower signal efficiency. Recall that there can be experimental conditions where maximizing background rejection is the highest priority.

In [None]:
#>>>RUN: L14.5-runcell04

mlp_2layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_2layer,101)
mlp_3layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer,101)
lr_rocpts = compute_ROC(Y_test,Y_test_predict_lr,101)

plt.plot(1./mlp_2layer_rocpts[0],mlp_2layer_rocpts[1],'g-',label="MLP (2 hidden layers)")
plt.plot(1./mlp_3layer_rocpts[0],mlp_3layer_rocpts[1],'--',color='orange',label="MLP (3 hidden layers)")
plt.plot(1./lr_rocpts[0],lr_rocpts[1],'m--',label="Logistic Regression")
plt.xlim([-1, 2500])
plt.title("ROC (Receiver Operating Characteristic) Curve")
plt.xlabel("1/(Background Efficiency)")
plt.ylabel("True Positive Rate (TPR) aka Signal Efficiency")
plt.legend(loc="upper right")
plt.show()

Okay, let's look at a fixed signal frequency of 50%. The model with the highest value of $1/\epsilon_{b}$ will be the best.

This way of plotting shows that the multi-layer models can achieve huge improvements in background rejection if a reduction in signal efficiency is acceptable. However, it appears that we may have saturated the potential improvement already with the 2-layer version.

<h3>Batch Normalization and Dropout</h3>

To finish, let's try two other types of regularizer: *batch normalization* and *dropout*.

Batch normalization works by rescaling each input such that the mean and standard deviation are 0 and 1, respectively. This helps make sure that each node has similar values when it is passed to the following layer.

Dropout works by randomly removing a given fraction of the nodes in a layer during each training pass. This helps ensure that no one node becomes crucially important to the final result.

First, we define both networks.

In [None]:
#>>>RUN: L14.5-runcell05

class MLP3_BN_net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bn0 = torch.nn.BatchNorm1d(ninputs)
        self.fc1 = torch.nn.Linear(ninputs,50)
        self.act1 = torch.nn.ReLU()
        self.bn1 = torch.nn.BatchNorm1d(50)
        self.fc2 = torch.nn.Linear(50,30)
        self.act2 = torch.nn.ReLU()
        self.bn2 = torch.nn.BatchNorm1d(30)
        self.fc3 = torch.nn.Linear(30,10)
        self.act3 = torch.nn.ReLU()
        self.bn3 = torch.nn.BatchNorm1d(10)
        self.fc4 = torch.nn.Linear(10,1)
        self.output = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.bn0(x)
        x = self.fc1(x)
        x = self.act1(x)
        x = self.bn1(x)
        x = self.fc2(x)
        x = self.act2(x)
        x = self.bn2(x)
        x = self.fc3(x)
        x = self.act3(x)
        x = self.bn3(x)
        x = self.fc4(x)
        x = self.output(x)
        return x
    
class MLP3_Drop_net(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(ninputs,50)
        self.act1 = torch.nn.ReLU()
        self.drop1 = torch.nn.Dropout(0.1)
        self.fc2 = torch.nn.Linear(50,30)
        self.act2 = torch.nn.ReLU()
        self.drop2 = torch.nn.Dropout(0.1)
        self.fc3 = torch.nn.Linear(30,10)
        self.act3 = torch.nn.ReLU()
        self.drop3 = torch.nn.Dropout(0.1)
        self.fc4 = torch.nn.Linear(10,1)
        self.output = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.act1(x)
        x = self.drop1(x)
        x = self.fc2(x)
        x = self.act2(x)
        x = self.drop2(x)
        x = self.fc3(x)
        x = self.act3(x)
        x = self.drop3(x)
        x = self.fc4(x)
        x = self.output(x)
        return x

torch.random.manual_seed(42)  # fix a random seed for reproducibility
model_mlp_3layer_bn = MLP3_BN_net()
print(model_mlp_3layer_bn)

torch.random.manual_seed(42)  # fix a random seed for reproducibility
model_mlp_3layer_drop = MLP3_Drop_net()
print(model_mlp_3layer_drop)

Now, we train the network with batch normalization.

In [None]:
#>>>RUN: L14.5-runcell06

history_mlp_3layer_bn = train(model_mlp_3layer_bn,trainloader,valloader,name='mlp_3layer_bn_model')
Y_test_predict_mlp_3layer_bn, Y_test = apply(model_mlp_3layer_bn, testloader)

Next, we train the dropout network.

In [None]:
#>>>RUN: L14.5-runcell07

history_mlp_3layer_drop = train(model_mlp_3layer_drop,trainloader,valloader,name='mlp_3layer_drop_model')
Y_test_predict_mlp_3layer_drop, Y_test = apply(model_mlp_3layer_drop, testloader)

<h3>Looking at the ROC</h3>

Finally, we look at the ROC and compare the standard 3-layer model with those using batch normalization and dropout regularization.

In [None]:
#>>>RUN: L14.5-runcell08

mlp_3layer_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer,101)
mlp_3layer_bn_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer_bn,101)
mlp_3layer_drop_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer_drop,101)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(12,4))

ax1.plot(mlp_3layer_rocpts[0],mlp_3layer_rocpts[1],'--',color='orange',label="MLP (3 hidden layers)")
ax1.plot(mlp_3layer_bn_rocpts[0],mlp_3layer_bn_rocpts[1],'--',color='brown',label="MLP (3 hidden layers w/ BN)")
ax1.plot(mlp_3layer_drop_rocpts[0],mlp_3layer_drop_rocpts[1],'--',color='cyan',label="MLP (3 hidden layers w/ Dropout)")
ax1.set_title("ROC (Receiver Operating Characteristic) Curve")
ax1.set_xlabel("Bkg Eff")
ax1.set_ylabel("Sig Eff")
ax1.legend(loc="lower right")

ax2.plot(1./mlp_3layer_rocpts[0],mlp_3layer_rocpts[1],'--',color='orange',label="MLP (3 hidden layers)")
ax2.plot(1./mlp_3layer_bn_rocpts[0],mlp_3layer_bn_rocpts[1],'--',color='brown',label="MLP (3 hidden layers w/ BN)")
ax2.plot(1./mlp_3layer_drop_rocpts[0],mlp_3layer_drop_rocpts[1],'--',color='cyan',label="MLP (3 hidden layers w/ Dropout)")
ax2.set_title("ROC (Receiver Operating Characteristic) Curve")
ax2.set_xlabel("1/Bkg Eff")
ax2.set_xlim([-1, 2500])
ax2.set_ylabel("Sig Eff")
ax2.legend(loc="upper right")

plt.show()

You now have all the tools you need to start developing, training, and testing your own neural networks!

<a name='exercises_14_5'></a>   

| [Top](#section_14_0) | [Restart Section](#section_14_5) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.5.1</span>

Which of the following statements is true about regularization in machine learning?

A) Regularization is a technique used to increase the complexity of a model to improve its performance on new data.\
B) Regularization is a technique used to prevent overfitting by employing a variety of methods, including adding a penalty term to the loss function.\
C) Regularization is a technique used to reduce the size of the training data to improve generalization performance.\
D) Regularization is a technique used to randomly drop out neurons in a neural network during training.\
E) Regularization is a technique that normalizes the mean and variance of the activations of each layer in a neural network.


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.5.2</span>

For a fixed signal efficiency of 97%, what is the fractional reduction in rejection rate going from a 3 hidden layer model to ones adding either batch normalization or dropout? Complete the code below to do this calculation, then report your answer as a list of numbers with precision 1e-2: `[reduction with batch norm, reduction with dropout]`

**HINT:** Use the function that you defined previously.

In [None]:
#>>>EXERCISE: L14.5.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def frac_reduc_fpr(array1, array2, sig_eff=0.97):
    #false-positive-rate (background efficiency)
    #true-positive-rate (signal efficiency):
    array1_fpr = array1[0]
    array1_tpr = array1[1]
    array2_fpr = array2[0]
    array2_tpr = array2[1]
    
    #find array_1_fpr where array_1_tpr is closest to sig_eff
    array1_fpr_val = #YOUR CODE HERE
    
    #find array_2_fpr where array_2_tpr is closest to sig_eff
    array2_fpr_val = #YOUR CODE HERE
    
    #calculate the fractional reduction in false-positive-rate
    frac_red = #YOUR CODE HERE
    
    return frac_red


#find where the signal efficiency is 97%
print("Fractional Reduction from 3Layer to Batch Norm :",frac_reduc_fpr(mlp_3layer_rocpts, mlp_3layer_bn_rocpts, 0.97))
print("Fractional Reduction from 3Layer to Dropout:",frac_reduc_fpr(mlp_3layer_rocpts, mlp_3layer_drop_rocpts, 0.97))

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Ex-14.5.3</span>

Try including both batch normalization *and* dropout. Do you get even better performance? Complete the code below to define and run this new model, then select the best answer from the following:

A) Yes, it's definitely better to combine both regularization methods.\
B) No, combining methods was worse than models using either one separately.\
C) The combined model was maybe better than one model, but not better than both.       

In [None]:
#>>>EXERCISE: L14.5.3
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

class MLP3_BN_Dropout_net(torch.nn.Module):
    #YOUR CODE HERE

    
torch.random.manual_seed(42)  # fix a random seed for reproducibility
model_mlp_3layer_bn_drop = MLP3_BN_Dropout_net()
print(model_mlp_3layer_bn)

history_mlp_3layer_bn_drop = train(model_mlp_3layer_bn_drop,trainloader,valloader,name='mlp_3layer_bn_drop_model',nepochs=100)
Y_test_predict_mlp_3layer_bn_drop, Y_test = apply(model_mlp_3layer_bn_drop, testloader)
mlp_3layer_bn_drop_rocpts = compute_ROC(Y_test,Y_test_predict_mlp_3layer_bn_drop,501)
    
#find where the signal efficiency is 97%
print("Fractional Reduction from Batch Norm to Combined Model:",frac_reduc_fpr(mlp_3layer_bn_rocpts, mlp_3layer_bn_drop_rocpts, 0.97))
print("Fractional Reduction from Dropout to Combined Model:",frac_reduc_fpr(mlp_3layer_drop_rocpts, mlp_3layer_bn_drop_rocpts, 0.97))