<hr style="height: 1px;">
<i>This notebook was authored by the 8.S50x Course Team, Copyright 2022 MIT All Rights Reserved.</i>
<hr style="height: 1px;">
<br>

<h1>Lesson 15: Deep Learning Regression</h1>


<a name='section_15_0'></a>
<hr style="height: 1px;">


## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.0 Overview</h2>


<h3>Navigation</h3>

<table style="width:100%">
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_15_1">L15.1 Discovering the Higgs with Deep Learning</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_15_1">L15.1 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_15_2">L15.2 Minimizing Loss with a Neural Network</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_15_2">L15.2 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_15_3">L15.3 An Example with PyTorch: Fitting a Parabola</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_15_3">L15.3 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_15_4">L15.4 Another Example: Fitting a Sine Function</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_15_4">L15.4 Exercises</a></td>
    </tr>
    <tr>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#section_15_5">L15.5 Sine Function Continued: Adjusting the Network</a></td>
        <td style="text-align: left; vertical-align: top; font-size: 10pt;"><a href="#exercises_15_5">L15.5 Exercises</a></td>
    </tr>
</table>

<h3>Learning Objectives</h3>

It is often the case that you want to fit a function to data. However, sometimes you want to do this in many dimensions and you can't visualize it all. In this lecture, we are going to look at deep learning regression. This can help you to model these complex scenarios. 

Let's say you have some data $\vec{x}$, this can be an n-dimensional set of inputs, and you want to predict an output $\vec{y}$ from $\vec{x}$ where $y$ can be an m-dimensional set of outputs. What we want then is to create a function

$$
\begin{equation}
 \vec{y} = f(\vec{x})
\end{equation}
$$

Earlier in this course,  we did this for a one dimensional $y$ taking in a set of inputs $\vec{x}$. In this Lesson, we aim to generalize this to predict an arbitrary number of outputs with an arbitrary set of inputs. In particular, we will explore the following learning objectives:

 - Fitting an arbitrary 1D dataset with a neural net
 - Deep Learning algorithm design
 - Studying decays of Higgs bosons to Tau leptons
 - Observing the improvements
 - Optimized target
 - The full mass regression
 - NN architecture


<h3>Installing Tools</h3>

Before we do anything, lets make sure we install the tools we need for this.

In [None]:
#>>>RUN: L15.0-runcell00

!pip install torch
!pip install imageio
!pip install awkward
!pip install george
!pip install uproot
!pip install pylorentz

<h3>Importing Libraries</h3>

Before beginning, run the cell below to import the relevant libraries for this notebook. 

In [None]:
#>>>RUN: L15.0-runcell01

import torch                        #https://pytorch.org/docs/stable/torch.html
import torch.nn as nn               #https://pytorch.org/docs/stable/nn.html
from torch.autograd import Variable #https://pytorch.org/docs/stable/autograd.html
import torch.nn.functional as F     #https://pytorch.org/docs/stable/nn.functional.html
import torch.utils.data as Data     #https://pytorch.org/docs/stable/data.html

import matplotlib.pyplot as plt     #https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
%matplotlib inline

import numpy as np                  #https://numpy.org/doc/stable/
import imageio                      #https://imageio.readthedocs.io/en/stable/
import george                       #https://george.readthedocs.io/en/latest/
from george import kernels          #https://george.readthedocs.io/en/latest/user/kernels/

<h3>Setting Default Figure Parameters</h3>

The following code cell sets default values for figure parameters.

In [None]:
#>>>RUN: L15.0-runcell02

#set plot resolution
%config InlineBackend.figure_format = 'retina'

#set default figure parameters
plt.rcParams['figure.figsize'] = (9,6)

medium_size = 12
large_size = 15

plt.rc('font', size=medium_size)          # default text sizes
plt.rc('xtick', labelsize=medium_size)    # xtick labels
plt.rc('ytick', labelsize=medium_size)    # ytick labels
plt.rc('legend', fontsize=medium_size)    # legend
plt.rc('axes', titlesize=large_size)      # axes title
plt.rc('axes', labelsize=large_size)      # x and y labels
plt.rc('figure', titlesize=large_size)    # figure title

<a name='section_15_1'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.1 Discovering the Higgs with Deep Learning</h2>  

| [Top](#section_15_0) | [Previous Section](#section_15_0) | [Exercises](#exercises_15_1) | [Next Section](#section_15_2) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS15/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS15_vid1" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L19/slides_L19_01.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L15.1-slides


from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L19/slides_L19_01.html', width=970, height=550)

<a name='exercises_15_1'></a>     

| [Top](#section_15_0) | [Restart Section](#section_15_1) | [Next Section](#section_15_2) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.1.1</span>

Look again at the Higgs discovery plots. At the Higgs mass, the CMS experiment is about 30% more sensitive than ATLAS, but both experiments found approximately the same excess. This is because:

A) The ATLAS experiment was less sensitive, but had more data.\
B) The ATLAS experiment had more advanced machine learning analysis tools.\
C) This was just a random fluctuation, and not a particularly unlikely one.\
D) The CMS data was noisier.

<a name='section_15_2'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.2 Minimizing Loss with a Neural Network</h2>  

| [Top](#section_15_0) | [Previous Section](#section_15_1) | [Exercises](#exercises_15_2) | [Next Section](#section_15_3) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS15/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS15_vid2" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Slides</h3>

Run the code below to view the slides for this section, which are discussed in the related video. You can also open the slides in a separate window <a href="https://mitx-8s50.github.io/slides/L19/slides_L19_02.html" target="_blank">HERE</a>.

In [None]:
#>>>RUN: L15.2-slides

from IPython.display import IFrame
IFrame(src='https://mitx-8s50.github.io/slides/L19/slides_L19_02.html', width=970, height=550)

<h3>Fitting an arbitrary 1D dataset with a neural net</h3>

What we would like to do is fit a distribution without an initial choice of a function. To envision this distribution, let's create a Gaussian function that is smeared out a little bit. From that, we can try to fit this dataset to get a functional form for this. 

In [None]:
#>>>RUN: L15.2-runcell01

#Let's Try GP on a Gaussian random set of points
def gaussian(mu,sigma,norm,offset):
    """Returns a gaussian function with the given parameters"""
    return lambda x: norm*np.exp(-(1./2.)*((x-mu)/(sigma))**2)+offset

Xin = np.mgrid[0:201] # points from 0 to 201
data = gaussian(100., 20., 10., 5)(Xin) + 5*np.random.random(Xin.shape) # Guassian + semaring
plt.plot(data,"*")
plt.show()

Now, one way to model this distribution is through a Gaussian process. This will try to fit very many Gaussians that will allow us to extract a function. 

In [None]:
#>>>RUN: L15.2-runcell02

#now let's run GP on this guy

kernel = np.var(data) * kernels.ExpSquaredKernel(1.5)
gp = george.GP(kernel)
var=np.ones(len(Xin))
gp.compute(Xin,var)
x_pred = np.linspace(0, 200, 100)
pred, pred_var = gp.predict(data, x_pred, return_var=True)

plt.fill_between(x_pred, pred - np.sqrt(pred_var), pred + np.sqrt(pred_var),color="k", alpha=0.2)
plt.plot(x_pred, pred, "k", lw=1.5, alpha=0.5)
plt.plot(Xin,data,"*")
plt.xlabel("x")
plt.ylabel("y");

There are lots of fluctuations here. Clearly, the function is not ideal, and with a bit of tuning and smoothening, we could probably better approximate the underlying function. However, another way we can do this is with a neural network. 

<a name='exercises_15_2'></a>     

| [Top](#section_15_0) | [Restart Section](#section_15_2) | [Next Section](#section_15_3) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.2.1</span>

In the last lesson, when we applied a neural network to separate two samples, we cared only about minimizing the binary cross entropy. We did not care about knowing where the points are distributed. When we fit a function, we need to guess a form to fit the data and minimize $\chi^2$. Do we need to do that for a neural network?  

A) The neural network needs an architecture that is similar to the functional guess.\
B) If there are sufficient parameters, the neural network can approximate the function, no matter the form.\
C) Specific architectures are needed for specific problems (CNNs for images, RNNs for time series).\
D) The neural network needs to work with our statistical tools like Gaussian processes and f-tests to get the right function.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.2.2</span>

In the preceding code cell, `L15.2-runcell02`, we used the Gaussian Process regression algorithm to fit a curve to some data. Explore using different metrics in the function <a href="https://dfm.io/george/dev/user/kernels/#george.kernels.ExpSquaredKernel" target="_blank">`george.kernels.ExpSquaredKernel`</a> (currently, the default is set to 1.5). The code cell is repeated below, in the notebook. 

What approximate value of the metric should one use to smooth out the extraneous wiggles in the plot?

A) 0.5\
B) 2.5\
C) 5\
D) 20\
E) 100

In [None]:
#>>>EXERCISE: L15.2.2

#now let's run GP on this guy

kernel = np.var(data) * kernels.ExpSquaredKernel(1.5)
gp = george.GP(kernel)
var=np.ones(len(Xin))
gp.compute(Xin,var)
x_pred = np.linspace(0, 200, 100)
pred, pred_var = gp.predict(data, x_pred, return_var=True)

plt.fill_between(x_pred, pred - np.sqrt(pred_var), pred + np.sqrt(pred_var),color="k", alpha=0.2)
plt.plot(x_pred, pred, "k", lw=1.5, alpha=0.5)
plt.plot(Xin,data,"*")
plt.xlabel("x")
plt.ylabel("y");

<a name='section_15_3'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.3 An Example with PyTorch: Fitting a Parabola</h2>  

| [Top](#section_15_0) | [Previous Section](#section_15_2) | [Exercises](#exercises_15_3) | [Next Section](#section_15_4) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS15/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS15_vid3" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

In previous setups for classification, we defined a loss function as:

$$
\begin{equation}
\mathcal{L} = y \log(f(x)) + (1-y)\log(1-f(x))
\end{equation}
$$

where $f(x)$ is our classifier with 1 being a high likelihood of a signal, and $0$ denoting a high likelihood of a background. Note that the goal here is to minimize the loss, so we want to make sure that f(x) and 1-f(x) are orthogonal.  

When we are running a deep learning algorithm, what we are effectively doing is to minimize the parameters: 

$$
\begin{equation}
\frac{\partial \mathcal{L}}{\partial w_{i}}\rightarrow 0
\end{equation}
$$

Now, the power of deep learning is that we can fit any arbitrary function to the data. 

Let's start with a simplified dataset. Before we go and fit the noisy Gaussian we generated previously, let's just try to fit $y=x^{2}$, again with added noise. I know this is kind of silly, but it will help with your understanding of how deep learning works. 



In [None]:
#>>>RUN: L15.3-runcell01

torch.manual_seed(1)    # reproducible

x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1)  # x data (tensor), shape=(100, 1)
y = x.pow(2) + 0.2*torch.rand(x.size())                 # noisy y data (tensor), shape=(100, 1)

# torch can only train on Variable, so convert them to Variable
x, y = Variable(x), Variable(y)

# view data
plt.figure(figsize=(10,4))
plt.scatter(x.data.numpy(), y.data.numpy(), color = "orange")
plt.title('Regression Analysis')
plt.xlabel('Independent variable')
plt.ylabel('Dependent variable')
plt.show()


<h3>Mean Squared Error Loss</h3>

Now, to fit this function, we can imagine defining a new loss that we can minimize. In this case, we will define a loss known as mean squared error (MSE) loss. This loss can be written as 

$$
\begin{equation}
\mathcal{L} = \left(y-f(x)\right)^{2}
\end{equation}
$$

More generally, we can write this over $N$ variables as 

$$
\begin{equation}
\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-f(x_{i})\right)^{2}
\end{equation}
$$

There are several variations on this loss. Perhaps the most common one is known as mean absolute percentage error (MAPE). This loss is defined as 

$$
\begin{equation}
\mathcal{L} = \frac{100\%}{N}\sum_{i=1}^{N}\frac{|y_{i}-f(x_{i})|}{y_{i}}
\end{equation}
$$

<h3>Defining a two-layered network</h3>

Let's fit the noisy parabola data using deep learning regression. To do this, we are going train a 2-layered dense network to predict this data. 

Our 2-layered model can be written as 

$$
\begin{equation}
 f(x) = W^{T}_{2}\left(\rm{Act}\left(\vec{W_{1}}x + \vec{b}\right)+b\right)
\end{equation}
$$

where $W_{1}$ and $W_{2}$ would generally denote matrices. However, in this particular case, they are really vectors since the output dimension is 1 and the input dimension is 1. With these matrices, we can specify the number of hidden parameters, which will be the size of the alternative dimension. For 10 hidden parameters, $W_{i}$ will be a 10x1 matrix(a 10 dimensional vector). The symbol $\rm{Act}$ denotes the activation function for this system. 

Let's go ahead and build the neural network with 10 hidden parameters, and use a mean squared loss. Additionally, for the minimizer, we will use stochastic gradient descent (SGD). 

In [None]:
#>>>RUN: L15.3-runcell02

torch.manual_seed(1)    # reproducible

# this is one way to define a network
class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.predict = torch.nn.Linear(n_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.predict(x)             # linear output
        return x

net = Net(n_feature=1, n_hidden=10, n_output=1)     # define the network
print(net)  # net architecture
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)#stochastic Gradient Descent
loss_func = torch.nn.MSELoss()  # this is for regression mean squared loss

<h3>Network Training</h3>

Ok, now that we have our neural network, what we are going to do is run 200 epochs of the network training, and for each epoch, we are going to make a plot. Then we will turn these plots into a moving video to see how the network is running. 

**NOTE:** If you run the next code cell multiple times, the network will continue to train from where it previously finished. Therefore, to restart the training process (and see the animated .gif from its initial flat-line state), you must redefine and reinitialize the network by running code cell `L15.3-runcell02` first.

One interesting feature of this code is that the training actually runs quite quickly, but generating the animated gif with `imageio.mimsave` takes a fair amount of time.


In [None]:
#>>>RUN: L15.3-runcell03

def makePlot(x,y,prediction,ax,fig,images,t,loss,ymin,ymax):
    # plot and show learning process
    plt.cla()
    ax.set_title('Regression Analysis', fontsize=35)
    ax.set_xlabel('Independent variable', fontsize=24)
    ax.set_ylabel('Dependent variable', fontsize=24)
    ax.set_ylim(ymin,ymax)
    ax.scatter(x.data.numpy(), y.data.numpy(), color = "orange")
    ax.plot(x.data.numpy(), prediction.data.numpy(), 'g-', lw=3)
    ax.text(0.6, 0.7, 'Epoch = %d' % t, fontdict={'size': 24, 'color':  'red'})
    ax.text(0.6, 0.3, 'Loss = %.4f' % loss.data.numpy(),fontdict={'size': 24, 'color':  'red'}) 
    fig.canvas.draw()       # draw the canvas, cache the renderer
    image = np.frombuffer(fig.canvas.tostring_rgb(), dtype='uint8')
    image  = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    images.append(image)

def train(x,y,net,loss_func,opt,nepochs,ymin,ymax):
    images = []
    fig, ax = plt.subplots(figsize=(12,7))
    for epoch in range(nepochs):
        if epoch % 50 == 0: 
            print("epoch:",epoch)
        prediction = net(x)
        loss = loss_func(prediction, y) 
        opt.zero_grad()
        loss.backward() 
        optimizer.step()
        nplots = int(nepochs/40)
        if epoch % nplots == 0:
            makePlot(x,y,prediction,ax,fig,images,epoch,loss,ymin,ymax)
    return images
    
from IPython.display import Image
images=train(x,y,net,loss_func,optimizer,200,-0.1,1.5)
imageio.mimsave('data/L15/curve_1.gif', images, fps=10)
Image(open('data/L15/curve_1.gif','rb').read())

Try exploring more epochs by running code cell `L15.3-runcell03` multiple times without reinitializing the network. You will notice that the network seems to come up with an output that looks like a set of connected lines. Why do you think that happens?

So, this is actually a great way to play around with models. You quickly see what the neural network is trying to do, and its output sort of approximates what you might get from fitting. You may also notice that this is a lot slower than fitting. Do you understand why? 

The answer involves the fact that the minimization procedure we are doing is not a full fit procedure starting from a fixed functional form. We are computing a gradient, but we are doing a "soft" propagation. This is not the full stepping and fitting that we did with our original fits. The advantage, of course, is that we have many more parameters now to play with. 

For a quick look at how things change, let's double the number of hidden parameters. What we want to do is rerun code cell `L15.3-runcell02` to define a neural network with 20 neurons in the hidden layer. Then, we can run code cell `L15.3-runcell03` to see how the output evolves with epoch using this larger hidden layer.

<a name='exercises_15_3'></a>     

| [Top](#section_15_0) | [Restart Section](#section_15_3) | [Next Section](#section_15_4) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.3.1</span>

What are the dimensions of the input, output, and hidden layers? Enter your answer as a list of integers: `[input, output, hidden]`

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.3.2</span>

Why do we choose to use mean squared error for our loss function? Select all the apply:

A) This is the same as our original fit minimization (least squares).\
B) We did not give uncertainties, so we cannot do chi2.\
C) Mean squared error is the closest we get to absolute value.\
D) Actually, we should not use a NN at all, and we really should do a full systematic approach of adding polynomials and fitting to determine the functional form.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.3.3</span>

You saw that increasing the number of hidden parameters (i.e. the number of neurons in the hidden layer) gave a better by-eye fit. Why not use as many hidden parameters as possible? Wouldn't this make an even better fit to the data? Why would it NOT be ideal to arbitrarily add more hidden parameters? Select all that apply:

A) It can lead to overfitting of the model to the training data.\
B) It can make the model too simple and unable to learn complex features.\
C) It can require more computational resources.\
D) It can decrease the accuracy of the model on the training data.\
E) There is an ideal number of hidden parameters that works universally well for all neural networks.


<a name='section_15_4'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.4 Another Example: Fitting a Sine Function</h2>  

| [Top](#section_15_0) | [Previous Section](#section_15_3) | [Exercises](#exercises_15_4) | [Next Section](#section_15_5) |


*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS15/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS15_vid4" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Ok, now let's look at another function to regress, let's try to regress 

\begin{equation}
 f(x) = \sin(x)
\end{equation}

In [None]:
#>>>RUN: L15.4-runcell01

torch.manual_seed(1)    # reproducible
x = torch.unsqueeze(torch.linspace(-10, 10, 1000), dim=1)  # x data (tensor), shape=(100, 1)
y = torch.sin(x) + 0.2*torch.rand(x.size())                 # noisy y data (tensor), shape=(100, 1)

# torch can only train on Variable, so convert them to Variable
x, y = Variable(x), Variable(y)
plt.figure(figsize=(10,4))
plt.scatter(x.data.numpy(), y.data.numpy(), color = "orange")
plt.title('Regression Analysis')
plt.xlabel('Independent variable')
plt.ylabel('Dependent variable')
plt.show()

<h3>Challenge question</h3>

What happens if we make a regression for the above dataset? How does this regression change with the number of parameters, say 100 hidden parameters?

In [None]:
#>>>RUN: L15.4-runcell02

#redefine network
net = Net(n_feature=1, n_hidden=10, n_output=1)     # define the network, try changing to n_hidden=100
#net = Net(n_feature=1, n_hidden=100, n_output=1)

images=train(x,y,net,loss_func,optimizer,200,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

This fit is doing an astoundingly bad job! You can see that it's not reproducing any features of the data at all, even though we have 100 parameters. Would more epochs help? Try rerunning code cell L15.4-runcell02 a few times to check.

In the next section, we'll to understand more about why this attempt fails so miserably.        

<h3>Deep Learning Algorithm Design</h3>

Now, we want to build some intuition about how neutral networks work. Let's take the above problem and see if we can really explain its behavior by doing some deep learning R&D. To do that, let's first take an architecture similar to the last one, except with more than 2 layers. 

Let's start with a 3-layer network with 100 hidden parameters in both the first and second layers.  

In [None]:
#>>>RUN: L15.4-runcell03

torch.manual_seed(1)    # reproducible

# another way to define a network
net = torch.nn.Sequential(
        torch.nn.Linear(1, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 1),
    )
print(net[0].weight[0:10])
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
images=train(x,y,net,loss_func,optimizer,200,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

Yet again, we see a total failure to match any aspect of the data! Why did the neural network just try to fit a line in this case? 

The answer is a bit subtle, but the network lacks enough "expressiveness" to solve this particular problem. The linear layers that we used correspond to a simple matrix multiplication. That means that each layer just multiplies its input by a constant and also adds an offset $ax+b$. Now, it's true that $a$ can be a matrix. However, all that does is give a vector of linear outputs.

Note that this explains why the previous attempt to fit a noisy parabola resulted in a fit that was a sequence of connected short line segments.

To fix this mismatch between the network architecture and the data we are trying to fit, we will add a different sort of "activation" layer between the linear layers. This will be covered in the next section. 

<a name='exercises_15_4'></a>     

| [Top](#section_15_0) | [Restart Section](#section_15_4) | [Next Section](#section_15_5) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.4.1</span>

Which of the following statements describes why linear activation function layers are not suitable for classification of data which is strongly nonlinear?

A) A linear activation function cannot capture complex nonlinear relationships between the input features and the output classes.\
B) A linear activation function is only effective for classification tasks with linearly separable data.\
C) A linear activation function is prone to underfitting and can lead to poor performance on the training and test sets.\
D) A linear activation function can introduce too much noise into the model and reduce its accuracy.

<a name='section_15_5'></a>
<hr style="height: 1px;">

## <h2 style="border:1px; border-style:solid; padding: 0.25em; color: #FFFFFF; background-color: #90409C">L15.5 Sine Function Continued: Adjusting the Network</h2>  

| [Top](#section_15_0) | [Previous Section](#section_15_4) | [Exercises](#exercises_15_5) |

*The material in this section is discussed in the video **<a href="https://courses.mitxonline.mit.edu/learn/course/course-v1:MITxT+8.S50.2x+1T2025/block-v1:MITxT+8.S50.2x+1T2025+type@sequential+block@seq_LS15/block-v1:MITxT+8.S50.2x+1T2025+type@vertical+block@vert_LS15_vid5" target="_blank">HERE</a>.** You are encouraged to watch that video and use this notebook concurrently.*

<h3>Overview</h3>

Now, let's build on the intuition we started to gain about how neutral networks work. Let's take the previous problem of fitting a noisy sine wave and see if we can really explain its behavior by doing some deep learning R&D. To do that, let's first take an architecture similar to the last one, except with one addition. 

Let's start with a 3-layer network with 100 and 50 hidden parameters in the first and second layers, respectively. However, we will now add one ReLU activation function into the network. We will skip right down to the point where we can train it. 

In [None]:
#>>>RUN: L15.5-runcell01

torch.manual_seed(1)    # reproducible

# another way to define a network
net = torch.nn.Sequential(
        torch.nn.Linear(1, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 50),
        torch.nn.Linear(50, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
images=train(x,y,net,loss_func,optimizer,200,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

What you see is that the ReLU activation function introduced a "kink" in the final fit. The result is still pretty awful, but it does seem to be adding a new feature that is trending in the right direction. Let's keep going and add a second ReLU activation function. Also, let's increase the number of epochs to give the network more time to adjust the parameters.

In [None]:
#>>>RUN: L15.5-runcell02

torch.manual_seed(1)    # reproducible

# another way to define a network
net = torch.nn.Sequential(
        torch.nn.Linear(1, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
images=train(x,y,net,loss_func,optimizer,400,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

OK, now the fit is improving dramatically, at least in the central region. Looking closely, you can see that the green line is again a series of connected straight lines. However, the "kinks" introduced by the ReLU functions allow the network the flexibility (what we have also called "expressiveness") to deal with the strong non-linearities in the data.

<h3>Challenge Question</h3>

Given the observed trends, add more layers to your network. How many are needed to describe the whole oscillation? Would a third pair of Linear layer plus ReLU activation function be enough?

In [None]:
#>>>RUN: L15.5-runcell03

torch.manual_seed(1)    # reproducible

net = torch.nn.Sequential(
        torch.nn.Linear(1, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
images=train(x,y,net,loss_func,optimizer,400,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

How about 4?

In [None]:
#>>>RUN: L15.5-runcell04

torch.manual_seed(1)    # reproducible

net = torch.nn.Sequential(
        torch.nn.Linear(1, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 100),
        torch.nn.ReLU(),
        torch.nn.Linear(100, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.MSELoss()
images=train(x,y,net,loss_func,optimizer,400,-1.1,1.1)
imageio.mimsave('data/L15/curve_2.gif', images, fps=10)
Image(open('data/L15/curve_2.gif','rb').read())

So, something in the range of 3-4 ReLU functions seems to do the trick. This makes sense since the range of the sine wave that we tried to fit looks something like 3 (or maybe 3.5) parabolas, but you need the ReLU kinks between them to get it all to line up.

You may have noticed that the network with 2 ReLU functions did a sort of reasonable job of fitting 2 of the sine wave peaks, but with a single ReLU function the output didn't seem to match anything at all. One reason for this difference is that it's almost impossible to fit only one sine wave peak without having the rest of the output wildly different from the data.

Now that we have all of this working, what would be the right architecture to perform a regression on the noisy Gaussian we started out with? 

Let's setup the data using the torch tools.

In [None]:
#>>>RUN: L15.5-runcell05

torch.manual_seed(1)    # reproducible

x=torch.unsqueeze(torch.linspace(0, 200, 201), dim=1) 
y=torch.from_numpy(data.reshape(len(data),1).astype('float32'))

# torch can only train on Variable, so convert them to Variable
x, y = Variable(x), Variable(y)

# view data
plt.figure(figsize=(10,4))
plt.scatter(x.data.numpy(), y.data.numpy(), color = "orange")
plt.title('Regression Analysis')
plt.xlabel('Independent variable')
plt.ylabel('Dependent variable')
plt.show()


With a Gaussian shape, depending on how you count, there are 2 or 3 inflection points (i.e. kinks), so a 4 layer network (with 3 ReLU functions in between) with not very many parameters should be more than adequate. Let's try!

In [None]:
#>>>RUN: L15.5-runcell06

torch.manual_seed(1)    # reproducible

def makePlot(x,y,prediction,ax,fig,images,t,loss,ymin,ymax):
    # plot and show learning process
    plt.cla()
    ax.set_title('Regression Analysis', fontsize=35)
    ax.set_xlabel('Independent variable', fontsize=24)
    ax.set_ylabel('Dependent variable', fontsize=24)
    ax.set_ylim(ymin,ymax)
    ax.scatter(x.data.numpy(), y.data.numpy(), color = "orange")
    ax.plot(x.data.numpy(), prediction.data.numpy(), 'g-', lw=3)
    ax.text(125, 16, 'Epoch = %d' % t, fontdict={'size': 24, 'color':  'red'})
    ax.text(125, 14, 'Loss = %.4f' % loss.data.numpy(),fontdict={'size': 24, 'color':  'red'}) 
    fig.canvas.draw()       # draw the canvas, cache the renderer
    image = np.frombuffer(fig.canvas.tostring_rgb(), dtype='uint8')
    image  = image.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    images.append(image)


net = torch.nn.Sequential(
        torch.nn.Linear(1, 10),
        torch.nn.ReLU(),
        torch.nn.Linear(10, 10),
        torch.nn.ReLU(),
        torch.nn.Linear(10, 10),
        torch.nn.ReLU(),
        torch.nn.Linear(10, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
images=train(x,y,net,loss_func,optimizer,300,5,20)
imageio.mimsave('data/L15/curve_gaus.gif', images, fps=10)
Image(open('data/L15/curve_gaus.gif','rb').read())

This result looks surprisingly poor. Maybe we need another ReLU function and more epochs?

In [None]:
#>>>RUN: L15.5-runcell07

torch.manual_seed(1)    # reproducible

n_hidden=10
net = torch.nn.Sequential(
        torch.nn.Linear(1, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
images=train(x,y,net,loss_func,optimizer,1000,5,20)
imageio.mimsave('data/L15/curve_gaus.gif', images, fps=10)
Image(open('data/L15/curve_gaus.gif','rb').read())

Hmm... still not very impressive. How about increasing the number of parameters in the hidden layers? Give it a try.

Training for these relatively simple functions should illustrate both the benefits and negative aspects of deep learning. This again just follows from the fact that deep learning is not really performing full fits and statistical tests. It's sacrificing the uncertainty and optimized tuning for larger scale generality, really more parameters. 

The following exercises will explore some of these issues in more detail. 

<a name='exercises_15_5'></a>     

| [Top](#section_15_0) | [Restart Section](#section_15_5) | [Next Section](#section_15_6) |


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.5.1</span>

Consider the following neural network. How many hidden layers are there in this model? Enter your answer as an integer.

<pre>
net = torch.nn.Sequential(
        torch.nn.Linear(1, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 50),
        torch.nn.ReLU(),
        torch.nn.Linear(50, 1),
    )
</pre>


### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #90409C; color: #90409C;">Exercise 15.5.2</span>

Run the code cell shown below multiple times, **varying the number of hidden layers each time.** Which of the following statements describes the effect of increasing the number of hidden layers? Select all that apply:

A) Increasing the number of hidden layers always leads to better classification performance.\
B) Increasing the number of hidden layers can help the model learn more complex patterns and improve classification performance, but may also increase the risk of overfitting.\
C) Increasing the number of hidden layers increases computational load and should only be done if better performance is needed.\
D) Increasing the number of hidden layers is only beneficial if the number of neurons in each layer is also increased.

**Extra:** Try also varying the number of parameters within the hidden layers.

In [None]:
#>>>EXERCISE: L15.5.2
# Use this cell for drafting your solution (if desired),
# then enter your solution in the interactive problem online to be graded.

def train(x,y,net,loss_func,opt,nepochs,ymin,ymax):
    images = []
    fig, ax = plt.subplots(figsize=(12,7))
    for epoch in range(nepochs):
        if epoch % 200 == 0: 
            print("epoch:",epoch)
        prediction = net(x)
        loss = loss_func(prediction, y) 
        opt.zero_grad()
        loss.backward() 
        optimizer.step()
        # Minimize plots for faster running
        if epoch == nepochs-1:
            makePlot(x,y,prediction,ax,fig,images,epoch,loss,ymin,ymax)
    return images

torch.manual_seed(1)    # reproducible

n_hidden=30

#VARY THE NUMBER OF HIDDEN LAYERS BELOW
net = torch.nn.Sequential(
        torch.nn.Linear(1, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, n_hidden),
        torch.nn.ReLU(),
        torch.nn.Linear(n_hidden, 1),
    )
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
images=train(x,y,net,loss_func,optimizer,1000,5,20)
imageio.mimsave('data/L15/curve_gaus.gif', images, fps=10)
Image(open('data/L15/curve_gaus.gif','rb').read())
