# Neural Net

- Get's name from neurons in biology
- One neuron receives signals from other neurons, combines them in some way, then sends this combined signal to other neurons

![](nn.drawio.png)

- you can make neural nets any size you want
    - add/subtract **hidden layers** (making the diagram above wider/thinnier). This is what's known as **depth** of the neural net.
    - add/subtract **nodes** in a hidden layer (making the the diagram taller/shorter)
- each hidden layer performs a linear operation, then a non-linear operation called an **activation function**
- the sigmoid function is one of many different types of activation functions you can use

## Math operations in forward pass of neural net

### Input layer and hidden layer 1


$$ \bm{Z} = \bm{\sigma}(\bm{X} \bm{W} + \bm{J}_{N,2} \bm{w}_b) $$
- where:
    
    - $\bm{\sigma}(\bm{A})$ is the sigmoid function applied element-wise to matrix $\bm{A}$:
        $$\bm{\sigma}(\bm{A}) = \left[\begin{array}{cc}
                \sigma(a_1^{(1)}) & \sigma(a_2^{(1)}) & \cdots & \sigma(a_{I}^{(1)}) \\
                \sigma(a_1^{(2)}) & \sigma(a_2^{(2)}) & \cdots & \sigma(a_{I}^{(2)}) \\
                \vdots            & \vdots            & \ddots & \vdots\\
                \sigma(a_1^{(N)}) & \sigma(a_2^{(N)}) & \cdots & \sigma(a_{I}^{(N)})
                \end{array}\right] $$
    - reminder that the sigmoid function is:
        $$ \sigma(a) = \frac{1}{1 + e^{-a}} $$
    - $N$ is the number of data points
    - $\bm{J}_{a,b}$ is a matrix with size $a \times b$ full of $1$'s
        - e.g. :
        $$\bm{J}_{4,2} = \left[\begin{array}{cc}
            1 & 1\\
            1 & 1\\
            1 & 1\\
            1 & 1
            \end{array}\right]$$
    - $\bm{w}_b$ is the bias term:
        $$ \bm{w}_b = \left[\begin{array}{cc}
            w_{01} & w_{02}
            \end{array}\right] $$
    - $\bm{X}$ and $\bm{W}$: 
        $$ \bm{X} = \left[\begin{array}{cc}
            x_1^{(1)} & x_2^{(1)} & x_{3}^{(1)} \\
            x_1^{(2)} & x_2^{(2)} & x_{3}^{(2)} \\
            \vdots    & \vdots    & \vdots\\
            x_1^{(N)} & x_2^{(N)} & x_{3}^{(N)}
            \end{array}\right], \quad \quad 
            \bm{W} = \left[\begin{array}{cc}
            w_{11} & w_{12}\\
            w_{21} & w_{22}\\
            w_{31} & w_{32}
            \end{array}\right] $$
- $\bm{Z}$ is a matrix of size $N \times 2$:
    $$ \bm{Z} \in \mathbb{R}^{N \times 2} $$

- the code below makes hidden layer 1 with pytorch

In [None]:
import torch
torch.manual_seed(0)
layer1 = torch.nn.Sequential(
    torch.nn.Linear(in_features=3, out_features=2),
    torch.nn.Sigmoid()
)

- the code below gets the bias terms for hidden layer 1 (i.e. $w_{01}$ and $w_{02}$)

In [None]:
w_bias = layer1[0].bias.data
print(w_bias)

- the code below gets $\bm{W}$
- notice that pytorch stores it as the transpose of $\bm{W}$ above
  - $ \bm{W}^\top = \left[\begin{array}{cc}
            w_{11} & w_{21} & w_{31}\\
            w_{12} & w_{22} & w_{32}
            \end{array}\right] $

In [None]:
w_weight = layer1[0].weight.data
print(w_weight)

### Hidden Layer 2

$$ \bm{S} = \bm{\sigma}(\bm{Z}\bm{V} + \bm{J}_{N,3} \bm{v}_b) $$
$$ \bm{S} \in \mathbb{R}^{N \times 3}

- the pytorch code for hidden layer 2 is below

In [None]:
layer2 = torch.nn.Sequential(
    torch.nn.Linear(in_features=2, out_features=3),
    torch.nn.Sigmoid()
)

### Output Layer

$$ \hat{\bm{Y}} = \bm{\sigma}(\bm{S}\bm{T} + \bm{J}_{N,2} \bm{t}_b) $$
$$ \hat{\bm{Y}} \in \mathbb{R}^{N \times 2}

- the pytorch code for the output layer is below:

In [None]:
layer_output = torch.nn.Sequential(
    torch.nn.Linear(in_features=3, out_features=2),
    torch.nn.Sigmoid()
)

## Putting it all together in one python class

In [None]:
class ExampleNN(torch.nn.Module): # inherits from torch.nn.Module
    def __init__(self): # defines initialization of this module
        super().__init__() # calls the parent class' (torch.nn.Module) initializer
        self.layer1 = torch.nn.Sequential(
            torch.nn.Linear(in_features=3, out_features=2), # linear operation
            torch.nn.Sigmoid() # activation
        )
        self.layer2 = torch.nn.Sequential(
            torch.nn.Linear(in_features=2, out_features=3),
            torch.nn.Sigmoid()
        )
        self.layer_output = torch.nn.Sequential(
            torch.nn.Linear(in_features=3, out_features=2),
            torch.nn.Sigmoid()
        )
    
    def forward(self, x): # forward pass through neural network
        z = self.layer1(x)
        s = self.layer2(z)
        y = self.layer_output(s)
        return y

- instantiate the neural network

In [None]:
example_nn = ExampleNN()

- make a fake dataset to pass through the neural net
    - dataset has 20 examples, with 3 features for each example

In [None]:
X = torch.ones(20,3)

- do a forward pass with the fake dataset

In [None]:
Y_pred = example_nn(X)

- the output should be a $20 \times 2$ matrix:
  
  $\hat{\bm{Y}} \in \mathbb{R}^{20 \times 2}$

In [None]:
Y_pred.shape

## Other activation functions

In [None]:
import plotly.express as px
import plotly.graph_objects as go

### Hyperbolic Tangent

$$ \tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac{e^{2x} -1}{e^{2x} + 1}$$
- similar shape to sigmoid
- often times converges to an optimum value faster than sigmoid

In [None]:
X = torch.linspace(-10.0, 10, 100)
tanh = torch.nn.Tanh()
Y = tanh(X)

## we could also use the torch.nn.functional version so we don't have to make an object of torch.nn.Tanh class
Y = torch.nn.functional.tanh(X)

px.line(x=X, y=Y, title="Tanh", labels={'x':'input', 'y':'output'})

### Rectified linear unit (ReLU)

$$ \text{relu}(x) =  \begin{cases} 
      x, & \text{if} \ \ x \geq 0 \\
      0, & \text{if} \ \ x <    0
   \end{cases} $$
- quick to compute
- no vanishing gradient when input is large
- has vanishing gradient when input is negative

In [None]:
X = torch.linspace(-10.0, 10, 100)
Y = torch.nn.functional.relu(X)
px.line(x=X, y=Y, title="ReLU", labels={'x':'input', 'y':'output'})

### Leaky ReLU

$$ \text{LeakyReLU}(x) =  \begin{cases} 
      x, & \text{if} \ \ x \geq 0 \\
      mx, & \text{if} \ \ x <    0
   \end{cases} $$
$$ \text{where} \quad 0 < m < 1 $$

- helps avoid vanishing gradients in ReLU by giving negative inputs a small slope
- still pretty simple to compute

In [None]:
X = torch.linspace(-10.0, 10, 100)
Y = torch.nn.functional.leaky_relu(X, negative_slope=0.1)
px.line(x=X, y=Y, title="Leaky ReLU", labels={'x':'input', 'y':'output'})

### Parametric ReLU

$$ \text{PReLU}(x) =  \begin{cases} 
      x, & \text{if} \ \ x \geq 0 \\
      mx, & \text{if} \ \ x <    0
   \end{cases} $$
$$ m \text{ is a learnable parameter} $$
- same as Leaky ReLU but the $m$ is a learnable parameter, like the weights in the linear operations of neural net layers

## Loss functions

- If you're predicting a continuous output 
  - default loss function is mean square error
- If you're predciting classes (one or more)
  - use binary cross entropy

## Train a Neural Net with Stochastic Gradient Descent

We're going to train a neural net to predict the Y in the data below. This data has one dimension in the input and one dimension in the output.

In [None]:
import pandas as pd
df = torch.load('nn_dataframe.pt')
df

In [None]:
px.scatter(df, x='x', y='y')

### Make the Neural Net

Make a neural net that has two hidden layers, with three nodes in the first hidden layer and two nodes in the second. Use PReLU activation functions.

Normally, you'd have to determine how many layers, how many nodes, and which activation function to use by **hyperparameter tuning**, but for the first neural net example, to make it simple, I'll give you the right hyperparameters.

In [None]:
class NeuralNet(torch.nn.Module): # inherits from torch.nn.Module
    def __init__(self): # defines initialization of this module
        super().__init__() # calls the parent class' (torch.nn.Module) initializer
        self.layer1 = torch.nn.Sequential(
            torch.nn.Linear(in_features=1, out_features=3),
            torch.nn.PReLU(),
        )
        self.layer2 = torch.nn.Sequential(
            torch.nn.Linear(in_features=3, out_features=2),
            torch.nn.PReLU(),
        )
        self.layer_output = torch.nn.Sequential(
            torch.nn.Linear(in_features=2, out_features=1),
            torch.nn.PReLU(),
        )
    
    def forward(self, x): # forward pass through neural network
        z = self.layer1(x)
        s = self.layer2(z)
        y = self.layer_output(s)
        return y

### Standardize the data

In [None]:
def standardize(data):
    assert isinstance(data, torch.Tensor) # check that data is a pytorch tensor
    column_mu = data.mean(dim=0) # take the mean of each column
    column_sigma = data.std(dim=0) # take the std deviation of each column
    data_standardized = (data-column_mu)/column_sigma # subract each values column mean then divide by the column std deviation
    return data_standardized, column_mu, column_sigma

def unstandardize(data_stand, column_mu, column_sigma):
    assert isinstance(data_stand, torch.Tensor)
    return data_stand * column_sigma + column_mu

In [None]:
X_all, x_mu, x_sigma = standardize(torch.tensor(df['x'].values))
X_all = X_all
Y_all, y_mu, y_sigma = standardize(torch.tensor(df['y'].values))

### Train, Validation, Test Split

In [None]:
random_order_indices = torch.randperm(len(X_all)) # random order of X_all's indices
train_proportion = 0.7
valid_proportion = 0.15
test_proportion = 0.15
train_valid_index_border = torch.math.floor(len(X_all)*train_proportion)
valid_test_index_border = torch.math.floor(len(X_all)*(train_proportion+valid_proportion))
train_indices = random_order_indices[0:train_valid_index_border] # take first 80% of random_order_indices to be the train indices
valid_indices = random_order_indices[train_valid_index_border:valid_test_index_border]
test_indices = random_order_indices[valid_test_index_border:]

X = X_all[train_indices]
Y = Y_all[train_indices]
X_valid = X_all[valid_indices]
Y_valid = Y_all[valid_indices]
X_test  = X_all[test_indices]
Y_test  = Y_all[test_indices]

### Stochastic Gradient Descent

- Doesn't use all the dataset at once
- Breaks data into batches
- Trains model on one batch at a time
- Useful when your dataset is very large and/or when your neural net requires a lot of memory to do forward or backward passes
- Also helps generalize model better
- After going through all data in the dataset once, shuffle the data and repeat
- Each round through the entire dataset is called an **epoch**

### Using pytorch's Dataset and DataLoader for Stochastic Gradient Descent (SGD)

- DataLoader does the batching and shuffling for you
- Data needs to be in a pytorch Dataset class to use DataLoader

In [None]:
from torch.utils.data import Dataset, DataLoader

In [None]:
class NeuralNetData(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.Y = Y
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx].float(), self.Y[idx].float()

- A dataset class needs the three functions you see above
- If you are using SGD to save on memory usage, do not load the entire dataset when creating the dataset class like above. You'll need grab the data from it's location on your computer storage in the `__getitem__` function

To get the batching to work in pytorch all datasets need to have at least two dimensions.

First dimension is the number of data points.

In [None]:
X = X.unsqueeze(1).float()
Y = Y.unsqueeze(1).float()
X_valid = X_valid.unsqueeze(1).float()
Y_valid = Y_valid.unsqueeze(1).float()
X_test = X_test.unsqueeze(1).float()
Y_test = Y_test.unsqueeze(1).float()

In [None]:
Y.shape

In [None]:
train_dataset = NeuralNetData(X, Y)
train_dataloader = DataLoader(dataset=train_dataset, batch_size=100, shuffle=True)

### SGD

- change the step function from previous lessons to include a loop to iterate over batches
- we'll only evaluate with validation data after each epoch

In [None]:
def one_epoch(train_dataloader, X_valid, Y_valid, model, optim, Losses, Losses_valid):
    global epoch
    print('epoch #{}'.format(epoch))
    for batch in train_dataloader:
        optim.zero_grad() # reset the gradients
        model.train() # tells pytorch we're training the linear module
        X = batch[0]
        Y = batch[1]
        pred_y = model(X) # make a prediction

        loss = torch.nn.functional.mse_loss(pred_y, Y) # calculate mse loss
        loss.backward() # calculate gradients
        optim.step() # take gradient descent step
        Losses.append(loss.item()) # add current training loss to Losses list

    model.eval() # tells pytorch we're not training our module (we're validating now instead of training)
    with torch.no_grad(): # Tells pytorch to not track gradients. This prevents unnecessary calculations and memory usage
        pred_y_valid = model(X_valid)
        loss_valid = torch.nn.functional.mse_loss(pred_y_valid, Y_valid)
    
    Losses_valid.append(loss_valid.item()) # add current validation loss to Losses_valid list
    print('Validation Loss: ' + str(loss_valid.item()))
    epoch += 1

In [None]:
def graph_steps(title):
    Losses = []
    Losses_valid = []
    fig = go.FigureWidget()
    
    trace_train = go.Scatter( x=list(range(len(Losses))), y=Losses, line=dict(color=px.colors.qualitative.Plotly[0]), name='train loss' )
    fig.add_trace(trace_train)
    
    trace_valid = go.Scatter( x=list(range(len(Losses_valid))), y=Losses_valid, line=dict(color=px.colors.qualitative.Plotly[1]), name='valid loss' )
    fig.add_trace(trace_valid)
    
    fig.update_layout( autosize=False, width=750, height=500, title=title)
    fig.update_xaxes( title="steps")
    fig.update_yaxes( title='loss')
    return fig

In [None]:
neural_net = NeuralNet() # initialize neural net
optim = torch.optim.SGD(neural_net.parameters(), lr=0.05)
epoch = 0
step = 0
Losses = []
Losses_valid = []

### Initializing values

- Neural nets usually have many local minimums, the larger your neural net the more minimums you'll have
- The values you initialize the weights with have a huge affect on the results you get
- We're going to use Kaiming/He initialization which works well with ReLU activation functions

In [None]:
def init_weights(m):
    if isinstance(m, torch.nn.Linear):
        torch.nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
neural_net.apply(init_weights)

In [None]:
fig = graph_steps('MSE Loss')
train_loss = fig.data[0] # use this variable to add train losses to graph
valid_loss = fig.data[1] # use this variable to add validation losses to graph
fig

In [None]:
one_epoch(train_dataloader, X_valid, Y_valid, neural_net, optim, Losses, Losses_valid)
# if Losses_valid[-1] < best_model_valid_loss:
#     best_model = copy_linear(model)
#     best_model = Losses_valid[-1]
train_loss.x = list(range(len(Losses))) # add the train steps to graph on x-axis
train_loss.y = Losses # add the train losses to graph on y-axis
valid_loss.x = list(range(len(train_dataloader)-1,(len(Losses_valid)+1)*len(train_dataloader), len(train_dataloader)))
valid_loss.y = Losses_valid

### Calculate MSE on test data

In [None]:
test_pred_y = neural_net(X_test)
test_mse = torch.nn.functional.mse_loss(test_pred_y, Y_test)
test_mse

### Plot the regression line your neural net produced

The code below will import the actual underlying trend line, plot the data generated from this trend line, and plot the line your neural net predicted.

In [None]:
df_trend = torch.load('nn_dataframe_trend.pt')
trend_X = df_trend['x'].values
trend_Y = df_trend['y'].values

In [None]:
line_X = torch.linspace(X_all.min(), X_all.max(), 1000).unsqueeze(1)
line_Y = neural_net(line_X)
line_X = unstandardize(line_X, x_mu, x_sigma)
line_Y = unstandardize(line_Y, y_mu, y_sigma)
X_all_unstd = unstandardize(X_all, x_mu, x_sigma)
Y_all_unstd = unstandardize(Y_all, y_mu, y_sigma)

In [None]:
fig = go.Figure()
data_trace = go.Scatter(x=X_all_unstd.detach().squeeze(), y=Y_all_unstd.detach().squeeze(), mode='markers', name='data')
line_trace = go.Scatter(x=line_X.squeeze().detach(), y=line_Y.squeeze().detach(), mode='lines', name='nn prediction')
trend_trace = go.Scatter(x=trend_X, y=trend_Y, mode='lines', name='actual trend')
fig.add_traces([data_trace, line_trace, trend_trace])


## Quiz

### 1

Draw a neural net of logistic regression with 4 input features

*Answer is in file `log_reg_4x_nn_diagram.png`*

### 2

Predict handwritten digits

In [None]:
import torchvision

training_data = torchvision.datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

test_data = torchvision.datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=torchvision.transforms.ToTensor()
)

- Each element of `training_data` and `test_data` is a tuple with the image and its label
- The first element of the tuple is the image

In [None]:
image_0 = training_data[0][0]
image_0.shape

- Pytorch image format is the first dimension is channels, second is height, third is width
- `px.imshow` needs a two dimension array, so squeeze that thrid dimension away to display the image

In [None]:
px.imshow(image_0.squeeze(), color_continuous_scale='gray')

- The second element of the tuple is the label

In [None]:
training_data[0][1]

Now that you understand the dataset, make a neural net to predict the digit from the images.

You may want to research convolutional layers to do this task. Convolutional layers can take advantage of the grid structure of images.