<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_L09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the in-class notebook for MGSC496 Lecture 9.

In [None]:
import sklearn
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Things to mention:
* ??

## Reading Exercise Solution: Single Neuron to Predict Cat
 

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise:  Make a single-layer NN to predict whether an animal is a cat</font>

Let's take some data on different animals and build our own single-layer neural network (one neuron) that predicts whether an animal is a cat based on some features. We haven't talked about how to **train** a neural network yet, so we'll just make our weights random and keep trying until we get something that gives the right prediction.

Here is some data on different animals:

In [None]:
import pandas as pd
import numpy as np

In [None]:
cat_df = pd.DataFrame({'is_mammal':[1,1,1,0],
                       'four_legged':[0,1,1,1],
                       'body_weight_lbs':[70.0,8800.0,9.0,0.5],
                       'body_height_inches':[36.0,324.0,9.0,1.0],
                       'has_thumbs':[1,0,0,0],
                       'animal':['chimp','elephant','cat','lizard'],
                       'is_cat':[0,0,1,0]})
cat_df.head()

Unnamed: 0,is_mammal,four_legged,body_weight_lbs,body_height_inches,has_thumbs,animal,is_cat
0,1,0,70.0,36.0,1,chimp,0
1,1,1,8800.0,324.0,0,elephant,0
2,1,1,9.0,9.0,0,cat,1
3,0,1,0.5,1.0,0,lizard,0


We'll use the first five columns as our features $x$. Our goal is to randomly pick five weights $w_i$ ($i=0,...,4$) and one bias $b$ and then calculate $y = \delta(\sum_i{w_i x_i} + b)$. 

If $x$ is a numpy array of all the features, then we can use `np.dot(x,w)` to do the summation term, $\sum_i{w_i x_i}$ , we can add the bias $b$ to this and then use a version of the step function $\delta()$:.

```python
yhat = delta(np.dot(x,w)+b)
```

Before we do this, however, we should think about the weights and the features. As we learned from studying sklearn, ML models often struggle when the numerical features of data are distributed very differently. Our data includes dummy variables (which are either 0 or 1) and some numerical variables (like `body_weight_lbs` and `body_height_inches`) which have very different scales. If we're going to pick some random values for the weights and biases, we'll have to us a function from the numpy module to give us those random numbers.

But how should these weights be distributed? If all of our features had mean and standard deviation close to 1, then we could draw our weights and bias from a random normal distribution. Let's do that.

In a single code block:

1. Use the `StandardScaler` that you learned about from studying sklearn to standardize all of the features.

2. Use `np.random.rand()` to generate 5 weights and 1 bias from the random normal distribution

3. Use `yhat = delta(np.dot(x,w)+b)` (I've defined the delta function defined for you below) to get a prediction(`True` or `False` for whether each data point is a cat.

4. Rerun your code and watch how different random choices of weights and bias lead to different answers


In [None]:
def delta(x, thresh=0.5):
  sigmoid = 1/(1+np.exp(-x))
  return (sigmoid>thresh)

In [None]:
from sklearn.preprocessing import StandardScaler

# 1. Use the StandardScaler from sklearn. Fit this to the first 5 columns of cat_df and then use .transform() to transform the first 5 columns of the data (the features)
#    and define this variable as x
x = cat_df.loc[:,'is_mammal':'has_thumbs']
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)

# 2. Use np.random.randn() to generate a random numpy array of 5 normally distributed weights called w. Also use it to generate a random numpy array of 1 normally
#    distibruted bias called b. Print out w and b
w = np.random.randn(5)
b = np.random.randn(1)
print(w)
print(b)

# 3. Use yhat = delta(np.dot(x,w)+b) to get a prediction (for each row of the data) of whether the row is a cat and print out yhat
yhat = delta(np.dot(x,w)+b)
print(yhat)

# Re-run this cell multiple times and look for the result [False False True False] -- this is predicting the cat is a cat and all others are not

[-1.96383719 -0.13417717 -0.71409171 -0.75798395  1.29879528]
[-1.07256832]
[ True False False  True]


Now you have found a set of weights and bias that give the right prediction on the training data. Of course, we want a **training procedure** that intelligently updates the weights (rather than just pick them randomly) to minimize some error (or "Loss Function"). We also would want our neural network (here just a single neuron) to predict correctly on unseen data (test data). We will talk about this, but first let's talk about moving from a single neuron to a **neural network**.

<hr/>

# Example: Training a single neuron:

Let's repeat the above, but try to use our understanding of how a neural net is trained to update the weights/biases intelligently, instead of just picking some random numbers.

Instead of setting the output to true or false, I'll keep the number after the sigmoid, as this will be better for training:

In [None]:
def sigmoid(x, thresh=0.5):
  sigmoid = 1/(1+torch.exp(-x))
  return sigmoid

Now, we can just make some random weights with pytorch (you'll read about this soon, but for now, just know that pytorch keeps track of and calculates the gradients for us, which is something that numpy doesn't do!):

In [None]:
w = torch.tensor(np.random.randn(5),requires_grad=True) # make a random tensor of 5 weights.
b = torch.tensor(np.random.randn(1),requires_grad=True) # make a random tensor of a single bias.
print(w)
print(b)

tensor([-0.0933, -0.3533, -1.1217,  0.8401, -2.1974], dtype=torch.float64,
       requires_grad=True)
tensor([-0.9246], dtype=torch.float64, requires_grad=True)


In the last reading/lecture we used `np.dot()` to multiple the features by the weights. More generally, to get a weighted sum of the input features, we will use matrix multiplication. The operator to do  matrix multiplication in numpy and pytorch is `@`. So, to get the weighted sum that we want to pass into our nonlinear function, where the $i$th element is:

$\hat{y}_i =  \delta ( w_0X_{i0}+w_1X_{i1}+w_2X_{i2}+w_3X_{i3}+w_4X_{i4}+b )$

we can us `X @ w` like this:

In [None]:
yhat = delta(X @ w + b)
yhat

tensor([0.0288, 0.2973, 0.5803, 0.4784], dtype=torch.float64,
       grad_fn=<MulBackward0>)

FInally we can make a cost function which takes all the predicted outcomes $\hat{y}$ and all the actual outcomes $y$ and calculates a single scalar (in this case, the sum of the squared difference between each predicted outcome and each actual outcome):

In [None]:
C = ((yhat - y)**2).sum() # Define the cost as the sum of the squared difference
C

tensor(0.4942, dtype=torch.float64, grad_fn=<SumBackward0>)

Now, instead of randomly guessing weights and bias until we get a set that gives us the right prediction, we can use pytorch to get the gradient:

In [None]:
C.backward() # this call will tell pytorch to use backward propagation to produce gradients for w and b 

In [None]:
print(w.grad)
print(b.grad)

tensor([-0.4589,  0.0888,  0.1938,  0.1779, -0.0888], dtype=torch.float64)
tensor([0.1602], dtype=torch.float64)


The gradient values above are the relative amounts that we have to nudges each weight/bias. Typically, we multiply these nudges by a small number, called the learning rate. This is "taking a tiny step in the right direction".

When we do this weight update in pytorch, we have to tell it that we are "just updating the weights" and its gradient package should not pay attention to  any calculation (we'll also talke about this later). We do this with `with torch.no_grad()`:

In [None]:
with torch.no_grad(): # tell pytorch not to use the equations in this code block to setup autograd relationships
  w-=w.grad*5e-2 # move w a bit in the right direction, learning rate is 0.05
  b-=b.grad*5e-2 # move b a bit in the right direction, learning rate is 0.05
  w.grad.zero_() # now set all the gradients to zero so we can feed the entire training data back through again next epoch
  b.grad.zero_()


In [None]:
print(w)
print(b)
yhat = delta(X @ w + b)
yhat

tensor([ 0.2062, -0.4249, -1.2659,  0.7071, -2.1258], dtype=torch.float64,
       requires_grad=True)
tensor([-1.0515], dtype=torch.float64, requires_grad=True)


tensor([0.0305, 0.2841, 0.5858, 0.4579], dtype=torch.float64,
       grad_fn=<MulBackward0>)

In [None]:
# Code to train single neuron "is_cat" predictor

w = torch.tensor(np.random.randn(5),requires_grad=True)
b = torch.tensor(np.random.randn(1),requires_grad=True)

for epoch in range(1,1000):
  yhat = delta(X @ w + b)
  C = ((yhat - y)**2).sum()
  if epoch%100==0:
    print(f"yhat = {yhat}")
    print(f"cost: {C}")
  C.backward()
  with torch.no_grad():
    w-=w.grad*5e-2
    b-=b.grad*5e-2
    w.grad.zero_()
    b.grad.zero_()



yhat = tensor([0.1364, 0.1257, 0.6720, 0.2269], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.19347783343739194
yhat = tensor([0.1014, 0.1009, 0.8205, 0.1477], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.07450588159983429
yhat = tensor([0.0825, 0.0827, 0.8664, 0.1145], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.044609512773799175
yhat = tensor([0.0707, 0.0710, 0.8898, 0.0961], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.031414565887348395
yhat = tensor([0.0626, 0.0629, 0.9044, 0.0841], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.024082368683748818
yhat = tensor([0.0567, 0.0569, 0.9146, 0.0755], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.019450716072030402
yhat = tensor([0.0521, 0.0523, 0.9222, 0.0691], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.01627412839514157
yhat = tensor([0.0484, 0.0486, 0.9281, 0.0640], dtype=torch.float64,
       grad_fn=<MulBackward0>)
cost: 0.013966

In [None]:
print(w)
print(b)

tensor([ 2.4109,  1.1539, -1.6755, -0.8196, -1.2763], dtype=torch.float64,
       requires_grad=True)
tensor([-1.5820], dtype=torch.float64, requires_grad=True)
