<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_R08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Your Info

your_name = '' #@param {type:"string"}
your_email = '' #@param {type:"string"}
today_date = '' #@param {type:"date"}


# How to "read" this notebook

As you go through this notebook (or any notebook for this class), you will encounter new concepts and python code that implements them -- just like you would see in a textbook. Of course, in a textbook, it's easy to read code and an explanation of what it does and think that you understand it.
<br />
<br />

### Learn by doing
But this notebook is different from a textbook because it allows you to not just read the code, but play with it. **You can and should try out changing the code that you see**. In fact, in many places throughout this reading notebook, you will be asked to write your own code to experiment with a concept that was just covered. This is a form of "active reading" and the idea behind it is that we really learn by **doing**. 
<br />
<br />

### Change everything
But don't feel limited to only change code when I prompt you. This notebook is your learning environment and your playground. I encourage you to try changing and running all the code throughout the notebook and even to **add your own notes and new code blocks**. Adding comments to code to explain what you are testing, experimenting with or trying to do is really helpful to understand what you were thinking when you revisit it later. 
<br />
<br />
### Make this notebook your own
Make this notebook your own. Write your questions and thoughts. At the end of every reading notebook, I will ask the same set of questions to try to elicit your questions, reaction and feedback. When we review the reading notebook in class, I encourage you to   



# Code Preface

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Pytorch

![](https://drive.google.com/uc?id=1OFx-0HlKzV2kOVaD2SNkZwOUpHhClMiJ)

## What is PyTorch?

Pytorch is an open source machine learning framework that we can use to build and train artificial neural networks.


## Why PyTorch?

You might have heard about a very popular Neural Network library called Tensorflow. And you might be wondering "Why aren't we learning Tensorflow?".

There are a few answers to this question:
- Declarative vs Imperative:
 - Tensorflow was designed to be **declarative**. In Tensorflow, you set up your neural network architecture as a static graph before a model can run. The graph is full of all of these placeholders that will be replaced with tensors built from data when the model is run.  In this sense, the model is kind of an enclosed box that you define ahead of time, with only a few ways that you can communicate or pass data into this box. 
 - In contrast, PyTorch is **imperative**. You can define, change and pass data through nodes of the graph as you go. This means you have the ability to change things on the fly and peer into what is happening, and even **debug** easily.
- Static vs Dynamic:
 - Because you build static graphs in Tensorflow, it is harder to implement dynamic neural network architectures. Other dynamical things, such as input data that has varying size have to be handled with workarounds (such as padding the data).
 - But PyTorch is naturally dynamic and so its relatively easy to do these things.
- Pedagogical reasons:
 - The Tensorflow API is a bit cluttered. There are many ways to do things and it isn't always clear what is the best way to do it. While there is a ton of support online (because the community of Tensorflow users is quite large), it has evolved significantly over the years and you may easily find outdated methods and approaches.
 - PyTorch is very easy to learn and is more "Pythonic".


To be clear, there are lots of other differences and Tensorflow has some advantages. Not to mention that Tensorflow has a new Eager execution framework that allows you to do more dynamic things with it. Its also very easy to deploy in production and can be computationally efficient. Ultimately if you continue to learn and work with neural networks, you will likely have to learn both frameworks. However, I chose PyTorch as a starting framework because I believe it provides a smoother learning curve.  And once you know PyTorch and the fundamentals of neural networks, it will be much easier to learn Tensorflow if you choose. Typically, researchers who are working on creating new and interesting neural network architectures prefer to work in PyTorch.


## What does PyTorch do for us?

One of the many thing that pytorch provides is the ability to make and work with tensor objects.

A tensor is a number, vector, matrix, or any n-dimensional array.  You might be thinking "Hey, wait a minute, we already have a library for working with numbers, vectors, matrices and n-dimensional arrays -- its called numpy!".


That's true, but pytorch is different because the tensor objects are built to automatically compute gradients (derivatives) when they are linked to one another through an expression. Computing gradients is an important aspect of the backward propagation step that is used in the training loop to train a neural network.

We will see how all of this works together, but for now let's just explore how to create and work with some very basic tensors.

# Basics of Pytorch tensors

The first thing we need to do is to import the pytorch module and decide whether we want to execute our code on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit).

The operations involved in working with tensors and training neural network can benefit from parellelization significantly. Just as tools such as numpy take advantage of linear algebra libraries to vectorize operations, pytorch can also take advantage of libraries to parallelize computations. Whereas modern CPUs have increased the number of cores over the years, they still pale in comparison to the number of cores on GPUs which can number in the hundreds to thousands. Neural networking frameworks such as Pytorch can leverage libraries, such as CUDA (a parallelel computing framework developed by NVIDIA) to take advantage of all the GPU cores.

Of course, in order to do this, we need to be executing the code on a machine that has a GPU.

Google Colab offers GPUs and TPUs (Tensor Processing Units) as options under the runtime settings: 
`Runtime->Change Runtime Type->Hardware Accelerator-> GPU or TPU`. 

Let's do this now, so that we have access to a GPU.

We won't need to do this for now, as we'll be executing relatively shallow neural nets that can still be trained in a reasonable amount of time with just a CPU.


When you work with pytorch, you can specify whether to use the CPU or GPU (if one is available). So I will show you code to do this.

NOTE: When first building networks, its often better to work on the CPU, as debugging is much easier. Converting your tensors and model to run on the GPU is relatively easy and can be done at the last step when you want to train it over many epochs (iterations over the data).


In [None]:
import torch
torch.cuda.is_available() # This will return True if we have  setup a GPU on this Colab runtime or False otherwise

Now let's see how to make some simple tensors. It is very similar to how you create multi-dimensional arrays in numpy:

In [None]:
x = torch.tensor([[0.,3.0,-3.4],[-2.4,8.0,5.9]]) # A 2x2 tensor of floats
x

In [None]:
x.shape

In [None]:
x.dtype

In [None]:
torch.zeros(2,2) # a 2x2 tensor with all values set to 0

In [None]:
torch.rand(1) # a 1-D tensor (scalar) filled with random values.

You can do all the things with tensors that you can do with numpy, such as the usual arithmetic operations:

In [None]:
x = torch.ones(3, 5)
y = torch.rand(3, 5)
print(x)
print(y)
print(x+y)

In [None]:
y/x

And tensors in torch have their own version of Ufuncs and support broadcasting (just like numpy):

In [None]:
y+1 # will add 1 to every element of y

Tensors can live in the main memory of the machine (if we are going to work with them on the CPU) or they can live in the graphics cards dedicated memory (if we are going to work with them on the GPU):

In [None]:
some_cpu_tensor = torch.rand(5,4)
some_cpu_tensor.device

In [None]:
some_gpu_tensor = torch.cuda.FloatTensor(5,4).uniform_() # notice the call to do this on the GPU is a bit different
some_gpu_tensor

Notice the property `device` indicates `cuda` (i.e., that it lives on the GPU)

We can also move a tensor to the GPU or CPU:

In [None]:
some_cpu_tensor.to("cuda") # This will return a tensor that is identical to the one we created earlier, but that lives in the memory on the GPU

In [None]:
some_gpu_tensor.to("cpu") # This will return a tensor that is identical to the one we created earlier, but the lives in the memory on the CPU

Note that the above calls return copies of the tensor that live in the memory of either the CPU or GPU, but they do not affect the original tensor we created. If we want to work with that copy we have to assign it to a variable. 

In [None]:
some_cpu_tensor

`some_cpu_tensor` still lives in the CPU.

In [None]:
another_gpu_tensor = some_cpu_tensor.to("cuda")
another_gpu_tensor

Pytorch tensors support all the usual numpy functions. So you should feel right at home working with them. For that reason, I won't go through all the basic things that you can do with tensors. 

For example:

In [None]:
some_gpu_tensor.max()

# Gradients and Autograd

At this point, you might again be wondering why we need pytorch if numpy can do all these things already. But there is one very important thing that pytorch tensors do that is very useful for working with neural networks. They can automatically differentiate -- or in other words, they can calculate gradients (multi-dimensional derivatives) when we combine tensors together with mathematical expressions.  This is exactly what is needed for the "back propagation" step of training a neural net.

When we create a tensor, we can tell pytorch to enable calculation of  gradients for that tensor by specifying the keyword argument `requires_grad=True`.

In [None]:
x = torch.rand(1,requires_grad=True)
x

`requires_grad` is contagious -- any expression that depends on a tensor that requires a gradient will also require a gradient:

In [None]:
y = x**2 # remember ** means "raise to the power"
print(y)
print(y.requires_grad)

Two things to notice:
1. `y` has a function associated with it called `grad_fn`. The name of the functions gives a hint at what it is for `PowBackward0`. The `Pow` indicates that it is from a power operation and `Backward` indicates that it is used to propagate the gradient backward. 
2. Because `y` depends on `x` which requires a gradient, it also requires a gradient (contagious).

Whenever we have a **scalar tensor** (a tensor that holds only one value, not a vector or multi-dimensional array) that depends on other tensors, we can tell pytorch to calculate the gradient. We do this by calling `y.backward()`. The term backward here refers to back propagation and it is the fundamental way that we train a neural network. By "train" I mean, adjust the weights and biases of a NN to minimize the loss function.


In [None]:
y.backward() # tell pytorch to calculate the gradients involved in the definition of y

In [None]:
print(f'dy/dx = {x.grad}')
print(2*x)

Notice that pytorch correctly calculated the gradient:  $dy/dx=2x$

How this works is that pytorch takes expressions between tensors and uses them to build a *computational graph* behind the scenes. All operators in this graph are implemented by the **autograd** package. Pytorch can pass data forward through this graph (i.e., start with the input, perform the operations to get the output) and also propagate things backward through this graph by applying the chain rule (remember your calculus?). Autograd is designed in a modular way so that the functions it implements only have to worry about their own role (their own differentiation with respect to inputs and outputs) in the chain rule process.  

Some important things to note here:
- Automatic differentiation is all handled for us in the background by autograd -- we won't actually need to manually look at `.grad`'s
- By default, the `.grad` values accumulate. This is useful for pytorch do "do its thing" when we tell it to backward propagate the loss. But if we wanted to handle some portions of this process manually, we have to remember that gradients will accumulate unless we set them to zero. 
 - We can zero gradients of some tensor `x` with a call to `x.grad.zero_()`
 - we'll actually do this today when we manually implement linear regression.
 - More generally, we might want to zero the gradients of all the tensors in a neural network. We'll talk about this later.



In [None]:
x.grad.zero_()
print(x.grad)

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise:  Create and play with tensors</font>

1. Create three tensors, $a$, $b$ and $c$, each of which is a 1D random scalar.
2. Compute a new scalar tensor $d = 2a^2 + 3b^3 + 4c$
3. Since $d$ is a scalar we can tell pytorch to calculate the gradients involved in its calculate. Do so. Now print out the gradients of $a$, $b$, and $c$. Do they make sense? Why?



In [None]:
# Try it out


The way the autograd part of pytorch is designed, it only calculates gradient values for "leaf nodes", that is for tensors that do not depend on other tensors. This will be true for all parameters (weights and biases) of a neural network (which are the only things we need the gradients for - to figure  out which direction and how much to nudge them during training). 

<hr/>


 ## How does autograd do it's thing?
 * At this stage, you don't need to know... and you probably don't want to know.
 - Ok, really want to know? Have a [look at this video](https://youtu.be/MswxJw-8PvE) or [this blog post](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/) or [this one](https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95).
 - Autograd is a really clever implementation of automatic differentiation. So, naturally it is not trivial to understand.


# Moving data between types: Scalars, Numpy Arrays, and Tensors

<font size=5> Getting a scalar or a numpy array from a tensor</font>

There are some circumstances where you might want to get the value of a scalar rather than as a tensor type.  You can do this with the `.item()` method:


In [None]:
st = torch.tensor([5.2])
st.item()

In some circumstances, we might want to get a tensor as a numpy array.  If we want to do this, we must:
-  use the `.detach()` method to get rid of any gradient part of the tensor and keep only the array part
- then we can use the `.numpy()` method to get the detached tensor as a numpy array

In [None]:
at = torch.rand(10, requires_grad=True)
at.detach().numpy()

<font size=5>Getting a tensor from a numpy array</font>

We can just call `torch.tensor()` on a numpy array to turn it into a tensor:

In [None]:
some_np_array = np.random.rand(10)
torch.tensor(some_np_array)

<font size=5>Getting a tensor from a dataframe</font>

To show how to define tensors when starting with a dataframe, we'll use the example from last reading/lecture --  here's the scaled data from our "is it a cat" prediciton problem:

In [None]:
# Run this code to load the scaled cat_df
from sklearn.preprocessing import StandardScaler

cat_df = pd.DataFrame({'is_mammal':[1,1,1,0],
                       'four_legged':[0,1,1,1],
                       'body_weight_lbs':[70.0,8800.0,9.0,0.5],
                       'body_height_inches':[36.0,324.0,9.0,1.0],
                       'has_thumbs':[1,0,0,0],
                       'animal':['chimp','elephant','cat','lizard'],
                       'is_cat':[0,0,1,0]})
scaler = StandardScaler()
scaler.fit(cat_df.loc[:,'is_mammal':'has_thumbs'])
cat_df.loc[:,'is_mammal':'has_thumbs'] = scaler.transform(cat_df.loc[:,'is_mammal':'has_thumbs'])
cat_df.head()

Recall, we wanted to use the columns `is_mammal` to `has_thumbs` as the features $X$ and the column`is_cat` as the outcome $y$. 

Remember that pandas uses numpy on the backend (for now... [this is changing](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)), so we can get the numpy array from any dataframe or subset of its rows and columns, with the attribute `.values`. 

For example:

In [None]:
cat_df.loc[:,'is_mammal':'has_thumbs'].values # Using .values gives us a numpy array that we can pass to torch.tensor()

We can just call `torch.tensor()` on these numpy arrays to define our features (input) $X$ and outcome (output) $y$. We can accomplish this in a single line of code, like this:

In [None]:
X = torch.tensor(cat_df.loc[:,'is_mammal':'has_thumbs'].values)
X

In [None]:
y = torch.tensor(cat_df.loc[:,'is_cat'].values)
y

# Training a Single Neuron with Pytorch

Below is an example of implementing the NN training loop that we talked about in class.   

In [None]:
# Code to train single neuron "is_cat" predictor


# Load the data and scale it 
from sklearn.preprocessing import StandardScaler

cat_df = pd.DataFrame({'is_mammal':[1,1,1,0],
                       'four_legged':[0,1,1,1],
                       'body_weight_lbs':[70.0,8800.0,9.0,0.5],
                       'body_height_inches':[36.0,324.0,9.0,1.0],
                       'has_thumbs':[1,0,0,0],
                       'animal':['chimp','elephant','cat','lizard'],
                       'is_cat':[0,0,1,0]})
scaler = StandardScaler()
scaler.fit(cat_df.loc[:,'is_mammal':'has_thumbs'])
cat_df.loc[:,'is_mammal':'has_thumbs'] = scaler.transform(cat_df.loc[:,'is_mammal':'has_thumbs'])


# Make the features (inputs), X,  and outcomes (targets), y tensors from the dataframe 
X = torch.tensor(cat_df.loc[:,'is_mammal':'has_thumbs'].values)
y = torch.tensor(cat_df.loc[:,'is_cat'].values)

# Instead of returning the prediciton true or false for each y, we'll return the sigmoid.
# We can think of this as a score that reflects how much the NN thinks each row is a cat
# This will be more useful in calculating the cost function
def sigmoid(x):
  sigmoid = 1/(1+torch.exp(-x))
  return sigmoid

# Initialize the weights/bias to random values
w = torch.tensor(np.random.randn(5),requires_grad=True)
b = torch.tensor(np.random.randn(1),requires_grad=True)

for epoch in range(1,1000): # run over the entire training data 1000 time
  yhat = sigmoid(X @ w + b) # calculate the predicted value. @ means matrix multiplication, as we discussed in class (before with numpy, we used np.dot(), but this is more general)
  C = ((yhat - y)**2).sum() # Calculate the cost -- a single number representing how badly our NN did across all the training data
  if epoch%100==0: # Every 100 epochs, we will print out the predicted values and the cost function
    print(f"yhat = {yhat}")
    print(f"cost: {C}")
  C.backward() # Tell pytorch to perform backward propagation, to get the gradients for w and b
  with torch.no_grad(): # This line tells pytorch not to modify the computation graph for gradients for anything in the indented block below 
    w-=w.grad*5e-2 # Update the weights by taking a tiny step in the right direction
    b-=b.grad*5e-2 # Update the bias by taking a tiny step in the right direction
    w.grad.zero_() # set the gradient values in w and b to zero, so we can calculate them fresh on the next epoch
    b.grad.zero_()



Some things to note about the above example:
* When we did this with just numpy, we just repeatedly tried different random values of the weights/bias until we got the right prediction
* With pytorch, we can calculate the gradients, so can use them to adjust the weights a tiny bit in the right direction. We used a constant learning rate of 0.05
*Notice how the cost is dropping with every epoch -- this is good! The NN is "learning"
* The final prediction is pretty good. It think the cat row is a cat with a score of $> 0.9$ and the others with a score of $< 0.06$
* We used `yhat = sigmoid(X @ w + b)` which is equivalent to $\hat{y}_i =  \delta ( w_0X_{i0}+w_1X_{i1}+w_2X_{i2}+w_3X_{i3}+w_4X_{i4}+b )$. This is a more general form, which would also work if we had more than one outcome value (we will in the next example).

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise:  Build a single neuron trained to predict wine quality </font>

Following the same structure as the above example, train a single neuron to predict the quality of red wine based on many different features. Run the code below to read in the dataset, then:

<br />

1. Like in the example above, use the sklearn standard scalar to scale all the feature columns. Rescale the target (outcome), which is a number from 1 to 10 representing wine quality, to be a number between 0 and 1 (this is what we want if we're going to output a sigmoid for $\hat{y}$, because sigmoids can only be between 0 and 1) 

2. Make tensors for features $X$ and target $y$. (make sure their dtypes match that of the weights and biases, which may be torch.float32)

3. Create weight and bias tensors initialized to random numbers. You'll have to figure out what size they should be.

4. You can define the same sigmoid function as above to be your "activation" function.

5. Write code for your training loop. It should do the following:
 * Compute `yhat`
 * Calculate the cost function `C` (you can use the same on as in our "is cat" example )
 * Print out the cost every 1000 loops (use an `if epoch%1000==0` codeblock to do this)
 * Call `C.backward()`
 * Use a `with torch.no_grad()` codeblock to update the weight/bias tensors, then zero the gradients for both
6. Calculate the prediction for the training data using the fully trained model (i.e., get yhat as you did in the training loop, but now with the finalized weights/biases)
7. Run the plotting code provided below to plot the performance on the training data 


In [None]:
wine_df = pd.read_csv('https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/winequality-red.csv')
wine_df.head()

In [None]:
# Try it out

# 1. Use the sklearn standard scalar to scale all feature columns. Also scale the target column appropriately so it is between 0 and 1.


# 2. Make tensors for features  X  and target  y (make sure their dtypes match that of the weights and biases, which may be torch.float32).


# 3. Create weight and bias tensors initialized to random numbers. You'll have to figure out what size they should be.

# 4. You can define the same sigmoid function as above to be your "activation" function. Define it below


# 5. Write code for your training loop to train over 20,000 epochs. It should do the following:
#    - compute yhat
#    - Calculate the cost function C (you can use the same one as in our "is cat" example )
#    - print out the cost every 1000 loops (use an "if epoch%1000==0" codeblock to do this)
#    - call C.backward()
#    - use a "with torch.no_grad()" codeblock to update the weight/bias tensors with a learning rate of 1e-5, then zero the gradients of both


# 6. Calculate the prediction for the training data using the fully trained model (i.e., get yhat as you did in the training loop, but now with the finalized weights/biases)



In [None]:
# Run this code to plot the predictions vs the actual targets
import matplotlib.pyplot as plt

plt.plot(y,yhat.detach().numpy(),'.',alpha=0.2);
plt.xlabel('actual wine quality');
plt.ylabel('predicted wine quality');


<hr/>

# Linear Regression with Pytorch -- the manual way

Now we will illustrate how pytorch can be used to implement the main training loop in training a neural network: 
- forward computation to get the result and calculate the loss by comparing the result to the "right answer"
- backward propagation to adjust the weights to minimize the loss

We'll do this through a simple example of linear regression.

The data that we'll work with is data on Apple and Orange crop yields from different geographic regions and average data on temperature, rainfall and humidity:

In [None]:
import pandas as pd
crop_file = 'https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/crop_yield.csv'
crop_df = pd.read_csv(crop_file)
crop_df

In [None]:
inputs = torch.tensor(crop_df.loc[:,'Temp_F':'Humidity_pct'].values, dtype=torch.float32) # notice here that we explicitly made the data type a float32
inputs

In [None]:
targets = torch.tensor(crop_df.loc[:,'Apples_ton':'Oranges_ton'].values, dtype=torch.float32) # notice here that we explicitly made the data type a float32
targets

We want to make a linear "neural network" that relates the inputs to the targets. In other words we want to implement the equation:


$$
\hspace{1cm} Y\hspace{1cm}=\hspace{1.cm}X \hspace{2.1cm} \times \hspace{1.cm} W^T \hspace{1.cm}  + \hspace{1cm} b \hspace{1cm}
$$

$$
\left[ \begin{array}{cc}
y_{11} & y_{21} \\
y_{12} & y_{22} \\
y_{13} & y_{23} \\
\end{array} \right]
%
=
%
\left[ \begin{array}{cc}
73 & 67 & 43 \\
91 & 88 & 64 \\
\vdots & \vdots & \vdots \\
69 & 96 & 70
\end{array} \right]
%
\times
%
\left[ \begin{array}{cc}
w_{11} & w_{21} \\
w_{12} & w_{22} \\
w_{13} & w_{23}
\end{array} \right]
%
+
%
\left[ \begin{array}{cc}
b_{1} & b_{2} \\
b_{1} & b_{2} \\
\vdots & \vdots \\
b_{1} & b_{2} \\
\end{array} \right]
$$

* Notice that in this example, we have two outcomes (targets) we are trying to predict, the yield of apples and the yield of oranges. Each outcome (target) will weight each of the features and have its own bias, and we have three features (temp, rainfall, humidity), so we have 6 weights and 2 biases.
* Also notice that since we are doing linear regression, we don't have a nonlinear activation function at all.

We'll start with some random weights and biases -- this is typically how we "initialize" the parameters of a neural network -- set them to some random values to start:

In [None]:
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)
print(w)
print(b)
print(w.dtype) # the dtype of the weights and biases has to match the dtype of the inputs

We'll define our model according to the above equation 

In [None]:
# Define the model
def model(x):
    return x @ w.t() + b  #note that @ is the matrix multiplication operator in numpy or pytorch

We took advantage of broadcasting in the above (since we only have two biases). We also took the transpose of the weight matrix with `.t()` (so that the dimensions match to permit a matrix multiplication). 

Our model is ready to generate predictions -- although, they won't be very good ones yet because we haven't trained it:

In [None]:
# Generate predictions
preds = model(inputs)
print(preds)

In [None]:
# Compare with targets
print(targets)

The next step is to define a loss (cost) function, that describes how (poorly) our predictions match the targets. We'll use the mean squared error, since we are implementing linear regression. Different circumstances would require a potentially different loss function.

In [None]:
# MSE loss
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff * diff) / diff.numel()

In [None]:
# Compute loss
loss = mse(preds, targets)
print(loss)

We can now calculate the gradients of the loss with respect to the weights and biases:

In [None]:
# Compute gradients
loss.backward()

In [None]:
# Gradients for weights
print(w)
print(w.grad)

In [None]:
# Gradients for bias
print(b)
print(b.grad)

We know about derivatives and optimization from calculus:
![](https://drive.google.com/uc?id=1ywzITKPARYGqM8eS_ha7OFbD4g0wF7fg)

if the gradient is positive: 
- increasing the variable will increase the loss 
- decreasing the variable will decrease the loss 

If the gradient is negative:
- increasing the variable will decrease the loss 
- decreasing the variable will increase the loss 


For now, we'll zero the gradients that for our `w` and `b` tensors, as we'll want to start the training process with no gradients accumulated:


In [None]:
w.grad.zero_()
b.grad.zero_()

We'll use gradient descent as the rule for adjusting our weights and biases:
```
w -= w.grad * 1e-5
b -= b.grad * 1e-5
```
This is in accordance with the intuition we developed above. We chose to multiply the gradient by a small number (which adjusts how big of a step we take). This is called the "learning rate".

Now we are ready to implement our training loop. We will pass over our data several times to keep adjusting the weights and biases.  Each pass is called an *epoch*:

In [None]:
# Here I repeated all of the code to load, prepare the data, define the loss and model functions, for ease of running everything together

# Load the dataframe
crop_file = 'https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/crop_yield.csv'
crop_df = pd.read_csv(crop_file)

# Make inputs (features) and the targets (outcomes) tensors
inputs = torch.tensor(crop_df.loc[:,'Temp_F':'Humidity_pct'].values, dtype=torch.float32)
targets = torch.tensor(crop_df.loc[:,'Apples_ton':'Oranges_ton'].values, dtype=torch.float32) 

# init weights/biases
w = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, requires_grad=True)

# define the loss (cost) function
def mse(t1, t2):
    diff = t1 - t2
    return torch.sum(diff * diff) / diff.numel()

# define the model
def model(x):
    return x @ w.t() + b  

# Train for 100 epochs
losses = []
num_epochs = 100
for epoch in range(num_epochs):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    losses.append(loss.item()) # Here we extract the loss (which is a scalar, and store it in a list, so we can plot loss vs epochs later)
    with torch.no_grad(): # this ensures that gradients won't be affected in the codeblock -- we already have the values of the gradient from the loss.backward() call
        w -= w.grad * 1e-2 
        b -= b.grad * 1e-2
        w.grad.zero_() # zero the weight gradient after we adjusted the weight
        b.grad.zero_() # zero the bias gradient after we adjusted the bias

Ok, let's see how we did:

In [None]:
import matplotlib.pyplot as plt
plt.plot(losses);
plt.xlabel('Epoch');
plt.ylabel('Loss');

Notice how the loss decreases with epochs and slowly flattens out. This indicates that further training over the data will not do much good.

In [None]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

In [None]:
# Plot predictions vs targets
#  In order to get from tensors to numpy arrays that we can plot, we'll have to:
#     - call .detach() to return a copy with just the data and not the gradient
#     - call .numpy() to return a numpy array, since matplotlib doesn't know how to plot tensors
plt.plot(targets.detach().numpy(),preds.detach().numpy(),'.');
print(targets.detach().numpy().flatten())
print(preds.detach().numpy().flatten())
plt.xlabel('targets');
plt.ylabel('predictions');

Not too shabby. Of course, the real test would be to apply this NN to unseen test data.

Now we've seen how to implement the main training loop in pytorch. But we did almost everything else manually.

Let's see what it would look like if we did the same thing but now using pytorch's built-in features.

# Linear Regression with Pytorch -- the right way

Instead of doing everything ourselves, we are going to take advantage of the many tools and methods that pytorch has built into it. This includes:
- Dataset and Dataloader objects
- Neural network layers
- Optimizers
- Various predefined loss functions

## Dataset and DataLoader

First we need to import some stuff (we don't *have* to import these into our namespace, since they can be accessed from `torch.`, but it makes the calls shorter):

In [None]:
# Import tensor dataset & data loader
from torch.utils.data import TensorDataset, DataLoader

We can use the `TensorDataset` class to handle our data. One of the things this does is it lets us grab a row from the inputs and target as a tuple: 

In [None]:
# Define dataset
train_ds = TensorDataset(inputs, targets)

In [None]:
# Now we can get rows:
train_ds[0:3]

Next, we will use pytorch's `DataLoader` which allows can split the data into batches when we train (and even shuffle or resample, if we want). This is useful because in many cases datasets are too big to process completely. Shuffling data when training is almost always a good idea and resampling is useful if our data is imbalanced in some way.

We'll define the `batch_size` when create the DataLoader. This is just the number of rows of the data that we will train over before updating the parameters. 

Note: If we update parameters before passing over the entire dataset, then we refer to each batch as a "mini-batch". This is what happens in **Stochastic Gradient Descent**. We could also process in batches and only update after passing through the entire dataset. In this case, these are called "batches" and we are using **Gradient Descent**. (one reason to batch when doing Gradient Descent is that the full dataset might not fit into memory all at once). You may see the terms "batch" and "mini-batch" used interchangeably (and sometimes incorrectly), but these are the technically correct definitions of the term.

Here we'll set the batch to be the length of the dataset (so that we only update parameters after processing all of the data).

In [None]:
# Define data loader
batch_size = len(train_ds)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)

In [None]:
#We can see how shuffle works by running over this two times
for i in range(0,2):
  for xb,yb in train_dl:
    print(xb,yb)

# Now change the batch size in the prior cell and rerun both to see what's going on
# Just be sure to set it back to 5 and rerun this before continuing

In [None]:
next(iter(train_dl)) # this will return 5 (batch_size) rows of inputs and targets after shuffling. 
# iter() defines an iterator and next() tells python to get the next iteration from it.

We'll see why being able to change the order and group the data into different sized batches is important when we talk about the optimizer.

## Use a predefined linear layer -- nn.Linear()

Instead of:
- defining weights and biases tensors manually
- defining the linear function that uses weights and biases to relate input to ouput via a matrix multiplication
- initializing weights and biases to random values

We can just use a layer that already does all that: ``torch.nn.Linear()``

In [None]:
model = torch.nn.Linear(3,2) # 3 input (features), 2 outputs

## Optimizer

Instead of manually implementing Gradient Descent to adjust weights and biases according to the gradient and some learning rate, we can use one of Pytorch's built-in optimizers.

Here we'll use the optimizer for <font color=blue>Stochastic Gradient Descent (SGD) </font>, but set it so that we're just doing regular <font color=blue>Gradient Descent</font>. The difference between the two is that in SGD, you adjust your weights and biases after running each batch through the network. It's **stochastic** because the data loader shuffles the data, so the order the data is seen varies from run to run. In pytorch, all of these versions (plus some variants) are implement with `torch.optim.SGD()`. Have a look at the [documentation](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD).


In [None]:
# Define optimizer
opt = torch.optim.SGD(model.parameters(), lr=1e-5) # lr is the learning rate

## Loss function

Instead of using our manually defined `mse()` function, we'll use a predefined one.

In [None]:
loss_fn = torch.nn.functional.mse_loss

So, for example, we could compute the loss:

In [None]:
loss = loss_fn(model(inputs),targets)
print(loss)

## Define a fit function

Finally we will define a function to fit our model by putting all of the above pieces together:

In [None]:
# Define a utility function to train the model, here with regular gradient descent
def fit_gd(num_epochs, model, train_dl, loss_fn, opt):
    losses = []
    model.train() # put the model into training mode (in case it is already in evaluation mode) - we'll talk about this later.
    for epoch in range(num_epochs):
      loss_for_epoch = 0
      opt.zero_grad() # tell the optimizer to zero the gradient so we start fresh on this epoch
      for xb,yb in train_dl: # sample a batch of inputs,targets
        pred = model(xb) # get the predictions
        loss = loss_fn(pred, yb) # calculate the loss for these predictions
        loss_for_epoch += loss.item() # add loss for this batch to the running total of the loss for this epoch
        loss.backward() # propagate the loss backward
      losses.append(loss_for_epoch)
      opt.step() # this tells the optimizer to take one step -- it knows about model weights/biases and learning rate as we passed them when we create opt, this is where weigths are adjusted
    return losses 

There's a lot going on in the fit function above. Try to read it over carefully and make sure you understand it. Some things to notice:
* There is a loop over epochs -- we will pass over the data many times to train our NN
* For each epoch, we pull batches of `xb,yb` from the training data and process each batch one at a time
* Notice where `opt.step()` occurs, because this is where the weights/biases are adjusted


## Fit the model

Now we can train the model by running our fit function:

In [None]:
# Train the model for 100 epochs
losses = fit_gd(100, model, train_dl, loss_fn, opt)

In [None]:
# Get the final predictions (after we are done training)
preds = model(inputs)

# Plot predictions vs targets
plt.plot(targets.detach().numpy(),preds.detach().numpy(),'.');
plt.xlabel('targets');
plt.ylabel('predictions');

One important thing that we DID NOT DO in the above two examples is that we didn't split the data into training and test sets.  We could have accomplished this using:
```
train_ds,test_ds = torch.utils.data.random_split(full_ds,[len_train, len_test])
```

But we didn't do this because this was just for illustration purposes and the dataset we are using is pretty small.

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise:  Fit Function for Stochastic Gradient Descent  </font>

The fit function provided above performs Gradient Descent. However, with just a few very small changes you can adapt this to be a fit function that uses Stochastic Gradient Descent (SGD). Write your code for an SGD fit function and call it `fit_sgd`:  




In [None]:
# Try it out

<hr/>

# Saving and Loading pytorch models

You will want a way to save and (later to) load the models that you have trained. Pytorch has some tools to do this, using `torch.save()` and `torch.load()`.

There are a few approaches:
1. Save the entire model with `torch.save(model,path)` and then later load it using `model = torch.load(path)`
 - This will generate a save file that depends on the actual directory structure used in the specified path -- so you would only be able to load the save file if it was sitting in the same directory (even if you copied it to another system). So this way is not recommended. See the [pytorch docs](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for more info.
2. Save the models state_dictionary only with `torch.save(model.state_dict(),path)`. When you want to load the model, you must then do the following:

  ```python
  model = Model() # make an instance of the model
  model.load_state_dict(torch.load(path))
  ```

3. Save a checkpoint that captures the model's state_dictionary and other state variables, such as the optimizer's state_dictionary, current epoch, and so on. This can be done by passing your own dictionary to `torch.save()` like this:
 
 ```python 
 torch.save({'model_state_dict': model.state_dict(),
            'optimizer_state_dict': opt.state_dict(),
            'epoch': epoch,
            ...
            },
            path)
 ```
 
 Loading the model is then accomplished like this:
 
 ```python
 model = Model() # make an instance of the model
 checkpoint = torch.load(path)
 model.load_state_dict(checkpoint['model_state_dict'])
 ...
 opt.load_state_dict(checkpoint['opt_state_dict'])
 ...
 epoch = checkpoint['epoch']
 ```

The 3rd way is the most bulletproof approach and the most configurable (you can add anything you want to the dictionary that you save -- including some text description). Unfortunately there is no "standard file extension" that is conventionally used, though it is common give torch saves the `.pt` or `.pth` file extension.  

The consequence of this is: As long as you are consistent, you won't have issues saving and loading. But you may have trouble loading models that were saved by others, so be aware of this.

**One thing to note: colab won't save your files if your session disconnects, so you should download them after you save them.**

# Feedback
What did you think about this notebook? What questions do you have? Were any parts confusing? Write your thoughts in the text box below.

<font size =2> note: You can double click this text box in colab to edit it.</font>

PUT YOUR THOUGHTS HERE

# Submit
Don't forget to submit your notebook before class! Make sure you have saved your work (**Colab Menu: File-> Save**) and then download a pure python copy (**Colab Menu: File-> Download -> Download .py**) and a python notebook copy (**Colab Menu: File-> Download -> Download .ipynb**). You will upload both of these to the assignment on the canvas page.
