<img src="img/dsci572_header.png" width="600">

# Lecture 4: Introduction to Pytorch & Neural Networks



<br><br><br>

## Lecture Learning Objectives


- Describe the difference between `Numpy` and `torch` arrays (`np.array` vs. `torch.Tensor`)
- Explain fundamental concepts of neural networks such as layers, nodes, activation functions, etc.
- Create a simple neural network in PyTorch for regression or classification

<br><br><br>

## Imports


In [1]:
import sys
import numpy as np
import pandas as pd
import torch
from torchsummary import summary
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import make_regression, make_circles, make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from utils.plotting import *

## Introduction


PyTorch is a Python-based tool for scientific computing that provides several main features:

In general, PyTorch does two main things:

- Provides an n-dimensional array object similar to that of `Numpy`, with the difference that it can be manipulated using GPUs

- Computes gradients (through automatic differentiation)

<br><br><br>

## PyTorch's Tensor


- In PyTorch a tensor is just like NumPy's `ndarray` that we have become so familiar with.

- A key difference between PyTorch's `torch.Tensor` and Numpy's `np.array` is that `torch.Tensor` was constructed to integrate with GPUs and PyTorch's computational graphs

### `ndarray` vs `tensor`

- Creating and working with tensors is much the same as with Numpy `ndarrays`

- You can create a tensor with `torch.tensor()` in various ways:

In [2]:
a = torch.tensor([1, 2, 3.])

In [3]:
b = torch.tensor([1, 2, 3])

Let's see the datatype of each tensor:

In [4]:
for _ in [a, b]:
    print(f"{_}, dtype: {_.dtype}")

tensor([1., 2., 3.]), dtype: torch.float32
tensor([1, 2, 3]), dtype: torch.int64


<br><br><br>

- PyTorch comes with most of the `Numpy` functions we're already familiar with:

In [5]:
torch.zeros(2, 2)  # zeroes

tensor([[0., 0.],
        [0., 0.]])

In [6]:
torch.ones(2, 2)  # ones

tensor([[1., 1.],
        [1., 1.]])

In [7]:
torch.randn(3, 2)  # random normal

tensor([[-0.2291,  1.7156],
        [ 3.5932,  0.5267],
        [ 0.6965,  0.8538]])

In [8]:
torch.rand(2, 3, 2)  # rand uniform

tensor([[[0.8340, 0.3935],
         [0.8448, 0.8823],
         [0.4909, 0.0952]],

        [[0.6257, 0.0039],
         [0.5839, 0.2183],
         [0.7205, 0.5707]]])

- Just like in NumPy we can look at the shape of a tensor with the `.shape` attribute:

In [9]:
x = torch.rand(2, 3, 2, 2)
x.shape

torch.Size([2, 3, 2, 2])

In [10]:
x.ndim

4

<br><br><br>

### Tensors and Data Types

- Different dtypes have different memory and computational implications (see the end of [Lecture 1](lecture1_floating-point.ipynb))

- In Pytorch we'll be building networks that require thousands, millions, or even billions of floating point calculations
- In such cases, using a smaller dtype like `float32` can significantly speed up computations and reduce memory requirements
- The **default float dtype in Pytorch is `float32`**, as opposed to Numpy's `float64`
- In fact some operations in Pytorch will even throw an error if you pass a high-memory `dtype`

In [11]:
torch.tensor([1., 2]).dtype

torch.float32

In [12]:
print(np.array([3.14159]).dtype)
print(torch.tensor([3.14159]).dtype)

float64
torch.float32


- But just like in Numpy, you can always specify the particular dtype you want using the `dtype` argument:

In [13]:
print(torch.tensor([3.14159], dtype=torch.float64).dtype)

torch.float64


<br><br><br>

### Operations on Tensors

- Tensors operate just like `ndarrays` and have a variety of familiar methods that can be called off them:

In [14]:
a = torch.rand(1, 3)
b = torch.rand(3, 1)

In [15]:
a

tensor([[0.7455, 0.2166, 0.4254]])

In [16]:
b

tensor([[0.9217],
        [0.9894],
        [0.5510]])

In [17]:
a + b  # broadcasting betweean a 1 x 3 and 3 x 1 tensor

tensor([[1.6672, 1.1383, 1.3471],
        [1.7349, 1.2060, 1.4148],
        [1.2965, 0.7676, 0.9764]])

In [18]:
a * b  # element-wise multiplication

tensor([[0.6871, 0.1996, 0.3921],
        [0.7376, 0.2143, 0.4209],
        [0.4108, 0.1193, 0.2344]])

In [19]:
a.mean()

tensor(0.4625)

In [20]:
a.sum()

tensor(1.3875)

<br><br><br>

### Indexing

- Once again, same as Numpy

In [21]:
X = torch.rand(5, 2)
print(X)

tensor([[0.9728, 0.1372],
        [0.0764, 0.2492],
        [0.8003, 0.8137],
        [0.6015, 0.9586],
        [0.6350, 0.9977]])


In [22]:
print(X[0, :])
print(X[0])
print(X[:, 0])

tensor([0.9728, 0.1372])
tensor([0.9728, 0.1372])
tensor([0.9728, 0.0764, 0.8003, 0.6015, 0.6350])


<br><br><br>

### Numpy Bridge

- Sometimes we might want to convert a tensor back to a NumPy array

- We can do that using the `.Numpy()` method

In [23]:
X = torch.rand(3,3)
print(type(X))
X_numpy = X.numpy()
print(type(X_numpy))

<class 'torch.Tensor'>
<class 'numpy.ndarray'>


<br><br><br>

### Using GPU with PyTorch

- GPU is a graphical processing unit (as opposed to a CPU: central processing unit)

- GPUs were originally developed for gaming. They are very fast at performing operations on large amounts of data by performing them in parallel (think about updating the value of all pixels on a screen very quickly as a player moves around in a game)

- More recently, GPUs have been adapted for more general purpose programming

- Neural networks can typically be broken into smaller computations that can be performed in parallel on a GPU

- PyTorch is tightly integrated with [CUDA](https://en.wikipedia.org/wiki/CUDA) (Compute Unified Device Architecture), a software layer developed by Nvidia that facilitates interactions with an Nvidia GPU (if you have one)

- You can check if you have a CUDA GPU:

In [24]:
torch.cuda.is_available()

False

- In May 2022, PyTorch also announced GPU-accelerated PyTorch training on Mac (see [here](https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)) using [Apple’s Metal Performance Shaders (MPS)](https://developer.apple.com/metal/).

- If you're using a Mac equipped with **Apple silicon** (M1 or M2), you can benefit from its GPU cores to train your PyTorch models:

In [25]:
torch.backends.mps.is_available()

True

- When training on a machine that has a GPU, you need to tell PyTorch you want to use it

- You'll see the following at the top of most PyTorch code:

In [26]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
print(device)

mps


- You can then use the `device` argument when creating tensors to specify whether you wish to use a CPU or GPU

- Or if you want to move a tensor between the CPU and GPU, you can use the `.to()` method:

In [27]:
X = torch.rand(2, 2, 2, device=device)
X.device

device(type='mps', index=0)

In [28]:
X.to('cpu')

tensor([[[0.8325, 0.3972],
         [0.8938, 0.8426]],

        [[0.0672, 0.4259],
         [0.3225, 0.5180]]])

In [29]:
X.device

device(type='mps', index=0)

In [30]:
# X.to('cuda')  # will give me an error as I don't have a CUDA GPU

<br><br><br>

### A Quick Peek at Automatic Differentiation

Consider the following Numpy array:

In [31]:
x = np.array(2.5)

Let's do some mathematical operations on it and save the result in another variable:

In [32]:
y = x**2 + x
y

8.75

If I am only given `y`, there is no way to trace back what kind of operations were done on `x` that led to the results stored in `y`.

This is exactly where PyTorch tensors distinguish themselves from Numpy arrays: **PyTorch tensors are able to track computations in which they participate!**

<br><br><br>

But what is the purpose of tracking computations?

<br><br><br>

The answer is: 

If we know exactly **what kind of operations** have been performed on $x$ to give $y$,

then we will be able to **compute the derivative** of $y$ with respect to $x$; that is, $dy/dx$.

<br>

But how?

<br><br><br>

We can say $y$ is constructed this way:

1. Identity: $f_1(a) = a$

1. Raising to power of 2: $ f_2(a) = a^2$

1. Addition of two terms: $ f_3(a, b) = a + b $

Therefore, $y$ can be expressed as the result of sequential applications of the above functions ($f_1$, $f_2$, and $f_3$):

$$
y = \quad f_3(\quad f_2(x),\quad f_1(x)\quad)
$$

<br><br><br>

We can now use the **chain rule** for computing the derivative $dy/dx$:

$$
\begin{align*}
\frac{dy}{dx} &= \frac{dy}{df_3} \frac{df_3}{df_2} \frac{df_2}{dx} + \frac{dy}{df_3} \frac{df_3}{df_1} \frac{df_1}{dx}\\
\\
\frac{dy}{dx} &= 1 \cdot 1 \cdot 2x + 1 \cdot 1 = 2x + 1
\end{align*}
$$

<br><br><br>

In summary,

- We break down any function of $x$ to the most elementary computations/transformations

- We know the derivatives of those elementary computations/transformations

- We can use the **chain rule** to relate

  - the derivative of the elementary computations/transformations **to** the overall derivative of the function

<br><br><br>

This is the whole idea of **automatic differentiation**.

If we can track what operations have been performed on a variable (could be a scalar, vector, matrix, or tensor), we can compute derivatives of the function at the end w.r.t this variable.

<br><br><br>

How is this useful in neural networks?

We'll learn in a bit that neural networks are nothing but a series of simple linear operations, combined with intermediate non-linear operations applied on the inputs.

But how can we make use of this **computation tracking** for our variables? I'll show you in the following example.

<br><br><br>

#### Computation tracking in PyTorch

Let's go back to our original vector `x`, but this time, I define it as a PyTorch tensor so that I can benefit from computation tracking:

In [33]:
x = torch.tensor(2.5)
x

tensor(2.5000)

By default, computations are not tracked and gradients can't be computed w.r.t `x`.

But `tensor.torch()` accepts a parameter called `requires_grad=False`. If set to `True`, it tells PyTorch to keep track of all computations involving that tensor:

In [34]:
x = torch.tensor(2.5, requires_grad=True)
x

tensor(2.5000, requires_grad=True)

Now if we do some computations using `x`:

In [35]:
y = x**2 + x
y

tensor(8.7500, grad_fn=<AddBackward0>)

Note `grad_fn=<AddBackward0>` appearing in the output of the above cell. It's actually saying that the last operation that PyTorch observed and added to the history was an addition operation.

<br><br><br>

#### Gradient computation

We'll learn in the next lecture that the process of using the chain rule in a backward manner (from the last computation back to its root at the variable) is called **back-propagation**. For now, let's just learn how we make PyTorch compute the gradient for us using back-propagation.

PyTorch provides a `.backward()` method on every tensor that takes part in gradient computation.

Here, the final computation is stored in `y`, so we do chain rule starting from `y` all the way back to `x` using `.backward()`, to compute `dy/dx`:

In [36]:
y.backward()

To confirm our results, here is the true value of the derivative:

$$\frac{dy}{dx} = 2x+1 = 2(2.5) + 1 = 6$$

Let's see if that's what we get using PyTorch:

In [37]:
x.grad

tensor(6.)

<br><br><br>

#### How does this help?

Well, what we just saw enables us to **do gradient descent without worrying about the structure of our model**,since we no longer need to compute the derivative ourselves.

<br><br><br>

To get a clearer picture, recall that in GD or SGD, we need **the gradient of the loss function** to adjust the weights.

What is the loss function? Is it anything other than a series of elementary computations performed on the weight parameters (along with some constant data points)?

So if we track computations that involve the weights all the way to the loss function, we can compute the gradient of the loss function using back-propagation w.r.t to the weights (just like the example above). Voila!

<br><br><br>

Do you remember this bit from our GD function?

```python
w = np.array(w0)

g = 2 * (x * (w * x - y)).mean()
dw = alpha * g
w -= dw
```

<br><br><br>

If we now define `w` using PyTorch tensors, we can use `.backward()` on our loss function to compute the derivative w.r.t to `w`:

```python
w = torch.tensor(w0, requires_grad=True)

mse = ((x * w - y)**2).mean()
mse.backward()

with torch.no_grad():
    g = w.grad
    dw = alpha * g
    w -= dw

w.grad.zero_()
```

<br><br><br>

## Neural Network Basics


- Neural networks are simply another class of parametric algorithms

- As we'll see, a neural network is just a sequence of linear and non-linear transformations
- Often you see something like this when learning about/using neural networks:

<img src="img/nn-6.png">

- So what on Earth does that all mean? Well we are going to build up some intuition one step at a time

<br><br><br>

### Simple Linear Regression with a Neural Network

- Let's create a simple regression dataset with 500 observations:

In [38]:
X, y = make_regression(n_samples=500, n_features=1, random_state=0, noise=10.0)
plot_regression(X, y)

- We know how to fit a simple linear regression to this data using sklearn:

In [39]:
sk_model = LinearRegression().fit(X, y)
plot_regression(X, y, sk_model.predict(X))

- Here are the parameters of that fitted line:

In [40]:
print(f"w_0: {sk_model.intercept_:.2f} (bias/intercept)")
print(f"w_1: {sk_model.coef_[0]:.2f}")

w_0: -0.77 (bias/intercept)
w_1: 45.50


- As an equation, that looks like this:

$$\hat{y}=-0.77 + 45.50x$$

- Or in matrix form:

$$\begin{bmatrix} \hat{y_1} \\ \hat{y_2} \\ \vdots \\ \hat{y_n} \end{bmatrix}=\begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} -0.77 \\ 45.55 \end{bmatrix}$$

- Or in graph form I'll represent it like this: 

<img src="img/nn-1.png">

<br><br><br>

### Linear Regression with a Neural Network in PyTorch

- So let's implement the above in PyTorch to start gaining an intuition about neural networks

- Every neural network model you build in PyTorch has to inherit from `torch.nn.Module`

- Remember class inheritance from DSCI 511? Inheritance allows us to inherit commonly needed functionality without having to write code ourselves

- Think about sklearn models: they all inherit common methods like `.fit()`, `.predict()`, `.score()`, etc. When creating a neural network, we define our own architecture but still want common functionality which we inherit from `torch.nn.Module`.

- Let's create a model called `linearRegression` and then I'll talk you through the syntax:

In [41]:
class linearRegression(nn.Module):  # our class inherits from nn.Module and we can call it anything we like
    
    def __init__(self, input_size, output_size):
        super().__init__()  # super().__init__() makes our class inherit everything from torch.nn.Module
        
        self.linear = nn.Linear(input_size, output_size,)  # this is a simple linear layer: wX + b

    def forward(self, x):
        out = self.linear(x)
        return out

Let's step through the above:

```python
class linearRegression(nn.Module):
    
    def __init__(self, input_size, output_size):
        super().__init__() 
```

<br><br><br>

Here we're creating a class called `linearRegression` and inheriting the methods and attributes of `nn.Module`

(hint: try typing `help(linearRegression)` to see all the things we inherited from `nn.Module`).

```python
        self.linear = nn.Linear(input_size, output_size)
```

<br><br><br>

Here we're defining a "Linear" layer, which just means `wX + b`, i.e., the weights of the network, multiplied by the inputs plus the bias.

```python
    def forward(self, x):
        out = self.linear(x)
        return out
```

PyTorch networks created with `nn.Module` must have a `forward()` method. It accepts the input data `x` and passes it through the defined operations. In this case, we are passing `x` into our linear layer and getting an output `out`.

<br><br><br>

- After defining the model class, we can create an instance of that class:

In [42]:
model = linearRegression(input_size=1, output_size=1)

<img src="img/nn-2.png">

- We can check out our model using `print()`:

In [43]:
print(model)

linearRegression(
  (linear): Linear(in_features=1, out_features=1, bias=True)
)


- Or the more useful `summary()` (which we imported at the top of this notebook with `from torchsummary import summary`):

In [44]:
summary(model, (1,));

Layer (type:depth-idx)                   Output Shape              Param #
├─Linear: 1-1                            [-1, 1]                   2
Total params: 2
Trainable params: 2
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


- Notice how we have two parameters? We have one for the weight (`w1`) and one for the bias (`w0`)

- These were initialized **randomly** by PyTorch when we created our model. They can be accessed with `model.state_dict()`:

In [45]:
model.state_dict()

OrderedDict([('linear.weight', tensor([[0.6442]])),
             ('linear.bias', tensor([-0.8345]))])

<br><br><br>

- Our `x` and `y` data are currently Numpy arrays but they need to be PyTorch tensors

- Let's convert them:

In [46]:
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)

<br><br><br>

- We have a working model right now and could tell it to give us some output with this syntax:

In [47]:
model(X_t[0])

tensor([-0.4349], grad_fn=<AddBackward0>)

That's just a raw prediction, and far from the actual value of:

In [48]:
y_t[0]

tensor(31.0760)

It's because our model is not trained yet.

What does training mean? In the context of what we've learned so far, it means that we haven't yet done an SGD run to find optimal weights.

- As we learned in the past few lectures, to fit our model we need:

    1. **a loss function** (called "`criterion`" in PyTorch) to tell us how good/bad our predictions are. We'll use mean squared error, `torch.nn.MSELoss()`
    
    2. **an optimization algorithm** to help optimize model parameters. We'll use SGD, `torch.optim.SGD()`

In [49]:
LEARNING_RATE = 0.02
criterion = nn.MSELoss()  # loss function
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)  # optimization algorithm is SGD

- Before we train I'm going to create a **data loader** to help batch my data

- We'll talk more about these next lecture and in lab but they are just generators that yield data to us on request (remember generators from 511?)
- We'll use a `BATCH_SIZE=50` (which should give us 10 batches because we have 500 data points)

In [50]:
BATCH_SIZE = 50
dataset = TensorDataset(X_t, y_t)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)
dataloader

<torch.utils.data.dataloader.DataLoader at 0x17f96f0a0>

- We should have 10 batches:

In [51]:
len(dataloader)

10

- We can look at a batch using this syntax:

In [52]:
XX, yy = next(iter(dataloader))

In [53]:
print(f"Shape of feature data (X) in batch: {XX.shape}")
print(f"Shape of response data (y) in batch: {yy.shape}")

Shape of feature data (X) in batch: torch.Size([50, 1])
Shape of response data (y) in batch: torch.Size([50])


<br><br><br>

Let's write code for doing a typical SGD with 10 epochs, but using automatic differentiation of PyTorch:

In [54]:
def trainer(model, criterion, optimizer, dataloader, epochs=5, verbose=True):
    """Simple training wrapper for PyTorch network."""
    
    for epoch in range(epochs):
        losses = 0
        
        for X, y in dataloader:
            
            optimizer.zero_grad()       # Clear gradients w.r.t. parameters
            y_hat = model(X).flatten()  # Forward pass to get output
            loss = criterion(y_hat, y)  # Calculate loss
            loss.backward()             # Getting gradients w.r.t. parameters
            optimizer.step()            # Update parameters
            losses += loss.item()       # Add loss for this batch to running total
            
        if verbose: print(f"epoch: {epoch + 1}, loss: {losses / len(dataloader):.4f}")

<br><br><br>

OK, before starting the training, here are the model parameters before training for reference:

In [55]:
model.state_dict()

OrderedDict([('linear.weight', tensor([[0.6442]])),
             ('linear.bias', tensor([-0.8345]))])

<br><br><br>

In [56]:
trainer(model, criterion, optimizer, dataloader, epochs=100, verbose=True)

epoch: 1, loss: 1523.6027
epoch: 2, loss: 725.3294
epoch: 3, loss: 373.1668
epoch: 4, loss: 217.7169
epoch: 5, loss: 148.2417
epoch: 6, loss: 117.5025
epoch: 7, loss: 104.0313
epoch: 8, loss: 98.0026
epoch: 9, loss: 95.2701
epoch: 10, loss: 94.1805
epoch: 11, loss: 93.6136
epoch: 12, loss: 93.3771
epoch: 13, loss: 93.2962
epoch: 14, loss: 93.1986
epoch: 15, loss: 93.2796
epoch: 16, loss: 93.2206
epoch: 17, loss: 93.2298
epoch: 18, loss: 93.2408
epoch: 19, loss: 93.2582
epoch: 20, loss: 93.1888
epoch: 21, loss: 93.2197
epoch: 22, loss: 93.2509
epoch: 23, loss: 93.2816
epoch: 24, loss: 93.2021
epoch: 25, loss: 93.2461
epoch: 26, loss: 93.3452
epoch: 27, loss: 93.2357
epoch: 28, loss: 93.1745
epoch: 29, loss: 93.2405
epoch: 30, loss: 93.2957
epoch: 31, loss: 93.2531
epoch: 32, loss: 93.2019
epoch: 33, loss: 93.1806
epoch: 34, loss: 93.3229
epoch: 35, loss: 93.2418
epoch: 36, loss: 93.2921
epoch: 37, loss: 93.2332
epoch: 38, loss: 93.2329
epoch: 39, loss: 93.3110
epoch: 40, loss: 93.2286
e

<br><br><br>

- Now our model has been trained, our parameters should be different than before:

In [57]:
model.state_dict()

OrderedDict([('linear.weight', tensor([[45.4996]])),
             ('linear.bias', tensor([-0.7183]))])

- Comparing to our sklearn model, we get the same answer:

In [58]:
pd.DataFrame({"w0": [sk_model.intercept_, model.state_dict()['linear.bias'].item()],
              "w1": [sk_model.coef_[0], model.state_dict()['linear.weight'].item()]},
             index=['sklearn', 'pytorch']).round(2)

Unnamed: 0,w0,w1
sklearn,-0.77,45.5
pytorch,-0.72,45.5


- We got pretty close

- We could do better by changing the number of epochs or the learning rate
- So here is our simple network once again:

<img src="img/nn-2.png">

- By the way, check out what happens if we run `trainer()` again:

In [59]:
trainer(model, criterion, optimizer, dataloader, epochs=20, verbose=True)

epoch: 1, loss: 93.2452
epoch: 2, loss: 93.2298
epoch: 3, loss: 93.1942
epoch: 4, loss: 93.2969
epoch: 5, loss: 93.2006
epoch: 6, loss: 93.2771
epoch: 7, loss: 93.1800
epoch: 8, loss: 93.3119
epoch: 9, loss: 93.2317
epoch: 10, loss: 93.2376
epoch: 11, loss: 93.2057
epoch: 12, loss: 93.1897
epoch: 13, loss: 93.2429
epoch: 14, loss: 93.2018
epoch: 15, loss: 93.1451
epoch: 16, loss: 93.2524
epoch: 17, loss: 93.2095
epoch: 18, loss: 93.3556
epoch: 19, loss: 93.1832
epoch: 20, loss: 93.2722


- **Our model continues where we left off**

- This may or may not be what you want. We can start from scratch by re-making the `model` and `optimizer`.

In [60]:
pd.DataFrame({"w0": [sk_model.intercept_, model.state_dict()['linear.bias'].item()],
              "w1": [sk_model.coef_[0], model.state_dict()['linear.weight'].item()]},
             index=['sklearn', 'pytorch']).round(2)

Unnamed: 0,w0,w1
sklearn,-0.77,45.5
pytorch,-0.75,45.57


<br><br><br>

### Multiple Linear Regression with a Neural Network

- Okay, let's do a multiple linear regression now with 3 features

- So our network will look like this:

<img src="img/nn-3.png">

<br><br><br>

- Let's go ahead and create some data:

In [61]:
# Create dataset
X, y = make_regression(n_samples=500, n_features=3, random_state=0, noise=10.0)
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)

# Create dataloader
dataset = TensorDataset(X_t, y_t)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

- And let's create the above model:

In [62]:
model = linearRegression(input_size=3, output_size=1)

- We should now have 4 parameters (3 weights and 1 bias)

In [63]:
summary(model, (3,));

Layer (type:depth-idx)                   Output Shape              Param #
├─Linear: 1-1                            [-1, 1]                   4
Total params: 4
Trainable params: 4
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


- Looks good to me! Let's train the model and then compare it to sklearn's `LinearRegression()`:

In [64]:
model.state_dict()

OrderedDict([('linear.weight', tensor([[-0.0610,  0.1707,  0.3138]])),
             ('linear.bias', tensor([-0.0732]))])

In [65]:
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
trainer(model, criterion, optimizer, dataloader, epochs=1000, verbose=True)

epoch: 1, loss: 1616.9759
epoch: 2, loss: 274.9714
epoch: 3, loss: 121.6398
epoch: 4, loss: 103.9073
epoch: 5, loss: 101.5647
epoch: 6, loss: 101.6030
epoch: 7, loss: 101.3058
epoch: 8, loss: 101.5976
epoch: 9, loss: 101.3601
epoch: 10, loss: 101.2879
epoch: 11, loss: 101.5143
epoch: 12, loss: 101.3614
epoch: 13, loss: 101.8122
epoch: 14, loss: 101.5668
epoch: 15, loss: 101.3496
epoch: 16, loss: 101.2788
epoch: 17, loss: 101.4515
epoch: 18, loss: 101.4835
epoch: 19, loss: 101.5016
epoch: 20, loss: 101.2596
epoch: 21, loss: 101.4976
epoch: 22, loss: 101.6955
epoch: 23, loss: 101.3558
epoch: 24, loss: 101.2661
epoch: 25, loss: 101.8850
epoch: 26, loss: 101.7393
epoch: 27, loss: 101.5943
epoch: 28, loss: 101.8091
epoch: 29, loss: 101.2854
epoch: 30, loss: 101.3061
epoch: 31, loss: 101.6445
epoch: 32, loss: 101.3639
epoch: 33, loss: 101.3464
epoch: 34, loss: 101.5466
epoch: 35, loss: 101.3283
epoch: 36, loss: 101.3586
epoch: 37, loss: 101.6363
epoch: 38, loss: 101.4497
epoch: 39, loss: 101

In [66]:
sk_model = LinearRegression().fit(X, y)
pd.DataFrame({"w0": [sk_model.intercept_, model.state_dict()['linear.bias'].item()],
              "w1": [sk_model.coef_[0], model.state_dict()['linear.weight'][0, 0].item()],
              "w2": [sk_model.coef_[1], model.state_dict()['linear.weight'][0, 1].item()],
              "w3": [sk_model.coef_[2], model.state_dict()['linear.weight'][0, 2].item()]},
             index=['sklearn', 'pytorch']).round(2)

Unnamed: 0,w0,w1,w2,w3
sklearn,0.43,0.62,55.99,11.14
pytorch,0.5,0.61,55.96,11.11


<br><br><br>

### Non-linear Regression with a Neural Network

- Okay so we can make a simple network to imitate simple and multiple *linear* regression

- For example, what happens when we have more complicated datasets like this?

In [67]:
# Create dataset
np.random.seed(2020)

X = np.sort(np.random.randn(500))
y = X ** 2 + 15 * np.sin(X) **3

X_t = torch.tensor(X[:, None], dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)

# Create dataloader
dataset = TensorDataset(X_t, y_t)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

plot_regression(X, y, y_range=[-25, 25], dy=5)

- This is obviously non-linear, and we need to introduce some **non-linearities** into our network

- These non-linearities are what make neural networks so powerful and they are called **"activation functions"**
- We are going to create a new model class that includes a non-linearity, that is, a sigmoid function:

$$S(x)=\frac{1}{1+e^{-x}}$$

In [68]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

xs = np.linspace(-15, 15, 100)
plot_regression(xs, [0], sigmoid(xs), x_range=[-5, 5], y_range=[0, 1], dy=0.2)

- We'll talk more about activation functions later, but note how the sigmoid function non-linearly maps `x` to a value between 0 and 1

- Okay, so let's create the following network:

<img src="img/nn-5.png">

- All this means is that the value of each node in the hidden layer will be transformed by the "activation function", thus introducing non-linear elements to our model

- There's **two main ways** of creating the above model in PyTorch, I'll show you both:

In [69]:
class nonlinRegression(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.hidden = nn.Linear(input_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.hidden(x)       # input -> hidden layer
        x = self.sigmoid(x)      # sigmoid activation function in hidden layer
        x = self.output(x)       # hidden -> output layer
        return x

- Note how our `forward()` method now passes `x` through the `nn.Sigmoid()` function after the hidden layer

- The above method is very clear and flexible, but I prefer using `nn.Sequential()` to combine my layers together in the constructor:

In [70]:
class nonlinRegression(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        
        self.main = torch.nn.Sequential(
            nn.Linear(input_size, hidden_size),  # input -> hidden layer
            nn.Sigmoid(),                        # sigmoid activation function in hidden layer
            nn.Linear(hidden_size, output_size)  # hidden -> output layer
        )

    def forward(self, x):
        x = self.main(x)
        return x

- Let's make an instance of our new class and confirm it has 10 parameters (6 weights + 4 biases):

In [71]:
model = nonlinRegression(1, 10, 1)
summary(model, (1,));

Layer (type:depth-idx)                   Output Shape              Param #
├─Sequential: 1-1                        [-1, 1]                   --
|    └─Linear: 2-1                       [-1, 10]                  20
|    └─Sigmoid: 2-2                      [-1, 10]                  --
|    └─Linear: 2-3                       [-1, 1]                   11
Total params: 31
Trainable params: 31
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


- Okay, let's train:

In [72]:
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

trainer(model, criterion, optimizer, dataloader, epochs=1000, verbose=True)

epoch: 1, loss: 57.2754
epoch: 2, loss: 54.4414
epoch: 3, loss: 51.3297
epoch: 4, loss: 47.6599
epoch: 5, loss: 43.3059
epoch: 6, loss: 38.4537
epoch: 7, loss: 33.4780
epoch: 8, loss: 28.6447
epoch: 9, loss: 24.3285
epoch: 10, loss: 20.6818
epoch: 11, loss: 17.7626
epoch: 12, loss: 15.5484
epoch: 13, loss: 13.8642
epoch: 14, loss: 12.6696
epoch: 15, loss: 11.7869
epoch: 16, loss: 11.1688
epoch: 17, loss: 10.6970
epoch: 18, loss: 10.3636
epoch: 19, loss: 10.1163
epoch: 20, loss: 9.9151
epoch: 21, loss: 9.7484
epoch: 22, loss: 9.6331
epoch: 23, loss: 9.5288
epoch: 24, loss: 9.4259
epoch: 25, loss: 9.3383
epoch: 26, loss: 9.2651
epoch: 27, loss: 9.2026
epoch: 28, loss: 9.1389
epoch: 29, loss: 9.0903
epoch: 30, loss: 9.0429
epoch: 31, loss: 8.9929
epoch: 32, loss: 8.9488
epoch: 33, loss: 8.9232
epoch: 34, loss: 8.8922
epoch: 35, loss: 8.8536
epoch: 36, loss: 8.8312
epoch: 37, loss: 8.7957
epoch: 38, loss: 8.7731
epoch: 39, loss: 8.7506
epoch: 40, loss: 8.7380
epoch: 41, loss: 8.7233
epoch:

In [73]:
y_p = model(X_t).detach().numpy().squeeze()
plot_regression(X, y, y_p, y_range=[-25, 25], dy=5)

- Take a look at those non-linear predictions

- Our model is not great and we could make it better soon by adjusting the learning rate, the number of nodes, and the number of epochs

- I want you to see how **each of our hidden nodes is engineering a non-linear feature** to be used for the predictions, check it out:

In [74]:
plot_nodes(X, y_p, model)

### Deep Learning

You've probably heard the magic term **"deep learning"** and you're about to find out what it means

- **Deep neural network: a neural network with more than 1 hidden layer**

- On the other hand, a neural network with only 1 hidden layer is called a **shallow neural network**.

I like to think of each layer as a "feature engineer", it is trying to extract information from the layer before it

<br><br><br>

- Let's create a "deep" network of 2 layers:

<img src="img/nn-6.png" width=500>

In [75]:
class deepRegression(nn.Module):
    def __init__(self, input_size, hidden_size_1, hidden_size_2, output_size):
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(input_size, hidden_size_1),
            nn.Sigmoid(),
            nn.Linear(hidden_size_1, hidden_size_2),
            nn.Sigmoid(),
            nn.Linear(hidden_size_2, output_size)
        )

    def forward(self, x):
        out = self.main(x)
        return out

In [76]:
model = deepRegression(1, 5, 3, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.3)

In [77]:
trainer(model, criterion, optimizer, dataloader, epochs=10**3, verbose=False)
plot_regression(X, y, model(X_t).detach(), y_range=[-25, 25], dy=5)

The neural network is doing a good job, but it's still struggling to handle data points near the boundaries, but we can do better by having more neurons in our network:

In [78]:
model = deepRegression(1, 10, 10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.2)

In [79]:
trainer(model, criterion, optimizer, dataloader, epochs=10**3, verbose=False)
plot_regression(X, y, model(X_t).detach(), y_range=[-25, 25], dy=5)

## Activation Functions


- Activation functions are what allow us to model complex, non-linear functions

- There are **many** different activations functions:

In [80]:
functions = [torch.nn.Sigmoid, torch.nn.Tanh, torch.nn.Softplus, torch.nn.ReLU, torch.nn.LeakyReLU, torch.nn.SELU]
plot_activations(torch.linspace(-6, 6, 100), functions)

- Activation functions should be non-linear and tend to be monotonic and continuously differentiable (smooth)

- But as you can see with the ReLU function above, that's not always the case

- I wanted to point this out because it highlights how much of an art deep learning really is.
- Here's a great quote from [Yoshua Bengio](https://en.wikipedia.org/wiki/Yoshua_Bengio) (famous for his work in AI and deep learning) on his group experimenting with ReLU:

>"_...one of the biggest mistakes I made was to think, like everyone else in the 90s, that you needed smooth non-linearities in order for backpropagation to work. I thought that if we had something like rectifying non-linearities, where you have a flat part, it would be really hard to train, because the derivative would be zero in so many places. And when we started experimenting with ReLU, with deep nets around 2010, I was obsessed with the idea that, we should be careful about whether neurons won't saturate too much on the zero part. **But in the end, it turned out that, actually, the ReLU was working a lot better than the sigmoids and tanh, and that was a big surprise**...it turned out to work better, whereas I thought it would be harder to train!_"

- Anyway, TL;DR **ReLU is probably the most popular these days for training deep neural nets**, but you can treat activation functions as hyper-parameters to be optimized

<br><br><br>

## Neural Network Classification


### Binary Classification

- This will actually be the easiest part of the lecture

- Up until now, we've been looking at developing networks for regression tasks, but what if we want to do binary classification?

- Well, what did we do in Logistic Regression? We just passed the output of a regression into the Sigmoid Function to get a value between 0 and 1 (a probability of an observation belonging to the positive class). Let's do the same thing here

- Let's create a toy dataset first:

In [81]:
X, y = make_circles(n_samples=300, factor=0.5, noise=0.1, random_state=2020)
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.float32)
plot_classification_2d(X, y)

In [82]:
LEARNING_RATE = 0.1
BATCH_SIZE = 50

# Create dataloader
dataset = TensorDataset(X_t, y_t)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

- Let's create this network to model that dataset:

<img src="img/nn-7.png">

<br><br><br>

- I'm going to start using `ReLU` as our activation function(s) and `Adam` as our optimizer because these are what are currently, commonly used in practice.

- We are doing classification now so we'll need to use log loss (binary cross entropy) as our loss function:

$$L(w) = \sum_{x,y \in D} -y \log(\hat{y}) - (1-y)\log(1-\hat{y})$$

- In PyTorch, binary cross entropy loss criterion is `torch.nn.BCELoss`

- The formula expects a "probability" which is why we add a Sigmoid function to the end of out network.

In [83]:
class binaryClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size),
            nn.Sigmoid()  # <-- this will squash our output to a probability between 0 and 1
        )

    def forward(self, x):
        out = self.main(x)
        return out

- **BUT WAIT!**

- While we can do the above and then train with a `torch.nn.BCELoss` loss function, there's a better way

- We can omit the Sigmoid function and just use `torch.nn.BCEWithLogitsLoss` (which combines a Sigmoid layer and the BCELoss)

- Why would we do this? It's numerically more stable (Did you do the log-sum-exp question in Lab 1? We use it here for stability)

- From the docs:
>"*This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.*"

- So actually, here's our model (no Sigmoid layer at the end because it's included in the loss function we'll use):

In [84]:
class binaryClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.main = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, output_size)
        )

    def forward(self, x):
        out = self.main(x)
        return out

- Let's train the model:

In [85]:
model = binaryClassifier(2, 5, 1)
criterion = torch.nn.BCEWithLogitsLoss() # loss function
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)  # optimization algorithm
trainer(model, criterion, optimizer, dataloader, epochs=20, verbose=True)

epoch: 1, loss: 0.6484
epoch: 2, loss: 0.5936
epoch: 3, loss: 0.5255
epoch: 4, loss: 0.4573
epoch: 5, loss: 0.3785
epoch: 6, loss: 0.3008
epoch: 7, loss: 0.2553
epoch: 8, loss: 0.2282
epoch: 9, loss: 0.1967
epoch: 10, loss: 0.1654
epoch: 11, loss: 0.1366
epoch: 12, loss: 0.1151
epoch: 13, loss: 0.0945
epoch: 14, loss: 0.0856
epoch: 15, loss: 0.0755
epoch: 16, loss: 0.0714
epoch: 17, loss: 0.0670
epoch: 18, loss: 0.0651
epoch: 19, loss: 0.0606
epoch: 20, loss: 0.0568


In [86]:
plot_classification_2d(X, y, model)

- To be clear, our model is just outputting **logits**, which are some number between -∞ and +∞ (we aren't applying Sigmoid), so:

    - To get the probabilities we would need to pass them through a Sigmoid;

    - To get classes, we can apply some threshold (usually 0.5)
    
- For example, we would expect the point `(0, 0)` to have a high probability and the point `(-1,-1)` to have a low probability:

In [87]:
prediction = model(torch.tensor([[0, 0], [-1, -1]], dtype=torch.float32)).detach()
print(prediction)

tensor([[  5.1897],
        [-10.8944]])


In [88]:
probability = nn.Sigmoid()(prediction)
print(probability)

tensor([[9.9446e-01],
        [1.8562e-05]])


In [89]:
classes = np.where(probability > 0.5, 1, 0)
print(classes)

[[1]
 [0]]


<br><br><br>

### Multiclass Classification

- For multiclass classification, remember softmax?

$$\sigma(\vec{z})_i=\frac{e^{z_i}}{\sum_{j=1}^{K}e^{z_j}}$$

- It basically makes sure all the outputs are probabilities between 0 and 1, and that they all sum to 1.

- `torch.nn.CrossEntropyLoss` is a loss that combines a Softmax with cross entropy loss.

- Let's try a 4-class classification problem using the following network:

<img src="img/nn-8.png">

In [90]:
class multiClassifier(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.main = torch.nn.Sequential(
            torch.nn.Linear(input_size, hidden_size),
            torch.nn.ReLU(),
            torch.nn.Linear(hidden_size, output_size)
        )

    def forward(self, x):
        out = self.main(x)
        return out

In [91]:
X, y = make_blobs(n_samples=200, centers=4, center_box=(-1.2, 1.2), cluster_std=[0.15, 0.15, 0.15, 0.15], random_state=12345)
X_t = torch.tensor(X, dtype=torch.float32)
y_t = torch.tensor(y, dtype=torch.int64)

# Create dataloader
dataset = TensorDataset(X_t, y_t)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

plot_classification_2d(X, y)

- Let's train this model:

In [92]:
model = multiClassifier(2, 5, 4)
criterion = torch.nn.CrossEntropyLoss() # loss function
optimizer = torch.optim.Adam(model.parameters(), lr=0.2)  # optimization algorithm

for epoch in range(10):
    losses = 0
    for X_batch, y_batch in dataloader:
        optimizer.zero_grad()       # Clear gradients w.r.t. parameters
        y_hat = model(X_batch)            # Forward pass to get output
        loss = criterion(y_hat, y_batch)  # Calculate loss
        loss.backward()             # Getting gradients w.r.t. parameters
        optimizer.step()            # Update parameters
        losses += loss.item()       # Add loss for this batch to running total
    print(f"epoch: {epoch + 1}, loss: {losses / len(dataloader):.4f}")

epoch: 1, loss: 1.1238
epoch: 2, loss: 0.5183
epoch: 3, loss: 0.2320
epoch: 4, loss: 0.0505
epoch: 5, loss: 0.0246
epoch: 6, loss: 0.0044
epoch: 7, loss: 0.0026
epoch: 8, loss: 0.0037
epoch: 9, loss: 0.0018
epoch: 10, loss: 0.0007


In [93]:
plot_classification_2d(X, y, model, transform="Softmax")

- To be clear once again, our model is just outputting logits, which are some numbers between -∞ and +∞, so:

    - To get the probabilities we would need to pass them to a Softmax;

    - To get classes, we need to select the largest probability.
    
- For example, we would expect the point (-1, -1) to have a high probability of belonging to class 1, and the point (0,0) to have the highest probability of belonging to class 2.

In [94]:
prediction = model(torch.tensor([[-1, -1], [0, 0]], dtype=torch.float32)).detach()
print(prediction)

tensor([[-12.6362,  20.9416,   2.8311, -39.4635],
        [-10.4034,  -1.0663,   9.6552,  -6.8234]])


- Note how we get 4 predictions per data point (a prediction for each of the 4 classes)

In [95]:
probability = nn.Softmax(dim=1)(prediction)
print(probability)

tensor([[2.6143e-15, 1.0000e+00, 1.3637e-08, 5.8402e-27],
        [1.9439e-09, 2.2066e-05, 9.9998e-01, 6.9732e-08]])


- The predictions should now sum to 1:

In [96]:
probability.sum(dim=1, keepdim=True)

tensor([[1.0000],
        [1.0000]])

- We can get the class with maximum probability using `argmax()`:

In [97]:
classes = probability.argmax(dim=1)
print(classes)

tensor([1, 2])


<br><br><br>

## Lecture Exercise: True/False Questions


Answer True/False for the following:

1. Neural networks can be used for both regression and classification. (**True**)

2. For fully connected neural networks, the number of parameters $\geq$ the number of features. (**True**)

3. Neural networks are parametric models. (**True**)

4. Any neural network with 3 hidden layers will have more parameters than any neural network with 2 hidden layers. (**False**)

5. The architecture of a neural network (number of hidden layers and hidden nodes) is a hyperparameter. (**True**)

6. Like linear regression or logistic regression, with neural networks we can interpret each feature's weight value as a measure of the feature's importance. (**False**)

<br><br><br>

## Lecture Highlights


1. PyTorch is a neural network package that implements tensors with computation history

2. Neural Networks are simply:

    - Composed of an input layer, 1 or more hidden layers, and an output layer, each with 1 or more nodes.

    - The number of nodes in the Input/Output layers is defined by the problem/data. Hidden layers can have an arbitrary number of nodes.

    - Activation functions in the hidden layers help us model non-linear data.

    - Feed-forward neural networks are just a combination of simple linear and non-linear operations.
    
3. Activation functions allow the network to learn non-linear function