# Gradient Descent and Linear Regression with PyTorch

## Importing Libraries

In [2]:
import numpy as np
import torch

## Introduction to Linear Regression

In this tutorial, we'll discuss one of the foundational algorithms in machine learning: Linear regression. We'll create a model that predicts crop yields for apples and oranges (target variables) by looking at the average temperature, rainfall, and humidity (input variables or features) in a region. Here's the training data:

![table](https://i.imgur.com/6Ujttb4.png)

In a linear regression model, each target variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :

yield_apple  = w11 * temp + w12 * rainfall + w13 * humidity + b1

yield_orange = w21 * temp + w22 * rainfall + w23 * humidity + b2

Visually, it means that the yield of apples is a linear or planar function of temperature, rainfall and humidity:

![linear-regression-graph](https://i.imgur.com/4DJ9f8X.png)

The learning part of linear regression is to figure out a set of weights w11, w12,... w23, b1 & b2 using the training data, to make accurate predictions for new data. The learned weights will be used to predict the yields for apples and oranges in a new region using the average temperature, rainfall, and humidity for that region.

We'll train our model by adjusting the weights slightly many times to make better predictions, using an optimization technique called gradient descent. Let's begin by importing Numpy and PyTorch.

## Training Data

In [3]:
# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70]], dtype='float32')
np.shape(inputs)

(5, 3)

In [4]:
# Targets (apples, oranges)
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119]], dtype='float32')
np.shape(targets)

(5, 2)

In [5]:
# Converting inputs and targets to tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs)
print(targets)

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.]])
tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


## Linear Regression Model from Scratch

In [6]:
# Weights and Biases
w = torch.randn(2, 3, requires_grad = True)
b = torch.randn(2, requires_grad = True)
print(w)
print(b)

tensor([[-2.1095,  0.9673,  0.6344],
        [ 1.2768,  1.1176, -1.6715]], requires_grad=True)
tensor([-0.9188, -1.1990], requires_grad=True)


torch.randn creates a tensor with the given shape, with elements picked randomly from a normal distribution with mean 0 and standard deviation 1.

Our model is simply a function that performs a matrix multiplication of the inputs and the weights w (transposed) and adds the bias b (replicated for each observation).

![linear_regression](https://i.imgur.com/WGXLFvA.png)

In [7]:
def model(x):
    return x @ w.t() + b

@ reprents matrix multiplication in PyTorch, and the .t method returns the transpose of the tensor

In [8]:
# Generate predictions
preds = model(inputs)
print(preds)

tensor([[ -62.8211,   95.0054],
        [ -67.1555,  106.3539],
        [ -18.0296,  162.6839],
        [-151.0166,  115.2388],
        [  -9.2026,   77.1768]], grad_fn=<AddBackward0>)


In [9]:
# Compare with targets 
print(targets)

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


## Loss Function

In [10]:
# MSE Loss
def mse(t1, t2):
    diff = t1-t2
    return torch.sum(diff*diff) / diff.numel()

# torch.sum(input, *, dtype=None) → Tensor
# Returns the sum of all elements in the input tensor.
# https://pytorch.org/docs/stable/generated/torch.sum.html

# torch.numel(input) → int
# Returns the total number of elements in the input tensor.
# https://pytorch.org/docs/stable/generated/torch.numel.html

In [11]:
# Computing the Loss
loss = mse(preds, targets)
print(loss)

tensor(10677.5312, grad_fn=<DivBackward0>)


## Computing Gradients

In [12]:
# Computing the Gradients

# With PyTorch, we can automatically compute the gradient or derivative of the 
# loss w.r.t. to the weights and biases, because they have requires_grad set to 
# True.

# The gradients are stored in the .grad property of the respective tensors. Note 
# that the derivative of the loss w.r.t. the weights matrix is itself a matrix, 
# with the same dimensions.

# To reduce memory usage, during the .backward() call, all the intermediary 
# results are deleted when they are not needed anymore. Hence if you try to call 
# .backward() again, the intermediary results don’t exist and the backward pass 
# cannot be performed (and you get the error you see). You can call 
# .backward(retain_graph=True) to make a backward pass that will not delete 
# intermediary results, and so you will be able to call .backward() again. All 
# but the last call to backward should have the retain_graph=True option

loss.backward()

In [13]:
# Gradients for the weights
print(w)
print(w.grad)

tensor([[-2.1095,  0.9673,  0.6344],
        [ 1.2768,  1.1176, -1.6715]], requires_grad=True)
tensor([[-11893.4668, -11514.3643,  -7358.9541],
        [  1997.9314,   1094.6770,    621.3517]])


In [14]:
print(b)
print(b.grad)

tensor([-0.9188, -1.1990], requires_grad=True)
tensor([-137.8451,   19.2918])


## Adjusting weights and biases to reduce the loss

In [15]:
# We need the no grad as we dont want the computational graph to include
# the gradient in its own graph calculation
with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5

In [16]:
# Verifying if the loss is lower

preds = model(inputs)
print(preds)
loss = mse(preds, targets)
print(loss)

tensor([[ -43.2585,   92.5461],
        [ -41.4887,  103.1746],
        [  12.0166,  159.1182],
        [-131.2099,  112.5001],
        [  15.2103,   74.3122]], grad_fn=<AddBackward0>)
tensor(7637.3960, grad_fn=<DivBackward0>)


In [17]:
# Before we proceed, we reset the gradients to zero by calling .zero_() method. 
# We need to do this, because PyTorch accumulates, gradients i.e. the next time 
# we call .backward on the loss, the new gradient values will get added to the 
# existing gradient values, which may lead to unexpected results.
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])


## Training the Model using Gradient Descent

As seen above, we reduce the loss and improve our model using the gradient descent optimization algorithm. Thus, we can train the model using the following steps:

 - Generate predictions

 - Calculate the loss

 - Compute gradients w.r.t the weights and biases

 - Adjust the weights by subtracting a small quantity proportional to the gradient

 - Reset the gradients to zero

In [18]:
# Train for 100 epochs 
for i in range(100):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    with torch.no_grad():
        w -= w.grad * 1e-5
        b -= b.grad * 1e-5
        w.grad.zero_()
        b.grad.zero_()

In [19]:
# Calculating Losses
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(592.3265, grad_fn=<DivBackward0>)


In [20]:
preds

tensor([[ 51.1445,  77.0389],
        [ 79.6443,  87.1033],
        [134.1974, 152.9909],
        [-12.8761,  75.4734],
        [117.0995,  72.8562]], grad_fn=<AddBackward0>)

In [21]:
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

## Linear Regression using PyTorch Built-Ins

In [22]:
# Let's begin by importing the torch.nn package from PyTorch, which 
# contains utility classes for building neural networks.

import torch.nn as nn

In [23]:
# Input (temp, rainfall, humidity)
inputs = np.array([[73, 67, 43], 
                   [91, 88, 64], 
                   [87, 134, 58], 
                   [102, 43, 37], 
                   [69, 96, 70], 
                   [74, 66, 43], 
                   [91, 87, 65], 
                   [88, 134, 59], 
                   [101, 44, 37], 
                   [68, 96, 71], 
                   [73, 66, 44], 
                   [92, 87, 64], 
                   [87, 135, 57], 
                   [103, 43, 36], 
                   [68, 97, 70]], 
                  dtype='float32')

# Targets (apples, oranges)
targets = np.array([[56, 70], 
                    [81, 101], 
                    [119, 133], 
                    [22, 37], 
                    [103, 119],
                    [57, 69], 
                    [80, 102], 
                    [118, 132], 
                    [21, 38], 
                    [104, 118], 
                    [57, 69], 
                    [82, 100], 
                    [118, 134], 
                    [20, 38], 
                    [102, 120]], 
                   dtype='float32')

inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs.shape)
print(targets.shape)

torch.Size([15, 3])
torch.Size([15, 2])


In [24]:
inputs

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.],
        [ 74.,  66.,  43.],
        [ 91.,  87.,  65.],
        [ 88., 134.,  59.],
        [101.,  44.,  37.],
        [ 68.,  96.,  71.],
        [ 73.,  66.,  44.],
        [ 92.,  87.,  64.],
        [ 87., 135.,  57.],
        [103.,  43.,  36.],
        [ 68.,  97.,  70.]])

## Dataset and Dataloader

In [25]:
# We'll create a TensorDataset, which allows access to rows from 
# inputs and targets as tuples, and provides standard APIs for working with 
# many different types of datasets in PyTorch.

from torch.utils.data import TensorDataset

In [26]:
# Defining the Dataset
train_ds = TensorDataset(inputs, targets)
train_ds[0:3]

(tensor([[ 73.,  67.,  43.],
         [ 91.,  88.,  64.],
         [ 87., 134.,  58.]]),
 tensor([[ 56.,  70.],
         [ 81., 101.],
         [119., 133.]]))

In [27]:
from torch.utils.data import DataLoader

In [28]:
# Defining Dataloader
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle = True)

In [29]:
for xb, yb in train_dl:
    print(xb)
    print(yb)
    break

tensor([[ 91.,  88.,  64.],
        [101.,  44.,  37.],
        [103.,  43.,  36.],
        [ 87., 135.,  57.],
        [ 68.,  97.,  70.]])
tensor([[ 81., 101.],
        [ 21.,  38.],
        [ 20.,  38.],
        [118., 134.],
        [102., 120.]])


In each iteration, the data loader returns one batch of data with the given batch size. If shuffle is set to True, it shuffles the training data before creating batches. Shuffling helps randomize the input to the optimization algorithm, leading to a faster reduction in the loss.



## nn.Linear

Instead of initializing the weights & biases manually, we can define the model using the nn.Linear class from PyTorch, which does it automatically.

Linear
module = Linear(inputDimension,outputDimension)

Applies a linear transformation to the incoming data, i.e. //y= Ax+b//. The input tensor given in forward(input) must be either a vector (1D tensor) or matrix (2D tensor). If the input is a matrix, then each row is assumed to be an input sample of given batch.

In [30]:
# Define Model
model = nn.Linear(3,2)
print(model.weight)
print(model.bias)
print(model.weight.shape)
print(model.bias.shape)

Parameter containing:
tensor([[ 0.2883, -0.4666, -0.1476],
        [ 0.2662,  0.4421, -0.1961]], requires_grad=True)
Parameter containing:
tensor([0.2754, 0.1876], requires_grad=True)
torch.Size([2, 3])
torch.Size([2])


PyTorch models also have a helpful .parameters method, which returns a list containing all the weights and bias matrices present in the model. For our linear regression model, we have one weight matrix and one bias matrix.

In [31]:
# Parameters
list(model.parameters())

[Parameter containing:
 tensor([[ 0.2883, -0.4666, -0.1476],
         [ 0.2662,  0.4421, -0.1961]], requires_grad=True),
 Parameter containing:
 tensor([0.2754, 0.1876], requires_grad=True)]

In [32]:
# Generate predictions
preds = model(inputs)
print(preds.shape)
preds

torch.Size([15, 2])


tensor([[-16.2863,  40.8119],
        [-23.9941,  50.7704],
        [-45.7243,  71.2187],
        [  4.1567,  39.0990],
        [-34.9540,  47.2734],
        [-15.5314,  40.6361],
        [-23.6751,  50.1322],
        [-45.5836,  71.2888],
        [  3.4018,  39.2749],
        [-35.3898,  46.8110],
        [-15.9673,  40.1737],
        [-23.2393,  50.5945],
        [-46.0434,  71.8569],
        [  4.5925,  39.5613],
        [-35.7088,  47.4492]], grad_fn=<AddmmBackward>)

## Loss Function

Instead of defining a loss function manually, we can use the built-in loss function mse_loss.

In [33]:
# Import nn.functional
import torch.nn.functional as F

In [34]:
# Define loss function
loss_fn = F.mse_loss

In [35]:
loss = loss_fn(model(inputs), targets)
print(loss)

tensor(7491.4233, grad_fn=<MseLossBackward>)


## Optimizer

Instead of manually manipulating the model's weights & biases using gradients, we can use the optimizer optim.SGD. SGD is short for "stochastic gradient descent". The term stochastic indicates that samples are selected in random batches instead of as a single group.

In [36]:
# Define optimizer
opt = torch.optim.SGD(model.parameters(), lr=1e-5)

Note that model.parameters() is passed as an argument to optim.SGD so that the optimizer knows which matrices should be modified during the update step. Also, we can specify a learning rate that controls the amount by which the parameters are modified.

## Train the Model

We are now ready to train the model. We'll follow the same process to implement gradient descent:

 - Generate predictions

 - Calculate the loss

 - Compute gradients w.r.t the weights and biases

 - Adjust the weights by subtracting a small quantity proportional to the gradient

 - Reset the gradients to zero

The only change is that we'll work batches of data instead of processing the entire training data in every iteration. Let's define a utility function fit that trains the model for a given number of epochs.


In [37]:
# Utility function to train the model
def fit(num_epochs, model, loss_fn, opt, train_dl):
    
    # Repeat for given number of epochs
    for epoch in range(num_epochs):
        
        # Train with the batches of data
        for xb, yb in train_dl:
            
            # 1. Generate Predictions
            pred = model(xb)
            
            # 2. Calculate Loss
            loss = loss_fn(pred, yb)
            
            # 3. Computing the Gradients
            loss.backward()
            
            # 4. Update the parameters using gradients
            opt.step()
            
            # 5. Reset the gradients to zero
            opt.zero_grad()
            
            # Printing the progress
            if (epoch+1) % 10 == 0:
                print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

Some things to note above:

 - We use the data loader defined earlier to get batches of data for every iteration.

 - Instead of updating parameters (weights and biases) manually, we use opt.step to perform the update and opt.zero_grad to reset the gradients to zero.

 - We've also added a log statement that prints the loss from the last batch of data for every 10th epoch to track training progress. loss.item returns the actual value stored in the loss tensor.

Let's train the model for 100 epochs.

In [38]:
fit(100, model, loss_fn, opt, train_dl)

Epoch [10/100], Loss: 757.1930
Epoch [10/100], Loss: 593.8428
Epoch [10/100], Loss: 598.0179
Epoch [20/100], Loss: 461.1000
Epoch [20/100], Loss: 413.1428
Epoch [20/100], Loss: 475.2563
Epoch [30/100], Loss: 435.5168
Epoch [30/100], Loss: 317.0444
Epoch [30/100], Loss: 253.0004
Epoch [40/100], Loss: 27.9852
Epoch [40/100], Loss: 379.7800
Epoch [40/100], Loss: 282.8146
Epoch [50/100], Loss: 171.4773
Epoch [50/100], Loss: 114.6867
Epoch [50/100], Loss: 233.7007
Epoch [60/100], Loss: 59.1532
Epoch [60/100], Loss: 198.0815
Epoch [60/100], Loss: 118.8027
Epoch [70/100], Loss: 64.1531
Epoch [70/100], Loss: 103.1029
Epoch [70/100], Loss: 122.5319
Epoch [80/100], Loss: 77.6288
Epoch [80/100], Loss: 82.1988
Epoch [80/100], Loss: 77.7008
Epoch [90/100], Loss: 54.8107
Epoch [90/100], Loss: 42.5772
Epoch [90/100], Loss: 85.4586
Epoch [100/100], Loss: 62.1815
Epoch [100/100], Loss: 59.2382
Epoch [100/100], Loss: 31.9695


Let's generate predictions using our model and verify that they're close to our targets.

In [39]:
# Generate the predictions
preds = model(inputs)
preds

tensor([[ 58.8197,  71.7688],
        [ 80.9237,  96.1433],
        [117.3367, 140.3239],
        [ 31.4714,  45.2254],
        [ 93.7960, 106.5507],
        [ 57.8133,  70.7297],
        [ 80.4798,  95.3810],
        [117.5378, 140.4921],
        [ 32.4779,  46.2644],
        [ 94.3586, 106.8275],
        [ 58.3758,  71.0065],
        [ 79.9173,  95.1042],
        [117.7805, 141.0862],
        [ 30.9089,  44.9486],
        [ 94.8024, 107.5897]], grad_fn=<AddmmBackward>)

In [40]:
# Comparing it with targets
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.],
        [ 57.,  69.],
        [ 80., 102.],
        [118., 132.],
        [ 21.,  38.],
        [104., 118.],
        [ 57.,  69.],
        [ 82., 100.],
        [118., 134.],
        [ 20.,  38.],
        [102., 120.]])

Indeed, the predictions are quite close to our targets. We have a trained a reasonably good model to predict crop yields for apples and oranges by looking at the average temperature, rainfall, and humidity in a region. We can use it to make predictions of crop yields for new regions by passing a batch containing a single row of input.

In [41]:
model(torch.tensor([[75, 63, 44.]]))

tensor([[55.5373, 67.9437]], grad_fn=<AddmmBackward>)

The predicted yield of apples is 54.3 tons per hectare, and that of oranges is 68.3 tons per hectare.

## Machine Learning vs. Classical Programming

The approach we've taken in this tutorial is very different from programming as you might know it. Usually, we write programs that take some inputs, perform some operations, and return a result.

However, in this notebook, we've defined a "model" that assumes a specific relationship between the inputs and the outputs, expressed using some unknown parameters (weights & biases). We then show the model some know inputs and outputs and train the model to come up with good values for the unknown parameters. Once trained, the model can be used to compute the outputs for new inputs.

This paradigm of programming is known as machine learning, where we use data to figure out the relationship between inputs and outputs. Deep learning is a branch of machine learning that uses matrix operations, non-linear activation functions and gradient descent to build and train models. Andrej Karpathy, the director of AI at Tesla Motors, has written a great blog post on this topics, titled [Software 2.0](https://medium.com/@karpathy/software-2-0-a64152b37c35).

This picture from book [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python) by Francois Chollet captures the difference between classical programming and machine learning:

![image](https://i.imgur.com/oJEQe7k.png)