# PyTorch for Linear Regression

So we can understand how PyTorch works

In [1]:
import torch
import numpy as np
import sys

In [2]:
torch.__version__

'1.12.0'

In [3]:
torch.cuda.is_available()

False

In [4]:
#as you all know, things can speed up if you have NVIDIA GPU
#CUDA is the framework that NVIDIA develops, which allows us to use the GPU for calculations

#my PC is MAC, and I don't have CUDA

device = torch.device("cuda:0" if (torch.cuda.is_available()) else "cpu")
device

device(type='cpu')

Plan for today:

1. ETL 
   1. Specifying some some random input
   2. PyTorch Dataset and DataLoader
2. EDA - we gonna just skip because we are lazy...
3. Feature Engineering / Cleaning - which we don't need to....
4. Modeling 
   1. `nn.Linear` (luckily, you already understand this!  Yay!)
   2. Define loss function (mse for regression, ce for classification)
   3. Define the optimizer function (gradient descent ; adam)
   4. Train the model
5. Inference / Testing

## 1. ETL

### 1.1 Specify some input

Consider this data:

<img src = "../figures/japan.png" width="500">

In a linear regression model, each target variable is estimated to be a weighted sum of the input variables, offset by some constant, known as a bias :

$$\text{yield}_\text{apple}  = w_{11} * \text{temp} + w_{12} * \text{rainfall} + w_{13} * \text{humidity} + b_{1}$$

$$\text{yield}_\text{orange} = w_{21} * \text{temp} + w_{22} * \text{rainfall} + w_{23} * \text{humidity} + b_{2}$$

Visually, it means that the yield of apples is a linear or planar function of temperature, rainfall and humidity:

<img src = "../figures/japan2.png" width="400">

The learning part of linear regression is to figure out a set of weights <code>w11, w12,... w23, b1 \& b2</code> using gradient descent

In [7]:
#X (temp, rainfall, hum)
X_train = np.array([[73, 67, 43], [91, 88, 64], [87, 134, 58], 
                   [102, 43, 37], [69, 96, 70], [73, 67, 43], 
                   [91, 88, 64], [87, 134, 58], [102, 43, 37], 
                   [69, 96, 70], [73, 67, 43], [91, 88, 64], 
                   [87, 134, 58], [102, 43, 37], [69, 96, 70]], 
                  dtype='float32')

# Targets (apples, oranges)
Y_train = np.array([[56, 70], [81, 101], [119, 133], 
                    [22, 37], [103, 119], [56, 70], 
                    [81, 101], [119, 133], [22, 37], 
                    [103, 119], [56, 70], [81, 101], 
                    [119, 133], [22, 37], [103, 119]], 
                   dtype='float32')

In [8]:
#please create tensors from these numpy array
#torch.from_numpy (copy)  or torch.tensor  (not a copy!)
inputs  = torch.tensor(X_train)
targets = torch.tensor(Y_train)

#please print the shape of these tensors
#use either .size() or .shape
inputs.shape, targets.shape


(torch.Size([15, 3]), torch.Size([15, 2]))

### 1.2 Dataset

We gonna create `TensorDataset` on top of these tensors, so we can access each row from inputs and targets as tuples.   

Note:  This must be done, if we want to use `DataLoader`.

In [9]:
from torch.utils.data import TensorDataset

In [10]:
#put this dataset on top of our inputs and targets
#format: TensorDataset(X, y) where X.shape is (m, n) and y.shape is (m, k)
ds = TensorDataset(inputs, targets)

In [21]:
ds[1] #this is a tuple of two tensors, the x and the corresponding y
#this IS THE FORMAT that pyTorch wants!!!

(tensor([91., 88., 64.]), tensor([ 81., 101.]))

### 1.3 DataLoader

By default, PyTorch works in batch (remember the mini-batch gradient descent!).

In simple words, it will ALWAYS take some mini-batch, and perform gradient descent.

Why PyTorch assume mini-batch; because PyTorch assumes you won't be able to fit in ~1M samples into your GPU ram....(3, 4, 6, 11, 12, 64).

In [36]:
#this dataloader will automatically create an enumerator, look at each batch
#means, you can simply perform a for loop onto this dataloader
#if you DON'T WANT TO use this DataLoader, it's fine!  But you have
#to manually select the mini-batch (just like we do in our LR mini-batch class)
from torch.utils.data import DataLoader  #this guy is randomized (if you set Shuffle=true)

batch_size = 3  #this is any number you like
#too small then your code runs slow
#too big then you may get "out of memory" error

dl = DataLoader(ds, batch_size, shuffle=True)


In [37]:
#now, this dl is basically an enumerator, in which we can loop on....

# for x, y in dl:
#     print(f"X: {x}")  
#     print(f"Y: {y}")
#     break

#this dl has an internal counter, that keeps where it is currently
#this dl keeps on running; which is intentional; because we have the concept of "epochs"
#epochs means that how many times we "exhaust" the whole dataset

## 2. EDA - skip because we are lazy

## 3. Modeling

### 3.1 Define our neural network

- how many linear layers we want???

In [38]:
import torch.nn as nn #stands for neural network; modules that contains many possible layers
#define our neural network
#just use one layer....
#we gonna come back here and add one more layer....
#format: nn.Linear(in_features, out_features)
#format: nn.Linear(temp;rainfall;hum  ,  orange; apples)
model = nn.Linear(3, 2)

#linear layers are basically simple matrix multiplication....
#Many other names:  In Keras, we called Dense.  In TensorFlow, we called FullyConnected

#Keras are very high-level - not good for research / development (mainly for education...)
#TensorFlow is developed by Google - it's quite good

#for very huge, complex, high performance model - TensorFlow is much better / optimized
    #they are more low-level than PyTorch
#for very generally almost any model that we use (even in research) - PyTorch is much better 
# due to its computational graph.....

#TensorFlow supports something called TensorFlowLite, which is the way
#you want to use for mobile phones....

In [39]:
model.weight #by default, these weights are uniformly close to 0

Parameter containing:
tensor([[-0.0085, -0.0723,  0.1650],
        [-0.0224, -0.3495,  0.0397]], requires_grad=True)

In [41]:
model.weight.shape  #this one is basically in the shape (out_features, in_feature)

#you can imagine X @ W^T
#after you transpose W, W^T becomes [3, 2]
#which now you can do X @ W^T which is (anything, 3) @ (3, 2)

torch.Size([2, 3])

In [42]:
model.bias  #why two bias???

Parameter containing:
tensor([-0.2132,  0.2043], requires_grad=True)

In [44]:
list(model.parameters())  #this will list all the parameters (it's a object)

[Parameter containing:
 tensor([[-0.0085, -0.0723,  0.1650],
         [-0.0224, -0.3495,  0.0397]], requires_grad=True),
 Parameter containing:
 tensor([-0.2132,  0.2043], requires_grad=True)]

In [47]:
#p.numel() just flatten everything....
sum(p.numel() for p in model.parameters() if p.requires_grad)

#why 8 here??? - 6 weights and 2 bias.....

8

In [48]:
#so how do we use our model
model

Linear(in_features=3, out_features=2, bias=True)

In [51]:
#so you can perform a forward pass, simply using 
#format: model(inputs)

print("Inputs: ", inputs.shape)

output = model(inputs)  #(15, 3) @ (3, 2) = (15, 2)
print(output.shape)  #why output.shape is 15, 2??

torch.Size([15, 3])
torch.Size([15, 2])


### 3.2 Define the loss function

- should we use MSE or Cross Entropy

In [None]:
#under the nn module, there are many loss function
J_fn = nn.MSELoss()

#later on, you will know how to use this.....

### 3.3 Define the optimizer
- Gradient Descent

In [None]:
#normally, in sklearn, we simply call fit, and it will do gradient descent
#in code from scratch, we need to like specify how we want to update the gradients
#optimizer handles how we update the parameters
#   if we use w = w - alpha (gradient) ==> gradient descent
#optimizer is handles by the `torch.optim` module
#stochastic gradient descent ==> is NOT one sample; is basically mini-batch 
optim = torch.optim.SGD(model.parameters(), lr=0.0001)

### 3.4 Actually train the model

- 1. Predict
- 2. Loss
- 3. Gradient
- 4. Update the weights