### 0. ML Background
A. Train-Validate-Test Paradigm
- Traditionally, a dataset needs to be collected for a specific task. 
- This dataset is split into a training/validation/test set using a 80/10/10 split.
- We fit a model's parameters using the training data. 
- Then, the model's hyperparameters are evaluated and selected using the validation set.
- Finally, the test set is to determine the final performance. 
- This paradigm helps mitigate overfitting to the training data, preventing our model from generalizing well.
- Let's walk through a simple example.

B. Fine-Tuning
- We previously treat a neural net as a singular black box. Nowadays, this black box is VERY BIG. 
- These BIG black boxes tend to perform decently well on a variety of tasks, even without fitting a model to a specific task. 
- Yet, these models can perform even better if we add a single additional layer at the beginning of the neural net, which we "fine-tune" to our task of interest. (Adding a smaller black box before the bigger black box.) 
- This allows us to significantly reduce computing resources.

C. Few Shot/Zero Shot Settings
- With the new models like GPT-3 and OPT containing 175B parameters, these black boxes are EVEN BIGGER.
- It turns out, these models perform super well on tasks, even without fitting or fine-tuning a model to that task.
- In the few shot setting, we provide the model of one or few examples, and ask the model to complete the task.
- In the zero shot setting, we ask the model to complete the task without seeing any training examples specific to the task.

### 1. Learning Checkpoint: Classical ML
A. Linear Regression
- Given: input data $(x_1, t_1), ..., (x_n, t_n)$ where $x \in R^m$ and target $t \in R^1$.
- Goal: fit an order $M$ polynomial function that is linear with respect to weights $w \in R^{m+1}$:
$$y(x, w) = w_0 + w_1 x + ... + w_m x^M$$
- To optimize, we need to find an optimal $w^*$ that minimizes the least squares loss function:
$$E(w) = \frac{1}{2} \sum_i^n (y(x_i, w) - t_i)^2 = \frac{1}{2} ||Xw-t||_2^2$$
- To minimize, we move in the direction of steepest descent (negative gradient):
$$-\nabla_w E(w) = X^T t - X^T X w$$
- At inference time, we simply use our linear model:
$$\hat{y} = Xw^*$$

In [None]:
# Necessary Imports
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Hyper-parameters
input_size = 1
output_size = 1
num_epochs = 60
learning_rate = 0.001

In [None]:
# Toy dataset
x_train = np.array([[3.3], [4.4], [5.5], [6.71], [6.93], [4.168], 
                    [9.779], [6.182], [7.59], [2.167], [7.042], 
                    [10.791], [5.313], [7.997], [3.1]], dtype=np.float32)

y_train = np.array([[1.7], [2.76], [2.09], [3.19], [1.694], [1.573], 
                    [3.366], [2.596], [2.53], [1.221], [2.827], 
                    [3.465], [1.65], [2.904], [1.3]], dtype=np.float32)

In [None]:
# Linear regression model
model = nn.Linear(input_size, output_size, bias=True)

In [None]:
# Plot the dataset
predicted = model(torch.from_numpy(x_train)).detach().numpy()
plt.plot(x_train, y_train, 'ro', slabel='Original data')
plt.legend()
plt.show()

In [None]:
# Loss and optimizer
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  

In [None]:
# Train the model
for epoch in range(num_epochs):
    # Convert numpy arrays to torch tensors
    inputs = torch.from_numpy(x_train)
    targets = torch.from_numpy(y_train)

    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, targets)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch+1) % 5 == 0:
        print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item()))

In [None]:
# Plot the graph
predicted = model(torch.from_numpy(x_train)).detach().numpy()
plt.plot(x_train, y_train, 'ro', label='Original data')
plt.plot(x_train, predicted, label='Fitted line')
plt.legend()
plt.show()

B. Logistic Classifier
- Derived from our linear model, we now map the real value output to a binary classification. Specifically, we use sigmoid activation:
$$\theta(s) = \frac{\exp(s)}{1 + \exp(s)} = \frac{1}{1 + \exp(-s)}$$
- We define classifier as follows, with default decision boundary = 0.5:
$$h(x) = \theta(w^T x)$$
- Once again, we want to find optimal $w^*$, done using (stochastic) gradient descent, moving in the direction of steepest descent. For practicality purposes, we abstract the math.


In [None]:
#Necessary Imports
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

In [None]:
# Hyper-parameters 
input_size = 28 * 28    # 784
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001

In [None]:
# MNIST dataset (images and labels)
train_dataset = torchvision.datasets.MNIST(root='../../data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data', 
                                          train=False, 
                                          transform=transforms.ToTensor())

In [None]:
# Data loader (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

In [None]:
# Logistic regression model
model = nn.Linear(input_size, num_classes)

In [None]:
# Loss and optimizer
# nn.CrossEntropyLoss() computes softmax internally
criterion = nn.CrossEntropyLoss()  
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  

In [None]:
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Reshape images to (batch_size, input_size)
        images = images.reshape(-1, input_size)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

In [None]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, input_size)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum()

    print('Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

C. Feed Forward Neural Net
- Think of this as a nested logistic classifier.
- If a hidden layer has $n$ nodes, we have $n$ independent logistic classifiers, each with a different set of weights.
- To get the final prediction, we have another logistic classifier, taking the outputs of the hidden layer (for a single hidden layer network).
- Let's draw this out.
- This is powerful because the sigmoid activation is a nonlinear function that allows us to expand our modelling space beyond linear transformations.

In [None]:
# Necessary Imports
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

In [None]:
# Device configuration
device = torch.device('cpu')

In [None]:
# Hyper-parameters 
input_size = 784
hidden_size = 500
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001

In [None]:
# MNIST dataset 
train_dataset = torchvision.datasets.MNIST(root='../../data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),  
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='../../data', 
                                          train=False, 
                                          transform=transforms.ToTensor())

In [None]:
# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

In [None]:
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

In [None]:
# Instantiate model
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 

In [None]:
# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):  
        # Move tensors to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if (i+1) % 100 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))

In [None]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))

### 2. Knowledge Check 
  - Linear Regression
    - What is the task we are trying to accomplish?
    - What is the training objective that aims to solve this task?
    - How do we use the training objective to find an optimal solution (in the convex case)?
  - Logistic Classifier
    - How does the logistic classifier relate to linear regression?
    - How do we achieve a binary output?
  - Feed Forward Neural Net
    - How does the feed forward neural net relate to the logistic classifier?
    - Why do we use nonlinear activation transformation functions in FFNNs?
  - Coding Abstractions 
    - What is a data loader?
    - What are the forward and backward steps?