# INTERMEDIATE PREREQUISITES 

## Deep Learning Introduction

### Deep Learning: What, Why, When, How


#### **What is Deep Learning?**

Deep Learning is a subset of Machine Learning that uses artificial neural networks with multiple layers (deep neural networks) to model and solve complex problems. It's inspired by the structure and function of the human brain.

#### **Why Deep Learning?**

1. **Automatic Feature Extraction**: Deep learning models can automatically learn features from raw data, reducing the need for manual feature engineering.
2. **Scalability**: Performance often improves with more data and larger models.
3. **Versatility**: Applicable to a wide range of problems, from image and speech recognition to natural language processing and game playing.
4. **State-of-the-art Performance**: Achieves top results in many domains, often surpassing human-level performance.

#### **When to Use Deep Learning?**

- **Large Amounts of Data**: Deep learning thrives on big data.
- **Complex Problems**: When traditional ML methods struggle with high-dimensional or highly nonlinear problems.
- **Unstructured Data**: Especially effective for images, audio, and text.
- **Time Series and Sequential Data**: RNNs and LSTMs excel at these tasks.
- **Autonomous Systems**: In robotics, self-driving cars, etc.

#### **How Deep Learning Works?**

1. **Data Preparation**: Collect and preprocess large datasets.
2. **Model Architecture Design**: Choose and configure a suitable neural network architecture.
3. **Training**: 
   - Feed data through the network (forward propagation)
   - Calculate loss
   - Update weights (backpropagation)
4. **Evaluation**: Test the model on unseen data.
5. **Deployment**: Use the trained model for predictions on new data.


### Advantages of Deep Learning over Machine Learning




1. **Feature Learning**: 
   - ML: Often requires manual feature engineering
   - DL: Automatically learns relevant features

2. **Performance with Large Data**: 
   - ML: Performance plateaus with increasing data
   - DL: Continues to improve with more data

3. **Handling Unstructured Data**: 
   - ML: Struggles with raw unstructured data
   - DL: Excels at processing raw images, audio, and text

4. **Scalability**: 
   - ML: Often requires redesigning as problem complexity increases
   - DL: Can scale to very complex problems by adding layers/neurons

5. **Transfer Learning**: 
   - ML: Limited transfer learning capabilities
   - DL: Pretrained models can be fine-tuned for new tasks efficiently

6. **Parallel Processing**: 
   - ML: Limited parallelization options
   - DL: Highly parallelizable, leveraging GPUs for faster computation

However, deep learning also has disadvantages, including:
- Requires large amounts of data
- Computationally intensive
- Less interpretable ("black box" nature)
- Prone to overfitting without proper regularization

The choice between ML and DL depends on the specific problem, available data, and computational resources.

## Pytorch Introduction

PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides a flexible and efficient platform for building and training neural networks, making it a popular choice among researchers and developers in the field of deep learning.

### Key Features of PyTorch

1. **Dynamic Computational Graphs**: PyTorch uses a dynamic computation graph, allowing for more flexible model architectures and easier debugging.

2. **Pythonic Interface**: PyTorch provides a natural, intuitive interface that aligns well with Python programming practices.

3. **GPU Acceleration**: Built-in support for CUDA enables seamless utilization of GPU resources for faster computations.

4. **Autograd System**: Automatic differentiation engine that enables automatic computation of gradients, simplifying the implementation of backpropagation.

5. **Rich Ecosystem**: Extensive libraries and tools for various deep learning tasks, including computer vision, natural language processing, and reinforcement learning.

## PyTorch Functional Overview

PyTorch's functionality can be broadly categorized into several key areas:

### 1. Tensor Operations

At the core of PyTorch are tensors, multi-dimensional arrays similar to NumPy's ndarrays but with the ability to run on GPUs.

In [1]:
import torch

# Creating tensors
x = torch.tensor([1, 2, 3])
y = torch.rand(3, 3)
# [[1,2,3], [4,5,6], [7,8,9]]
print(y.shape)

# Basic operations
z = x + y
print(z)
w = torch.matmul(y, y)
print(w)

torch.Size([3, 3])
tensor([[1.9458, 2.7273, 3.1516],
        [1.6086, 2.0129, 3.8765],
        [1.9567, 2.9389, 3.7240]])
tensor([[1.4821, 0.8395, 0.8906],
        [1.4219, 1.2658, 0.7381],
        [2.1689, 1.3877, 1.4922]])


In [2]:
print(y)

tensor([[0.9458, 0.7273, 0.1516],
        [0.6086, 0.0129, 0.8765],
        [0.9567, 0.9389, 0.7240]])


### 2. Autograd (Automatic Differentiation)

PyTorch's autograd system enables automatic computation of gradients, which is crucial for training neural networks.

In [3]:
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean()
out.backward()  # Computes gradients
print(x.grad)  # Displays gradients

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


In [4]:
x

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

### 3. Neural Network Modules

PyTorch provides a high-level API for building neural networks through the `torch.nn` module.

In [5]:
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 2)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNet()
model

SimpleNet(
  (fc1): Linear(in_features=10, out_features=5, bias=True)
  (fc2): Linear(in_features=5, out_features=2, bias=True)
)

### 4. Data Loading and Processing

The `torch.utils.data` module provides tools for efficient data loading and preprocessing.

In [None]:
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]
    
data, labels = ["a", "b"], [1, 2]

dataset = CustomDataset(data, labels)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

These components work together to provide a comprehensive framework for developing, training, and deploying deep learning models. PyTorch's design philosophy emphasizes ease of use, flexibility, and performance, making it a powerful tool for both research and production environments.

## PyTorch Dataset and DataLoader

### Dataset

In PyTorch, the `Dataset` class is an abstract class representing a dataset. Custom datasets should inherit from `Dataset` and override the following methods:

- `__len__`: Returns the size of the dataset
- `__getitem__`: Supports integer indexing from 0 to len(self)

#### Example:


In [6]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]
    
customdata = CustomDataset(["a", "b"], [1, 2])

### DataLoader



The `DataLoader` class provides an iterator over the dataset and supports:

- Automatic batching
- Shuffling
- Multiprocessing for data loading

Key parameters:
- `batch_size`: Number of samples in each batch
- `shuffle`: Whether to shuffle the data at every epoch
- `num_workers`: Number of subprocesses for data loading

#### Example:


In [10]:
from torch.utils.data import DataLoader

dataset = CustomDataset(["a", "b"], [1, 2])
dataloader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=0)

for batch_data, batch_labels in dataloader:
    print(batch_data)
    print(batch_labels)
    break

('a',)
tensor([1])


## Neural Networks

### What are Neural Networks?

Neural Networks are computational models inspired by the human brain's structure and function. They consist of interconnected nodes (neurons) organized in layers, designed to recognize patterns and solve complex problems.

### Why use Neural Networks?

1. **Pattern Recognition**: Excellent at identifying complex patterns in data.
2. **Adaptability**: Can learn and improve from experience.
3. **Generalization**: Can make accurate predictions on unseen data.
4. **Non-linearity**: Can model complex non-linear relationships.
5. **Parallel Processing**: Can process multiple inputs simultaneously.

### How do Neural Networks work?

1. **Input Layer**: Receives initial data.
2. **Hidden Layers**: Process the data through weighted connections.
3. **Output Layer**: Produces the final result.
4. **Activation Functions**: Introduce non-linearity to the model.
5. **Training**: Adjust weights to minimize the difference between predicted and actual outputs.


## A Simple Perceptron from Scratch with NumPy

A perceptron is the simplest form of a neural network, consisting of a single neuron.

In [1]:
import numpy as np

class Perceptron:
    def __init__(self, input_size, learning_rate=0.1):
        self.weights = np.random.rand(input_size)
        self.bias = np.random.rand()
        self.learning_rate = learning_rate

    def activate(self, x):
        return 1 if x > 0 else 0

    def predict(self, inputs):
        sum = np.dot(inputs, self.weights) + self.bias
        return self.activate(sum)

    def train(self, inputs, label):
        prediction = self.predict(inputs)
        error = label - prediction
        self.weights += error * self.learning_rate * inputs
        self.bias += error * self.learning_rate

# Example usage
p = Perceptron(2)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])

for _ in range(100):
    for inputs, label in zip(X, y):
        p.train(inputs, label)

# Test
for inputs in X:
    print(f"Input: {inputs}, Prediction: {p.predict(inputs)}")

Input: [0 0], Prediction: 0
Input: [0 1], Prediction: 0
Input: [1 0], Prediction: 0
Input: [1 1], Prediction: 1


This simple perceptron can learn to perform basic logical operations like AND or OR.

## NN Architectures

### Artificial Neural Network (ANN)

ANNs are the most basic type of neural network, consisting of fully connected layers.

- **Structure**: Input layer, one or more hidden layers, output layer.
- **Use Cases**: Classification, regression, pattern recognition.
- **Pros**: Versatile, can approximate any function.
- **Cons**: May struggle with spatial or temporal data.

### Convolutional Neural Network (CNN)

CNNs are specialized for processing grid-like data, such as images.

- **Key Components**: Convolutional layers, pooling layers, fully connected layers.
- **Use Cases**: Image classification, object detection, computer vision tasks.
- **Pros**: Efficient for spatial data, parameter sharing reduces overfitting.
- **Cons**: May struggle with non-spatial data.

### Recurrent Neural Network (RNN)

RNNs are designed to work with sequential data by maintaining an internal state (memory).

- **Key Feature**: Loops in the network allow information to persist.
- **Variants**: LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit).
- **Use Cases**: Natural language processing, time series prediction, speech recognition.
- **Pros**: Can handle variable-length sequences, maintains temporal information.
- **Cons**: Can be difficult to train due to vanishing/exploding gradients.

## Simple ANN Network - Theory


A multilayer ANN, also known as a feedforward neural network, consists of multiple layers of neurons.

1. **Input Layer**: Receives the initial data.
2. **Hidden Layers**: Process the data through weighted connections.
3. **Output Layer**: Produces the final result.
4. **Activation Functions**: Introduce non-linearity (e.g., ReLU, sigmoid, tanh).
5. **Backpropagation**: Algorithm used to train the network by adjusting weights.

## Pytorch Simple ANN Network - Code 

PyTorch provides a high-level API for creating neural networks through the `torch.nn` module.

### Basic Structure

1. Define a class that inherits from `nn.Module`
2. Define layers in the `__init__` method
3. Implement the `forward` method to define the computation

#### Example:

```python
import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

# Create an instance of the network
model = SimpleNN(input_size=10, hidden_size=20, output_size=2)
```

### Common Layer Types

- `nn.Linear`: Fully connected layer
- `nn.Conv2d`: 2D convolutional layer
- `nn.RNN`, `nn.LSTM`, `nn.GRU`: Recurrent layers
- `nn.BatchNorm2d`: Batch normalization
- `nn.Dropout`: Dropout layer

### Activation Functions

- `nn.ReLU`: Rectified Linear Unit
- `nn.Sigmoid`: Sigmoid activation
- `nn.Tanh`: Hyperbolic tangent
- `nn.Softmax`: Softmax activation

### Loss Functions

- `nn.MSELoss`: Mean Squared Error
- `nn.CrossEntropyLoss`: Combines LogSoftmax and NLLLoss
- `nn.BCELoss`: Binary Cross Entropy

### Optimizers

Available in `torch.optim`:

- `optim.SGD`: Stochastic Gradient Descent
- `optim.Adam`: Adaptive Moment Estimation
- `optim.RMSprop`: Root Mean Square Propagation

#### Training Loop Example:

```python
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(num_epochs):
    for inputs, labels in dataloader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

This structure allows for flexible and efficient creation of various neural network architectures in PyTorch.

## Pytorch Simple ANN Training

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import numpy as np

# Custom Dataset class
class RandomDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.Tensor(X)
        self.y = torch.LongTensor(y)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Simple feed-forward neural network
class SimpleANN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleANN, self).__init__()
        # Define layers
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        # Define forward pass
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Generate synthetic data
def generate_data(samples=1000, input_size=2, num_classes=2):
    X = np.random.randn(samples, input_size)
    y = np.random.randint(0, num_classes, samples)
    return X, y

# Hyperparameters
input_size = 2  # Number of features in the input data
hidden_size = 16  # Hidden layer size
output_size = 2  # Number of classes (binary classification)
learning_rate = 0.01
epochs = 100
batch_size = 32  # Mini-batch size

# Generate data
X, y = generate_data(samples=1000, input_size=input_size, num_classes=output_size)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create custom datasets
train_dataset = RandomDataset(X_train, y_train)
test_dataset = RandomDataset(X_test, y_test)

# Create DataLoaders for batching
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Initialize model, loss function, and optimizer
model = SimpleANN(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()  # Since it's a classification problem
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(epochs):
    model.train()  # Set model to training mode
    running_loss = 0.0
    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {running_loss/len(train_loader):.4f}')

# Inference function
def inference(model, data_loader):
    model.eval()  # Set model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_X, batch_y in data_loader:
            outputs = model(batch_X)
            _, predicted = torch.max(outputs.data, 1)
            total += batch_y.size(0)
            correct += (predicted == batch_y).sum().item()

    return correct / total

# Testing the model on test data
test_accuracy = inference(model, test_loader)
print(f'Test Accuracy: {test_accuracy * 100:.2f}%')

Epoch [10/100], Loss: 0.6866
Epoch [20/100], Loss: 0.6865
Epoch [30/100], Loss: 0.6864
Epoch [40/100], Loss: 0.6894
Epoch [50/100], Loss: 0.6859
Epoch [60/100], Loss: 0.6827
Epoch [70/100], Loss: 0.6827
Epoch [80/100], Loss: 0.6846
Epoch [90/100], Loss: 0.6826
Epoch [100/100], Loss: 0.6834
Test Accuracy: 52.50%


## Forward Propagation

- What: The process of computing the output of a neural network given an input.
- Why: To make predictions and compute the loss.
- How: By sequentially applying each layer's operations to the input data.

## Backward Propagation




- What: The process of computing gradients of the loss with respect to the network parameters.
- Why: To determine how to adjust the parameters to minimize the loss.
- How: By applying the chain rule of calculus to propagate gradients backwards through the network.

## Activation Functions - What, Why, How, Types

### What are Activation Functions?

Activation functions are mathematical equations that determine the output of a neural network. They are applied to the weighted sum of the inputs at each neuron.

### Why use Activation Functions?

1. Introduce non-linearity: This allows the network to learn complex patterns.
2. Normalize the output: Keep values within a specific range.
3. Enable backpropagation: Many activation functions are differentiable, allowing gradients to flow backward through the network.

### How do Activation Functions work?

They take the weighted sum of inputs to a neuron and apply a mathematical operation to produce an output. This output then becomes the input for the next layer or the final output of the network.

### Types of Activation Functions


#### **1. Sigmoid (Logistic) Function**

**What**: A S-shaped curve that maps any input value to a value between 0 and 1.

**Why**: 
- Useful for models where we need to predict the probability as an output.
- Historically popular, but less used in hidden layers of modern networks.

**How**: 

```math

f(x) = 1 / (1 + exp(-x))
```

![image](res/sigmoid.webp)

In [2]:
import torch.nn as nn
import torch
activation_input = torch.tensor([0, 1])

In [3]:
sigmoid = nn.Sigmoid()
output = sigmoid(activation_input.float())

In [4]:
output

tensor([0.5000, 0.7311])

#### **2. Hyperbolic Tangent (tanh)**


**What**: Similar to sigmoid, but maps values to range [-1, 1].

**Why**: 
- Zero-centered, making it easier for the model to learn.
- Often performs better than sigmoid in hidden layers.

**How**:

```math

f ( x ) = exp(x) − exp(−x) / exp(x) + exp(−x)
```

![image](res/tanh.webp)

In [5]:
tanh = nn.Tanh()
output = tanh(activation_input.float())

In [6]:
output

tensor([0.0000, 0.7616])

#### **3. Rectified Linear Unit (ReLU)**

**What**: Returns 0 for negative values, and the input value for positive values.

**Why**: 
- Computationally efficient.
- Helps mitigate the vanishing gradient problem.
- Induces sparsity in the hidden units.

**How**: 
```math
f(x) = max(0, x)
```

![image](res/relu.webp)

In [7]:
relu = nn.ReLU()
output = relu(activation_input.float())

In [8]:
output

tensor([0., 1.])

#### **4. Leaky ReLU**

**What**: Similar to ReLU, but allows small negative values when the input is less than zero.

**Why**: 
- Attempts to solve the "dying ReLU" problem where neurons can get stuck during training.
- Allows for slight gradient flow for negative inputs.

**How**: 

$$
f(x) =
\begin{cases} 
x, & \text{if } x > 0 \\
\alpha x, & \text{if } x \leq 0
\end{cases}
$$
Where $\alpha$ is a small constant (e.g., $\alpha = 0.01$).


![image](res/leaky_relu.webp)

In [9]:
leaky_relu = nn.LeakyReLU(0.01)  # 0.01 is the default negative slope
output = leaky_relu(activation_input.float())

In [10]:
output

tensor([0., 1.])

In [12]:
relu(torch.tensor(-1))

tensor(0)

In [14]:
leaky_relu(torch.tensor(-1.0))

tensor(-0.0100)

#### **5. Softmax**


**What**: Converts a vector of real numbers into a probability distribution.

**Why**: 
- Commonly used in the output layer of multi-class classification problems.
- Ensures all output values are between 0 and 1 and sum to 1.

**How**: 

$$
f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}
$$

In [15]:
softmax = nn.Softmax(dim=0)  # Apply softmax along dimension 1
output = softmax(activation_input.float())

In [16]:
output

tensor([0.2689, 0.7311])

### Derivates of Activation Function

![image1](res/activationfunccheatsheet.webp)
![image2](res/activationfunctionderivates.webp)

### Choosing Activation Functions




- For hidden layers, ReLU is often a good default choice due to its simplicity and effectiveness.
- For binary classification output, sigmoid is commonly used.
- For multi-class classification output, softmax is typically used.
- For regression problems, the output layer often uses a linear activation (i.e., no activation function).

The choice of activation function can significantly impact the network's performance and training dynamics. Experimentation is often necessary to find the best activation functions for a specific problem.

## Loss Functions - What, Why, How, Types and Code


### What are Loss Functions?
Loss functions measure the difference between the predicted output and the actual target, quantifying the model's performance.

### Why use Loss Functions?
They guide the learning process by providing a scalar value to be minimized during training.

### How do Loss Functions work?
They compute a score based on the model's predictions and the true values, which is then used to update the model's parameters.

### **Loss Functions Types**


#### **1. Mean Squared Error (MSE)**

**What**: A loss function that measures the average squared difference between the predicted and actual values.

**Why**: 
- Suitable for regression problems
- Penalizes larger errors more heavily
- Differentiable, making it suitable for gradient-based optimization

**How**: 
- Calculate the difference between each predicted and actual value
- Square these differences
- Take the mean of these squared differences

In [18]:
import torch
predictions = torch.Tensor([1.0, 0.0, 1.0])
targets = torch.Tensor([0.0, 0.0, 1.0])

In [19]:
import torch.nn as nn

mse_loss = nn.MSELoss()
loss = mse_loss(predictions, targets)

In [20]:
loss

tensor(0.3333)

#### **2. Cross-Entropy Loss**


**What**: A loss function that measures the performance of a classification model whose output is a probability value between 0 and 1.

**Why**: 
- Suitable for multi-class classification problems
- Encourages confident predictions
- Works well with softmax activation in the output layer

**How**: 
- Apply softmax to the model's raw output to get probabilities
- Take the negative log of the predicted probability of the correct class
- Average this across all samples

In [24]:
predictions[0] = 0

In [25]:
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(predictions, targets) 

In [26]:
loss

tensor(0.5514)

#### **3. Binary Cross-Entropy**








**What**: A special case of cross-entropy for binary classification problems.

**Why**: 
- Suitable for binary classification problems
- Works well with sigmoid activation in the output layer

**How**: 
- Apply sigmoid to the model's raw output to get a probability
- Compute the negative log likelihood of the correct class


In [27]:
bce_loss = nn.BCEWithLogitsLoss()  # Combines sigmoid and BCE
loss = bce_loss(predictions, targets)

In [28]:
loss

tensor(0.5665)

## Optimizers - What, Why, How, Types and Code


### What are Optimizers?
Algorithms that adjust the network's parameters to minimize the loss function.

### Why use Optimizers?
They implement different strategies for updating weights, which can lead to faster convergence or better generalization.

### How do Optimizers work?
By computing gradients of the loss with respect to the parameters and updating them accordingly.


### Optimizer Types:


#### **1. Stochastic Gradient Descent (SGD)**


**What**: The most basic form of gradient descent, updating parameters based on the gradient of the current mini-batch.

**Why**: 
- Simple and memory-efficient
- Can escape shallow local minima due to its stochastic nature

**How**: 
- Compute the gradient of the loss with respect to each parameter
- Update each parameter by subtracting the learning rate multiplied by its gradient

In [29]:
sgd_optimizer = optim.SGD(model.parameters(), lr=0.01)

In [30]:
sgd_optimizer

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    momentum: 0
    nesterov: False
    weight_decay: 0
)

#### **2. Adam (Adaptive Moment Estimation)**

**What**: An adaptive learning rate optimization algorithm that computes individual learning rates for different parameters.

**Why**: 
- Combines the benefits of AdaGrad and RMSprop
- Works well for problems with sparse gradients or noisy data

**How**: 
- Maintains a moving average of the gradient and the squared gradient
- Uses these to compute adaptive learning rates for each parameter

In [None]:
adam_optimizer = optim.Adam(model.parameters(), lr=0.001)

#### **3. RMSprop**


**What**: An optimizer that adapts the learning rate for each parameter based on the recent gradient history.

**Why**: 
- Addresses the diminishing learning rates problem of AdaGrad
- Works well for non-stationary objectives

**How**: 
- Maintains a moving average of squared gradients
- Divides the learning rate by the square root of this average


In [None]:
rmsprop_optimizer = optim.RMSprop(model.parameters(), lr=0.01)

## Gradient Descent - What, Why, How, Types and Code

### What is Gradient Descent?
An optimization algorithm used to minimize the loss function by iteratively moving in the direction of steepest descent.

### Why use Gradient Descent?
It provides a way to find the optimal parameters that minimize the loss function.

### How does Gradient Descent work?
By computing the gradient of the loss with respect to each parameter and updating the parameters in the opposite direction of the gradient.

### Gradient Descent Types


#### **1. Batch Gradient Descent**

**What**: Computes the gradient using the entire dataset.

**Why**: 
- Provides a more accurate estimate of the gradient
- Guaranteed to converge to the global minimum for convex error surfaces

**How**: 
- Compute the gradient of the loss over the entire dataset
- Update parameters once per epoch

#### **2. Stochastic Gradient Descent (SGD)**

**What**: Computes the gradient using a single randomly selected sample.

**Why**: 
- Faster than batch gradient descent for large datasets
- Can escape local minima due to its noisy updates

**How**: 
- Randomly select a single sample
- Compute the gradient based on this sample
- Update parameters

#### **3. Mini-batch Gradient Descent**

**What**: Computes the gradient using a small random subset of the data.

**Why**: 
- Balances the efficiency of SGD with the stability of batch gradient descent
- Allows for vectorized operations, which can be computationally efficient

**How**: 
- Divide the dataset into mini-batches
- Compute the gradient for each mini-batch
- Update parameters after each mini-batch
