# Learning Gradient Descent.

Playing around with gradient descent. Learning about loss functions. 

Loss function is given by L(y, t). Where y is the output of the model, and t is the target of the actual data being predicted.

The goal of gradient descent is to converge the difference between y and t to 0. This means a higher loss is worse, and a lower loss is better. It indicates the delta of y and t is growing smaller.

Each tick of the learning rate adjusts parameters used to generate y, which is checked against t. 

The loss L(y, t) function measures how well the model's predictions y match the actual target values t. Common loss functions include Mean Squared Error (MSE) for regression tasks and Cross-Entropy Loss for classification tasks.

Gradient Descent: This optimization algorithm iteratively adjusts the model parameters to minimize the loss function. The goal is to reduce the difference between the predicted output y and the actual target t, thereby minimizing the loss.

Learning Rate: This hyperparameter controls the size of the steps taken during gradient descent. Each iteration updates the model parameters in the direction that reduces the loss the most, with the step size determined by the learning rate.

Convergence: The process of gradient descent continues until the loss function converges to a minimum value, ideally as close to zero as possible, indicating that the model's predictions are very close to the actual targets.

This iterative process continues until the loss function reaches a minimum value, indicating that the model has learned the best possible parameters to predict the target values accurately.








Here's a step-by-step process: 
1. Forward Pass: Compute the predicted value y using the current weights and biases.
2. Compute Loss: Calculate the loss L(y, t) using the predicted value y and the actual value t.
3. Backward Pass: Compute the gradients of the loss with respect to the weights and biases. Basically, the vector of partial derivatives of the loss function for each parameter. The gradient indicates the direciton and rate of the steepest increase of the loss function. 
4. Update Parameters: Adjust the weights and biases using the gradients and the learning rate. The partial derivatives are used to update the model parameters in the opposite direction of the gradient to minimise the loss function. The gradient points in the direction of the steepest ascent, so moving in the opposite direction (i.e., subtracting the gradient) helps in descending the loss function towards a local minimum.

In summary, t is the actual value, y is the predicted value, and the loss function L(y,t) measures how far y is from t. The goal of gradient descent is to adjust the weights and biases to make y as close to t as possible.

Gradient descent is an optimization algorithm used to minimize the loss function in machine learning models. The main goal is to find the set of model parameters (weights and biases) that result in the smallest possible loss. There are several variations of gradient descent, including:

Batch Gradient Descent: This version calculates the gradient using the entire training dataset. While it provides a stable and accurate estimate of the gradient, it can be computationally expensive for large datasets.

Stochastic Gradient Descent (SGD): Instead of using the entire dataset, SGD updates the model parameters using the gradient of the loss function for a single training example at a time. This makes the updates noisier but allows the algorithm to potentially escape local minima and find a better global minimum.

Mini-Batch Gradient Descent: This is a compromise between batch gradient descent and SGD. It calculates the gradient using a small batch of training examples. This approach is commonly used as it balances the efficiency of batch gradient descent and the noise reduction of SGD.

The learning rate 
η is a crucial hyperparameter that determines the size of the steps taken during gradient descent. If the learning rate is too high, the algorithm might overshoot the minimum and fail to converge. If the learning rate is too low, the algorithm will converge very slowly and may get stuck in a local minimum.

Convergence
Convergence is achieved when the changes in the loss function become very small, indicating that the algorithm has reached a minimum. Monitoring the loss function over iterations helps in determining when to stop the training process.

Challenges and Solutions
Local Minima: The loss function may have multiple local minima. Gradient descent can get stuck in a local minimum instead of finding the global minimum. Techniques like adding noise (SGD) or using advanced optimization algorithms (e.g., Adam, RMSprop) can help mitigate this issue.

Vanishing and Exploding Gradients: In deep neural networks, gradients can become very small (vanishing) or very large (exploding). Techniques like gradient clipping, proper weight initialization, and using activation functions like ReLU can help address these problems.

Choosing the Right Learning Rate: Using learning rate schedules (e.g., reducing the learning rate as training progresses) or adaptive learning rates (e.g., AdaGrad, Adam) can help optimize the learning rate dynamically.

Advanced Optimization Algorithms
In addition to basic gradient descent, several advanced optimization algorithms are commonly used:

- Momentum: This technique helps accelerate gradient descent by adding a fraction of the previous update to the current update.
- AdaGrad: Adjusts the learning rate for each parameter based on the historical gradients, allowing for larger updates for infrequent parameters.
- RMSprop: Similar to AdaGrad but uses a moving average of squared gradients to adjust the learning rate, preventing the learning rate from becoming too small.
- Adam: Combines the benefits of Momentum and RMSprop, using both first-order (mean) and second-order (variance) moments of the gradients to adapt the learning rate.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

Generate Fake Data

In [2]:
obs = 100000

xs = np.random.uniform(low =-100, high = 100, size = (obs,1))
zs = np.random.uniform(low =-100, high = 100, size = (obs,1))

inputs = np.column_stack((xs,zs))

Fake Targets to Predict

In [3]:
error = np.random.uniform(low=-100, high=100, size=(obs, 1))

targets = 15*xs - 13*zs + 6 + error #xs, zs are the weights. 6 is bias. Error is just the noise in the dataset. 

Make the Variables

In [4]:
init_range = 0.002

weights = np.random.uniform(low=-init_range, high=init_range, size=(2, 1))

bias = np.random.uniform(low=-init_range, high=init_range, size=1)

Learning Rates

In [5]:
learning_rate = 0.02

Train the Model

In [6]:
for i in range (100):
    outputs = np.dot(inputs,weights) + bias
    delta = outputs - targets
    
    loss = np.sum(delta ** 2)

    print(f"Iteration {i+1}: Loss = {loss}")

    delta_scaled = delta / obs

    weights = weights - learning_rate * np.dot(inputs.T,delta_scaled)
    bias = bias - learning_rate * np.sum(delta_scaled)

print("Final weights:", weights)
print("Final bias:", bias)

Iteration 1: Loss = 131509451709.95451
Iteration 2: Loss = 564628005536449.1
Iteration 3: Loss = 2.430431297308273e+18
Iteration 4: Loss = 1.0461775042511154e+22
Iteration 5: Loss = 4.503272624156164e+25
Iteration 6: Loss = 1.9384380891007085e+29
Iteration 7: Loss = 8.344041623906968e+32
Iteration 8: Loss = 3.5917145961130486e+36
Iteration 9: Loss = 1.5460659186990108e+40
Iteration 10: Loss = 6.655106690634975e+43
Iteration 11: Loss = 2.8647245168150722e+47
Iteration 12: Loss = 1.233137580973406e+51
Iteration 13: Loss = 5.308123600428068e+54
Iteration 14: Loss = 2.2849218148720512e+58
Iteration 15: Loss = 9.835636810118586e+61
Iteration 16: Loss = 4.23384002825609e+65
Iteration 17: Loss = 1.822498693547886e+69
Iteration 18: Loss = 7.845142304507515e+72
Iteration 19: Loss = 3.37703252642204e+76
Iteration 20: Loss = 1.453685565712267e+80
Iteration 21: Loss = 6.257581694367192e+83
Iteration 22: Loss = 2.693663815877197e+87
Iteration 23: Loss = 1.1595275446765615e+91
Iteration 24: Loss = 4

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  loss = np.sum(delta ** 2)


### Here's an example of the exploding gradient problem. 
This occurs when the gradients during training become excessively large. This typically happens in deep neural networks during backpropagation when the gradients are propagated backward through many layers, resulting in very large updates to the weights. Consequently, the model parameters can become so large that they cause numerical instability, often leading to the loss function becoming very large and eventually reaching infinity (or inf in practical terms).



In [9]:

init_range = 0.002

weights = np.random.uniform(low=-init_range, high=init_range, size=(2, 1))

bias = np.random.uniform(low=-init_range, high=init_range, size=1)

learning_rate = 0.000009

for i in range (100):
    outputs = np.dot(inputs,weights) + bias
    delta = outputs - targets
    
    loss = np.sum(delta ** 2)

    print(f"Iteration {i+1}: Loss = {loss}")

    delta_scaled = delta / obs

    weights = weights - learning_rate * np.dot(inputs.T,delta_scaled)
    bias = bias - learning_rate * np.sum(delta_scaled)

print("Final weights:", weights)
print("Final bias:", bias)

Iteration 1: Loss = 131500265099.56976
Iteration 2: Loss = 123755192183.14716
Iteration 3: Loss = 116467459944.08029
Iteration 4: Loss = 109610062750.02641
Iteration 5: Loss = 103157589632.19342
Iteration 6: Loss = 97086130121.53886
Iteration 7: Loss = 91373185645.28008
Iteration 8: Loss = 85997586155.38225
Iteration 9: Loss = 80939411680.08014
Iteration 10: Loss = 76179918507.7304
Iteration 11: Loss = 71701469729.45924
Iteration 12: Loss = 67487469883.22073
Iteration 13: Loss = 63522303457.08015
Iteration 14: Loss = 59791277023.837845
Iteration 15: Loss = 56280564792.56492
Iteration 16: Loss = 52977157375.28454
Iteration 17: Loss = 49868813578.94631
Iteration 18: Loss = 46944015044.0526
Iteration 19: Loss = 44191923561.843
Iteration 20: Loss = 41602340911.87086
Iteration 21: Loss = 39165671071.14337
Iteration 22: Loss = 36872884654.78662
Iteration 23: Loss = 34715485456.4651
Iteration 24: Loss = 32685478964.566803
Iteration 25: Loss = 30775342737.48603
Iteration 26: Loss = 28977998528

### Here's an example of the vanishing gradient problem
The vanishing gradient problem occurs when the gradients of the loss function with respect to the model parameters become very small during training. This leads to extremely small updates to the model parameters, effectively causing the training to slow down or stop. This is particularly problematic in deep neural networks, where the gradients can diminish as they are propagated back through many layers.

In [22]:
init_range = 0.1

weights = np.random.uniform(low=-init_range, high=init_range, size=(2, 1))

bias = np.random.uniform(low=-init_range, high=init_range, size=1)

learning_rate = 0.0065

for i in range (100):
    outputs = np.dot(inputs,weights) + bias
    delta = outputs - targets
    
    loss = np.sum(delta ** 2)

    print(f"Iteration {i+1}: Loss = {loss}")

    delta_scaled = delta / obs

    weights = weights - learning_rate * np.dot(inputs.T,delta_scaled)
    bias = bias - learning_rate * np.sum(delta_scaled)

print("Final weights:", weights)
print("Final bias:", bias)

Iteration 1: Loss = 132123550221.1919
Iteration 2: Loss = 56185205358971.375
Iteration 3: Loss = 2.3953539586338244e+16
Iteration 4: Loss = 1.0212237440193829e+19
Iteration 5: Loss = 4.353845371030912e+21
Iteration 6: Loss = 1.8562051838721366e+24
Iteration 7: Loss = 7.913704759003266e+26
Iteration 8: Loss = 3.373918496840408e+29
Iteration 9: Loss = 1.4384348955496382e+32
Iteration 10: Loss = 6.132629271207012e+34
Iteration 11: Loss = 2.6145929748519184e+37
Iteration 12: Loss = 1.1147111469716714e+40
Iteration 13: Loss = 4.752492854764001e+42
Iteration 14: Loss = 2.026196096266373e+45
Iteration 15: Loss = 8.638579768363169e+47
Iteration 16: Loss = 3.6830201253879802e+50
Iteration 17: Loss = 1.5702424192993284e+53
Iteration 18: Loss = 6.694685609341465e+55
Iteration 19: Loss = 2.85426659590427e+58
Iteration 20: Loss = 1.2169136102003164e+61
Iteration 21: Loss = 5.188308909904365e+63
Iteration 22: Loss = 2.2120389315974415e+66
Iteration 23: Loss = 9.43106130715007e+68
Iteration 24: Loss 

### A kinda-sorta(ish) example of a smoother/stabler gradient.
This means that the gradients are neither too large (exploding) nor too small (vanishing). A stable gradient flow ensures that the model parameters are updated in a balanced manner, allowing for efficient and effective training. Here are some strategies to achieve this: