# Gradient Descent Algorithm

Gradient Descent is an optimization algorithm used to minimize the loss function in machine learning and deep learning. It iteratively adjusts the parameters (such as weights in a model) to reduce the error (or loss) between predicted outputs and actual results. It is one of the most widely used techniques for training models.

## How Gradient Descent Works

1. **Objective (Loss) Function:**
   Gradient Descent is used to minimize a **loss function** (also known as the cost function or objective function). The loss function measures the difference between the predicted and actual outputs. In a regression model, for instance, this could be the Mean Squared Error (MSE).

2. **Gradient Computation:**
   The **gradient** is a vector that represents the direction of the greatest increase in the loss function. The gradient is computed for the model parameters (e.g., weights), indicating how much the loss function will change if the parameters are adjusted.

3. **Parameter Update:**
   Once the gradient is calculated, the model's parameters are updated in the opposite direction of the gradient to minimize the loss function. The size of each step is determined by the **learning rate**.

   The parameter update rule is:
   $$
   \theta = \theta - \alpha \cdot \nabla_{\theta} J(\theta)
   $$
   where:
   - $\theta$ is the model parameter (e.g., weight),
   - $\alpha$ is the learning rate,
   - $\nabla_{\theta} J(\theta)$ is the gradient of the loss function $J(\theta)$ with respect to $\theta$.

4. **Repeat:**
   The process of calculating the gradient and updating the parameters is repeated for a predefined number of iterations or until the algorithm converges to a minimum.

## Types of Gradient Descent

There are three main types of Gradient Descent, each varying in the way the gradient is calculated and the parameters are updated:

### 1. **Batch Gradient Descent (BGD):**
   - In **batch gradient descent**, the gradient is calculated using the entire dataset. This approach ensures stable and smooth convergence but can be computationally expensive for large datasets.

### 2. **Stochastic Gradient Descent (SGD):**
   - In **stochastic gradient descent**, the gradient is calculated using a single data point at each iteration. This speeds up the learning process but introduces high variance, causing more oscillations in the loss function curve.

### 3. **Mini-Batch Gradient Descent:**
   - **Mini-batch gradient descent** is a compromise between batch and stochastic gradient descent. It uses a small, random subset (mini-batch) of the data to compute the gradient. This approach balances the efficiency of batch gradient descent with the speed of stochastic gradient descent.

## Key Parameters in Gradient Descent

### 1. **Learning Rate ($\alpha$):**
   - The learning rate controls the size of the steps taken in the direction of the gradient. If the learning rate is too small, convergence will be slow. If it's too large, the algorithm may overshoot the optimal solution.

### 2. **Convergence:**
   - The algorithm is considered to have converged when the change in the loss function or in the parameters becomes very small, indicating that the model has found optimal or near-optimal parameters.

## Mathematical Intuition Behind Gradient Descent

In the simplest case of **linear regression**, the objective function is the **Mean Squared Error (MSE)** between the predicted and actual values:
$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2
$$
Where:
- $J(\theta)$ is the cost function,
- $h_\theta(x^{(i)})$ is the model's prediction for input $x^{(i)}$,
- $y^{(i)}$ is the actual output for the $i$-th example,
- $m$ is the total number of training examples.

The gradient of the cost function $J(\theta)$ with respect to the parameters $\theta$ (weights) is:
$$
\nabla_\theta J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right) x^{(i)}
$$

Using this gradient, the parameters are updated as:
$$
\theta = \theta - \alpha \cdot \nabla_\theta J(\theta)
$$

## Applications of Gradient Descent

Gradient Descent is used extensively in machine learning and deep learning algorithms, including:
- **Linear Regression**: To minimize the cost function and find the optimal weights.
- **Logistic Regression**: To optimize the parameters for binary classification.
- **Neural Networks**: To adjust the weights and biases of the network using backpropagation and gradient descent.
- **Support Vector Machines (SVM)**: To find the optimal hyperplane for classification tasks.

## Conclusion

Gradient Descent is a powerful and versatile optimization algorithm that allows machine learning models to find the optimal parameters by iteratively minimizing a loss function. By understanding and choosing the right type of gradient descent and tuning its parameters, we can train models effectively and efficiently.


___

- in linear regression - > we optimize intercept and slop
- in logistic regressio -> we optimize squiggle
- in tsne we optimize clusters


___
- in linear regression for example
  - we  will start by using gradient descent to find intercept
  - then once we understand how gradient descent works ,we will use it to solve for the intercept and slope
  - eg: for slope 0.64 find optimal value for intercept
  - first pick a random value for intercept
  - calculate loss
  - plot different intercepts on x and curresponding loss on y axis 
  

![](images/grad.png)

- select point with lowest loss
-   gradient descent is more efficient , it only does a few calculations far from the optimal solution and increases the number of calculations closer to optimal value.

![](images/gr_eff.png)

- we get a equation for curve
- calculating derivative of function at any point will give slope
- 
