## There are certain important concepts to learn!

# 1. Gradient descent:
- Gradient descent is an iterative optimization algorithm used to minimize a function f(θ) (the loss function in ML).

# 2. Types of Gradient Descent
## (a) Batch Gradient Descent

- Uses the entire dataset to calculate the slope (gradient) at each step.

- More accurate, but very slow for large datasets.
- Analogy:
 Imagine you’re blindfolded on a hill. Before making your first step, you ask everyone in the whole city what the slope direction is. Accurate advice, but it takes forever!

## (b) Stochastic Gradient Descent (SGD)

- Uses one random data point at a time to calculate the gradient.

- Very fast, but noisy (you may wobble left and right instead of smoothly going down).

- Analogy:
 - Now, instead of asking the whole city, you just ask one random passerby which way is downhill.

 - Sometimes they’re right, sometimes they’re wrong.

 - But if you keep walking, on average, you’ll still reach the bottom.

## (c) Mini-Batch Gradient Descent

- Uses a small group of data points at each step (like 32, 64, 128 samples).

- A compromise between batch and SGD: not too slow, not too noisy.

- Analogy:
  - Now you ask a small group of 10 hikers around you.

  - Not as slow as asking the whole city.

  - More reliable than asking just one random person

  # 3. Global minimum, Local minimum, Local maximum:

  ![image.png](attachment:image.png)

Here’s the visual analogy of what we just discussed:

- The blue curve is your loss function (like the shape of mountains and valleys).

- The red dot is the global minimum → deepest valley (lowest error).

- The orange dot is a local minimum → a smaller valley where you might get stuck.

- The green dot is a local maximum → a small hill.

- 👉 Gradient descent is like walking downhill on this curve until you (hopefully) reach the red dot.

# 4. Bias and Variance Trade-off:

Imagine you’re playing darts.

- The bullseye = the true function (perfect predictions).
- Your dart throws = predictions made by your ML model.

Now let’s compare two cases:

## 1. High Bias (Underfitting)

- Your darts all land far from the bullseye, but they are close to each other.
- This means your model is too simple and ignores important patterns.
- Example: Using a straight line to fit a highly curved dataset.

👉 Bias = error from incorrect assumptions (model is too simple).

## 2. High Variance (Overfitting)

- Your darts are spread all over the board (some close, some far).
- Model is too complex, learns not just patterns but also noise in training data.
- Example: A high-degree polynomial that perfectly passes through every training point.

👉 Variance = error from sensitivity to fluctuations in training data.

## 3. Ideal Case (Balanced Bias & Variance)

- Your darts land close to the bullseye and tightly grouped.
- Model generalizes well — not too simple, not too complex.

Total error in a model can be decomposed into 3 parts:
**Expected Error = Bias*Bias + Variance + Irreducible Error**

**Bias² →** Error due to overly simplistic assumptions.
High bias = underfitting.

**Variance →** Error due to too much complexity (model fits noise).
High variance = overfitting.

**Irreducible Error →** Comes from noise in the data itself (you can’t eliminate this).

## ⚖️ The Tradeoff

- If you reduce bias (make the model more complex), variance usually increases.

- If you reduce variance (simplify model), bias usually increases.

- 👉 The goal is to find the sweet spot — a model complex enough to capture patterns, but simple enough to generalize.