# 1Ô∏è‚É£ What Is Statistical Learning?

Statistical learning is about learning patterns from data.
We generally have two main components:
* **Inputs (features):** usually written as $X$
* **Output (target):** usually written as $Y$

We assume there is a hidden, true relationship between them:

$$
Y = f(X) + \epsilon
$$

### üîç Explanation of Symbols
* **$Y$**: The output we care about (e.g., sales, marks, house price).
* **$X$**: One feature or a collection of features (e.g., hours studied, house size, location).
* **$f(X)$**: The **true but unknown function** that links $X$ to $Y$. This is the "signal."
* **$\epsilon$ (epsilon)**: Random **noise**. These are things we can‚Äôt measure or control (luck, mood, measurement errors).

**Our job in statistical learning is to use data to estimate $f$.**



---
### üí° Simple Examples

**1. Study vs Marks**
* $X$ = hours of study
* $Y$ = exam score
$$
\text{Score} = f(\text{Hours}) + \epsilon
$$

**2. House Features vs Price**
* $X$ = (area in sq ft, number of bedrooms, distance to metro)
* $Y$ = house price
$$
\text{Price} = f(\text{Area, Bedrooms, Distance}) + \epsilon
$$

*In practice, we never know the true $f$, but we estimate it using data!*

# 2Ô∏è‚É£ Why Do We Estimate $f(X)$? (Prediction vs Inference)

We estimate $f(X)$ mainly for two reasons:

## A. Prediction üéØ
We learn an estimate of the true function $f$ and call it $\hat f$ (pronounced "f-hat").
For a new input $X$, our prediction is:

$$
\hat Y = \hat f(X)
$$

* **$\hat f(X)$**: The function we learned (e.g., a Linear Regression model).
* **$\hat Y$**: The predicted value.

**Example:**
Given a student with 5 hours of study, we plug $X = 5$ into $\hat f$:
$$
\hat Y = \hat f(5) \quad \text{(This is our predicted score)}
$$
*Goal: Minimize the difference between $\hat Y$ and true $Y$.*

## B. Inference üîç
Here, we don't just want a number; we want to **understand the relationship**.

**Typical questions:**
* Does increasing TV ads increase sales more than radio ads?
* Which features are the most important?
* Is the relationship positive or negative?



---

### üìâ Reducible vs Irreducible Error
Even with a perfect model, predictions are never 100% accurate because of noise ($\epsilon$).
The expected squared error is:

$$
\mathbb{E}\left[(Y - \hat Y)^2\right] = \underbrace{[f(X) - \hat f(X)]^2}_{\text{Reducible Error}} + \underbrace{\text{Var}(\epsilon)}_{\text{Irreducible Error}}
$$

| Error Type | Meaning | Can we fix it? |
| :--- | :--- | :--- |
| **Reducible** | Error because our model $\hat f$ isn't perfect. | **Yes**, by using better algorithms/data. |
| **Irreducible** | Error caused by natural noise ($\epsilon$) in the world. | **No**, this is pure randomness. |

# 3Ô∏è‚É£ Parametric Methods (Assume a Formula)

Parametric methods involve a two-step process:
1.  **Assume a shape** for $f(X)$ (usually a line).
2.  **Estimate the parameters** (coefficients) from the data.

The most common is the **Linear Model**:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon
$$

### üìù Explanation of Symbols
* **$\beta_0$**: Intercept (baseline value of $Y$ when all $X$ are 0).
* **$\beta_1 \dots \beta_p$**: Coefficients (the effect of each feature on $Y$).
* **$p$**: Number of features.

We use training data to find the best values for $\beta$. The most common method is **Least Squares**.



[Image of linear regression fitted line through data points]


### Example: Predicting Salary üí∞
$$
\text{Salary} = \beta_0 + \beta_1 (\text{Education}) + \beta_2 (\text{Experience}) + \epsilon
$$
* **Interpretation of $\beta_1$**: How much salary increases (on average) for every 1 extra year of education.

### ‚úÖ Pros & ‚ùå Cons
* **Pros:** Simple, fast, very easy to interpret (good for inference).
* **Cons:** If the real world is complex (non-linear), this model will perform poorly (high bias).

# 4Ô∏è‚É£ Non-Parametric Methods (Let Data Decide)

Non-parametric methods **do not** assume a fixed formula like a straight line. Instead, they try to fit the data points as closely as possible, allowing the data to determine the shape of the curve.

We still write $Y \approx \hat f(X)$, but $\hat f$ is flexible (e.g., K-Nearest Neighbors, Splines, Decision Trees).

### Example: Study Hours vs Marks (Curved)
* **Reality:** First 2‚Äì3 hours $\to$ big improvement. After 9 hours $\to$ no improvement (flattening).
* A linear model forces a straight line (incorrect).
* A non-parametric model bends to fit the curve.



### ‚úÖ Pros & ‚ùå Cons
* **Pros:** Very flexible, fits complex/weird patterns well.
* **Cons:** Needs lots of data, harder to interpret, and high risk of **Overfitting**.

# 5Ô∏è‚É£ Overfitting (Too Good to be True)

Overfitting happens when a model learns the **noise** in the training data rather than the **signal**.

**Imagine we have 10 students:**
* A very flexible model might draw a wiggly line that touches all 10 points perfectly (0% Training Error).
* But for the 11th student (Test Data), the prediction is way off because the model was too focused on the specific quirks of the first 10 students.

**Key Idea:**
* **Overfitted Model:** Low Training Error, High Test Error.
* **Good Model:** Balanced.



[Image of underfitting vs overfitting plots]


*Both parametric and non-parametric methods can overfit, but non-parametric methods are riskier because they are so flexible.*

# 6Ô∏è‚É£ Flexibility vs. Interpretability Trade-off

There is a classic trade-off in machine learning:
* **Flexibility:** How curvy/complex can the model be?
* **Interpretability:** How easy is it to explain *why* the model made a prediction?

| Model Type | Flexibility | Interpretability | Examples |
| :--- | :--- | :--- | :--- |
| **Low Flex** | Low | **High** | Linear Regression, Lasso |
| **Medium Flex** | Medium | Medium | GAMs (Generalized Additive Models) |
| **High Flex** | **High** | Low (Black Box) | Random Forests, Neural Networks, SVMs |



**Rule of Thumb:**
* Care about **Understanding** (Inference)? $\to$ Use Simpler Models.
* Care about **Accuracy** (Prediction)? $\to$ Use Flexible Models.

# 7Ô∏è‚É£ Supervised vs. Unsupervised Learning

### üÖ∞Ô∏è Supervised Learning
We have both **Inputs ($X$)** and **Correct Answers ($Y$)**.
* **Goal:** Learn a function $\hat Y = \hat f(X)$.
* **Examples:** Predicting house prices (Regression), Email Spam detection (Classification).
* *We "supervise" the model by checking its answers against the true $Y$.*

### üÖ±Ô∏è Unsupervised Learning
We only have **Inputs ($X$)**. We have **NO** labels ($Y$).
* **Goal:** Find hidden structure or groups in the data.
* **Typical Task:** **Clustering**.

**Example:** Customer Segmentation
We have data on: *Number of orders, Average cart value*.
The algorithm might discover 3 groups:
1.  Heavy Spenders
2.  Occasional Buyers
3.  Window Shoppers



[Image of supervised classification vs unsupervised clustering]


*Note: There is also **Semi-Supervised Learning**, where we have a small amount of labeled data and lots of unlabeled data.*

# 8Ô∏è‚É£ Regression vs. Classification

We categorize supervised learning problems based on the **type of Output ($Y$)**.

### 1. Regression ($Y$ is Numeric)
The output is a continuous number.
$$
Y = f(X) + \epsilon
$$
* **Examples:** Predicting stock price, temperature, blood sugar level.
* **Methods:** Linear Regression, Regression Trees.

### 2. Classification ($Y$ is Categorical)
The output belongs to a specific class or category.
$$
Y \in \{ \text{Class 1}, \text{Class 2}, \dots, \text{Class K} \}
$$
* **Examples:** Spam vs Not Spam ($K=2$), Cat vs Dog vs Horse ($K=3$).
* **Methods:** Logistic Regression, KNN, SVM.



**Crucial Note:** The inputs ($X$) can be the same for both.
* Predict probability of rain (0.85) $\to$ **Regression**.
* Predict *will* it rain? (Yes/No) $\to$ **Classification**.


# üß† **2.2 Assessing Model Accuracy

---

# #Ô∏è‚É£ 2.2 Assessing Model Accuracy

When we build a model, the BIG question is:

üëâ **How do we know if our model is good?**

To answer this, we measure **how close predictions are to real values**.
We care mostly about **how well the model performs on new, unseen data** (not the data used for training!).

---

## ‚≠ê 2.2.1 Measuring the Quality of Fit

### üéØ What is ‚ÄúQuality of Fit‚Äù?

It‚Äôs simply **how well predictions match the actual values**.

In **regression** models, the most common metric is:

### üìå **Mean Squared Error (MSE)**

When we perform regression (predicting a number, like stock price or height), we need a metric to see how far off our predictions are from reality. The most common metric is the **Mean Squared Error (MSE)**.

#### üìò Formula

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2
$$

### üîç What do the symbols mean?

| Symbol                     | Meaning                       |
| -------------------------- | ----------------------------- |
| ( n )                      | Number of data points         |
| $( y_i )$                    | Actual value for point *i*    |
| $( \hat{f}(x_i) )$           | Predicted value for point *i* |
| $$((y_i - \hat{f}(x_i))^2)$$ | Squared error for point *i*   |

---

### ‚ö†Ô∏è **Training MSE vs Test MSE**

### üèãÔ∏è **Training MSE**

Computed on the **same data used to train** the model.
It **always decreases** when the model becomes more flexible (more complicated).

üìâ **Lower training MSE ‚â† better model**.

### üöÄ **Test MSE**

Computed on **new, unseen data**.
This is what truly matters.

---

### üß©  Example

Suppose a model predicts house prices.

| House | Actual Price | Predicted | Error | Squared Error |
| ----- | ------------ | --------- | ----- | ------------- |
| 1     | 50           | 60        | -10   | 100           |
| 2     | 80           | 70        | 10    | 100           |
| 3     | 100          | 90        | 10    | 100           |

MSE:

$$
\text{MSE} = \frac{100 + 100 + 100}{3} = 100
$$

---

### üìâ Why Low Training MSE Can Be Misleading

If a model is **too complicated**, it ‚Äúmemorizes‚Äù the training data.
This is called **overfitting**.

üìå Overfitted models look perfect on training data but fail badly on new data.

Think of a student who memorizes answers but doesn‚Äôt actually understand anything.

This is why **test MSE** typically shows a **U-shaped curve**:

üìâ MSE decreases ‚Üí ‚úî
üìâ MSE reaches a minimum ‚Üí best model
üìà MSE increases again ‚Üí ‚ùå overfitting



---

## ‚≠ê 2.2.2 The Bias‚ÄìVariance Trade-Off

Probably the **most important concept in machine learning**.

### üéØ Key Goal

Choose a model that balances:

* **Bias** (error from being too simple)
* **Variance** (error from being too sensitive)

This creates the famous **trade-off**.

---

Why does the Test MSE form that U-shape? It happens because of two competing forces: **Bias** and **Variance**. The total error in any model can be mathematically broken down into three parts.

### üìå Formula for Expected Test MSE


$$
\mathbb{E}[(y_0 - \hat{f}(x_0))^2]
= \text{Var}(\hat{f}(x_0))
+ \text{Bias}(\hat{f}(x_0))^2
+ \text{Var}(\varepsilon)
$$


### üîç Variables Explained

| Term                        | Meaning                                                              |
| --------------------------- | -------------------------------------------------------------------- |
| $( y_0 )$                     | Actual value for a test point                                        |
| $$(\hat{f}(x_0))$$            | Predicted value for that test point                                  |
| $(\text{Var}(\hat{f}))$     | How much the model‚Äôs prediction changes with different training data |
| $(\text{Bias}^2)$           | How far the model‚Äôs predictions are from the true pattern            |
| $(\text{Var}(\varepsilon))$ | Irreducible noise (cannot be eliminated by any model)                |

---

### üß† Intuition

* **High bias ‚Üí underfitting**
  *Model too simple ‚Üí misses real patterns.*

* **High variance ‚Üí overfitting**
  *Model too sensitive ‚Üí sees patterns in noise.*

Ideal model: **low bias + low variance**.

---

### üçï Real-Life Analogy

Imagine predicting pizza delivery time.

* **High bias (too simple model):**
  "Delivery always takes 30 minutes."
  ‚Üí Bad for short or long deliveries.

* **High variance (too complex model):**
  "Delivery depends on 20 factors: weather, chef mood, pizza toppings, etc."
  ‚Üí Overreacting, unstable.

* **Balanced model:**
  "Delivery depends mostly on distance + traffic."
  ‚Üí Simple but accurate.

---

## ‚≠ê 2.2.3 The Classification Setting

Now we move from predicting numbers ‚Üí predicting categories.

Examples:

* Spam or Not Spam
* Cancer or No Cancer
* Loan Default or Not

---

### üìå Training Error Rate (Classification)

Formula:

$$
\text{Training Error Rate}
= \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i)
$$


### Symbols:

* $(\hat{y}_i)$: predicted class
* $(y_i )$: actual class
* $(I(\cdot))$: indicator function (1 if condition is true)

---

## ‚≠ê The Bayes Classifier

(The perfect‚Äîbut impossible‚Äîclassifier)

It chooses the class with the highest probability:


$$
\hat{y} = \arg\max_j \; \Pr(Y=j \mid X=x_0)
$$


You usually **cannot compute this**, because you don‚Äôt know true probabilities.

But it is the **gold standard**.

---

### üß† Bayes Error Rate (the minimum possible error)

Even the ideal classifier cannot be perfect if classes naturally overlap.

Formula:


$$
1 - \mathbb{E}\left[ \max_j \Pr(Y=j \mid X) \right]
$$


---

## ‚≠ê K-Nearest Neighbors (KNN) Classifier

Super simple idea:

üëâ To classify a point, look at the **K closest training points**
üëâ Take the **majority vote**

---

### üìå Conditional Probability Estimate in KNN

$$
\Pr(Y = j \mid X = x_0)
= \frac{1}{K} \sum_{i \in N_0} I(y_i = j)
$$


Where:

* ( N_0 ): set of nearest K neighbors
* ( I(\cdot) ): 1 if neighbor belongs to class j

---

 ### üéØ Effect of Choosing K

* **Small K** ‚Üí very flexible ‚Üí low bias, high variance ‚Üí overfits
* **Large K** ‚Üí less flexible ‚Üí high bias, low variance ‚Üí underfits

The test error again forms a **U-shape**.


---

## ‚≠ê Extra Intuitive Example (Not from book)

Imagine predicting if a person likes chai.

You survey neighbors:

### Case 1: K = 1

You only ask the **closest neighbor**.
‚Üí Very noisy
‚Üí High variance

### Case 2: K = 100

You ask **100 people**, even far away.
‚Üí Majority vote becomes too generic
‚Üí High bias

Best value of K is somewhere in the middle.

---

 # üéâ Final Summary

| Concept          | Meaning                       | Good / Bad                                 |
| ---------------- | ----------------------------- | ------------------------------------------ |
| Training MSE     | Error on training data        | Always low for complex models (misleading) |
| Test MSE         | Error on new data             | What actually matters                      |
| Bias             | Wrong assumptions             | Too simple = underfitting                  |
| Variance         | Too sensitive to noise        | Too complex = overfitting                  |
| Bayes Classifier | Theoretical best              | Not achievable in practice                 |
| KNN              | Simple method using neighbors | K controls flexibility                     |

---
