This document explains three important machine learning concepts
**Overfitting**, **RFE**, and **K-Fold Cross Validation** using
 **house price dataset** from **Problem 4**

------------------------------------------------------------------------

# 1. Overfitting

## What is Overfitting?

Overfitting happens when a machine learning model **memorizes the
training data** instead of learning general patterns.\
The model becomes too tailored to the specific examples it has seen and
performs poorly on new, unseen data.

It's like a student who memorizes answers to practice questions instead
of understanding the underlying concepts.

------------------------------------------------------------------------

## Example Using House Price Data

Imagine your model learns the exact price of each house based on
specific combinations:

-   *Area = 7420, Bathrooms = 2 → Price = 13,300,000*
-   *Area = 8960, Bathrooms = 4 → Price = 12,250,000*

This is memorized, not learned.

When you show it a new house:

    Area = 8200
    Bathrooms = 3
    Stories = 2

The model struggles because it has **not learned a general rule** like:

> "Higher area and more bathrooms tend to increase price."

Instead, it learned only the *exact* examples.

------------------------------------------------------------------------

## Symptoms of Overfitting

-   Very low error on training set
-   High error on test set
-   Complex or unnecessarily large models

Example:

  Model        Train Error   Test Error
  ------------ ------------- ------------
  Overfitted   Very low      Very high
  Good model   Balanced      Balanced

------------------------------------------------------------------------

## Why Overfitting Happens

Overfitting typically occurs when:

-   The model is too complex (too many coefficients or features)
-   Too many irrelevant features
-   Not enough training data
-   No regularization used

------------------------------------------------------------------------

# 2. RFE --- Recursive Feature Elimination

## What is RFE?

RFE is a **feature selection technique** that automatically selects the
most important predictors by **removing the weakest features one at a
time**.

This continues until only the strongest features remain.

------------------------------------------------------------------------

## Example Using House Price Data

Suppose your dataset has the following features:

-   `area`
-   `bedrooms`
-   `bathrooms`
-   `stories`
-   `parking`
-   `mainroad`
-   `guestroom`
-   `basement`
-   `airconditioning`
-   `prefarea`
-   dummy variables from `furnishingstatus`

RFE works like this:

### Step 1: Fit the model using all features

RFE evaluates feature importance.

### Step 2: Remove the least important feature

Maybe `guestroom` is weak.

### Step 3: Fit again without it

Evaluate remaining features.

### Step 4: Remove the next weakest

Maybe `prefarea`.

### Step 5: Continue until the desired number of features remain

For example, RFE might conclude that the **top 2 best predictors** are:

-   `area`
-   `bathrooms`

RFE finds features that *truly matter* for predicting house prices.

------------------------------------------------------------------------

## Why Use RFE?

-   Removes noise
-   Simplifies the model
-   Avoids unnecessary complexity
-   Improves generalization (helps prevent overfitting)

------------------------------------------------------------------------

# 3. K-Fold Cross Validation

## What is K-Fold CV?

K-Fold Cross Validation tests your model's performance more reliably by
splitting the data into **K parts** (folds) and training/testing the
model **K times**, each time using a different fold for testing.

It gives a more stable and fair estimation of your model's performance.

------------------------------------------------------------------------

## ⭐ Example Using House Price Data

Assume you choose **K = 5**.

Your dataset is split into 5 equal folds:

    Fold 1: rows 1–109
    Fold 2: rows 110–218
    Fold 3: rows 219–327
    Fold 4: rows 328–436
    Fold 5: rows 437–545

### The model is trained and tested 5 times:

| Round | Train On       | Test On |
|-------|----------------|---------|
| 1     | Folds 2–5      | Fold 1  |
| 2     | Folds 1,3–5    | Fold 2  |
| 3     | Folds 1–2,4–5  | Fold 3  |
| 4     | Folds 1–3,5    | Fold 4  |
| 5     | Folds 1–4      | Fold 5  |

### Each round produces a score (e.g., R²):

| Fold | R² Score |
|------|----------|
| 1    | 0.64     |
| 2    | 0.66     |
| 3    | 0.62     |
| 4    | 0.68     |
| 5    | 0.65     |
### The final performance is the average:

\[ ext{Final R}\^2 = 0.65 \]

This is **more reliable** than a single train-test split (which might
give 0.55 or 0.75 depending on luck).
------------------------------------------------------------------------

## Why Use K-Fold CV?

-   Reduces the effect of randomness
-   Uses all data for both training AND testing
-   Gives a robust estimate of true model performance
-   Helps detect overfitting

------------------------------------------------------------------------