The goal is to split our dataset into training and test sets in a way that ensures:
- we don't train on the test set (to prevent data snooping bias)
- the test set is representative of the overall data (important for generalization)
- and this split is stable and reproducible over time

### Why do we need a test set?
A test set simulates new, unseen data and helps us to estimate how our model will perform in the real world.
- Training set - used to train the model
- Test set - used to evaluate final performance

## The Naive Way (Not Ideal)

In [None]:
import pandas as pd

housing = pd.read_csv("datasets/housing/housing.csv")
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

- This splits 80% of the data into `train_set` and 20% into `test_set`
- It works, but it has two problems:
    - It's random -> the test set changes each time unless you fix `random_state`
    - It doesn't guarantee that the test set is __representative__ (e.g., of income distribution)

## Better Approach: Stable Test Set Using Hashing
It means creating a __consistent test split__ using a __hash function__. But Why?
- Our dataset grows over time (more rows)
- If we split randomly each time, some previously used test examples may end up in training -- __causing data leakage__.

>The Solution:
use a rowID or index and a hash function to deterministically select test set rows.

In [3]:
import hashlib
import numpy as np

def test_set_check(identifier, test_ratio, hash):
    return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

- `identifier` = unique ID per row
- `hash()` = hash function lilke `hashlib`.md5
- it keeps rows in the test set if their hash value is in the lowest `test_ratio` portion

In [4]:
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
    return data.loc[~in_test_set], data.loc[in_test_set]

This approach:
- always puts the same rows into test set, even when the dataset is updated later
- avoids data leakage

## Problem: What if We Don't Have a Row ID?
Use a combination of features to generate one, or use the row index itself:

In [5]:
housing_with_id = housing.reset_index() # adds index as a column
# now we can use that index as the indentifier

## Best Practice: Stratified Sampling
Random sampling can still give us a test set that's not representative, especially for skewed features like income.<br>
So we will use __Stratified Sampling__ using income category as the __stratification feature__:

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing['income_cat']):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

- This ensures the distribution of income_cat is the same in both training and test sets.
- Much better for realistic performance estimation.

## Let's build a solid understanding of Stratified Sampling

> Stratified Sampling is a technique where we split the data into groups (strata) based on an important feature, then sample proportionally from each group.

__Why?__<br>
To ensure the sample (like a test set) has the same distribution of key characteristics as the full dataset -- so it's fair and representative

__Context:__ <br>
We're building a model to predict median house value. One of the most important input features is:
> `median_income`, But it is not evenly distributed -- there are many low-income districts, fewer high-income ones.

__Problem with Random Splits:__ <br>
If we split randomly, it may pick too few high-income districts in the test set -- so the test ste won't reflect how well the model works for all income levels.


__Solution: Use Stratified Sampling on Income__
1. convert `median_income` into income categories:

In [7]:
housing["income_cat"] = pd.cut(
    housing["median_income"],
    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
    labels=[1, 2, 3, 4, 5]
)

This turns continuous income into 5 categories, like:
| Category | Income Range |
| -------- | ------------ |
| 1        | 0.0 – 1.5    |
| 2        | 1.5 – 3.0    |
| 3        | 3.0 – 4.5    |
| 4        | 4.5 – 6.0    |
| 5        | 6.0+         |

2. Apply `StratifiedShuffleSplit` to maintain the same ratio of each income group in train and test sets.
3. Now, our training and test sets both reflect the income structure of the original dataset.

## What do to after this?
- remove the `income_cat` column -- it was just for stratified sampling

In [8]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)


- Now move on to:
    - Data exploration
    - Feature engineering
    - Model building -- with balanced and representative train/test sets

----

In [12]:
# This is random splitting, which can lead to biased or unbalanced test sets, especially if important features (like income) are not evenly distributed.

from sklearn.model_selection import train_test_split
import numpy as np 

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
# The result is a categorical version of median income, which helps us later do stratified sampling — ensuring every income group is proportionally represented in both train and test sets.

housing['income_cat'] = pd.cut(housing['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])

| Method                     | Purpose                                                                                 | When to Use                                                                       |
| -------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| `train_test_split()`       | Randomly splits data into train/test                                                    | Use only for quick tests or when you’re sure your data is evenly distributed.     |
| `StratifiedShuffleSplit()` | Splits data **while preserving the distribution** of an important feature (like income) | ✅ Use this when your data is **not evenly distributed**, especially for ML tasks. |
