<div class="alert alert-block alert-success">
<b>NOTEBOOK 3 - Data Splitting
</div>

---
># 1 - IMPORTS

### 1.1 - SETUP PROJECT

In [1]:
# Centralized setup
import sys
from pathlib import Path

# Make sure PROJECT_PATH is in sys
PROJECT_ROOT = Path.cwd().resolve().parent
PROJECT_PATH = PROJECT_ROOT / "src" / "project"

if str(PROJECT_PATH) not in sys.path:
    sys.path.insert(0, str(PROJECT_PATH))

# Centralized import
from imports import *

Imports ready: pd, np, sns, plt, joblib, sklearn, etc.
PROJECT_ROOT: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Regression\beyond-grades-ml-project


---
># 2 - DATASET LOAD

In [2]:
### 2.1 - LOADING

In [3]:
dataset_path = "../data/interim/01_dataset_structural_cleanup.xlsx"
try:
    df = utils.load_student_dataset(dataset_path)
    print('Data loaded successfully.')
except Exception as e:
    print(f'An error occurred during data loading: {e}')

Data loaded successfully.


---
># 3 - DATA SPLIT

### 3.1 - SPLITTING WITH STRATIFICATION

This notebook is all about splitting the data into two groups: a **training** and a **test set**. 
- The train (or training) set will be used for EDA and modeling and will consist of 80% of the data points;
- The test set will first be transformed according to training preprocessing statistical operations. Also, it will act as unseen data and will be used to evaluate the final model. 

Below I have created new dataframes:
- `X_train`: features used for training;
- `X_test`: target used for training;
- `y_train`: features used for evaluation;
- `y_test`: target used for evaluation.

**Method**  
- **Target**: 'GPA'  
- **Stratification**: quantile-based bins (up to 5 bins via 'max_q=5') to balance 'GPA' across splits.  
- **Function**: 'splitting.safe_train_test_split'  
- **Test size**: 0.20  
- **Random seed**: 42  
- **Leakage prevention**: split was done **before** any preprocessing or target leakage-prone steps.  
- **Saved metadata**: we persist the **bin edges**, **random seed**, **indices** for each fold/partition, and key parameters.

**Why stratified by quantiles?**  
'GPA' is continuous; quantile bins approximate stratification used for classification. This keeps the **distribution** of 'GPA' comparable between train and test.

**Artifacts saved**  
- Train/test **indices** (to re-index any future versions of the dataset).  
- **bin_edges** used for stratification.  
- Split **parameters** (test_size, random_state, max_q).  
- Optional **hash** of the rows used to detect data drifts that could break reproducibility.

**Checks performed**  
- Count per quantile bin in train vs test (using the **same bin edges**).  
- Sanity check: no overlap in indices; sizes match expected proportions.

**How to reproduce**  
Use the saved indices and bin edges; do not re-sample. If data order changes, re-index using the stored integer positions or stable IDs.

In [4]:
# Data splitting

# Define features/target
X = df.drop(columns=["GPA"])
y = df["GPA"]

# Split with safe stratification
X_train, X_test, y_train, y_test, meta = splitting.safe_train_test_split(
    X, y, test_size=0.2, random_state=42, max_q=5, verbose=True
)

# Check distributions using the SAME bin edges used for stratification
if meta["bin_edges"] is not None:
    edges = meta["bin_edges"]
    print("\nTraining target distribution (same edges as stratify):")
    print(pd.cut(y_train, bins=edges, labels=False, include_lowest=True).value_counts().sort_index())

    print("\nTest target distribution (same edges as stratify):")
    print(pd.cut(y_test, bins=edges, labels=False, include_lowest=True).value_counts().sort_index())
else:
    print("\n(No stratification used — skipping quantile distribution check)")

# Save (Excel-friendly)
X_train_to_save = splitting.for_excel(X_train)
X_test_to_save  = splitting.for_excel(X_test)

utils.save_dataset(X_train_to_save, "interim/02_X_train_aftersplit.xlsx")
utils.save_dataset(X_test_to_save,  "interim/02_X_test_aftersplit.xlsx")
utils.save_dataset(y_train.to_frame("GPA"), "interim/02_y_train_aftersplit.xlsx")
utils.save_dataset(y_test.to_frame("GPA"),  "interim/02_y_test_aftersplit.xlsx")

print("\nSplit completed and files saved.")

Stratification by quantiles: q=5

Training target distribution (same edges as stratify):
GPA
0    383
1    383
2    382
3    382
4    383
Name: count, dtype: int64

Test target distribution (same edges as stratify):
GPA
0    96
1    95
2    96
3    96
4    96
Name: count, dtype: int64
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Regression\beyond-grades-ml-project\data\interim\02_X_train_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Regression\beyond-grades-ml-project\data\interim\02_X_test_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Regression\beyond-grades-ml-project\data\interim\02_y_train_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloud

### 3.2 - SAVING SPLIT METADATA

Below I have saved the split metadata to file (.json).

In [None]:
# Save metadata about the split
split_meta = {
    "method": "safe_train_test_split",
    "params": {
        "test_size": 0.20,
        "random_state": 42,
        "max_q": 5,
    },
    "bin_edges": meta.get("bin_edges", None),  # from your function
    "index": {
        "train": X_train.index.tolist(),
        "test":  X_test.index.tolist(),
    },
    "n_rows": {
        "train": int(len(X_train)),
        "test":  int(len(X_test)),
        "total": int(len(X_train) + len(X_test)),
    },
    "target": "GPA",
}

# Ensure all objects are JSON-safe
safe_meta = utils.make_json_safe(split_meta)

out = Path("../data/meta")
out.mkdir(parents=True, exist_ok=True)
with open(out / "split_meta.json", "w", encoding="utf-8") as f:
    json.dump(safe_meta, f, indent=2, ensure_ascii=False)

print("✔ Saved split metadata → data/meta/split_meta.json")

✔ Saved split metadata → data/meta/split_meta.json


Proper Data Splitting Implemented:

>- Applied appropriate splitting strategy for my data type.
>- Maintained (temporal/geographic/hierarchical) integrity.
>- Validated split quality and representativeness.
>- Documented splitting approach for reproducibility.

<div class="alert alert-block alert-info">
<b>Next Notebook - EDA
</div>