<div class="alert alert-block alert-success">
<b>NOTEBOOK 3 - Data Splitting
</div>

---
># 1 - IMPORTS

### 1.1 - SETUP PROJECT

In [20]:
# IMPORTS

# Standard libraries
import sys
import importlib
from pathlib import Path
import json  # for saving split metadata

# Third-party
import pandas as pd
from sklearn.model_selection import train_test_split

# Add "../src/utilities" to sys.path for custom utilities
sys.path.append("../src/utilities")  # Ensure src/ is in path

# Import utils (reload to pick up latest edits)
try:
    import utils
    importlib.reload(utils)   # Ensures latest version is loaded
except ImportError as e:
    raise ImportError(f"Could not import utils module: {e}")

---
># 2 - DATASET LOAD

### 2.1 - LOADING

In [21]:
# Load dataset
df = pd.read_pickle("../data/interim/02_dataset_structural_cleanup.pkl")

---
># 3 - DATA SPLIT

### 3.1 - SPLITTING WITH STRATIFICATION

In this step, the dataset is divided into two groups: a **training set** and a **test set**.

- The **training set** (80%) is used for Exploratory Data Analysis (EDA) and model development.  
- The **test set** (20%) remains unseen during training. It is transformed using preprocessing parameters fitted on the training data and serves as a final benchmark for model evaluation.  

The following dataframes are created:
- `X_train`: features for training  
- `X_test`: features for testing  
- `y_train`: target values for training  
- `y_test`: target values for testing  

**Method**  
- **Target**: `target` (the label to be predicted)  
- **Stratification**: performed on `y` to maintain proportional class distributions across splits  
- **Function**: `train_test_split` (from `sklearn.model_selection`)  
- **Test size**: 0.20 (20% of the data)  
- **Random seed**: 42 (ensures reproducibility)  
- **Leakage prevention**: splitting is done **before** any preprocessing or feature engineering  

---
**Checks performed**  
- Verified that **train/test sizes** match the expected proportions  
- Confirmed **class distribution consistency** between train and test sets  
- Ensured **no overlap** of indices between partitions  


**Artifacts saved**  
- Train and test **dataframes** (Excel-friendly format for inspection)  
- Exported splits:  
  - `03_X_train_aftersplit.xlsx`  
  - `03_X_test_aftersplit.xlsx`  
  - `03_y_train_aftersplit.xlsx`  
  - `03_y_test_aftersplit.xlsx`  

**Reproducibility**  
The split can be reproduced exactly by using the same random seed and stratification. If dataset order changes in future versions, the saved splits provide stable references for downstream work.


In [22]:
# DATA SPLITTING

# Define features/target
X = df.drop(columns=["target"])
y = df["target"]

# Stratified split (20% test for a slightly larger training set)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,       # 20% test set
    stratify=y,           # preserve class proportions
    random_state=42       # reproducible split
)

print(f"Shapes → X_train {X_train.shape} | X_test {X_test.shape}")

# Quick class distribution check
print("\nTrain distribution (%)")
print((y_train.value_counts(normalize=True) * 100).round(2).to_string())

print("\nTest distribution (%)")
print((y_test.value_counts(normalize=True) * 100).round(2).to_string())

# Save to Excel-friendly format.
def for_excel(df_like):
    """Return a copy safe to save to Excel (no index)."""
    if isinstance(df_like, pd.Series):
        return df_like.to_frame(name=df_like.name or "target").reset_index(drop=True)
    return df_like.reset_index(drop=True)

X_train_xl = for_excel(X_train)
X_test_xl  = for_excel(X_test)
y_train_xl = for_excel(y_train.rename("target"))
y_test_xl  = for_excel(y_test.rename("target"))

# Save as Excel files
utils.save_dataset(X_train_xl, "interim/03_X_train_aftersplit.xlsx")
utils.save_dataset(X_test_xl,  "interim/03_X_test_aftersplit.xlsx")
utils.save_dataset(y_train_xl, "interim/03_y_train_aftersplit.xlsx")
utils.save_dataset(y_test_xl,  "interim/03_y_test_aftersplit.xlsx")

# Save X train and test sets as pickle files
X_train.to_pickle("../data/interim/03_X_train_aftersplit.pkl")
X_test.to_pickle("../data/interim/03_X_test_aftersplit.pkl")

# Save y train and test sets as pickle files and convert to DataFrame to maintain consistency when loading later
y_train.to_frame(name="target").to_pickle("../data/interim/03_y_train_aftersplit.pkl")
y_test.to_frame(name="target").to_pickle("../data/interim/03_y_test_aftersplit.pkl")

Shapes → X_train (3477, 30) | X_test (870, 30)

Train distribution (%)
target
Graduate    50.82
Dropout     30.92
Enrolled    18.26

Test distribution (%)
target
Graduate    50.80
Dropout     30.92
Enrolled    18.28
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Classification\Early_Identification_Of_At-Risk_Students\data\interim\03_X_train_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Classification\Early_Identification_Of_At-Risk_Students\data\interim\03_X_test_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desktop\ENRICO\05_LEARNING\University\ToU\Phases\02_Calibration_Phase\Applied_Machine_Learning\Classification\Early_Identification_Of_At-Risk_Students\data\interim\03_y_train_aftersplit.xlsx
File saved at: C:\Users\Vaccari\Desktop\iCloudDrive\Desk

### 3.2 - SAVING SPLIT METADATA

Below I have saved the split metadata to file (.json).

In [23]:
# SAVE SPLIT METADATA

split_meta = {
    "method": "train_test_split",
    "params": {
        "test_size": 0.20,
        "random_state": 42,
        "stratify": True,  # stratification used
    },
    "index": {
        "train": X_train.index.tolist(),
        "test":  X_test.index.tolist(),
    },
    "n_rows": {
        "train": int(len(X_train)),
        "test":  int(len(X_test)),
        "total": int(len(X_train) + len(X_test)),
    },
    "target": "target",  # classification label
}

# Ensure JSON-safe (convert any numpy/Index objects if needed)
try:
    safe_meta = utils.make_json_safe(split_meta)
except NameError:
    # fallback: cast index lists already done, so safe enough
    safe_meta = split_meta

# Save to file 
out = Path("../data/meta")
out.mkdir(parents=True, exist_ok=True)
with open(out / "split_meta.json", "w", encoding="utf-8") as f:
    json.dump(safe_meta, f, indent=2, ensure_ascii=False)

print("Saved split metadata → ../data/meta/split_meta.json")

Saved split metadata → ../data/meta/split_meta.json


Proper Data Splitting Implemented:

>- Applied appropriate splitting strategy for my data type.
>- Maintained (temporal/geographic/hierarchical) integrity.
>- Validated split quality and representativeness.
>- Documented splitting approach for reproducibility.

<div class="alert alert-block alert-info">
<b>Next Notebook - EDA
</div>