# 05 ‚Äî End-to-End Pipeline Demo

> **Objective:** To demonstrate the full ML pipeline for overqualification prediction: from raw data to trained CatBoost model and submission file, matching the workflow used in the SFU Data Science ML Hackathon.

This notebook:
1. [**Loads and preprocesses**](#step-1-load-and-preprocess) the NGS training data  
2. [**Trains**](#step-2-train) the CatBoost model with validation  
3. [**Generates predictions**](#step-3-predict) on the test set and builds a submission DataFrame  
4. [**Recaps**](#summary) the pipeline and how to run it from the command line

### üß† Context

The production pipeline is implemented in **`src/`**: `data`, `preprocess`, `features`, `model`, `train`, `evaluate`, `predict`. This notebook replays the same steps in order so you can see the full flow in one place. For reproducible training and submission generation, run from the project root:

```bash
python3 -m src.train
python3 -m src.predict
```

---
### üß∞ Imports

In [1]:
import sys
from pathlib import Path

import pandas as pd

sys.path.insert(0, str(Path().resolve().parent))

from src.config import TRAIN_CSV, MODEL_ARTIFACT_DIR, SUBMISSIONS_DIR
from src.data import load_train, load_test, split_X_y, get_train_val_split
from src.preprocess import clean
from src.features import add_features, get_categorical_feature_names
from src.model import build_model
from src.train import run_train_pipeline
from src.predict import run_predict_pipeline

### Step 1: Load and Preprocess <a id="step-1-load-and-preprocess"></a>

In [2]:
df_train = load_train()
df_train = clean(df_train)
df_train = add_features(df_train)

X, y = split_X_y(df_train, target_col="overqualified")
y = y.astype(int)

print("Training samples:", len(df_train))
print("Features:", list(X.columns))
print("Target distribution:", y.value_counts().to_dict())

Training samples: 7709
Features: ['CERTLEVP', 'PGMCIPAP', 'PGM_P034', 'PGM_P036', 'PGM_280A', 'PGM_280B', 'PGM_280C', 'PGM_280F', 'PGM_P401', 'STULOANS', 'DBTOTGRD', 'SCHOLARP', 'PREVLEVP', 'HLOSGRDP', 'GRADAGEP', 'GENDER2', 'CTZSHIPP', 'VISBMINP', 'DDIS_FL', 'PAR1GRD', 'PAR2GRD', 'BEF_P140', 'BEF_160']
Target distribution: {0: 4745, 1: 2964}


### Step 2: Train <a id="step-2-train"></a>

Run the full training pipeline (validation + retrain on full train, save model and artifacts). This may take a minute.

In [3]:
metrics = run_train_pipeline(validate=True)
print("\nMetrics:", metrics)

0:	learn: 0.6246149	test: 0.6115435	best: 0.6115435 (0)	total: 2.48ms	remaining: 1.24s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.6640726329
bestIteration = 9

Shrink model to first 10 iterations.
Validation accuracy: 0.6641
CV accuracy: 0.6669 ¬± 0.0081
0:	learn: 0.6488520	total: 6.84ms	remaining: 3.41s
100:	learn: 0.6982747	total: 729ms	remaining: 2.88s
200:	learn: 0.7198080	total: 1.71s	remaining: 2.54s
300:	learn: 0.7369309	total: 2.55s	remaining: 1.69s
400:	learn: 0.7505513	total: 3.37s	remaining: 832ms
499:	learn: 0.7636529	total: 4.2s	remaining: 0us
Model saved to /Users/florykhan/Documents/Projects/Data & Machine Learning/graduate-underemployment-prediction/models/model.cbm

Metrics: {'cv_mean_accuracy': 0.6669394198703666, 'cv_std_accuracy': 0.008124213340195165, 'val_accuracy': 0.6640726329442282}


### Step 3: Predict <a id="step-3-predict"></a>

Load the saved model, run prediction on the test set, and write the submission CSV.

In [4]:
out_path = run_predict_pipeline(output_name="submission.csv")
sub = pd.read_csv(out_path)
print("Submission preview:")
print(sub.head(10))
print("\nSubmission shape:", sub.shape)

Submission written to /Users/florykhan/Documents/Projects/Data & Machine Learning/graduate-underemployment-prediction/submissions/submission.csv (2508 rows)
Submission preview:
      id  overqualified
0  10636              0
1  12024              0
2  11353              1
3  10555              0
4   7751              1
5   7003              0
6    807              0
7   7413              0
8   5121              1
9   3817              0

Submission shape: (2508, 2)


---
## üìù Summary <a id="summary"></a>

The end-to-end pipeline:

1. **Data** ‚Äî `data/raw/train.csv`, `data/raw/test.csv`  
2. **Preprocessing** ‚Äî `clean()` (NGS codes, mixed types) and `add_features()` (categorical strings)  
3. **Training** ‚Äî CatBoost with early stopping; optional CV and hyperparameter tuning  
4. **Artifacts** ‚Äî `models/model.cbm`, `models/artifacts.pkl`  
5. **Submission** ‚Äî `submissions/submission.csv` (id, overqualified)

To reproduce from the terminal:

```bash
pip install -r requirements.txt
python3 -m src.train
python3 -m src.predict
```

For full methodology, results, and leaderboard context, see the [README](../README.md) and [reports/report.md](../reports/report.md).