# 02 ‚Äî Preprocessing & Feature Engineering

> **Objective:** To implement the cleaning and feature-engineering steps used in the overqualification pipeline: handling NGS special codes, normalizing mixed-type columns, and preparing categorical features for CatBoost.

This notebook covers:
1. [**Preprocessing**](#preprocessing) ‚Äî `clean()`: missing codes and categorical normalization  
2. [**Feature engineering**](#feature-engineering) ‚Äî `add_features()`: categorical encoding for CatBoost  
3. [**Before/after comparison**](#before-and-after-comparison) ‚Äî data shape and sample values

### üß† Context

The NGS dataset uses **6, 9, 99** as valid skip / refused / not stated. The pipeline treats these as missing and fills them consistently. Columns such as **GENDER2**, **DDIS_FL**, and **VISBMINP** sometimes contain text (e.g. "Female", "With disability") in addition to numeric codes; we normalize these to numeric codes before converting to categorical strings for CatBoost.

---
### üß∞ Imports

In [1]:
import sys
from pathlib import Path

import pandas as pd

sys.path.insert(0, str(Path().resolve().parent))

from src.data import load_train
from src.preprocess import clean
from src.features import add_features, get_categorical_feature_names

### üì• Load raw data

In [2]:
df_raw = load_train()
print("Shape:", df_raw.shape)
df_raw.head()

Shape: (7709, 25)


Unnamed: 0,id,CERTLEVP,PGMCIPAP,PGM_P034,PGM_P036,PGM_280A,PGM_280B,PGM_280C,PGM_280F,PGM_P401,...,GRADAGEP,GENDER2,CTZSHIPP,VISBMINP,DDIS_FL,PAR1GRD,PAR2GRD,BEF_P140,BEF_160,overqualified
0,187,1.0,4.0,1.0,6.0,2.0,1.0,9.0,2.0,,...,1.0,2.0,2.0,1.0,2.0,3.0,9.0,3.0,4.0,0
1,5343,2.0,5.0,1.0,6.0,2.0,6.0,2.0,9.0,2.0,...,1.0,2.0,1.0,2.0,1.0,3.0,6.0,3.0,6.0,0
2,7011,2.0,99.0,1.0,6.0,2.0,2.0,2.0,1.0,1.0,...,1.0,2.0,2.0,1.0,2.0,,2.0,3.0,3.0,0
3,1519,1.0,7.0,1.0,6.0,2.0,2.0,2.0,1.0,1.0,...,4.0,9.0,2.0,2.0,,6.0,3.0,1.0,,0
4,6770,2.0,5.0,9.0,1.0,2.0,9.0,2.0,1.0,2.0,...,1.0,,1.0,2.0,1.0,6.0,6.0,3.0,3.0,0


### üßπ Preprocessing <a id="preprocessing"></a>

`clean()`:
- Replaces NGS codes **6, 9, 99** with `NaN` in numeric/code columns  
- Normalizes **GENDER2** (e.g. "Male" ‚Üí 1, "Female" ‚Üí 2)  
- Normalizes **DDIS_FL** ("With disability" / "Without disability")  
- Normalizes **VISBMINP** (e.g. "Yes" / "No")

In [3]:
df_cleaned = clean(df_raw)
print("After clean():")
print("  GENDER2 sample values:", df_cleaned["GENDER2"].dropna().astype(str).unique()[:8])
print("  DDIS_FL sample values:", df_cleaned["DDIS_FL"].dropna().astype(str).unique()[:8])
print("  Null count (should increase where 6/9/99 were replaced):", df_cleaned.isnull().sum().sum())

After clean():
  GENDER2 sample values: ['2.0' '9.0' '1.0' '3.0' '0.0']
  DDIS_FL sample values: ['2.0' '1.0' '3.0' '0.0']
  Null count (should increase where 6/9/99 were replaced): 8706


### üîß Feature Engineering <a id="feature-engineering"></a>

`add_features()`:
- Converts all survey-code columns to **string** type (CatBoost treats object columns as categorical)  
- Fills remaining NaN in those columns with the string `"missing"` so CatBoost can use them as a category

In [4]:
df_engineered = add_features(df_cleaned)
cat_cols = get_categorical_feature_names()
print("Categorical feature names (for CatBoost):", cat_cols)
print("\nSample of engineered columns (string type):")
print(df_engineered[cat_cols[:5]].dtypes)
print(
    "\nUnique values in CERTLEVP (after add_features):",
    df_engineered["CERTLEVP"].astype(str).unique()[:10],
)

Categorical feature names (for CatBoost): ['CERTLEVP', 'PGMCIPAP', 'PGM_P034', 'PGM_P036', 'PGM_280A', 'PGM_280B', 'PGM_280C', 'PGM_280F', 'PGM_P401', 'STULOANS', 'DBTOTGRD', 'SCHOLARP', 'PREVLEVP', 'HLOSGRDP', 'GRADAGEP', 'GENDER2', 'CTZSHIPP', 'VISBMINP', 'DDIS_FL', 'PAR1GRD', 'PAR2GRD', 'BEF_P140', 'BEF_160']

Sample of engineered columns (string type):
CERTLEVP    object
PGMCIPAP    object
PGM_P034    object
PGM_P036    object
PGM_280A    object
dtype: object

Unique values in CERTLEVP (after add_features): ['1' '2' '3' '9' '4' '5' 'missing']


### üìä Before and After Comparison <a id="before-and-after-comparison"></a>

In [5]:
print("Raw shape:", df_raw.shape)
print("After clean + add_features:", df_engineered.shape)
print("\nNo rows/columns dropped; only types and values normalized.")
print("\nPipeline order: load_train() ‚Üí clean() ‚Üí add_features() ‚Üí split_X_y() for model.")

Raw shape: (7709, 25)
After clean + add_features: (7709, 25)

No rows/columns dropped; only types and values normalized.

Pipeline order: load_train() ‚Üí clean() ‚Üí add_features() ‚Üí split_X_y() for model.


---
## üìù Summary

Preprocessing and feature engineering produce a single DataFrame that retains `id` and `overqualified` and has all predictor columns as **string-typed categories** suitable for CatBoost. The same sequence is used in `src/train.py` and `src/predict.py`.

**Next step:** `03_catboost_training_tuning.ipynb` ‚Äî train and tune the CatBoost model.