# Data Preprocessing

This notebook prepares the input datasets (`X_train.csv`, `X_test.csv`, `y_train.csv`) for modeling by cleaning and imputing missing values.  
It ensures both train and test data are fully numeric, aligned, and ready for model development in `xgb.ipynb`.

---

## Purpose

The notebook handles:
- Loading raw CSV data from the `data/` folder  
- Checking missing values and feature consistency  
- Applying **numerical imputation** using methods from `src/preprocess.py`  
- Saving clean, imputed datasets as `X_train_imp.csv` and `X_test_imp.csv`

This is the **first step** in the full modeling pipeline. All downstream steps depend on these imputed files.

---

## Steps Performed

1. **Load Data**
   - Reads `X_train.csv`, `y_train.csv`, and `X_test.csv` from `data/`
   - Checks for shape and column alignment
   - Ensures index consistency between `X_train` and `y_train`

2. **Inspect Missingness**
   - Calculates per-feature missing value ratios
   - Plots missingness bar chart for quick assessment

3. **Imputation**
   - Fits an imputer (e.g., `IterativeImputer` or `KNNImputer`) on the training data
   - Transforms both `X_train` and `X_test`
   - Prevents **data leakage** by never fitting on test data

4. **Post-Imputation Checks**
   - Verifies that there are no NaNs remaining
   - Confirms column order consistency
   - Optionally inspects the distribution of selected features before/after imputation

5. **Save Outputs**
   - Writes:
     - `data/X_train_imp.csv`
     - `data/X_test_imp.csv`
   - These are the official model-ready feature files

---

## Outputs

```
data/
â”œâ”€â”€ X_train_imp.csv   # Imputed and cleaned training data
â”œâ”€â”€ X_test_imp.csv    # Imputed and cleaned test data
```

These files are later consumed by `xgb.ipynb` for model training and evaluation.

---

## ðŸ§© Notes

- Missing values across features were roughly uniform (~5%),  
  indicating that no variable is excessively incomplete.
- The final datasets preserve original feature order and statistical characteristics.

---

## Next Step

After completing preprocessing:
1. Verify that both imputed files exist in `/data`.
2. Proceed to `notebooks/xgb.ipynb` to train and tune the XGBoost model.

---

## Summary

| Step | Action | Output |
|------|---------|---------|
| 1 | Load & inspect raw CSVs | Data overview |
| 2 | Compute missingness ratios | Missingness chart |
| 3 | Impute missing values | Fully numeric train/test |
| 4 | Validate consistency | No NaN values |
| 5 | Save imputed files | `X_train_imp.csv`, `X_test_imp.csv` |

This notebook ensures **clean, consistent, and leakage-free input data** for modeling.


In [1]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("../src")
import numpy as np
import pandas as pd
from preprocess import fit_mice_bayes_imputer, transform_with_imputer, assert_no_overwrite, summarize_differences
from util_io import load_frames


In [2]:
X_train_raw, y_train, X_test_raw = load_frames("../data")

imputer = fit_mice_bayes_imputer(X_train_raw, max_iter=15, random_state=1)

X_train_imp = transform_with_imputer(imputer, X_train_raw)
X_test_imp  = transform_with_imputer(imputer,  X_test_raw)

In [3]:
#summarize_differences(X_train_raw, X_train_imp)
#summarize_differences(X_test_raw, X_test_imp)

assert_no_overwrite(X_train_raw, X_train_imp)
assert_no_overwrite(X_test_raw, X_test_imp)

passed -- no overwrite
passed -- no overwrite


In [4]:
X_train_imp.to_csv("../data/X_train_imp.csv")   # keep index
X_test_imp.to_csv("../data/X_test_imp.csv")