# 02 — Preprocessing Pipeline (Leakage-Safe)

Hotel Booking Demand (Cancellation Prediction)

## Notebook purpose

This notebook defines and validates the preprocessing pipelines used by all supervised models in the project.
The pipelines are designed to:

- avoid data leakage by fitting transformations on the training split only
- impute missing values consistently
- encode categorical variables
- scale numeric variables when required by the model family

## Inputs

- Preferred: `data/processed/hotel_bookings_dedup.csv`
- Fallback: `data/raw/hotel_bookings.csv`

## Outputs (overwritten on each run)

Saved to a fixed artifact structure:

- `artifacts/preprocessing/preprocessor_sparse.joblib`
- `artifacts/preprocessing/preprocessor_dense.joblib`
- `artifacts/preprocessing/preprocess_options_sparse.json`
- `artifacts/preprocessing/preprocess_options_dense.json`
- `artifacts/preprocessing/transform_info_sparse.json`
- `artifacts/preprocessing/transform_info_dense.json`
- `artifacts/preprocessing/feature_names.csv`
- `artifacts/data/train_test_split.json`
- `artifacts/data/label_distribution_train.csv`
- `artifacts/data/label_distribution_test.csv`
- `artifacts/reports/preprocessing_notes.md`


In [1]:
# Repository bootstrap (fixes ModuleNotFoundError: 'src')
# The repository root is resolved quickly using Git when available.
# A bounded parent-directory scan is used as a fallback.

import os
import sys
import subprocess
from pathlib import Path


def _find_repo_root(max_levels: int = 25) -> Path:
    # Fast path: Git repository root (works when the notebook is executed inside the repo)
    try:
        out = subprocess.check_output(
            ["git", "rev-parse", "--show-toplevel"],
            stderr=subprocess.DEVNULL,
            text=True,
        ).strip()
        p = Path(out)
        if (p / "src").is_dir():
            print("✓ Found repository root via Git:", p)
            return p
    except Exception:
        print("Git lookup not available, falling back to directory scan...")

    # Fallback: bounded parent scan (prevents long scans on unusual paths)
    cwd = Path.cwd()
    candidates = [cwd] + list(cwd.parents)

    # Progress bar for directory scan
    try:
        from tqdm.auto import tqdm

        candidates_iter = tqdm(
            candidates[:max_levels], desc="Scanning parent directories", unit="dir"
        )
    except ImportError:
        candidates_iter = candidates[:max_levels]

    for p in candidates_iter:
        if (p / "src").is_dir():
            print(f"✓ Found repository root: {p}")
            return p

    raise FileNotFoundError(
        "Folder 'src' was not found within the parent directories. "
        "Open the repository root folder in VS Code and rerun the notebook."
    )


root = _find_repo_root(max_levels=25)

os.chdir(root)
if str(root) not in sys.path:
    sys.path.insert(0, str(root))

print("Working directory:", Path.cwd())
print("Python path entry added:", root)

✓ Found repository root via Git: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment
Working directory: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment
Python path entry added: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment


## Imports and artifact folder initialization

The `artifacts/` folder is used as a fixed output location. Files are overwritten on each run to keep the latest outputs available.


In [2]:
import platform
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

# Optional progress bar utility (installed in most environments)
try:
    from tqdm.auto import tqdm
except Exception:
    raise ImportError("Package 'tqdm' is required. Install with: pip install tqdm")

from src.config import (
    PROJECT_NAME,
    RANDOM_STATE,
    TARGET_COL,
    DEFAULT_DATA_PATH,
    LEAKAGE_COLS,
    FORCE_CATEGORICAL_COLS,
)
from src.data_loader import load_hotel_bookings, basic_train_ready_checks
from src.preprocessing import build_preprocessor, PreprocessOptions, get_feature_names
from src.io_utils import (
    ensure_artifact_dirs,
    save_json,
    save_text,
    save_dataframe,
    save_model,
    save_run_metadata,
)

ART = ensure_artifact_dirs("artifacts")

meta_path = save_run_metadata(
    {
        "project": PROJECT_NAME,
        "random_state": RANDOM_STATE,
        "target_col": TARGET_COL,
        "notebook": "02_preprocessing_pipeline.ipynb",
        "python_version": sys.version,
        "platform": platform.platform(),
    },
    base_dir="artifacts",
    repo_root=".",
)

print("Metadata file:", meta_path.resolve())
print("Artifacts base:", ART["base"].resolve())

  from .autonotebook import tqdm as notebook_tqdm


Metadata file: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\reports\run_metadata.json
Artifacts base: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts


## Dataset loading

The processed dataset is preferred when available to standardize rows across team members.


In [3]:
preferred_processed = Path("data/processed/hotel_bookings_dedup.csv")
preferred_raw = Path(DEFAULT_DATA_PATH)

dataset_path = preferred_processed if preferred_processed.exists() else preferred_raw
if not dataset_path.exists():
    raise FileNotFoundError(
        "Dataset not found. Place the CSV at data/raw/hotel_bookings.csv"
    )

print("Dataset path:", dataset_path.resolve())

df = load_hotel_bookings(dataset_path, drop_duplicates=False, verbose=True)
basic_train_ready_checks(df, target_col=TARGET_COL)

display(df.head())
display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))

Dataset path: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\data\processed\hotel_bookings_dedup.csv
[data_loader] Loaded shape: (87396, 32)
[data_loader] Columns: 32


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


Unnamed: 0,rows,columns
0,87396,32


## Train/test split

A stratified split preserves the class balance of `is_canceled`.
Split metadata and label distributions are saved and displayed.


In [4]:
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL].astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    random_state=RANDOM_STATE,
    stratify=y,
)

split_info = {
    "n_rows_total": int(len(df)),
    "n_train": int(len(X_train)),
    "n_test": int(len(X_test)),
    "test_size": 0.20,
    "random_state": RANDOM_STATE,
    "stratified": True,
}
save_json(split_info, ART["data"] / "train_test_split.json")
print("Saved:", (ART["data"] / "train_test_split.json").resolve())

train_dist = y_train.value_counts().rename_axis("label").reset_index(name="count")
train_dist["rate"] = train_dist["count"] / train_dist["count"].sum()
save_dataframe(train_dist, ART["data"] / "label_distribution_train.csv", index=False)
print("Saved:", (ART["data"] / "label_distribution_train.csv").resolve())

test_dist = y_test.value_counts().rename_axis("label").reset_index(name="count")
test_dist["rate"] = test_dist["count"] / test_dist["count"].sum()
save_dataframe(test_dist, ART["data"] / "label_distribution_test.csv", index=False)
print("Saved:", (ART["data"] / "label_distribution_test.csv").resolve())

display(pd.DataFrame([split_info]))
display(train_dist)
display(test_dist)

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\train_test_split.json
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\label_distribution_train.csv
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\label_distribution_test.csv


Unnamed: 0,n_rows_total,n_train,n_test,test_size,random_state,stratified
0,87396,69916,17480,0.2,42,True


Unnamed: 0,label,count,rate
0,0,50696,0.725099
1,1,19220,0.274901


Unnamed: 0,label,count,rate
0,0,12675,0.725114
1,1,4805,0.274886


## Preprocessing configuration

Two preprocessors are built and saved:

- Sparse output: efficient for high-dimensional one-hot encoding (recommended for Logistic Regression and tree-based models)
- Dense output: recommended for KNN (dense distance computations)

Both preprocessors:

- drop leakage columns (defined in `src/config.py`)
- treat ID-like fields (e.g., `agent`, `company`) as categorical (defined in `src/config.py`)
- impute numeric features using median and categorical features using most-frequent
- clip numeric outliers using training-set quantiles (1% and 99%)
- scale numeric features (enabled)
- one-hot encode categorical features and ignore unseen categories at inference time


In [5]:
opts_sparse = PreprocessOptions(
    output_sparse=True,
    scale_numeric=True,
    onehot_min_frequency=0.01,
    lower_clip_q=0.01,
    upper_clip_q=0.99,
)

opts_dense = PreprocessOptions(
    output_sparse=False,
    scale_numeric=True,
    onehot_min_frequency=0.01,
    lower_clip_q=0.01,
    upper_clip_q=0.99,
)

save_json(opts_sparse, ART["preprocessing"] / "preprocess_options_sparse.json")
save_json(opts_dense, ART["preprocessing"] / "preprocess_options_dense.json")

print("Saved:", (ART["preprocessing"] / "preprocess_options_sparse.json").resolve())
print("Saved:", (ART["preprocessing"] / "preprocess_options_dense.json").resolve())

display(
    pd.DataFrame(
        {
            "option": [
                "output_sparse",
                "scale_numeric",
                "onehot_min_frequency",
                "lower_clip_q",
                "upper_clip_q",
            ],
            "sparse_value": [
                opts_sparse.output_sparse,
                opts_sparse.scale_numeric,
                opts_sparse.onehot_min_frequency,
                opts_sparse.lower_clip_q,
                opts_sparse.upper_clip_q,
            ],
            "dense_value": [
                opts_dense.output_sparse,
                opts_dense.scale_numeric,
                opts_dense.onehot_min_frequency,
                opts_dense.lower_clip_q,
                opts_dense.upper_clip_q,
            ],
        }
    )
)

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\preprocess_options_sparse.json
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\preprocess_options_dense.json


Unnamed: 0,option,sparse_value,dense_value
0,output_sparse,True,False
1,scale_numeric,True,True
2,onehot_min_frequency,0.01,0.01
3,lower_clip_q,0.01,0.01
4,upper_clip_q,0.99,0.99


## Fit and transform (sparse)

The sparse preprocessor is fitted on training data only, then applied to both train and test.
Transformed shapes and sparsity statistics are saved and displayed.


In [6]:
pre_sparse = build_preprocessor(
    drop_cols=LEAKAGE_COLS,
    force_categorical_cols=FORCE_CATEGORICAL_COLS,
    options=opts_sparse,
)

pre_sparse.fit(X_train)

Xtr_sparse = pre_sparse.transform(X_train)
Xte_sparse = pre_sparse.transform(X_test)

sparse_info = {
    "X_train_shape_before": [int(X_train.shape[0]), int(X_train.shape[1])],
    "X_test_shape_before": [int(X_test.shape[0]), int(X_test.shape[1])],
    "X_train_shape_after": [int(Xtr_sparse.shape[0]), int(Xtr_sparse.shape[1])],
    "X_test_shape_after": [int(Xte_sparse.shape[0]), int(Xte_sparse.shape[1])],
    "output_sparse": True,
}

try:
    import scipy.sparse as sp

    if sp.issparse(Xtr_sparse):
        sparse_info["train_nonzeros"] = int(Xtr_sparse.nnz)
        sparse_info["train_density"] = float(
            Xtr_sparse.nnz / (Xtr_sparse.shape[0] * Xtr_sparse.shape[1])
        )
except Exception:
    pass

save_json(sparse_info, ART["preprocessing"] / "transform_info_sparse.json")
print("Saved:", (ART["preprocessing"] / "transform_info_sparse.json").resolve())

save_model(pre_sparse, ART["preprocessing"] / "preprocessor_sparse.joblib")
print("Saved:", (ART["preprocessing"] / "preprocessor_sparse.joblib").resolve())

display(pd.DataFrame([sparse_info]))

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\transform_info_sparse.json
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\preprocessor_sparse.joblib


Unnamed: 0,X_train_shape_before,X_test_shape_before,X_train_shape_after,X_test_shape_after,output_sparse
0,"[69916, 31]","[17480, 31]","[69916, 102]","[17480, 102]",True


## Fit and transform (dense)

The dense preprocessor is fitted on training data only, then applied to both train and test.
The dense output is suitable for KNN and other dense-matrix algorithms.


In [7]:
pre_dense = build_preprocessor(
    drop_cols=LEAKAGE_COLS,
    force_categorical_cols=FORCE_CATEGORICAL_COLS,
    options=opts_dense,
)

pre_dense.fit(X_train)

Xtr_dense = pre_dense.transform(X_train)
Xte_dense = pre_dense.transform(X_test)

dense_info = {
    "X_train_shape_after": [int(Xtr_dense.shape[0]), int(Xtr_dense.shape[1])],
    "X_test_shape_after": [int(Xte_dense.shape[0]), int(Xte_dense.shape[1])],
    "output_sparse": False,
}

save_json(dense_info, ART["preprocessing"] / "transform_info_dense.json")
print("Saved:", (ART["preprocessing"] / "transform_info_dense.json").resolve())

save_model(pre_dense, ART["preprocessing"] / "preprocessor_dense.joblib")
print("Saved:", (ART["preprocessing"] / "preprocessor_dense.joblib").resolve())

display(pd.DataFrame([dense_info]))

# Display a small numeric preview for sanity checking (first 3 rows, first 10 features)
dense_preview = pd.DataFrame(Xtr_dense[:3, :10])
display(dense_preview)

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\transform_info_dense.json
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\preprocessor_dense.joblib


Unnamed: 0,X_train_shape_after,X_test_shape_after,output_sparse
0,"[69916, 102]","[17480, 102]",False


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-0.682939,1.151852,-0.719226,1.488643,1.023004,-0.846323,0.250807,-0.306123,-0.103003,-0.201923
1,0.378889,-0.306232,0.157687,-0.322203,-1.013102,0.211369,0.250807,-0.306123,-0.103003,-0.201923
2,3.204666,1.151852,-0.207694,-0.322203,-1.013102,-0.846323,0.250807,-0.306123,-0.103003,-0.201923


## Feature names (post-transform)

Feature names are extracted for interpretability and saved to `artifacts/preprocessing/feature_names.csv`.


In [8]:
feature_names = get_feature_names(pre_sparse, input_features=list(X_train.columns))
feature_names_df = pd.DataFrame({"feature_name": feature_names})
save_dataframe(
    feature_names_df, ART["preprocessing"] / "feature_names.csv", index=False
)

print("Saved:", (ART["preprocessing"] / "feature_names.csv").resolve())
print("Total transformed features:", len(feature_names))

display(feature_names_df.head(50))

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\feature_names.csv
Total transformed features: 0


Unnamed: 0,feature_name


## Transformed samples (small, for debugging)

Small transformed samples are stored without saving full matrices:

- `artifacts/preprocessing/X_train_dense_sample.npy`
- `artifacts/preprocessing/X_train_sparse_sample.npz` (when SciPy sparse utilities are available)


In [9]:
sample_n = 200
sample_idx = np.arange(min(sample_n, X_train.shape[0]))

dense_sample = Xtr_dense[sample_idx]
np.save(ART["preprocessing"] / "X_train_dense_sample.npy", dense_sample)
print("Saved:", (ART["preprocessing"] / "X_train_dense_sample.npy").resolve())

try:
    import scipy.sparse as sp
    from scipy.sparse import save_npz

    if sp.issparse(Xtr_sparse):
        sparse_sample = Xtr_sparse[sample_idx]
        save_npz(ART["preprocessing"] / "X_train_sparse_sample.npz", sparse_sample)
        print("Saved:", (ART["preprocessing"] / "X_train_sparse_sample.npz").resolve())
except Exception as e:
    print("Sparse sample save skipped:", str(e))

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\preprocessing\X_train_dense_sample.npy


## Preprocessing notes (report-ready)

The summary below is saved to `artifacts/reports/preprocessing_notes.md` and can be copied into the final report.


In [10]:
notes = [
    "Preprocessing summary",
    f"- Target column: {TARGET_COL}",
    f"- Leakage columns removed: {', '.join(LEAKAGE_COLS)}",
    f"- ID-like columns treated as categorical: {', '.join(FORCE_CATEGORICAL_COLS)}",
    "- Numeric processing: median imputation, quantile clipping (1%–99%), standard scaling.",
    "- Categorical processing: most-frequent imputation, one-hot encoding with unseen-category handling.",
    "- Two preprocessors saved: sparse (general use) and dense (KNN-friendly).",
    "- Fit performed on the training split only; transformations applied to both train and test after fitting.",
]

out_path = ART["reports"] / "preprocessing_notes.md"
save_text("\n".join(notes), out_path)
print("Saved:", out_path.resolve())

display(pd.DataFrame({"Preprocessing notes": notes}))

Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\reports\preprocessing_notes.md


Unnamed: 0,Preprocessing notes
0,Preprocessing summary
1,- Target column: is_canceled
2,"- Leakage columns removed: reservation_status,..."
3,- ID-like columns treated as categorical: agen...
4,"- Numeric processing: median imputation, quant..."
5,- Categorical processing: most-frequent imputa...
6,- Two preprocessors saved: sparse (general use...
7,- Fit performed on the training split only; tr...


## Next notebooks

Proceed to the model notebooks:

- `03_model_logreg.ipynb`
- `04_model_knn.ipynb`
- `05_model_decision_tree.ipynb`
- `06_model_random_forest.ipynb`
