# 00 — Setup and Data Loading  
Hotel Booking Demand (Cancellation Prediction)

## Notebook purpose
- Load the dataset from a standard location
- Apply minimal, leakage-safe cleaning (exact duplicate removal)
- Save dataset snapshots and summary tables to `artifacts/` (overwritten on each run)
- Optionally create a shared, deduplicated baseline dataset in `data/processed/`

## Inputs
- Preferred: `data/raw/hotel_bookings.csv`

## Outputs (overwritten on each run)
- `artifacts/data/summary.json`
- `artifacts/data/df_head.csv`
- `artifacts/data/missing_top20.csv`
- `artifacts/data/target_distribution.csv`
- `artifacts/reports/environment.json`
- `artifacts/reports/environment.txt`
- `artifacts/reports/run_metadata.json`
- `artifacts/reports/setup_notes.txt`
- Optional shared baseline: `data/processed/hotel_bookings_dedup.csv`


In [11]:
# Repository bootstrap (fixes ModuleNotFoundError: 'src')
# The repository root is resolved quickly using Git when available.
# A bounded parent-directory scan is used as a fallback.

import os
import sys
import subprocess
from pathlib import Path

def _find_repo_root(max_levels: int = 25) -> Path:
    # Fast path: Git repository root (works when the notebook is executed inside the repo)
    try:
        out = subprocess.check_output(
            ["git", "rev-parse", "--show-toplevel"],
            stderr=subprocess.DEVNULL,
            text=True,
        ).strip()
        p = Path(out)
        if (p / "src").is_dir():
            return p
    except Exception:
        pass

    # Fallback: bounded parent scan (prevents long scans on unusual paths)
    cwd = Path.cwd()
    candidates = [cwd] + list(cwd.parents)
    for p in candidates[:max_levels]:
        if (p / "src").is_dir():
            return p

    raise FileNotFoundError(
        "Folder 'src' was not found within the parent directories. "
        "Open the repository root folder in VS Code and rerun the notebook."
    )

root = _find_repo_root(max_levels=25)

os.chdir(root)
if str(root) not in sys.path:
    sys.path.insert(0, str(root))

print("Working directory:", Path.cwd())
print("Python path entry added:", root)


Working directory: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment
Python path entry added: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment


## Imports and artifact folder initialization

`artifacts/` is used as a fixed output location. Files are overwritten on each run to keep the latest outputs available.


In [12]:
import platform
import pandas as pd

from src.config import PROJECT_NAME, RANDOM_STATE, TARGET_COL, DEFAULT_DATA_PATH
from src.data_loader import load_hotel_bookings, basic_train_ready_checks, summarize_dataframe
from src.io_utils import ensure_artifact_dirs, save_json, save_text, save_dataframe, save_run_metadata

ART = ensure_artifact_dirs("artifacts")

meta_path = save_run_metadata(
    {
        "project": PROJECT_NAME,
        "random_state": RANDOM_STATE,
        "target_col": TARGET_COL,
        "notebook": "00_setup_data.ipynb",
        "python_version": sys.version,
        "platform": platform.platform(),
    },
    base_dir="artifacts",
    repo_root=".",
)

print("Metadata file:", meta_path.resolve())
print("Artifacts base:", ART["base"].resolve())


Metadata file: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\reports\run_metadata.json
Artifacts base: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts


## Dataset location

Preferred location:
- `data/raw/hotel_bookings.csv`

Fallback locations (useful for quick tests):
- `hotel_bookings.csv`
- `../hotel_bookings.csv`
- `/mnt/data/hotel_bookings.csv`


In [13]:
from shutil import copy2
from pathlib import Path

preferred = Path(DEFAULT_DATA_PATH)
candidates = [
    preferred,
    Path("hotel_bookings.csv"),
    Path("../hotel_bookings.csv"),
    Path("/mnt/data/hotel_bookings.csv"),
]

dataset_path = None
for p in candidates:
    if p.exists():
        dataset_path = p
        break

if dataset_path is None:
    raise FileNotFoundError(
        "Dataset file not found. Place the CSV at: data/raw/hotel_bookings.csv"
    )

print("Dataset path:", dataset_path.resolve())

# Copy to preferred location for consistent relative paths across notebooks
preferred.parent.mkdir(parents=True, exist_ok=True)
if dataset_path.resolve() != preferred.resolve():
    copy2(dataset_path, preferred)
    dataset_path = preferred
    print("Copied dataset to:", preferred.resolve())


Dataset path: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\data\raw\hotel_bookings.csv


## Dataset loading (leakage-safe)

Operations performed:
- read CSV
- remove exact duplicate rows
- validate target column presence and binary format

The loaded dataframe preview is displayed, and key summary tables are generated and saved.


In [14]:
df = load_hotel_bookings(dataset_path, drop_duplicates=True, verbose=True)
basic_train_ready_checks(df, target_col=TARGET_COL)

display(df.head())
display(pd.DataFrame({"rows": [df.shape[0]], "columns": [df.shape[1]]}))


[data_loader] Dropped duplicates: 31,994 rows
[data_loader] Loaded shape: (87396, 32)
[data_loader] Columns: 32


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


Unnamed: 0,rows,columns
0,87396,32


## Saved snapshots and summary tables

The following outputs are saved to `artifacts/data/` and displayed inline:
- dataset summary (`summary.json`)
- a small head snapshot (`df_head.csv`)
- top missing-value columns (`missing_top20.csv`)
- target distribution (`target_distribution.csv`)


In [15]:
summary = summarize_dataframe(df, target_col=TARGET_COL)
save_json(summary, ART["data"] / "summary.json")

save_dataframe(df.head(200), ART["data"] / "df_head.csv", index=False)

missing = (
    df.isna().sum()
      .sort_values(ascending=False)
      .reset_index()
)
missing.columns = ["column", "missing_count"]
save_dataframe(missing.head(20), ART["data"] / "missing_top20.csv", index=False)

target_dist = (
    df[TARGET_COL].value_counts()
      .rename_axis("label")
      .reset_index(name="count")
)
target_dist["rate"] = target_dist["count"] / target_dist["count"].sum()
save_dataframe(target_dist, ART["data"] / "target_distribution.csv", index=False)

print("Saved:", (ART["data"] / "summary.json").resolve())
print("Saved:", (ART["data"] / "df_head.csv").resolve())
print("Saved:", (ART["data"] / "missing_top20.csv").resolve())
print("Saved:", (ART["data"] / "target_distribution.csv").resolve())

display(target_dist)
display(missing.head(20))


Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\summary.json
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\df_head.csv
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\missing_top20.csv
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\data\target_distribution.csv


Unnamed: 0,label,count,rate
0,0,63371,0.725102
1,1,24025,0.274898


Unnamed: 0,column,missing_count
0,company,82137
1,agent,12193
2,country,452
3,children,4
4,arrival_date_month,0
5,arrival_date_week_number,0
6,hotel,0
7,is_canceled,0
8,stays_in_weekend_nights,0
9,arrival_date_day_of_month,0


## Environment information (reproducibility)

Package versions are saved to `artifacts/reports/` and displayed.


In [16]:
import numpy as np
import sklearn
import matplotlib

env_info = {
    "python": sys.version,
    "pandas": pd.__version__,
    "numpy": np.__version__,
    "sklearn": sklearn.__version__,
    "matplotlib": matplotlib.__version__,
}
save_json(env_info, ART["reports"] / "environment.json")
save_text(
    "\n".join([f"{k}: {v}" for k, v in env_info.items()]),
    ART["reports"] / "environment.txt",
)

print("Saved:", (ART["reports"] / "environment.json").resolve())
display(pd.DataFrame({"package": list(env_info.keys()), "version": list(env_info.values())}))


Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\reports\environment.json


Unnamed: 0,package,version
0,python,"3.13.1 (tags/v3.13.1:0671451, Dec 3 2024, 19:..."
1,pandas,3.0.0
2,numpy,2.4.2
3,sklearn,1.8.0
4,matplotlib,3.10.8


## Optional shared baseline dataset

A deduplicated dataset is written to `data/processed/hotel_bookings_dedup.csv`.
This file supports consistent training rows across team members.


In [17]:
processed_out = Path("data/processed/hotel_bookings_dedup.csv")
processed_out.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(processed_out, index=False)

save_text(
    f"Deduplicated dataset saved to: {processed_out}\nRows: {df.shape[0]:,} | Cols: {df.shape[1]}",
    ART["reports"] / "setup_notes.txt",
)

print("Saved:", processed_out.resolve())
print("Saved:", (ART["reports"] / "setup_notes.txt").resolve())


Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\data\processed\hotel_bookings_dedup.csv
Saved: D:\SLIIT\Y4S2\IT4060 - Machine Learning\Assignment\repo\Machine-Learning-Assignment\artifacts\reports\setup_notes.txt


## Next notebook

Proceed to `01_eda_dataset_understanding.ipynb` for report-ready exploratory analysis and figures.
