# SLDCE – Notebook 01: Config-Driven Data Loading
### This notebook loads and standardizes any dataset using configuration only.


## Imports

## Imports & Project Root
✔ Notebook can access src/, config/, data/
✔ No fragile ../ path guessing
✔ Works for any dataset, any OS
✔ Immediately tells you if folder structure is wrong

In [12]:
# ========== BASIC IMPORTS ==========
import sys
from pathlib import Path
import yaml
import pandas as pd

# ========== ADD PROJECT ROOT TO PATH ==========
PROJECT_ROOT = Path("..").resolve()
sys.path.append(str(PROJECT_ROOT))

# ========== SANITY CHECK ==========
print("Project root:", PROJECT_ROOT)
print("Config exists:", (PROJECT_ROOT / "config/default.yaml").exists())
print("Raw data exists:", (PROJECT_ROOT / "data/raw").exists())


Project root: C:\Project_Final_Year
Config exists: True
Raw data exists: True


## Load Configuration
✔ Loads only from config (no hardcoding)
✔ Makes pipeline dataset-agnostic
✔ Central control for dataset + target
✔ Safe for any future dataset

In [13]:
# ========== LOAD CONFIG ==========
with open(PROJECT_ROOT / "config/default.yaml", "r") as f:
    config = yaml.safe_load(f)

# ========== READ DATASET CONFIG ==========
DATA_PATH = config["dataset"]["path"]
TARGET = config["dataset"]["target_column"]

print("Dataset path:", DATA_PATH)
print("Target column:", TARGET)


Dataset path: data/raw/adult.csv
Target column: income


## Load Dataset
✔ Works for CSV / XLS / XLSX
✔ Uses project-root–anchored paths
✔ No hardcoded filenames
✔ Fails early if file is missing (good!)

In [14]:
# ========== LOAD DATASET ==========
from src.data.loader import load_dataset

df = load_dataset(PROJECT_ROOT / DATA_PATH)

# ========== BASIC VIEW ==========
print("Dataset loaded successfully")
print("Shape:", df.shape)

df.head()


Dataset loaded successfully
Shape: (32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


## Dataset Sanity Checks
✔ Confirms all column names
✔ Detects missing values early
✔ Verifies target column correctness
✔ Prevents silent bugs later

In [15]:
# ========== BASIC DATASET CHECKS ==========
print("Columns:")
print(df.columns.tolist())

print("\nMissing values per column:")
print(df.isnull().sum())

print("\nTarget distribution:")
print(df[TARGET].value_counts())


Columns:
['age', 'workclass', 'fnlwgt', 'education', 'education.num', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'capital.gain', 'capital.loss', 'hours.per.week', 'native.country', 'income']

Missing values per column:
age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

Target distribution:
income
<=50K    24720
>50K      7841
Name: count, dtype: int64


## Feature Type Detection
✔ Does not assume target type
✔ Works for any dataset
✔ Safe even if target is categorical
✔ No .drop() → no KeyError

In [16]:
# ========== CLEAN COLUMN NAMES ==========
df.columns = df.columns.str.strip()

# ========== IDENTIFY FEATURE TYPES ==========
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()

# ========== REMOVE TARGET SAFELY ==========
if TARGET in numeric_cols:
    numeric_cols.remove(TARGET)

if TARGET in categorical_cols:
    categorical_cols.remove(TARGET)

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)


Numeric columns: ['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']
Categorical columns: ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']


## Preprocessing Pipeline
✔ Works for any dataset
✔ Handles unseen categories safely
✔ No data leakage (fit later, not here)
✔ Fully reusable for future datasets

In [18]:
# ========== PREPROCESSING PIPELINE ==========
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Numeric features
numeric_pipeline = Pipeline(steps=[
    ("scaler", StandardScaler())
])

# Categorical features
try:
    # sklearn >= 1.2
    categorical_pipeline = Pipeline(steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
    ])
except TypeError:
    # sklearn < 1.2
    categorical_pipeline = Pipeline(steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore", sparse=False))
    ])

# Combine pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_pipeline, numeric_cols),
        ("cat", categorical_pipeline, categorical_cols)
    ],
    remainder="drop"
)

print("Preprocessing pipeline created successfully")


Preprocessing pipeline created successfully


## Train / Test Split
✔ Stratified split (important for income imbalance)
✔ Controlled by config (dataset-agnostic)
✔ No preprocessing applied yet (no leakage)

In [19]:
# ========== TRAIN / TEST SPLIT ==========
from src.data.splitter import split_data

X_train, X_test, y_train, y_test = split_data(
    df,
    TARGET,
    test_size=config["preprocessing"]["test_size"],
    random_state=config["preprocessing"]["random_state"]
)

print("Train size:", X_train.shape)
print("Test size:", X_test.shape)


Train size: (26048, 14)
Test size: (6513, 14)


## Fit Preprocessor & Save Processed Data
✔ Data loaded
✔ Config-driven
✔ Generic preprocessing
✔ Works for any dataset

In [20]:
# ========== FIT PREPROCESSOR ==========
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# ========== CONVERT TO DATAFRAME ==========
train_df = pd.DataFrame(X_train_processed)
train_df["label"] = y_train.values

test_df = pd.DataFrame(X_test_processed)
test_df["label"] = y_test.values

# ========== SAVE PROCESSED DATA ==========
processed_path = PROJECT_ROOT / "data/processed"
processed_path.mkdir(parents=True, exist_ok=True)

train_df.to_csv(processed_path / "dataset_processed_train.csv", index=False)
test_df.to_csv(processed_path / "dataset_processed_test.csv", index=False)

print("Processed data saved successfully")
print("Train processed shape:", train_df.shape)
print("Test processed shape:", test_df.shape)


Processed data saved successfully
Train processed shape: (26048, 108)
Test processed shape: (6513, 108)
