# Feature Engineering & Data Engineering
This notebook performs **competition-grade feature engineering** on the cleaned dataset.

✅ Input: `data/cleaned_students.csv`
✅ Output: `data/featured_students.csv`

Goals:
- Create meaningful derived features
- Correctly encode ordinal & nominal variables
- Prepare a modeling-ready dataset (no leakage)


## 0) Setup

In [None]:
import os
from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

RANDOM_STATE = 42
print("✅ Setup complete")

## 1) Load Cleaned Dataset

In [None]:
DATA_PATH = Path("data/cleaned_students.csv")
OUT_PATH = Path("data/featured_students.csv")

if not DATA_PATH.exists():
    raise FileNotFoundError("❌ Run data_cleaning.ipynb first")

df = pd.read_csv(DATA_PATH)
print("Loaded:", DATA_PATH, "| Shape:", df.shape)
df.head()

## 2) Define Target & Drop Non-Model Columns

In [None]:
TARGET_COL = "exam_score_class"

DROP_COLS = ["student_id"]  # ID columns should not enter modeling
df_model = df.drop(columns=[c for c in DROP_COLS if c in df.columns])

print("Remaining columns:", df_model.columns.tolist())

## 3) Feature Creation (Domain-Driven)
These features improve signal without leakage.

In [None]:
df_fe = df_model.copy()

# Study efficiency: combines hours & attendance
df_fe["study_efficiency"] = df_fe["study_hours"] * (df_fe["class_attendance"] / 100)

# Sleep deficit (ideal sleep assumed = 8 hours)
df_fe["sleep_deficit"] = np.maximum(0, 8 - df_fe["sleep_hours"])

# Good sleep flag
df_fe["good_sleep_flag"] = df_fe["sleep_quality"].isin(["good"]).astype(int)

# Internet + study method interaction
df_fe["online_learning_flag"] = (
    (df_fe["internet_access"] == "yes") &
    (df_fe["study_method"].isin(["online videos", "mixed"]))
).astype(int)

# Attendance bucket
df_fe["attendance_bucket"] = pd.cut(
    df_fe["class_attendance"],
    bins=[0, 60, 80, 100],
    labels=["low_attendance", "medium_attendance", "high_attendance"]
)

df_fe.head()

## 4) Identify Feature Types

In [None]:
numeric_features = df_fe.select_dtypes(include=["int64", "float64"]).columns.tolist()
numeric_features = [c for c in numeric_features if c != TARGET_COL]

categorical_features = df_fe.select_dtypes(include=["object", "category"]).columns.tolist()

# Ordinal features with meaningful order
ordinal_features = [
    "sleep_quality",
    "facility_rating",
    "exam_difficulty"
]

# Remove ordinal from nominal list
nominal_features = [c for c in categorical_features if c not in ordinal_features]

print("Numeric:", numeric_features)
print("Ordinal:", ordinal_features)
print("Nominal:", nominal_features)

## 5) Define Encoders

In [None]:
ordinal_categories = [
    ["poor", "average", "good"],      # sleep_quality
    ["low", "medium", "high"],         # facility_rating
    ["easy", "moderate", "hard"]       # exam_difficulty
]

ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown="ignore")

## 6) Apply Encoding

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("ord", ordinal_encoder, ordinal_features),
        ("nom", onehot_encoder, nominal_features)
    ]
)

X = df_fe.drop(columns=[TARGET_COL])
y = df_fe[TARGET_COL]

X_processed = preprocessor.fit_transform(X)

# Feature names
num_names = numeric_features
ord_names = ordinal_features
nom_names = preprocessor.named_transformers_["nom"].get_feature_names_out(nominal_features)

feature_names = num_names + ord_names + list(nom_names)

df_final = pd.DataFrame(X_processed, columns=feature_names)
df_final[TARGET_COL] = y.values

print("Final feature shape:", df_final.shape)
df_final.head()

## 7) Save Engineered Dataset

In [None]:
df_final.to_csv(OUT_PATH, index=False)
print("✅ Feature-engineered dataset saved to:", OUT_PATH)

## 8) Next Step

Next notebook: **model.ipynb**

It will:
- Train multiple classifiers
- Perform cross-validation
- Compare models
- Select best model & export PKL
