# Feature Engineering & Selection — Corrected Notebook with Explanations

This notebook demonstrates **feature engineering and selection** on the Titanic dataset.  
We will cover:  
1. Data cleaning & imputation  
2. Encoding categorical variables  
3. Feature scaling  
4. Train/test split  
5. Filter, Wrapper, and Embedded feature selection methods  
6. Final pipeline with preprocessing + feature selection + model  
Each code cell is preceded by explanation in markdown.

## Cell 1 — Imports and dataset load
We import libraries and load the Titanic dataset from seaborn.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, RFE, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load Titanic dataset
df = sns.load_dataset("titanic")
df.head()

## Cell 2 — Data cleaning & preprocessing
We inspect missing values, impute them, and drop redundant columns.

In [None]:
print("Missing values before:")
print(df.isnull().sum())

# Imputers
imputer_num = SimpleImputer(strategy="median")
imputer_cat = SimpleImputer(strategy="most_frequent")

df[["age"]] = imputer_num.fit_transform(df[["age"]])
df[["embarked"]] = imputer_cat.fit_transform(df[["embarked"]])

# Drop irrelevant/leaky columns
cols_to_drop = ["deck", "embark_town", "alive", "class", "who", "adult_male"]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns])

print("\nMissing values after imputation:")
print(df.isnull().sum())
df.head()

## Cell 3 — Encoding categorical variables
We encode categorical features into numeric codes.

In [None]:
categorical_cols = df.select_dtypes(include=['object','category']).columns.tolist()
print("Categorical columns:", categorical_cols)

for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))

print("Remaining object columns:", df.select_dtypes(include=['object']).columns.tolist())
df.head()

## Cell 4 — Feature scaling
We standardize numeric features and keep a MinMax copy for chi2 if needed.

In [None]:
num_cols = ['age','fare']

scaler_std = StandardScaler()
scaler_mm = MinMaxScaler()

df[num_cols] = scaler_std.fit_transform(df[num_cols])
df_mm = df.copy()
df_mm[num_cols] = scaler_mm.fit_transform(df_mm[num_cols])

df[num_cols].describe()

## Cell 5 — Train/test split
We separate features (X) and target (y), then split.

In [None]:
X = df.drop(columns=["survived"]) if "survived" in df.columns else df.copy()
y = df["survived"]

non_numeric = X.select_dtypes(include=['object']).columns.tolist()
print("Non-numeric columns:", non_numeric)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.shape

## Cell 6 — Filter method (Mutual Information)
We use `SelectKBest(mutual_info_classif)` to rank features.

In [None]:
selector = SelectKBest(score_func=mutual_info_classif, k=5)
selector.fit(X_train, y_train)

selected_features = X_train.columns[selector.get_support()]
print("Top 5 features (Mutual Info):", list(selected_features))

## Cell 7 — Wrapper method (RFE)
Recursive Feature Elimination using Logistic Regression.

In [None]:
estimator = LogisticRegression(max_iter=1000)
rfe = RFE(estimator, n_features_to_select=5)
rfe.fit(X_train, y_train)

rfe_features = X_train.columns[rfe.get_support()]
print("Top 5 features (RFE):", list(rfe_features))

## Cell 8 — Embedded method (Random Forest)
We fit a Random Forest and inspect feature importances.

In [None]:
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values().plot(kind="barh", figsize=(8,6))
plt.title("Feature Importances (Random Forest)")
plt.show()

importances.sort_values(ascending=False).head(10)

## Cell 9 — Full Pipeline (Best Practice)
We combine preprocessing, feature selection, and model into one pipeline to avoid data leakage.

In [None]:
num_cols = ['age','fare','sibsp','parch']
cat_cols = ['sex','embarked']

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_cols),
    ('cat', cat_pipeline, cat_cols)
])

pipe = Pipeline([
    ('prep', preprocessor),
    ('select', SelectKBest(mutual_info_classif, k=8)),
    ('clf', RandomForestClassifier(random_state=42))
])

scores = cross_val_score(pipe, df.drop(columns=['survived']), df['survived'], cv=5, scoring='roc_auc')
print("CV ROC-AUC mean:", scores.mean(), "std:", scores.std())