<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-dipti/Exercises/day-7/Data_Cleaning_Feature_Engineering/FE_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Cleaning and Feature Engineering Pipeline in Google Colab
Build a reusable data pipeline that cleans and engineers features on a real-world dataset. This exercise uses the Titanic passenger dataset.

**Objectives**
- Load and inspect the Titanic dataset.

- Identify and handle missing, inconsistent, and duplicate values.

- Create new informative features.

- Modularize your pipeline into functions for cleaning and feature engineering.

- Save the final prepared DataFrame for modeling.

In [1]:
# Step 1: Install and Import Libraries

!pip install pandas numpy -q

import pandas as pd
import numpy as np


In [2]:
# Step 2: Load the Titanic Dataset

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [11]:
df

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S
887,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S
888,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,23.4500,S
889,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [3]:
# Step 3: Data Cleaning
# 1. Detect Missing Values and Data Types

print(df.info())
print(df.isnull().sum())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int6

In [4]:
# 2. Drop Irrelevant Columns

df.drop(columns=["PassengerId","Ticket","Cabin"], inplace=True)


In [8]:
# Impute Missing Values

# Age: fill with median
df["Age"] = df["Age"].fillna(df["Age"].median())
# Embarked: fill with mode
df["Embarked"] = df["Embarked"].fillna(df["Embarked"].mode()[0])

# Age: fill with median
#df["Age"].fillna(df["Age"].median(), inplace=True)
# Embarked: fill with mode
#df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)


In [9]:
df.sample(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
211,1,2,"Cameron, Miss. Clear Annie",female,35.0,0,0,21.0,S
835,1,1,"Compton, Miss. Sara Rebecca",female,39.0,1,1,83.1583,C
242,0,2,"Coleridge, Mr. Reginald Charles",male,29.0,0,0,10.5,S
861,0,2,"Giles, Mr. Frederick Edward",male,21.0,1,0,11.5,S
607,1,1,"Daniel, Mr. Robert Williams",male,27.0,0,0,30.5,S
879,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,83.1583,C
797,1,3,"Osman, Mrs. Mara",female,31.0,0,0,8.6833,S
187,1,1,"Romaine, Mr. Charles Hallace (""Mr C Rolmane"")",male,45.0,0,0,26.55,S
795,0,2,"Otter, Mr. Richard",male,39.0,0,0,13.0,S
589,0,3,"Murdlin, Mr. Joseph",male,28.0,0,0,8.05,S


In [12]:
# Handle Inconsistencies & Duplicates

# Ensure no negative fares
df = df[df["Fare"] >= 0]
# Drop duplicate rows
df.drop_duplicates(inplace=True)


In [13]:
df.sample(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
262,0,1,"Taussig, Mr. Emil",male,52.0,1,1,79.65,S
27,0,1,"Fortune, Mr. Charles Alexander",male,19.0,3,2,263.0,S
276,0,3,"Lindblom, Miss. Augusta Charlotta",female,45.0,0,0,7.75,S
23,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,35.5,S
447,1,1,"Seward, Mr. Frederic Kimber",male,34.0,0,0,26.55,S
858,1,3,"Baclini, Mrs. Solomon (Latifa Qurban)",female,24.0,0,3,19.2583,C
102,0,1,"White, Mr. Richard Frasar",male,21.0,0,1,77.2875,S
20,0,2,"Fynney, Mr. Joseph J",male,35.0,0,0,26.0,S
866,1,2,"Duran y More, Miss. Asuncion",female,27.0,1,0,13.8583,C
687,0,3,"Dakic, Mr. Branko",male,19.0,0,0,10.1708,S


In [16]:
# Step 4: Feature Engineering
# 1.Title Extraction from Name
df.head()
df["Title"] = df["Name"].str.extract(r",\s*([^\.]+)\.")
df["Title"] = df["Title"].replace(["Lady","Countess","Capt","Col","Don","Dr","Major","Rev","Sir","Jonkheer","Dona"], "Rare")
df.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,Mr
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,Mrs
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,Miss
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,Mrs
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,Mr



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [17]:
# Family Size & IsAlone Flag

df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
df["IsAlone"] = np.where(df["FamilySize"] == 1, 1, 0)


In [18]:
# One-Hot Encoding Categorical Features

df_final = pd.get_dummies(df, columns=["Sex","Embarked","Title"], drop_first=True)


In [19]:
df_final.head()

Unnamed: 0,Survived,Pclass,Name,Age,SibSp,Parch,Fare,FamilySize,IsAlone,Sex_male,Embarked_Q,Embarked_S,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rare,Title_the Countess
0,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,2,0,True,False,True,False,False,False,True,False,False,False,False
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,2,0,False,False,False,False,False,False,False,True,False,False,False
2,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,1,1,False,False,True,True,False,False,False,False,False,False,False
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,2,0,False,False,True,False,False,False,False,True,False,False,False
4,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,1,1,True,False,True,False,False,False,True,False,False,False,False


In [20]:
def clean_data(df):
    df = df.drop(columns=["PassengerId","Ticket","Cabin"], errors="ignore")
    df = df.copy()  # ensure we work on a fresh copy
    # Fill missing Age and Embarked by direct assignment
    median_age = df["Age"].median()
    df["Age"] = df["Age"].fillna(median_age)
    mode_embarked = df["Embarked"].mode()[0]
    df["Embarked"] = df["Embarked"].fillna(mode_embarked)
    df = df[df["Fare"] >= 0].drop_duplicates()
    return df

def engineer_features(df):
    df = df.copy()
    # Extract and binarize Title
    df["Title"] = df["Name"].str.extract(r",\s*([^\.]+)\.")
    rare_titles = ["Lady","Countess","Capt","Col","Don","Dr","Major",
                   "Rev","Sir","Jonkheer","Dona"]
    df["Title"] = df["Title"].replace(rare_titles, "Rare")
    # Family size and alone flag
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    df["IsAlone"] = np.where(df["FamilySize"] == 1, 1, 0)
    # Binning
    df["FareBin"] = pd.qcut(df["Fare"], 4, labels=False)
    df["AgeBin"]  = pd.cut(df["Age"], bins=[0,12,20,40,60,100], labels=False)
    # One-hot encoding via direct assignment
    df = pd.get_dummies(df, columns=["Sex","Embarked","Title"], drop_first=True)
    return df

# Run pipeline
df_clean = clean_data(df)
df_prepared = engineer_features(df_clean)
df_prepared.head()


Unnamed: 0,Survived,Pclass,Name,Age,SibSp,Parch,Fare,FamilySize,IsAlone,FareBin,...,Embarked_Q,Embarked_S,Title_Miss,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rare,Title_the Countess
0,0,3,"Braund, Mr. Owen Harris",22.0,1,0,7.25,2,0,0,...,False,True,False,False,False,True,False,False,False,False
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,71.2833,2,0,3,...,False,False,False,False,False,False,True,False,False,False
2,1,3,"Heikkinen, Miss. Laina",26.0,0,0,7.925,1,1,1,...,False,True,True,False,False,False,False,False,False,False
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,53.1,2,0,3,...,False,True,False,False,False,False,True,False,False,False
4,0,3,"Allen, Mr. William Henry",35.0,0,0,8.05,1,1,1,...,False,True,False,False,False,True,False,False,False,False


In [21]:
# Step 6: Save the Prepared Dataset
df_prepared.to_csv("titanic_prepared.csv", index=False)


## Reflection Questions
- Which cleaning steps had the largest impact on data consistency?
The largest impact on data consistency typically comes from the following steps:

✅ Replacing Invalid Zeros with Medians:
Columns like Glucose, BloodPressure, SkinThickness, Insulin, and BMI sometimes have zero as a placeholder for missing values.

Replacing those with the median corrects this inconsistency and brings the data back to a plausible physiological range.

✅ Removing Duplicates:
Ensures that each row represents a unique individual.

Prevents data leakage or skewed statistics caused by repeated entries.

✅ Handling Missing Values:
If there were any NaNs or zero-replacement tasks, filling those ensures no row is incomplete and improves model compatibility.

- How did the new features (Title, FamilySize, IsAlone, bins) improve the dataset’s expressive power?
These engineered features capture non-linear and contextual relationships that raw features might miss:

🧑‍🎓 Title:
Extracted from names (e.g., "Mr", "Mrs", "Miss").

Encodes social status, gender, and sometimes age range (e.g., "Master" = young male).

Helps the model capture patterns like survival probability among women/children.

👪 FamilySize:
Combines SibSp and Parch + 1 (self).

Captures support system size, which may impact survival or health behavior.

A person with family might behave differently than one traveling alone.

🧍 IsAlone:
Derived from FamilySize; 1 if traveling alone.

Binary variable that explicitly indicates solitude.

Simplifies modeling — especially for decision trees or logistic regression.

📊 Binned Features (age_bin, BMI category, etc.):
Convert continuous variables into categorical buckets (e.g., Underweight, Normal, Obese).

Improves interpretability and often helps models that struggle with non-linear relationships.

- How would you adapt this pipeline for a different tabular dataset?
Adapting it involves understanding domain context and following a systematic approach:

🔁 Reusability Steps:
Missing value handling: Detect placeholder values (0, 9999, -1) or NaNs.

Outlier treatment: Use IQR or z-score to handle extreme values.

Drop duplicates: Always applicable.

Create interaction terms: Identify variable pairs that could multiply to create meaningful interactions.

Binning: Useful for continuous columns like income, age, BMI, etc.

🧠 Domain-specific Engineering:
In a banking dataset, extract Job Type, Account Age, or IsSeniorCitizen.

In a medical dataset, compute RiskScore, AgeGroup, HasComorbidities.

In e-commerce, create features like TimeOnSite, RepeatCustomer, CartAbandonRate.

🧪 Validations & Assertions:
Always add post-cleaning checks: nulls, value ranges, data types, unique values in categorical features.