<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-dipti/Exercises/day-7/Feature_Engineerring_with_Validation/Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Additional Lab Exercises: Feature Engineering & Validation Pipelines

Below are three self-guided exercises. For each, students should build reusable `clean_data()`, `engineer_features()`, and `validate_data()` functions in Colab. Direct CSV links are provided.

---

### Exercise 1: Diabetes Risk Prediction Pipeline

**Dataset (Pima Indians Diabetes):**  
https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv

**Tasks**  
1. **Clean Data**  
   - Identify zero values in physiological measures (e.g., `Glucose`, `BloodPressure`) and replace with column medians.  
   - Drop duplicate rows.  
2. **Feature Engineering**  
   - Create BMI categories (`Underweight`, `Normal`, `Overweight`, `Obese`) from `BMI`.  
   - Compute `age_bin` by decade.  
   - Generate interaction term `Glucose*Insulin`.  
3. **Validation**  
   - Assert no nulls remain.  
   - Check that all new categorical bins cover expected ranges.  

---

### Exercise 2: Customer Churn Prediction Pipeline

**Dataset (Telco Customer Churn):**  
https://raw.githubusercontent.com/blastchar/telco-customer-churn/master/WA_Fn-UseC_-Telco-Customer-Churn.csv

**Tasks**  
1. **Clean Data**  
   - Convert `TotalCharges` to numeric, coerce errors, then impute missing.  
   - Drop `customerID`.  
2. **Feature Engineering**  
   - Create `tenure_group` (e.g., `0-12`,`13-24`,…) from `tenure`.  
   - Compute `avg_charges_per_month = TotalCharges / tenure`.  
   - Encode `Contract` and `PaymentMethod` with one-hot encoding.  
3. **Validation**  
   - Verify no infinite values in `avg_charges_per_month`.  
   - Confirm all `tenure_group` labels appear at least once.  

---

### Exercise 3: House Price Modeling Pipeline

**Dataset (Ames Housing – from Kaggle):**  
https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv  *(Use as proxy for Kaggle)*

**Tasks**  
1. **Clean Data**  
   - Impute `total_bedrooms` median.  
   - Drop `ocean_proximity` outliers (if any).  
2. **Feature Engineering**  
   - Create `rooms_per_household`, `bedrooms_per_room`, `population_per_household`.  
   - Bin `median_income` into quartiles.  
   - Log-transform `median_house_value`.  
3. **Validation**  
   - Ensure no negative or zero values in ratio features.  
   - Check that log transformation values are finite.  

---

**EXERCISE A**

In [1]:
import pandas as pd
import numpy as np

In [2]:
url="https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv"
df=pd.read_csv(url)
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:

# 1. Data Cleaning


# Columns with zero values that shouldn't be zero
zero_cols = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

# Replace 0s with column median (excluding 0s)
for col in zero_cols:
    median_val = df[df[col] != 0][col].median()
    df[col] = df[col].replace(0, median_val)

# Drop duplicate rows
df = df.drop_duplicates()

In [4]:
df.sample(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
328,2,102,86,36,120,45.5,0.127,23,1
731,8,120,86,29,125,28.4,0.259,22,1
478,8,126,74,38,75,25.9,0.162,39,0
623,0,94,70,27,115,43.5,0.347,21,0
476,2,105,80,45,191,33.7,0.711,29,1


In [5]:

# 2. Feature Engineering


# BMI Categories
def bmi_category(bmi):
    if bmi < 18.5:
        return "Underweight"
    elif 18.5 <= bmi < 25:
        return "Normal"
    elif 25 <= bmi < 30:
        return "Overweight"
    else:
        return "Obese"

df["BMI_Category"] = df["BMI"].apply(bmi_category)

# Age bin by decade
df["Age_Bin"] = pd.cut(df["Age"], bins=[20, 30, 40, 50, 60, 100], labels=["20s", "30s", "40s", "50s", "60+"], right=False)

# Interaction term
df["Glucose_Insulin"] = df["Glucose"] * df["Insulin"]


In [6]:
df.sample(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,BMI_Category,Age_Bin,Glucose_Insulin
737,8,65,72,23,125,32.0,0.6,42,0,Obese,40s,8125
585,1,93,56,11,125,22.5,0.417,22,0,Normal,20s,11625
457,5,86,68,28,71,30.2,0.364,24,0,Obese,20s,6106
292,2,128,78,37,182,43.3,1.224,31,1,Obese,30s,23296
503,7,94,64,25,79,33.3,0.738,41,0,Obese,40s,7426





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.




Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



In [7]:
# 3. Validation

# Assert no nulls
assert df.isnull().sum().sum() == 0, "There are missing values!"

# Check BMI categories
assert set(df["BMI_Category"].unique()).issubset({"Underweight", "Normal", "Overweight", "Obese"}), "Unexpected BMI categories"

# Check age bins
assert df["Age_Bin"].notnull().all(), "Missing values in Age_Bin. Check binning range."

In [8]:
print("✅ Pipeline completed successfully. Cleaned data sample:")
print(df.head())

✅ Pipeline completed successfully. Cleaned data sample:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35      125  33.6   
1            1       85             66             29      125  26.6   
2            8      183             64             29      125  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome BMI_Category Age_Bin  \
0                     0.627   50        1        Obese     50s   
1                     0.351   31        0   Overweight     30s   
2                     0.672   32        1       Normal     30s   
3                     0.167   21        0   Overweight     20s   
4                     2.288   33        1        Obese     30s   

   Glucose_Insulin  
0            18500  
1            10625  
2            22875  
3             8366  
4        

EXERCISE **B**

In [14]:
import pandas as pd
import numpy as np

# Load dataset
urla = "https://raw.githubusercontent.com/blastchar/telco-customer-churn/master/WA_Fn-UseC_-Telco-Customer-Churn.csv"
dfw = pd.read_csv(url)

# 1. Data Cleaning


# Convert TotalCharges to numeric, coercing errors
dfw['TotalCharges'] = pd.to_numeric(dfw['TotalCharges'], errors='coerce')

# Impute missing TotalCharges with median
dfw['TotalCharges'].fillna(dfw['TotalCharges'].median(), inplace=True)

# Drop customerID
dfw.drop(columns=['customerID'], inplace=True)


# 2. Feature Engineering


# Create tenure_group (e.g., 0-12,13-24,…)
bins = [0, 12, 24, 36, 48, 60, np.inf]
labels = ['0-12', '13-24', '25-36', '37-48', '49-60', '61+']
dfw['tenure_group'] = pd.cut(dfw['tenure'], bins=bins, labels=labels, right=True)

# Compute avg_charges_per_month = TotalCharges / tenure
# Avoid division by zero
dfw['avg_charges_per_month'] = dfw['TotalCharges'] / dfw['tenure'].replace(0, np.nan)

# One-hot encode Contract and PaymentMethod
dfw = pd.get_dummies(dfw, columns=['Contract', 'PaymentMethod'])


# 3. Validation


# 1. Check for infinite values in avg_charges_per_month
if np.isinf(dfw['avg_charges_per_month']).any():
    print(" Infinite values found in avg_charges_per_month")
else:
    print(" No infinite values in avg_charges_per_month")

# 2. Confirm all tenure_group labels appear at least once
expected_labels = set(labels)
present_labels = set(dfw['tenure_group'].dropna().unique())
missing_labels = expected_labels - present_labels

if not missing_labels:
    print(f" All tenure_group labels are present: {present_labels}")
else:
    print(f" Missing tenure_group labels: {missing_labels}")


HTTPError: HTTP Error 404: Not Found

EXERCISE **C**

In [15]:
import pandas as pd
import numpy as np

# Load dataset
url = "https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv"
df = pd.read_csv(url)


# 1. CLEAN DATA


# Impute missing values in 'total_bedrooms' with median
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)

# Drop rows with outliers in 'ocean_proximity' if any (not necessary, but shown as per task)
# Let's print value counts first
print(df['ocean_proximity'].value_counts())

# If required to drop specific category like 'ISLAND' (often rare), we can do:
df = df[df['ocean_proximity'] != 'ISLAND']


# 2. FEATURE ENGINEERING


# rooms_per_household
df['rooms_per_household'] = df['total_rooms'] / df['households']

# bedrooms_per_room
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

# population_per_household
df['population_per_household'] = df['population'] / df['households']

# Bin median_income into quartiles
df['income_quartile'] = pd.qcut(df['median_income'], 4, labels=False)

# Log-transform median_house_value
df['log_median_house_value'] = np.log(df['median_house_value'])


# 3. VALIDATION CHECKS


# Check for any negative or zero values in ratio features
print("Negative or zero values:")
print((df[['rooms_per_household', 'bedrooms_per_room', 'population_per_household']] <= 0).sum())

# Ensure log transformation has only finite values
print("Non-finite log values:", np.isfinite(df['log_median_house_value']).sum(), "/", len(df))


# Preview the transformed dataset
print(df.head())


ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64
Negative or zero values:
rooms_per_household         0
bedrooms_per_room           0
population_per_household    0
dtype: int64
Non-finite log values: 20635 / 20635
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  \
0       322.0       126.0         8.3252            452600.0        NEAR BAY   
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY   
2      

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)
