<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-dipti/Exercises/day-7/Data_Cleaning_Feature_Engineering/Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Additional Lab Exercises: Cleaning & Feature Engineering Pipelines

Below are few exercises with direct CSV links. For each, build modular `clean_data()` and `engineer_features()` functions in Colab.

---

### Exercise A: Housing Prices Pipeline

**Dataset (Housing Prices Mini):**  
https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv

**Tasks**  
1. Load and inspect for missing values and outliers.  
2. Impute `total_bedrooms` with median.  
3. Create new features:  
   - `rooms_per_household = total_rooms / households`  
   - `bedrooms_per_room  = total_bedrooms / total_rooms`  
   - `population_per_household = population / households`  
4. One-hot encode `ocean_proximity`.  
5. Normalize numeric features (MinMax or StandardScaler).  
6. Wrap cleaning and feature logic in functions.

---

### Exercise B: Credit Card Transactions

**Dataset (UCI Default of Credit Card Clients):**  
https://archive.ics.uci.edu/static/public/422/default_of_credit_card_clients__data_.csv

**Tasks**  
1. Rename columns to lowercase and remove spaces.  
2. Handle missing or invalid billing/payment values.  
3. Engineer features:  
   - `avg_bill_amt = (bill_amt1 + … + bill_amt6) / 6`  
   - `avg_pay_amt  = (pay_amt1 + … + pay_amt6) / 6`  
   - `pay_bill_ratio = avg_pay_amt / avg_bill_amt`  
4. Bin `age` into groups (e.g., decades).  
5. One-hot encode `sex`, `education`, `marriage`.  
6. Modularize into `clean_data()` and `engineer_features()`.

---

**Exercise A **

In [2]:
!pip install pandas numpy -q

import pandas as pd
import numpy as np

In [3]:
url="https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv"
df=pd.read_csv(url)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
# Check for missing values
print(df.isnull().sum())

# Basic statistics to detect outliers
print(df.describe())


longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64
          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.0000

In [23]:
# Impute missing 'total_bedrooms' with median
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)


In [24]:
df['rooms_per_household'] = df['total_rooms'] / df['households']
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']
df['population_per_household'] = df['population'] / df['households']


In [25]:
df.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
6045,0.90827,-0.740606,1.379433,-0.529787,-0.426435,0.075522,-0.396368,-0.999524,-0.834158,1.336603,0.804917,-0.190534,True,False,False,False
3348,-0.404443,2.503927,-0.13027,-0.852033,-0.838947,-0.977961,-1.005805,-1.151384,-1.445981,0.847116,0.984641,0.972317,True,False,False,False
6134,0.783488,-0.731243,0.425936,-0.430775,-0.192757,0.162061,-0.192351,-0.853085,-0.592375,2.239528,0.447466,-0.842527,True,False,False,False
11145,0.79347,-0.834244,-0.448103,-0.250629,-0.38113,-0.300659,-0.35975,0.470755,-0.219735,0.696676,1.520692,0.835745,False,False,False,False
11636,0.768514,-0.843607,-0.289187,0.180713,-0.011538,0.214161,0.024744,0.397062,0.197102,7.30317,-0.063847,8.654899,False,False,False,False


In [27]:
from sklearn.preprocessing import StandardScaler

# Select numeric features
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Normalize
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


In [28]:
def clean_data(df):
    df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)
    return df

def engineer_features(df):
    df['rooms_per_household'] = df['total_rooms'] / df['households']
    df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']
    df['population_per_household'] = df['population'] / df['households']
    df = pd.get_dummies(df, columns=['ocean_proximity'], drop_first=True)
    return df

def normalize_features(df):
    from sklearn.preprocessing import StandardScaler
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
    scaler = StandardScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    return df


In [30]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-1.327835,1.052548,0.982143,-0.804819,-0.972476,-0.974429,-0.977033,2.344766,2.129631,0.002359,-0.003443,0.005924,False,False,True,False
1,-1.322844,1.043185,-0.607019,2.04589,1.357143,0.861439,1.669961,2.332238,1.314156,0.033352,-0.013863,-0.03042,False,False,True,False
2,-1.332827,1.038503,1.856182,-0.535746,-0.827024,-0.820777,-0.843637,1.782699,1.258693,-0.012211,0.00297,0.00408,False,False,True,False
3,-1.337818,1.038503,1.856182,-0.624215,-0.719723,-0.766028,-0.733781,0.932968,1.1651,0.00444,-0.0045,0.009442,False,False,True,False
4,-1.337818,1.038503,1.856182,-0.462404,-0.612423,-0.759847,-0.629157,-0.012881,1.1729,-0.004496,-0.001223,0.021804,False,False,True,False


**Exercise B**

In [13]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [17]:
# 1. Load dataset
url = "https://raw.githubusercontent.com/GDPlumb/credit-card-default-prediction/main/UCI_Credit_Card.csv"
df = pd.read_csv(url)
df.head()

HTTPError: HTTP Error 404: Not Found

In [None]:
def clean_data(df):
    # Rename columns: lowercase and remove spaces
    df.columns = [col.lower().replace(" ", "_") for col in df.columns]

    # Convert column names for consistency (if needed)
    df.rename(columns={"default_payment_next_month": "default"}, inplace=True)

    # Handle missing/invalid billing and payment values
    bill_cols = [f"bill_amt{i}" for i in range(1, 7)]
    pay_cols = [f"pay_amt{i}" for i in range(1, 7)]

    for col in bill_cols + pay_cols:
        df[col] = pd.to_numeric(df[col], errors="coerce")  # convert non-numeric to NaN
        df[col] = df[col].fillna(0)  # fill NaNs with 0 (assumption)

    return df

In [None]:
# 3. Feature engineering
def engineer_features(df):
    # Average bill amount
    bill_cols = [f"bill_amt{i}" for i in range(1, 7)]
    df["avg_bill_amt"] = df[bill_cols].mean(axis=1)

    # Average pay amount
    pay_cols = [f"pay_amt{i}" for i in range(1, 7)]
    df["avg_pay_amt"] = df[pay_cols].mean(axis=1)

    # Pay/Bill ratio
    df["pay_bill_ratio"] = df["avg_pay_amt"] / (df["avg_bill_amt"] + 1e-6)  # avoid division by zero

    # Bin age into decades
    df["age_bin"] = pd.cut(df["age"], bins=[20, 30, 40, 50, 60, 70, 80], labels=["20s", "30s", "40s", "50s", "60s", "70s"])

    # One-hot encode sex, education, marriage
    categorical_cols = ["sex", "education", "marriage", "age_bin"]
    df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

    return df

In [None]:
df = clean_data(df)
df = engineer_features(df)

# 5. Check final result
print(df.head())

**One-Hot Encoding**
Converts categorical columns (sex, education, marriage, age_bin) into binary columns to use with machine learning models.