# Session 66: Capstone Project Part 1 (Data Cleaning)

**Unit 6: Data Ethics, Privacy, and Future Trends**
**Hour: 66**
**Mode: Practical Project**

---

### 1. Objective

This lab session is dedicated to the **Scrub** phase of our project. We will address the data quality issues we identified in the last session and create new, engineered features that will be useful for our analysis.

### 2. Setup

Load the data again.

In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/LeoFernan/Marketing-Campaigns-Analysis/main/marketing_campaign.csv'
df = pd.read_csv(url, sep='\t')

### 3. Cleaning Task 1: Handle Missing Income Values

We have 24 missing values in the `Income` column. Since this is a small number, we will use imputation with the median.

In [None]:
# Calculate median
median_income = df['Income'].median()

# Impute missing values
df['Income'].fillna(median_income, inplace=True)

# Verify
print(f"Missing values in Income after cleaning: {df['Income'].isnull().sum()}")

### 4. Feature Engineering Task 1: Create Age and Customer Lifetime Features

The `Year_Birth` and `Dt_Customer` columns are not very useful in their current form. Let's create more intuitive features from them.

#### 4.1. Create 'Age' from 'Year_Birth'

We'll assume the analysis is being done in the year 2024.

In [None]:
df['Age'] = 2024 - df['Year_Birth']

#### 4.2. Create 'Customer_Lifetime' from 'Dt_Customer'

First, we need to convert `Dt_Customer` to a datetime object. Then we can calculate the number of days since enrollment.

In [None]:
# Convert to datetime
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])

# Calculate lifetime in days relative to a reference date
reference_date = pd.to_datetime('2024-01-01')
df['Customer_Lifetime_Days'] = (reference_date - df['Dt_Customer']).dt.days

df[['Dt_Customer', 'Customer_Lifetime_Days']].head()

### 5. Feature Engineering Task 2: Simplify Categorical Columns

The `Marital_Status` and `Education` columns have many categories. Let's simplify them.

In [None]:
# Simplify Marital_Status into 'In Relationship' or 'Single'
df['Relationship'] = df['Marital_Status'].replace({
    'Married': 'In Relationship',
    'Together': 'In Relationship',
    'Single': 'Single',
    'Divorced': 'Single',
    'Widow': 'Single',
    'Alone': 'Single',
    'Absurd': 'Single',
    'YOLO': 'Single'
})

# Simplify Education into 'Undergraduate', 'Graduate', 'Postgraduate'
df['Education_Level'] = df['Education'].replace({
    'Basic': 'Undergraduate',
    '2n Cycle': 'Graduate', # Assuming this is a Master's degree
    'Graduation': 'Graduate',
    'Master': 'Postgraduate',
    'PhD': 'Postgraduate'
})

### 6. Feature Engineering Task 3: Combine Features

Let's create some new features that summarize behavior.

In [None]:
# Create a 'Children' feature combining kids and teens
df['Children'] = df['Kidhome'] + df['Teenhome']

# Create a 'Total_Spend' feature
spend_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df['Total_Spend'] = df[spend_cols].sum(axis=1)

# Create a 'Total_Purchases' feature
purchase_cols = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
df['Total_Purchases'] = df[purchase_cols].sum(axis=1)

### 7. Final Step: Drop Original and Unused Columns

Now that we've created our new features, we can drop the old ones to keep our DataFrame clean.

In [None]:
cols_to_drop = ['Year_Birth', 'Dt_Customer', 'Marital_Status', 'Education', 
                'Kidhome', 'Teenhome', 'Z_CostContact', 'Z_Revenue'] # Z columns are constants

df_clean = df.drop(columns=cols_to_drop)

print("Cleaned DataFrame columns:")
print(df_clean.columns)

### 8. Conclusion

This was an intensive data cleaning and feature engineering session. We have:
1.  Handled missing values.
2.  Converted dates and years into more useful features like `Age` and `Customer_Lifetime_Days`.
3.  Simplified complex categorical variables.
4.  Combined multiple columns to create powerful summary features like `Children`, `Total_Spend`, and `Total_Purchases`.
5.  Cleaned up our final DataFrame by dropping unneeded columns.

Our dataset is now fully prepared for the **Explore** phase.

**Next Session:** We will begin the Exploratory Data Analysis of our cleaned dataset to test our hypotheses.