# Session 67: Capstone Project Part 2 (Exploratory Data Analysis)

**Unit 6: Data Ethics, Privacy, and Future Trends**
**Hour: 67**
**Mode: Practical Project**

---

### 1. Objective

This session marks the beginning of the **Explore** phase of our capstone project. We will use the cleaned and engineered dataset from the previous session to calculate descriptive statistics and perform grouped analysis to start testing our initial hypotheses.

### 2. Setup

For this lab, we need to recreate our `df_clean` DataFrame by running all the steps from the previous session.

In [None]:
import pandas as pd

# --- Start of Cleaning and Feature Engineering Code ---
url = 'https://raw.githubusercontent.com/LeoFernan/Marketing-Campaigns-Analysis/main/marketing_campaign.csv'
df = pd.read_csv(url, sep='\t')

df['Income'].fillna(df['Income'].median(), inplace=True)
df['Age'] = 2024 - df['Year_Birth']
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
df['Customer_Lifetime_Days'] = (pd.to_datetime('2024-01-01') - df['Dt_Customer']).dt.days
df['Relationship'] = df['Marital_Status'].replace({'Married': 'In Relationship', 'Together': 'In Relationship', 'Single': 'Single', 'Divorced': 'Single', 'Widow': 'Single', 'Alone': 'Single', 'Absurd': 'Single', 'YOLO': 'Single'})
df['Education_Level'] = df['Education'].replace({'Basic': 'Undergraduate', '2n Cycle': 'Graduate', 'Graduation': 'Graduate', 'Master': 'Postgraduate', 'PhD': 'Postgraduate'})
df['Children'] = df['Kidhome'] + df['Teenhome']
spend_cols = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df['Total_Spend'] = df[spend_cols].sum(axis=1)
purchase_cols = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']
df['Total_Purchases'] = df[purchase_cols].sum(axis=1)
cols_to_drop = ['Year_Birth', 'Dt_Customer', 'Marital_Status', 'Education', 'Kidhome', 'Teenhome', 'Z_CostContact', 'Z_Revenue']
df_clean = df.drop(columns=cols_to_drop)
# --- End of Cleaning and Feature Engineering Code ---

print("Clean DataFrame is ready.")

### 3. Exploratory Analysis: Testing Our Hypotheses

Let's use `.groupby()` and descriptive statistics to investigate the hypotheses we formulated in Session 64.

#### 3.1. Hypothesis 1: "Customers with higher `Income` will be more likely to respond."

Let's group by our target variable `Response` and check the average income for each group.

In [None]:
df_clean.groupby('Response')['Income'].mean()

**Finding:** **Hypothesis Confirmed.** The average income of customers who responded (`1`) is significantly higher (~$68,345) than that of customers who did not respond (`0`) (~$49,431).

#### 3.2. Hypothesis 2: "Customers who have purchased more recently (`Recency` is low) will be more likely to respond."

Let's group by `Response` and check the average `Recency`.

In [None]:
df_clean.groupby('Response')['Recency'].mean()

**Finding:** **Hypothesis Confirmed.** The average recency for responders is lower (~35 days) than for non-responders (~52 days), suggesting that recently active customers are indeed more likely to accept an offer.

#### 3.3. Hypothesis 3: "Customers with kids at home (`Children` > 0) will be less likely to respond."

This is a bit more complex. Let's group by the number of children and look at the response rate for each group. The response rate is the *mean* of the `Response` column.

In [None]:
df_clean.groupby('Children')['Response'].mean() * 100

**Finding:** **Hypothesis Confirmed.** The response rate for customers with 0 children (~21%) is significantly higher than for customers with 1, 2, or 3 children (hovering around 9-11%).

### 4. Further Exploration: What else can we find?

Let's explore some of our other new features.

**Question:** Do customers who spend more overall respond more often?

In [None]:
df_clean.groupby('Response')['Total_Spend'].mean()

**Finding:** Yes, a very strong signal. Responders spend, on average, more than double what non-responders do.

**Question:** What about our simplified `Relationship` status?

In [None]:
df_clean.groupby('Relationship')['Response'].mean() * 100

**Finding:** This is interesting. The response rate for 'Single' customers is slightly higher than for those 'In Relationship', but the difference is not as dramatic as with income or children.

### 5. Conclusion

Our initial Exploratory Data Analysis has been very successful. Using simple grouped statistics, we have:
1.  Confirmed all three of our initial hypotheses.
2.  Uncovered several other strong indicators of a positive response, especially `Total_Spend`.

We can now define a preliminary profile of a high-propensity customer:
*   They have a high income.
*   They have purchased recently.
*   They have no children.
*   They are high spenders overall.

**Next Session:** We will use data visualization to make these numerical findings more intuitive and compelling.