# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

---

## Step 1

In [1]:
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"
df = pd.read_csv(url)

# Check the first few rows of the dataset
df.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [2]:
# Filter the DataFrame
filtered_df = df[(df['Total Claim Amount'] > 1000) & (df['Response'] == 'Yes')]

# Display the filtered DataFrame
filtered_df.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
189,189,OK31456,California,11009.13049,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.4,Luxury Car,Medsize,
236,236,YJ16163,Oregon,11009.13049,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Special Auto,Special L3,Offer2,Agent,1358.4,Luxury Car,Medsize,A
419,419,GW43195,Oregon,25807.063,Yes,Extended,College,2/13/11,Employed,F,...,1.0,2,Personal Auto,Personal L2,Offer1,Branch,1027.2,Luxury Car,Small,A
442,442,IP94270,Arizona,13736.1325,Yes,Premium,Master,2/13/11,Disabled,F,...,0.0,8,Personal Auto,Personal L2,Offer1,Web,1261.319869,SUV,Medsize,A
587,587,FJ28407,California,5619.689084,Yes,Premium,High School or Below,1/26/11,Unemployed,M,...,0.0,1,Personal Auto,Personal L1,Offer2,Web,1027.000029,SUV,Medsize,A


---

## Step 2

In [8]:
# Filter the original DataFrame for customers who responded "Yes"
yes_response_df = df[df['Response'] == 'Yes']

# Group by policy_type and gender, then calculate the mean of total_claim_amount
average_claims = yes_response_df.groupby(['Policy Type', 'Gender'])['Total Claim Amount'].mean().reset_index()

# Display the result
average_claims

Unnamed: 0,Policy Type,Gender,Total Claim Amount
0,Corporate Auto,F,433.738499
1,Corporate Auto,M,408.582459
2,Personal Auto,F,452.965929
3,Personal Auto,M,457.010178
4,Special Auto,F,453.280164
5,Special Auto,M,429.527942


### Conclusions:

1. **Gender Differences**:
   - For **Corporate Auto** and **Special Auto** policies, **females** have a slightly higher average claim amount than **males**.
   - However, for **Personal Auto** policies, the trend reverses slightly, with **males** having a slightly higher average claim amount than **females**.

2. **Policy Type Differences**:
   - The **Personal Auto** and **Special Auto** policies have **higher average claim amounts** compared to **Corporate Auto**.
   - Both genders generally have similar average claims for each policy type (with small differences), but the exact difference may still be significant depending on your analytical focus.

---

## Step 3:

In [11]:
# Group by state and count the number of customers
state_counts = df['State'].value_counts().reset_index()

# Rename columns for clarity
state_counts.columns = ['State', 'Customer Count']

# Display the result
state_counts

Unnamed: 0,State,Customer Count
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


In [None]:
# Filter to include only states with more than 500 customers (unnecessary, but for the sake of the example, we'll include it here)
filtered_states = state_counts[state_counts['Customer Count'] > 500]

# Display the filtered result (wow, all States are over 500 customers, who would have thought that!)
filtered_states.head()

Unnamed: 0,State,Customer Count
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


---

## Step 4

In [15]:
# Group by education level and gender
clv_stats = df.groupby(['Education', 'Gender'])['Customer Lifetime Value'].agg(['max', 'min', 'median']).reset_index()

# Display the result
clv_stats

Unnamed: 0,Education,Gender,max,min,median
0,Bachelor,F,73225.95652,1904.000852,5640.505303
1,Bachelor,M,67907.2705,1898.007675,5548.031892
2,College,F,61850.18803,1898.683686,5623.611187
3,College,M,61134.68307,1918.1197,6005.847375
4,Doctor,F,44856.11397,2395.57,5332.462694
5,Doctor,M,32677.34284,2267.604038,5577.669457
6,High School or Below,F,55277.44589,2144.921535,6039.553187
7,High School or Below,M,83325.38119,1940.981221,6286.731006
8,Master,F,51016.06704,2417.777032,5729.855012
9,Master,M,50568.25912,2272.30731,5579.099207


### Conclusions

1. **Gender Differences**:
   - Across most education levels, **females** tend to have a slightly higher **maximum CLV**, except for **High School or Below**, where **males** reach the highest overall max ($83,325.38).
   - **Median CLVs** are fairly close between genders but are often higher for **males**, particularly in **College** and **High School or Below**.

2. **Education Differences**:
   - Surprisingly, the **highest maximum CLV** comes from customers with **High School or Below** education, not higher degrees.
   - **Median CLVs** tend to be higher in lower or mid-range education levels like **College** and **High School or Below**, suggesting customer lifetime value does not increase linearly with education.

3. **Insights**:
   - There's no direct correlation between higher education and higher customer lifetime value.
   - Customers with **High School or Below** education may represent a high-value segment worth analyzing further.

---

## Step 5

In [17]:
# Check if the column 'Effective To Date' is in datetime format
df['Effective To Date'] = pd.to_datetime(df['Effective To Date'])

# Extract the month name from the date
df['Month'] = df['Effective To Date'].dt.strftime('%B')

# Create the pivot table
policy_pivot = df.pivot_table(index='State', 
                               columns='Month', 
                               values='Policy', 
                               aggfunc='count',
                               fill_value=0)

# Optional: sort months in calendar order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
policy_pivot = policy_pivot.reindex(columns=month_order)

# Display the pivot table
policy_pivot

Month,January,February,March,April,May,June,July,August,September,October,November,December
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Arizona,1008,929,,,,,,,,,,
California,1918,1634,,,,,,,,,,
Nevada,551,442,,,,,,,,,,
Oregon,1565,1344,,,,,,,,,,
Washington,463,425,,,,,,,,,,


---

## Step 6

In [18]:
# 1: Group by state and count total number of policies sold
total_policies_by_state = df.groupby('State')['Policy'].count().sort_values(ascending=False)

# 2: Identify the top 3 states with the most policies sold
top_3_states = total_policies_by_state.head(3).index.tolist()

# 3: Filter the DataFrame to include only those top 3 states
filtered_df = df[df['State'].isin(top_3_states)]

# 4: Group by state and month, count policies sold
top_states_monthly_sales = filtered_df.groupby(['State', 'Month'])['Policy'].count().reset_index()

# 5: Pivot the table to show months as columns and states as rows
final_df = top_states_monthly_sales.pivot(index='State', columns='Month', values='Policy').fillna(0)

# Optional: Reorder columns to match calendar order
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
final_df = final_df.reindex(columns=month_order)

# Display final DataFrame
final_df

Month,January,February,March,April,May,June,July,August,September,October,November,December
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Arizona,1008,929,,,,,,,,,,
California,1918,1634,,,,,,,,,,
Oregon,1565,1344,,,,,,,,,,


---

## Step 7

In [27]:
# Step 1: Filter to only relevant columns
channel_response = df[['Sales Channel', 'Response']]

# Step 2: Group by sales channel and response, count each combination
response_counts = channel_response.groupby(['Sales Channel', 'Response']).size().unstack(fill_value=0)

# Step 3: Calculate the response rate (Yes / total)
response_counts['response_rate'] = (
    response_counts['Yes'] / (response_counts['Yes'] + response_counts['No'])
).round(3) * 100  # convert to percentage

# Display result
response_counts[['response_rate']]

Response,response_rate
Sales Channel,Unnamed: 1_level_1
Agent,19.1
Branch,11.4
Call Center,11.0
Web,11.7
