# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [5]:
import pandas as pd

url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"
df = pd.read_csv(url)

print(df.head())


   Unnamed: 0 Customer       State  Customer Lifetime Value Response  \
0           0  DK49336     Arizona              4809.216960       No   
1           1  KX64629  California              2228.525238       No   
2           2  LZ68649  Washington             14947.917300       No   
3           3  XL78013      Oregon             22332.439460      Yes   
4           4  QA50777      Oregon              9025.067525       No   

   Coverage Education Effective To Date EmploymentStatus Gender  ...  \
0     Basic   College           2/18/11         Employed      M  ...   
1     Basic   College           1/18/11       Unemployed      F  ...   
2     Basic  Bachelor           2/10/11         Employed      M  ...   
3  Extended   College           1/11/11         Employed      M  ...   
4   Premium  Bachelor           1/17/11    Medical Leave      F  ...   

   Number of Open Complaints Number of Policies     Policy Type        Policy  \
0                        0.0                  9  Corp

In [7]:
# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

# Clean and convert 'customer_lifetime_value' and 'total_claim_amount' to numeric
df['customer_lifetime_value'] = pd.to_numeric(df['customer_lifetime_value'], errors='coerce')
df['total_claim_amount'] = pd.to_numeric(df['total_claim_amount'], errors='coerce')

print(df.head())


   unnamed:_0 customer       state  customer_lifetime_value response  \
0           0  DK49336     Arizona              4809.216960       No   
1           1  KX64629  California              2228.525238       No   
2           2  LZ68649  Washington             14947.917300       No   
3           3  XL78013      Oregon             22332.439460      Yes   
4           4  QA50777      Oregon              9025.067525       No   

   coverage education effective_to_date employmentstatus gender  ...  \
0     Basic   College           2/18/11         Employed      M  ...   
1     Basic   College           1/18/11       Unemployed      F  ...   
2     Basic  Bachelor           2/10/11         Employed      M  ...   
3  Extended   College           1/11/11         Employed      M  ...   
4   Premium  Bachelor           1/17/11    Medical Leave      F  ...   

   number_of_open_complaints number_of_policies     policy_type        policy  \
0                        0.0                  9  Corp

In [9]:
# Filter DataFrame
filter1000_df = df[(df['total_claim_amount'] > 1000) & (df['response'] == 'Yes')]

# Preview the filtered DataFrame
print(filter1000_df.head())
print(filter1000_df.shape)  


     unnamed:_0 customer       state  customer_lifetime_value response  \
189         189  OK31456  California             11009.130490      Yes   
236         236  YJ16163      Oregon             11009.130490      Yes   
419         419  GW43195      Oregon             25807.063000      Yes   
442         442  IP94270     Arizona             13736.132500      Yes   
587         587  FJ28407  California              5619.689084      Yes   

     coverage             education effective_to_date employmentstatus gender  \
189   Premium              Bachelor           1/24/11         Employed      F   
236   Premium              Bachelor           1/24/11         Employed      F   
419  Extended               College           2/13/11         Employed      F   
442   Premium                Master           2/13/11         Disabled      F   
587   Premium  High School or Below           1/26/11       Unemployed      M   

     ...  number_of_open_complaints number_of_policies     policy_ty

In [11]:
# 2-Filter for customers who responded "Yes"
yes_response_df = df[df['response'] == 'Yes']

# Group by 'policy_type' and 'gender' to find average 'total_claim_amount'
avg_by_policy_gender = yes_response_df.groupby(['policy_type', 'gender'])['total_claim_amount'].mean()

print(avg_by_policy_gender)


policy_type     gender
Corporate Auto  F         433.738499
                M         408.582459
Personal Auto   F         452.965929
                M         457.010178
Special Auto    F         453.280164
                M         429.527942
Name: total_claim_amount, dtype: float64


This table shows how the average total claim amount differs across policy types and genders for customers who responded to the marketing campaign.
Insights can be drawn about which customer groups have higher claims(men have the highest claim amount( in mean) in personal auto type, however, females have the highest one(in mean) in Special auto, a bit more than personal auto), helping guide marketing strategies.

In [13]:
# 3- Group by 'state' and count customers in each state

customers_by_state = df.groupby('state')['customer'].count()

# Filter states with more than 500 customers
states_with_more_than_500 = customers_by_state[customers_by_state > 500]

print(states_with_more_than_500)


state
Arizona       1937
California    3552
Nevada         993
Oregon        2909
Washington     888
Name: customer, dtype: int64


In [15]:
# 4-Group by education and gender and calculate max, min, and median customer_lifetime
customerlv_by_education_gender = df.groupby(['education', 'gender'])['customer_lifetime_value'].agg(['max', 'min', 'median'])

# Print the result
print(customerlv_by_education_gender)


                                     max          min       median
education            gender                                       
Bachelor             F       73225.95652  1904.000852  5640.505303
                     M       67907.27050  1898.007675  5548.031892
College              F       61850.18803  1898.683686  5623.611187
                     M       61134.68307  1918.119700  6005.847375
Doctor               F       44856.11397  2395.570000  5332.462694
                     M       32677.34284  2267.604038  5577.669457
High School or Below F       55277.44589  2144.921535  6039.553187
                     M       83325.38119  1940.981221  6286.731006
Master               F       51016.06704  2417.777032  5729.855012
                     M       50568.25912  2272.307310  5579.099207


This info provides insights into how customer lifetime value (CLV) varies across different education levels and gender.
It can guide personalized marketing strategies and customer retention efforts based on customer lifetime value patterns.