# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [1]:
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"
df = pd.read_csv(url)

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [2]:
#Column names should be in lower case
df.columns = [col.lower() for col in df.columns] 
#White spaces in column names should be replaced by _
df.columns = [col.replace(" ","_") for col in df.columns]
#st could be replaced for state
df.rename(columns= {'st': 'state'}, inplace=True)
print(df.columns)

Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type'],
      dtype='object')


In [3]:
df["state"].value_counts()

state
California    3552
Oregon        2909
Arizona       1937
Nevada         993
Washington     888
Name: count, dtype: int64

In [4]:
df["education"].unique()

array(['College', 'Bachelor', 'High School or Below', 'Doctor', 'Master'],
      dtype=object)

In [5]:
df["vehicle_class"].unique()

array(['Four-Door Car', 'SUV', 'Two-Door Car', 'Sports Car', 'Luxury Car',
       'Luxury SUV', nan], dtype=object)

In [6]:
df.dtypes

unnamed:_0                         int64
customer                          object
state                             object
customer_lifetime_value          float64
response                          object
coverage                          object
education                         object
effective_to_date                 object
employmentstatus                  object
gender                            object
income                             int64
location_code                     object
marital_status                    object
monthly_premium_auto               int64
months_since_last_claim          float64
months_since_policy_inception      int64
number_of_open_complaints        float64
number_of_policies                 int64
policy_type                       object
policy                            object
renew_offer_type                  object
sales_channel                     object
total_claim_amount               float64
vehicle_class                     object
vehicle_size    

In [7]:
df["number_of_open_complaints"].unique()

array([ 0., nan,  3.,  1.,  2.,  4.,  5.])

In [8]:
#Identify any columns with null or missing values. 
#Identify how many null values each column has. 
#You can use the isnull() function in pandas to find columns with null values.
df.isnull().sum()

unnamed:_0                          0
customer                            0
state                             631
customer_lifetime_value             0
response                          631
coverage                            0
education                           0
effective_to_date                   0
employmentstatus                    0
gender                              0
income                              0
location_code                       0
marital_status                      0
monthly_premium_auto                0
months_since_last_claim           633
months_since_policy_inception       0
number_of_open_complaints         633
number_of_policies                  0
policy_type                         0
policy                              0
renew_offer_type                    0
sales_channel                       0
total_claim_amount                  0
vehicle_class                     622
vehicle_size                      622
vehicle_type                     5482
dtype: int64

In [9]:
#
df.dropna(inplace=True)

In [10]:
#You can use the fillna() function in pandas to fill null values or dropna() function to drop null values.
df.isnull().sum()

unnamed:_0                       0
customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
vehicle_class                    0
vehicle_size                     0
vehicle_type                     0
dtype: int64

In [11]:
#After formatting data types, as a last step, convert all the numeric variables to integers.
df.dtypes

unnamed:_0                         int64
customer                          object
state                             object
customer_lifetime_value          float64
response                          object
coverage                          object
education                         object
effective_to_date                 object
employmentstatus                  object
gender                            object
income                             int64
location_code                     object
marital_status                    object
monthly_premium_auto               int64
months_since_last_claim          float64
months_since_policy_inception      int64
number_of_open_complaints        float64
number_of_policies                 int64
policy_type                       object
policy                            object
renew_offer_type                  object
sales_channel                     object
total_claim_amount               float64
vehicle_class                     object
vehicle_size    

In [12]:
df['customer_lifetime_value'] = pd.to_numeric (df['customer_lifetime_value'].astype(int))
df['ttotal_claim_amountotal_claim_amount'] = pd.to_numeric (df['total_claim_amount'].astype(int))
df['months_since_last_claim'] = pd.to_numeric (df['months_since_last_claim'].astype(int))
df['number_of_open_complaints'] = pd.to_numeric (df['number_of_open_complaints'].astype(int))
df['total_claim_amount'] = pd.to_numeric (df['total_claim_amount'].astype(int))
df['months_since_policy_inception'] = pd.to_numeric (df['months_since_policy_inception'].astype(int))
df

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,ttotal_claim_amountotal_claim_amount
2,2,LZ68649,Washington,14947,No,Basic,Bachelor,2/10/11,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480,SUV,Medsize,A,480
3,3,XL78013,Oregon,22332,Yes,Extended,College,1/11/11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484,Four-Door Car,Medsize,A,484
10,10,HG93801,Arizona,5154,No,Extended,High School or Below,1/2/11,Employed,M,...,1,Corporate Auto,Corporate L3,Offer2,Branch,442,SUV,Large,A,442
13,13,KR82385,California,5454,No,Basic,Master,1/26/11,Employed,M,...,4,Personal Auto,Personal L3,Offer4,Call Center,331,Two-Door Car,Medsize,A,331
16,16,FH51383,California,5326,No,Basic,High School or Below,2/7/11,Employed,F,...,6,Personal Auto,Personal L3,Offer4,Call Center,300,Two-Door Car,Large,A,300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10902,10902,PP30874,California,3579,No,Extended,High School or Below,1/24/11,Employed,F,...,1,Personal Auto,Personal L2,Offer2,Agent,655,Four-Door Car,Medsize,A,655
10903,10903,SU71163,Arizona,2771,No,Basic,College,1/7/11,Employed,M,...,1,Personal Auto,Personal L2,Offer2,Branch,355,Two-Door Car,Medsize,A,355
10904,10904,QI63521,Nevada,19228,No,Basic,High School or Below,2/24/11,Unemployed,M,...,2,Personal Auto,Personal L2,Offer1,Branch,897,Luxury SUV,Medsize,A,897
10906,10906,KX53892,Oregon,5259,No,Basic,College,1/6/11,Employed,F,...,6,Personal Auto,Personal L3,Offer2,Branch,273,Four-Door Car,Medsize,A,273


In [13]:
df.dtypes

unnamed:_0                               int64
customer                                object
state                                   object
customer_lifetime_value                  int32
response                                object
coverage                                object
education                               object
effective_to_date                       object
employmentstatus                        object
gender                                  object
income                                   int64
location_code                           object
marital_status                          object
monthly_premium_auto                     int64
months_since_last_claim                  int32
months_since_policy_inception            int32
number_of_open_complaints                int32
number_of_policies                       int64
policy_type                             object
policy                                  object
renew_offer_type                        object
sales_channel

In [14]:
#Use the .duplicated() method to identify any duplicate rows in the dataframe
df.duplicated().sum()  #There is no duplicates

0

In [15]:
#after dropping duplicates, reset the index to ensure consistency.
df = df.reset_index(drop=True)

In [16]:
df["customer"].unique()

array(['LZ68649', 'XL78013', 'HG93801', ..., 'QI63521', 'KX53892',
       'WA60547'], dtype=object)

In [17]:
#1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 
#and have a response of "Yes" to the last marketing campaign.
grouped_customers = df.groupby(["customer","total_claim_amount"]).filter(lambda x: (x['total_claim_amount'] > 1000).any() and (x['response'] == 'Yes').any())
grouped_customers

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,ttotal_claim_amountotal_claim_amount
103,236,YJ16163,Oregon,11009,Yes,Premium,Bachelor,1/24/11,Employed,F,...,1,Special Auto,Special L3,Offer2,Agent,1358,Luxury Car,Medsize,A,1358
167,419,GW43195,Oregon,25807,Yes,Extended,College,2/13/11,Employed,F,...,2,Personal Auto,Personal L2,Offer1,Branch,1027,Luxury Car,Small,A,1027
171,442,IP94270,Arizona,13736,Yes,Premium,Master,2/13/11,Disabled,F,...,8,Personal Auto,Personal L2,Offer1,Web,1261,SUV,Medsize,A,1261
231,587,FJ28407,California,5619,Yes,Premium,High School or Below,1/26/11,Unemployed,M,...,1,Personal Auto,Personal L1,Offer2,Web,1027,SUV,Medsize,A,1027
616,1527,TU53781,Oregon,8427,Yes,Extended,Bachelor,2/10/11,Employed,F,...,1,Corporate Auto,Corporate L3,Offer1,Agent,1032,Luxury SUV,Medsize,A,1032
717,1809,QO62792,Oregon,7840,Yes,Extended,College,1/14/11,Employed,M,...,1,Personal Auto,Personal L3,Offer2,Agent,1008,Luxury SUV,Small,A,1008
805,2027,TA66375,Oregon,11009,Yes,Premium,Bachelor,1/24/11,Employed,F,...,1,Corporate Auto,Corporate L2,Offer2,Agent,1358,Luxury Car,Medsize,A,1358
850,2125,JC11405,Oregon,10963,Yes,Premium,High School or Below,2/8/11,Employed,M,...,1,Personal Auto,Personal L3,Offer1,Agent,1324,Luxury SUV,Medsize,A,1324
1130,2865,FH77504,California,11009,Yes,Premium,Bachelor,1/24/11,Employed,F,...,1,Personal Auto,Personal L3,Offer2,Agent,1358,Luxury Car,Medsize,A,1358
1613,3958,KC26486,Arizona,8427,Yes,Extended,Bachelor,2/10/11,Employed,F,...,1,Personal Auto,Personal L2,Offer1,Agent,1032,Luxury SUV,Medsize,A,1032


In [18]:
grouped_customers = grouped_customers.pivot_table(index=["policy_type","gender"],values=(["total_claim_amount"]),aggfunc="mean")
grouped_customers

Unnamed: 0_level_0,Unnamed: 1_level_0,total_claim_amount
policy_type,gender,Unnamed: 2_level_1
Corporate Auto,F,1195.0
Corporate Auto,M,1324.0
Personal Auto,F,1243.769231
Personal Auto,M,1108.5
Special Auto,F,1358.0


In [19]:
#2. Using the original Dataframe

df

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,ttotal_claim_amountotal_claim_amount
0,2,LZ68649,Washington,14947,No,Basic,Bachelor,2/10/11,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480,SUV,Medsize,A,480
1,3,XL78013,Oregon,22332,Yes,Extended,College,1/11/11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484,Four-Door Car,Medsize,A,484
2,10,HG93801,Arizona,5154,No,Extended,High School or Below,1/2/11,Employed,M,...,1,Corporate Auto,Corporate L3,Offer2,Branch,442,SUV,Large,A,442
3,13,KR82385,California,5454,No,Basic,Master,1/26/11,Employed,M,...,4,Personal Auto,Personal L3,Offer4,Call Center,331,Two-Door Car,Medsize,A,331
4,16,FH51383,California,5326,No,Basic,High School or Below,2/7/11,Employed,F,...,6,Personal Auto,Personal L3,Offer4,Call Center,300,Two-Door Car,Large,A,300
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4538,10902,PP30874,California,3579,No,Extended,High School or Below,1/24/11,Employed,F,...,1,Personal Auto,Personal L2,Offer2,Agent,655,Four-Door Car,Medsize,A,655
4539,10903,SU71163,Arizona,2771,No,Basic,College,1/7/11,Employed,M,...,1,Personal Auto,Personal L2,Offer2,Branch,355,Two-Door Car,Medsize,A,355
4540,10904,QI63521,Nevada,19228,No,Basic,High School or Below,2/24/11,Unemployed,M,...,2,Personal Auto,Personal L2,Offer1,Branch,897,Luxury SUV,Medsize,A,897
4541,10906,KX53892,Oregon,5259,No,Basic,College,1/6/11,Employed,F,...,6,Personal Auto,Personal L3,Offer2,Branch,273,Four-Door Car,Medsize,A,273


In [31]:
#2. analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. 
#Write your conclusions. 
yes_customers = df[df['response'] == 'Yes']
tcl_summary = df.groupby(["policy_type","gender"]).agg({"total_claim_amount":"mean"}).reset_index()
tcl_summary

Unnamed: 0,policy_type,gender,total_claim_amount
0,Corporate Auto,F,392.87747
1,Corporate Auto,M,475.794521
2,Personal Auto,F,411.483186
3,Personal Auto,M,461.568732
4,Special Auto,F,463.072
5,Special Auto,M,427.416667


In [21]:
yes_customers

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,ttotal_claim_amountotal_claim_amount
1,3,XL78013,Oregon,22332,Yes,Extended,College,1/11/11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484,Four-Door Car,Medsize,A,484
7,19,NJ54277,California,3746,Yes,Extended,College,2/26/11,Employed,F,...,1,Personal Auto,Personal L2,Offer2,Call Center,19,Two-Door Car,Large,A,19
31,69,QG27547,Oregon,2867,Yes,Extended,Bachelor,1/3/11,Retired,F,...,1,Personal Auto,Personal L3,Offer2,Call Center,374,Four-Door Car,Medsize,A,374
41,102,VG56765,Arizona,2471,Yes,Basic,High School or Below,1/15/11,Employed,M,...,1,Personal Auto,Personal L2,Offer2,Agent,114,Two-Door Car,Medsize,A,114
45,113,EC28398,Oregon,5096,Yes,Basic,Master,1/28/11,Disabled,F,...,3,Corporate Auto,Corporate L2,Offer1,Agent,312,Four-Door Car,Small,A,312
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4505,10818,XT35473,Arizona,2300,Yes,Basic,Bachelor,1/15/11,Retired,M,...,1,Personal Auto,Personal L3,Offer1,Agent,302,Four-Door Car,Large,A,302
4507,10825,FB17016,Oregon,5470,Yes,Extended,Bachelor,2/17/11,Employed,F,...,1,Personal Auto,Personal L3,Offer2,Agent,702,SUV,Medsize,A,702
4512,10840,ME22430,Nevada,2453,Yes,Basic,Bachelor,2/9/11,Medical Leave,M,...,1,Personal Auto,Personal L2,Offer1,Agent,331,Four-Door Car,Medsize,A,331
4531,10887,BY78730,Oregon,8879,Yes,Basic,High School or Below,2/3/11,Employed,F,...,7,Special Auto,Special L2,Offer1,Agent,528,SUV,Small,A,528


In [22]:
#My conclusions: 
#In this table we can observe the mean of total claim amount per policy type and gender for customers who have responded "Yes" to the last marketing campaign. 
#the total claim amount of customers with Corporate Auto policy type is lower in females customers (392.877470) than in males customers (475.794521).
#the total claim amount of customers with Personal Auto policy type is lower in females customers(411.483186) than in males customers (461.568732).
#the total claim amount of customers with Special Auto policy type is lower in males customers(463.072000) than in females customers (427.416667).
#the policy type with the highest total claim amount is Special Auto. 

In [26]:
#3.  Analyze the total number of customers who have policies in each state, 
#and then filter the results to only include states where there are more than 500 customers.

state_customers = df.groupby(["state","policy_type"]).agg({"customer":"count"}).reset_index()

# Filter out states where the count is more than 500
states_500_customers = state_customers[state_customers['customer'] > 500]

In [28]:
states_500_customers 

Unnamed: 0,state,policy_type,customer
1,Arizona,Personal Auto,658
4,California,Personal Auto,1134
10,Oregon,Personal Auto,964


In [33]:
#4.Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.
df_1 = df.groupby(["education","gender"]).agg({"customer_lifetime_value":["max", "min","median"]}).reset_index()
df_1

Unnamed: 0_level_0,education,gender,customer_lifetime_value,customer_lifetime_value,customer_lifetime_value
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,max,min,median
0,Bachelor,F,58753,1904,5752.0
1,Bachelor,M,67907,2030,5797.0
2,College,F,61850,2004,5642.5
3,College,M,44795,1918,6005.0
4,Doctor,F,44856,2395,5789.0
5,Doctor,M,32677,2267,5843.5
6,High School or Below,F,55277,2150,5978.0
7,High School or Below,M,83325,2132,6081.0
8,Master,F,51016,2417,5714.0
9,Master,M,50568,2357,5512.0


In [None]:
#Conclusions:
#This table shows us the maximum, minimum and median customer life time value by education level and gender.
#For the customers with Bachelor's degree for the customer life time value, the maximum and the minimum values are higher in males than females. But the value of the median is very close between the two genders.
#For the customers with College's degree for the customer life time value, the maximum and the minimum values are higher in females than males and the value of the median is higher in males than in females.
#For the customers with Doctor's degree for the customer life time value, the maximum and the minimum values are higher in females than males and the value of the median is higher in males than in females.
#For the customers with High School or Below's degree for the customer life time value, the maximum value are higher in males than in females, the minimum value is higher in females than males. The value of the median is higher in males than in females.
#For the customers with Master's degree for the customer life time value, the maximum value are higher in females than in males, the minimum value is higher in females than males. The value of the median is higher in females than in males.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [None]:
# your code goes here