# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

In [172]:
# Your code here
import pandas as pd

# Load the dataset
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"
marketing_df = pd.read_csv(url)

# Display the first few rows of the DataFrame
marketing_df

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.216960,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,SUV,Medsize,A
3,3,XL78013,Oregon,22332.439460,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,1/19/11,Unemployed,F,...,,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,Luxury Car,Medsize,A
10906,10906,KX53892,Oregon,5259.444853,No,Basic,College,1/6/11,Employed,F,...,0.0,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,Four-Door Car,Medsize,A
10907,10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2/6/11,Employed,F,...,0.0,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,Luxury SUV,Medsize,
10908,10908,WA60547,California,11971.977650,No,Premium,College,2/13/11,Employed,F,...,4.0,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,SUV,Medsize,A


In [173]:
# All the columns in lower case and without spaces, drop teh column unnamed
marketing_df.columns = pd.Series(marketing_df.columns).apply(lambda col: col.lower())
marketing_df.columns = [col.replace(" ", "_") for col in marketing_df.columns]
marketing_df["customer_lifetime_value"] = marketing_df["customer_lifetime_value"].round(2)
marketing_df[ "income"] = marketing_df["income"].round(2)
marketing_df[ "total_claim_amount"] = marketing_df["total_claim_amount"].round(2)
marketing_df = marketing_df.drop("unnamed:_0", axis=1)

marketing_df.head ()

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,DK49336,Arizona,4809.22,No,Basic,College,2/18/11,Employed,M,48029,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,KX64629,California,2228.53,No,Basic,College,1/18/11,Unemployed,F,0,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.92,Four-Door Car,Medsize,
2,LZ68649,Washington,14947.92,No,Basic,Bachelor,2/10/11,Employed,M,22139,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,XL78013,Oregon,22332.44,Yes,Extended,College,1/11/11,Employed,M,49078,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.01,Four-Door Car,Medsize,A
4,QA50777,Oregon,9025.07,No,Premium,Bachelor,1/17/11,Medical Leave,F,23675,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.93,Four-Door Car,Medsize,


In [174]:
# Check general info
marketing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 25 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   customer                       10910 non-null  object 
 1   state                          10279 non-null  object 
 2   customer_lifetime_value        10910 non-null  float64
 3   response                       10279 non-null  object 
 4   coverage                       10910 non-null  object 
 5   education                      10910 non-null  object 
 6   effective_to_date              10910 non-null  object 
 7   employmentstatus               10910 non-null  object 
 8   gender                         10910 non-null  object 
 9   income                         10910 non-null  int64  
 10  location_code                  10910 non-null  object 
 11  marital_status                 10910 non-null  object 
 12  monthly_premium_auto           10910 non-null 

In [175]:
# Change the data type to datetime
marketing_df["effective_to_date"] = marketing_df["effective_to_date"].astype("datetime64[us]")

In [176]:
# Check for null values
marketing_df.isnull().sum() / marketing_df.shape[0] * 100

customer                          0.000000
state                             5.783685
customer_lifetime_value           0.000000
response                          5.783685
coverage                          0.000000
education                         0.000000
effective_to_date                 0.000000
employmentstatus                  0.000000
gender                            0.000000
income                            0.000000
location_code                     0.000000
marital_status                    0.000000
monthly_premium_auto              0.000000
months_since_last_claim           5.802016
months_since_policy_inception     0.000000
number_of_open_complaints         5.802016
number_of_policies                0.000000
policy_type                       0.000000
policy                            0.000000
renew_offer_type                  0.000000
sales_channel                     0.000000
total_claim_amount                0.000000
vehicle_class                     5.701192
vehicle_siz

In [177]:
# The column vehicle_type has 50 % of the df with null value, at this moment I don't consider it relevant for the analyse.
marketing_df.drop("vehicle_type", axis=1, inplace=True)

# The columns state, response, months_since_last_claim, number_of_open_complaints, vehicle_class and vehicle_size have less than 6% of null values. And I wont need for the next analyse.
#marketing_df.dropna (how="any")
marketing_df= marketing_df.dropna (how="any")

In [178]:
# Check null values, at this moment =0.
marketing_df.isnull().sum() 

customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
vehicle_class                    0
vehicle_size                     0
dtype: int64

In [179]:
#Check the unique values, it seems everything ok
for i in range(len(marketing_df.columns)):
    column_name = marketing_df.columns[i]
    unique_values = marketing_df.iloc[:, i].unique()
    print(f"Unique values in column {column_name}: {unique_values}")

Unique values in column customer: ['DK49336' 'KX64629' 'LZ68649' ... 'KX53892' 'TL39050' 'WA60547']
Unique values in column state: ['Arizona' 'California' 'Washington' 'Oregon' 'Nevada']
Unique values in column customer_lifetime_value: [ 4809.22  2228.53 14947.92 ...  5259.44 23893.3  11971.98]
Unique values in column response: ['No' 'Yes']
Unique values in column coverage: ['Basic' 'Extended' 'Premium']
Unique values in column education: ['College' 'Bachelor' 'Doctor' 'High School or Below' 'Master']
Unique values in column effective_to_date: <DatetimeArray>
['2011-02-18 00:00:00', '2011-01-18 00:00:00', '2011-02-10 00:00:00',
 '2011-01-11 00:00:00', '2011-02-14 00:00:00', '2011-02-24 00:00:00',
 '2011-01-19 00:00:00', '2011-01-04 00:00:00', '2011-01-02 00:00:00',
 '2011-01-31 00:00:00', '2011-01-26 00:00:00', '2011-02-28 00:00:00',
 '2011-01-16 00:00:00', '2011-02-07 00:00:00', '2011-01-17 00:00:00',
 '2011-02-26 00:00:00', '2011-02-23 00:00:00', '2011-01-15 00:00:00',
 '2011-02-15 0

In [180]:
#Check duplicates
marketing_df.duplicated ()


0        False
1        False
2        False
3        False
6        False
         ...  
10903    False
10904    False
10906    False
10907    False
10908    False
Length: 9134, dtype: bool

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [181]:
new_marketing_df = marketing_df[(marketing_df["total_claim_amount"] > 1000) & (marketing_df["response"] == "Yes")].reset_index(drop=True)
display (new_marketing_df.head ())

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,OK31456,California,11009.13,Yes,Premium,Bachelor,2011-01-24,Employed,F,51643,...,43,0.0,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.4,Luxury Car,Medsize
1,YJ16163,Oregon,11009.13,Yes,Premium,Bachelor,2011-01-24,Employed,F,51643,...,43,0.0,1,Special Auto,Special L3,Offer2,Agent,1358.4,Luxury Car,Medsize
2,GW43195,Oregon,25807.06,Yes,Extended,College,2011-02-13,Employed,F,71210,...,89,1.0,2,Personal Auto,Personal L2,Offer1,Branch,1027.2,Luxury Car,Small
3,IP94270,Arizona,13736.13,Yes,Premium,Master,2011-02-13,Disabled,F,16181,...,79,0.0,8,Personal Auto,Personal L2,Offer1,Web,1261.32,SUV,Medsize
4,FJ28407,California,5619.69,Yes,Premium,High School or Below,2011-01-26,Unemployed,M,0,...,5,0.0,1,Personal Auto,Personal L1,Offer2,Web,1027.0,SUV,Medsize


2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.


In [182]:
# Group by policy type and gender, then calculate the mean of total_claim_amount
avg_total_claim_df = marketing_df[marketing_df["response"] == "Yes"].groupby(["policy_type", "gender"])["total_claim_amount"].mean().reset_index()

display (avg_total_claim_df)


Unnamed: 0,policy_type,gender,total_claim_amount
0,Corporate Auto,F,431.480067
1,Corporate Auto,M,412.756763
2,Personal Auto,F,454.090104
3,Personal Auto,M,453.603695
4,Special Auto,F,455.648438
5,Special Auto,M,414.799333


Analyses: Across all policy types, female customers have a higher average total_claim_amount compared to their male customers. This suggests that female customers tend to make higher value claims. Since women respond positively to campaigns and have higher claim amounts, targeted marketing efforts towards to female customers could be particularly beneficial.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [183]:
# Count the number of customers in each state
count_by_state = marketing_df["state"].value_counts()
display (count_by_state)

# States where there ar nore than 500 customers
states_more_500 = count_by_state > 500
display(states_more_500)


state
California    3150
Oregon        2601
Arizona       1703
Nevada         882
Washington     798
Name: count, dtype: int64

state
California    True
Oregon        True
Arizona       True
Nevada        True
Washington    True
Name: count, dtype: bool

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [184]:
new_df = marketing_df.groupby(["education", "gender"])["customer_lifetime_value"].agg(["max", "min", "median"]).reset_index()

display(new_df)

Unnamed: 0,education,gender,max,min,median
0,Bachelor,F,73225.96,1904.0,5678.05
1,Bachelor,M,67907.27,1898.01,5555.83
2,College,F,61850.19,1898.68,5621.79
3,College,M,61134.68,1918.12,5989.77
4,Doctor,F,44856.11,2395.57,5332.46
5,Doctor,M,32677.34,2267.6,5620.59
6,High School or Below,F,55277.45,2144.92,6044.02
7,High School or Below,M,83325.38,1940.98,6176.7
8,Master,F,51016.07,2417.78,5801.13
9,Master,M,50568.26,2272.31,5617.96


Conclusion: Surprisingly, customers with a high school education have the highest maximum and median Customer lifetime value compared to those with a university education. 
This may indicate that they may be more loyal, more frequent buyers, or perhaps more engaged with the products or services offered by the company. 
They may also have lower expectations or focus on long-term value, resulting in higher lifetime spending.
Marketing strategies could be tailored to enhance their experience, perhaps by offering loyalty programs or discounts on frequently purchased products.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

In [185]:
#check the column date
marketing_df.head ()

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,DK49336,Arizona,4809.22,No,Basic,College,2011-02-18,Employed,M,48029,...,52,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize
1,KX64629,California,2228.53,No,Basic,College,2011-01-18,Unemployed,F,0,...,26,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.92,Four-Door Car,Medsize
2,LZ68649,Washington,14947.92,No,Basic,Bachelor,2011-02-10,Employed,M,22139,...,31,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize
3,XL78013,Oregon,22332.44,Yes,Extended,College,2011-01-11,Employed,M,49078,...,3,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.01,Four-Door Car,Medsize
6,IW72280,California,5035.04,No,Basic,Doctor,2011-02-14,Employed,F,37405,...,99,3.0,4,Corporate Auto,Corporate L2,Offer2,Branch,287.56,Four-Door Car,Medsize


In [186]:
# Extract the month of the column effective_to_date
marketing_df["month"] = marketing_df["effective_to_date"].dt.strftime("%m")

# Create a pivot table
state_and_month = marketing_df.pivot_table (index="state", columns="month", aggfunc="size")

display(state_and_month)

month,01,02
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,899,804
California,1695,1455
Nevada,494,388
Oregon,1396,1205
Washington,414,384


6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

In [187]:

state_and_month = marketing_df.pivot_table (index="state", columns="month", aggfunc="size")
state_and_month["total_policies"] = state_and_month.sum(axis=1)

sorted_state_and_month = state_and_month.sort_values(by="total_policies", ascending=False)

top_3_states = sorted_state_and_month.head(3)
top3_df= top_3_states.drop(columns="total_policies")

display(top3_df)

month,01,02
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,1695,1455
Oregon,1396,1205
Arizona,899,804


7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [188]:
# Count 'Yes' responses by channel
yes_counts = marketing_df [marketing_df['response'] == "Yes"].groupby("sales_channel").size()

# Count total responses by channel
total_counts = marketing_df.groupby("sales_channel").size()

# Calculate response rates
response_rates = yes_counts / total_counts

# Convert to DataFrame
response_rates_df = response_rates.reset_index(name="response_rate")
response_rates_df.columns = ["sales_channel", "response_rate"]

display (response_rates_df)


Unnamed: 0,sales_channel,response_rate
0,Agent,0.191544
1,Branch,0.114531
2,Call Center,0.108782
3,Web,0.117736
