# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"

insurance_company_df = pd.read_csv(url)

print("First few rows of the dataset:")
insurance_company_df.head()



First few rows of the dataset:


Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [4]:
# Check for missing values
print("\nMissing values in each column:")
print(insurance_company_df.isnull().sum())

# Ensure all columns are in lowercase and spaces are replaced with underscores for consistency
insurance_company_df.columns = insurance_company_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Display updated column names
print("\nUpdated column names:")
print(insurance_company_df.columns)

# Check the data types
print("\nData types of the columns:")
print(insurance_company_df.dtypes)



Missing values in each column:
unnamed:_0                          0
customer                            0
state                             631
customer_lifetime_value             0
response                          631
coverage                            0
education                           0
effective_to_date                   0
employmentstatus                    0
gender                              0
income                              0
location_code                       0
marital_status                      0
monthly_premium_auto                0
months_since_last_claim           633
months_since_policy_inception       0
number_of_open_complaints         633
number_of_policies                  0
policy_type                         0
policy                              0
renew_offer_type                    0
sales_channel                       0
total_claim_amount                  0
vehicle_class                     622
vehicle_size                      622
vehicle_type      

In [5]:
# Filter the DataFrame based on the specified conditions
filtered_df = insurance_company_df[(insurance_company_df['total_claim_amount'] > 1000) & (insurance_company_df['response'] == 'Yes')]

# Display the first few rows of the filtered DataFrame
print("\nFirst few rows of the filtered DataFrame:")
print(filtered_df.head())

# Display basic information about the filtered DataFrame
print("\nInformation about the filtered DataFrame:")
print(filtered_df.info())



First few rows of the filtered DataFrame:
     unnamed:_0 customer       state  customer_lifetime_value response  \
189         189  OK31456  California             11009.130490      Yes   
236         236  YJ16163      Oregon             11009.130490      Yes   
419         419  GW43195      Oregon             25807.063000      Yes   
442         442  IP94270     Arizona             13736.132500      Yes   
587         587  FJ28407  California              5619.689084      Yes   

     coverage             education effective_to_date employmentstatus gender  \
189   Premium              Bachelor           1/24/11         Employed      F   
236   Premium              Bachelor           1/24/11         Employed      F   
419  Extended               College           2/13/11         Employed      F   
442   Premium                Master           2/13/11         Disabled      F   
587   Premium  High School or Below           1/26/11       Unemployed      M   

     ...  number_of_open_

In [7]:
# Ensure all columns are in lowercase and spaces are replaced with underscores for consistency
insurance_company_df.columns = insurance_company_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Convert 'effective_to_date' to datetime if necessary
insurance_company_df['effective_to_date'] = pd.to_datetime(insurance_company_df['effective_to_date'])

# Display the first few rows of the dataset to understand its structure
print("First few rows of the dataset:")
insurance_company_df.head()

First few rows of the dataset:


Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2011-02-18,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2011-02-10,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,2011-01-11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [8]:
# Filter the DataFrame based on the specified conditions
responded_yes_df = insurance_company_df[insurance_company_df['response'] == 'Yes']

# Create a pivot table to calculate the average total_claim_amount by policy type and gender
avg_claim_by_policy_gender = responded_yes_df.pivot_table(
    values='total_claim_amount',
    index='policy_type',
    columns='gender',
    aggfunc='mean'
).round(2)

# Display the pivot table
print("\nAverage Total Claim Amount by Policy Type and Gender for customers who responded 'Yes':")
print(avg_claim_by_policy_gender)



Average Total Claim Amount by Policy Type and Gender for customers who responded 'Yes':
gender               F       M
policy_type                   
Corporate Auto  433.74  408.58
Personal Auto   452.97  457.01
Special Auto    453.28  429.53


In [None]:
#Insights from the pivot table are discussed, including the differences in average total claim amounts between different policy types
# and genders. This information can help in understanding customer behavior and optimizing marketing strategies.

In [9]:
import pandas as pd

url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"

insurance_company_df = pd.read_csv(url)

# Ensure all columns are in lowercase and spaces are replaced with underscores for consistency
insurance_company_df.columns = insurance_company_df.columns.str.strip().str.lower().str.replace(' ', '_')

# Display the first few rows of the dataset to understand its structure
print("First few rows of the dataset:")
insurance_company_df.head()


First few rows of the dataset:


Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [11]:
# Aggregate data to count the number of customers in each state
customers_by_state = insurance_company_df['state'].value_counts().reset_index()
customers_by_state.columns = ['state', 'number_of_customers']

# Display the aggregated data
print("\nNumber of customers by state:")
customers_by_state



Number of customers by state:


Unnamed: 0,state,number_of_customers
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


In [12]:
# Filter the results to only include states with more than 500 customers
filtered_states = customers_by_state[customers_by_state['number_of_customers'] > 500]

# Display the filtered results
print("\nStates with more than 500 customers:")
filtered_states



States with more than 500 customers:


Unnamed: 0,state,number_of_customers
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


In [13]:
# Group the data by education level and gender
grouped_df = insurance_company_df.groupby(['education', 'gender'])['customer_lifetime_value']

# Calculate the maximum, minimum, and median CLV
agg_df = grouped_df.agg(['max', 'min', 'median']).reset_index()

# Display the aggregated data
print("\nMaximum, Minimum, and Median Customer Lifetime Value by Education Level and Gender:")
agg_df



Maximum, Minimum, and Median Customer Lifetime Value by Education Level and Gender:


Unnamed: 0,education,gender,max,min,median
0,Bachelor,F,73225.95652,1904.000852,5640.505303
1,Bachelor,M,67907.2705,1898.007675,5548.031892
2,College,F,61850.18803,1898.683686,5623.611187
3,College,M,61134.68307,1918.1197,6005.847375
4,Doctor,F,44856.11397,2395.57,5332.462694
5,Doctor,M,32677.34284,2267.604038,5577.669457
6,High School or Below,F,55277.44589,2144.921535,6039.553187
7,High School or Below,M,83325.38119,1940.981221,6286.731006
8,Master,F,51016.06704,2417.777032,5729.855012
9,Master,M,50568.25912,2272.30731,5579.099207


In [None]:
#Conclusions
    #Education Level and Gender Insights: By looking at the maximum, minimum, and median Customer Lifetime Value (CLV) for
    # different education levels and genders, we can spot interesting patterns and trends. These patterns help us understand
    # which groups tend to have higher or lower CLVs.

    #Marketing Strategies: These insights are valuable for tailoring our marketing efforts.
    # If we see that a particular education level and gender combination has a higher median CLV, we can focus our
    # marketing campaigns on attracting more customers from that specific group. This targeted approach can increase
    # the effectiveness of our marketing efforts.

    #Customer Segmentation: Knowing the variation in CLV among different groups allows us to segment our customers more effectively.
    # This means we can offer more personalized and relevant communication and services to each segment, improving customer satisfaction
    # and loyalty.