# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

In [3]:
#Lab | Data Aggregation and Filtering

# Import pandas library
import pandas as pd

# Read CSV file 
url = 'marketing_customer_analysis.csv'
marketing_customer_analysis = pd.read_csv(url)

In [4]:
# DataFrame's overview
marketing_customer_analysis.head()

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
0,DK49336,Arizona,4809.21696,No,Basic,College,2011-02-18,Employed,M,48029,...,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,A,2
1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,0,...,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,A,1
2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2011-02-10,Employed,M,22139,...,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A,2
3,XL78013,Oregon,22332.43946,Yes,Extended,College,2011-01-11,Employed,M,49078,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,1
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,23675,...,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,A,1


1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [6]:
# Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

filtered_df = marketing_customer_analysis[(marketing_customer_analysis['total_claim_amount'] > 1000) & (marketing_customer_analysis['response'] == 'Yes')]
filtered_df.head()

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
189,OK31456,California,11009.13049,Yes,Premium,Bachelor,2011-01-24,Employed,F,51643,...,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.4,Luxury Car,Medsize,A,1
236,YJ16163,Oregon,11009.13049,Yes,Premium,Bachelor,2011-01-24,Employed,F,51643,...,1,Special Auto,Special L3,Offer2,Agent,1358.4,Luxury Car,Medsize,A,1
419,GW43195,Oregon,25807.063,Yes,Extended,College,2011-02-13,Employed,F,71210,...,2,Personal Auto,Personal L2,Offer1,Branch,1027.2,Luxury Car,Small,A,2
442,IP94270,Arizona,13736.1325,Yes,Premium,Master,2011-02-13,Disabled,F,16181,...,8,Personal Auto,Personal L2,Offer1,Web,1261.319869,SUV,Medsize,A,2
587,FJ28407,California,5619.689084,Yes,Premium,High School or Below,2011-01-26,Unemployed,M,0,...,1,Personal Auto,Personal L1,Offer2,Web,1027.000029,SUV,Medsize,A,1


2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

In [12]:
# Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. 

yes_customers = marketing_customer_analysis[marketing_customer_analysis['response'] == 'Yes']

average_claim_amount = yes_customers.groupby(['policy_type', 'gender'])['total_claim_amount'].mean()
average_claim_amount.head(6)

policy_type     gender
Corporate Auto  F         433.738499
                M         408.582459
Personal Auto   F         452.965929
                M         457.010178
Special Auto    F         453.280164
                M         429.527942
Name: total_claim_amount, dtype: float64

## Write your conclusions

### Regarding this new dataframe that collects that of the total claim amount paid by customers based on their gender and policy type, we can draw the following insights:

### The clients that pay more are men with a Personal Auto policy, followed by women with a Special Auto policy. Those who pay less are men with a Corporate Auto policy. 

### We can draw as a conclusion that, on average, women pay more than men. Clients with a Special Auto are the ones that pay the most, with a clear difference between men and women in this policy, being female clients the ones that pay the most for this type of policy.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [14]:
# Analyze the total number of customers who have policies in each state

customers_by_state = marketing_customer_analysis.groupby('state')['customer'].count()
customers_by_state

state
Arizona       1937
California    4183
Nevada         993
Oregon        2909
Washington     888
Name: customer, dtype: int64

In [15]:
# Filter the results to only include states where there are more than 500 customers.

states_with_more_than_500_customers = customers_by_state[customers_by_state > 500]
states_with_more_than_500_customers

state
Arizona       1937
California    4183
Nevada         993
Oregon        2909
Washington     888
Name: customer, dtype: int64

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [18]:
# Find the maximum, minimum, and median customer lifetime value by education level and gender. 

stats_customer_lifetime_value = marketing_customer_analysis.groupby(['education', 'gender'])['customer_lifetime_value'].agg(['max', 'min', 'median'])
stats_customer_lifetime_value



Unnamed: 0_level_0,Unnamed: 1_level_0,max,min,median
education,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bachelor,F,73225.95652,1904.000852,5640.505303
Bachelor,M,67907.2705,1898.007675,5548.031892
College,F,61850.18803,1898.683686,5623.611187
College,M,61134.68307,1918.1197,6005.847375
Doctor,F,44856.11397,2395.57,5332.462694
Doctor,M,32677.34284,2267.604038,5577.669457
High School or Below,F,55277.44589,2144.921535,6039.553187
High School or Below,M,83325.38119,1940.981221,6286.731006
Master,F,51016.06704,2417.777032,5729.855012
Master,M,50568.25912,2272.30731,5579.099207


## Write your conclusions

### The maximum lifetime value belongs to men that went to High School or below. The minium lifetime value belongs to men that are doctored. With this information, we can assume that men tend to be more present than women and that the level of education takes an important role regarding the lifetime value. The higher your level of education, the less your lifetime value is, and viceversa.

### The highest average of lifetime value is for men that went to High School, which makes sense regarding the other data. 

### As a conclusion, on average, women tend to have a lower lifetime value than men. The higher your level of education is, more lifetime value you have.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

In [19]:
# Bonus
# 5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

policies_by_state = pd.pivot_table(marketing_customer_analysis, values='number_of_policies', index='state', columns='month', aggfunc='sum', fill_value=0)
policies_by_state

month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,6666,5901
Nevada,1493,1278
Oregon,4697,3969
Washington,1358,1225


6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

In [20]:
# 6. Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.
# Group the data by state and month, then count the number of policies sold for each group
policies_by_state = marketing_customer_analysis.groupby('state')['number_of_policies'].sum()

# Sort the data in descending order
policies_by_state.sort_values(ascending=False)

# Select the top 3 states with the highest number of policies sold.
top_states = policies_by_state.nlargest(3).index

# Create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.
filtered_policies =  marketing_customer_analysis[ marketing_customer_analysis['state'].isin(top_states)]

policies_top3_df = pd.pivot_table(filtered_policies, values='number_of_policies', index='state', columns='month', aggfunc='sum', fill_value=0)
policies_top3_df


month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,6666,5901
Oregon,4697,3969


7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

In [21]:
# 7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.
# Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

melted_df = pd.melt(marketing_customer_analysis, id_vars=['response'], value_vars=['sales_channel', 'renew_offer_type', 'sales_channel'], value_name='Marketing Channel')

# Filtered responses
yes_responses = melted_df[melted_df['response'] == 'Yes']

# Count yes responses
response_counts = yes_responses.groupby('Marketing Channel').size()

# Calculate response rate
response_rate = response_counts / len(marketing_customer_analysis[marketing_customer_analysis['response'] == 'Yes'])
response_rate

Marketing Channel
Agent          0.506139
Branch         0.222374
Call Center    0.150750
Offer1         0.453615
Offer2         0.524557
Offer3         0.021828
Web            0.120737
dtype: float64

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9