# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [1]:
# Import pandas library 
import pandas as pd

In [2]:
# URL for the dataset
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'

# Load the data into a DataFrame
data = pd.read_csv(url)

# Display the first 5 rows of the DataFrame
data.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [5]:
# Display the shape of the DataFrame
print(f"The shape of the DataFrame has {data.shape[0]} rows and {data.shape[1]} columns.")

The shape of the DataFrame has 10910 rows and 26 columns.


In [6]:
# Display the info of the DataFrame
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     10910 non-null  int64  
 1   Customer                       10910 non-null  object 
 2   State                          10279 non-null  object 
 3   Customer Lifetime Value        10910 non-null  float64
 4   Response                       10279 non-null  object 
 5   Coverage                       10910 non-null  object 
 6   Education                      10910 non-null  object 
 7   Effective To Date              10910 non-null  object 
 8   EmploymentStatus               10910 non-null  object 
 9   Gender                         10910 non-null  object 
 10  Income                         10910 non-null  int64  
 11  Location Code                  10910 non-null  object 
 12  Marital Status                 10910 non-null 

In [8]:
# Display statistical summary of the DataFrame
data.describe(). round(2)

Unnamed: 0.1,Unnamed: 0,Customer Lifetime Value,Income,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Total Claim Amount
count,10910.0,10910.0,10910.0,10910.0,10277.0,10910.0,10277.0,10910.0,10910.0
mean,5454.5,8018.24,37536.28,93.2,15.15,48.09,0.38,2.98,434.89
std,3149.59,6885.08,30359.2,34.44,10.08,27.94,0.91,2.4,292.18
min,0.0,1898.01,0.0,61.0,0.0,0.0,0.0,1.0,0.1
25%,2727.25,4014.45,0.0,68.0,6.0,24.0,0.0,1.0,271.08
50%,5454.5,5771.15,33813.5,83.0,14.0,48.0,0.0,2.0,382.56
75%,8181.75,8992.78,62250.75,109.0,23.0,71.0,0.0,4.0,547.2
max,10909.0,83325.38,99981.0,298.0,35.0,99.0,5.0,9.0,2893.24


In [11]:
# Filtrar el DataFrame para incluir solo los clientes que tienen un total_claim_amount mayor a $1,000 y una response de "Yes"
filtered_df = data[(data['Total Claim Amount'] > 1000) & (data['Response'] == 'Yes')]

# Mostrar el nuevo DataFrame
filtered_df.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
189,189,OK31456,California,11009.13049,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.4,Luxury Car,Medsize,
236,236,YJ16163,Oregon,11009.13049,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Special Auto,Special L3,Offer2,Agent,1358.4,Luxury Car,Medsize,A
419,419,GW43195,Oregon,25807.063,Yes,Extended,College,2/13/11,Employed,F,...,1.0,2,Personal Auto,Personal L2,Offer1,Branch,1027.2,Luxury Car,Small,A
442,442,IP94270,Arizona,13736.1325,Yes,Premium,Master,2/13/11,Disabled,F,...,0.0,8,Personal Auto,Personal L2,Offer1,Web,1261.319869,SUV,Medsize,A
587,587,FJ28407,California,5619.689084,Yes,Premium,High School or Below,1/26/11,Unemployed,M,...,0.0,1,Personal Auto,Personal L1,Offer2,Web,1027.000029,SUV,Medsize,A


In [12]:
# Filtrar el DataFrame para incluir solo los clientes que han respondido "Yes" a la última campaña de marketing
filtered_df = data[data['Response'] == 'Yes']

# Crear una tabla dinámica que muestre el total_claim_amount promedio por tipo de póliza y género
pivot_avg_claim = filtered_df.pivot_table(values='Total Claim Amount', index='Policy Type', columns='Gender', aggfunc='mean').round(2)

# Mostrar la tabla resultante
print(pivot_avg_claim)

Gender               F       M
Policy Type                   
Corporate Auto  433.74  408.58
Personal Auto   452.97  457.01
Special Auto    453.28  429.53


In [13]:
# Contar el número total de clientes que tienen pólizas en cada estado
customer_count_by_state = data['State'].value_counts()

# Filtrar los resultados para incluir solo los estados donde hay más de 500 clientes
filtered_states = customer_count_by_state[customer_count_by_state > 500]

# Mostrar los resultados
print(filtered_states)

State
California    3552
Oregon        2909
Arizona       1937
Nevada         993
Washington     888
Name: count, dtype: int64


In [14]:
# Crear una tabla dinámica que muestre el valor máximo, mínimo y mediano del customer_lifetime_value por nivel educativo y género
pivot_clv_stats = data.pivot_table(values='Customer Lifetime Value', index=['Education', 'Gender'], aggfunc={'Customer Lifetime Value': ['max', 'min', 'median']}).round(2)

# Mostrar la tabla resultante
print(pivot_clv_stats)

                                  max   median      min
Education            Gender                            
Bachelor             F       73225.96  5640.51  1904.00
                     M       67907.27  5548.03  1898.01
College              F       61850.19  5623.61  1898.68
                     M       61134.68  6005.85  1918.12
Doctor               F       44856.11  5332.46  2395.57
                     M       32677.34  5577.67  2267.60
High School or Below F       55277.45  6039.55  2144.92
                     M       83325.38  6286.73  1940.98
Master               F       51016.07  5729.86  2417.78
                     M       50568.26  5579.10  2272.31


In [20]:
# Asegurarse de que la columna 'Effective To Date' sea de tipo datetime
data['Effective To Date'] = pd.to_datetime(data['Effective To Date'])

# Extraer el mes de la columna 'Effective To Date'
data['month'] = data['Effective To Date'].dt.month

# Crear una tabla dinámica que muestre el número de pólizas vendidas por tipo de póliza, estado y mes
pivot_policies = data.pivot_table(values='Customer', index=['State', 'Policy Type'], columns='month', aggfunc='count', fill_value=0)

# Mostrar la tabla resultante
print(pivot_policies)

month                         1     2
State      Policy Type               
Arizona    Corporate Auto   214   167
           Personal Auto    747   722
           Special Auto      47    40
California Corporate Auto   461   374
           Personal Auto   1391  1203
           Special Auto      66    57
Nevada     Corporate Auto   120    99
           Personal Auto    405   334
           Special Auto      26     9
Oregon     Corporate Auto   305   287
           Personal Auto   1191   989
           Special Auto      69    68
Washington Corporate Auto    86    89
           Personal Auto    363   319
           Special Auto      14    17


In [21]:
# Asegurarse de que la columna 'Effective To Date' sea de tipo datetime
data['Effective To Date'] = pd.to_datetime(data['Effective To Date'])

# Extraer el mes de la columna 'Effective To Date'
data['month'] = data['Effective To Date'].dt.month

# Agrupar los datos por estado y mes, y contar el número de pólizas vendidas para cada grupo
grouped = data.groupby(['State', 'month']).size().reset_index(name='Number of Policies Sold')

# Ordenar los datos por el conteo de pólizas vendidas en orden descendente
sorted_grouped = grouped.groupby('State')['Number of Policies Sold'].sum().sort_values(ascending=False).reset_index()

# Seleccionar los 3 estados con el mayor número de pólizas vendidas
top_3_states = sorted_grouped.head(3)['State']

# Crear un nuevo DataFrame que contenga el número de pólizas vendidas por mes para cada uno de los 3 estados principales
top_3_df = grouped[grouped['State'].isin(top_3_states)]

# Mostrar el nuevo DataFrame
print(top_3_df)

        State  month  Number of Policies Sold
0     Arizona      1                     1008
1     Arizona      2                      929
2  California      1                     1918
3  California      2                     1634
6      Oregon      1                     1565
7      Oregon      2                     1344
