# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [56]:
import pandas as pd

url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'
marketing_df = pd.read_csv(url)

In [57]:
marketing_df

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.216960,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,SUV,Medsize,A
3,3,XL78013,Oregon,22332.439460,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,1/19/11,Unemployed,F,...,,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,Luxury Car,Medsize,A
10906,10906,KX53892,Oregon,5259.444853,No,Basic,College,1/6/11,Employed,F,...,0.0,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,Four-Door Car,Medsize,A
10907,10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2/6/11,Employed,F,...,0.0,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,Luxury SUV,Medsize,
10908,10908,WA60547,California,11971.977650,No,Premium,College,2/13/11,Employed,F,...,4.0,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,SUV,Medsize,A


In [58]:
def first_clean(df):
    df.columns = df.columns.str.replace(' ', '_')
    df.columns = df.columns.str.lower()
    return df

In [59]:
marketing_df = first_clean(marketing_df)
print(marketing_df.columns)

Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type'],
      dtype='object')


In [60]:
marketing_df = marketing_df.drop(columns = ["unnamed:_0"])
marketing_df = marketing_df.dropna(how='all')
marketing_df

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,DK49336,Arizona,4809.216960,No,Basic,College,2/18/11,Employed,M,48029,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,Four-Door Car,Medsize,
1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,0,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2/10/11,Employed,M,22139,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,SUV,Medsize,A
3,XL78013,Oregon,22332.439460,Yes,Extended,College,1/11/11,Employed,M,49078,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,23675,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,1/19/11,Unemployed,F,0,...,,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,Luxury Car,Medsize,A
10906,KX53892,Oregon,5259.444853,No,Basic,College,1/6/11,Employed,F,61146,...,0.0,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,Four-Door Car,Medsize,A
10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2/6/11,Employed,F,39837,...,0.0,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,Luxury SUV,Medsize,
10908,WA60547,California,11971.977650,No,Premium,College,2/13/11,Employed,F,64195,...,4.0,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,SUV,Medsize,A


In [61]:
columns_list = marketing_df.columns.tolist()

df_numerical_variables = marketing_df.describe()

numerical_variables = df_numerical_variables.columns.tolist()

categorical_variables = list(filter(lambda x: x not in numerical_variables, columns_list))

In [62]:
categorical_variables

['customer',
 'state',
 'response',
 'coverage',
 'education',
 'effective_to_date',
 'employmentstatus',
 'gender',
 'location_code',
 'marital_status',
 'policy_type',
 'policy',
 'renew_offer_type',
 'sales_channel',
 'vehicle_class',
 'vehicle_size',
 'vehicle_type']

In [63]:
for i in categorical_variables:
    print(i)
    print(marketing_df[i].unique())

customer
['DK49336' 'KX64629' 'LZ68649' ... 'KX53892' 'TL39050' 'WA60547']
state
['Arizona' 'California' 'Washington' 'Oregon' nan 'Nevada']
response
['No' 'Yes' nan]
coverage
['Basic' 'Extended' 'Premium']
education
['College' 'Bachelor' 'High School or Below' 'Doctor' 'Master']
effective_to_date
['2/18/11' '1/18/11' '2/10/11' '1/11/11' '1/17/11' '2/14/11' '2/24/11'
 '1/19/11' '1/4/11' '1/2/11' '2/7/11' '1/31/11' '1/26/11' '2/28/11'
 '1/16/11' '2/26/11' '2/23/11' '1/15/11' '2/2/11' '2/15/11' '1/24/11'
 '2/21/11' '2/22/11' '1/7/11' '1/28/11' '2/8/11' '2/12/11' '2/20/11'
 '1/5/11' '2/19/11' '1/3/11' '2/3/11' '1/22/11' '1/23/11' '2/5/11'
 '2/13/11' '1/25/11' '2/16/11' '2/1/11' '1/27/11' '1/12/11' '1/20/11'
 '2/6/11' '2/11/11' '1/21/11' '1/29/11' '1/9/11' '2/9/11' '2/27/11'
 '1/1/11' '2/17/11' '2/25/11' '1/13/11' '1/6/11' '2/4/11' '1/14/11'
 '1/10/11' '1/8/11' '1/30/11']
employmentstatus
['Employed' 'Unemployed' 'Medical Leave' 'Disabled' 'Retired']
gender
['M' 'F']
location_code
['Suburb

In [64]:
for i in numerical_variables:
    print(i)
    print(marketing_df[i].unique())

customer_lifetime_value
[ 4809.21696   2228.525238 14947.9173   ...  5259.444853 23893.3041
 11971.97765 ]
income
[48029     0 22139 ... 61146 39837 64195]
monthly_premium_auto
[ 61  64 100  97 117  63 154  85 127  62  99  69 116 114  66  73  94 104
 189  74 121 110 111  72 115 159 101  65  82  71 126  68 199  96  67 125
 249 105  92  78  77  79 223 242  70 102 109 107 119 194 113 106 247  80
  86  81  83 122 253 196 132 139  84 130  93 103 112 222 118  88 182 283
  90 128  89 235 190  76  87 133 153 129  98 148 123  91 211 131 108 187
 214 181 173 252  95 124 137 145 188 143 198 138 245 195 186 170 136 161
 157 141 205 271 192 142 140 134 240 185 244 210 184 202 296 213 273 219
 135 169 155 225 266 215 197 256 212 158 180 166 168 183 162 191 179 150
 146 276 165 239 237 193 229 274 207 295 208 172 217 206 201 171 152 156
 174 238 167 151 144 163 287 209 290 220 228 232 178 177 275 176 281 149
 298 255 216 285 226 160 147 254 164 175 297 234 284 204 218 261 231 248
 286 230 268 203]
mo

In [65]:
# 1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [66]:
marketing_filtered = marketing_df[(marketing_df['total_claim_amount'] > 1000) & (marketing_df['response'] == "Yes")]

In [67]:
marketing_filtered

Unnamed: 0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
189,OK31456,California,11009.130490,Yes,Premium,Bachelor,1/24/11,Employed,F,51643,...,0.0,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.400000,Luxury Car,Medsize,
236,YJ16163,Oregon,11009.130490,Yes,Premium,Bachelor,1/24/11,Employed,F,51643,...,0.0,1,Special Auto,Special L3,Offer2,Agent,1358.400000,Luxury Car,Medsize,A
419,GW43195,Oregon,25807.063000,Yes,Extended,College,2/13/11,Employed,F,71210,...,1.0,2,Personal Auto,Personal L2,Offer1,Branch,1027.200000,Luxury Car,Small,A
442,IP94270,Arizona,13736.132500,Yes,Premium,Master,2/13/11,Disabled,F,16181,...,0.0,8,Personal Auto,Personal L2,Offer1,Web,1261.319869,SUV,Medsize,A
587,FJ28407,California,5619.689084,Yes,Premium,High School or Below,1/26/11,Unemployed,M,0,...,0.0,1,Personal Auto,Personal L1,Offer2,Web,1027.000029,SUV,Medsize,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10351,FN44127,Oregon,3508.569533,Yes,Extended,College,1/5/11,Medical Leave,M,20978,...,1.0,1,Personal Auto,Personal L2,Offer2,Branch,1176.278800,Four-Door Car,Small,
10373,XZ64172,Oregon,10963.957230,Yes,Premium,High School or Below,2/8/11,Employed,M,55687,...,0.0,1,Corporate Auto,Corporate L2,Offer1,Agent,1324.800000,Luxury SUV,Medsize,
10487,IX60941,Oregon,3508.569533,Yes,Extended,College,1/5/11,Medical Leave,M,20978,...,1.0,1,Personal Auto,Personal L3,Offer2,Branch,1176.278800,Four-Door Car,Small,
10565,QO62792,Oregon,7840.165778,Yes,Extended,College,1/14/11,Employed,M,58414,...,2.0,1,Personal Auto,Personal L3,Offer2,Agent,1008.000000,,,


In [68]:
marketing_filtered["total_claim_amount"].unique()

array([1358.4     , 1027.2     , 1261.319869, 1027.000029, 1300.8     ,
       1032.      , 1008.      , 1324.8     , 1176.2788  , 1294.700423])

In [69]:
#2. Utilizando el Dataframe original, analice el importe_total_reclamación medio por cada tipo de póliza y sexo para los clientes que han respondido «Sí» a la última campaña de marketing. Escriba sus conclusiones.

In [70]:
df_yes_resp = marketing_df[(marketing_df['response'] == "Yes")]

In [71]:
result_df = df_yes_resp.groupby(['policy_type', 'gender'])['total_claim_amount'].mean().reset_index()

In [72]:
result_df.set_index('policy_type', inplace = True)
result_df

Unnamed: 0_level_0,gender,total_claim_amount
policy_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Corporate Auto,F,433.738499
Corporate Auto,M,408.582459
Personal Auto,F,452.965929
Personal Auto,M,457.010178
Special Auto,F,453.280164
Special Auto,M,429.527942


In [73]:
# 3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.
# 3. Analice el número total de clientes que tienen pólizas en cada estado y, a continuación, filtre los resultados para incluir sólo los estados en los que hay más de 500 clientes.

In [74]:
count_customers = marketing_df['state'].value_counts().reset_index()

count_customers.columns = ['state', 'total_customers'] #renombras la columna de counts para que sea más claro

count_customers

Unnamed: 0,state,total_customers
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


In [75]:
#Para filtrar los Estados con mas de 500 clientes

filter_state = count_customers[count_customers['total_customers'] > 500]

filter_state

Unnamed: 0,state,total_customers
0,California,3552
1,Oregon,2909
2,Arizona,1937
3,Nevada,993
4,Washington,888


In [76]:
#4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.
#4. Encuentre el valor máximo, mínimo y medio del ciclo de vida del cliente por nivel educativo y sexo. Escribe tus conclusiones.

In [77]:
statistic_result = marketing_df.groupby(['education', 'gender'])['customer_lifetime_value'].agg(['max', 'min', 'mean']).round(2).reset_index()

statistic_result

Unnamed: 0,education,gender,max,min,mean
0,Bachelor,F,73225.96,1904.0,7874.27
1,Bachelor,M,67907.27,1898.01,7703.6
2,College,F,61850.19,1898.68,7748.82
3,College,M,61134.68,1918.12,8052.46
4,Doctor,F,44856.11,2395.57,7328.51
5,Doctor,M,32677.34,2267.6,7415.33
6,High School or Below,F,55277.45,2144.92,8675.22
7,High School or Below,M,83325.38,1940.98,8149.69
8,Master,F,51016.07,2417.78,8157.05
9,Master,M,50568.26,2272.31,8168.83


In [78]:
#Bonus 5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.
#Bonus 5. El equipo de marketing quiere analizar el número de pólizas vendidas por estado y mes. Presenta los datos en una tabla en la que los meses estén ordenados como columnas y los estados como filas.

In [79]:
# Asegurémonos de que 'effective_to_date' esté en formato de fecha
marketing_df['effective_to_date'] = pd.to_datetime(marketing_df['effective_to_date'], errors='coerce')

# Crear una nueva columna 'month' que contenga el mes de la fecha efectiva
marketing_df['month'] = marketing_df['effective_to_date'].dt.month

# Agrupar por estado y mes, y contar el número de pólizas vendidas
sales_state_month_files = marketing_df.groupby(['state', 'month'])['number_of_policies'].sum().unstack(fill_value=0)

# Ver los resultados
sales_state_month_files

  marketing_df['effective_to_date'] = pd.to_datetime(marketing_df['effective_to_date'], errors='coerce')


month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,5673,4929
Nevada,1493,1278
Oregon,4697,3969
Washington,1358,1225


In [80]:
#6.  Muestre un nuevo DataFrame que contenga el número de pólizas vendidas por mes, por estado, para los 3 estados con mayor número de pólizas vendidas.

#*Sugerencia
#- Para conseguir esto, primero tendrá que agrupar los datos por estado y mes, y después contar el número de pólizas vendidas para cada grupo. Después, tendrá que ordenar los datos por el número de pólizas vendidas en orden descendente.
#- A continuación, seleccionaremos los 3 estados con mayor número de pólizas vendidas.
#- Por último, creará un nuevo DataFrame que contenga el número de pólizas vendidas por mes para cada uno de los 3 estados principales.

In [81]:
# 1. Agrupar los datos por estado y mes, y contar el número de pólizas vendidas
sales_state_month = marketing_df.groupby(['state', 'month'])['number_of_policies'].sum().reset_index()

# 2. Contar el número total de pólizas vendidas por estado y ordenar en orden descendente
total_sales_state = sales_state_month.groupby('state')['number_of_policies'].sum().sort_values(ascending=False)

# 3. Seleccionar los 3 estados con mayor número de pólizas vendidas
top_3_states = total_sales_state.head(3).index

# 4. Filtrar el DataFrame original para incluir solo los 3 estados principales
df_top_3_estados = sales_state_month[sales_state_month['state'].isin(top_3_states)]

# 5. Crear un DataFrame con los resultados: número de pólizas vendidas por mes y por estado para los 3 principales
sales_top_3_estados = df_top_3_estados.pivot_table(index='state', columns='month', values='number_of_policies', aggfunc='sum', fill_value=0)

# Ver el DataFrame resultante
sales_top_3_estados

month,1,2
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,5673,4929
Oregon,4697,3969


In [82]:
#7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.
#Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

#7. El equipo de marketing quiere analizar el efecto de los diferentes canales de marketing en la tasa de respuesta de los clientes.
#Sugerencia: Puede utilizar melt para desagrupar los datos y crear una tabla que muestre la tasa de respuesta de los clientes (los que respondieron «Sí») por canal de marketing.

In [83]:
# 1. Filtrar los clientes que respondieron "Sí" a la campaña de marketing
customers_yes = marketing_df[marketing_df['response'] == 'Yes']

# 2. Agrupar los datos por canal de marketing y contar las respuestas "Sí"
tasa_respuesta_por_canal = customers_yes.groupby('sales_channel')['response'].count().reset_index()
tasa_respuesta_por_canal.rename(columns={'response': 'numero_respuestas_si'}, inplace=True)

# 3. Contar el total de clientes por canal de marketing (incluyendo "No")
total_customers_por_canal = marketing_df.groupby('sales_channel')['response'].count().reset_index()
total_customers_por_canal.rename(columns={'response': 'total_clientes'}, inplace=True)

# 4. Unir los dos DataFrames para calcular la tasa de respuesta
tasa_respuesta_por_canal = pd.merge(tasa_respuesta_por_canal, total_customers_por_canal, on='sales_channel')

# 5. Calcular la tasa de respuesta dividiendo las respuestas "Sí" entre el total de clientes
tasa_respuesta_por_canal['tasa_respuesta'] = tasa_respuesta_por_canal['numero_respuestas_si'] / tasa_respuesta_por_canal['total_clientes']

# 6. Usar melt para desagrupar y presentar los datos de forma más clara
tasa_respuesta_melted = tasa_respuesta_por_canal.melt(id_vars=['sales_channel'], value_vars=['tasa_respuesta'], 
                                                     var_name='metric', value_name='value')

# Ver el DataFrame resultante
tasa_respuesta_melted

Unnamed: 0,sales_channel,metric,value
0,Agent,tasa_respuesta,0.190746
1,Branch,tasa_respuesta,0.113787
2,Call Center,tasa_respuesta,0.109786
3,Web,tasa_respuesta,0.117141
