# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [1]:
import pandas as pd

# Función para limpiar los nombres de columnas
def cleaning_columns_names(df):
    # Poner en minúsculas y reemplazar espacios por guiones bajos
    new_columns = [column.lower().replace(' ','_') for column in df.columns]
    df.columns = new_columns
    return df

# Cargar el dataset
df = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv')

# Limpiar nombres de columnas
df = cleaning_columns_names(df)

# Filtrar clientes con total_claim_amount > 1000 y respuesta "Yes"
df1=df[(df['total_claim_amount'] > 1000) & (df['response']=='Yes')]

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

In [None]:
df2 = df[df['response'] == 'Yes'].groupby(['policy_type','gender'])['total_claim_amount'].mean()
df2

# CONCLUSIONES:

# Entre los clientes que respondieron "Yes" a la campaña:

# Las mujeres con pólizas "Special Auto" tienen el mayor promedio de reclamación, seguidas por los hombres del mismo tipo de póliza.

# Las pólizas de tipo "Corporate Auto" también tienen altos importes medios reclamados, especialmente entre las mujeres.

# "Personal Auto" tiene los importes medios más bajos, independientemente del género.

# Esto sugiere que los clientes con pólizas más específicas o empresariales (como Special Auto y Corporate Auto) 
# tienden a tener mayores reclamaciones, lo que podría deberse a una mayor cobertura o tipo de siniestros.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [None]:
# Contar número de clientes por estado
df3 =  df.groupby('state')['customer'].count()

# Filtrar estados con más de 500 clientes
df3 = df3[df3>500]

# Mostrar resultados
df3 = df3.sort_values(ascending=False)
df3

# CONCLUSIONES

# Solo unos pocos estados tienen una gran base de clientes (más de 500).
# Esto puede indicar mercados clave donde la aseguradora tiene una mayor presencia. 
# California y Oregon, por ejemplo, podrían ser los principales objetivos para campañas futuras debido a su gran volumen de clientes actuales.

state
California    3552
Oregon        2909
Arizona       1937
Nevada         993
Washington     888
Name: customer, dtype: int64

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [None]:
df4 = df.groupby(['education','gender'])['customer_lifetime_value'].agg(['max','min','median'])
df4

# CONCLUSIONES

# 1. Los hombres con menor nivel educativo ("High School or Below") tienen el CLV más alto

# 2. Las mujeres con formación universitaria ("Bachelor") tienen el CLV máximo más alto entre mujeres

# 3. Clientes con doctorado tienen CLV más moderado

# 4.La mediana es bastante estable entre niveles educativos (entre 5.300 y 6.300), excepto en los extremos

# 5. Grandes diferencias entre el mínimo y máximo CLV dentro de cada grupo

# Aunque el nivel educativo y el género influyen en el CLV, no hay una correlación directa y clara. 
# Sin embargo, destacan ciertos grupos:

# Hombres con baja educación: sorprendentemente valiosos, posiblemente por hábitos de consumo o fidelidad.

# Mujeres universitarias: segmento consistente y rentable.

# Clientes con doctorado: valor más moderado de media.


Unnamed: 0_level_0,Unnamed: 1_level_0,max,min,median
education,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bachelor,F,73225.95652,1904.000852,5640.505303
Bachelor,M,67907.2705,1898.007675,5548.031892
College,F,61850.18803,1898.683686,5623.611187
College,M,61134.68307,1918.1197,6005.847375
Doctor,F,44856.11397,2395.57,5332.462694
Doctor,M,32677.34284,2267.604038,5577.669457
High School or Below,F,55277.44589,2144.921535,6039.553187
High School or Below,M,83325.38119,1940.981221,6286.731006
Master,F,51016.06704,2417.777032,5729.855012
Master,M,50568.25912,2272.30731,5579.099207


## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

In [23]:
# Extraemos el nombre del mes de la columna 'effective_to_date'
df['month'] = pd.to_datetime(df['effective_to_date'], format='%m/%d/%y').dt.month_name()

# Ordena los meses cronológicamente
month_order = ['January', 'February', 'March', 'April', 'May', 'June',
               'July', 'August', 'September', 'October', 'November', 'December']
df['month'] = pd.Categorical(df['month'], categories=month_order, ordered=True)

df5 = df.groupby(['state', 'month'],observed=True)['number_of_policies'].sum().unstack()
df5

# CONCLUSIONES

# 1. En todos los estados, las ventas de pólizas en febrero son menores que en enero

# 2. California es el estado con mayor número de pólizas vendidas en ambos meses, seguido de Oregon y Arizona.

month,January,February
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,5673,4929
Nevada,1493,1278
Oregon,4697,3969
Washington,1358,1225


6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

In [22]:
# Agrupar por estado y mes, sumando número de pólizas
df6 = df.groupby(['state', 'month'],observed=True)['number_of_policies'].sum().reset_index()

# Obtener los 3 estados con mayor número total de pólizas vendidas
top_states = df6.groupby('state')['number_of_policies'].sum().sort_values(ascending=False).head(3).index

# Filtrar solo esos estados top
df6 = df6[df6['state'].isin(top_states)]

# Pivotar para que los meses sean columnas y estados filas
df6_pivot = df6.pivot(index='state', columns='month', values='number_of_policies').fillna(0)

df6_pivot


month,January,February
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,3052,2864
California,5673,4929
Oregon,4697,3969


7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

In [25]:
response_rate = df.groupby('sales_channel')['response'].apply(lambda x: (x == 'Yes').mean())

response_rate

sales_channel
Agent          0.180053
Branch         0.107876
Call Center    0.103223
Web            0.108856
Name: response, dtype: float64

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9