# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [25]:
import pandas as pd

# Step 1: Load the dataset
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'
df = pd.read_csv(url)

# Step 2: Inspect the dataset to ensure the necessary columns are present
# Display the first few rows of the dataset to verify column names and data types
print(df.head())

# Step 3: Filter the dataset
# Ensure column names are in the correct case (case-sensitive)
if 'Total Claim Amount' in df.columns and 'Response' in df.columns:
    
    # Step 4: Create a new DataFrame with the specified conditions
    filtered_df = df[(df['Total Claim Amount'] > 1000) & (df['Response'] == 'Yes')]
    
    # Display the new DataFrame
    print(filtered_df)
    
else:
    print("The required columns 'Total Claim Amount' and 'Response' are not present in the dataset.")


   Unnamed: 0 Customer       State  Customer Lifetime Value Response  \
0           0  DK49336     Arizona              4809.216960       No   
1           1  KX64629  California              2228.525238       No   
2           2  LZ68649  Washington             14947.917300       No   
3           3  XL78013      Oregon             22332.439460      Yes   
4           4  QA50777      Oregon              9025.067525       No   

   Coverage Education Effective To Date EmploymentStatus Gender  ...  \
0     Basic   College           2/18/11         Employed      M  ...   
1     Basic   College           1/18/11       Unemployed      F  ...   
2     Basic  Bachelor           2/10/11         Employed      M  ...   
3  Extended   College           1/11/11         Employed      M  ...   
4   Premium  Bachelor           1/17/11    Medical Leave      F  ...   

   Number of Open Complaints Number of Policies     Policy Type        Policy  \
0                        0.0                  9  Corp

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

In [26]:
# Step 2: Filter the dataset for customers who responded "Yes"
filtered_df = df[df['Response'] == 'Yes']

# Step 3: Group by 'Policy Type' and 'Gender', and calculate the average 'Total Claim Amount'
average_claims = filtered_df.groupby(['Policy Type', 'Gender'])['Total Claim Amount'].mean().reset_index()

# Step 4: Round the average to 2 decimal places for clarity
average_claims['Total Claim Amount'] = average_claims['Total Claim Amount'].round(2)

# Display the result
print(average_claims)

      Policy Type Gender  Total Claim Amount
0  Corporate Auto      F              433.74
1  Corporate Auto      M              408.58
2   Personal Auto      F              452.97
3   Personal Auto      M              457.01
4    Special Auto      F              453.28
5    Special Auto      M              429.53


3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [27]:
# Step 2: Count the number of customers by state
customer_counts = df['State'].value_counts().reset_index()
customer_counts.columns = ['State', 'Number_of_Customers']

# Step 3: Filter the results to include only states with more than 500 customers
filtered_states = customer_counts[customer_counts['Number_of_Customers'] > 500]

# Display the result
print(filtered_states)

        State  Number_of_Customers
0  California                 3552
1      Oregon                 2909
2     Arizona                 1937
3      Nevada                  993
4  Washington                  888


4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [29]:
# Step 2: Ensure the necessary columns are present
# Display the first few rows to verify column names
print(df.head())

# Step 3: Group by 'Education' and 'Gender', and calculate max, min, and median 'Customer Lifetime Value'
# Ensure column names are in the correct case (case-sensitive)
if 'Education' in df.columns and 'Gender' in df.columns and 'Customer Lifetime Value' in df.columns:
    
    # Group by 'Education' and 'Gender'
    grouped = df.groupby(['Education', 'Gender'])['Customer Lifetime Value']
    
    # Calculate max, min, and median
    stats = grouped.agg(['max', 'min', 'median']).reset_index()
    
    # Rename columns for clarity
    stats.columns = ['Education', 'Gender', 'Max_CLV', 'Min_CLV', 'Median_CLV']
    
    # Display the results
    print(stats)
    
else:
    print("The required columns 'Education', 'Gender', and 'Customer Lifetime Value' are not present in the dataset.")

   Unnamed: 0 Customer       State  Customer Lifetime Value Response  \
0           0  DK49336     Arizona              4809.216960       No   
1           1  KX64629  California              2228.525238       No   
2           2  LZ68649  Washington             14947.917300       No   
3           3  XL78013      Oregon             22332.439460      Yes   
4           4  QA50777      Oregon              9025.067525       No   

   Coverage Education Effective To Date EmploymentStatus Gender  ...  \
0     Basic   College           2/18/11         Employed      M  ...   
1     Basic   College           1/18/11       Unemployed      F  ...   
2     Basic  Bachelor           2/10/11         Employed      M  ...   
3  Extended   College           1/11/11         Employed      M  ...   
4   Premium  Bachelor           1/17/11    Medical Leave      F  ...   

   Number of Open Complaints Number of Policies     Policy Type        Policy  \
0                        0.0                  9  Corp

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

In [32]:
import pandas as pd

# Cargar el dataset
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'
df = pd.read_csv(url)

# Agrupar por 'Policy Type' y 'Sales Channel' y contar el número de pólizas
policy_counts = df.groupby(['Policy Type', 'Sales Channel']).size().reset_index(name='Number_of_Policies')

# Pivotar los datos para tener los canales de ventas como columnas y los tipos de pólizas como filas
pivot_table = policy_counts.pivot_table(index='Policy Type', columns='Sales Channel', values='Number_of_Policies', fill_value=0)

# Mostrar el resultado
print(pivot_table)



Sales Channel    Agent  Branch  Call Center     Web
Policy Type                                        
Corporate Auto   885.0   646.0        467.0   343.0
Personal Auto   3064.0  2247.0       1611.0  1206.0
Special Auto     172.0   129.0         63.0    77.0


6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

In [33]:
# Contar el número de pólizas por tipo de póliza
total_policies_by_type = df.groupby('Policy Type').size().reset_index(name='Number_of_Policies')

# Encontrar los 3 tipos de pólizas con el mayor número de pólizas vendidas
top_policies = total_policies_by_type.sort_values(by='Number_of_Policies', ascending=False).head(3)['Policy Type']

# Filtrar los datos para estos 3 tipos de pólizas
top_policies_counts = df[df['Policy Type'].isin(top_policies)]

# Contar el número de pólizas vendidas por tipo de póliza y canal de ventas para los 3 tipos principales
top_policies_counts_summary = top_policies_counts.groupby(['Policy Type', 'Sales Channel']).size().reset_index(name='Number_of_Policies')

# Pivotar los datos para tener los canales de ventas como columnas y los tipos de pólizas como filas
top_policies_pivot = top_policies_counts_summary.pivot_table(index='Policy Type', columns='Sales Channel', values='Number_of_Policies', fill_value=0)

# Mostrar el resultado
print(top_policies_pivot)



Sales Channel    Agent  Branch  Call Center     Web
Policy Type                                        
Corporate Auto   885.0   646.0        467.0   343.0
Personal Auto   3064.0  2247.0       1611.0  1206.0
Special Auto     172.0   129.0         63.0    77.0


7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [34]:
# Supongamos que tienes una columna de respuesta llamada 'Response' y deseas analizar la tasa de respuesta
# Primero, verifica si tienes la columna de respuesta
if 'Response' in df.columns:
    # Melt the dataframe to unpivot marketing channels
    marketing_channels = ['Sales Channel']  # Ajustar según las columnas reales disponibles

    melted_df = df.melt(id_vars=['Response'], value_vars=marketing_channels, var_name='Marketing_Channel', value_name='Channel_Value')

    # Filtrar respuestas 'Yes'
    response_df = melted_df[melted_df['Response'] == 'Yes']

    # Calcular la tasa de respuesta para cada canal de ventas
    response_rate = response_df.groupby('Marketing_Channel').size() / melted_df.groupby('Marketing_Channel').size()

    # Convertir a DataFrame para mayor claridad
    response_rate_df = response_rate.reset_index(name='Response_Rate')

    # Mostrar el resultado
    print(response_rate_df)
else:
    print("No se encontró una columna de respuesta en el dataset.")


  Marketing_Channel  Response_Rate
0     Sales Channel       0.134372
