# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

Import dataset and load it into a dataframe

In [24]:
#Import libraries that will be working with
import pandas as pd 
# Load dataset from an online source
url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"
df_insurance_marketing_customer = pd.read_csv(url)

Data cleaning, formatting and structuring

In [25]:
df_insurance_marketing_customer.head()

Unnamed: 0.1,Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,...,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size,Vehicle Type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [26]:
df_insurance_marketing_customer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Unnamed: 0                     10910 non-null  int64  
 1   Customer                       10910 non-null  object 
 2   State                          10279 non-null  object 
 3   Customer Lifetime Value        10910 non-null  float64
 4   Response                       10279 non-null  object 
 5   Coverage                       10910 non-null  object 
 6   Education                      10910 non-null  object 
 7   Effective To Date              10910 non-null  object 
 8   EmploymentStatus               10910 non-null  object 
 9   Gender                         10910 non-null  object 
 10  Income                         10910 non-null  int64  
 11  Location Code                  10910 non-null  object 
 12  Marital Status                 10910 non-null 

In [27]:
# Convert all column names to lowecase and underscores
df_insurance_marketing_customer.columns = df_insurance_marketing_customer.columns.str.lower().str.replace(" ", "_")
# Rename the column that still not follow the rule
df_insurance_marketing_customer= df_insurance_marketing_customer.rename(columns={"employmentstatus": "employment_status"})
#Display new names for each column 
df_insurance_marketing_customer.columns

Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employment_status', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type'],
      dtype='object')

In [28]:
#Delete columns we don't need for this exercise
columns_delete = ['unnamed:_0', 'coverage','employment_status', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints', 'policy',
       'renew_offer_type', 'vehicle_class', 'vehicle_size','vehicle_type']

df_insurance_marketing_customer_removed = df_insurance_marketing_customer.drop(columns_delete, axis=1)

In [29]:
#Numbers for unique values of each columns
print("Number of unique values:")
print(df_insurance_marketing_customer_removed.nunique())
print()

Number of unique values:
customer                   9134
state                         5
customer_lifetime_value    8041
response                      2
education                     5
effective_to_date            59
gender                        2
number_of_policies            9
policy_type                   3
sales_channel                 4
total_claim_amount         5106
dtype: int64



In [30]:
# Check for duplicated values
duplicates = df_insurance_marketing_customer_removed.duplicated()
number_of_duplicates = duplicates.sum()
# Print the number of duplicated rows
print(f"Number of duplicated rows before cleaning: {number_of_duplicates}")

# Remove duplicates and reset index
df_insurance_marketing_customer_cleaned = df_insurance_marketing_customer_removed.drop_duplicates().reset_index(drop=True)
    
# Check for duplicates after cleaning
duplicates_after = df_insurance_marketing_customer_cleaned.duplicated().sum()
print(f"Number of duplicated rows after cleaning: {duplicates_after}")

Number of duplicated rows before cleaning: 1187
Number of duplicated rows after cleaning: 0


In [31]:
 # Count the number of null values in each column
print("Number of null values in each column before handling:")
print(df_insurance_marketing_customer_cleaned.isna().sum())

# Fill null values in two columns state and response with the mode
for column in df_insurance_marketing_customer_cleaned.select_dtypes(include=["object"]).columns:
    mode_value = df_insurance_marketing_customer_cleaned[column].mode()[0]
    df_insurance_marketing_customer_cleaned[column] = df_insurance_marketing_customer_cleaned[column].fillna(mode_value)

# Check if there are any remaining null values
remaining_nulls = df_insurance_marketing_customer_cleaned.isnull().sum()
print("\nNumber of null values in each column after handling:")
print(remaining_nulls[remaining_nulls > 0])


Number of null values in each column before handling:
customer                     0
state                      589
customer_lifetime_value      0
response                   589
education                    0
effective_to_date            0
gender                       0
number_of_policies           0
policy_type                  0
sales_channel                0
total_claim_amount           0
dtype: int64

Number of null values in each column after handling:
Series([], dtype: int64)


In [32]:
df_insurance_marketing_customer_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9723 entries, 0 to 9722
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   customer                 9723 non-null   object 
 1   state                    9723 non-null   object 
 2   customer_lifetime_value  9723 non-null   float64
 3   response                 9723 non-null   object 
 4   education                9723 non-null   object 
 5   effective_to_date        9723 non-null   object 
 6   gender                   9723 non-null   object 
 7   number_of_policies       9723 non-null   int64  
 8   policy_type              9723 non-null   object 
 9   sales_channel            9723 non-null   object 
 10  total_claim_amount       9723 non-null   float64
dtypes: float64(2), int64(1), object(8)
memory usage: 835.7+ KB


In [33]:
# Imprimir valores únicos de cada columna
for column in df_insurance_marketing_customer_cleaned.columns:
    unique_values = df_insurance_marketing_customer_cleaned[column].unique()
    print(f"\nUnique values in column {column}:")
    print(unique_values)


Unique values in column customer:
['DK49336' 'KX64629' 'LZ68649' ... 'KX53892' 'TL39050' 'WA60547']

Unique values in column state:
['Arizona' 'California' 'Washington' 'Oregon' 'Nevada']

Unique values in column customer_lifetime_value:
[ 4809.21696   2228.525238 14947.9173   ...  5259.444853 23893.3041
 11971.97765 ]

Unique values in column response:
['No' 'Yes']

Unique values in column education:
['College' 'Bachelor' 'High School or Below' 'Doctor' 'Master']

Unique values in column effective_to_date:
['2/18/11' '1/18/11' '2/10/11' '1/11/11' '1/17/11' '2/14/11' '2/24/11'
 '1/19/11' '1/4/11' '1/2/11' '2/7/11' '1/31/11' '1/26/11' '2/28/11'
 '1/16/11' '2/26/11' '2/23/11' '1/15/11' '2/2/11' '2/15/11' '1/24/11'
 '2/21/11' '2/22/11' '1/7/11' '1/28/11' '2/8/11' '2/12/11' '2/20/11'
 '1/5/11' '2/19/11' '1/3/11' '2/3/11' '1/22/11' '1/23/11' '2/5/11'
 '2/13/11' '1/25/11' '2/16/11' '2/1/11' '1/27/11' '1/12/11' '1/20/11'
 '2/6/11' '2/11/11' '1/21/11' '1/29/11' '1/9/11' '2/9/11' '2/27/11'
 '1

Exercicies

In [34]:
#1. Create new DataFrame that only includes customers who have a total_claim_amount greater than 1000 and "Yes" response to latest marketing campaing
filtered_by_total_claim_amount_and_response_df = df_insurance_marketing_customer_cleaned[(df_insurance_marketing_customer_cleaned["total_claim_amount"] > 1000) & (df_insurance_marketing_customer_cleaned["response"] == "Yes")]
filtered_by_total_claim_amount_and_response_df.head().round(2)


Unnamed: 0,customer,state,customer_lifetime_value,response,education,effective_to_date,gender,number_of_policies,policy_type,sales_channel,total_claim_amount
189,OK31456,California,11009.13,Yes,Bachelor,1/24/11,F,1,Corporate Auto,Agent,1358.4
236,YJ16163,Oregon,11009.13,Yes,Bachelor,1/24/11,F,1,Special Auto,Agent,1358.4
419,GW43195,Oregon,25807.06,Yes,College,2/13/11,F,2,Personal Auto,Branch,1027.2
442,IP94270,Arizona,13736.13,Yes,Master,2/13/11,F,8,Personal Auto,Web,1261.32
587,FJ28407,California,5619.69,Yes,High School or Below,1/26/11,M,1,Personal Auto,Web,1027.0


In [35]:
# Describe numerical columns
numerical_columns_stats = filtered_by_total_claim_amount_and_response_df.describe().round(2)

# Describe categorical columns
categorical_columns_stats = filtered_by_total_claim_amount_and_response_df.describe(include=["object"])

print("Numerical columns stats:")
print(numerical_columns_stats)

print("\nCategorical columns stats:")
print(categorical_columns_stats)

Numerical columns stats:
       customer_lifetime_value  number_of_policies  total_claim_amount
count                    60.00                60.0               60.00
mean                  11625.26                 2.4             1181.05
std                    6312.02                 2.6              137.31
min                    3508.57                 1.0             1008.00
25%                    7840.17                 1.0             1027.20
50%                   10571.84                 1.0             1218.80
75%                   13736.13                 2.0             1300.80
max                   25807.06                 8.0             1358.40

Categorical columns stats:
       customer   state response education effective_to_date gender  \
count        60      60       60        60                60     60   
unique       60       5        1         4                 8      2   
top     OK31456  Oregon      Yes  Bachelor           2/13/11      F   
freq          1      24 

In [36]:
#2. Using the original Df, analyze average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

#2.1 Filtred original Df to a Df with the data of only the people who respond Yes to latest Marketing campaing
filtered_by_response_df=df_insurance_marketing_customer[(df_insurance_marketing_customer["response"] == "Yes")]

#2.2 Group by Policy Type and Gender 
grouped_df=filtered_by_response_df.groupby(["policy_type", "gender"])

#2.3 Calculate average for "Total Claim Amount"
average_total_claims_grouped_df=grouped_df["total_claim_amount"].mean().reset_index()

#Rename with Mean calculation over Total Claim Amount
average_total_claims_grouped_df.rename(columns={"total_claim_amount": "avg_total_claim_amount"}, inplace=True)

#Visualize
average_total_claims_grouped_df.round(2)


Unnamed: 0,policy_type,gender,avg_total_claim_amount
0,Corporate Auto,F,433.74
1,Corporate Auto,M,408.58
2,Personal Auto,F,452.97
3,Personal Auto,M,457.01
4,Special Auto,F,453.28
5,Special Auto,M,429.53


In [37]:
#Conclusions
print("Conclusions 2:\nPersonal Auto policy type has the higher average total claim amount overall.\nRegarding gender females tend to have higher average total claims amount than males in Corporate and Special.\nHowever, in Personal Auto males have slighly higher, but only aprox 5€/$\nDifferences by gender are minimal, the policy type has a more noticeable impact")

Conclusions 2:
Personal Auto policy type has the higher average total claim amount overall.
Regarding gender females tend to have higher average total claims amount than males in Corporate and Special.
However, in Personal Auto males have slighly higher, but only aprox 5€/$
Differences by gender are minimal, the policy type has a more noticeable impact


In [None]:
#3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.
# Count customers by state
grouped_by_state_df=df_insurance_marketing_customer_cleaned.groupby("state").size().reset_index(name="customer_number")
#Visualize
grouped_by_state_df

#In case I needed to to filter for states > 500 customers below are the steps, but there is not need to do this.
filtered_by_state_more_500_df=grouped_by_state_df[grouped_by_state_df["customer_number"] > 500]
#Visualize same as before
filtered_by_state_more_500_df.sort_values(by="customer_number", ascending=False)


Unnamed: 0,state,customer_number
0,Arizona,1703
1,California,3739
2,Nevada,882
3,Oregon,2601
4,Washington,798


In [40]:
#Conclusions
print("Conclusions: There is no need to filter in this case >500, this step shouldn't have been done")

Conclusions: There is no need to filter in this case >500, this step shouldn't have been done


In [41]:
#4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

# Group by Education and Gender 
grouped_by_education_and_gender_df = df_insurance_marketing_customer_cleaned.groupby(["education", "gender"])

# Calculate the max, min, and median of 'customer_lifetime_value' for each group
clv_stats_by_education_gender_df = grouped_by_education_and_gender_df["customer_lifetime_value"].agg(["max", "min", "median"]).reset_index()

# Define the desired order of education levels
desired_order = ["Doctor", "Master", "Bachelor", "College", "High School or Below"]

# Sort the DataFrame by the 'education' column according to the desired order
clv_stats_by_education_gender_df['education'] = pd.Categorical(clv_stats_by_education_gender_df['education'], categories=desired_order, ordered=True)

# Sort by education and gender for better readability
clv_stats_by_education_gender_df = clv_stats_by_education_gender_df.sort_values(by=["education", "gender"])

# Reset the index to remove the hierarchical index and keep it clean
clv_stats_by_education_gender_df = clv_stats_by_education_gender_df.reset_index(drop=True)

# Display and round the DataFrame
clv_stats_by_education_gender_df.round(2)

Unnamed: 0,education,gender,max,min,median
0,Doctor,F,44856.11,2395.57,5332.46
1,Doctor,M,32677.34,2267.6,5581.49
2,Master,F,51016.07,2417.78,5827.41
3,Master,M,50568.26,2272.31,5604.33
4,Bachelor,F,73225.96,1904.0,5674.38
5,Bachelor,M,67907.27,1898.01,5555.56
6,College,F,61850.19,1898.68,5626.57
7,College,M,61134.68,1918.12,6005.85
8,High School or Below,F,55277.45,2144.92,6027.53
9,High School or Below,M,83325.38,1940.98,6203.79


In [42]:
#Conclusions
print("Conclusions:\nBachelor's degree holders (especially females) have the highest maximum CLVs, while Doctorate holders (especially males) tend to have the lowest values, particularly at the extreme.\nMales generally have lower maximum CLVs than females, except in the High School or Below category, where males have a higher maximum.\nMedian values are relatively similar across genders and education levels, indicating that most customers tend to have similar lifetime values within each education category and gender category.")

Conclusions:
Bachelor's degree holders (especially females) have the highest maximum CLVs, while Doctorate holders (especially males) tend to have the lowest values, particularly at the extreme.
Males generally have lower maximum CLVs than females, except in the High School or Below category, where males have a higher maximum.
Median values are relatively similar across genders and education levels, indicating that most customers tend to have similar lifetime values within each education category and gender category.


## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

In [43]:
#Create a pivot table wirh months as colums and state

# Since we do not have months, let's assume the duration of the insurance are anual so effective_to_day will be in the same month as theorically effective_from_day
# Ensure the 'effective_to_date' is in datetime format
df_insurance_marketing_customer_cleaned["effective_to_date"] = pd.to_datetime(df_insurance_marketing_customer_cleaned["effective_to_date"])

# Extract the full month name and create a new column 'month'
df_insurance_marketing_customer_cleaned["month"] = df_insurance_marketing_customer_cleaned["effective_to_date"].dt.strftime("%B")

# Create a new column for the count of policies (assuming each row represents a policy)
df_insurance_marketing_customer_cleaned["policies_number"] = 1



pivot_df = df_insurance_marketing_customer_cleaned.pivot_table(
    index="state",                                                          # Rows: each state
    columns=df_insurance_marketing_customer_cleaned["month"],               # Columns: each month
    values="policies_number",                                               # Values: number of policies sold
    aggfunc="sum",                                                          # Aggregate by sum (or mean if needed)
)

# Reorder the columns to place January first
ordered_months = ["January", "February"]
pivot_df = pivot_df[ordered_months]

# Print the pivot table
print(pivot_df)

month       January  February
state                        
Arizona         899       804
California     1990      1749
Nevada          494       388
Oregon         1396      1205
Washington      414       384


  df_insurance_marketing_customer_cleaned["effective_to_date"] = pd.to_datetime(df_insurance_marketing_customer_cleaned["effective_to_date"])


6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

In [44]:
#Group the data by state and month, count the number of policies with size(). 
df_total_policies_filtered_by_month_and_state=df_insurance_marketing_customer_cleaned.groupby(["month","state"]).size().reset_index(name="count_policies")

#Step 2: Group the data by state and count the total number of policies sold by state.
total_policies_by_state = df_total_policies_filtered_by_month_and_state.groupby("state")["count_policies"].sum().reset_index()

#Step 3: Sort the states by the total number of policies sold in descending order and select the top 3 states.
top_3_states = total_policies_by_state.sort_values(by="count_policies", ascending=False).head(3)["state"]

#Step 4: Filter the original DataFrame to include only the top 3 states.
df_top_3_states = df_insurance_marketing_customer_cleaned[df_insurance_marketing_customer_cleaned["state"].isin(top_3_states)]

#Step 5: Create a new DataFrame that shows the number of policies sold by month for each of the top 3 states.
df_policies_by_month_and_state = df_top_3_states.groupby(["month", "state"]).size().reset_index(name="count_policies")

#Visualize
df_policies_by_month_and_state

Unnamed: 0,month,state,count_policies
0,February,Arizona,804
1,February,California,1749
2,February,Oregon,1205
3,January,Arizona,899
4,January,California,1990
5,January,Oregon,1396


7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [45]:
# Step 1: Filter only "Yes" responses
yes_responses = df_insurance_marketing_customer_cleaned[df_insurance_marketing_customer_cleaned['response'] == 'Yes']

# Step 2: Count the total responses by sales channel and "Yes" response
yes_counts = yes_responses.groupby('sales_channel').size().reset_index(name='yes_count')
total_counts = df_insurance_marketing_customer_cleaned.groupby('sales_channel').size().reset_index(name='total_count')

# Step 3: Merge the total counts and "Yes" counts
response_rate = pd.merge(total_counts, yes_counts, on='sales_channel', how='left')

# Step 4: Calculate the response rate
response_rate['response_rate'] = (response_rate['yes_count'] / response_rate['total_count']) * 100

# Show the result
response_rate[['sales_channel', 'response_rate']].head()

Unnamed: 0,sales_channel,response_rate
0,Agent,18.024357
1,Branch,10.820758
2,Call Center,10.212766
3,Web,10.901468


In [None]:
df_insurance_marketing_customer_cleaned['response'] = df_insurance_marketing_customer_cleaned['response'].apply(lambda x: 1 if x=='Yes' else 0) 

# Melt the data to convert columns to rows"
melted_table = pd.melt(df, id_vars=['sales_channel'], value_vars=['response'])
melted_table