# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [9]:
#Importa data

import pandas as pd

df = pd.read_csv("/Users/jon/Desktop/Ironhack/Unit 2 - Data Wrangling/df_lab_aggregation_filtering.txt")
print(df.head)


<bound method NDFrame.head of        Unnamed: 0 Customer       State  Customer Lifetime Value Response  \
0               0  DK49336     Arizona              4809.216960       No   
1               1  KX64629  California              2228.525238       No   
2               2  LZ68649  Washington             14947.917300       No   
3               3  XL78013      Oregon             22332.439460      Yes   
4               4  QA50777      Oregon              9025.067525       No   
...           ...      ...         ...                      ...      ...   
10905       10905  FE99816      Nevada             15563.369440       No   
10906       10906  KX53892      Oregon              5259.444853       No   
10907       10907  TL39050     Arizona             23893.304100       No   
10908       10908  WA60547  California             11971.977650       No   
10909       10909  IV32877         NaN              6857.519928      NaN   

       Coverage Education Effective To Date EmploymentSta

In [None]:
#1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

df_filtered=df[(df["Total Claim Amount"]>1000) & (df["Response"]=="Yes")]
print(df_filtered)


       Unnamed: 0 Customer       State  Customer Lifetime Value Response  \
189           189  OK31456  California             11009.130490      Yes   
236           236  YJ16163      Oregon             11009.130490      Yes   
419           419  GW43195      Oregon             25807.063000      Yes   
442           442  IP94270     Arizona             13736.132500      Yes   
587           587  FJ28407  California              5619.689084      Yes   
...           ...      ...         ...                      ...      ...   
10351       10351  FN44127      Oregon              3508.569533      Yes   
10373       10373  XZ64172      Oregon             10963.957230      Yes   
10487       10487  IX60941      Oregon              3508.569533      Yes   
10565       10565  QO62792      Oregon              7840.165778      Yes   
10708       10708  CK39096      Oregon              5619.689084      Yes   

       Coverage             Education Effective To Date EmploymentStatus  \
189     Pre

In [16]:
#2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender 
# for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

df_filtered_2=df[(df["Response"])=="Yes"]
#print(df_filtered_2)

df_filtered_2_mean=round(df_filtered_2.groupby("Policy Type")["Total Claim Amount"].mean(),2)
df_filtered_2_gender=round(df_filtered_2.groupby("Gender")["Total Claim Amount"].mean(),2)

print(df_filtered_2_mean)
print(df_filtered_2_gender)

#Conclusions: 
#Out of those that have responded "YES" to the last marketing campaign we can draw the following conlcusions:
#1) On average the total claim amount is the highest for the "Personal Auto" and lowest for the "Corporate Auto"
#2) On average women have a higher "Total Claim Amount" than men, although the difference is very small, we would have to check for statistical differences

Policy Type
Corporate Auto    421.74
Personal Auto     454.98
Special Auto      441.94
Name: Total Claim Amount, dtype: float64
Gender
F    448.61
M    445.46
Name: Total Claim Amount, dtype: float64


In [38]:
#3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

df_ej_3=df[df["Policy Type"].notna()]

df_ej3_agg=df_ej_3.groupby("State").agg({"Customer": ("count", "nunique")})

df_filtered_500=df_ej3_agg[df_ej3_agg[("Customer", "nunique")]>500]

print(df_filtered_500)

           Customer        
              count nunique
State                      
Arizona        1937    1703
California     3552    3150
Nevada          993     882
Oregon         2909    2601
Washington      888     798


In [53]:
#4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

max_lifetime_value_gender=df.groupby("Gender")["Customer Lifetime Value"].max()
min_lifetime_value_gender=df.groupby("Gender")["Customer Lifetime Value"].min()
mean_lifetime_value_gender=df.groupby("Gender")["Customer Lifetime Value"].mean()

max_lifetime_value_education=df.groupby("Education")["Customer Lifetime Value"].max()
min_lifetime_value_education=df.groupby("Education")["Customer Lifetime Value"].min()
mean_lifetime_value_education=df.groupby("Education")["Customer Lifetime Value"].mean()

print(f"The max life time value by gender is {max_lifetime_value_gender}")
print(f"The min life time value by gender is {min_lifetime_value_gender}")
print(f"The mean life time value by gender is {mean_lifetime_value_gender}")
print(f"The max life time value by education level is {max_lifetime_value_education}")
print(f"The min life time value by education level is {min_lifetime_value_education}")
print(f"The mean life time value by education level is {mean_lifetime_value_education}")

#Conclsuions:

#1) Females have a higher average lifetime values, again, should check for statistically significant differences
#2) Those with a High School education level or lower have the highest lifetime value

The max life time value by gender is Gender
F    73225.95652
M    83325.38119
Name: Customer Lifetime Value, dtype: float64
The min life time value by gender is Gender
F    1898.683686
M    1898.007675
Name: Customer Lifetime Value, dtype: float64
The mean life time value by gender is Gender
F    8071.105001
M    7963.039566
Name: Customer Lifetime Value, dtype: float64
The max life time value by education level is Education
Bachelor                73225.95652
College                 61850.18803
Doctor                  44856.11397
High School or Below    83325.38119
Master                  51016.06704
Name: Customer Lifetime Value, dtype: float64
The min life time value by education level is Education
Bachelor                1898.007675
College                 1898.683686
Doctor                  2267.604038
High School or Below    1940.981221
Master                  2272.307310
Name: Customer Lifetime Value, dtype: float64
The mean life time value by education level is Education
Bachel