# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [1]:
import pandas as pd

In [2]:
from data_cleaning import *

In [3]:
url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'
df = load_df(url)

In [4]:
#Standarize the column names
column_names(df)

In [5]:
df.head()

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,0,DK49336,Arizona,4809.21696,No,Basic,College,2/18/11,Employed,M,...,0.0,9,Corporate Auto,Corporate L3,Offer3,Agent,292.8,Four-Door Car,Medsize,
1,1,KX64629,California,2228.525238,No,Basic,College,1/18/11,Unemployed,F,...,0.0,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,
2,2,LZ68649,Washington,14947.9173,No,Basic,Bachelor,2/10/11,Employed,M,...,0.0,2,Personal Auto,Personal L3,Offer3,Call Center,480.0,SUV,Medsize,A
3,3,XL78013,Oregon,22332.43946,Yes,Extended,College,1/11/11,Employed,M,...,0.0,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,1/17/11,Medical Leave,F,...,,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,


In [6]:
#get the value of each column values
def unique_value_col(df, col):
    unique_values = df[col].unique()
    return(f"Unique values in column '{col}': {unique_values}\n")
for col in list(df.columns):
    print(unique_value_col(df, col))

Unique values in column 'unnamed:_0': [    0     1     2 ... 10907 10908 10909]

Unique values in column 'customer': ['DK49336' 'KX64629' 'LZ68649' ... 'KX53892' 'TL39050' 'WA60547']

Unique values in column 'state': ['Arizona' 'California' 'Washington' 'Oregon' nan 'Nevada']

Unique values in column 'customer_lifetime_value': [ 4809.21696   2228.525238 14947.9173   ...  5259.444853 23893.3041
 11971.97765 ]

Unique values in column 'response': ['No' 'Yes' nan]

Unique values in column 'coverage': ['Basic' 'Extended' 'Premium']

Unique values in column 'education': ['College' 'Bachelor' 'High School or Below' 'Doctor' 'Master']

Unique values in column 'effective_to_date': ['2/18/11' '1/18/11' '2/10/11' '1/11/11' '1/17/11' '2/14/11' '2/24/11'
 '1/19/11' '1/4/11' '1/2/11' '2/7/11' '1/31/11' '1/26/11' '2/28/11'
 '1/16/11' '2/26/11' '2/23/11' '1/15/11' '2/2/11' '2/15/11' '1/24/11'
 '2/21/11' '2/22/11' '1/7/11' '1/28/11' '2/8/11' '2/12/11' '2/20/11'
 '1/5/11' '2/19/11' '1/3/11' '2/3/11' '1

In [7]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
10905    False
10906    False
10907    False
10908    False
10909    False
Length: 10910, dtype: bool

In [8]:
df.isna()

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
10906,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10907,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
10908,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [9]:
#I wonder if I can create another if to asign every column the correct data type
for col in list(df.columns):
    # Check if there are missing values (NaN) in the column
    if df[col].isna().any():
        # Fill missing values in the column with 'Unknown'
        df[col].fillna('Unknown', inplace=True)

  df[col].fillna('Unknown', inplace=True)


In [10]:
df.isna().any()

unnamed:_0                       False
customer                         False
state                            False
customer_lifetime_value          False
response                         False
coverage                         False
education                        False
effective_to_date                False
employmentstatus                 False
gender                           False
income                           False
location_code                    False
marital_status                   False
monthly_premium_auto             False
months_since_last_claim          False
months_since_policy_inception    False
number_of_open_complaints        False
number_of_policies               False
policy_type                      False
policy                           False
renew_offer_type                 False
sales_channel                    False
total_claim_amount               False
vehicle_class                    False
vehicle_size                     False
vehicle_type             

In [11]:
#get the value of each column values
for col in list(df.columns):
    print(unique_value_col(df, col))

Unique values in column 'unnamed:_0': [    0     1     2 ... 10907 10908 10909]

Unique values in column 'customer': ['DK49336' 'KX64629' 'LZ68649' ... 'KX53892' 'TL39050' 'WA60547']

Unique values in column 'state': ['Arizona' 'California' 'Washington' 'Oregon' 'Unknown' 'Nevada']

Unique values in column 'customer_lifetime_value': [ 4809.21696   2228.525238 14947.9173   ...  5259.444853 23893.3041
 11971.97765 ]

Unique values in column 'response': ['No' 'Yes' 'Unknown']

Unique values in column 'coverage': ['Basic' 'Extended' 'Premium']

Unique values in column 'education': ['College' 'Bachelor' 'High School or Below' 'Doctor' 'Master']

Unique values in column 'effective_to_date': ['2/18/11' '1/18/11' '2/10/11' '1/11/11' '1/17/11' '2/14/11' '2/24/11'
 '1/19/11' '1/4/11' '1/2/11' '2/7/11' '1/31/11' '1/26/11' '2/28/11'
 '1/16/11' '2/26/11' '2/23/11' '1/15/11' '2/2/11' '2/15/11' '1/24/11'
 '2/21/11' '2/22/11' '1/7/11' '1/28/11' '2/8/11' '2/12/11' '2/20/11'
 '1/5/11' '2/19/11' '1/3/11'

In [12]:
#1. Create a new DataFrame that only includes customers who have a 
# total_claim_amount greater than $1,000 and have a response of "Yes" 
# to the last marketing campaign.

claimamount_over1000 = df[(df['total_claim_amount'] > 1000) & (df['response'] == 'Yes')]
claimamount_over1000

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type
189,189,OK31456,California,11009.130490,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Corporate Auto,Corporate L3,Offer2,Agent,1358.400000,Luxury Car,Medsize,Unknown
236,236,YJ16163,Oregon,11009.130490,Yes,Premium,Bachelor,1/24/11,Employed,F,...,0.0,1,Special Auto,Special L3,Offer2,Agent,1358.400000,Luxury Car,Medsize,A
419,419,GW43195,Oregon,25807.063000,Yes,Extended,College,2/13/11,Employed,F,...,1.0,2,Personal Auto,Personal L2,Offer1,Branch,1027.200000,Luxury Car,Small,A
442,442,IP94270,Arizona,13736.132500,Yes,Premium,Master,2/13/11,Disabled,F,...,0.0,8,Personal Auto,Personal L2,Offer1,Web,1261.319869,SUV,Medsize,A
587,587,FJ28407,California,5619.689084,Yes,Premium,High School or Below,1/26/11,Unemployed,M,...,0.0,1,Personal Auto,Personal L1,Offer2,Web,1027.000029,SUV,Medsize,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10351,10351,FN44127,Oregon,3508.569533,Yes,Extended,College,1/5/11,Medical Leave,M,...,1.0,1,Personal Auto,Personal L2,Offer2,Branch,1176.278800,Four-Door Car,Small,Unknown
10373,10373,XZ64172,Oregon,10963.957230,Yes,Premium,High School or Below,2/8/11,Employed,M,...,0.0,1,Corporate Auto,Corporate L2,Offer1,Agent,1324.800000,Luxury SUV,Medsize,Unknown
10487,10487,IX60941,Oregon,3508.569533,Yes,Extended,College,1/5/11,Medical Leave,M,...,1.0,1,Personal Auto,Personal L3,Offer2,Branch,1176.278800,Four-Door Car,Small,Unknown
10565,10565,QO62792,Oregon,7840.165778,Yes,Extended,College,1/14/11,Employed,M,...,2.0,1,Personal Auto,Personal L3,Offer2,Agent,1008.000000,Unknown,Unknown,Unknown


In [13]:
#2. Using the original Dataframe, analyze the average 
# total_claim_amount by each policy type and gender 
# for customers who have responded "Yes" to the last 
# marketing campaign. Write your conclusions.

filtered_df = df[df['response'] == 'Yes']

pivot_table_1 = filtered_df.pivot_table(
    index=['policy_type'],
    columns=['gender'],
    values='total_claim_amount',
    aggfunc=['mean']
)

print(pivot_table_1)

print("There's less corporate and speacial policies for men than women\nand lightly more more personal policies for men than women")

                      mean            
gender                   F           M
policy_type                           
Corporate Auto  433.738499  408.582459
Personal Auto   452.965929  457.010178
Special Auto    453.280164  429.527942
There's less corporate and speacial policies for men than women
and lightly more more personal policies for men than women


In [23]:
#3. Analyze the total number of customers who have policies in each 
# state, and then filter the results to only include states where there 
# are more than 500 customers.

customer_state = df.groupby('state')['customer'].count().reset_index()

# Filter the data to only include states with more than 100 customers
states_over_100_customers = customer_state[customer_state['customer'] > 500]
states_over_100_customers

Unnamed: 0,state,customer
0,Arizona,1937
1,California,3552
2,Nevada,993
3,Oregon,2909
4,Unknown,631
5,Washington,888


## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [None]:
# your code goes here