# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

1. Create a new DataFrame that only includes customers who:
   - have a **low total_claim_amount** (e.g., below $1,000),
   - have a response "Yes" to the last marketing campaign.

In [1]:
import pandas as pd

insurance=pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv")

insurance.columns = insurance.columns.str.lower().str.replace(" ", "_")

insurance.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 26 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   unnamed:_0                     10910 non-null  int64  
 1   customer                       10910 non-null  object 
 2   state                          10279 non-null  object 
 3   customer_lifetime_value        10910 non-null  float64
 4   response                       10279 non-null  object 
 5   coverage                       10910 non-null  object 
 6   education                      10910 non-null  object 
 7   effective_to_date              10910 non-null  object 
 8   employmentstatus               10910 non-null  object 
 9   gender                         10910 non-null  object 
 10  income                         10910 non-null  int64  
 11  location_code                  10910 non-null  object 
 12  marital_status                 10910 non-null 

In [2]:
insurance["claim_under"] = insurance['total_claim_amount'] < 1000
insurance["campaing_yes"] = insurance["response"] == "Yes"

new_insurance= insurance[insurance["campaing_yes"] & insurance["claim_under"]]

display(insurance.columns)

Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type', 'claim_under', 'campaing_yes'],
      dtype='object')

2. Using the original Dataframe, analyze:
   - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded "Yes", and
   - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company.

In [3]:
grouped= insurance.groupby(["policy_type","gender"])

grouped["monthly_premium_auto"].mean().round(2)

policy_type     gender
Corporate Auto  F         91.38
                M         94.76
Personal Auto   F         93.15
                M         93.30
Special Auto    F         93.56
                M         93.20
Name: monthly_premium_auto, dtype: float64

In [4]:
grouped= insurance.groupby(["policy_type","gender"])

grouped["total_claim_amount"].mean().round(2)

policy_type     gender
Corporate Auto  F         397.80
                M         462.22
Personal Auto   F         413.24
                M         459.92
Special Auto    F         458.14
                M         420.36
Name: total_claim_amount, dtype: float64

The main thing that caught my attention while looking at this two charts is that, on average males with the policy types: Corporate Auto and Personal Auto pay a slightly higher monthly premium while they also file significantly higher total claim amounts when compared to female customers. Suggesting that for the same policy type male customers tipicaly represent a higher risk or that they have a claim pattern that leads to larger payouts.

3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [5]:
CM_state=insurance.groupby(["policy", "state"]).nunique()
CM_state["customer"]


policy        state     
Corporate L1  Arizona         58
              California     148
              Nevada          32
              Oregon          85
              Washington      36
Corporate L2  Arizona        106
              California     225
              Nevada          61
              Oregon         161
              Washington      42
Corporate L3  Arizona        169
              California     366
              Nevada         102
              Oregon         294
              Washington      83
Personal L1   Arizona        255
              California     392
              Nevada         120
              Oregon         361
              Washington     112
Personal L2   Arizona        384
              California     798
              Nevada         189
              Oregon         547
              Washington     204
Personal L3   Arizona        654
              California    1108
              Nevada         343
              Oregon        1030
              Wash

In [6]:
CM_state=insurance.groupby("state")["customer"].nunique()
valid_states= CM_state[CM_state>500].index.tolist()
cm_st_insurance=insurance[insurance["state"].isin(valid_states)]

display(cm_st_insurance["state"].unique())

array(['Arizona', 'California', 'Washington', 'Oregon', 'Nevada'],
      dtype=object)

In [7]:
display(cm_st_insurance.groupby('state')['customer'].nunique())

state
Arizona       1703
California    3150
Nevada         882
Oregon        2601
Washington     798
Name: customer, dtype: int64

4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [None]:
index=["education","gender"]
# index=["gender"]
edu_gender=insurance.groupby(index)["customer_lifetime_value"].agg(max_='max',min_='min',median_='median')

edu_gender.round(2)
# insurance.gender.info()

#When looking at the big picture male customers have a higher lifetime value than female customers. However, when broken down to education level the max values are dominated by females.
#They are only behind male customers in the max range when they only have a level of education of highschool or bellow. However when it comes to the average the female customer are mostly on top.

Unnamed: 0_level_0,Unnamed: 1_level_0,max_,min_,median_
education,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Bachelor,F,73225.96,1904.0,5640.51
Bachelor,M,67907.27,1898.01,5548.03
College,F,61850.19,1898.68,5623.61
College,M,61134.68,1918.12,6005.85
Doctor,F,44856.11,2395.57,5332.46
Doctor,M,32677.34,2267.6,5577.67
High School or Below,F,55277.45,2144.92,6039.55
High School or Below,M,83325.38,1940.98,6286.73
Master,F,51016.07,2417.78,5729.86
Master,M,50568.26,2272.31,5579.1


## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [9]:
# your code goes here