# Lab | Data Aggregation and Filtering

In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv

This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring.

In [119]:
import pandas as pd

url = "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv"

data_mca = pd.read_csv(url)


In [120]:
data_mca.columns = data_mca.columns.str.replace(' ', '_')
data_mca.columns = data_mca.columns.str.lower()

In [121]:
data_mca["gender"].value_counts()

gender
F    5573
M    5337
Name: count, dtype: int64

In [122]:
data_mca = data_mca.drop(columns=['unnamed:_0'])
data_mca.duplicated().sum()

np.int64(37)

In [123]:
#Keeping only the first occurrence of each duplicated row
data_mca = data_mca.drop_duplicates(subset=['customer'], keep='first')
data_mca.duplicated().sum()

np.int64(0)

In [124]:
data_mca.isnull().sum()

customer                            0
state                             290
customer_lifetime_value             0
response                          290
coverage                            0
education                           0
effective_to_date                   0
employmentstatus                    0
gender                              0
income                              0
location_code                       0
marital_status                      0
monthly_premium_auto                0
months_since_last_claim           287
months_since_policy_inception       0
number_of_open_complaints         287
number_of_policies                  0
policy_type                         0
policy                              0
renew_offer_type                    0
sales_channel                       0
total_claim_amount                  0
vehicle_class                     267
vehicle_size                      267
vehicle_type                     4608
dtype: int64

In [125]:
# Fill missing values with ratio M/F based on the known distribution. 
import numpy as np

gender_distribution = data_mca['gender'].value_counts(normalize=True)
print("Gender Distribution (Known):")
print(gender_distribution)

prob_male = gender_distribution.get('M', 0)
prob_female = gender_distribution.get('F', 0)


missing_gender_rows = data_mca[data_mca['gender'].isnull()]

random_assignments = np.random.choice(['M', 'F'], size=missing_gender_rows.shape[0], p=[prob_male, prob_female])

data_mca.loc[data_mca['gender'].isnull(), 'gender'] = random_assignments

Gender Distribution (Known):
gender
F    0.509963
M    0.490037
Name: proportion, dtype: float64


In [126]:
# Fill missing values with ratio State based on the known distribution.
state_distribution = data_mca['state'].value_counts(normalize=True)
print("State Distribution (Known):")
print(state_distribution)

states = state_distribution.index
probabilities = state_distribution.values

missing_state_rows = data_mca[data_mca['state'].isnull()]

random_assignments = np.random.choice(states, size=missing_state_rows.shape[0], p=probabilities)

data_mca.loc[data_mca['state'].isnull(), 'state'] = random_assignments

State Distribution (Known):
state
California    0.343171
Oregon        0.284600
Arizona       0.187133
Nevada        0.097580
Washington    0.087517
Name: proportion, dtype: float64


In [127]:
data_mca['response'].fillna('unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_mca['response'].fillna('unknown', inplace=True)


In [128]:
data_mca['months_since_last_claim'] = data_mca['months_since_last_claim'].fillna(data_mca['months_since_last_claim'].mean()).round(1)


In [129]:
data_mca['number_of_open_complaints'] = data_mca['number_of_open_complaints'].fillna(data_mca['number_of_open_complaints'].mean()).round(0)


In [130]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Select the features and target variable
features = data_mca.drop(columns=['vehicle_class'])
target = data_mca['vehicle_class']

# Encode categorical variables
label_encoders = {}
for column in features.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    features[column] = le.fit_transform(features[column].astype(str))
    label_encoders[column] = le

# Split the data into training and testing sets
X_train = features[target.notnull()]
y_train = target[target.notnull()]
X_test = features[target.isnull()]

# Train a RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict the missing values
predicted_vehicle_class = clf.predict(X_test)

# Fill the missing values in the original dataframe
data_mca.loc[data_mca['vehicle_class'].isnull(), 'vehicle_class'] = predicted_vehicle_class

# Verify the missing values are filled
print(data_mca['vehicle_class'].isnull().sum())

0


In [131]:
# Fill missing values in vehicle_size based on vehicle_class
vehicle_size_mode = data_mca.groupby('vehicle_class')['vehicle_size'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

def fill_vehicle_size(row):
    if pd.isnull(row['vehicle_size']):
        return vehicle_size_mode.get(row['vehicle_class'], np.nan)
    return row['vehicle_size']

data_mca['vehicle_size'] = data_mca.apply(fill_vehicle_size, axis=1)

# Verify the missing values are filled
print(data_mca['vehicle_size'].isnull().sum())

0


In [132]:
# Fill missing values in vehicle_type based on vehicle_class and vehicle_size
vehicle_type_mode = data_mca.groupby(['vehicle_class', 'vehicle_size'])['vehicle_type'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)

def fill_vehicle_type(row):
    if pd.isnull(row['vehicle_type']):
        return vehicle_type_mode.get((row['vehicle_class'], row['vehicle_size']), np.nan)
    return row['vehicle_type']

data_mca['vehicle_type'] = data_mca.apply(fill_vehicle_type, axis=1)

# Verify the missing values are filled
print(data_mca['vehicle_type'].isnull().sum())

0


In [133]:
data_mca.isnull().sum()

customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
vehicle_class                    0
vehicle_size                     0
vehicle_type                     0
dtype: int64

1. Create a new DataFrame that only includes customers who have a total_claim_amount greater than $1,000 and have a response of "Yes" to the last marketing campaign.

In [134]:
df_1000 = data_mca[(data_mca['total_claim_amount'] > 1000) & (data_mca['response'] == 'Yes')]
print(df_1000)

      customer       state  customer_lifetime_value response  coverage  \
189    OK31456  California             11009.130490      Yes   Premium   
236    YJ16163      Oregon             11009.130490      Yes   Premium   
419    GW43195      Oregon             25807.063000      Yes  Extended   
442    IP94270     Arizona             13736.132500      Yes   Premium   
587    FJ28407  California              5619.689084      Yes   Premium   
654    VV47792  California             25807.063000      Yes  Extended   
680    QL45827  Washington             25807.063000      Yes  Extended   
815    MA15172  California             10179.717040      Yes   Premium   
1276   WA45944     Arizona             25807.063000      Yes  Extended   
1422   YK65946      Oregon              8427.172486      Yes  Extended   
1527   TU53781      Oregon              8427.172486      Yes  Extended   
1570   VL84149      Oregon             25807.063000      Yes  Extended   
1629   UH35128      Oregon            

2. Using the original Dataframe, analyze the average total_claim_amount by each policy type and gender for customers who have responded "Yes" to the last marketing campaign. Write your conclusions.

In [135]:
responded_yes = data_mca[data_mca['response'] == 'Yes']

average_claims = responded_yes.groupby(['policy_type', 'gender'])['total_claim_amount'].mean().reset_index()

average_claims

Unnamed: 0,policy_type,gender,total_claim_amount
0,Corporate Auto,F,431.835494
1,Corporate Auto,M,410.032861
2,Personal Auto,F,453.143064
3,Personal Auto,M,456.083128
4,Special Auto,F,460.360832
5,Special Auto,M,414.799291


3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers.

In [136]:
customer_count_by_state = data_mca.groupby('state')['customer'].count().reset_index()

states_with_more_than_500_customers = customer_count_by_state[customer_count_by_state['customer'] > 500]
states_with_more_than_500_customers

Unnamed: 0,state,customer
0,Arizona,1706
1,California,3132
2,Nevada,895
3,Oregon,2599
4,Washington,802


4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions.

In [137]:

clv_stats = data_mca.groupby(['education', 'gender'])['customer_lifetime_value'].agg(['max', 'min', 'median']).reset_index()

print(clv_stats)

              education gender          max          min       median
0              Bachelor      F  73225.95652  1904.000852  5678.050167
1              Bachelor      M  67907.27050  1898.007675  5555.830205
2               College      F  61850.18803  1898.683686  5621.789489
3               College      M  61134.68307  1918.119700  5989.773931
4                Doctor      F  44856.11397  2395.570000  5332.462694
5                Doctor      M  32677.34284  2267.604038  5620.587184
6  High School or Below      F  55277.44589  2144.921535  6044.020038
7  High School or Below      M  83325.38119  1940.981221  6176.700392
8                Master      F  51016.06704  2417.777032  5801.128252
9                Master      M  50568.25912  2272.307310  5617.956917


## Bonus

5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows.

6.  Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.

*Hint:*
- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*
- *Next, you will select the top 3 states with the highest number of policies sold.*
- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*

7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.

Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded "Yes") by marketing channel.

External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9

In [138]:
# your code goes here