# Lab Customer Analysis Round 5 & 6

## **Processing Data**

1. One Hot/Label Encoding (categorical).
2. Concat DataFrames
3. X-y split. 
4. Feature scaling - Normalize (numerical). 
5. Train-test split.
6. Apply linear regression.
7. Model Validation
    - R2.
    - MSE.
    - RMSE.
    - MAE.

In [832]:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, Normalizer, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import statsmodels.api as sm
from statsmodels.formula.api import ols


In [833]:
df = pd.read_csv('/Users/dooinnkim/ironhack_da_may_2023/lab-customer-analysis-round-5/files_for_lab/csv_files/marketing_customer_analysis.csv')

In [834]:
df.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,...,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


In [835]:
df.shape

(9134, 24)

In [836]:
df.isna().sum()

Customer                         0
State                            0
Customer Lifetime Value          0
Response                         0
Coverage                         0
Education                        0
Effective To Date                0
EmploymentStatus                 0
Gender                           0
Income                           0
Location Code                    0
Marital Status                   0
Monthly Premium Auto             0
Months Since Last Claim          0
Months Since Policy Inception    0
Number of Open Complaints        0
Number of Policies               0
Policy Type                      0
Policy                           0
Renew Offer Type                 0
Sales Channel                    0
Total Claim Amount               0
Vehicle Class                    0
Vehicle Size                     0
dtype: int64

In [837]:
df.columns = [column.lower().replace(' ', '_') for column in df.columns]

In [838]:
df.dtypes

customer                          object
state                             object
customer_lifetime_value          float64
response                          object
coverage                          object
education                         object
effective_to_date                 object
employmentstatus                  object
gender                            object
income                             int64
location_code                     object
marital_status                    object
monthly_premium_auto               int64
months_since_last_claim            int64
months_since_policy_inception      int64
number_of_open_complaints          int64
number_of_policies                 int64
policy_type                       object
policy                            object
renew_offer_type                  object
sales_channel                     object
total_claim_amount               float64
vehicle_class                     object
vehicle_size                      object
dtype: object

In [839]:
df['effective_to_date'] = pd.to_datetime(df['effective_to_date'])

In [840]:
df.dtypes

customer                                 object
state                                    object
customer_lifetime_value                 float64
response                                 object
coverage                                 object
education                                object
effective_to_date                datetime64[ns]
employmentstatus                         object
gender                                   object
income                                    int64
location_code                            object
marital_status                           object
monthly_premium_auto                      int64
months_since_last_claim                   int64
months_since_policy_inception             int64
number_of_open_complaints                 int64
number_of_policies                        int64
policy_type                              object
policy                                   object
renew_offer_type                         object
sales_channel                           

## 1. One Hot/Label Encoding (categorical).

In [841]:
df_copy = df.copy()
df_dummy = df.copy()

In [842]:
list(df_copy.columns)

['customer',
 'state',
 'customer_lifetime_value',
 'response',
 'coverage',
 'education',
 'effective_to_date',
 'employmentstatus',
 'gender',
 'income',
 'location_code',
 'marital_status',
 'monthly_premium_auto',
 'months_since_last_claim',
 'months_since_policy_inception',
 'number_of_open_complaints',
 'number_of_policies',
 'policy_type',
 'policy',
 'renew_offer_type',
 'sales_channel',
 'total_claim_amount',
 'vehicle_class',
 'vehicle_size']

In [843]:
# exclude customer and effective_to_date
df_copy_cols = ['state',
 'customer_lifetime_value',
 'response',
 'coverage',
 'education',
 'employmentstatus',
 'gender',
 'income',
 'location_code',
 'marital_status',
 'monthly_premium_auto',
 'months_since_last_claim',
 'months_since_policy_inception',
 'number_of_open_complaints',
 'number_of_policies',
 'policy_type',
 'policy',
 'renew_offer_type',
 'sales_channel',
 'total_claim_amount',
 'vehicle_class',
 'vehicle_size']

df_copy=df_copy[df_copy_cols]

In [844]:
df_copy.head()

Unnamed: 0,state,customer_lifetime_value,response,coverage,education,employmentstatus,gender,income,location_code,marital_status,...,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,Washington,2763.519279,No,Basic,Bachelor,Employed,F,56274,Suburban,Married,...,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,Arizona,6979.535903,No,Extended,Bachelor,Unemployed,F,0,Suburban,Single,...,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,Nevada,12887.43165,No,Premium,Bachelor,Employed,F,48767,Suburban,Married,...,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,California,7645.861827,No,Basic,Bachelor,Unemployed,M,0,Suburban,Married,...,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,Washington,2813.692575,No,Basic,Bachelor,Employed,M,43836,Rural,Single,...,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


In [845]:
list(df_copy.select_dtypes("object"))

['state',
 'response',
 'coverage',
 'education',
 'employmentstatus',
 'gender',
 'location_code',
 'marital_status',
 'policy_type',
 'policy',
 'renew_offer_type',
 'sales_channel',
 'vehicle_class',
 'vehicle_size']

In [846]:
encoder = OneHotEncoder(drop='first')

categorical_var = list(df_copy.select_dtypes("object"))

encoder.fit(df_copy[categorical_var])

cols = []
for i in range(len(categorical_var)):
    cols += list(encoder.categories_[i][1:])


df_encoded = pd.DataFrame(encoder.transform(df_copy[categorical_var]).todense(), columns=cols, index=df_copy.index)




In [847]:
# if using get_dummies for Encoding

df_dummy = pd.get_dummies(df_dummy, drop_first=True)
df_dummy = df_dummy.drop('effective_to_date', axis=1)

## 2. Concat DataFrames

In [848]:
df_copy.drop(categorical_var, axis=1, inplace=True)


df_copy = pd.concat([df_copy, df_encoded], axis=1)

## 3. X-y split. 

In [849]:
X = df_dummy.drop(['total_claim_amount'], axis=1)
y = df_dummy['total_claim_amount']

# X = df_copy.drop(['total_claim_amount'], axis=1)
# y = df_copy['total_claim_amount']

In [850]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [851]:
X_train

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,customer_AA11235,customer_AA16582,customer_AA30683,...,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_Luxury Car,vehicle_class_Luxury SUV,vehicle_class_SUV,vehicle_class_Sports Car,vehicle_class_Two-Door Car,vehicle_size_Medsize,vehicle_size_Small
434,5015.009472,48567,130,12,15,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
4641,5149.301306,26877,131,5,2,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4952,4904.894731,12902,139,3,51,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1489,8510.525936,0,121,5,94,0,8,0,0,0,...,1,0,0,0,0,0,1,0,1,0
812,3278.531880,70247,83,13,19,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5734,7334.328083,87957,61,31,63,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5191,5498.940679,22520,73,17,64,0,3,0,0,0,...,1,0,0,0,0,0,0,0,1,0
5390,8992.779137,0,129,13,4,0,7,0,0,0,...,0,1,0,0,0,1,0,0,1,0
860,14635.451580,0,139,5,56,0,2,0,0,0,...,0,1,0,0,0,1,0,0,1,0


## 4. Feature scaling - Normalize (numerical)

In [852]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

## 5. Train-test split.

In [853]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 6. Apply linear regression.

In [854]:
# Start the model object:
lm = LinearRegression()

# Fit the model object on the training set:
lm.fit(X_train, y_train)

LinearRegression()

## 7. Model Validation

In [855]:
# Get predictions for the test set:
predictions = lm.predict(X_test)
# Calculate your metrics:
rmse = mean_squared_error(y_test, predictions, squared=False) # or mse with squared=True
mae = mean_absolute_error(y_test, predictions)
print("R2_score:", round(r2_score(y_test, predictions),2)) 
print("RMSE:", rmse)
print("MAE:", mae)

R2_score: 0.67
RMSE: 164.51668080262516
MAE: 124.34199099093314


## Which one is better model??? > Have no idea how to interprete this...

### Get_dummies (considerd both numerical & categorical)

- R2_score: 0.67
- RMSE: 164.51668080262516
- MAE: 124.34199099093314

### OneHotEncoder (considered only categorical)

- R2_score: 0.77
- RMSE: 138.50093390700894
- MAE: 94.52272816801197