# <span style="color:green">Comparing regression models | Ainara Guerra</span> 

<span style="color:rgb(255, 0, 255)">**💗Steps in pink have been practiced in previous labs in this Unit and they have been reviewed with feedback of the labs by the TAs and inspiration from the exercises of other students**

<span style="color:green"> **🟩To review only the exercise for this lab, jump into the green titles**

### <span style="color:rgb(255, 0, 255)">--- Import the necessary libraries</span>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import matplotlib.ticker as mk
pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

import os #we will use the function listdir to list files in a folder
import math #to apply absolute value

### <span style="color:rgb(255, 0, 255)">--- Load the database</span>

In [2]:
data = pd.read_csv('we_fn_use_c_marketing_customer_value_analysis.csv')
data.head()

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
0,BU79786,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,QZ44356,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,AI49188,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,WW63253,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,HB64268,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


### <span style="color:rgb(255, 0, 255)">--- Let's look at its main features (head, shape, info).

In [3]:
data.shape

(9134, 24)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer                       9134 non-null   object 
 1   State                          9134 non-null   object 
 2   Customer Lifetime Value        9134 non-null   float64
 3   Response                       9134 non-null   object 
 4   Coverage                       9134 non-null   object 
 5   Education                      9134 non-null   object 
 6   Effective To Date              9134 non-null   object 
 7   EmploymentStatus               9134 non-null   object 
 8   Gender                         9134 non-null   object 
 9   Income                         9134 non-null   int64  
 10  Location Code                  9134 non-null   object 
 11  Marital Status                 9134 non-null   object 
 12  Monthly Premium Auto           9134 non-null   i

In [5]:
data.describe()

Unnamed: 0,Customer Lifetime Value,Income,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Total Claim Amount
count,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0,9134.0
mean,8004.940475,37657.380009,93.219291,15.097,48.064594,0.384388,2.96617,434.088794
std,6870.967608,30379.904734,34.407967,10.073257,27.905991,0.910384,2.390182,290.500092
min,1898.007675,0.0,61.0,0.0,0.0,0.0,1.0,0.099007
25%,3994.251794,0.0,68.0,6.0,24.0,0.0,1.0,272.258244
50%,5780.182197,33889.5,83.0,14.0,48.0,0.0,2.0,383.945434
75%,8962.167041,62320.0,109.0,23.0,71.0,0.0,4.0,547.514839
max,83325.38119,99981.0,298.0,35.0,99.0,5.0,9.0,2893.239678


In [6]:
#let's see first if we have any duplicated rows with nan values in those rows aka no info in that entry
duplicate_rows = data[data.duplicated()]
duplicate_rows

Unnamed: 0,Customer,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,Marital Status,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size


In [7]:
data.isna().sum()

Customer                         0
State                            0
Customer Lifetime Value          0
Response                         0
Coverage                         0
Education                        0
Effective To Date                0
EmploymentStatus                 0
Gender                           0
Income                           0
Location Code                    0
Marital Status                   0
Monthly Premium Auto             0
Months Since Last Claim          0
Months Since Policy Inception    0
Number of Open Complaints        0
Number of Policies               0
Policy Type                      0
Policy                           0
Renew Offer Type                 0
Sales Channel                    0
Total Claim Amount               0
Vehicle Class                    0
Vehicle Size                     0
dtype: int64

### <span style="color:rgb(255, 0, 255)">--- Put the columns names on point

In [8]:
data.columns

Index(['Customer', 'State', 'Customer Lifetime Value', 'Response', 'Coverage',
       'Education', 'Effective To Date', 'EmploymentStatus', 'Gender',
       'Income', 'Location Code', 'Marital Status', 'Monthly Premium Auto',
       'Months Since Last Claim', 'Months Since Policy Inception',
       'Number of Open Complaints', 'Number of Policies', 'Policy Type',
       'Policy', 'Renew Offer Type', 'Sales Channel', 'Total Claim Amount',
       'Vehicle Class', 'Vehicle Size'],
      dtype='object')

In [9]:
cols = [col_name.lower().replace(' ', '_') for col_name in data]
data.columns = cols
data.columns

Index(['customer', 'state', 'customer_lifetime_value', 'response', 'coverage',
       'education', 'effective_to_date', 'employmentstatus', 'gender',
       'income', 'location_code', 'marital_status', 'monthly_premium_auto',
       'months_since_last_claim', 'months_since_policy_inception',
       'number_of_open_complaints', 'number_of_policies', 'policy_type',
       'policy', 'renew_offer_type', 'sales_channel', 'total_claim_amount',
       'vehicle_class', 'vehicle_size'],
      dtype='object')

### <span style="color:rgb(255, 0, 255)">--- Drop columns that we don't longer need

In [10]:
data = data.drop(['customer'], axis=1)
data.head() #we don't need customer for the model because is an ID

Unnamed: 0,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,income,location_code,marital_status,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size
0,Washington,2763.519279,No,Basic,Bachelor,2/24/11,Employed,F,56274,Suburban,Married,69,32,5,0,1,Corporate Auto,Corporate L3,Offer1,Agent,384.811147,Two-Door Car,Medsize
1,Arizona,6979.535903,No,Extended,Bachelor,1/31/11,Unemployed,F,0,Suburban,Single,94,13,42,0,8,Personal Auto,Personal L3,Offer3,Agent,1131.464935,Four-Door Car,Medsize
2,Nevada,12887.43165,No,Premium,Bachelor,2/19/11,Employed,F,48767,Suburban,Married,108,18,38,0,2,Personal Auto,Personal L3,Offer1,Agent,566.472247,Two-Door Car,Medsize
3,California,7645.861827,No,Basic,Bachelor,1/20/11,Unemployed,M,0,Suburban,Married,106,18,65,0,7,Corporate Auto,Corporate L2,Offer1,Call Center,529.881344,SUV,Medsize
4,Washington,2813.692575,No,Basic,Bachelor,2/3/11,Employed,M,43836,Rural,Single,73,12,44,0,1,Personal Auto,Personal L1,Offer1,Agent,138.130879,Four-Door Car,Medsize


### <span style="color:rgb(255, 0, 255)">--- Check for the format of date columns</span>

In [11]:
data["effective_to_date"] = pd.to_datetime(data["effective_to_date"], errors='coerce')

In [12]:
data["effective_to_date"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 9134 entries, 0 to 9133
Series name: effective_to_date
Non-Null Count  Dtype         
--------------  -----         
9134 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 71.5 KB


### <span style="color:rgb(255, 0, 255)">--- Load the continuous and discrete variables into numericals and categorical variables</span>

In [14]:
num = data.select_dtypes(include = np.number)
num = num.drop('total_claim_amount', axis=1)
date = data["effective_to_date"]
cat = data.select_dtypes(include = np.object)
target = data['total_claim_amount']

In [15]:
num.head()

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies
0,2763.519279,56274,69,32,5,0,1
1,6979.535903,0,94,13,42,0,8
2,12887.43165,48767,108,18,38,0,2
3,7645.861827,0,106,18,65,0,7
4,2813.692575,43836,73,12,44,0,1


In [16]:
cat.head()

Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size
0,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize
1,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize
2,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize
3,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L2,Offer1,Call Center,SUV,Medsize
4,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L1,Offer1,Agent,Four-Door Car,Medsize


### <span style="color:rgb(255, 0, 255)"> --- Feature selection in categoricals

</span>

<span style="color:rgb(255, 0, 255)"> The goal of feature selection is to improve model accuracy by reducing the number of irrelevant or redundant features that may introduce noise or bias in the model.

In [17]:
#Inspired by Luis's Code
#Check unique values for each categorical value
for column in cat.columns:
    print('─' * 10)
    print("This feature ", '\033[1m' + column + '\033[0m' ," has ", cat[column].nunique(), " categories \n The single values are: ", cat[column].unique(),"\n" )
    print("Here the detail: \n" , cat[column].value_counts())
    print("\n\n")

──────────
This feature  [1mstate[0m  has  5  categories 
 The single values are:  ['Washington' 'Arizona' 'Nevada' 'California' 'Oregon'] 

Here the detail: 
 California    3150
Oregon        2601
Arizona       1703
Nevada         882
Washington     798
Name: state, dtype: int64



──────────
This feature  [1mresponse[0m  has  2  categories 
 The single values are:  ['No' 'Yes'] 

Here the detail: 
 No     7826
Yes    1308
Name: response, dtype: int64



──────────
This feature  [1mcoverage[0m  has  3  categories 
 The single values are:  ['Basic' 'Extended' 'Premium'] 

Here the detail: 
 Basic       5568
Extended    2742
Premium      824
Name: coverage, dtype: int64



──────────
This feature  [1meducation[0m  has  5  categories 
 The single values are:  ['Bachelor' 'College' 'Master' 'High School or Below' 'Doctor'] 

Here the detail: 
 Bachelor                2748
College                 2681
High School or Below    2622
Master                   741
Doctor                 

In [18]:
#Clean the categorical variables and reduce the unique values

cat_temp= cat.copy() # make a copy in case we mess something up.

# Inspired by Luis's code: 
#Groupping education. High, medium and low
cat_temp['education_grouped'] = np.where(cat['education'] == 'High School or Below' ,"LOW",cat_temp['education'])
cat_temp['education_grouped'] = np.where(cat_temp['education_grouped'] == 'College' ,"MEDIUM",cat_temp['education_grouped'])
cat_temp['education_grouped'] = np.where((cat_temp['education_grouped'] == 'Bachelor') |(cat_temp['education_grouped'] == 'Master') |(cat_temp['education_grouped'] == 'Doctor') ,"HIGH",cat_temp['education_grouped'])

In [23]:
#Groupping employmentstatus. I try to make bigger groups. employed. unemployed and other
cat_temp['employmentstatus_grouped'] = np.where((cat_temp['employmentstatus'] == 'Medical Leave')|(cat_temp['employmentstatus'] == 'Disabled')| (cat_temp['employmentstatus'] == 'Retired'),"Other",cat_temp['employmentstatus'])

In [20]:
#Groupping VEHICLE_CLASS to diminish number of unique variables. Luxury. Sports Car. Four-Door Car includes SUV. Two-Door Car
cat_temp['vehicle_class_grouped'] = np.where(cat_temp['vehicle_class'] == 'SUV',"Four-Door Car",cat_temp['vehicle_class'])
cat_temp['vehicle_class_grouped'] = np.where((cat_temp['vehicle_class'] == 'Luxury SUV') | (cat_temp['vehicle_class'] == 'Luxury Car'),"Luxury",cat_temp['vehicle_class_grouped'])
cat_temp['vehicle_class_grouped'] = np.where(cat_temp['vehicle_class'] == 'Sports Car',"Two-Door Car",cat_temp['vehicle_class_grouped'])

In [21]:
# Here I put another way of doing this: 
policy_map = {
    'Personal L3': 'Personal',
    'Personal L2': 'Personal',
    'Personal L1': 'Personal',
    'Corporate L3': 'Corporate',
    'Corporate L2': 'Corporate',
    'Corporate L1': 'Corporate',
    'Special L2': 'Special',
    'Special L3': 'Special',
    'Special L1': 'Special'
}

# Apply the mapping function to create a new column named "policy_type"
cat_temp['policy_grouped'] = cat_temp['policy'].apply(lambda x: policy_map[x])
cat_temp.head()

Unnamed: 0,state,response,coverage,education,employmentstatus,gender,location_code,marital_status,policy_type,policy,renew_offer_type,sales_channel,vehicle_class,vehicle_size,education_grouped,employmentstatus_grouped,vehicle_class_grouped,policy_grouped
0,Washington,No,Basic,Bachelor,Employed,F,Suburban,Married,Corporate Auto,Corporate L3,Offer1,Agent,Two-Door Car,Medsize,HIGH,Employed,Two-Door Car,Corporate
1,Arizona,No,Extended,Bachelor,Unemployed,F,Suburban,Single,Personal Auto,Personal L3,Offer3,Agent,Four-Door Car,Medsize,HIGH,Unemployed,Four-Door Car,Personal
2,Nevada,No,Premium,Bachelor,Employed,F,Suburban,Married,Personal Auto,Personal L3,Offer1,Agent,Two-Door Car,Medsize,HIGH,Employed,Two-Door Car,Personal
3,California,No,Basic,Bachelor,Unemployed,M,Suburban,Married,Corporate Auto,Corporate L2,Offer1,Call Center,SUV,Medsize,HIGH,Unemployed,Four-Door Car,Corporate
4,Washington,No,Basic,Bachelor,Employed,M,Rural,Single,Personal Auto,Personal L1,Offer1,Agent,Four-Door Car,Medsize,HIGH,Employed,Four-Door Car,Personal


In [24]:
cat_final=cat_temp.drop(columns=['education', 'employmentstatus', 'policy', 'policy_type','vehicle_class'])
cat_final.head()

Unnamed: 0,state,response,coverage,gender,location_code,marital_status,renew_offer_type,sales_channel,vehicle_size,education_grouped,employmentstatus_grouped,vehicle_class_grouped,policy_grouped
0,Washington,No,Basic,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Corporate
1,Arizona,No,Extended,F,Suburban,Single,Offer3,Agent,Medsize,HIGH,Unemployed,Four-Door Car,Personal
2,Nevada,No,Premium,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Personal
3,California,No,Basic,M,Suburban,Married,Offer1,Call Center,Medsize,HIGH,Unemployed,Four-Door Car,Corporate
4,Washington,No,Basic,M,Rural,Single,Offer1,Agent,Medsize,HIGH,Employed,Four-Door Car,Personal


In [25]:
df = pd.concat([cat_final,num, target], axis=1)
df.head()

Unnamed: 0,state,response,coverage,gender,location_code,marital_status,renew_offer_type,sales_channel,vehicle_size,education_grouped,employmentstatus_grouped,vehicle_class_grouped,policy_grouped,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception,number_of_open_complaints,number_of_policies,total_claim_amount
0,Washington,No,Basic,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Corporate,2763.519279,56274,69,32,5,0,1,384.811147
1,Arizona,No,Extended,F,Suburban,Single,Offer3,Agent,Medsize,HIGH,Unemployed,Four-Door Car,Personal,6979.535903,0,94,13,42,0,8,1131.464935
2,Nevada,No,Premium,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Personal,12887.43165,48767,108,18,38,0,2,566.472247
3,California,No,Basic,M,Suburban,Married,Offer1,Call Center,Medsize,HIGH,Unemployed,Four-Door Car,Corporate,7645.861827,0,106,18,65,0,7,529.881344
4,Washington,No,Basic,M,Rural,Single,Offer1,Agent,Medsize,HIGH,Employed,Four-Door Car,Personal,2813.692575,43836,73,12,44,0,1,138.130879


In [26]:
df.isna().sum()
# we sucessfully concat everything

state                            0
response                         0
coverage                         0
gender                           0
location_code                    0
marital_status                   0
renew_offer_type                 0
sales_channel                    0
vehicle_size                     0
education_grouped                0
employmentstatus_grouped         0
vehicle_class_grouped            0
policy_grouped                   0
customer_lifetime_value          0
income                           0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
total_claim_amount               0
dtype: int64

### <span style="color:rgb(255, 0, 255)">--- Normalize the continuous variables


In [27]:
num_df = df.select_dtypes(include = np.number)
num_df = num_df.drop(columns=['total_claim_amount']) #not include the target
cat_df = df.select_dtypes(include = np.object)
target = df['total_claim_amount']

num_df_continous = num_df.drop(['number_of_open_complaints','number_of_policies'], axis=1)
num_df_continous.head()

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception
0,2763.519279,56274,69,32,5
1,6979.535903,0,94,13,42
2,12887.43165,48767,108,18,38
3,7645.861827,0,106,18,65
4,2813.692575,43836,73,12,44


In [28]:
#Do the distributions for different continous numerical variables look like a normal distribution

transformer = MinMaxScaler().fit(num_df_continous)
num_minmax = transformer.transform(num_df_continous)
num_norm = pd.DataFrame(num_minmax,columns=num_df_continous.columns)
num_normalized = num_norm.copy()
num_normalized.head()

Unnamed: 0,customer_lifetime_value,income,monthly_premium_auto,months_since_last_claim,months_since_policy_inception
0,0.010629,0.562847,0.033755,0.914286,0.050505
1,0.062406,0.0,0.139241,0.371429,0.424242
2,0.13496,0.487763,0.198312,0.514286,0.383838
3,0.070589,0.0,0.189873,0.514286,0.656566
4,0.011245,0.438443,0.050633,0.342857,0.444444


### <span style="color:rgb(255, 0, 255)">--- Encoding the categorical variables

One hot to state

One hot to marital status

One hot to policy type

One hot to policy

One hot to renew offer

One hot to sales channel

One hot vehicle class

Ordinal vehicle size

Ordinal to coverage

Ordinal to employmentstatus

Ordinal to location code

In [30]:
#For one hot we will use get dummies
cat_df_encoded1 = cat_df.copy()
cat_df_encoded1 =pd.get_dummies(cat_df_encoded1[['state','marital_status','policy_grouped','renew_offer_type','sales_channel','vehicle_class_grouped']])
cat_df_encoded1.head()

Unnamed: 0,state_Arizona,state_California,state_Nevada,state_Oregon,state_Washington,marital_status_Divorced,marital_status_Married,marital_status_Single,policy_grouped_Corporate,policy_grouped_Personal,policy_grouped_Special,renew_offer_type_Offer1,renew_offer_type_Offer2,renew_offer_type_Offer3,renew_offer_type_Offer4,sales_channel_Agent,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_grouped_Four-Door Car,vehicle_class_grouped_Luxury,vehicle_class_grouped_Two-Door Car
0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1
1,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0
2,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1
3,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0
4,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0


In [31]:
# For ordinal we will asign numbers
cat_df_encoded2 = cat_df.copy()
cat_df_encoded2 ['vehicle_size_enc'] = cat_df["vehicle_size"].map({"Small" : 0, "Medsize" : 1, "Large" : 2})
cat_df_encoded2 ['coverage_enc'] = cat_df["coverage"].map({"Basic" : 0, "Extended" : 1, "Premium" : 2})
cat_df_encoded2 ['employmentstatus_enc'] = cat_df["employmentstatus_grouped"].map({"Employed" : 0, "Unemployed" : 1, "Other": 2})
cat_df_encoded2 ['location_code_enc'] = cat_df["location_code"].map({"Rural" : 0, "Suburban" : 1, "Urban" : 2})
cat_df_encoded2.head()

Unnamed: 0,state,response,coverage,gender,location_code,marital_status,renew_offer_type,sales_channel,vehicle_size,education_grouped,employmentstatus_grouped,vehicle_class_grouped,policy_grouped,vehicle_size_enc,coverage_enc,employmentstatus_enc,location_code_enc
0,Washington,No,Basic,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Corporate,1,0,0,1
1,Arizona,No,Extended,F,Suburban,Single,Offer3,Agent,Medsize,HIGH,Unemployed,Four-Door Car,Personal,1,1,1,1
2,Nevada,No,Premium,F,Suburban,Married,Offer1,Agent,Medsize,HIGH,Employed,Two-Door Car,Personal,1,2,0,1
3,California,No,Basic,M,Suburban,Married,Offer1,Call Center,Medsize,HIGH,Unemployed,Four-Door Car,Corporate,1,0,1,1
4,Washington,No,Basic,M,Rural,Single,Offer1,Agent,Medsize,HIGH,Employed,Four-Door Car,Personal,1,0,0,0


In [32]:
cat_df_enc_final = pd.concat([cat_df_encoded1, cat_df_encoded2], axis=1)
cat_df_enc_final = cat_df_enc_final.select_dtypes(include = np.number)
cat_df_enc_final.head()

Unnamed: 0,state_Arizona,state_California,state_Nevada,state_Oregon,state_Washington,marital_status_Divorced,marital_status_Married,marital_status_Single,policy_grouped_Corporate,policy_grouped_Personal,policy_grouped_Special,renew_offer_type_Offer1,renew_offer_type_Offer2,renew_offer_type_Offer3,renew_offer_type_Offer4,sales_channel_Agent,sales_channel_Branch,sales_channel_Call Center,sales_channel_Web,vehicle_class_grouped_Four-Door Car,vehicle_class_grouped_Luxury,vehicle_class_grouped_Two-Door Car,vehicle_size_enc,coverage_enc,employmentstatus_enc,location_code_enc
0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,1
1,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,1,1
2,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,1,2,0,1
3,0,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1,1
4,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0


***

# <span style="color:green"> Instructions </span> 

1. In this final lab, we will model our data. Import sklearn train_test_split and separate the data.
2. Try a simple linear regression with all the data to see whether we are getting good results.
3. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.
4. Use the function to check LinearRegressor and KNeighborsRegressor.
5. You can check also the MLPRegressor for this task!
6. Check and discuss the results.

<span style="color:green"> Number 1 is in the libraries box at the top </span> 

### <span style="color:green"> 2. Try a simple linear regression with all the data to see whether we are getting good results.
 </span> 

In [36]:
X = pd.concat([num_normalized,cat_df_enc_final], axis = 1)
Y = df["total_claim_amount"]
#Separation between train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42) 
#Train model
model = LinearRegression()
model.fit(X_train,y_train)

LinearRegression()

In [37]:
#Test model
predictions  = model.predict(X_test)
predictions.shape

(2741,)

In [38]:
# evaluating sklearn's LR model
r2 = r2_score(y_test, predictions)
RMSE = mean_squared_error(y_test, predictions, squared=False)
MSE = mean_squared_error(y_test, predictions)
print("r2 = ", r2)
print("RMSE = ", RMSE)
print("MSE = ", MSE)

r2 =  0.6092500730615791
RMSE =  178.60179187074277
MSE =  31898.600059440116


### <span style="color:green"> 3. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.
 </span> 

In [39]:
# With a lot of internet search: 

from sklearn.model_selection import train_test_split

def train_test_models(models, X, y, test_size=0.3, random_state=42):
    """
    Parameters:
        models (list): A list of models to train and test.
        X (pandas.DataFrame): The input features.
        Y (pandas.Series): The target variable.
        test_size (float): The proportion of the data to use for testing.
        random_state (int): The random seed for reproducibility.
    """

    # We split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Train and test each model
    results = {}
    for model in models:
        model_name = type(model).__name__
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        results[model_name] = score

    return results


### <span style="color:green"> 4. Use the function to check LinearRegressor and KNeighborsRegressor.
 </span> 

### <span style="color:green">  5. You can check also the MLPRegressor for this task!


In [48]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

models = [LinearRegression(), KNeighborsRegressor(), MLPRegressor()]
results = train_test_models(models, X, Y)
print(results)

{'LinearRegression': 0.6092500730615791, 'KNeighborsRegressor': 0.5250993059574394, 'MLPRegressor': 0.7002193474308254}


### <span style="color:green"> 6. Check and discuss the results.

<span style="color:green"> The results of all the models are not good eitherway, but the linear regression model performs a little bit better. Every model has its own necesities and the data we have presented (without much cleaning such as removing outliers) might be better for linear regression than KNN. 