# Lab | Comparing regression models

For this lab, we will be using the same dataset we used in the previous labs. We recommend using the same notebook since you will be reusing the same variables you previous created and used in labs.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

In [3]:
data = pd.read_csv('we_fn_use_c_marketing_customer_value_analysis.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9134 entries, 0 to 9133
Data columns (total 24 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer                       9134 non-null   object 
 1   State                          9134 non-null   object 
 2   Customer Lifetime Value        9134 non-null   float64
 3   Response                       9134 non-null   object 
 4   Coverage                       9134 non-null   object 
 5   Education                      9134 non-null   object 
 6   Effective To Date              9134 non-null   object 
 7   EmploymentStatus               9134 non-null   object 
 8   Gender                         9134 non-null   object 
 9   Income                         9134 non-null   int64  
 10  Location Code                  9134 non-null   object 
 11  Marital Status                 9134 non-null   object 
 12  Monthly Premium Auto           9134 non-null   i

In [4]:
data = data.set_index('Customer')

### 1. In this final lab, we will model our data. Import sklearn train_test_split and separate the data.


In [5]:
y = data['Total Claim Amount']
X = data.drop(['Total Claim Amount'],axis=1)

In [6]:
numericals = X.select_dtypes(np.number)

In [7]:
transformer = StandardScaler().fit(numericals)
x_standardized = transformer.transform(numericals)

In [8]:
categoricals = X.select_dtypes(exclude=np.number)

In [9]:
encoder = OneHotEncoder(handle_unknown='error', drop='first').fit(categoricals)
encoded = encoder.transform(categoricals).toarray()

In [10]:
X = np.concatenate((x_standardized, encoded), axis=1)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

### 2. Try a simple linear regression with all the data to see whether we are getting good results.

In [12]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [13]:
predictions = model.predict(X_test)

In [14]:
r2_score(y_test, predictions), mean_absolute_error(y_test, predictions), mean_squared_error(y_test, predictions, squared=False)

(0.7675830894926592, 96.47843501944618, 136.0607007788132)

### 3. Great! Now define a function that takes a list of models and train (and tests) them so we can try a lot of them without repeating code.

In [40]:
def r_model (X,y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('Linear Regression:', r2_score(y_test, predictions))

    model = KNeighborsRegressor(n_neighbors=4)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print('KNeighborsRegressor ', model.score(X_test, y_test))

    model = MLPRegressor()
    model.fit(X_train, y_train)
    expected_y  = y_test
    predicted_y = model.predict(X_test)
    print('MLP Regressor:',r2_score(expected_y, predicted_y))

### 4. Use the function to check LinearRegressor and KNeighborsRegressor.

In [41]:
r_model(X,y)

Linear Regression: 0.7675830894926592
KNeighborsRegressor  0.6464390890978609
MLP Regressor: 0.8244893746290227


### 5. You can check also the MLPRegressor for this task!

In [None]:
#done above

### 6. Check and discuss the results.

MLPRegressor has the higher score, while LinearRegression nad KNeighborsRegressor have similar, Linear regression being slightly higher. 
Though I am not completely sure to why those difference occur, the fact the data has been taken raw, without cleaning will haev an effect on this.