# Further Tuning of the ML Model for the InsuranceCharges Data

Since we saw that the model does not perform differently including sex, we will remove it from our data. In `ML-DBmodel.ipynb` we saw that the LinearRegression (i.e. Lasso with $\alpha$ = 0) was the best performing model.

In this notebook we will try to improve the $R^2$ value of the LinearRegression model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
#For connecting with SQL database
import pymssql
from config import database
from config import username
from config import password
from config import server

In [2]:
#Create connectiong to databse
conn = pymssql.connect(server, username, password, database)
cursor = conn.cursor()

In [3]:
#Run the query to gather the table
table = 'dbo.InsuranceCharges'

querycosts = '''Select I.ChargeID, I.ChargeValue, I.AgeID, A.AgeLabel, I.ChildrenID, C.ChildrenLabel, I.RegionID, 
R.RegionLabel, I.SexID, S.SexLabel, I.SmokerID, Sm.SmokerLabel, I.BMI from InsuranceCharges I
inner join Age A on I.AgeID = A.AgeID
inner join Children C on  I.ChildrenID = C.ChildrenID
inner join Region R on  I.RegionID = R.RegionID
inner join Sex S on  I.SexID = S.SexID
inner join Smoker Sm on  I.SmokerID = Sm.SmokerID
'''
#Load the query to a pandas dataframe
df_costs = pd.read_sql(querycosts, conn)



In [4]:
df = df_costs[['AgeLabel', 'BMI', 'ChildrenLabel', 'RegionLabel', 'ChargeValue', 'SmokerLabel']]

#Make sure all columns are numerical in order to feed them into the ML model
df = df.astype({"AgeLabel": int, "ChildrenLabel": int})
df_dummies = pd.get_dummies(df, columns = ['RegionLabel', 'SmokerLabel'], drop_first = True)

df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AgeLabel               1338 non-null   int32  
 1   BMI                    1338 non-null   float64
 2   ChildrenLabel          1338 non-null   int32  
 3   ChargeValue            1338 non-null   float64
 4   RegionLabel_northwest  1338 non-null   uint8  
 5   RegionLabel_southeast  1338 non-null   uint8  
 6   RegionLabel_southwest  1338 non-null   uint8  
 7   SmokerLabel_True       1338 non-null   uint8  
dtypes: float64(2), int32(2), uint8(4)
memory usage: 36.7 KB


Now we isolate the target variable (`ChargeValue`) and assign it as `y`, which is a culmination of individual medical costs incurred by people with attributes `X` (i.e. `AgeLabel`, `BMI`, `ChildrenLabel`, `RegionaLabel`, `SmokerLabel`).

In [5]:
X = df_dummies.drop(columns = 'ChargeValue').copy()
y = df_dummies[['ChargeValue']].copy()

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 9)

In [7]:
model = make_pipeline(PolynomialFeatures(degree=2),
                      LinearRegression())

In [8]:
param_grid = {'polynomialfeatures__degree': np.arange(1, 8)}

grid = GridSearchCV(model, param_grid, cv=7)

In [9]:
grid.fit(X_train, y_train)

GridSearchCV(cv=7,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'polynomialfeatures__degree': array([1, 2, 3, 4, 5, 6, 7])})

In [10]:
grid.best_params_

{'polynomialfeatures__degree': 3}

The best params are
* polynomialfeatures_degree: 3

In [11]:
best_model = grid.best_estimator_

In [12]:
best_model.score(X_test, y_test)

0.8703310497833336

With approximately 10 percentage point increase the new model has an $R^2$ score of 87%

This was mostly achieved through the use of polynomialfeatures as we saw the alphas made little-to-no difference near $\alpha$ = 0

We can use the model to create predictions for potential insurance customers, but first we export the model using joblib

### Joblib Machine Learning Model Export

In [13]:
from joblib import dump, load

In [14]:
dump(best_model, 'LinearRegressionTunedJoblib.model')

['LinearRegressionTunedJoblib.model']

In [15]:
load_model = load('LinearRegressionTunedJoblib.model')

In [16]:
X_test.head()

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,RegionLabel_northwest,RegionLabel_southeast,RegionLabel_southwest,SmokerLabel_True
227,35,31.0,1,0,0,1,0
1235,55,37.715,3,1,0,0,0
985,53,36.1,1,0,0,1,0
326,18,31.35,0,0,1,0,0
352,36,26.885,0,1,0,0,0


In [17]:
#Create a new entry to predict the costs

prediction1 = {'AgeLabel': 22,
              'BMI': 22.3,
              'ChildrenLabel': 0,
              'RegionLabel_northwest': 0,
              'RegionLabel_southeast': 1,
              'RegionLabel_southwest': 0,
              'SmokerLabel_True': 0}


In [18]:
test_prediction_df = pd.DataFrame(prediction1, index=[1])
test_prediction_df

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,RegionLabel_northwest,RegionLabel_southeast,RegionLabel_southwest,SmokerLabel_True
1,22,22.3,0,0,1,0,0


In [19]:
load_model.predict(test_prediction_df)

array([[3473.15487341]])

A 22 year-old living in the southeast with a BMI of 22.3 and no children will incur 3473.15 dollars of healthcare costs