# Further Tuning of the ML Model for the InsuranceCharges Data

Since we saw that the model does not perform differently including sex, we will remove it from our data. In `ML-DBmodel.ipynb` we saw that the LinearRegression (i.e. Lasso with $\alpha$ = 0) was the best performing model.

In this notebook we will try to improve the $R^2$ value of the LinearRegression model

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
#For connecting with SQL database
import pymssql
from config import database
from config import username
from config import password
from config import server

In [2]:
#Create connectiong to databse
conn = pymssql.connect(server, username, password, database)
cursor = conn.cursor()

In [3]:
#Run the query to gather the table
table = 'dbo.InsuranceCharges'

querycosts = '''Select I.ChargeID, I.ChargeValue, I.AgeID, A.AgeLabel, I.ChildrenID, C.ChildrenLabel, I.RegionID, 
R.RegionLabel, I.SexID, S.SexLabel, I.SmokerID, Sm.SmokerLabel, I.BMI from InsuranceCharges I
inner join Age A on I.AgeID = A.AgeID
inner join Children C on  I.ChildrenID = C.ChildrenID
inner join Region R on  I.RegionID = R.RegionID
inner join Sex S on  I.SexID = S.SexID
inner join Smoker Sm on  I.SmokerID = Sm.SmokerID
'''
#Load the query to a pandas dataframe
df_costs = pd.read_sql(querycosts, conn)
df_costs



Unnamed: 0,ChargeID,ChargeValue,AgeID,AgeLabel,ChildrenID,ChildrenLabel,RegionID,RegionLabel,SexID,SexLabel,SmokerID,SmokerLabel,BMI
0,1,11082.577,38,55,1,0,1,northwest,1,female,1,False,26.980
1,2,14711.744,3,20,1,0,1,northwest,1,female,2,True,22.420
2,3,1743.214,2,19,1,0,4,southwest,1,female,1,False,28.900
3,4,8516.829,28,45,3,2,2,southeast,1,female,1,False,28.600
4,5,12268.632,38,55,3,2,1,northwest,1,female,1,False,32.775
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,1334,37484.449,5,22,3,2,2,southeast,2,male,2,True,37.070
1334,1335,4462.722,15,32,2,1,1,northwest,2,male,1,False,33.820
1335,1336,48970.248,42,59,2,1,2,southeast,2,male,2,True,41.140
1336,1337,19673.336,11,28,1,0,1,northwest,2,male,1,False,33.820


In [4]:
df = df_costs[['AgeLabel', 'BMI', 'ChildrenLabel', 'RegionLabel', 'ChargeValue', 'SmokerLabel']]
df

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,RegionLabel,ChargeValue,SmokerLabel
0,55,26.980,0,northwest,11082.577,False
1,20,22.420,0,northwest,14711.744,True
2,19,28.900,0,southwest,1743.214,False
3,45,28.600,2,southeast,8516.829,False
4,55,32.775,2,northwest,12268.632,False
...,...,...,...,...,...,...
1333,22,37.070,2,southeast,37484.449,True
1334,32,33.820,1,northwest,4462.722,False
1335,59,41.140,1,southeast,48970.248,True
1336,28,33.820,0,northwest,19673.336,False


In [5]:
df['AgeLabel'] = df['AgeLabel'].astype('int64')
df['ChildrenLabel'] = df['ChildrenLabel'].astype('int64')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['AgeLabel'] = df['AgeLabel'].astype('int64')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['ChildrenLabel'] = df['ChildrenLabel'].astype('int64')


In [6]:
df_dummies = pd.get_dummies(df, columns = ['RegionLabel', 'SmokerLabel'], drop_first = True)

In [7]:
df_dummies

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,ChargeValue,RegionLabel_northwest,RegionLabel_southeast,RegionLabel_southwest,SmokerLabel_True
0,55,26.980,0,11082.577,1,0,0,0
1,20,22.420,0,14711.744,1,0,0,1
2,19,28.900,0,1743.214,0,0,1,0
3,45,28.600,2,8516.829,0,1,0,0
4,55,32.775,2,12268.632,1,0,0,0
...,...,...,...,...,...,...,...,...
1333,22,37.070,2,37484.449,0,1,0,1
1334,32,33.820,1,4462.722,1,0,0,0
1335,59,41.140,1,48970.248,0,1,0,1
1336,28,33.820,0,19673.336,1,0,0,0


In [8]:
X = df_dummies.drop(columns = 'ChargeValue').copy()
y = df_dummies[['ChargeValue']].copy()

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 9)

In [10]:
model = make_pipeline(PolynomialFeatures(degree=2),
                      LinearRegression())

In [11]:
param_grid = {'polynomialfeatures__degree': np.arange(8),
              'linearregression__fit_intercept': [True, False],
              'linearregression__normalize': [True, False]}

grid = GridSearchCV(model, param_grid, cv=7)

In [12]:
grid.fit(X_train, y_train)

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)






If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)


If you wish to scale the data, use Pipeline wi

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)






28 fits failed out of a total of 224.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
24 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Christian\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Christian\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\Users\Christian\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\Users\Christian\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args,

GridSearchCV(cv=7,
             estimator=Pipeline(steps=[('polynomialfeatures',
                                        PolynomialFeatures()),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'linearregression__fit_intercept': [True, False],
                         'linearregression__normalize': [True, False],
                         'polynomialfeatures__degree': array([0, 1, 2, 3, 4, 5, 6, 7])})

In [13]:
grid.best_params_

{'linearregression__fit_intercept': False,
 'linearregression__normalize': True,
 'polynomialfeatures__degree': 2}

The best params are
* fit_intercept: True
* normalize: True
* polynomialfeatures_degree: 2

In [14]:
best_model = grid.best_estimator_

In [15]:
best_model.score(X_test, y_test)

0.8762270532286466

With approximately 10 percentage point increase the new model has an $R^2$ score of 87.6%

This was mostly achieved through the use of polynomialfeatures as we saw the alphas made little-to-no difference near $\alpha$ = 0

We can use the model to create predictions for potential insurance customers, but first we export the model using joblib

### Joblib Machine Learning Model Export

In [16]:
from joblib import dump, load

In [17]:
dump(best_model, 'LinearRegressionTunedJoblib.model')

['LinearRegressionTunedJoblib.model']

In [18]:
load_model = load('LinearRegressionTunedJoblib.model')

In [19]:
X_test.head()

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,RegionLabel_northwest,RegionLabel_southeast,RegionLabel_southwest,SmokerLabel_True
227,35,31.0,1,0,0,1,0
1235,55,37.715,3,1,0,0,0
985,53,36.1,1,0,0,1,0
326,18,31.35,0,0,1,0,0
352,36,26.885,0,1,0,0,0


In [29]:
#Create a new entry to predict the costs

prediction1 = {'AgeLabel': 22,
              'BMI': 22.3,
              'ChildrenLabel': 0,
              'SmokerLabel_True': 0,
              'RegionLabel_northwest': 0,
              'RegionLabel_southeast': 1,
              'RegionLabel_southwest': 0}


In [30]:
test_prediction_df = pd.DataFrame(prediction1, index=[1])
test_prediction_df

Unnamed: 0,AgeLabel,BMI,ChildrenLabel,SmokerLabel_True,RegionLabel_northwest,RegionLabel_southeast,RegionLabel_southwest
1,22,22.3,0,0,0,1,0


In [31]:
load_model.predict(test_prediction_df)

Feature names must be in the same order as they were in fit.



array([[2744.01271322]])