Hi, in this notebook I tried LazyPredict library to find out which model is best fits my dataset and make more accurate predictions. To keep it simple I skipped some steps such as EDA analysis because I have another notebook which is I had done before. You can find it here: https://www.kaggle.com/code/bahadirozcanli/medical-cost-prediction-eda-linear-regression

# Importing Libraries and Reading Dataset

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv("C:/Users/bahadiroz/OneDrive - Otokoç Otomotiv Ticaret ve Sanayi A.Ş/Desktop/Bahadır/insurance.csv")
df.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [3]:
#copying data frame for back up
df2= df.copy()

We will encode categorical features in order to improve our models performance. We will use the Label Encoding technique.

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#age
df.sex = le.fit_transform(df.sex)
#smoker
df.smoker = le.fit_transform(df.smoker)
#region
df.region = le.fit_transform(df.region)

df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


# Modelling

In [5]:
from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import metrics

In [6]:
y= df["charges"] #dependent variables
X= df.drop(["charges"], axis=1) #independent variables

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

I start by making a simple linear regression model. We will compare model performance with LazyRegressor results.

In [8]:
lr = LinearRegression()
lr.fit(X_train, y_train)

LinearRegression()

In [9]:
print(lr.intercept_, "\n")
print(lr.coef_.reshape(-1,1))

-11946.606567263041 

[[ 2.57056264e+02]
 [-1.87914567e+01]
 [ 3.35781491e+02]
 [ 4.25091456e+02]
 [ 2.36478181e+04]
 [-2.71284266e+02]]


In [10]:
y_pred = lr.predict(X_test)

In [11]:
results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred, 'Residuals': y_test - y_pred})
results


Unnamed: 0,Actual,Predicted,Residuals
764,9095.07,8924.41,170.66
887,5272.18,7116.30,-1844.12
890,29330.98,36909.01,-7578.03
1293,9301.89,9507.87,-205.98
259,33750.29,27013.35,6736.94
...,...,...,...
109,47055.53,39116.97,7938.56
575,12222.90,11814.56,408.34
535,6067.13,7638.11,-1570.98
543,63770.43,40959.08,22811.35


In [12]:
#model performance

print("MAE: ", metrics.mean_absolute_error(y_test, y_pred))
print("MSE: ", metrics.mean_squared_error(y_test, y_pred))
print("RMSE: ", metrics.mean_squared_error(y_test, y_pred, squared=False))
print("R2: ", metrics.r2_score(y_test, y_pred), "\n")
print("Score: ", lr.score(X_test, y_test))

MAE:  4186.5088983664355
MSE:  33635210.431178406
RMSE:  5799.587091438356
R2:  0.7833463107364539 

Score:  0.7833463107364539


Here we have some metrics and score. Let's run LazyRegressor and compare our result against other models. 

In [13]:
reg = LazyRegressor(verbose=0,ignore_warnings=False, custom_metric=None )
models,predictions = reg.fit(X_train, X_test, y_train, y_test)
models

100%|██████████████████████████████████████████████████████████████████████████████████| 42/42 [00:10<00:00,  4.08it/s]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingRegressor,0.88,0.88,4351.11,0.08
RandomForestRegressor,0.86,0.87,4571.5,0.21
LGBMRegressor,0.86,0.86,4590.16,0.1
HistGradientBoostingRegressor,0.86,0.86,4599.76,0.34
XGBRegressor,0.85,0.85,4750.36,0.1
ExtraTreesRegressor,0.85,0.85,4825.67,0.17
BaggingRegressor,0.85,0.85,4828.92,0.02
KNeighborsRegressor,0.83,0.83,5068.57,0.01
AdaBoostRegressor,0.82,0.82,5267.06,0.02
PoissonRegressor,0.79,0.8,5632.71,0.01


In [14]:
gbr = GradientBoostingRegressor(random_state=0)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)
print("Score: ", gbr.score(X_test, y_test))

Score:  0.8779726251291786


## Conclusion

LazyPredict ran 42 different models for us and measured their metrics. The Linear Regression Model which is we tried first step is only 13th best performance model :) LazyPredict is looks like a useful library for selecting a models, at least in the first step.

## Resources

https://lazypredict.readthedocs.io/en/latest/index.html

https://www.kaggle.com/code/mervanzekinci/lazypredict-on-breast-cancer
    
