<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 3.0**


---

# Predicting Health Insurance costs

When acquiring Health Insurance, it is common that the we pay a fixed and low amount of money, in return to being covered by the insurance over a high amount of charges during a moment of healthcare need or emergency. Given this fact, is important for insurance companies to predict the cost of customers in case such an event arises, so that their business is still feasible. This is a difficult issue, because it is hard to predict when and how someone will become ill. However, certain aspects of people's behaviour, habits and medical history might be able to tell us how much these patients will cost for the insurance company.

<p align=center>
<img src="img/health_insurance.png" width="30%"><br>
<i><sup>Image credits: pch.vector (<a href="https://br.freepik.com/vetores-gratis/pai-apertando-as-maos-com-agente-de-seguros_6974887.htm">www.freepik.com</a>)</sup></i>
</p>

In this notebook we will be looking at a Health Insurance Cost dataset, using regression machine learning models in [PyCaret](https://pycaret.org/). PyCaret is a popular, low-code library, that provides an automated way to create data analysis workflows using Machine Learning. It aims to reduce time used for coding the models, while leaving more time for the analyses themselves.

# The Data

The data for this project was obtained on [Kaggle](https://www.kaggle.com/annetxu/health-insurance-cost-prediction). There is not much information about it on the page, but it is a simple dataset (with 7 columns, only) which features characteristics of the individuals and their insurance charges over the period analysed (unknown). For ease of access I have downloaded the dataset and included it in the `data` folder for this project.


In [None]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sweetviz as sv

# Importing pycaret tools
from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

# Getting the data
df = pd.read_csv("data/insurance.csv")

# Life, the Universe, and Everything
np.random.seed(42)

# Defining plot parameters
# plt.style.use('dark_background')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rcParams['font.stretch'] = 'normal'
plt.rcParams['font.style'] = 'normal'
plt.rcParams['font.variant'] = 'normal'

# Checking first entries of the dataset
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [19]:
# Dataset size
df.shape

(1338, 7)

## Data variables

As mentioned above, the dataset comes with 1338 observations and 7 columns only, which are:

* `age` = The age of the individual insurance client.
* `sex` = The biological sex.
* `bmi` = Body Mass Index, a health measure based on weight divided by the squared height.
* `children` = The number of children the individual has.
* `smoker` = If they smoke or not.
* `region` = The region where they live (related to the dataset origin, other information unknown).
* `charges` = The incurred charges originanting from the specific individual. *This is our target variable*.

First, we will observe our variables and make sure that their types match the expected.

In [None]:
# Checking types
df.info()

In [20]:
# Describing our dataset
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Now, let us use SweetViz for a more compreensive view of our variables.

In [21]:
sweetviz = sv.analyze(df)
sweetviz.show_notebook()

                                             |          | [  0%]   00:00 -> (? left)

We can see that our `age` variable only ranges between 18 and 64. If it were there were older people in the dataset, it would be a good idea to create and additional column categorizing the ages, since we know that older people will often need more health care than younger people. Also, the age of the individuals has a somewhat uniform distribution, with the exception of an increase around the lower end of the age range.

The `sex` variable, which might as well influence our predictions is well balanced. The `region` variable is also well balanced.

The `children` variable is more skewed towards the lower end of the distribution, but this is a potential factor that might be associated to our outcome, due to children being able to get ill easier through, for example, school contact.

Our target variable, `charges`, presents with some obvious outliers, but these are of extreme interest in the case of predicting costs in a health insurance scenario.

The `smoker` variable is also unbalanced, with ~20% of smokers in the dataset. This also represents a health risk factor, and the variable will be left as is.

Another possible risk factor for health conditions is **adiposity**. Weight and height alone (the measurements used to calculate BMI - Body Mass Index) are not good enough factors to evaluate someone's adiposity levels and we should be careful not to fat shame other people in the name of their health condition (viewers can read more on it [here](https://www.goodhousekeeping.com/health/a35422452/fat-phobia/)). However, historically the BMI and the categories defined by it have been associate to negative health conditions. Thus, here we will classify the data according to the [BMI standards](https://www.who.int/europe/news-room/fact-sheets/item/a-healthy-lifestyle---who-recommendations) and we will compare how this measurement influences our predictions.

In [11]:
# Copying dataset
df_bmi = df.copy()

# Removing column from other df
df = df.drop('bmi', axis = 1)

In [16]:
# Adding new column with BMI category
conditions = [
    (df_bmi['bmi'] < 18.5),
    (df_bmi['bmi'] >= 18.5) & (df_bmi['bmi'] < 25),
    (df_bmi['bmi'] >= 25) & (df_bmi['bmi'] < 30),
    (df_bmi['bmi'] >= 30) & (df_bmi['bmi'] < 35),
    (df_bmi['bmi'] >= 35) & (df_bmi['bmi'] < 40),
    (df_bmi['bmi'] >= 40)]
classes = ['Underweight', 'Regular', 'Overweight', 'Obesity 1', 'Obesity 2', 'Obsesity 3']
df_bmi['bmi_class'] = np.select(conditions, classes)

df_bmi

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,bmi_class
0,19,female,27.900,0,yes,southwest,16884.92400,Overweight
1,18,male,33.770,1,no,southeast,1725.55230,Obesity 1
2,28,male,33.000,3,no,southeast,4449.46200,Obesity 1
3,33,male,22.705,0,no,northwest,21984.47061,Regular
4,32,male,28.880,0,no,northwest,3866.85520,Overweight
...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,Obesity 1
1334,18,female,31.920,0,no,northeast,2205.98080,Obesity 1
1335,18,female,36.850,0,no,southeast,1629.83350,Obesity 2
1336,21,female,25.800,0,no,southwest,2007.94500,Overweight


Now let's separate our train and test datasets:

In [3]:
# Creating test dataset
test = df.sample(frac=0.1)

# Creating train data by dropping test data
train = df.drop(test.index)

# Resetting indexes
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

In [4]:
# Checking sizes
print(train.shape)
print(test.shape)

(1204, 7)
(134, 7)


Now, let's start with PyCaret.

# Regression with PyCaret

First, we pass our data to PyCaret.

## Creating regressor

In [5]:
# Creating regressor using PyCaret
reg = setup(data=train, target='charges')

Unnamed: 0,Description,Value
0,session_id,6592
1,Target,charges
2,Original Data,"(1204, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(842, 14)"


From this initial report, we can see that our dataset has no missing values. Thus, we can proceed to create our pipeline.

## PyCaret pipeline

In [6]:
# Creating pipeline
reg = setup(data=train,
            target='charges',
            normalize=True,
            normalize_method='zscore'
            log_experiment=False,
            experiment_name='HealthInsuranceCosts'
            )

Unnamed: 0,Description,Value
0,session_id,6092
1,Target,charges
2,Original Data,"(1204, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(842, 14)"


Now that our setup has been initiated, let us compare how each regressor model behaves with our dataset.

In [7]:
# Checking regression models
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,2696.6335,22955900.8419,4768.4209,0.8483,0.4352,0.3115,0.012
rf,Random Forest Regressor,2839.6653,25253762.9326,5001.5729,0.8341,0.4764,0.3398,0.044
lightgbm,Light Gradient Boosting Machine,3094.1874,25922340.7893,5079.3896,0.8295,0.5739,0.3924,0.014
et,Extra Trees Regressor,2934.4377,29064814.464,5367.8101,0.8089,0.4932,0.3431,0.041
ada,AdaBoost Regressor,4345.1227,29825187.8898,5453.2272,0.8022,0.6145,0.7045,0.006
catboost,CatBoost Regressor,2614.9669,22979650.9722,4533.3996,0.7503,0.4363,0.3249,0.214
lr,Linear Regression,4381.2086,39218866.0,6234.8892,0.7394,0.5944,0.4362,0.542
ridge,Ridge Regression,4399.4389,39265183.2,6238.8486,0.7393,0.5927,0.4402,0.004
llar,Lasso Least Angle Regression,4381.4788,39243358.575,6236.4471,0.7393,0.5834,0.4366,0.005
br,Bayesian Ridge,4396.9691,39274026.2449,6239.3833,0.7392,0.5939,0.4396,0.004


From the comparisons, the Gradient Boosting Regressor achieved the best metrics in nearly all categories. Let's see how the parameters were set in this model:

In [8]:
# Printing parameters for the best model
print(best)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=6092, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)


With our best model identified, we now must effectively build the model to train our data (this is step is not done during `compare_models()`). For comparison purposes, we will also use the second and third best algorithms identified. Since LightGBM is not indicated for datasets with less than 10,000 observations, we will compare it CatBoost and Random Forest Regressor instead.

## The Gradient Boosting Regressor

The Gradient Boosting Regressor, or simply GBR, is a powerful machine learning model used for predictions. It is based on the construction of weak learners, which are improved by adding them together in an ensembl of predictors that minimizes the loss function<sup><a href="https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/">1</a>,<a href="https://en.wikipedia.org/wiki/Gradient_boosting">2</a></sup>.

It basically has three components<sup><a href="https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/">1</a></sup>:

1. A loss function;
2. Weak learners;
3. The additive model in which the weak learners are added to minimize the loss function.

In this model, the weak learners are added one by one, while existing ones remain unchanged. This is done through a [gradient descent](https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html) procedure (iterative method to minimize some function).

### Creating our model

In [9]:
# Creating first model
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2795.5364,25100055.8505,5009.9956,0.8278,0.4087,0.3441
1,2554.3471,21800752.6608,4669.1276,0.8338,0.4843,0.3062
2,2724.051,28289713.9886,5318.8076,0.8369,0.4548,0.2404
3,2414.419,17531458.3113,4187.0584,0.9017,0.3739,0.2915
4,2944.1349,26212703.185,5119.8343,0.8467,0.4384,0.3275
5,2395.9946,15798288.5339,3974.7061,0.9011,0.4579,0.341
6,2749.9466,24144587.7778,4913.7143,0.8365,0.4639,0.3286
7,2681.8429,18751621.3255,4330.3142,0.8476,0.417,0.329
8,2544.6183,21962012.1999,4686.3645,0.8388,0.429,0.3268
9,3161.4446,29967814.5856,5474.2867,0.8122,0.424,0.2802


This shows the result we had previously, as the model has been instantiated using the same hyperparameters.

### Tuning the GBR model

In [10]:
# Creating tuned model
tuned_gbr = tune_model(gbr, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2916.4319,25763917.7437,5075.817,0.8233,0.3953,0.3358
1,2561.4537,21650544.271,4653.0145,0.8349,0.4808,0.2962
2,2607.1248,26493010.3627,5147.1361,0.8472,0.4737,0.2631
3,2394.8074,18472942.3,4298.0161,0.8964,0.3943,0.2944
4,2928.0001,26038612.0445,5102.8043,0.8477,0.4125,0.292
5,2431.6552,16362678.4869,4045.0808,0.8975,0.4527,0.3398
6,2710.7128,23214485.3937,4818.1413,0.8428,0.4691,0.3204
7,2481.9027,16755110.497,4093.3007,0.8639,0.3971,0.3129
8,2544.3852,22330513.1358,4725.5172,0.8361,0.4114,0.3009
9,3125.2488,28971591.3294,5382.5265,0.8185,0.4191,0.2689


## Trying CatBoost and Random Forest Regressors

In [11]:
## CatBoost

# Creating model
cat = create_model('catboost')

# Tuning model
tuned_cat = tune_model(cat, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2654.7153,23525639.5913,4850.3237,0.8386,0.4071,0.3473
1,2624.8895,21122843.3021,4595.9595,0.8389,0.4994,0.3568
2,2637.035,26387745.6258,5136.9004,0.8478,0.4694,0.2943
3,2507.6466,19324039.5779,4395.9117,0.8916,0.4126,0.3396
4,2903.0326,25984782.5167,5097.5271,0.848,0.3993,0.2753
5,2389.5573,15262064.7041,3906.6693,0.9044,0.4564,0.3644
6,2574.3745,22007042.1946,4691.1664,0.851,0.4555,0.314
7,2474.4257,17075949.9792,4132.3056,0.8613,0.4107,0.3264
8,2536.4533,21944563.49,4684.5025,0.839,0.4291,0.3338
9,3108.2624,28000732.9769,5291.5719,0.8245,0.4135,0.2769


In [12]:
## Random Forest

# Creating model
rf = create_model('rf')

# Tuning model
tuned_rf = tune_model(rf, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2763.8864,22349852.0795,4727.563,0.8467,0.378,0.3234
1,2351.5717,20013781.7562,4473.6765,0.8474,0.4498,0.2436
2,2739.1745,27240800.8126,5219.2721,0.8429,0.4619,0.2673
3,2442.9324,18470030.9704,4297.6774,0.8964,0.4463,0.3517
4,2889.7019,26116032.9007,5110.3848,0.8473,0.4266,0.2824
5,2611.991,18036414.3949,4246.93,0.887,0.5136,0.3848
6,2794.6951,24136478.3533,4912.889,0.8366,0.4598,0.3311
7,2648.707,18368671.5629,4285.8688,0.8507,0.4101,0.3221
8,2719.9109,23165531.6151,4813.0584,0.83,0.4278,0.3328
9,3137.3462,27309659.7132,5225.8645,0.8289,0.412,0.2774


# References

1: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

2: https://en.wikipedia.org/wiki/Gradient_boosting