<img alt="Colaboratory logo" width="15%" src="https://raw.githubusercontent.com/carlosfab/escola-data-science/master/img/novo_logo_bg_claro.png">

#### **Data Science na Prática 3.0**


---

# Predicting Health Insurance costs

When acquiring Health Insurance, it is common that the we pay a fixed and low amount of money, in return to being covered by the insurance over a high amount of charges during a moment of healthcare need or emergency. Given this fact, is important for insurance companies to predict the cost of customers in case such an event arises, so that their business is still feasible. This is a difficult issue, because it is hard to predict when and how someone will become ill. However, certain aspects of people's behaviour, habits and medical history might be able to tell us how much these patients will cost for the insurance company.

<p align=center>
<img src="img/health_insurance.png" width="30%"><br>
<i><sup>Image credits: pch.vector (<a href="https://br.freepik.com/vetores-gratis/pai-apertando-as-maos-com-agente-de-seguros_6974887.htm">www.freepik.com</a>)</sup></i>
</p>

In this notebook we will be looking at a Health Insurance Cost dataset, using regression machine learning models in [PyCaret](https://pycaret.org/). PyCaret is a popular, low-code library, that provides an automated way to create data analysis workflows using Machine Learning. It aims to reduce time used for coding the models, while leaving more time for the analyses themselves.

# The Data

The data for this project was obtained on [Kaggle](https://www.kaggle.com/annetxu/health-insurance-cost-prediction). There is not much information about it on the page, but it is a simple dataset (with 7 columns, only) which features characteristics of the individuals and their insurance charges over the period analysed (unknown). For ease of access I have downloaded the dataset and included it in the `data` folder for this project.


In [30]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing pycaret tools
from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

# Getting the data
df = pd.read_csv("data/insurance.csv")

# Life, the Universe, and Everything
np.random.seed(42)

# Defining plot parameters
# plt.style.use('dark_background')
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'
plt.rcParams['font.stretch'] = 'normal'
plt.rcParams['font.style'] = 'normal'
plt.rcParams['font.variant'] = 'normal'

# Checking first entries of the dataset
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [31]:
# Dataset size
df.shape

(1338, 7)

## Data variables

As mentioned above, the dataset comes with 1338 observations and 7 columns only, which are:

* `age` = The age of the individual insurance client.
* `sex` = The biological sex.
* `bmi` = Body Mass Index, a health measure based on weight divided by the squared height.
* `children` = The number of children the individual has.
* `smoker` = If they smoke or not.
* `region` = The region where they live (related to the dataset origin, other information unknown).
* `charges` = The incurred charges originanting from the specific individual. *This is our target variable*.

We begin by separating our train and test datasets:

In [32]:
# Creating test dataset
test = df.sample(frac=0.1)

# Creating train data by dropping test data
train = df.drop(test.index)

# Resetting indexes
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

In [33]:
# Checking sizes
print(train.shape)
print(test.shape)

(1204, 7)
(134, 7)


Now, let's start with PyCaret.

# Regression with PyCaret

First, we pass our data to PyCaret.

## Creating regressor

In [34]:
# Creating regressor using PyCaret
reg = setup(data=train, target='charges')

Unnamed: 0,Description,Value
0,session_id,1637
1,Target,charges
2,Original Data,"(1204, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(842, 14)"


From this initial report, we can see that our dataset has no missing values. Thus, we can proceed to create our pipeline.

## PyCaret pipeline

In [35]:
# Creating pipeline
reg = setup(data=train,
            target='charges',
            normalize=True,
            log_experiment=True,
            experiment_name='HealthInsuranceCosts'
            )

Unnamed: 0,Description,Value
0,session_id,1555
1,Target,charges
2,Original Data,"(1204, 7)"
3,Missing Values,False
4,Numeric Features,2
5,Categorical Features,4
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(842, 14)"


Now that our setup has been initiated, let us compare how each regressor model behaves with our dataset.

In [36]:
# Checking regression models
best = compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,2539.4673,21135761.992,4540.3636,0.8422,0.4174,0.2944,0.01
lightgbm,Light Gradient Boosting Machine,2814.839,22662453.8094,4716.8562,0.8321,0.5321,0.3581,0.011
catboost,CatBoost Regressor,2715.1368,23217321.2964,4774.0721,0.8257,0.4668,0.3209,0.137
rf,Random Forest Regressor,2709.3979,23461419.5539,4776.3626,0.8238,0.4566,0.3277,0.03
ada,AdaBoost Regressor,3822.8099,24953164.0648,4967.8106,0.8162,0.5821,0.6301,0.005
et,Extra Trees Regressor,2668.016,26530860.7187,5098.2335,0.8014,0.4783,0.315,0.028
xgboost,Extreme Gradient Boosting,2991.7688,28185934.3,5249.5122,0.7883,0.5226,0.3742,0.033
llar,Lasso Least Angle Regression,4199.1669,37197066.4636,6071.6675,0.7276,0.6233,0.4171,0.004
ridge,Ridge Regression,4224.9894,37274399.0,6078.0762,0.727,0.6302,0.4209,0.202
br,Bayesian Ridge,4221.9547,37278440.2135,6078.3222,0.7269,0.6231,0.4203,0.004


From the comparisons, the Gradient Boosting Regressor achieved the best metrics in nearly all categories. Let's see how the parameters were set in this model:

In [37]:
# Printing parameters for the best model
print(best)

GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=1555, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)


With our best model identified, we now must effectively build the model to train our data (this is step is not done during `compare_models()`). For comparison purposes, we will also use the second and third best algorithms identified. Since LightGBM is not indicated for datasets with less than 10,000 observations, we will compare it CatBoost and Random Forest Regressor instead.

## The Gradient Boosting Regressor

The Gradient Boosting Regressor, or simply GBR, is a powerful machine learning model used for predictions. It is based on the construction of weak learners, which are improved by adding them together in an ensembl of predictors that minimizes the loss function<sup><a href="https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/">1</a>,<a href="https://en.wikipedia.org/wiki/Gradient_boosting">2</a></sup>.

It basically has three components<sup><a href="https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/">1</a></sup>:

1. A loss function;
2. Weak learners;
3. The additive model in which the weak learners are added to minimize the loss function.

In this model, the weak learners are added one by one, while existing ones remain unchanged. This is done through a [gradient descent](https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html) procedure (iterative method to minimize some function).

### Creating our model

In [38]:
# Creating first model
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2738.7052,24341200.2327,4933.6802,0.7817,0.4145,0.2716
1,2481.7072,20160172.5428,4490.0081,0.8579,0.4245,0.3208
2,2325.8705,18656503.2155,4319.3174,0.8973,0.4266,0.2659
3,2523.5614,18969543.6851,4355.404,0.8146,0.4047,0.3177
4,2074.9967,11538916.9453,3396.8981,0.931,0.3416,0.2782
5,2946.6367,31932549.9785,5650.8893,0.7243,0.5238,0.2831
6,3253.5243,31198907.148,5585.5982,0.7494,0.4869,0.3664
7,2088.1727,13619574.974,3690.4708,0.8859,0.3322,0.235
8,2189.7268,15519168.0314,3939.4375,0.9056,0.3286,0.2799
9,2771.7719,25421083.1662,5041.9325,0.8745,0.4903,0.3251


This shows the result we had previously, as the model has been instantiated using the same hyperparameters.

### Tuning the GBR model

In [39]:
# Creating tuned model
tuned_gbr = tune_model(gbr, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2707.3854,23269902.6268,4823.8887,0.7913,0.4045,0.2845
1,2481.1552,18772888.7051,4332.7692,0.8677,0.4609,0.3921
2,2384.9571,17933399.8974,4234.7845,0.9013,0.4587,0.3302
3,2538.4465,18320903.8021,4280.2925,0.821,0.4115,0.3357
4,2328.7713,12196671.0178,3492.3733,0.9271,0.3964,0.3511
5,2800.4239,29421213.9566,5424.1326,0.746,0.5227,0.3323
6,3127.674,28966789.6299,5382.0804,0.7673,0.4635,0.3562
7,2112.7539,14040107.2197,3747.0131,0.8824,0.4163,0.3051
8,2402.9425,15910158.2335,3988.754,0.9032,0.4137,0.387
9,2719.8435,23944500.8022,4893.3118,0.8818,0.4749,0.3594


## Trying CatBoost and Random Forest Regressors

In [40]:
## CatBoost

# Creating model
cat = create_model('catboost')

# Tuning model
tuned_cat = tune_model(cat, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2863.6522,23553623.4099,4853.2075,0.7887,0.395,0.2704
1,2391.4409,19483001.9546,4413.9554,0.8627,0.4033,0.3067
2,2522.2848,18953729.2614,4353.5881,0.8957,0.4338,0.2856
3,2354.0851,17758084.4517,4214.0342,0.8265,0.3716,0.2715
4,2227.7487,12818967.0768,3580.3585,0.9233,0.3595,0.2938
5,2846.5038,30158597.7807,5491.6844,0.7396,0.5103,0.277
6,2912.8299,27154423.9789,5210.9907,0.7819,0.4265,0.3003
7,2286.6014,15436661.3115,3928.9517,0.8707,0.406,0.2738
8,2168.1594,14249342.5524,3774.8301,0.9133,0.3405,0.3028
9,2716.5632,22786209.0322,4773.4902,0.8875,0.4639,0.3243


In [41]:
## Random Forest

# Creating model
rf = create_model('rf')

# Tuning model
tuned_rf = tune_model(rf, optimize='R2', choose_better=True, n_iter=100)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2674.9982,22629109.7389,4757.0064,0.797,0.3969,0.2602
1,2387.1684,18609215.9893,4313.8401,0.8688,0.4059,0.3157
2,2455.0424,18376693.4143,4286.8046,0.8989,0.4196,0.2602
3,2483.2129,18406675.1355,4290.3001,0.8201,0.3795,0.2856
4,2149.2056,12066127.4626,3473.6332,0.9278,0.3965,0.3214
5,2778.9318,30992592.839,5567.0991,0.7324,0.4924,0.251
6,3308.2958,31008606.6404,5568.5372,0.7509,0.5114,0.4032
7,1969.2737,13716073.4226,3703.5218,0.8851,0.3724,0.2412
8,2198.5696,14674567.4233,3830.7398,0.9108,0.3221,0.2695
9,2528.7589,23004308.1972,4796.2807,0.8864,0.4372,0.2731


# PROJETO REGRESSÃO

Como falamos antes, nossos templates ficarão cada vez mais simples!!

O objetivo deste projeto é desenvolver um projeto de Regressão para prever o custo do Seguro de Vida, com [esses dados do Kaggle](https://www.kaggle.com/annetxu/health-insurance-cost-prediction).

Aqui, queremos, como sempre, trazer o máximo de autonomia e independência pra vocês. Façam o download, e disponibilizem os seus dados, carreguem eles aqui, e desenvolvam o projeto.

## Objetivos

* Adquirir e disponibilizar os dados
* Análise completa dos dados e do problema (Na mão, ou com Pandas Profiling, SweetViz, etc)
* Desenvolver uma solução de Machine Learning eficiente com PyCaret para Regressão

LEMBREM-SE: Documentação, Storytelling, Artigo com Código! 

Mãos à obra e boa sorte!


# References

1: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

2: https://en.wikipedia.org/wiki/Gradient_boosting