---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/Credit-Card-Fraud-Detection-/main/images/myself.png" align=right>

# **Health Insurance Cost Prediction Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

**Health insurance** is a contract that covers medical expenses in exchange for an annual fee. It protects you from unexpected medical costs and offers many other benefits.

This project will consist of using individual customer information (age, BMI, whether they are a smoker, etc.) and the annual cost of their health insurance to obtain a model that can indicate an optimal price for health insurance, given the individual's data. For this, several **supervised machine learning models** will be tested, the one that obtains the lowest error rates will be selected and will go through the finetuning process, in order to improve its prediction.
<p align="center">
<img width=50% src="https://github.com/gabrielcapela/AutoML-Projects/blob/main/Regression/images/national-cancer-institute-NFvdKIhxYlU-unsplash.jpg?raw=true">
</p>

The purpose of this project is to apply **Automated Machine Learning**, in order to demonstrate the practicality of this type of tool. The data used was taken from [Kaggle](https://www.kaggle.com/datasets/annetxu/health-insurance-cost-prediction/data).

# Business Understanding 

In the health insurance sector, pricing plans is a **critical challenge** for insurers. The cost of health insurance varies according to several factors, such as age, medical history, lifestyle habits and demographic characteristics of policyholders. Inadequate pricing can lead to financial losses for the company or uncompetitive prices, impacting customer retention and acquisition.

This project seeks to develop a Machine Learning model capable of **accurately predicting health insurance cost**s based on individual characteristics of policyholders. Automating this process, using **AutoML**, allows us to identify hidden patterns in the data and optimize pricing efficiently, reducing errors and increasing transparency in decisions.

The main objectives of this study are:
*   Improving cost predictability – Creating a model capable of estimating insurance costs with a high degree of accuracy.

*   Supporting decision-making – Helping insurers define fairer and more sustainable prices.

*   Exploring the influence of risk factors – Identifying which variables have the most significant impact on insurance costs.

With this data-driven approach, it is expected to not only improve the efficiency of the sector, but also offer customers prices that are more appropriate to their risk profile.

# Data Understanding

The dataset used can be downloaded from this [page](https://github.com/gabrielcapela/AutoML-Projects/blob/main/Regression/insurance.csv)

## Obtaining and Summary Analysis of Data

Let's start by importing the data and previewing the first few rows to illustrate the meaning of each column:

In [82]:
# Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [110]:
# Importing the dataset
df = pd.read_csv('https://raw.githubusercontent.com/gabrielcapela/AutoML-Projects/refs/heads/main/Regression/insurance.csv')
#Showing the first 5 lines
print(f"The data has {df.shape[0]} rows and {df.shape[1]} variables")
df.head()

The data has 1338 rows and 7 variables


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Below is the meaning of each variable:

* **age**: The age of the insured individual in years.

* **sex**: The gender of the insured individual (male or female).

* **bmi**: Body Mass Index (BMI), a measure of weight relative to height.

* **children**: The number of children the insured has.

* **smoker**: Indicates whether the insured individual is a smoker (yes or no).

* **region**: The geographic region where the insured individual resides.

* **charges**: The total annual health insurance cost (in dollars) for the individual, which is the target variable in this regression model.

## Pandas Profiling

In line with the AutoML philosophy, I will be using **Pandas Profilin**g in the Data Understanding phase of my project. Pandas Profiling **automates the generation of comprehensive Exploratory Data Analysis (EDA) reports**, allowing me to quickly and in-depth understand the dataset, summarizing important statistics, identifying missing values, detecting correlations, and visualizing distributions. The goal of this tool is to **increase productivity and reduce manual effort**, allowing me to focus on interpreting the results instead of performing repetitive EDA tasks.

In [134]:
#Importing the required package
from ydata_profiling import ProfileReport

#Generating the report using Pandas Profiling
profile = ProfileReport(df, explorative=True)

# Saving the report as an HTML file
profile.to_file("insurance_report.html")


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Click [**HERE**](https://gabrielcapela.github.io/health_insurance_cost_prediction/insurance_report.html) to see the report

**Some observations** can already be made:

*   The variables **sex**, **smoker** and **region** are categorical, the first two being binary and **region** having four different classes

*   **Charges** is highly overall correlated with **age** and **smoker** 

*   **Children** has 574 (42.9%) zeros, but they are not missing values, they are just people without children.

*   The **Smoker** variable has unbalanced data, around only 20% of the data are positive (smoker).

# Data Preparation

## About PyCaret

**PyCaret** is a Python AutoML library that simplifies the process of building, testing, and optimizing machine learning models. It automates tasks such as model selection, hyperparameter tuning, and performance evaluation, making the workflow faster and more accessible.

For more details, visit [PyCaret](https://pycaret.org).

PyCaret already performs several steps of **data preprocessing automatically**, such as missing value treatment, categorical variable encoding, feature scaling, and outlier detection. Manual preprocessing can sometimes lead to better results, but we will stick to PyCaret’s built-in capabilities to **evaluate the efficiency of AutoM**L.

## Division of data

Although PyCaret already automatically splits the data into training and testing data within the setup() function, choosing to **pre-separate a training set before** delivering it to AutoML can bring important benefits to the reliability and robustness of the model evaluation. One of the benefits, for example, is avoiding data leakage – If the model is optimized based on PyCaret's internal test set, it can indirectly "learn" patterns that it would not have in a real application.
Having an external set for final validation – This allows a more realistic evaluation of the model's performance before putting it into production.

In [113]:
# Separating data into test and training
test = df.sample(frac=0.20)
train = df.drop(test.index)

test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)

print(test.shape)
print(train.shape)

(268, 7)
(1070, 7)


# Modeling

In [114]:
# Importing the necessary packages
from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

## Defining the setup

PyCaret's setup() is the main function for initializing a machine learning experiment. It prepares the data, sets up the pipeline configuration, and applies preprocessing automatically.

In [115]:
# Creating the PyCaret setup
reg = setup(data=train, target='charges')

Unnamed: 0,Description,Value
0,Session id,7601
1,Target,charges
2,Target type,Regression
3,Original data shape,"(1070, 7)"
4,Transformed data shape,"(1070, 10)"
5,Transformed train set shape,"(749, 10)"
6,Transformed test set shape,"(321, 10)"
7,Numeric features,3
8,Categorical features,3
9,Preprocess,True


## Creating the pipeline


The pipeline is an **automated preprocessing and modeling workflow** that ensures that all transformations applied to the training data are replicated on the test data and new predictions.

When we call setup(), PyCaret creates a pipeline that includes steps such as missing value handling, categorical variable encoding, normalization, outlier removal, and feature selection. This pipeline is automatically applied to all trained models, ensuring consistency and **eliminating the need to repeat these steps manually**.

In [116]:
# Creating the pipeline
reg = setup(data = train,
            target = 'charges',
            normalize = True,
            #log_experiment = True,
            experiment_name = 'test_01')

Unnamed: 0,Description,Value
0,Session id,6253
1,Target,charges
2,Target type,Regression
3,Original data shape,"(1070, 7)"
4,Transformed data shape,"(1070, 10)"
5,Transformed train set shape,"(749, 10)"
6,Transformed test set shape,"(321, 10)"
7,Numeric features,3
8,Categorical features,3
9,Preprocess,True


## Comparing the models

PyCaret's **compare_models()** function automatically evaluates multiple regression models and compares them based on performance metrics. It uses **10-fold cross-validation** (fold=10), ensuring a more robust evaluation by splitting the training data into 10 parts and averaging the results.

By default, models are evaluated using the following metrics:

*   R2 (Coefficient of Determination)
*   RMSE (Root Mean Squared Error)
*   MAE (Mean Absolute Error)
*   MSE (Mean Squared Error)
*   RMSLE (Root Mean Squared Log Error)
*   MAPE (Mean Absolute Percentage Error)

This function allows you to quickly identify the most promising model without having to train each one manually.

In [135]:
import logging

# Suppressing the log output from LightGBM and other libraries
logging.getLogger("pycaret").setLevel(logging.WARNING)
logging.getLogger("lightgbm").setLevel(logging.WARNING)

# Now running compare_models
best = compare_models()


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,2494.3084,20606071.3122,4459.9776,0.8458,0.4245,0.288,0.183
rf,Random Forest Regressor,2643.5354,23043346.7037,4726.1899,0.8255,0.4571,0.3151,0.311
lightgbm,Light Gradient Boosting Machine,2839.1794,23973797.8786,4823.7448,0.8204,0.5623,0.3551,0.387
et,Extra Trees Regressor,2656.9173,25964641.199,5027.3925,0.8038,0.4768,0.3224,0.265
ada,AdaBoost Regressor,4170.5466,27448761.4954,5207.6281,0.7908,0.6123,0.701,0.1
knn,K Neighbors Regressor,3434.3103,30726879.2,5525.411,0.763,0.4859,0.3695,0.097
br,Bayesian Ridge,4307.53,38105732.7398,6151.5977,0.7158,0.5525,0.4358,0.127
ridge,Ridge Regression,4305.479,38106179.9403,6151.8287,0.7157,0.551,0.4353,0.08
llar,Lasso Least Angle Regression,4304.1731,38106848.9015,6151.9776,0.7157,0.5509,0.4349,0.081
lar,Least Angle Regression,4304.3603,38107834.4706,6152.0693,0.7157,0.5509,0.435,0.092


By default, the grid is arranged from R2 (highest to lowest)

In [136]:
# Checking the best model 
print(best)

GradientBoostingRegressor(random_state=6253)


## Instantiating the model

PyCaret's compare_models() function **does not train** a definitive model, it simply evaluates and compares different regression algorithms, displaying a grid with their performance metrics. To use a specific model, **you need to select and instantiate it manually** using the create_model() function.

The chosen model is the **Gradient Boosting Regressor**, which is based on decision trees that **combine multiple weak estimators to create a stronger and more accurate model**. It works by training the trees sequentially, where each new tree learns to correct the errors of the previous ones, reducing the residual error of the model.

In [137]:
# Instantiating the model
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2692.0808,25258915.6574,5025.8249,0.8378,0.4651,0.2847
1,2627.5565,21212231.2786,4605.6738,0.8843,0.3967,0.2439
2,2914.5591,30823481.7479,5551.8899,0.8179,0.4883,0.3436
3,1999.9408,13476677.2338,3671.0594,0.9179,0.3613,0.2777
4,3009.0081,34314709.4011,5857.8758,0.68,0.5787,0.2448
5,2414.9851,19645954.7801,4432.3757,0.8399,0.4593,0.3501
6,2722.7516,23964166.2875,4895.3209,0.8103,0.4066,0.3088
7,2502.4707,15051773.0486,3879.6615,0.918,0.3691,0.3014
8,2097.4436,11009028.2647,3317.9856,0.8705,0.3071,0.2518
9,1962.2882,11303775.4218,3362.1088,0.8813,0.4131,0.2735


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

## Model Tuning

The Model Tuning step in PyCaret consists of **adjusting the model's hyperparameters** to optimize its performance. Instead of using default values, this phase **searches for the best combination of parameters** through techniques such as Random Grid Search.

In [138]:
# Hyperparameter tuning
tuned_gbr = tune_model(gbr, optimize='R2')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,3018.7954,27184014.1774,5213.8291,0.8254,0.4952,0.3282
1,2590.9251,21081326.1634,4591.4405,0.885,0.3957,0.2457
2,3006.9781,32118494.6325,5667.3181,0.8103,0.4703,0.3083
3,1967.5294,13712204.3697,3702.9994,0.9164,0.3939,0.3093
4,3082.5206,36577248.927,6047.9128,0.6589,0.585,0.2471
5,2400.0926,19009689.9243,4360.0103,0.845,0.4108,0.3131
6,2751.8003,24506509.1942,4950.405,0.806,0.4198,0.3211
7,2697.1972,18040227.5162,4247.3789,0.9017,0.377,0.3157
8,2166.6087,11732893.192,3425.3311,0.862,0.3032,0.2468
9,2159.3293,13238070.1823,3638.4159,0.861,0.4722,0.3057


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In this case, the tuned model did not present better results than the original. However, it is worth **highlighting the importance and practicality** of this function.

# Evaluation

## Interactive Model Evaluation

The evaluate_model() function in PyCaret allows you to interactively visualize the performance of the fitted model. It generates several graphs and metrics that **help analyze the quality of predictions and identify potential problems**, such as overfitting.

In [125]:
# Evaluating the model
evaluate_model(tuned_gbr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

## Making the predictions

The predict_model() function in PyCaret is used to **generate predictions with a previously trained and tuned model**. It can be applied to both test data and new data sets to evaluate the model's performance.

When called without a specific data set, predict_model(tuned_gbr) **returns predictions for the test data automatically separated by PyCaret**.

Later on, we will also use it for data separated at the beginning, never exposed to PyCaret.

In [126]:
# Making the predictions
predict_model(tuned_gbr);

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2597.8849,23370476.3041,4834.302,0.8639,0.4358,0.3044


We can see the R2 metric with 0.8639, **very similar** to the 0.8458 obtained with the training data.

## Finalizing the model

The finalize_model() function in PyCaret is used to train the fine-tuned model on the **entire available dataset**, ensuring that it uses as much information as possible before deploying it.

By default, PyCaret sets aside a portion of the data for testing, but **when finalizing the model, it is re-trained using 100% of the training data** to improve generalization before deploying it to new data.

In [100]:
# Finalizing the model
final_gbr = finalize_model(tuned_gbr)

In [128]:
type(final_gbr)

pycaret.internal.pipeline.Pipeline

In [None]:
# Making predictions with the model trained with all the data provided to PyCaret
pred_holdout = predict_model(final_gbr, data=train)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2168.0185,14726116.5617,3837.4623,0.902,0.3537,0.2546


As expected, we have better metrics, as the model is being tested with data used in its training (**overfitting**).

In [130]:
# Checking the parameters
print(final_gbr)

Pipeline(memory=Memory(location=None),
         steps=[('numerical_imputer',
                 TransformerWrapper(include=['age', 'bmi', 'children'],
                                    transformer=SimpleImputer())),
                ('categorical_imputer',
                 TransformerWrapper(include=['sex', 'smoker', 'region'],
                                    transformer=SimpleImputer(strategy='most_frequent'))),
                ('ordinal_encoding',
                 TransformerWrapper(include=['sex', 'smoker'],
                                    transfor...
                                                                        {'col': 'smoker',
                                                                         'data_type': dtype('O'),
                                                                         'mapping': no     0
yes    1
NaN   -1
dtype: int64}]))),
                ('onehot_encoding',
                 TransformerWrapper(include=['region'],
                      

## Predicting on new data


We will use the **separate dataset before using Pycaret**, so as to ensure the reliability of the model.

In [None]:
# Making Predictions with Unseen Data
unseen_predictions = predict_model(final_gbr, data=test)
unseen_predictions.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2338.6557,16011178.8905,4001.3971,0.8718,0.4091,0.2845


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,prediction_label
0,18,female,28.215,0,no,northeast,2200.830811,3907.452642
1,26,male,32.900002,2,yes,southwest,36085.21875,34719.015207
2,50,male,32.299999,2,no,southwest,9630.397461,11540.35269
3,49,female,27.17,0,no,southeast,8601.329102,11452.363425
4,57,female,38.0,2,no,southwest,12646.207031,14812.582825


R2 of 0.8718, **very similar** to the results obtained by the PyCaret test set, making the model acceptable.

# Saving the model

The save_model() function is used to save a trained and finalized model, allowing it to be easily loaded and used later without the need for retraining. This is especially **useful for production deployment**, where the model can be saved and reused to make predictions in real time or on new datasets.

The model is saved in a pickle file format (.pkl), which makes it easy to store and portable across environments.

In [131]:
save_model(final_gbr,'Modelo_Final_03_06_25')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['age', 'bmi', 'children'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['sex', 'smoker', 'region'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordinal_encoding',
                  TransformerWrapper(include=['sex', 'smoker'],
                                     transfor...
                                                                         {'col': 'smoker',
                                                                          'data_type': dtype('O'),
                                                                          'mapping': no     0
 yes    1
 NaN   -1
 dtype: int64}]))),
                 ('onehot_encoding',
                  TransformerWrapper(include=['region'],
    

## Loading a model

In [132]:
# Loading a saved model
saved_final_gbr = load_model('Modelo_Final_03_06_25')

Transformation Pipeline and Model Successfully Loaded


In [133]:
# Making new predictions with the saved model
new_prediction = predict_model(saved_final_gbr, data=test)
new_prediction.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,2227.6783,14881735.4755,3857.6852,0.8866,0.3522,0.2624


Unnamed: 0,age,sex,bmi,children,smoker,region,charges,prediction_label
0,50,male,34.200001,2,yes,southwest,42856.839844,42078.306828
1,27,female,24.1,0,no,southwest,2974.125977,7262.706422
2,51,male,33.330002,3,no,southeast,10560.492188,11780.298853
3,38,female,34.799999,2,no,southwest,6571.543945,8053.835581
4,27,male,32.584999,3,no,northeast,4846.919922,7291.681332
