---


<img width=25% src="https://raw.githubusercontent.com/gabrielcapela/Credit-Card-Fraud-Detection-/main/images/myself.png" align=right>

# **Health Insurance Cost Prediction Project**

*by Gabriel Capela*

[<img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white"/>](https://www.linkedin.com/in/gabrielcapela)
[<img src="https://img.shields.io/badge/Medium-12100E?style=for-the-badge&logo=medium&logoColor=white" />](https://medium.com/@gabrielcapela)

---

**Health insurance** is a contract that covers medical expenses in exchange for an annual fee. It protects you from unexpected medical costs and offers many other benefits.

This project will consist of using individual customer information (age, BMI, whether they are a smoker, etc.) and the annual cost of their health insurance to obtain a model that can indicate an optimal price for health insurance, given the individual's data. For this, several **supervised machine learning models** will be tested, the one that obtains the lowest error rates will be selected and will go through the finetuning process, in order to improve its prediction.
<p align="center">
<img width=50% src="https://github.com/gabrielcapela/AutoML-Projects/blob/main/Regression/images/national-cancer-institute-NFvdKIhxYlU-unsplash.jpg?raw=true">
</p>

The purpose of this project is to apply **Automated Machine Learning**, in order to demonstrate the practicality of this type of tool. The data used was taken from [Kaggle](https://www.kaggle.com/datasets/annetxu/health-insurance-cost-prediction/data). and The CRoss Industry Standard Process for Data Mining ([CRISP-DM](https://www.ibm.com/docs/pt-br/spss-modeler/saas?topic=dm-crisp-help-overview)) methodology will be used to guide the stages of this project.

# Business Understanding 

# Data Understanding

The dataset used can be downloaded from this [page](https://github.com/gabrielcapela/AutoML-Projects/blob/main/Regression/insurance.csv)

## Obtaining and Summary Analysis of Data

In [6]:
# Importing the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
# Importing the dataset
df = pd.read_csv('https://raw.githubusercontent.com/gabrielcapela/AutoML-Projects/refs/heads/main/Regression/insurance.csv')
#Showing the first 5 lines
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Below is the meaning of each variable:

* **age**: The age of the insured individual in years.

* **sex**: The gender of the insured individual (male or female).

* **bmi**: Body Mass Index (BMI), a measure of weight relative to height.

* **children**: The number of children covered by the insurance.

* **smoker**: Indicates whether the insured individual is a smoker (yes or no).

* **region**: The geographic region where the insured individual resides.

* **charges**: The total annual health insurance cost (in dollars) for the individual, which is the target variable in this regression model.

## Pandas Profiling

In line with the AutoML philosophy, I will be using **Pandas Profilin**g in the Data Understanding phase of my project. Pandas Profiling **automates the generation of comprehensive Exploratory Data Analysis (EDA) reports**, allowing me to quickly and in-depth understand the dataset, summarizing important statistics, identifying missing values, detecting correlations, and visualizing distributions. The goal of this tool is to **increase productivity and reduce manual effort**, allowing me to focus on interpreting the results instead of performing repetitive EDA tasks.

In [5]:
#Importing the required package
from ydata_profiling import ProfileReport

#Generating the report using Pandas Profiling
profile = ProfileReport(df, explorative=True)

# Saving the report as an HTML file
profile.to_file("insurance_report.html")

# Displaying the report inside a Jupyter Notebook (this doesn't appear on github)
#profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Click [**HERE**](https://gabrielcapela.github.io/health_insurance_cost_prediction/insurance_report.html) to see the report

**Some observations** can already be made:

*   The variables **sex**, **smoker** and **region** are categorical, the first two being binary and **region** having four different classes

*   **Charges** is highly overall correlated with **age** and **smoker** 

*   **Children** has 574 (42.9%) zeros, but they are not missing values, they are just people without children.

*   The **Smoker** variable has unbalanced data, around only 20% of the data are positive (smoker).

# Data Preparation

**PyCaret** (the AutoML tool that will be used) already performs several steps of **Data Preparation automatically**, such as: Missing values ​​treatment, Categorical variables encoding, Feature scaling and Outlier detection. Manual preparation can generate better results, but we will stick to what is provided by PyCaret, in order to verify the efficiency of AutoML.

# Modeling

In [11]:
# Separating data into test and training
test = df.sample(frac=0.30)
train = df.drop(test.index)

test.reset_index(inplace=True, drop=True)
train.reset_index(inplace=True, drop=True)

print(test.shape)
print(train.shape)

(401, 7)
(937, 7)


In [12]:
# importando os pacotes necessários
from pycaret.regression import setup, compare_models, models, create_model, predict_model
from pycaret.regression import tune_model, plot_model, evaluate_model, finalize_model
from pycaret.regression import save_model, load_model

In [14]:
# criando o setup do PyCaret
reg = setup(data=train, target='charges')

Unnamed: 0,Description,Value
0,Session id,4421
1,Target,charges
2,Target type,Regression
3,Original data shape,"(937, 7)"
4,Transformed data shape,"(937, 10)"
5,Transformed train set shape,"(655, 10)"
6,Transformed test set shape,"(282, 10)"
7,Numeric features,3
8,Categorical features,3
9,Preprocess,True


In [None]:
# criando o pipeline
reg = setup(data = train,
            target = 'SalePrice',
            normalize = True,
            log_experiment = True,
            experiment_name = 'sales_01')

# Evaluation