<a href="https://colab.research.google.com/github/gachet/ad-1-24/blob/main/Medical_Cost_with_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
mirichoi0218_insurance_path = kagglehub.dataset_download('mirichoi0218/insurance')

print('Data source import complete.')


![](https://accessiahealth.org/app/uploads/2024/06/iStock-1351105760.jpg)

## Introduction

Healthcare costs continue to pose one of the most significant economic pressures on individuals, insurers, and society at large. Using the Insurance dataset — which captures personal attributes including age, sex, body mass index (BMI), number of children, smoking status, region, and the actual medical insurance charges billed — this project aims to build a predictive model for insurance cost (“charges”).

By leveraging machine learning (or regression) techniques, the goal is two-fold: (1) understand which factors most strongly drive increasing medical insurance costs, and (2) provide a reliable cost-estimation tool that stakeholders (insurers, policy makers, individuals) can use for decision-making. Through data cleaning, exploratory data analysis, feature engineering, model selection, and evaluation, this work transforms raw variables into actionable insights about the financial risk embedded in health insurance.

## 1- Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score

import warnings
from warnings import filterwarnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)
filterwarnings("ignore")

## 2- Load Dataset

In [None]:
df=pd.read_csv("/kaggle/input/insurance/insurance.csv")

## 3- Analyzing the Dataset

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.describe().T

## 4- EDA

In [None]:
plt.figure(figsize=(6,4))
sns.histplot(x="age",data=df,kde=True,)
plt.title('Distribution of Age')
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
num_vars = ['age', 'bmi', 'children', 'charges']
for i, var in enumerate(num_vars, 1):
    plt.subplot(1, 4, i)
    sns.boxplot(y=df[var], color='coral')
    plt.title(f'{var} Box Plot')
plt.tight_layout()
plt.show()

In [None]:
le=LabelEncoder()
cols_to_encode = ["sex", "smoker", "region"]

for col in cols_to_encode:
    df[col] = le.fit_transform(df[col])

We are transforming our object data to numerical with Label Encoder

In [None]:
df.head()

In [None]:
numeric = df.select_dtypes(include=np.number)
plt.figure(figsize=(10,8))
sns.heatmap(numeric.corr(), annot=True, fmt=".2f", cmap='viridis', center=0)
plt.title('Correlation heatmap of numeric features')
plt.show()

In [None]:
df=df.drop(columns="region")

In [None]:
df.head()

## 5- Seperate target and features

In [None]:
x=df.drop(columns="charges")
y=df["charges"]

In [None]:
scaler=StandardScaler()
x=scaler.fit_transform(x)

## 6- Linear Regression Model

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=42)

model = LinearRegression()
model.fit(x_train, y_train)
model.score(x_test,y_test)

In [None]:
y_pred = model.predict(x_test)

# (R2 ve MSE)
print("R^2 Score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))


In [None]:
scores = cross_val_score(model, x, y, cv=5, scoring='r2')
print("Cross Validation Scores: ",scores)
print("Mean Score:", scores.mean())

## Conclusion

The modelling exercise revealed several illuminating patterns in the determinants of individual insurance charges. Key drivers — such as smoker status, BMI, and age — consistently emerged as strong predictors of higher costs. For example, smokers incur markedly higher charges compared to non-smokers, emphasising the premium impact of lifestyle choices.
Nonetheless, the model also highlights that certain variables (such as region or number of children) contribute less than might be expected, suggesting that cost predictions cannot rely purely on demographic segmentation. Moreover, while the model achieved respectable performance, it also indicates that a sizeable portion of variance remains unaccounted for — underlining the complexity and individualized nature of healthcare expenses.
Ultimately, this model serves as a valuable decision-support tool but should be used with consideration of its limitations. For future work, integrating longitudinal data, negotiated insurance claims, policy details or non-linear interaction effects could enhance predictive accuracy and yield deeper risk insights