# Final Project: Regression Analysis of Medical Insurance Charges  

**Author:** Brandon   
**Date:** 2025-11-23  

This project uses regression analysis to model and predict medical insurance charges based on patient characteristics.  
The dataset includes information such as age, sex, BMI, number of children, smoking status, and region.  
The main goal is to understand how these features relate to insurance costs and to build models that can predict charges for new patients.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


## 2. Data Exploration and Preparation


In [None]:
### 2.1 Explore data patterns and distributions


In [7]:
df = pd.read_csv("../../data/insurance.csv")
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 3. Feature Selection and Justification


### 3.1 Choose features and target

The target variable for this regression problem is **charges**, because the entire purpose of the dataset is to predict medical insurance costs.

I selected the following features: age, bmi, children, sex, smoker, region, bmi_over_30, and the engineering feature age_smoker_interaction. These variables likely influence medical costs, and several (especially smoking status and BMI) have strong known correlations with healthcare spending.

The one-hot encoded version of the dataset (df_encoded) ensures all categorical variables are converted properly for regression.


In [9]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()


Unnamed: 0,age,bmi,children,charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,False,True,False,False,True
1,18,33.77,1,1725.5523,True,False,False,True,False
2,28,33.0,3,4449.462,True,False,False,True,False
3,33,22.705,0,21984.47061,True,False,True,False,False
4,32,28.88,0,3866.8552,True,False,True,False,False


In [10]:
X = df_encoded.drop("charges", axis=1)
y = df_encoded["charges"]

X.head(), y.head()


(   age     bmi  children  sex_male  smoker_yes  region_northwest  \
 0   19  27.900         0     False        True             False   
 1   18  33.770         1      True       False             False   
 2   28  33.000         3      True       False             False   
 3   33  22.705         0      True       False              True   
 4   32  28.880         0      True       False              True   
 
    region_southeast  region_southwest  
 0             False              True  
 1              True             False  
 2              True             False  
 3             False             False  
 4             False             False  ,
 0    16884.92400
 1     1725.55230
 2     4449.46200
 3    21984.47061
 4     3866.85520
 Name: charges, dtype: float64)

## Reflection 3

I selected these features because they represent meaningful health and demographic factors that influence medical spending. Smoking status is especially important because smokers tend to have drastically higher insurance costs, and age is another major driver. BMI and obesity status also contribute to higher risk.

I included the engineered feature age_smoker_interaction to capture a non-linear relationship since older smokers may have disproportionately higher insurance costs. Including region and sex provides additional context even if their impact is smaller. These combined features should help improve model accuracy.


## 4. Train a Model (Linear Regression)


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape
