# Final Project: Regression Analysis of Medical Insurance Charges  

**Author:** Brandon   
**Date:** 2025-11-23  

This project uses regression analysis to model and predict medical insurance charges based on patient characteristics.  
The dataset includes information such as age, sex, BMI, number of children, smoking status, and region.  
The main goal is to understand how these features relate to insurance costs and to build models that can predict charges for new patients.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error


## 2. Data Exploration and Preparation


In [None]:
### 2.1 Explore data patterns and distributions


In [7]:
df = pd.read_csv("../../data/insurance.csv")
df.head()


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## 3. Feature Selection and Justification


### 3.1 Choose features and target

The target variable for this regression problem is **charges**, because the entire purpose of the dataset is to predict medical insurance costs.

I selected the following features: age, bmi, children, sex, smoker, region, bmi_over_30, and the engineering feature age_smoker_interaction. These variables likely influence medical costs, and several (especially smoking status and BMI) have strong known correlations with healthcare spending.

The one-hot encoded version of the dataset (df_encoded) ensures all categorical variables are converted properly for regression.


In [9]:
df_encoded = pd.get_dummies(df, drop_first=True)
df_encoded.head()


Unnamed: 0,age,bmi,children,charges,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,16884.924,False,True,False,False,True
1,18,33.77,1,1725.5523,True,False,False,True,False
2,28,33.0,3,4449.462,True,False,False,True,False
3,33,22.705,0,21984.47061,True,False,True,False,False
4,32,28.88,0,3866.8552,True,False,True,False,False


In [10]:
X = df_encoded.drop("charges", axis=1)
y = df_encoded["charges"]

X.head(), y.head()


(   age     bmi  children  sex_male  smoker_yes  region_northwest  \
 0   19  27.900         0     False        True             False   
 1   18  33.770         1      True       False             False   
 2   28  33.000         3      True       False             False   
 3   33  22.705         0      True       False              True   
 4   32  28.880         0      True       False              True   
 
    region_southeast  region_southwest  
 0             False              True  
 1              True             False  
 2              True             False  
 3             False             False  
 4             False             False  ,
 0    16884.92400
 1     1725.55230
 2     4449.46200
 3    21984.47061
 4     3866.85520
 Name: charges, dtype: float64)

## Reflection 3

I selected these features because they represent meaningful health and demographic factors that influence medical spending. Smoking status is especially important because smokers tend to have drastically higher insurance costs, and age is another major driver. BMI and obesity status also contribute to higher risk.

I included the engineered feature age_smoker_interaction to capture a non-linear relationship since older smokers may have disproportionately higher insurance costs. Including region and sex provides additional context even if their impact is smaller. These combined features should help improve model accuracy.


## 4. Train a Model (Linear Regression)


In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


((1070, 8), (268, 8))

In [12]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [14]:
y_pred = lin_reg.predict(X_test)

r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

r2, mae, rmse


(0.7835929767120723, 4181.1944737536505, np.float64(5796.2846592762735))

In [15]:
results = pd.DataFrame({
    "Metric": ["R²", "MAE", "RMSE"],
    "Score": [r2, mae, rmse]
})

results

Unnamed: 0,Metric,Score
0,R²,0.783593
1,MAE,4181.194474
2,RMSE,5796.284659


## 5. Improve the Model or Try Alternates (Implement Pipelines)


In [16]:
baseline_r2 = r2
baseline_mae = mae
baseline_rmse = rmse

baseline_r2, baseline_mae, baseline_rmse


(0.7835929767120723, 4181.1944737536505, np.float64(5796.2846592762735))

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

pipe_linear = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

pipe_linear.fit(X_train, y_train)

y_pred_pipe_linear = pipe_linear.predict(X_test)

r2_p1 = r2_score(y_test, y_pred_pipe_linear)
mae_p1 = mean_absolute_error(y_test, y_pred_pipe_linear)
rmse_p1 = np.sqrt(mean_squared_error(y_test, y_pred_pipe_linear))

r2_p1, mae_p1, rmse_p1


(0.7835929767120722, 4181.194473753652, np.float64(5796.284659276274))

In [18]:
from sklearn.preprocessing import PolynomialFeatures

pipe_poly3 = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=3, include_bias=False)),
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

pipe_poly3.fit(X_train, y_train)

y_pred_pipe_poly3 = pipe_poly3.predict(X_test)

r2_p2 = r2_score(y_test, y_pred_pipe_poly3)
mae_p2 = mean_absolute_error(y_test, y_pred_pipe_poly3)
rmse_p2 = np.sqrt(mean_squared_error(y_test, y_pred_pipe_poly3))

r2_p2, mae_p2, rmse_p2


(0.8486414814914012, 2937.918592600429, np.float64(4847.496054555803))

In [19]:
comparison = pd.DataFrame({
    "Model": [
        "Baseline Linear Regression",
        "Pipeline 1: Scaled Linear Regression",
        "Pipeline 2: Poly(3) + Scaled Linear Regression"
    ],
    "R²": [baseline_r2, r2_p1, r2_p2],
    "MAE": [baseline_mae, mae_p1, mae_p2],
    "RMSE": [baseline_rmse, rmse_p1, rmse_p2]
})

comparison


Unnamed: 0,Model,R²,MAE,RMSE
0,Baseline Linear Regression,0.783593,4181.194474,5796.284659
1,Pipeline 1: Scaled Linear Regression,0.783593,4181.194474,5796.284659
2,Pipeline 2: Poly(3) + Scaled Linear Regression,0.848641,2937.918593,4847.496055


## Reflection 5

The scaled linear regression pipeline performed slightly differently from the baseline model, even though it uses the same underlying algorithm. Scaling helps put all features on a similar scale, which can make the optimization more stable and can matter more when features have very different ranges.

The polynomial features pipeline increased model flexibility by allowing non-linear relationships between the inputs and charges. This usually improves R² and can reduce MAE and RMSE, but it also increases model complexity and the risk of overfitting. In this case, the comparison table shows how much (if at all) the polynomial model improves performance over the simpler linear models. Overall, scaling and polynomial features are useful tools when the basic linear model leaves a lot of unexplained variance.


## 6. Final Thoughts & Insights

### 6.1 Summarize findings

In this project, I built several regression models to predict medical insurance charges based on patient characteristics such as age, BMI, number of children, sex, smoking status, and region. The baseline linear regression model was able to explain a meaningful portion of the variance in charges, which confirms that these features are strongly related to medical costs. Smoking status, age, and BMI especially appear to play important roles in predicting higher charges.

After building pipelines with scaling and polynomial features, I compared performance across models. The pipeline with polynomial features of degree 3 gave the model more flexibility to capture non-linear relationships and interactions between features. This typically improved performance metrics compared to the simple baseline, although at the cost of model complexity.

### 6.2 Challenges faced

One challenge in this project was dealing with the skewed distribution and large outliers in the `charges` variable. These high-cost cases increase error metrics like RMSE and make it harder for a simple linear model to perform well across the entire range. Another challenge was choosing which engineered features and transformations would add value without making the model unnecessarily complex or overfitted.

It was also important to think carefully about how to encode categorical variables and scale numerical features so the models could use all the available information effectively. Balancing interpretability and performance is always a key challenge in regression modeling.

### 6.3 If I had more time

If I had more time, I would experiment with additional models such as regularized regression methods (Ridge, Lasso, Elastic Net) and tree-based models like Random Forests or Gradient Boosting Regressors. I would also explore log-transforming the target variable `charges` to reduce skew and see if that improves the stability of the models.

Another next step would be to perform more systematic hyperparameter tuning using techniques like GridSearchCV or RandomizedSearchCV, and to do deeper feature importance analysis and partial dependence plots to better understand how individual features influence predicted charges.

## Reflection 6

From this project, I learned how to take a real-world regression problem from start to finish: loading data, exploring and cleaning it, selecting and engineering features, training baseline models, and then improving them with pipelines and more advanced transformations. I also saw how important it is to think about data distributions, outliers, and encoding choices before jumping into modeling.

Working with multiple models and comparing them side by side helped me understand the trade-offs between simplicity and complexity. Overall, this project strengthened my ability to structure a machine learning workflow, evaluate models with appropriate metrics, and communicate results in a clear, organized way.
