Importing the libraries needed to work with my data and build my model. I use pandas to handle my CSV files, train_test_split and GridSearchCV to split my data and tune the model, Ridge as my regression model, and mean_squared_error and r2_score to check how well my predictions come out.

In [1]:
#importing necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

Loading my original data, insurance.csv, to keep the real min and max values for age, BMI, and charges. I also load my cleaned and encoded dataset, which already has my feature engineering done. This lets me train the model without extra steps.

In [3]:
#loading the original dataset for min and max values
df_raw = pd.read_csv('Downloads/insurance.csv')

#loading the cleaned and encoded dataset
df = pd.read_csv('Downloads/insurance_cleaned.csv')

Splitting my cleaned dataset into X and y. X has all my input features except the charges column, and y is the column I want my model to predict. This sets up the data for training.

In [5]:
#splitting data into features (X) and target (y)
X = df.drop('charges', axis=1)
y = df['charges']

Splitting my data into training and test sets. I use 80% for training my model and keep 20% aside to test how well the model works on new data.

In [7]:
#splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Setting up Ridge Regression and define a small grid of alpha values to test. I use GridSearchCV to check which alpha works best with cross-validation, then fit it to my training data to get the best Ridge model.

In [9]:
#setting up ridge regression and defining parameter grid
ridge = Ridge()
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0]}

#using grid search to find the best alpha
grid_search = GridSearchCV(ridge, param_grid, cv=5)
grid_search.fit(X_train, y_train)

#getting the best ridge model
best_ridge = grid_search.best_estimator_

Using my best Ridge model to predict insurance charges on my test data. I calculate the mean squared error, which came out to about 0.00855, to see how close my predictions are to the real charges. A lower MSE means my predictions are closer to the actual amounts. I also check the R-squared score, which is about 0.78. This means my model explains about 78% of the variation in insurance charges, so it does a good job showing how factors like age, BMI, and smoking can affect what people pay.

In [11]:
#making predictions on test data
y_pred = best_ridge.predict(X_test)

#calculating mean squared error and r-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

#displaying results
print("best alpha value:", grid_search.best_params_)
print("mean squared error:", mse)
print("r-squared:", r2)

best alpha value: {'alpha': 1.0}
mean squared error: 0.008557261073215893
r-squared: 0.7836639847116438


Here, I define a helper function that lets me test my model with new customer details. I start by scaling the raw age and BMI using the original min and max values so they match the same range my model was trained on. Then I create the right dummy variables for sex, smoker status, region, age group, and BMI category so the new input lines up with my encoded columns. After that, I run the prediction using my trained Ridge model and convert the scaled prediction back to a real dollar amount. This makes it easy to see what a new customer might actually pay based on their profile.

In [13]:
#defining function to estimate cost for new customer
def estimate_insurance_cost(
    age, bmi, children, sex, smoker, region, age_group, bmi_category
):
    #getting min and max for age and bmi
    age_min = df_raw['age'].min()
    age_max = df_raw['age'].max()
    bmi_min = df_raw['bmi'].min()
    bmi_max = df_raw['bmi'].max()

    #getting min and max for charges
    charge_min = df_raw['charges'].min()
    charge_max = df_raw['charges'].max()

    #scaling age and bmi
    age_scaled = (age - age_min) / (age_max - age_min)
    bmi_scaled = (bmi - bmi_min) / (bmi_max - bmi_min)

    #creating dummy variables
    sex_male = 1 if sex.lower() == 'male' else 0
    smoker_yes = 1 if smoker.lower() == 'yes' else 0

    region_northwest = 1 if region.lower() == 'northwest' else 0
    region_southeast = 1 if region.lower() == 'southeast' else 0
    region_southwest = 1 if region.lower() == 'southwest' else 0

    age_group_adult = 1 if age_group.lower() == 'adult' else 0
    age_group_middle_aged = 1 if age_group.lower() == 'middle_aged' else 0
    age_group_senior = 1 if age_group.lower() == 'senior' else 0

    bmi_category_normal = 1 if bmi_category.lower() == 'normal' else 0
    bmi_category_overweight = 1 if bmi_category.lower() == 'overweight' else 0
    bmi_category_obese = 1 if bmi_category.lower() == 'obese' else 0

    #building input row
    new_input = pd.DataFrame([{
        'age': age_scaled,
        'bmi': bmi_scaled,
        'children': children,
        'sex_male': sex_male,
        'smoker_yes': smoker_yes,
        'region_northwest': region_northwest,
        'region_southeast': region_southeast,
        'region_southwest': region_southwest,
        'age_group_adult': age_group_adult,
        'age_group_middle_aged': age_group_middle_aged,
        'age_group_senior': age_group_senior,
        'bmi_category_normal': bmi_category_normal,
        'bmi_category_overweight': bmi_category_overweight,
        'bmi_category_obese': bmi_category_obese
    }])

    #predicting scaled cost
    scaled_prediction = best_ridge.predict(new_input)[0]

    #converting scaled prediction to real dollars
    real_prediction = scaled_prediction * (charge_max - charge_min) + charge_min

    print(f"estimated scaled charges: {scaled_prediction:.4f}")
    print(f"estimated insurance cost in dollars: ${real_prediction:.2f}")

    return real_prediction

In this last part, I test the helper function with example inputs for a new customer. The function scales the inputs, makes the prediction with my trained model, and shows both the scaled value and the final dollar amount. The scaled number shows where the prediction falls between the lowest and highest charges in my data, and the dollar amount converts that back to what a real insurance cost would look like for the customer’s profile. This helps me see that my model gives a clear, realistic estimate.

In [17]:
#testing the helper function with example input
estimate_insurance_cost(
    age=35,
    bmi=28,
    children=1,
    sex='male',
    smoker='yes',
    region='southeast',
    age_group='adult',
    bmi_category='overweight'
)

estimated scaled charges: 0.4389
estimated insurance cost in dollars: $28617.93


28617.92989058414

Brief Overlook of the Project:

For this project, I trained a Ridge Regression model to estimate insurance charges using prepared customer data. I used a small grid search to tune the model’s alpha parameter and found that an alpha value of 1.0 worked best with my data. The final model explains about 78 percent of the variation in charges, which shows that key factors like age, BMI, and smoking status have a strong influence on predicted costs. I also created a simple function that lets me test new examples by scaling the inputs the same way as my training data and then converting the prediction back to a real dollar amount. This makes it possible to get a realistic insurance cost estimate for different customer profiles.