# U.S. Medical Insurance Costs

In this project, we’ll explore a medical insurance dataset to understand what drives insurance costs and what insights we can uncover for practical business use.

This code was written by Cody on 12/10/2025
It will be used to complete the Portfolio Project: U.S. Medical Insurance in Codecademy Data Scientist: Analytics Career Path.
Below are my boiler plate imports that I bring in at the start of projects  

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import requests as req
import json
import csv
import datetime as dt
from itertools import cycle

# 1. Importing the Data

We’ll start by loading insurance.csv and taking a quick look at its columns, data types, and any missing or unusual values.
I have imported this from the save path on my computer this save path would need to be modified to teh location of the file on your network or comuter.

In [None]:
# Load the insurance data be sure to use your file path#
df = pd.read_csv("Filepath/to/your/insurance.csv")
print(df)

# 2. Exploring the Dataset
Next, we’ll review basic stats and distributions for key fields like age, BMI, and charges, along with category counts for things like gender and region. This gives us a feel for what the data looks like.

* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

In [None]:
age = df["age"]
sex = df["sex"]
bmi = df["bmi"]
children = df["children"]
smoker = df["smoker"]
region = df["region"]
charges = df["charges"]
print(df.describe(include="all"))

In the following block, I will determine and display key information for each characteristic. This will allow me to compile more useful insights later in the project.
Although print(df.describe(include="all")) provides much of this information, I believe that creating functions to define these elements will be helpful for later visualizations and characterizations.

In [None]:
#age specifics#
total_age = df["age"].sum()
average_age = total_age / df["age"].count()
oldest = df["age"].max()
youngest = df["age"].min()
average_age_male = df[df["sex"] == "male"]["age"].mean()
average_age_female = df[df["sex"] == "female"]["age"].mean()
oldest_male = df[df["sex"] == "male"]["age"].max()
youngest_male = df[df["sex"] == "male"]["age"].min()
oldest_female = df[df["sex"] == "female"]["age"].max()
youngest_female = df[df["sex"] == "female"]["age"].min()

#sex specifics#
males = (df["sex"] == "male").sum()
females = (df["sex"] == "female").sum()

#bmi specifics#
total_bmi = df["bmi"].sum()
average_bmi = total_bmi / df["bmi"].count()
largest_bmi = df["bmi"].max()
smallest_bmi = df["bmi"].min()
aveage_bmi_male = df[df["sex"] == "male"]["bmi"].mean()
aveage_bmi_female = df[df["sex"] == "female"]["bmi"].mean()
largest_bmi_male = df[df["sex"] == "male"]["bmi"].max()
smallest_male = df[df["sex"] == "male"]["bmi"].min()
largest_bmi_female = df[df["sex"] == "female"]["bmi"].max()
smallest_female = df[df["sex"] == "female"]["bmi"].min()

#children specifics#
total_children = df["children"].sum()
average_children = total_bmi / df["children"].count()
most_children = df["children"].max()
least_children = df["children"].min()
aveage_children_male = df[df["sex"] == "male"]["children"].mean()
aveage_children_female = df[df["sex"] == "female"]["children"].mean()
most_children_male = df[df["sex"] == "male"]["children"].max()
least_children_male = df[df["sex"] == "male"]["children"].min()
most_children_female = df[df["sex"] == "female"]["children"].max()
least_children_female = df[df["sex"] == "female"]["children"].min()

#region specifics#
region_counts = df["region"].value_counts()
male_region_counts = df[df["sex"] == "male"]["region"].value_counts()
female_region_counts = df[df["sex"] == "female"]["region"].value_counts()


#smnoker specifics#
smoker_counts = df["smoker"].value_counts()
male_smoker_counts = df[df["sex"] == "male"]["smoker"].value_counts()
female_smoker_counts = df[df["sex"] == "female"]["smoker"].value_counts()
highest_smoker_region = df[df["smoker"] == "yes"]["region"].value_counts().idxmax()
lowest_smoker_region = df[df["smoker"] == "yes"]["region"].value_counts().idxmin()


#charges specifics#
total_charges = df["charges"].sum()
average_charges = total_charges / df["charges"].count()
most_expensive = df["charges"].max()
least_expensive = df["charges"].min()
aveage_expense_male = df[df["sex"] == "male"]["charges"].mean()
aveage_expense_female = df[df["sex"] == "female"]["charges"].mean()
most_expensive_male = df[df["sex"] == "male"]["charges"].max()
least_expensive_male = df[df["sex"] == "male"]["charges"].min()
most_expensive_female = df[df["sex"] == "female"]["charges"].max()
least_expensive_female = df[df["sex"] == "female"]["charges"].min()
average_expense_smoker = df[df["smoker"] == "yes"]["charges"].mean()
average_expense_nonsmoker = df[df["smoker"] == "no"]["charges"].mean()
most_expensive_smoker = df[df["smoker"] == "yes"]["charges"].max()
least_expensive_smoker = df[df["smoker"] == "yes"]["charges"].min()
most_expensive_nonsmoker = df[df["smoker"] == "no"]["charges"].max()
least_expensive_nonsmoker = df[df["smoker"] == "no"]["charges"].min()
most_expensive_region = df.groupby("region")["charges"].mean().idxmax()
least_expensive_region = df.groupby("region")["charges"].mean().idxmin()

I will now print descriptions for each of the specific indicators above.

In [56]:
#printing results#
print("Age Statistics:")
print(f"    Average Age: {average_age}")
print(f"    Oldest Age: {oldest}")
print(f"    Youngest Age: {youngest}")
print(f"    Average Age Male: {average_age_male}")
print(f"    Average Age Female: {average_age_female}")
print(f"    Oldest Male: {oldest_male}")
print(f"    Youngest Male: {youngest_male}")
print(f"    Oldest Female: {oldest_female}")
print(f"    Youngest Female: {youngest_female}")
print("\nSex Statistics:")
print(f"    Males: {males}")
print(f"    Females: {females}")
print("\nBMI Statistics:")
print(f"    Average BMI: {average_bmi}")
print(f"    Largest BMI: {largest_bmi}")
print(f"    Smallest BMI: {smallest_bmi}")
print(f"    Average BMI Male: {aveage_bmi_male}")
print(f"    Average BMI Female: {aveage_bmi_female}")
print(f"    Largest BMI Male: {largest_bmi_male}")
print(f"    Smallest BMI Male: {smallest_male}")
print(f"    Largest BMI Female: {largest_bmi_female}")
print(f"    Smallest BMI Female: {smallest_female}")
print("\nChildren Statistics:")
print(f"    Average Number of Children: {average_children}")
print(f"    Most Children: {most_children}")
print(f"    Least Children: {least_children}")
print(f"    Average Number of Children Male: {aveage_children_male}")
print(f"    Average Number of Children Female: {aveage_children_female}")
print(f"    Most Children Male: {most_children_male}")
print(f"    Least Children Male : {least_children_male}")
print(f"    Most Children Female: {most_children_female}")
print(f"    Least Children Female: {least_children_female}")
print("\nRegion Statistics:")
print(region_counts)
print(male_region_counts)
print(female_region_counts)
print("\nSmoker Statistics:")
print(smoker_counts)
print(male_smoker_counts)
print(female_smoker_counts)
print(f"    Region with Highest Smokers: {highest_smoker_region}")
print(f"    Region with Lowest Smokers: {lowest_smoker_region}")
print("\nCharges Statistics:")
print(f"    Total Charges: {total_charges}")
print(f"    Average Charges: {average_charges}")
print(f"    Most Expensive Charge: {most_expensive}")
print(f"    Least Expensive Charge: {least_expensive}")


Age Statistics:
    Average Age: 39.20702541106129
    Oldest Age: 64
    Youngest Age: 18
    Average Age Male: 38.917159763313606
    Average Age Female: 39.503021148036254
    Oldest Male: 64
    Youngest Male: 18
    Oldest Female: 64
    Youngest Female: 18

Sex Statistics:
    Males: 676
    Females: 662

BMI Statistics:
    Average BMI: 30.66339686098655
    Largest BMI: 53.13
    Smallest BMI: 15.96
    Average BMI Male: 30.943128698224854
    Average BMI Female: 30.37774924471299
    Largest BMI Male: 53.13
    Smallest BMI Male: 15.96
    Largest BMI Female: 48.07
    Smallest BMI Female: 16.815

Children Statistics:
    Average Number of Children: 30.66339686098655
    Most Children: 5
    Least Children: 0
    Average Number of Children Male: 1.1153846153846154
    Average Number of Children Female: 1.0740181268882176
    Most Children Male: 5
    Least Children Male : 0
    Most Children Female: 5
    Least Children Female: 0

Region Statistics:
region
southeast    364
sou

# 3. Visualizing Patterns
We’ll use simple visualizations—histograms, boxplots, scatterplots, and a correlation heatmap—to quickly spot trends. This helps us answer questions like how smoking, BMI, or age influence charges.

In [None]:
sns.set_style("whitegrid")

# Histogram of Age
plt.figure(figsize=(8,4))
sns.histplot(df['age'], bins=20, kde=True, color='skyblue')
plt.title("Distribution of Patient Ages")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

# Boxplot of Charges by Smoker Status
plt.figure(figsize=(8,4))
sns.boxplot(x='smoker', y='charges', data=df, color='skyblue')
plt.title("Insurance Charges by Smoker Status")
plt.xlabel("Smoker")
plt.ylabel("Charges")
plt.show()

# Scatterplot: BMI vs Charges
plt.figure(figsize=(8,4))
sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
plt.title("BMI vs Insurance Charges (Smoker Highlighted)")
plt.xlabel("BMI")
plt.ylabel("Charges")
plt.show()

# Correlation Heatmap of Numeric Features
plt.figure(figsize=(6,5))
numeric_cols = ['age', 'bmi', 'children', 'charges']
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# 4. Analyzing Relationships
Here, we’ll dig a bit deeper into correlations and group comparisons to identify which features most strongly impact insurance costs.

In [55]:
print(f"    Average Charge Male: {aveage_expense_male}")
print(f"    Average Charge Female: {aveage_expense_female}")
print(f"    Most Expensive Charge Male: {most_expensive_male}")
print(f"    Least Expensive Charge Male: {least_expensive_male}")
print(f"    Most Expensive Charge Female: {most_expensive_female}")
print(f"    Least Expensive Charge Female: {least_expensive_female}")
print(f"    Average Charge Smoker: {average_expense_smoker}")
print(f"    Average Charge Non-Smoker: {average_expense_nonsmoker}")
print(f"    Most Expensive Charge Smoker: {most_expensive_smoker}")
print(f"    Least Expensive Charge Smoker: {least_expensive_smoker}")
print(f"    Most Expensive Charge Non-Smoker: {most_expensive_nonsmoker}")
print(f"    Least Expensive Charge Non-Smoker: {least_expensive_nonsmoker}")
print(f"    Region with Highest Average Charge: {most_expensive_region}")
print(f"    Region with Lowest Average Charge: {least_expensive_region}")

    Average Charge Male: 13956.751177721893
    Average Charge Female: 12569.578843835347
    Most Expensive Charge Male: 62592.87309
    Least Expensive Charge Male: 1121.8739
    Most Expensive Charge Female: 63770.42801
    Least Expensive Charge Female: 1607.5101
    Average Charge Smoker: 32050.23183153284
    Average Charge Non-Smoker: 8434.268297856204
    Most Expensive Charge Smoker: 63770.42801
    Least Expensive Charge Smoker: 12829.4551
    Most Expensive Charge Non-Smoker: 36910.60803
    Least Expensive Charge Non-Smoker: 1121.8739
    Region with Highest Average Charge: southeast
    Region with Lowest Average Charge: southwest


# 5. Business Use Cases
From our insights, we’ll outline potential applications such as premium prediction, customer segmentation, risk scoring, and wellness program targeting.

Based on our analysis and modeling, there are several ways these insights could be applied in a real-world insurance context:

Customer Segmentation

Group customers into categories based on risk factors, demographics, or health indicators.

Enables targeted marketing, personalized offers, or plan recommendations.

Risk Scoring

Assess the likelihood of high medical costs for individuals or groups.

Supports underwriting decisions and prioritization of high-risk clients.

Wellness Program Targeting

Identify high-risk segments (e.g., smokers, high BMI) who could benefit from wellness initiatives.

Helps design incentives or preventive health programs to reduce long-term costs.

# 6. Wrap-Up
By the end, we’ll have a clear snapshot of the dataset and a solid foundation for building data-driven insurance insights.

We took a deep dive into the insurance dataset, looking at age, sex, BMI, children, region, smoking status, and charges. Here’s what we found and some actionable takeaways:

1. Age Insights

The average patient age is ~39 years, with the youngest at 18 and the oldest at 64.

Males are slightly younger on average (38.9) than females (39.5).

Oldest and youngest ages are similar across sexes, so age distribution is fairly balanced.

Takeaway: The dataset represents mostly middle-aged adults, which is typical for standard health insurance premiums.

2. Sex Distribution

Males: 676, Females: 662 → nearly even split.

Takeaway: No major gender bias—analyses or modeling won’t need heavy adjustments for sex.

3. BMI Patterns

Average BMI: ~30.7 (borderline obese)

Males slightly higher BMI (30.9) than females (30.4)

Extreme BMI values: 15.96 – 53.13, showing some outliers.

Takeaway: BMI is an important risk factor; programs targeting healthy weight could influence costs and claims.

4. Children

Average number of children: ~1.1 per patient

Most children: 5, least: 0

Distribution fairly even across males and females

Takeaway: Number of dependents may slightly impact charges, but effect seems minor.

5. Region

Most patients: Southeast (364), then Southwest, Northwest, Northeast (~324–325 each)

Male and female distribution roughly proportional across regions

Takeaway: Southeast dominates the dataset; regional trends in charges and smoking should be considered for risk assessment.

6. Smoking

Smokers: 274, Non-smokers: 1064 → roughly 1 in 4 patients smoke

Highest number of smokers: Southeast, lowest: Southwest

Smokers have dramatically higher charges (~$32k vs ~$8.4k for non-smokers)

Takeaway: Smoking is a key driver of cost. Programs targeting smoking cessation could significantly reduce expenses.

7. Charges

Total charges: ~$17.76M, average: ~$13.3k per patient

Males are slightly more expensive than females ($13.95k vs $12.57k)

Smokers drive the highest charges, with the most expensive single charge at ~$63.8k

Region-wise, Southeast has the highest average charges, Southwest the lowest

Takeaway: Charges are strongly influenced by smoking, BMI, and region. Risk-based pricing and targeted wellness programs could optimize costs.

Key Insights & Actionable Takeaways

Target High-Risk Groups: Smokers and high-BMI patients are driving the highest costs. Wellness programs or incentives could reduce claims.

Region-Specific Strategies: Southeast shows higher charges and more smokers. Consider region-specific health campaigns or premium adjustments.

Predictive Modeling Potential: Age, BMI, smoking status, and region are strong predictors of charges—perfect for building predictive models for premiums or risk scoring.

Balanced Dataset: Sex and age distributions are fairly even, simplifying segmentation or modeling efforts.