### **About the Project: Medical Cost Prediction Using Linear Regression**

This project focuses on predicting individual medical insurance costs by leveraging demographic, lifestyle, and health-related data. The dataset includes details about beneficiaries, such as their age, gender, body mass index (BMI), smoking habits, and residential regions in the U.S., alongside the corresponding medical expenses billed to their health insurance.

The primary objective is to develop a **linear regression model** that accurately estimates medical charges based on these features. This model will provide valuable insights into the key factors influencing healthcare costs, enabling insurance companies to optimize premium calculations and empowering individuals to make informed lifestyle choices.

---

### **Key Features of the Dataset**

1. **Demographic Variables**:
   - **`age`**: Age of the primary beneficiary.
   - **`sex`**: Gender of the insurance holder (male or female).
   - **`region`**: Residential region in the U.S. (northeast, southeast, southwest, northwest).

2. **Lifestyle and Health Metrics**:
   - **`bmi`**: Body Mass Index, a standard measure of body weight relative to height.
   - **`smoker`**: Smoking status of the beneficiary (yes or no).

3. **Family Characteristics**:
   - **`children`**: Number of dependents covered under the insurance.

4. **Target Variable**:
   - **`charges`**: Total medical costs billed to the individual by their insurance provider.

---

### **Objectives**
1. Conduct **exploratory data analysis (EDA)** to uncover patterns and relationships between variables.
2. Build and optimize a **linear regression model** to predict the medical charges (`charges`).
3. Assess model performance using metrics such as **R-squared**, **Mean Squared Error (MSE)**, and residual analysis.
4. Analyze the significance of predictors like smoking habits, BMI, and age in determining healthcare costs.

---

### **Expected Outcomes**
- A detailed understanding of how demographic, health, and lifestyle factors drive medical insurance costs.
- A high-performing linear regression model capable of making reliable cost predictions.
- Actionable insights into the most influential predictors, such as smoking and BMI, to guide healthcare policy and individual decision-making.

---

### **Applications**
This project illustrates the practical use of machine learning in healthcare by delivering insights that benefit multiple stakeholders:
- **Insurance Companies**: Refine risk assessment and premium structures.
- **Healthcare Providers**: Recognize cost patterns to enhance resource allocation.
- **Policyholders**: Identify personal factors contributing to insurance expenses.

---

### **Problem Statement**

The increasing cost of healthcare services poses a significant challenge for both insurance providers and policyholders. Identifying the underlying factors influencing these costs—such as age, BMI, and smoking habits—is critical for making fair, data-driven decisions in pricing and healthcare planning. 

This project aims to build a predictive model using linear regression to estimate individual medical charges based on demographic, lifestyle, and health attributes. By uncovering patterns in medical cost distribution, this project seeks to enable proactive risk management and encourage healthier lifestyle choices among beneficiaries.


## Import Data and Required Packages

Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


In [4]:
df = pd.read_csv('data/medical_insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [7]:
# 1. Basic Statistical Summary
def basic_stats():
    print("Dataset Overview:")
    print(f"Total Entries: {len(df)}")
    print("\nNumerical Columns Summary:")
    print(df.describe())
    
    print("\nCategorical Columns Distribution:")
    for col in ['sex', 'smoker', 'region']:
        print(f"\n{col.capitalize()} Distribution:")
        print(df[col].value_counts(normalize=True))

In [8]:
# 2. Correlation and Distribution Analysis
def correlation_analysis():
    # Correlation Heatmap
    plt.figure(figsize=(10, 8))
    numeric_cols = ['age', 'bmi', 'children', 'charges']
    correlation_matrix = df[numeric_cols].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
    plt.title('Correlation Heatmap of Numerical Features')
    plt.tight_layout()
    plt.savefig('correlation_heatmap.png')
    plt.close()

In [9]:
# 3. Detailed Feature Analysis
def feature_analysis():
    # Age Distribution
    plt.figure(figsize=(12, 4))
    plt.subplot(131)
    df['age'].hist(bins=30)
    plt.title('Age Distribution')
    
    # BMI Distribution
    plt.subplot(132)
    df['bmi'].hist(bins=30)
    plt.title('BMI Distribution')
    
    # Charges Distribution
    plt.subplot(133)
    df['charges'].hist(bins=30)
    plt.title('Insurance Charges Distribution')
    
    plt.tight_layout()
    plt.savefig('numerical_distributions.png')
    plt.close()

In [10]:
# 4. Categorical Feature Impact on Charges
def categorical_impact():
    plt.figure(figsize=(15, 5))
    
    # Sex Impact
    plt.subplot(131)
    sns.boxplot(x='sex', y='charges', data=df)
    plt.title('Charges by Sex')
    
    # Smoker Impact
    plt.subplot(132)
    sns.boxplot(x='smoker', y='charges', data=df)
    plt.title('Charges by Smoker Status')
    
    # Region Impact
    plt.subplot(133)
    sns.boxplot(x='region', y='charges', data=df)
    plt.title('Charges by Region')
    
    plt.tight_layout()
    plt.savefig('categorical_charges_impact.png')
    plt.close()

In [11]:
# 5. Advanced Analysis: Multivariate Relationships
def multivariate_analysis():
    plt.figure(figsize=(15, 5))
    
    # BMI vs Charges by Smoker Status
    plt.subplot(131)
    sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
    plt.title('BMI vs Charges by Smoker Status')
    
    # Age vs Charges by Sex
    plt.subplot(132)
    sns.scatterplot(x='age', y='charges', hue='sex', data=df)
    plt.title('Age vs Charges by Sex')
    
    # Children vs Charges
    plt.subplot(133)
    sns.boxplot(x='children', y='charges', data=df)
    plt.title('Charges by Number of Children')
    
    plt.tight_layout()
    plt.savefig('multivariate_relationships.png')
    plt.close()

# Execute all analyses
basic_stats()
correlation_analysis()
feature_analysis()
categorical_impact()
multivariate_analysis()

# Key Insights Compilation
def compile_insights():
    insights = {
        "Dataset Composition": {
            "Total Entries": len(df),
            "Unique Categories": {col: df[col].nunique() for col in ['sex', 'smoker', 'region']}
        },
        "Distribution Characteristics": {
            "Age": {"Mean": df['age'].mean(), "Median": df['age'].median()},
            "BMI": {"Mean": df['bmi'].mean(), "Median": df['bmi'].median()},
            "Charges": {"Mean": df['charges'].mean(), "Median": df['charges'].median()}
        },
        "Key Observations": [
            f"Smokers have significantly higher insurance charges (Mean: {df[df['smoker']=='yes']['charges'].mean():.2f})",
            f"Non-smokers have lower charges (Mean: {df[df['smoker']=='no']['charges'].mean():.2f})",
            f"BMI appears to have a positive correlation with insurance charges (Correlation: {df['bmi'].corr(df['charges']):.2f})"
        ]
    }
    
    print("\n--- Key Insights ---")
    for section, details in insights.items():
        print(f"\n{section}:")
        if isinstance(details, dict):
            for k, v in details.items():
                print(f"{k}: {v}")
        else:
            for item in details:
                print(f"- {item}")


Dataset Overview:
Total Entries: 1338

Numerical Columns Summary:
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010

Categorical Columns Distribution:

Sex Distribution:
sex
male      0.505232
female    0.494768
Name: proportion, dtype: float64

Smoker Distribution:
smoker
no     0.795217
yes    0.204783
Name: proportion, dtype: float64

Region Distribution:
region
southeast    0.272048
southwest    0.242900
northwest    0.242900
northeast    0.242152
Name: proportion, dtype: float64


Based on the statistical summary you've shared, I'll highlight the key insights:

Numerical Variables Overview:
1. Age
- Mean: 39.2 years
- Range: 18-64 years
- Median: 39 years

2. BMI (Body Mass Index)
- Mean: 30.66
- Range: 15.96-53.13
- Median: 30.4

3. Children
- Mean: 1.09
- Range: 0-5 children
- Median: 1 child

4. Insurance Charges
- Mean: $13,270.42
- Range: $1,121.87-$63,770.43
- Median: $9,382.03

Categorical Variables Distribution:
1. Sex
- Male: 50.5%
- Female: 49.5%

2. Smoker Status
- Non-smokers: 79.5%
- Smokers: 20.5%

3. Region
- Almost equal distribution across:
  - Southwest: 24.29%
  - Northwest: 24.29%
  - Northeast: 24.22%

Key Observations:
- Nearly balanced gender representation
- Majority are non-smokers
- Uniform regional distribution
- Wide variation in insurance charges
- Moderate age and BMI variability

In [12]:
# 6. Advanced Statistical Analysis
def advanced_statistical_analysis():
    # Normality Test
    print("Normality Tests (Shapiro-Wilk):")
    numeric_cols = ['age', 'bmi', 'children', 'charges']
    for col in numeric_cols:
        _, p_value = stats.shapiro(df[col])
        print(f"{col}: p-value = {p_value:.4f}")
    
    # ANOVA for categorical variables' impact on charges
    print("\nANOVA Tests:")
    categorical_cols = ['sex', 'smoker', 'region']
    for col in categorical_cols:
        groups = [group['charges'].values for name, group in df.groupby(col)]
        f_statistic, p_value = stats.f_oneway(*groups)
        print(f"{col}: F-statistic = {f_statistic:.4f}, p-value = {p_value:.4f}")
    
    # Advanced Correlation Analysis
    correlation_matrix = df[numeric_cols].corr()
    print("\nCorrelation Matrix:")
    print(correlation_matrix)

In [15]:
from scipy import stats

# Detailed Visualization
def comprehensive_visualization():
    plt.figure(figsize=(20, 15))
    
    # 1. Charges Distribution by Critical Factors
    plt.subplot(2, 3, 1)
    sns.boxplot(x='smoker', y='charges', data=df)
    plt.title('Charges by Smoking Status')
    
    # 2. BMI vs Charges with Smoker Overlay
    plt.subplot(2, 3, 2)
    sns.scatterplot(x='bmi', y='charges', hue='smoker', data=df)
    plt.title('BMI vs Charges by Smoking Status')
    
    # 3. Age Distribution of Smokers vs Non-Smokers
    plt.subplot(2, 3, 3)
    sns.boxplot(x='smoker', y='age', data=df)
    plt.title('Age Distribution by Smoking Status')
    
    # 4. Charges by Number of Children and Smoker Status
    plt.subplot(2, 3, 4)
    sns.boxplot(x='children', y='charges', hue='smoker', data=df)
    plt.title('Charges by Children and Smoking')
    
    # 5. Regional Charges Analysis
    plt.subplot(2, 3, 5)
    sns.boxplot(x='region', y='charges', data=df)
    plt.title('Charges by Region')
    
    # 6. BMI Category Distribution
    plt.subplot(2, 3, 6)
    df['bmi_category'] = pd.cut(df['bmi'], 
        bins=[0, 18.5, 25, 30, 100], 
        labels=['Underweight', 'Normal', 'Overweight', 'Obese'])
    df['bmi_category'].value_counts().plot(kind='pie', autopct='%1.1f%%')
    plt.title('BMI Category Distribution')
    
    plt.tight_layout()
    plt.savefig('comprehensive_insurance_analysis.png')
    plt.close()


In [17]:
# Key Statistical Insights
def statistical_insights():
    insights = {
        "Smoking Impact": {
            "Average Charges (Smokers)": df[df['smoker']=='yes']['charges'].mean(),
            "Average Charges (Non-Smokers)": df[df['smoker']=='no']['charges'].mean(),
            "Percentage Increase": (df[df['smoker']=='yes']['charges'].mean() / 
                                    df[df['smoker']=='no']['charges'].mean() - 1) * 100
        },
        "BMI Correlation": {
            "Correlation with Charges": df['bmi'].corr(df['charges']),
            "Average Charges by BMI Category": df.groupby('bmi_category')['charges'].mean()
        },
        "Age and Charges": {
            "Age-Charges Correlation": df['age'].corr(df['charges']),
            "Charges by Age Group": df.groupby(pd.cut(df['age'], bins=[0, 30, 45, 60]))['charges'].mean()
        }
    }
    
    print("\n--- Comprehensive Statistical Insights ---")
    for section, details in insights.items():
        print(f"\n{section}:")
        for k, v in details.items():
            print(f"{k}: {v}")

# Execute Analyses
advanced_statistical_analysis()
comprehensive_visualization()
statistical_insights()

Normality Tests (Shapiro-Wilk):
age: p-value = 0.0000
bmi: p-value = 0.0000
children: p-value = 0.0000
charges: p-value = 0.0000

ANOVA Tests:
sex: F-statistic = 4.3997, p-value = 0.0361
smoker: F-statistic = 2177.6149, p-value = 0.0000
region: F-statistic = 2.9696, p-value = 0.0309

Correlation Matrix:
               age       bmi  children   charges
age       1.000000  0.109272  0.042469  0.299008
bmi       0.109272  1.000000  0.012759  0.198341
children  0.042469  0.012759  1.000000  0.067998
charges   0.299008  0.198341  0.067998  1.000000

--- Comprehensive Statistical Insights ---

Smoking Impact:
Average Charges (Smokers): 32050.23183153284
Average Charges (Non-Smokers): 8434.268297856204
Percentage Increase: 280.0001458298317

BMI Correlation:
Correlation with Charges: 0.19834096883362895
Average Charges by BMI Category: bmi_category
Underweight     8657.620652
Normal         10435.440719
Overweight     10997.803881
Obese          15560.926321
Name: charges, dtype: float64

Age

Analysis Breakdown:

Normality Tests:
- All variables (age, bmi, children, charges) are statistically non-normal (p-value < 0.05)
- Suggests using non-parametric statistical methods

ANOVA Tests:
1. Sex: Slight significant difference in charges (p = 0.0361)
2. Smoker: Extremely significant charge difference (p = 0.0000)
3. Region: Marginally significant charge variation (p = 0.0309)

Correlation Matrix:
- Weak correlations overall
- Strongest correlations:
  1. Age-Charges: 0.299 (moderate positive)
  2. BMI-Charges: 0.198 (weak positive)

Smoking Impact:
- Smokers' average charges: $32,050
- Non-smokers' average charges: $8,434
- 280% charge increase for smokers

Charges by Age Group:
- 0-30 years: $9,397
- 30-45 years: $12,647
- 45-60 years: $16,341

Key Insights:
- Smoking is the most significant charge determinant
- Charges increase with age
- Mild variations by sex and region

Recommended Next Steps:
1. Non-parametric statistical modeling
2. Investigate interaction effects
3. Consider predictive modeling