## PHASE 3 PROJECT
### INTRODUCTION
Rising healthcare costs pose a major challenge for insurance providers, particularly due to high-risk individuals who require frequent and expensive medical care. This project applies Machine Learning Classification techniques to predict high-risk policyholders using demographic, lifestyle, medical, and insurance data. The goal is to enable proactive risk management and improve decision-making within health insurance operations.

### Business Problem Statement
Health insurance providers face rising medical costs driven by a subset of high-risk policyholders who require frequent and costly healthcare services. Currently, identifying these individuals early remains a challenge, leading to inefficient cost management and reactive care strategies.
This project aims to develop a classification model that predicts whether an insured individual is high-risk using demographic, medical, lifestyle, and insurance data. Accurate identification of high-risk individuals will enable the insurer to implement proactive interventions, improve resource allocation, and reduce overall healthcare costs.

### Business Objective
The objective of this project is to predict whether an insured individual is high-risk based on demographic, medical, lifestyle, and insurance-related data.
  - This will allow the company to:
  - Implement proactive care management programs
  - Improve risk-based pricing strategies
  - Reduce avoidable medical costs

### Dataset Overview
The dataset used in this project contains records of 100,000 insured individuals collected from a health insurance system. It includes a wide range of variables capturing demographic characteristics, lifestyle behaviors, medical history, insurance plan details, and healthcare utilization. These features provide a comprehensive view of factors that influence an individual’s health risk profile.

In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('medical_insurance.csv')
df.head()

Unnamed: 0,person_id,age,sex,region,urban_rural,income,education,marital_status,employment_status,household_size,...,liver_disease,arthritis,mental_health,proc_imaging_count,proc_surgery_count,proc_physio_count,proc_consult_count,proc_lab_count,is_high_risk,had_major_procedure
0,75722,52,Female,North,Suburban,22700.0,Doctorate,Married,Retired,3,...,0,1,0,1,0,2,0,1,0,0
1,80185,79,Female,North,Urban,12800.0,No HS,Married,Employed,3,...,0,1,1,0,0,1,0,1,1,0
2,19865,68,Male,North,Rural,40700.0,HS,Married,Retired,5,...,0,0,1,1,0,2,1,0,1,0
3,76700,15,Male,North,Suburban,15600.0,Some College,Married,Self-employed,5,...,0,0,0,1,0,0,1,0,0,0
4,92992,53,Male,Central,Suburban,89600.0,Doctorate,Married,Self-employed,2,...,0,1,0,2,0,1,1,0,1,0


In [4]:
# Checking for missing values
df.isnull().sum()

person_id                          0
age                                0
sex                                0
region                             0
urban_rural                        0
income                             0
education                          0
marital_status                     0
employment_status                  0
household_size                     0
dependents                         0
bmi                                0
smoker                             0
alcohol_freq                   30083
visits_last_year                   0
hospitalizations_last_3yrs         0
days_hospitalized_last_3yrs        0
medication_count                   0
systolic_bp                        0
diastolic_bp                       0
ldl                                0
hba1c                              0
plan_type                          0
network_tier                       0
deductible                         0
copay                              0
policy_term_years                  0
p

In [5]:
# To check class balance, which influences model choice and evaluation metrics
df['is_high_risk'].value_counts()


is_high_risk
0    63219
1    36781
Name: count, dtype: int64

The target variable is_high_risk shows a moderately imbalanced class distribution. Approximately 63.2% of individuals are classified as non–high-risk, while 36.8% are classified as high-risk. Although the imbalance is not severe, it is important to consider evaluation metrics beyond accuracy. In particular, recall for the high-risk class is crucial, as failing to identify high-risk individuals can lead to increased healthcare costs and missed preventive interventions

### Numerical Features vs Target
To understand how key numerical variables differ between high-risk and non–high-risk individuals

In [7]:
# Age vs Risk
df.groupby('is_high_risk')['age'].describe()
# High-risk individuals tend to be older on average, suggesting age is a significant factor in risk classification.


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,63219.0,40.702352,13.296727,0.0,32.0,41.0,50.0,72.0
1,36781.0,59.242217,13.185118,0.0,51.0,59.0,68.0,100.0


In [8]:
# BMI vs Risk
df.groupby('is_high_risk')['bmi'].describe()
# Higher BMI among high-risk individuals indicates lifestyle-related health risks

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,63219.0,26.666371,4.889411,12.0,23.4,26.7,29.8,47.3
1,36781.0,27.547644,5.123517,12.0,24.0,27.7,31.2,50.4


High-risk individuals incur significantly higher annual medical costs compared to non–high-risk individuals. This highlights the financial impact of high-risk policyholders and reinforces the business value of accurately identifying high-risk individuals early for proactive care management and cost control.

In [11]:
# Annual Claims Amount vs Risk
df.groupby('is_high_risk')['annual_medical_cost'].describe()
# High-risk individuals usually have much higher average claim amounts.
# This strongly aligns with the insurer’s cost concerns

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,63219.0,2408.490469,2390.271795,55.55,988.975,1710.91,2967.9,65431.24
1,36781.0,4042.38144,3883.946666,102.95,1699.83,2913.31,4995.88,65724.9


In [12]:
df.groupby('is_high_risk')['annual_medical_cost'].mean()


is_high_risk
0    2408.490469
1    4042.381440
Name: annual_medical_cost, dtype: float64

In [15]:
# Chronic Conditions vs Risk
df.groupby('is_high_risk')['chronic_count'].describe()
# High-risk individuals have a higher average number of chronic conditions

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,63219.0,0.379649,0.54488,0.0,0.0,0.0,1.0,3.0
1,36781.0,1.317827,0.835109,0.0,1.0,1.0,2.0,6.0


High-risk individuals exhibit a significantly higher number of chronic conditions compared to non–high-risk individuals, confirming the importance of medical history in risk classification.

In [17]:
# Individual Chronic Conditions
conditions = [
    'hypertension', 'diabetes', 'asthma', 'copd',
    'cardiovascular_disease', 'cancer_history',
    'kidney_disease', 'liver_disease', 'arthritis',
    'mental_health'
]

for condition in conditions:
    print(f"\n{condition}")
    print(pd.crosstab(df[condition], df['is_high_risk'], normalize='index') * 100)
	# Presence of these conditions greatly increases the probability of being high-risk.



hypertension
is_high_risk          0          1
hypertension                      
0             70.319503  29.680497
1             35.419022  64.580978

diabetes
is_high_risk          0          1
diabetes                          
0             66.197337  33.802663
1             31.537298  68.462702

asthma
is_high_risk         0         1
asthma                          
0             65.15678  34.84322
1             32.24053  67.75947

copd
is_high_risk          0          1
copd                              
0             64.382553  35.617447
1             32.016690  67.983310

cardiovascular_disease
is_high_risk                    0          1
cardiovascular_disease                      
0                       64.891498  35.108502
1                       32.206371  67.793629

cancer_history
is_high_risk            0          1
cancer_history                      
0               63.906632  36.093368
1               31.938633  68.061367

kidney_disease
is_high_risk            0 

In [19]:
# Procedures & Interventions vs Risk
df.groupby('is_high_risk')['proc_surgery_count'].mean()
# Medical procedures signal severity and cost intensity

is_high_risk
0    0.108543
1    0.244882
Name: proc_surgery_count, dtype: float64

In [20]:
# Diagnostic & Support Procedures
procedure_cols = [
    'proc_imaging_count',
    'proc_lab_count',
    'proc_consult_count',
    'proc_physio_count'
]

df.groupby('is_high_risk')[procedure_cols].mean()
# High-risk individuals undergo more medical procedures, indicating higher resource utilization.

Unnamed: 0_level_0,proc_imaging_count,proc_lab_count,proc_consult_count,proc_physio_count
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.45586,0.457979,0.459324,0.459482
1,0.599059,0.597075,0.59528,0.592453


In [21]:
# Days Hospitalized
df.groupby('is_high_risk')['days_hospitalized_last_3yrs'].mean()
# High-risk individuals experience more frequent and longer hospital stays, contributing significantly to increased healthcare costs

is_high_risk
0    0.341369
1    0.428319
Name: days_hospitalized_last_3yrs, dtype: float64

In [22]:
# Risk Score Validation
df.groupby('is_high_risk')['risk_score'].describe()
# High-risk group should have higher average risk scores
# Confirms internal consistency of the dataset

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,63219.0,0.362822,0.149725,0.0,0.2527,0.3736,0.4835,0.5934
1,36781.0,0.789747,0.130678,0.6044,0.6703,0.7692,0.9011,1.0


In [23]:
# Insurance Plan & Financial Features
# Plan Type vs Risk
pd.crosstab(df['plan_type'], df['is_high_risk'], normalize='index') * 100


is_high_risk,0,1
plan_type,Unnamed: 1_level_1,Unnamed: 2_level_1
EPO,62.70088,37.29912
HMO,63.142586,36.857414
POS,63.606645,36.393355
PPO,63.352006,36.647994


In [24]:
# Deductible & Premiums
df.groupby('is_high_risk')[['deductible', 'annual_premium', 'monthly_premium']].mean()
# High-risk individuals tend to be associated with different insurance plan characteristics, which may reflect risk-based pricing or plan selection behavior

Unnamed: 0_level_0,deductible,annual_premium,monthly_premium
is_high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1224.212658,507.948997,42.329089
1,1231.043202,710.148617,59.179036


In [25]:
# Major Procedure vs Risk (Sanity Check)
pd.crosstab(df['had_major_procedure'], df['is_high_risk'], normalize='index') * 100
# Confirms logical relationship between major procedures and high-risk status.

is_high_risk,0,1
had_major_procedure,Unnamed: 1_level_1,Unnamed: 2_level_1
0,65.319764,34.680236
1,52.940483,47.059517


### Overall EDA Summary
Exploratory data analysis reveals clear and consistent differences between high-risk and non–high-risk individuals across demographic, lifestyle, medical, and insurance-related features. High-risk individuals exhibit greater healthcare utilization, higher medical costs, increased prevalence of chronic conditions, and more frequent medical procedures. These patterns validate the relevance of the dataset for predictive modeling and provide strong justification for the selected target variable and feature set

### Data Preprocessing
Prior to model training, the dataset was prepared by separating the target variable from predictor features and removing unique identifiers. Numerical features were imputed using the median and scaled to ensure comparable ranges, while categorical features were imputed using the most frequent category and one-hot encoded. A train–test split was applied with stratification to preserve the class distribution. All preprocessing steps were implemented using a pipeline to prevent data leakage and ensure reproducibility.

In [26]:
# Separate Features and Target. To clearly define inputs (X) and output (y)
# Target variable
y = df['is_high_risk']

# Features (drop identifiers and target)
X = df.drop(columns=['is_high_risk', 'person_id'])


In [27]:
# Identify Numerical and Categorical Columns, Different preprocessing steps are needed for each type.
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object']).columns

numerical_cols, categorical_cols


(Index(['age', 'income', 'household_size', 'dependents', 'bmi',
        'visits_last_year', 'hospitalizations_last_3yrs',
        'days_hospitalized_last_3yrs', 'medication_count', 'systolic_bp',
        'diastolic_bp', 'ldl', 'hba1c', 'deductible', 'copay',
        'policy_term_years', 'policy_changes_last_2yrs', 'provider_quality',
        'risk_score', 'annual_medical_cost', 'annual_premium',
        'monthly_premium', 'claims_count', 'avg_claim_amount',
        'total_claims_paid', 'chronic_count', 'hypertension', 'diabetes',
        'asthma', 'copd', 'cardiovascular_disease', 'cancer_history',
        'kidney_disease', 'liver_disease', 'arthritis', 'mental_health',
        'proc_imaging_count', 'proc_surgery_count', 'proc_physio_count',
        'proc_consult_count', 'proc_lab_count', 'had_major_procedure'],
       dtype='object'),
 Index(['sex', 'region', 'urban_rural', 'education', 'marital_status',
        'employment_status', 'smoker', 'alcohol_freq', 'plan_type',
        'netw

In [29]:
# Train–Test Split. To evaluate how the model performs on unseen data.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y # To maintain class distribution in both sets
)


In [30]:
# Handle Missing Values. Most ML models cannot handle missing data
from sklearn.impute import SimpleImputer
# Numerical → median (robust to outliers)
# Categorical → most frequent

In [31]:
# Encode Categorical Variables. 
# Identify Categorical Columns
categorical_cols = X.select_dtypes(include='object').columns
categorical_cols


Index(['sex', 'region', 'urban_rural', 'education', 'marital_status',
       'employment_status', 'smoker', 'alcohol_freq', 'plan_type',
       'network_tier'],
      dtype='object')

In [32]:
# Apply Dummy Encoding
X_encoded = pd.get_dummies(
    X,
    columns=categorical_cols,
    drop_first=True # Avoids dummy variable trap (perfect multicollinearity)
)


In [33]:
# Check the New Shape
X_encoded.shape


(100000, 71)

In [34]:
# Train–Test Split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [35]:
# Scale Numerical Features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Categorical Encoding
Categorical variables were converted into numerical format using dummy variable encoding (pd.get_dummies). To avoid multicollinearity, one category from each categorical variable was dropped. This transformation enabled the use of machine learning algorithms that require numerical input.