# üß† Stroke Risk Prediction Using Machine Learning

## Step 1: Data Loading and Initial Exploration

## Project Overview

**Domain:** Healthcare Analytics & Predictive Modeling

**Objective:**  
This project focuses on analyzing patient demographic, clinical, and lifestyle data to predict the risk of stroke using machine learning techniques. The goal is to support early identification of high-risk individuals and enable data-driven healthcare decisions.

**Problem Statement:**  
Stroke is one of the leading causes of death and long-term disability worldwide. Due to its sudden onset and severe consequences, early identification of individuals at high risk is essential for timely medical intervention and preventive healthcare planning.

Traditional medical assessments may not always efficiently identify stroke risk at an early stage. Therefore, there is a need for an automated and data-driven approach that can analyze multiple patient attributes simultaneously to estimate stroke risk accurately.


**Why This Project Matters:**  

- Enables early detection of stroke risk  
- Supports preventive and personalized healthcare planning  
- Assists healthcare professionals with data-driven insights  
- Demonstrates practical application of machine learning in healthcare  

## üìä Dataset Description

The dataset used in this project is the Stroke Prediction Dataset, which contains **5,110 patient records** with **12 features** related to demographic information, medical conditions, and lifestyle habits.

- Number of records: 5,110  
- Number of features: 12  

The dataset includes attributes such as age, gender, hypertension, heart disease, average glucose level, BMI, and smoking status, which are relevant for assessing stroke risk.


## üîç Phase 1: Data Loading and Initial Understanding

This phase focuses on importing the dataset, examining its structure, and understanding the basic characteristics of the data before performing preprocessing or model building.

In [1]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np

# Display settings for pandas
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.float_format', '{:.2f}'.format)  # Format decimals to 2 places

print("Libraries imported successfully!")

Libraries imported successfully!


## 1.1 Dataset Source and Loading Method

**Data Source:** Kaggle ‚Äì Stroke Prediction Dataset  
**Source Link:** https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset   

### Loading Objectives
The dataset will be loaded from a CSV file and initially examined to understand:
- The total number of records (rows), representing individual patients  
- The total number of features (columns), representing patient attributes  
- Feature names along with their corresponding data types  
- The overall structure and quality of the dataset  

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv("healthcare-dataset-stroke-data.csv")

# Display dataset dimensions
print("Dataset loaded successfully!")
print("=" * 100)
print(f"Dataset Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")

Dataset loaded successfully!
Dataset Shape: 5,110 rows √ó 12 columns


## 1.2 Data Cleaning and Pre-processing

Before performing exploratory analysis or building models, the dataset must be cleaned and prepared. Real-world data often contains missing values, inconsistencies, and formatting issues that can affect results.

This section focuses on improving data quality by handling missing values, checking for duplicate records, verifying data types, and applying basic transformations. These steps ensure the dataset is reliable and suitable for further analysis.

## 1.3 Dataset Information

Understanding the dataset structure helps in deciding how the data should be processed and analyzed. It provides clarity on the number of observations, the type of features present, and the role of each column.

The stroke prediction dataset contains demographic, health-related, and lifestyle information for individuals. Each row represents one individual, while each column corresponds to a specific attribute.

The target variable in this dataset is **stroke**, which indicates whether an individual has experienced a stroke. This variable is binary in nature, where:
- 0 represents no stroke
- 1 represents occurrence of stroke

All other columns act as input features that contribute to predicting the target outcome.


## 1.4 Preview of Dataset (First 10 Records)

Before starting detailed cleaning and analysis, it is useful to view a small portion of the dataset. Displaying the first few records helps in understanding the overall structure, column arrangement, and the type of values stored in each feature.

In this project, the first 10 rows of the dataset are displayed to get an initial overview of the data.

In [3]:
# Display the first 10 rows of the dataset
df.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
5,56669,Male,81.0,0,0,Yes,Private,Urban,186.21,29.0,formerly smoked,1
6,53882,Male,74.0,1,1,Yes,Private,Rural,70.09,27.4,never smoked,1
7,10434,Female,69.0,0,0,No,Private,Urban,94.39,22.8,never smoked,1
8,27419,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
9,60491,Female,78.0,0,0,Yes,Private,Urban,58.57,24.2,Unknown,1


## 1.5 Dataset Information and Data Types

Examining the data types of each feature helps in understanding how the dataset can be processed and analyzed. Different data types require different handling techniques, especially when preparing the data for analysis or modeling.

Reviewing data types is useful for:
- Recognizing features that may need conversion or encoding
- Choosing suitable statistical and visualization methods
- Identifying missing or incomplete values
- Understanding how efficiently the dataset uses memory

The dataset contains a combination of numerical health indicators and categorical personal attributes. The target variable, stroke, is represented as a numeric value, making it suitable for classification-based analysis.

To summarize this information, a dataset overview function is used to display the structure, data types, and completeness of the data.

In [4]:
# Show structural details of the dataset
print("Stroke Dataset Overview")
print("-" * 60)
df.info()

Stroke Dataset Overview
------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [5]:
# Correct data types for numerical columns
df['age'] = df['age'].astype(int)
df['bmi'] = df['bmi'].astype(float)
df['avg_glucose_level'] = df['avg_glucose_level'].astype(float)

In [6]:
# Standardize categorical values (remove spaces and capitalize first letter)
cat_cols = df.select_dtypes(include='object').columns

for col in cat_cols:
    df[col] = df[col].str.strip().str.capitalize()

In [7]:
# Column names are reformatted by replacing underscores with spaces and capitalizing each word to improve readability and presentation.
df.columns = (
    df.columns
    .str.replace('_', ' ')
    .str.title()
)

## 1.6 Missing Value Analysis

Missing data is a common issue in real-world datasets and can affect the reliability of analysis if ignored. Identifying missing values early helps in choosing the most suitable method to handle them.

In the stroke prediction dataset, missing values are examined to determine which features require attention and how they should be treated without distorting the overall data distribution.

In [8]:
# Check for missing values in all columns
print("Missing Values Summary:")
missing_values = df.isnull().sum()
print(missing_values)
print("="*70)

# Columns with missing values
if missing_values.sum() > 0:
    print("\nColumns with missing values:")
    print(missing_values[missing_values > 0])
else:
    print("\nNo missing values found in the dataset.")

Missing Values Summary:
Id                     0
Gender                 0
Age                    0
Hypertension           0
Heart Disease          0
Ever Married           0
Work Type              0
Residence Type         0
Avg Glucose Level      0
Bmi                  201
Smoking Status         0
Stroke                 0
dtype: int64

Columns with missing values:
Bmi    201
dtype: int64


The BMI feature contains missing values. Since BMI is a numerical variable related to health measurements, replacing missing values with a central tendency measure helps maintain data consistency.

The median is chosen instead of the mean to reduce the influence of extreme values.

In [9]:
# Fill missing BMI values using the median
df['Bmi'] = df['Bmi'].fillna(df['Bmi'].median())

After handling the missing values, the dataset is checked again to ensure that all missing entries have been addressed successfully.

In [10]:
# Verify that missing values are handled
df.isnull().sum() 

Id                   0
Gender               0
Age                  0
Hypertension         0
Heart Disease        0
Ever Married         0
Work Type            0
Residence Type       0
Avg Glucose Level    0
Bmi                  0
Smoking Status       0
Stroke               0
dtype: int64

## 1.7 Duplicate Record Analysis

Duplicate records can lead to biased analysis by repeating the same information multiple times. Therefore, the dataset is checked for duplicate rows to ensure that each record represents a unique individual.

In [11]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print(f"\nWarning: {duplicates} duplicate row(s) detected!")
else:
    print("\nNo duplicate rows found in the dataset.")

Number of duplicate rows: 0

No duplicate rows found in the dataset.


## Phase 1: Analysis Summary

### Overview of the Dataset
- **Number of Records:** 5,110 patients  
- **Features Included:** 12 attributes  
- **Outcome Variable:** `stroke` (occurrence rate: 4.9%)  

### Patient Demographics and Profile
- **Mean Age:** 43 years (range: 0‚Äì82)  
- **Average BMI:** 28.9 (with some missing entries)  
- Many patients show low physical activity and have conditions like hypertension or heart disease  
- **Gender Breakdown:** fairly balanced between male and female  

### Data Quality Assessment

- **Missing Data:**  
  - BMI: 201 entries missing (~3.9%)  
  - Smoking status: 2 entries missing (~0.04%)  

- **Duplicate Entries:**  
  - No complete duplicates found  

- **Data Type Considerations:**  
  - `bmi` recorded as text instead of numeric  
  - Categorical variables such as `gender`, `ever_married`, and `work_type` need proper encoding  

- **Inconsistencies in Values:**  
  - `smoking_status` has multiple categories such as ‚Äúformerly smoked‚Äù, ‚Äúnever smoked‚Äù, and ‚ÄúUnknown‚Äù that require standardization  

### Key Takeaways
- Stroke occurrence is low (4.9%), suggesting an imbalanced dataset; this must be addressed during modeling  
- Factors such as age, BMI, and comorbidities (hypertension, heart disease) are likely influential  
- Lifestyle aspects, including smoking habits and physical activity, could also impact stroke risk  