##Data Cleaning and Processing of Stroke Prediction Data



This notebook is organized to guide the process of data cleaning and processing of the stroke prediction data. The workflow begins with importing and exploring the dataset, followed by thorough data cleaning and preprocessing, including handling missing values and encoding categorical variables. Multiple imputation strategies for missing BMI values are evaluated, and the cleaned dataset is standardized for model training.

In [28]:
#Import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [29]:
#read the data file with pandas

df = pd.read_csv('../data/healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
#describe shape of dataframe

df.shape


(5110, 12)

In [4]:
#check the column names of dataframe

df.columns


Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [5]:
#format the column names of dataframe to ensure consistency (lowercase)

df = df.rename(columns=str.lower)

In [7]:
#describe the information of dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [8]:
#descriptives of columns with numerical values

df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [9]:
# check missing values

df.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [10]:
#Check the number of values for each column

for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

id: 5110 unique values
gender: 3 unique values
age: 104 unique values
hypertension: 2 unique values
heart_disease: 2 unique values
ever_married: 2 unique values
work_type: 5 unique values
residence_type: 2 unique values
avg_glucose_level: 3979 unique values
bmi: 418 unique values
smoking_status: 4 unique values
stroke: 2 unique values


In [11]:
#Check number of duplicate rows

print(f"Duplicate rows: {df.duplicated().sum()}")


Duplicate rows: 0


In [12]:
#Gender with three values - check the frequency

df.gender.value_counts()


gender
Female    2994
Male      2115
Other        1
Name: count, dtype: int64

In [13]:
#remove data with gender=other

df = df[df.gender != 'Other']

df.gender.value_counts()

gender
Female    2994
Male      2115
Name: count, dtype: int64

##Handling Missing Values of BMI

In this section, I create four copies of the original DataFrame. Each copy will handle missing values in the BMI column using a different imputation method: mean, median, and two regression-based predictions.

In [14]:
#Original dataframe

df_original = df.copy()
df_original.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [15]:
#Replace missing value by mean

df_mean = df.copy()
df_mean['bmi'] = df_mean['bmi'].fillna(df_mean['bmi'].mean())
df_mean.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.89456,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [16]:
#Replace missing value by Median

df_median = df.copy()
df_median['bmi'] = df_median['bmi'].fillna(df_median['bmi'].median())
df_median.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,28.1,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [17]:
#regression to predict bmi by age, avg_glucose_level and replace it for missing values

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df_reg = df.copy()

# Create a model to predict BMI
bmi_df = df_reg.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing BMI values
missing_bmi_df = df_reg[df_reg['bmi'].isnull()].copy()
X_missing = missing_bmi_df[['age', 'avg_glucose_level']]
predicted_bmi = model.predict(X_missing)

# Replace the missing BMI values with the predicted values
df_reg.loc[df_reg['bmi'].isnull(), 'bmi'] = predicted_bmi

df_reg.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,32.66509,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [18]:
# prompt: calculate r2 of the model

from sklearn.metrics import r2_score

# Calculate R-squared for the regression model used to predict BMI
r2 = r2_score(y_test, model.predict(X_test))
print(f"R-squared of the BMI prediction model: {r2:.4f}")


R-squared of the BMI prediction model: 0.1173


In [19]:
#Replace missing 'bmi' values with predicted values - Fit a quadratic regression model by age
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

df_qua=df.copy()

# Create a model to predict BMI
bmi_df = df_reg.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the quadratic regression model
model = LinearRegression()
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
model.fit(X_poly, y_train)

# Predict missing BMI values
missing_bmi_df = df_qua[df_qua['bmi'].isnull()].copy()

# Replace the missing BMI values with the predicted values
df_qua.loc[df_qua['bmi'].isnull(), 'bmi'] = model.predict(poly.transform(missing_bmi_df[['age', 'avg_glucose_level']]))

df_qua.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,33.870789,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [20]:
# prompt: calculate r2

# Calculate R-squared for the quadratic regression model used to predict BMI
X_test_poly = poly.transform(X_test)
r2_quad = r2_score(y_test, model.predict(X_test_poly))
print(f"R-squared of the quadratic BMI prediction model: {r2_quad:.4f}")

R-squared of the quadratic BMI prediction model: 0.2595


In [21]:
#Compare the mean bmi values of the five dataframes

print("Mean BMI values:")
print("Original:", df_original['bmi'].mean())
print("Mean with Mean Imputation:", df_mean['bmi'].mean())
print("Mean with Median Imputation:", df_median['bmi'].mean())
print("Mean with Regression Imputation:", df_reg['bmi'].mean())
print("Mean with Quadratic Imputation:", df_qua['bmi'].mean())




Mean BMI values:
Original: 28.894559902200488
Mean with Mean Imputation: 28.894559902200484
Mean with Median Imputation: 28.863300058719908
Mean with Regression Imputation: 28.947396221024402
Mean with Quadratic Imputation: 28.93548254383332


##Encode categorical features.

In [22]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [30]:
#drop id

df = df.drop('id', axis=1)

In [33]:
df = pd.get_dummies(df, columns=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'])
df.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,61.0,0,0,202.21,,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,80.0,0,1,105.92,32.5,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,49.0,0,0,171.23,34.4,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,79.0,1,0,174.12,24.0,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


##One example of data processing: 
1. Use quardratic regression to predict missing values of BMI
2. Normalize or scale numerical features after filling the values
(Categorical variables encoded already)

In [35]:
#Replace bmi missing values with quardratic regression predicted values

df_replaced=df.copy()

# Create a model to predict BMI
bmi_df = df_replaced.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the quadratic regression model
model = LinearRegression()
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
model.fit(X_poly, y_train)

# Predict missing BMI values
missing_bmi_df = df_replaced[df_replaced['bmi'].isnull()].copy()

# Replace the missing BMI values with the predicted values
df_replaced.loc[df_replaced['bmi'].isnull(), 'bmi'] = model.predict(poly.transform(missing_bmi_df[['age', 'avg_glucose_level']]))

df_replaced.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,61.0,0,0,202.21,33.921354,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,80.0,0,1,105.92,32.5,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,49.0,0,0,171.23,34.4,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,79.0,1,0,174.12,24.0,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


In [37]:
#normalize numerical data

from sklearn.preprocessing import StandardScaler

# Select numerical features to normalize
numerical_features = ['age', 'avg_glucose_level', 'bmi']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the numerical data and transform the data
df_replaced[numerical_features] = scaler.fit_transform(df_replaced[numerical_features])

df_replaced.head()


Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,1.051434,0,1,2.706375,0.991353,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,0.78607,0,0,2.121559,0.645152,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,1.62639,0,1,-0.005028,0.461449,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,0.255342,0,0,1.437358,0.707015,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,1.582163,1,0,1.501184,-0.637132,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


##Create a cleaned and finalized dataframe

In [38]:
df_final = df_replaced.copy()
df_final.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,1.051434,0,1,2.706375,0.991353,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,0.78607,0,0,2.121559,0.645152,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,1.62639,0,1,-0.005028,0.461449,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,0.255342,0,0,1.437358,0.707015,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,1.582163,1,0,1.501184,-0.637132,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


In [39]:
df_final.to_csv('stroke_cleaned_final.csv', index=False)