##Data Cleaning and Processing of Stroke Prediction Data



This notebook is organized to guide the process of data cleaning and processing of the stroke prediction data. The workflow begins with importing and exploring the dataset, followed by thorough data cleaning and preprocessing, including handling missing values and encoding categorical variables. Multiple imputation strategies for missing BMI values are evaluated, and the cleaned dataset is standardized for model training.

In [84]:
#Import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [85]:
#data load from csv file

df = pd.read_csv('../data/healthcare-dataset-stroke-data.csv')
df.head()
df.shape
#around 5110 rows and 12 columns

(5110, 12)

In [86]:
#get somne basic information

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [87]:
df.head()


Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


after data exploration, we have a few binary value columns which do not need any further transofmration, including hypertension, heart_disease.
Categorical variables: 
- gender
- work_type
- residence_type
- smoking_status

In [88]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [89]:
# Step 1: Convert all string columns to lower case to avoid inconsistencies
# Apply .str.lower().str.strip() to all string (object) columns- this will convert all string values to lowercase and remove leading/trailing whitespace.
# This is useful for standardizing categorical data before encoding.


df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)
df.columns = df.columns.str.lower()
df_cleaned = df.copy()

  df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)


In [90]:
#Step2: Convert binary columns to 0/1
#Convert ever_married to binary values


df_cleaned['ever_married'] = df['ever_married'].str.lower().str.strip().map({'yes': 1, 'no': 0})

In [112]:
#Step 3: convert categorical columns to one-hot encoding 

categorical_cols = ['gender', 'work_type', 'residence_type', 'smoking_status']

# One-hot encode and drop the first category in each (to avoid multicollinearity)
df_cleaned = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# View new columns
df_cleaned.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           5110 non-null   int64  
 1   age                          5110 non-null   float64
 2   hypertension                 5110 non-null   int64  
 3   heart_disease                5110 non-null   int64  
 4   ever_married                 5110 non-null   object 
 5   avg_glucose_level            5110 non-null   float64
 6   bmi                          4909 non-null   float64
 7   stroke                       5110 non-null   int64  
 8   gender_male                  5110 non-null   bool   
 9   gender_other                 5110 non-null   bool   
 10  work_type_govt_job           5110 non-null   bool   
 11  work_type_never_worked       5110 non-null   bool   
 12  work_type_private            5110 non-null   bool   
 13  work_type_self-emp

In [121]:
#Create a function to convert boolean columns to binary (0/1) for consistency

def convert_booleans_to_binary(df):
    df_final = df_cleaned.copy()

    
    # 1. Convert string 'true'/'false' to boolean True/False
    df_final = df_final.applymap(lambda x: True if isinstance(x, str) and x.strip().lower() == 'true'
                              else False if isinstance(x, str) and x.strip().lower() == 'false'
                              else x)

    # 2. Convert boolean True/False to 1/0
    bool_cols = df_final.select_dtypes(include='bool').columns
    df[bool_cols] = df_final[bool_cols].astype(int)

    return df_final


In [125]:
df_final = convert_booleans_to_binary(df_cleaned)

  df_final = df_final.applymap(lambda x: True if isinstance(x, str) and x.strip().lower() == 'true'


In [126]:
df_final.head()

Unnamed: 0,id,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,stroke,gender_male,gender_other,work_type_govt_job,work_type_never_worked,work_type_private,work_type_self-employed,residence_type_urban,smoking_status_never smoked,smoking_status_smokes,smoking_status_unknown
0,9046,67.0,0,1,yes,228.69,36.6,1,1,0,0,0,1,0,1,0,0,0
1,51676,61.0,0,0,yes,202.21,,1,0,0,0,0,0,1,0,1,0,0
2,31112,80.0,0,1,yes,105.92,32.5,1,1,0,0,0,1,0,0,1,0,0
3,60182,49.0,0,0,yes,171.23,34.4,1,0,0,0,0,1,0,1,0,1,0
4,1665,79.0,1,0,yes,174.12,24.0,1,0,0,0,0,0,1,0,1,0,0


In [136]:
df_final['ever_married'] = (
    df_cleaned['ever_married']
    .astype(str)                      # <- this ensures you can use .str methods
    .str.lower()
    .str.strip()
    .map({'yes': 1, 'no': 0})
    .fillna(0)
    .astype(int)
)

#convert ever_married to binary values

In [144]:
df_final.head()

#Confirm all binary columns are now 0/1

Unnamed: 0,id,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,stroke,gender_male,gender_other,work_type_govt_job,work_type_never_worked,work_type_private,work_type_self-employed,residence_type_urban,smoking_status_never smoked,smoking_status_smokes,smoking_status_unknown
0,9046,67.0,0,1,1,228.69,36.6,1,1,0,0,0,1,0,1,0,0,0
1,51676,61.0,0,0,1,202.21,28.893237,1,0,0,0,0,0,1,0,1,0,0
2,31112,80.0,0,1,1,105.92,32.5,1,1,0,0,0,1,0,0,1,0,0
3,60182,49.0,0,0,1,171.23,34.4,1,0,0,0,0,1,0,1,0,1,0
4,1665,79.0,1,0,1,174.12,24.0,1,0,0,0,0,0,1,0,1,0,0


In [138]:
#Step 5: Fill in missing value

df_final['bmi'] = df_final['bmi'].fillna(df_final['bmi'].mean())

At this point, we have completed all cleaning steps prior to any further steps related to variable scaling. After, we need to move on with the train and test split. 

In [139]:
#Train and Test Split to avoid contemination of data 
# The intent of doing the train and test split before scaling is to ensure that the scaling parameters (mean, standard deviation) are derived only from the training data. This prevents data leakage from the test set into the training process, which could lead to overly optimistic performance estimates.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

numerical_cols = ['age', 'avg_glucose_level', 'bmi']
target = 'stroke'

# 1. Split first
X = df_cleaned.drop(columns=['stroke'])
y = df_cleaned['stroke']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)



We are going to choose a few more models - some require numerical column scaling and some don't. So i will need to keep a raw version for the models that don't require scaling

In [140]:
# 1. Raw (unscaled) for Isolation Forest
X_train_raw = X_train.copy()
X_test_raw = X_test.copy()

In [141]:
# 2. Scaled for SVM, LOF, Autoencoder
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()


In [142]:
X_train_scaled[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test_scaled[numerical_cols] = scaler.transform(X_test[numerical_cols])

In [146]:
X_train_raw['ever_married'] = (
    X_train_raw['ever_married']
    .astype(str)                     # ensure string
    .str.lower()                     # make lowercase
    .str.strip()                     # remove spaces
    .map({'yes': 1, 'no': 0})        # convert to binary
    .fillna(0)                       # default for unexpected values
    .astype(int)                     # make it numeric
)

X_test_raw['ever_married'] = (
    X_test_raw['ever_married']
    .astype(str)
    .str.lower()
    .str.strip()
    .map({'yes': 1, 'no': 0})
    .fillna(0)
    .astype(int)
)


In [147]:
X_train_raw.select_dtypes(include='object').columns

Index([], dtype='object')

In [149]:
X_train_raw = X_train_raw.fillna(X_train_raw.median())
X_test_raw = X_test_raw.fillna(X_train_raw.median())  # Note: use *train* median here!


In [158]:
# Isolation Forest — use raw

from sklearn.ensemble import IsolationForest

# Initialize Isolation Forest
iso = IsolationForest(n_estimators=100, contamination=0.10, random_state=42)

# Now you can fit
iso.fit(X_train_raw)
iso_preds = iso.predict(X_test_raw)





In [159]:
iso_preds_binary = (iso_preds == -1).astype(int)

In [None]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, iso_preds_binary)
print("Confusion Matrix:")

print(conf_matrix)


Confusion Matrix:
[[883  89]
 [ 37  13]]


In [161]:
from sklearn.metrics import classification_report

print("Classification Report:")
print(classification_report(y_test, iso_preds_binary, target_names=['Non-stroke', 'Stroke']))


Classification Report:
              precision    recall  f1-score   support

  Non-stroke       0.96      0.91      0.93       972
      Stroke       0.13      0.26      0.17        50

    accuracy                           0.88      1022
   macro avg       0.54      0.58      0.55      1022
weighted avg       0.92      0.88      0.90      1022



Model Result:
Use Contamination=0.02, I

Classification Report:
              precision    recall  f1-score   support

  Non-stroke       0.95      0.98      0.97       972
      Stroke       0.18      0.10      0.13        50

    accuracy                           0.93      1022
   macro avg       0.57      0.54      0.55      1022
weighted avg       0.92      0.93      0.92      1022

 am making the model more aggressive in finding stroke but the result still non great


               precision    recall  f1-score   support

  Non-stroke       0.96      0.91      0.93       972
      Stroke       0.13      0.26      0.17        50

    accuracy                           0.88      1022
   macro avg       0.54      0.58      0.55      1022
weighted avg       0.92      0.88      0.90      1022



JUL 21 STOP HERE

In [None]:

# One-Class SVM — use scaled
svm.fit(X_train_scaled)
svm_preds = svm.predict(X_test_scaled)


In [None]:

# 2. Initialize the scaler and fit it into the training data 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 3. Apply the same transformation to test data
X_test_scaled = scaler.transform(X_test)

In [None]:
#Step 4: Scale numerical columns

from sklearn.preprocessing import StandardScaler

numerical_cols = ['age', 'avg_glucose_level', 'bmi']
scaler = StandardScaler()
df_cleaned[numerical_cols] = scaler.fit_transform(df_cleaned[numerical_cols])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              5110 non-null   int64  
 1   age                             5110 non-null   float64
 2   hypertension                    5110 non-null   int64  
 3   heart_disease                   5110 non-null   int64  
 4   ever_married                    5110 non-null   object 
 5   avg_glucose_level               5110 non-null   float64
 6   bmi                             5110 non-null   float64
 7   stroke                          5110 non-null   int64  
 8   gender_Male                     5110 non-null   bool   
 9   gender_Other                    5110 non-null   bool   
 10  work_type_Never_worked          5110 non-null   bool   
 11  work_type_Private               5110 non-null   bool   
 12  work_type_Self-employed         51

In [17]:
#regression to predict bmi by age, avg_glucose_level and replace it for missing values

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

df_reg = df.copy()

# Create a model to predict BMI
bmi_df = df_reg.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing BMI values
missing_bmi_df = df_reg[df_reg['bmi'].isnull()].copy()
X_missing = missing_bmi_df[['age', 'avg_glucose_level']]
predicted_bmi = model.predict(X_missing)

# Replace the missing BMI values with the predicted values
df_reg.loc[df_reg['bmi'].isnull(), 'bmi'] = predicted_bmi

df_reg.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,32.66509,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [18]:
# prompt: calculate r2 of the model

from sklearn.metrics import r2_score

# Calculate R-squared for the regression model used to predict BMI
r2 = r2_score(y_test, model.predict(X_test))
print(f"R-squared of the BMI prediction model: {r2:.4f}")


R-squared of the BMI prediction model: 0.1173


In [19]:
#Replace missing 'bmi' values with predicted values - Fit a quadratic regression model by age
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

df_qua=df.copy()

# Create a model to predict BMI
bmi_df = df_reg.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the quadratic regression model
model = LinearRegression()
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
model.fit(X_poly, y_train)

# Predict missing BMI values
missing_bmi_df = df_qua[df_qua['bmi'].isnull()].copy()

# Replace the missing BMI values with the predicted values
df_qua.loc[df_qua['bmi'].isnull(), 'bmi'] = model.predict(poly.transform(missing_bmi_df[['age', 'avg_glucose_level']]))

df_qua.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,33.870789,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [20]:
# prompt: calculate r2

# Calculate R-squared for the quadratic regression model used to predict BMI
X_test_poly = poly.transform(X_test)
r2_quad = r2_score(y_test, model.predict(X_test_poly))
print(f"R-squared of the quadratic BMI prediction model: {r2_quad:.4f}")

R-squared of the quadratic BMI prediction model: 0.2595


In [21]:
#Compare the mean bmi values of the five dataframes

print("Mean BMI values:")
print("Original:", df_original['bmi'].mean())
print("Mean with Mean Imputation:", df_mean['bmi'].mean())
print("Mean with Median Imputation:", df_median['bmi'].mean())
print("Mean with Regression Imputation:", df_reg['bmi'].mean())
print("Mean with Quadratic Imputation:", df_qua['bmi'].mean())




Mean BMI values:
Original: 28.894559902200488
Mean with Mean Imputation: 28.894559902200484
Mean with Median Imputation: 28.863300058719908
Mean with Regression Imputation: 28.947396221024402
Mean with Quadratic Imputation: 28.93548254383332


##Encode categorical features.

In [22]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [30]:
#drop id

df = df.drop('id', axis=1)

In [33]:
df = pd.get_dummies(df, columns=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'])
df.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,61.0,0,0,202.21,,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,80.0,0,1,105.92,32.5,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,49.0,0,0,171.23,34.4,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,79.0,1,0,174.12,24.0,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


##One example of data processing: 
1. Use quardratic regression to predict missing values of BMI
2. Normalize or scale numerical features after filling the values
(Categorical variables encoded already)

In [35]:
#Replace bmi missing values with quardratic regression predicted values

df_replaced=df.copy()

# Create a model to predict BMI
bmi_df = df_replaced.dropna(subset=['bmi'])
X = bmi_df[['age', 'avg_glucose_level']]
y = bmi_df['bmi']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the quadratic regression model
model = LinearRegression()
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)
model.fit(X_poly, y_train)

# Predict missing BMI values
missing_bmi_df = df_replaced[df_replaced['bmi'].isnull()].copy()

# Replace the missing BMI values with the predicted values
df_replaced.loc[df_replaced['bmi'].isnull(), 'bmi'] = model.predict(poly.transform(missing_bmi_df[['age', 'avg_glucose_level']]))

df_replaced.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,228.69,36.6,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,61.0,0,0,202.21,33.921354,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,80.0,0,1,105.92,32.5,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,49.0,0,0,171.23,34.4,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,79.0,1,0,174.12,24.0,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


In [37]:
#normalize numerical data

from sklearn.preprocessing import StandardScaler

# Select numerical features to normalize
numerical_features = ['age', 'avg_glucose_level', 'bmi']

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the numerical data and transform the data
df_replaced[numerical_features] = scaler.fit_transform(df_replaced[numerical_features])

df_replaced.head()


Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,1.051434,0,1,2.706375,0.991353,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,0.78607,0,0,2.121559,0.645152,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,1.62639,0,1,-0.005028,0.461449,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,0.255342,0,0,1.437358,0.707015,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,1.582163,1,0,1.501184,-0.637132,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


##Create a cleaned and finalized dataframe

In [38]:
df_final = df_replaced.copy()
df_final.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,ever_married_No,...,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,1.051434,0,1,2.706375,0.991353,1,False,True,False,False,...,False,True,False,False,False,True,False,True,False,False
1,0.78607,0,0,2.121559,0.645152,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False
2,1.62639,0,1,-0.005028,0.461449,1,False,True,False,False,...,False,True,False,False,True,False,False,False,True,False
3,0.255342,0,0,1.437358,0.707015,1,True,False,False,False,...,False,True,False,False,False,True,False,False,False,True
4,1.582163,1,0,1.501184,-0.637132,1,True,False,False,False,...,False,False,True,False,True,False,False,False,True,False


In [39]:
df_final.to_csv('stroke_cleaned_final.csv', index=False)