## ⚕️ Patient Stroke Prediction

Given *medical patient data*, let's try to predict if a given patient will have a **stroke**.

We will use a variety of classification models to make our predictions.

Data source: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score, f1_score

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv("archive/healthcare-dataset-stroke-data.csv")
data

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
# Drop id column
df = df.drop('id', axis=1)
df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
5105,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [7]:
{column: len(df[column].unique()) for column in df.select_dtypes('object').columns}

{'gender': 3,
 'ever_married': 2,
 'work_type': 5,
 'Residence_type': 2,
 'smoking_status': 4}

In [8]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'gender': array(['Male', 'Female', 'Other'], dtype=object),
 'ever_married': array(['Yes', 'No'], dtype=object),
 'work_type': array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
       dtype=object),
 'Residence_type': array(['Urban', 'Rural'], dtype=object),
 'smoking_status': array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
       dtype=object)}

In [9]:
# Binary encoding
df['ever_married'] = df['ever_married'].replace({'No': 0, 'Yes': 1})

In [10]:
df['Residence_type'] = df['Residence_type'].replace({'Rural': 0, 'Urban': 1})

In [11]:
df

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,Male,67.0,0,1,1,Private,1,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,1,Self-employed,0,202.21,,never smoked,1
2,Male,80.0,0,1,1,Private,0,105.92,32.5,never smoked,1
3,Female,49.0,0,0,1,Private,1,171.23,34.4,smokes,1
4,Female,79.0,1,0,1,Self-employed,0,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...
5105,Female,80.0,1,0,1,Private,1,83.75,,never smoked,0
5106,Female,81.0,0,0,1,Self-employed,1,125.20,40.0,never smoked,0
5107,Female,35.0,0,0,1,Self-employed,0,82.99,30.6,never smoked,0
5108,Male,51.0,0,0,1,Private,0,166.29,25.6,formerly smoked,0


In [12]:
{column: df[column].unique() for column in df.select_dtypes('object').columns}

{'gender': array(['Male', 'Female', 'Other'], dtype=object),
 'work_type': array(['Private', 'Self-employed', 'Govt_job', 'children', 'Never_worked'],
       dtype=object),
 'smoking_status': array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
       dtype=object)}

In [14]:
pd.get_dummies(df['work_type'], dtype=int)

Unnamed: 0,Govt_job,Never_worked,Private,Self-employed,children
0,0,0,1,0,0
1,0,0,0,1,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,0,1,0
...,...,...,...,...,...
5105,0,0,1,0,0
5106,0,0,0,1,0
5107,0,0,0,1,0
5108,0,0,1,0,0


In [15]:
def onehot_encode(df, column):
    df = df.copy()
    dummies = pd.get_dummies(df[column], prefix=column, dtype=int)
    df = pd.concat([df, dummies], axis=1)
    df = df.drop(column, axis=1)
    return df

In [16]:
# One hot encoding
for column in ['gender', 'work_type', 'smoking_status']:
    df = onehot_encode(df, column=column)

In [17]:
df

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,gender_Female,gender_Male,gender_Other,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
0,67.0,0,1,1,1,228.69,36.6,1,0,1,0,0,0,1,0,0,0,1,0,0
1,61.0,0,0,1,0,202.21,,1,1,0,0,0,0,0,1,0,0,0,1,0
2,80.0,0,1,1,0,105.92,32.5,1,0,1,0,0,0,1,0,0,0,0,1,0
3,49.0,0,0,1,1,171.23,34.4,1,1,0,0,0,0,1,0,0,0,0,0,1
4,79.0,1,0,1,0,174.12,24.0,1,1,0,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5105,80.0,1,0,1,1,83.75,,0,1,0,0,0,0,1,0,0,0,0,1,0
5106,81.0,0,0,1,1,125.20,40.0,0,1,0,0,0,0,0,1,0,0,0,1,0
5107,35.0,0,0,1,0,82.99,30.6,0,1,0,0,0,0,0,1,0,0,0,1,0
5108,51.0,0,0,1,0,166.29,25.6,0,0,1,0,0,0,1,0,0,0,1,0,0


In [18]:
# Split df into X and y
y = df['stroke']
X = df.drop('stroke', axis=1)

In [19]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [20]:
X_train

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,gender_Female,gender_Male,gender_Other,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
4313,57.00,0,0,1,1,134.76,29.1,0,1,0,0,0,1,0,0,1,0,0,0
376,0.88,0,0,0,0,88.11,15.5,1,0,0,0,0,0,0,1,1,0,0,0
4913,20.00,0,0,0,1,84.49,20.5,0,1,0,0,0,1,0,0,0,0,1,0
1791,13.00,0,0,0,0,137.45,18.2,0,1,0,0,0,0,0,1,1,0,0,0
2166,28.00,0,0,1,0,169.49,27.2,0,1,0,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2895,46.00,0,0,1,1,137.77,29.3,0,1,0,0,0,1,0,0,0,0,1,0
2763,45.00,0,0,1,0,63.73,32.0,1,0,0,0,0,1,0,0,1,0,0,0
905,31.00,0,0,1,0,76.26,35.6,1,0,0,0,0,1,0,0,0,0,1,0
3980,45.00,0,0,1,0,218.10,55.0,1,0,0,0,0,1,0,0,0,0,0,1


In [21]:
X_train.isna().sum()

age                                 0
hypertension                        0
heart_disease                       0
ever_married                        0
Residence_type                      0
avg_glucose_level                   0
bmi                               143
gender_Female                       0
gender_Male                         0
gender_Other                        0
work_type_Govt_job                  0
work_type_Never_worked              0
work_type_Private                   0
work_type_Self-employed             0
work_type_children                  0
smoking_status_Unknown              0
smoking_status_formerly smoked      0
smoking_status_never smoked         0
smoking_status_smokes               0
dtype: int64

In [22]:
# KNN imputation of missing values
imputer = KNNImputer()

imputer.fit(X_train)

X_train = pd.DataFrame(imputer.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns, index=X_test.index)

In [24]:
X_train.isna().sum().sum()

np.int64(0)

In [25]:
X_test.isna().sum().sum()

np.int64(0)

In [26]:
# Scale X
scaler = StandardScaler()

scaler.fit(X_train)

X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

In [27]:
X_train

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,gender_Female,gender_Male,gender_Other,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
4313,0.608843,-0.330374,-0.238161,0.719002,0.994146,0.638258,0.010074,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344
376,-1.886989,-0.330374,-0.238161,-1.390817,-1.005888,-0.386663,-1.754554,0.851774,-0.851774,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344
4913,-1.036663,-0.330374,-0.238161,-1.390817,0.994146,-0.466196,-1.105794,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344
1791,-1.347974,-0.330374,-0.238161,-1.390817,-1.005888,0.697358,-1.404223,-1.174021,1.174021,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344
2166,-0.680878,-0.330374,-0.238161,0.719002,-1.005888,1.401291,-0.236455,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2895,0.119639,-0.330374,-0.238161,0.719002,0.994146,0.704389,0.036024,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344
2763,0.075165,-0.330374,-0.238161,0.719002,-1.005888,-0.922303,0.386355,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344
905,-0.547458,-0.330374,-0.238161,0.719002,-1.005888,-0.647013,0.853462,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344
3980,0.075165,-0.330374,-0.238161,0.719002,-1.005888,2.469274,3.370652,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036


In [28]:
X_train.mean()

age                              -3.426576e-16
hypertension                      3.178273e-17
heart_disease                     3.376915e-17
ever_married                      3.376915e-17
Residence_type                   -3.376915e-17
avg_glucose_level                -1.628865e-16
bmi                              -3.913249e-16
gender_Female                     4.370126e-17
gender_Male                      -4.370126e-17
gender_Other                      0.000000e+00
work_type_Govt_job               -4.668089e-17
work_type_Never_worked           -3.277594e-17
work_type_Private                 1.688458e-17
work_type_Self-employed          -3.178273e-17
work_type_children                2.185063e-17
smoking_status_Unknown            9.336178e-17
smoking_status_formerly smoked    1.112396e-16
smoking_status_never smoked       6.952473e-18
smoking_status_smokes             3.774200e-17
dtype: float64

In [29]:
X_train.var()

age                               1.00028
hypertension                      1.00028
heart_disease                     1.00028
ever_married                      1.00028
Residence_type                    1.00028
avg_glucose_level                 1.00028
bmi                               1.00028
gender_Female                     1.00028
gender_Male                       1.00028
gender_Other                      0.00000
work_type_Govt_job                1.00028
work_type_Never_worked            1.00028
work_type_Private                 1.00028
work_type_Self-employed           1.00028
work_type_children                1.00028
smoking_status_Unknown            1.00028
smoking_status_formerly smoked    1.00028
smoking_status_never smoked       1.00028
smoking_status_smokes             1.00028
dtype: float64

### Training

In [30]:
models = {
    "                   Logistic Regression": LogisticRegression(),
    "                   K-Nearest Neighbors": KNeighborsClassifier(),
    "                         Decision Tree": DecisionTreeClassifier(),
    "Support Vector Machine (Linear Kernel)": LinearSVC(),
    "   Support Vector Machine (RBF Kernel)": SVC(),
    "                     Gradient Boosting": GradientBoostingClassifier(),
    "                               XGBoost": XGBClassifier(eval_metric="mlogloss"),
    "                              LightGBM": LGBMClassifier(),
    "                              CatBoost": CatBoostClassifier(verbose=0)
}

In [31]:
for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

                   Logistic Regression trained.
                   K-Nearest Neighbors trained.
                         Decision Tree trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Number of positive: 166, number of negative: 3411
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000267 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 653
[LightGBM] [Info] Number of data points in the train set: 3577, number of used features: 17
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.046408 -> initscore=-3.022773
[LightGBM] [Info] Start training from score -3.022773
                              LightGBM trained.
                              CatBoost trained.


In [32]:
y_train.value_counts()

stroke
0    3411
1     166
Name: count, dtype: int64

In [33]:
print("Model performance\n-------------------------")
for name, model in models.items():
    y_pred = model.predict(X_test)
    print(
        "\n" + name + " Accuracy: {:.3f}%\n\t\t\t     F1-Score: {:.5f}"\
        .format(accuracy_score(y_test, y_pred) * 100, f1_score(y_test, y_pred))
    )

Model performance
-------------------------

                   Logistic Regression Accuracy: 94.586%
			     F1-Score: 0.00000

                   K-Nearest Neighbors Accuracy: 94.260%
			     F1-Score: 0.00000

                         Decision Tree Accuracy: 90.085%
			     F1-Score: 0.14607

Support Vector Machine (Linear Kernel) Accuracy: 94.586%
			     F1-Score: 0.00000

   Support Vector Machine (RBF Kernel) Accuracy: 94.586%
			     F1-Score: 0.00000

                     Gradient Boosting Accuracy: 94.586%
			     F1-Score: 0.00000

                               XGBoost Accuracy: 93.151%
			     F1-Score: 0.03670

                              LightGBM Accuracy: 94.064%
			     F1-Score: 0.06186

                              CatBoost Accuracy: 94.390%
			     F1-Score: 0.02273


In [34]:
from sklearn.metrics import confusion_matrix 

In [38]:
confusion_matrix(y_test, list(models.values())[0].predict(X_test))

array([[1450,    0],
       [  83,    0]])

#### Handling Class Imbalance With OverSampling

In [43]:
oversampled_data = pd.concat([X_train, y_train], axis=1).copy()
oversampled_data

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,gender_Female,gender_Male,gender_Other,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,stroke
4313,0.608843,-0.330374,-0.238161,0.719002,0.994146,0.638258,0.010074,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344,0
376,-1.886989,-0.330374,-0.238161,-1.390817,-1.005888,-0.386663,-1.754554,0.851774,-0.851774,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344,0
4913,-1.036663,-0.330374,-0.238161,-1.390817,0.994146,-0.466196,-1.105794,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344,0
1791,-1.347974,-0.330374,-0.238161,-1.390817,-1.005888,0.697358,-1.404223,-1.174021,1.174021,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344,0
2166,-0.680878,-0.330374,-0.238161,0.719002,-1.005888,1.401291,-0.236455,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2895,0.119639,-0.330374,-0.238161,0.719002,0.994146,0.704389,0.036024,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344,0
2763,0.075165,-0.330374,-0.238161,0.719002,-1.005888,-0.922303,0.386355,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344,0
905,-0.547458,-0.330374,-0.238161,0.719002,-1.005888,-0.647013,0.853462,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344,0
3980,0.075165,-0.330374,-0.238161,0.719002,-1.005888,2.469274,3.370652,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036,0


In [46]:
num_samples = y_train.value_counts()[0] - y_train.value_counts()[1]
new_samples = oversampled_data.query("stroke == 1").sample(num_samples, replace=True, random_state=1)

oversampled_data = pd.concat([oversampled_data, new_samples], axis=0).sample(frac=1.0, random_state=1).reset_index(drop=True)
oversampled_data

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,gender_Female,gender_Male,gender_Other,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes,stroke
0,0.030692,-0.330374,-0.238161,0.719002,-1.005888,0.604862,2.073132,0.851774,-0.851774,0.0,2.606344,-0.069103,-1.157341,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036,0
1,-1.481394,-0.330374,-0.238161,-1.390817,0.994146,-0.769169,-0.470009,-1.174021,1.174021,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344,0
2,-0.236146,-0.330374,-0.238161,0.719002,-1.005888,-0.536721,0.542057,0.851774,-0.851774,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036,0
3,-1.703759,-0.330374,-0.238161,-1.390817,-1.005888,-0.177285,-1.559926,-1.174021,1.174021,0.0,-0.383679,-0.069103,-1.157341,-0.438105,2.549304,1.515511,-0.461023,-0.763376,-0.425344,0
4,1.098047,-0.330374,4.198834,0.719002,0.994146,2.595164,0.373380,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,2.169088,-0.763376,-0.425344,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6817,1.142520,3.026868,-0.238161,-1.390817,0.994146,-0.653824,-0.054802,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,1.309970,-0.425344,0
6818,-0.058254,-0.330374,-0.238161,0.719002,-1.005888,-0.489924,-0.470009,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,1.515511,-0.461023,-0.763376,-0.425344,1
6819,0.742262,-0.330374,4.198834,0.719002,0.994146,-0.302956,0.892388,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036,1
6820,-0.547458,-0.330374,-0.238161,0.719002,0.994146,0.063950,0.526487,-1.174021,1.174021,0.0,-0.383679,-0.069103,0.864050,-0.438105,-0.392264,-0.659843,-0.461023,-0.763376,2.351036,0


In [48]:
oversampled_data["stroke"].value_counts()

stroke
0    3411
1    3411
Name: count, dtype: int64

In [50]:
y_train_oversampled = oversampled_data["stroke"]
X_train_oversampled = oversampled_data.drop("stroke", axis=1)

In [51]:
for name, model in models.items():
    model.fit(X_train_oversampled, y_train_oversampled)
    print(name + " trained.")

                   Logistic Regression trained.
                   K-Nearest Neighbors trained.
                         Decision Tree trained.
Support Vector Machine (Linear Kernel) trained.
   Support Vector Machine (RBF Kernel) trained.
                     Gradient Boosting trained.
                               XGBoost trained.
[LightGBM] [Info] Number of positive: 3411, number of negative: 3411
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000631 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 648
[LightGBM] [Info] Number of data points in the train set: 6822, number of used features: 17
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
                              LightGBM trained.
                              CatBoost trained.


In [52]:
print("Model performance\n-------------------------")
for name, model in models.items():
    y_pred = model.predict(X_test)
    print(
        "\n" + name + " Accuracy: {:.3f}%\n\t\t\t     F1-Score: {:.5f}"\
        .format(accuracy_score(y_test, y_pred) * 100, f1_score(y_test, y_pred))
    )

Model performance
-------------------------

                   Logistic Regression Accuracy: 73.190%
			     F1-Score: 0.25678

                   K-Nearest Neighbors Accuracy: 85.845%
			     F1-Score: 0.11429

                         Decision Tree Accuracy: 92.433%
			     F1-Score: 0.13433

Support Vector Machine (Linear Kernel) Accuracy: 72.929%
			     F1-Score: 0.25760

   Support Vector Machine (RBF Kernel) Accuracy: 78.082%
			     F1-Score: 0.19617

                     Gradient Boosting Accuracy: 79.256%
			     F1-Score: 0.22439

                               XGBoost Accuracy: 92.042%
			     F1-Score: 0.14085

                              LightGBM Accuracy: 91.520%
			     F1-Score: 0.19753

                              CatBoost Accuracy: 91.129%
			     F1-Score: 0.19048


In [53]:
confusion_matrix(y_test, list(models.values())[0].predict(X_test))

array([[1051,  399],
       [  12,   71]])