<a id="introduction"></a>
**INTRODUCTION**

* A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.

* A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.

<strong> Attribute Information </strong>
*  id: unique identifier
*  gender: "Male", "Female" or "Other"
*  age: age of the patient
*  hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
*  heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
*  ever_married: "No" or "Yes"
*  work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
*  Residence_type: "Rural" or "Urban"
*  avg_glucose_level: average glucose level in blood
*  bmi: body mass index
*  smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
*  stroke: 1 if the patient had a stroke or 0 if not <br>

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

In [1]:
# importing libraries
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [3]:
df.drop(['id'], axis = 1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                4909 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


Only feature bmi has missing values.

In [4]:
#filling missing values
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

In [5]:
df.stroke.value_counts()

0    4861
1     249
Name: stroke, dtype: int64

In [6]:
# over-sampling the minority class

from sklearn.utils import resample,shuffle
df_majority = df[df['stroke']==0]
df_minority = df[df['stroke']==1]
df_minority_upsampled = resample(df_minority,replace=True,n_samples=4800,random_state = 123)
balanced_df = pd.concat([df_minority_upsampled,df_majority])
balanced_df = shuffle(balanced_df)
balanced_df.stroke.value_counts()
df=balanced_df.copy()

# label encoding

residence_mapping = {'Urban': 0, 'Rural': 1}
df['Residence_type'] = df['Residence_type'].map(residence_mapping)
marriage_mapping = {'No': 0, 'Yes': 1}
df['ever_married'] = df['ever_married'].map(marriage_mapping)

# one-hot encoding

dfDummies = pd.get_dummies(df[["gender","work_type","smoking_status"]],drop_first=True)
df.drop(["gender","work_type","smoking_status"], axis=1, inplace=True)
df = pd.concat([df, dfDummies], axis=1)

# scaling

from sklearn.preprocessing import StandardScaler
std=StandardScaler()
columns = ['avg_glucose_level','bmi','age']
df[columns] = std.fit_transform(df[columns])

df.head(5)

Unnamed: 0,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,stroke,gender_Male,gender_Other,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
3217,0.279922,0,0,1,0,-0.557102,-0.841624,0,0,0,0,1,0,0,0,0,0
107,-0.034441,0,0,1,1,1.338323,1.561037,1,0,0,0,0,1,0,0,0,1
234,1.043377,0,0,1,0,1.822447,-0.395621,1,1,0,0,0,1,0,0,0,0
1121,0.639195,0,0,1,0,-0.092241,-0.985496,0,0,0,0,0,1,0,0,1,0
204,-0.034441,0,0,1,1,-0.551105,0.223028,1,1,0,0,0,0,0,0,0,1


In [10]:
#splitting data

y = df["stroke"]
X = df.drop(['stroke'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 101,stratify=y)

# ML model

model_xgb = 'Extreme Gradient Boost'
xgb = XGBClassifier(learning_rate=0.01, n_estimators=15, max_depth=10,gamma=0.6, subsample=0.52,colsample_bytree=0.6,seed=27, 
                    reg_lambda=2, booster='dart', colsample_bylevel=0.6, colsample_bynode=0.5)
xgb.fit(X_train, y_train)
xgb_predicted = xgb.predict(X_test)
xgb_conf_matrix = confusion_matrix(y_test, xgb_predicted)
xgb_acc_score = accuracy_score(y_test, xgb_predicted)
print("confusion matrix")
print(xgb_conf_matrix)
print("-"*30)
print("AUC-ROC score of Extreme Gradient Boost:",roc_auc_score(y_test, xgb_predicted)*100,'\n')
print("-"*30)
print(classification_report(y_test,xgb_predicted))

confusion matrix
[[718 255]
 [ 88 872]]
------------------------------
AUC-ROC score of Extreme Gradient Boost: 82.31286399451866 

------------------------------
              precision    recall  f1-score   support

           0       0.89      0.74      0.81       973
           1       0.77      0.91      0.84       960

    accuracy                           0.82      1933
   macro avg       0.83      0.82      0.82      1933
weighted avg       0.83      0.82      0.82      1933



In [11]:
# saving model and scaler for later use

import pickle
pickle.dump(xgb, open('model.pkl','wb'))
pickle.dump(std, open('scaler.pkl', 'wb'))