   **PROBLEM STATEMENT**

* A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes. A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.
* According to WHO, stroke is the second leading cause of death. If we are able to warn people in advance that they are likely to get a stroke in future then they can change their lifestyle and adopt healthy habits. So based on features like bmi, age, work and smoking status, I built a ML binary classification model.

Jump to:

[EDA](#EDA)

[Modeling](#model)

[Deployment](#deploy)

***DO UPVOTE!!***

In [None]:
# importing libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.metrics import classification_report

df = pd.read_csv('/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
print('Let\'s have a look at the dataset:')
df.head()

In [None]:
print('Details of the dataset:')
print ("Rows     : " , df.shape[0])
print ("Columns  : " , df.shape[1])
print ("\nFeatures : \n" , df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values : \n",df.nunique())

In [None]:
df.describe().T

*Let's find out what is missing*

In [None]:
df.isnull().sum().sort_values(ascending=False)[:]

Only one column has missing values. We will fill it before building model. Before that let's do some exploratory data analysis. This is a crucial step as it helps to turn data into insights.

## <a id='EDA'>EDA</a>

In [None]:
sns.boxenplot(x='avg_glucose_level',hue='gender',data=df, color='Red')
plt.title('Distribution of Average glucose level');

In [None]:
plt.figure(figsize=(10,5))
sns.boxenplot(x='bmi',data=df, color = 'Green')
plt.title('Distribution of BMI');

In [None]:
sns.set_style('whitegrid')
sns.distplot(df['age'],color = 'black')
plt.xlim(0)
plt.title('Age distribution');

Majority of average glucose level records lies around 100 which is normal range of glucose. Same goes for bmi. Some outliers are also present in the data. Age aries from 0 to 80.


## Exploring data of people who suffered stroke.

In [None]:
# getting data of people who suffered stroke
stroke = df.loc[df['stroke']==1]
sns.countplot(data=stroke,x='ever_married', palette="flare")
plt.title("Stroke vs Ever-Married");

*Looks like  the number of married people tend to have stroke significantly higher than single people. Interesting!!*

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data=stroke,x='work_type', palette="bwr")
plt.title("Stroke vs Work Type");

People in private sector has higher risk of having a stroke.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(data=stroke,x='smoking_status', palette="Set2")
plt.title("Stroke vs Smoking Status");

In total, former smokers and currently smokers has the highest risk 

In [None]:
sns.countplot(data=stroke,x='Residence_type', palette="summer")
plt.title("Stroke vs Residence Type");

Now we have a close distribution of rural and urban type of residence. Looks like it does not effect much.

In [None]:
sns.countplot(data=stroke,x='hypertension', palette="spring")
plt.title("Stroke vs Hypertension");

People without hypertension has more risk to have a stroke

In [None]:
sns.countplot(data=stroke,x='heart_disease', palette="spring")
plt.title("Stroke vs Heart Disease");

People without any previous heart disease has more risk to have a stroke

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
df.plot(kind="hist", y="age", bins=70, color="b", ax=axes[0][0])
df.plot(kind="hist", y="bmi", bins=100, color="r", ax=axes[0][1])
df.plot(kind="hist", y="heart_disease", bins=6, color="g", ax=axes[1][0])
df.plot(kind="hist", y="avg_glucose_level", bins=100, color="orange", ax=axes[1][1])
plt.show()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
df.plot(kind='scatter', x='age', y='avg_glucose_level', alpha=0.5, color='green', ax=axes[0], title="Age vs. avg_glucose_level")
df.plot(kind='scatter', x='bmi', y='avg_glucose_level', alpha=0.5, color='red', ax=axes[1], title="bmi vs. avg_glucose_level")
plt.show()

As age increases average glucose level also increases.

In [None]:
print('Now, let\'s check correlation between features.')
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),annot=True,cmap='summer');

<a id='model'> Modeling</a>

First of all, I will impute missing values.

Here, I have imputed it with mean. Though mean imputation is generally bad practice because it doesn’t take into account feature correlation. For example, if we took the average bmi from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher bmi score that he actually should. So, you can try filling it based on age or gender.

In [None]:
#filling missing values
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

In [None]:
df.stroke.value_counts()

* This is highly imbalanced dataset. I am up-sampling the minority class using sklearn module resample.
* Then I label encoded the categorical features having 2 classes. For categorical features with multiple categories, I used one-hot encoding.
* Then after scaling the numerical features I used XGBoost Classifier to build a model.

In [None]:
# over-sampling the minority class

from sklearn.utils import resample,shuffle
df_majority = df[df['stroke']==0]
df_minority = df[df['stroke']==1]
df_minority_upsampled = resample(df_minority,replace=True,n_samples=4800,random_state = 123)
balanced_df = pd.concat([df_minority_upsampled,df_majority])
balanced_df = shuffle(balanced_df)
balanced_df.stroke.value_counts()
df=balanced_df.copy()

# label encoding

residence_mapping = {'Urban': 0, 'Rural': 1}
df['Residence_type'] = df['Residence_type'].map(residence_mapping)
marriage_mapping = {'No': 0, 'Yes': 1}
df['ever_married'] = df['ever_married'].map(marriage_mapping)

# one-hot encoding

dfDummies = pd.get_dummies(df[["gender","work_type","smoking_status"]],drop_first=True)
df.drop(["gender","work_type","smoking_status"], axis=1, inplace=True)
df = pd.concat([df, dfDummies], axis=1)

# scaling

from sklearn.preprocessing import StandardScaler
std=StandardScaler()
columns = ['avg_glucose_level','bmi','age']
df[columns] = std.fit_transform(df[columns])

df.head(5)

In [None]:
#splitting data

y = df["stroke"]
X = df.drop(['stroke'],axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 101,stratify=y)

# ML model

model_xgb = 'Extreme Gradient Boost'
xgb = XGBClassifier(learning_rate=0.01, n_estimators=15, max_depth=10,gamma=0.6, subsample=0.52,colsample_bytree=0.6,seed=27, 
                    reg_lambda=2, booster='dart', colsample_bylevel=0.6, colsample_bynode=0.5)
xgb.fit(X_train, y_train)
xgb_predicted = xgb.predict(X_test)
xgb_conf_matrix = confusion_matrix(y_test, xgb_predicted)
xgb_acc_score = accuracy_score(y_test, xgb_predicted)
print("confusion matrix")
print(xgb_conf_matrix)
print("-"*30)
print("AUC-ROC score of Extreme Gradient Boost:",roc_auc_score(y_test, xgb_predicted)*100,'\n')
print("-"*30)
print(classification_report(y_test,xgb_predicted))

<a id='deploy'>DEPLOYMENT</a>

When we build the model on our local system and make predictions till that time the model gives prediction but as soon as we close the python file everything gets destroyed. So, it becomes important to save the model to avoid doing all the steps again. This is called Pickling or Serialization in python. This can be done using pickle module.

In [None]:
# saving model and scaler for later use

# import pickle
# pickle.dump(xgb, open('model.pkl','wb'))
# pickle.dump(std, open('scaler.pkl', 'wb'))

Using Flask framework I deployed it in Heroku. Heroku is a platform as a service (PAAS).

Youtube tutorial: https://www.youtube.com/watch?v=mrExsjcvF4o

My Github link: https://github.com/ayushikaushik/ML-deployment-stroke-prediction

Website: https://strokes-prediction-api.herokuapp.com/

Basic Steps:
1. Train model.
2. Create web app using Flask.
3. Commit the code in GitHub.
4. Create an account in Heroku.
5. Link GitHub to Heroku.
6. Deploy the model.