## Stroke Prediction

Dataset found here: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Goal is to predict whether patient will have stroke or not based on 11 clinical features identified below

In [None]:
import pandas as pd

df = pd.read_csv('/content/stroke.csv')
df = df.drop(columns=['id'])
df.head()

In [None]:
df.info()

In [None]:
import seaborn as sns

# plot target
_ = sns.countplot(df, x='stroke')

Findings: 
- BMI missing values
- Imbalanced target classes in favor of negative outcome



In [None]:
# replace missing BMI values
df['bmi'].fillna(df['bmi'].median(), inplace=True)

In [None]:
# identify numerical/categorical features
numerical = ['age', 'avg_glucose_level', 'bmi']
categorical = ['gender', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

In [None]:
# count unique values in categorical columns
for col in categorical:
  print(f"{col} has {df[col].nunique()} values")

In [None]:
# describe numerical statistics
df[numerical].describe()

In [None]:
# analyze skew
df[numerical].skew()

Findings: 
- Age skewed slighly left which suggests larger presence of older patients but average and median age are only 43 and 45 years old respectively
- BMI skewed moderately right with mean and median of roughly 28 which is considered overweight
- Glucose levels skewed right with median of 91.89 mg/dL which is acceptable but mean of 106.15 mg/dL which qualifies for prediabetes

In [None]:
def plot_numerical_distribution1(ax, data, column):
  _ = sns.histplot(ax=ax, data=data, x=column, kde=True).set(title='Distribution of ' + column)

def plot_categorical_distribution1(ax, data, column):
  _ = sns.countplot(ax=ax, data=data, x=column).set(title='Distribution of ' + column)

def plot_numerical_distribution2(ax, data, column):
  _ = sns.histplot(ax=ax, data=data, x=column, kde=True, hue='stroke').set(title='Distribution of ' + column)

def plot_categorical_distribution2(ax, data, column):
  _ = sns.countplot(ax=ax, data=data, x=column, hue='stroke').set(title='Distribution of ' + column)

In [None]:
import matplotlib.pyplot as plt

# plot numerical feature distributions
fig, axes = plt.subplots(1, 3, figsize=(16, 6))
for i in range(len(numerical)):
  #plot_numerical_distribution1(axes[i], df, numerical[i])
  plot_numerical_distribution2(axes[i], df, numerical[i])

In [None]:
# find categorical value proportions
for col in categorical:
  print(f"{round(df[col].value_counts(normalize=True), 4)}\n")

In [None]:
# plot categorical feature distributions
fig, axes = plt.subplots(2, 4, figsize=(16, 12))
j = 0
k = 0
for i in range(len(categorical)):
  k = i % 4
  if k == 0 and i != 0:
    j += 1
  #plot_categorical_distribution1(axes[j,k], df, categorical[i])
  plot_categorical_distribution2(axes[j,k], df, categorical[i])

In [None]:
# find categorical value correlation with target
for col in categorical:
  for val in df[col].unique():
    print(f"{col} = {val} -> {round(df[df[col] == val]['stroke'].mean()*100, 2)}% chance of stroke")
  print('\n')

Findings: 
- People with hypertension more likely to have stroke than those without
- People with heart disease more likely to have stroke than those without
- Males slightly more likely to have stroke
- Currently/previously married people more likely to have stroke
- Little difference in outcome based on smoking

In [None]:
# plot numerical feature correlations
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
from sklearn.preprocessing import *

# convert categorical to numerical
le = LabelEncoder()
df[categorical] = df[categorical].apply(le.fit_transform)

# prepare data for partitioning
X1 = df.drop(columns=['stroke'])
y1 = df['stroke']

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# select best features for predicting target
classifier = SelectKBest(score_func=f_classif)
results = classifier.fit(X1, y1)

xdf = pd.DataFrame(results.scores_)
cols = pd.DataFrame(X1.columns)

fscores = pd.concat([cols, xdf], axis=1)
fscores.columns = ['Attribute', 'Score']
fscores = fscores.sort_values(by='Score', ascending=False)
print(fscores)

Findings:
- Age appears to be most relevant in predicting stroke
- Presence of heart disease and hypertension are also quite relevant, supporting above findings
- Glucose levels and marital status are relevant as well

In [None]:
from imblearn.over_sampling import SMOTE

feature_select = False
ncols = fscores[fscores['Score'] > 50]['Attribute']

# oversample to address class imbalance
sm = SMOTE(random_state=14) 

# standardize data
ss = StandardScaler()

if feature_select:
  X, y = sm.fit_resample(X1[ncols], y1)
else:
   X, y = sm.fit_resample(X1, y1)
X = ss.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# partition data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=14)

#model = SVC(kernel='rbf')
#model = xgb.XGBClassifier()
model = RandomForestClassifier(n_estimators=750)
model.fit(X_train, y_train)

# identify most important features
for col, feat in zip(df.columns, model.feature_importances_):
  print(f"{col}: {feat}")

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay

y_pred = model.predict(X_test)

print(f"Accuracy Score: {round(accuracy_score(y_test, y_pred) * 100, 2)}%")
print(classification_report(y_test, y_pred))

In [None]:
cmd = ConfusionMatrixDisplay(confusion_matrix(y_pred, y_test))
cmd.plot()
plt.show()

Findings:
- Age is by far most relevant feature in model followed by glucose level and BMI, once again supporting above findings
- Random forest classifier with 750 estimators outperformed other random forest classifiers, XGBoost classifier, and SVM
- Model performs well overall but number of false negatives is of some concern