Skip to content

Predicting whether or not a patient will have a stroke using medical records. Focus on comparison between Statistical Modeling and Ensemble Classifiers performance for large class imbalance.

Notifications You must be signed in to change notification settings

dhelms1/stroke_prediction

Repository files navigation

Stroke Prediction


Data & Project Overview

Stroke is the third leading cause of death in the United States, with over 140,000 people dying annually. Each year approximately 795,000 people suffer from a stroke with nearly 75% of these occuring in people over the age of 65. High blood pressure is the most important risk factor for stroke (Stroke Center). The data originated from the Kaggle repository for stroke prediction. There are 11 features that were recorded for 5110 observations. The goal for this project will be to explore the data and find any correlations between features and the response variable stroke that will allow us to engineer new features for the data. After doing this, we will make a comparison between Statistical Modeling and Ensemble Modeling to see which we are able to achieve better results with. Note that these models will be evaluated by an F-Beta and Recall score since avoiding a missed diagnosis is the main focus.

Extra Libraries:

Data Format:

For the statistical modeling section, the data was reformatted in two ways to accommodate the large class imbalance (around 20x more observations of "No Stroke" compared to "Stroke"):

  • Training data was balanced by using SMOTE (Synthetic Minority Oversampling Technique) to increased of minority "stroke" class to a 3:4 ratio with the majority class "no stroke". This resulted in around 3400 majority observations (0) and 2500 minority observations (1).
  • Testing data was balanced using the NearMiss algorithm, which undersampled the majority class to a 4:3 ratio with the minority class. This resulted in around 120 majority and 90 minority observations to be used for evaluation. Note: when evaluating based on oversampled data, I did not feel the results were as accurate since repeated observations were increasing the scores. I want the model to be prepared for real world data rather than higher metrics on repeated data.

For the ensemble modeling section, the data was reformatted in the following ways to accommodate the class imbalance:

  • Training data was left untouched since the ensemble algorithms we used are able to handle the imbalance within the model itself.
  • Testing data was resampled so that we would have a "Stroke" to "No Stroke" ratio of 2:3, resulting in around 50 minority and 75 majority observations (slightly smaller than the statistical modeling data).
  • An important note is that the extra observations from the majority class (after being undersampled) in the testing data were added back into the training data so that we had more data to train on. This was due to the algorithms being able to handle class imbalance (so more majority observations would not have a negative effect).

IMPORTANT: see conclusion for details about difference in data for both models.


Findings

Statistical Modeling:

For the statistical modeling section, we first fit an initial model using all features that resulted in the following output:

Following this, 3 models were fit using a feature subset selection process (Best, Forwards, and Backwards). Each of the 3 models selected the same subset of 4 features, all of which were statistically significant (p-value < 0.05) and included: age, bmi, age_over_45, & never_smoked. From the base model (fit on all features), we have the following improvements:

  • True Negatives increased by 2, while False Positives decreased by 2 (less people classified as stroke that did not have a stroke).
  • False Negative decreased by 21, while True Positives increased by 21 (more people classified as stroke that actually had a stroke).
  • Precision increased from 57% to 66%.
  • Recall increased from 58% to 82% (this was the most important evaluation metric to improve).
  • Accuracy increased from 63% to 74%.

Note: Backwards Feature Selection is shown, but all 3 methods had same confusion matrix and evaluation metrics.

Ensemble Modeling:

For the ensemble modeling section, we used the ImBalanced-Learn API (linked above in the extra libraries section) which has similar ensemble methods to Sklearn, but better suited to handle imbalanced classes. To model chosen for this section is the BalancedRandomForestClassifier, which had the highest recall of the 3 initial models that were fit (see ensemble modeling section in notebook for more details). The initial BRFC model had the following results:

After fitting the above model, we proceeded with hyperparameter tuning in order to see if we could improve our results. However, the tuned model (for the most part) was equivalent to the initial model except for one change. The number of estimators was changed from 100 to 25, which had no negative impact on the predicted results and doubled the computational speed of the model. The confusion matrix and evaluation metrics for the tuned model were the same as above (so they will not be shown), but the following feature importance were graphed for the tuned model:

An important note is that the age, bmi, and avg_glucose_levels were not normalized and all other features had discrete values in the range [0,1], but normalizing the inputs resulted in the same feature importance but with worse performance.

Note: The tuned model was also run through the Exhaustive Feature Selector for mlxtend to find the best combination of features (ranging from 2 to 7 features), but the subset model had slightly worse performance. The original tuned model was kept due to this (see end of notebook for more info).


Conclusion

Comparing the statistical and ensemble models, we can see that the ensemble model seems to be performing better (although not be a large margin). We have the following difference based on the final models for both types:

  • Accuracy: Ensemble model has a 7% advantage.
  • Precision: Ensemble model has a 5% advantage.
  • Recall: Ensemble model has a 6% advantage.

An important note is that the two model types were trained and tested on different data. The statistical model had oversampling of the minority class for the training data in order to balance it since we could not handle the imbalance within the model itself (but the testing data was undersampled). The ensemble model had regular training data (not balanced) but the testing data was undersampled and slightly smaller (extra observations left over from the undersampling were also added back into training since imbalanced data was not a concern).

Overall, the ensemble model seems to be slightly stronger in all evaluation aspects. The computational speeds were similar for both models. Places to improve upon for our models would to be to try and get the data splits to be more similar in order to have more validity to our statement of the stronger model.

About

Predicting whether or not a patient will have a stroke using medical records. Focus on comparison between Statistical Modeling and Ensemble Classifiers performance for large class imbalance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published