Brainstroke Prediction

Kaggle dataset link : https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Libraries used for dataset processing

Numpy
Pandas

Libraries used for graphical representation

Matplotlib
Seaborn

Libraries used for Scaling and Oversampling

Sklearn.preprocessing
Imblearn

PREPROCESSING

Removed the id column – decreasing the dimension – did not add to insights in the data analysis.

df = df.drop(['id'],axis=1)

Count for NULL values are checked among the attributes of the dataset

print(df.isna().sum())

Only BMI-Attribute had NULL values
Plotted BMI's value distribution - looked skewed - therefore imputed the missing values using the median.
Didn’t eliminate the records due to dataset being highly skewed on the target attribute – stroke and a good portion of the missing BMI values had accounted for positive stroke

The dataset was skewed because there were only few records which had a positive value for stroke-target attribute
In the gender attribute, there were 3 types - Male, Female and Other. There was only 1 record of the type "other", Hence it was converted to the majority type – decrease the dimension
Most of the attributes in the dataset were binary values – converting the numeric bin values into string bin values for dummy encoding.
- Dummy encoding similar to one-hot encoding – Values in the binary ecoded columns are 1/0 – Additional attributes/columns created.
Random oversampling done on the dataset to balance the skew in the target attributes.
- Boosting the number of records in the minority class – records

EDA - Exploratory Data Analysis

Plotted plots of each attribute - Analyse trends if any – plots: pie, histogram.
Plotted relation of target attribute to other attributes to find any correlation.
Plotted the heatmap – correlation plot between the attributes.
- Heatmap showed very less correlation between the attribute values.

MODEL BUILDING

Creating a train and test split of the oversampled dataset. (80-20)

Applied various Machine learning models for predictive analysis

Decision tree
KNN
XG-Boost
Random forest
Logistic regression

Analysed the results generated using confusion matrix - accuracy, precision, recall and plotting the ROC plot and generating the AUC scores.

Accuracies calculated:

Decision tree : 97.89%
KNN : 97.22%
XG-Boost : 97.48%
Random forest : 99.48%
Logistic regression : 76.34%

Chosen model - RANDOM FOREST

Results were validated using the k fold (20 splits) validation for overfitting

Accuracy: 95.01
For Random Forest

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Atom_BrainStrokePrediction.ipynb		Atom_BrainStrokePrediction.ipynb
README.md		README.md
healthcare-dataset-stroke-data.csv		healthcare-dataset-stroke-data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brainstroke Prediction

PREPROCESSING

EDA - Exploratory Data Analysis

MODEL BUILDING

About

Releases

Packages

Languages

emilbluemax/Brainstroke

Folders and files

Latest commit

History

Repository files navigation

Brainstroke Prediction

PREPROCESSING

EDA - Exploratory Data Analysis

MODEL BUILDING

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages