# Problem Statement
Stroke is the second leading cause of death globally, accounting for approximately 11% of all fatalities, according to the World Health Organization (WHO). Despite advancements in healthcare, predicting the likelihood of a stroke remains a challenge. The aim of this project is to forecast the probability of an individual experiencing a stroke based on a range of health parameters. Through the analysis of data pertaining to factors such as age, gender, BMI, and medical history, a machine learning model can be developed to anticipate the likelihood of a stroke in an individual. This model holds promise for early identification, facilitating timely interventions to prevent adverse outcomes.

# The Dataset
The dataset aims to predict the likelihood of stroke by analysing real-world health data from individuals, including patients’ demographics and health attributes. The goal is to provide insights to support early intervention and prevention strategies for stroke. Each row in the data provides relevant information about the patient, such as:
1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not

# The Import Statements
In order to be able to work with the dataset, and then to implement the machine learning models, we need to import the necessary libraries first.

In [1]:
# Pandas is used for handling data structures and data manipulation, such as reading data from CSV files, managing DataFrames, and performing operations like filtering, grouping, and aggregation.
import pandas as pd


# Matplotlib's pyplot module is used for creating visualisations such as line charts, bar charts, histograms, scatter plots, and more. It provides control over plot elements like titles, labels, and legends.
import matplotlib.pyplot as plt


# The RandomForestClassifier from scikit-learn is used for creating a random forest model, which is an ensemble machine learning method based on decision trees, commonly used for classification tasks.
from sklearn.ensemble import RandomForestClassifier

# The MLPClassifier (Multilayer Perceptron) is a neural network model that supports multi-layer architecture for performing classification tasks, especially useful for more complex datasets.
from sklearn.neural_network import MLPClassifier

# KNeighborsClassifier is used for implementing the k-Nearest Neighbors algorithm, which is a simple, instance-based learning algorithm for classification tasks based on feature similarity.
from sklearn.neighbors import KNeighborsClassifier

# Confusion matrix is a performance evaluation metric used to summarize the results of a classification model by showing the count of true positive, true negative, false positive, and false negative predictions.
from sklearn.metrics import confusion_matrix

# Classification report is used to provide a detailed performance evaluation of a classifier, including precision, recall, F1 score, and accuracy for each class in the dataset.
from sklearn.metrics import classification_report

# ConfusionMatrixDisplay is used to visually represent the confusion matrix in the form of a heatmap, making it easier to interpret the classification results.
from sklearn.metrics import ConfusionMatrixDisplay

# SMOTE (Synthetic Minority Oversampling Technique) is used to handle imbalanced datasets by generating synthetic examples for the minority class, improving model performance when dealing with imbalanced data.
from imblearn.over_sampling import SMOTE

# train_test_split is used to split the dataset into training and testing subsets. This ensures that the model is trained on one part of the data and tested on another, allowing for evaluation of model performance on unseen data.
from sklearn.model_selection import train_test_split

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv


# Creating the Machine Learning Model Class
Since we will implement multiple machine learning models and compare their performances, and also because we want to comply with the rules and design of Object-Oriented Programming principles, we will create a class called `StrokeRiskPredictor` that will contain all the machine learning models, their implementations, and data preprocessing steps as different functions.

In [3]:
class StrokeRiskPredictor:
    def __init__(self, data_path):
        # Load the dataset from the specified path
        self.data = pd.read_csv(data_path)