<a href="https://colab.research.google.com/github/f-assiamah/Projects/blob/main/Project_Stroke_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Import

In [2]:
#Importing data
import pandas as pd

# Specify the path to your CSV file
file_path = 'https://drive.google.com/uc?export=download&id=1IKUzSjd0qkuLdyT2iC5O8zbpXb-mK_fT'

# Load the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  


#Cleaning and Preprocessing the Dataset

In [3]:
df = pd.read_csv(file_path)

# Some values within the bmi section of this dataset are missing therefore we are replacing the missing values by calculating the mean of the non-missing values.
df['bmi'].fillna(df['bmi'].mean(), inplace=True)

# Categorical variables cannot be used directly in machine learning models. One-hot encoding converts categorical variables into numerical columns with binary values (0 or 1). This is necessary for machine learning models to process categorical data. We set drop_first=True to avoid unnecessary columns.
df = pd.get_dummies(df, columns=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'], drop_first=True)

# Scale numerical features ensures all numerical data is on a similar scale to ensure no one feature overpowers others.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(df[['age', 'avg_glucose_level', 'bmi']])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bmi'].fillna(df['bmi'].mean(), inplace=True)


#Splitting the Data into Features (x) and Target (y)

In [None]:
from sklearn.model_selection import train_test_split

# Split data into features (X) and target variable (y). Dropping columns id and stroke as these are irrelevant when training the models to predict if an indidivual has or hasn't had a stroke.
X = df.drop(columns=['id', 'stroke'])
y = df['stroke']

#Splitting Data into Training and Test Sets

In [None]:
#80% of the data will be used to train the model and 20% of it to test the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Training and Evaluating the Models using Test Data with Object Oriented Programming

In [None]:
#Importing libraries from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier

# Using OOP to create a 'model' class.
class Model:
    def __init__(self, model, name):
        self.model = model
        self.name = name

#Training the data with the x and y variables
    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)

#Evaluating the model performance using 20% of the dataset that was previously split. This will help assess how well the models makes predictions using only the test data.
#y_pred is a variable that holds the predictions made by the model.
    def evaluate(self, X_test, y_test):
        y_pred = self.model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        return accuracy, conf_matrix

#Displaying the results.
    def display_results(self, accuracy, conf_matrix):
        print(f"Model: {self.name}")
        print(f"Accuracy: {accuracy:.2f}")
        print(f"Confusion Matrix:\n{conf_matrix}\n")


#Initialising the models

In [None]:
# Initialise models, setting parameters and preparing the models for training and evaluation.
# By using OOP to create a model class, logistic regression, random forest and KNN models are initliased as instances of this class. Thus, allowing for an organised and consistant way to train and evaluate each model.
logistic_regression_model = Model(LogisticRegression(random_state=42), "Logistic Regression")
random_forest_model = Model(RandomForestClassifier(random_state=42), "Random Forest")
knn_model = Model(KNeighborsClassifier(), "K-Nearest Neighbors")
decision_tree_model = Model(DecisionTreeClassifier(random_state=42), "Decision Tree")

#Logistic Regression
logistic_regression_model.train(X_train, y_train)
accuracy_lr, conf_matrix_lr = logistic_regression_model.evaluate(X_test, y_test)
logistic_regression_model.display_results(accuracy_lr, conf_matrix_lr)

#Random Forest
random_forest_model.train(X_train, y_train)
accuracy_rf, conf_matrix_rf = random_forest_model.evaluate(X_test, y_test)
random_forest_model.display_results(accuracy_rf, conf_matrix_rf)

#KNN
knn_model.train(X_train, y_train)
accuracy_knn, conf_matrix_knn = knn_model.evaluate(X_test, y_test)
knn_model.display_results(accuracy_knn, conf_matrix_knn)

#Decision Tree
decision_tree_model.train(X_train, y_train)
accuracy_dt, conf_matrix_dt = decision_tree_model.evaluate(X_test, y_test)
decision_tree_model.display_results(accuracy_dt, conf_matrix_dt)

#Logistic Regression



The Logistic Regression model achieved an accuracy of 94%. As a result, 94% of the test cases were predicted correctly.

The confusion matrix showed that the model correctly identified 960 people without a stroke, without any mistakes. However, it missed all of the stroke cases and misclassified 62 people who had a stroke as not having one.

As a result, while the model is good at predicting non-stroke cases (which makes up the majority of the cases), it struggles to correctly identify actual stroke cases, (which make up the minority).

#Random Forest

The Random Forest model, like Logistic Regression, also achieved an accuracy of 94% on the test data, meaning 94% of the cases were predicted correctly.

However, again, while the confusion matrix showed that the model correctly identified 960 people without a stroke, without any mistakes; it missed all of the stroke cases and misclassified 62 people who had a stroke as not having one.

Revealing the Random Forest model is good at predicting non-stroke cases, it also struggles to identify actual stroke cases.


#K-Nearest Neighbours

The K-Nearest Neighbours (KNN) model achieved an accuracy of 93% on the test data, as a result 93% of the cases were predicted correctly.

The confusion matrix revealed the model accurately predicted 955 non-stroke cases, without any mistakes. However it misclassified 62 people who had a stroke as not having one. It also incorrectly predicted 5 people as having a stroke when they did not have a stroke. None of the stroke cases were correctly identified.

This reveals the KNN model is fairly accurate at predicting non stroke cases but struggles to identify the minority class of stroke cases.

#Decision Tree#

The Decision Tree model achieved an accuracy of 91% on the test data, meaning 91% of the cases were predicted correctly.

The confusion matrix showed that the model correctly identified 913 people without a stroke but misclassified 47 people as having a stroke when they did not. Unlike the other models, the Decision Tree model correctly identified 14 actual stroke cases but still misclassified 48 stroke cases as not having had a stroke.

Although the accuracy for the Decision Tree model is lower than the other models at 91%, it is still the only model to accurately predict any stroke cases. Therefore, the Decision Tree model is currently the best model for predicting stroke cases, even if its accuracy for non-stroke cases is lower. However, the Decision Tree model still misclassified 48 stroke cases, therefore like the other models, it still struggles to identify the minority class of stroke cases.


#Classification report

In [None]:
from sklearn.metrics import classification_report

#Make predictions of the test set for logisitic regression
y_pred = logistic_regression_model.model.predict(X_test)
# Print the classification report
print(classification_report(y_test, y_pred))

#Make predicition of the test set for random forest
y_pred = random_forest_model.model.predict(X_test)
print(classification_report(y_test, y_pred))

#Make predictions of the test set for KNN
y_pred = knn_model.model.predict(X_test)
print(classification_report(y_test, y_pred))

#Make predictions of the test set for decision tree
y_pred = decision_tree_model.model.predict(X_test)
print(classification_report(y_test, y_pred))


The classification report provides an evaluation of the performance of the models. It includes precision, recall, f1-score and support.  

#**Logistic Regression and Random Forest:**#

**Class 0 (Non-Stroke)**
- Precision: 0.94. This result means the models accurately predicted 94% of the non-stroke cases as correct.
- Recall: 1.00. This result means 100% of the non-stroke cases were accurately identified by both models.
- F1-Score: 0.97. The f1 score balances precision and recall. a score of 0.97 is high and thus, the models are good at predicting non-stroke cases.
- Support: 960. This is the actual number of non-stroke cases within the test data.

**Class 1 (Stroke)**
- Precision: 0.00. This result means the models did not predict any stroke cases, as a result there were 0 stroke predictions.
- Recall: 0.00. This result means 0% of the stroke cases were identified by the model.
- F1-Score: 0.00. The precision and recall scores are 0 and as a result the f1 score is also 0.
- Support: This is the actual number of stroke cases within the test data.


**Accuracy**
- The overall accuracy of the model is 0.94 which means it accurately predicted 94% of the total cases which includes stroke and non-stroke.


**Macro Average**
- Precision: 0.88, Recall: 0.94, F1-Score: 0.48
The Macro average is the unweighted average of precision, recall and the f1-score. The model works well when predicted non-stroke cases (class 0) but poorly when predicting stroke cases (class 1) and as a result, the macro average is lower compared to the weighted averages.


**Weighted Average**
- Precision: 0.88, Recall: 0.94, F1-Score: 0.91
The weighted average takes into account the number of cases there are in each class, stroke and non-stroke. The dataset consists of significantly more non-stroke cases than stroke cases, thus increasing the overall scores making them higher.

**Warning**
- When the code is run, it results in a warning message at the bottom. As the model did not predict any stroke cases, the precision for class 1 is not defined. As a result, the precision is set to 0 which leads to a warning about dividing by zero.  

#**K-Nearest Neighbours (KNN):**#

**Class 0 (Non-Stroke)**
- Precision: 0.94. This result means the model accurately predicted 94% of the non-stroke cases as correct.
- Recall: 0.99. This result means 99% of the non-stroke cases were accurately identified by the model.
- F1-Score: 0.97. The F1 score balances precision and recall. A score of 0.97 is high, which means the model is good at predicting non-stroke cases.
Support: 960. This is the number of non-stroke cases within the test data.

**Class 1 (Stroke)**
- Precision: 0.00. This result means the model did not predict any stroke cases, as a result, there were 0 stroke predictions.
Recall: 0.00. This result means 0% of the stroke cases were identified by the model.
- F1-Score: 0.00. The precision and recall scores are 0, as a result, the F1 score is also 0.
- Support: 62. This is the actual number of stroke cases within the test data.
Accuracy

The overall accuracy of the KNN model is 0.93, which means it accurately predicted 93% of the total cases, which includes both stroke and non-stroke.

**Macro Average**

- Precision: 0.47, Recall: 0.50, F1-Score: 0.48.
The macro average is the unweighted average of precision, recall, and the - F1-score. The model works well when predicting non-stroke cases (Class 0) but poorly when predicting stroke cases (Class 1), and as a result, the macro average is lower compared to the weighted averages.

**Weighted Average**

- Precision: 0.88, Recall: 0.93, F1-Score: 0.91.
The weighted average takes into account the number of cases in each class, stroke and non-stroke. The dataset consists of significantly more non-stroke cases than stroke cases, thus increasing the overall scores and making them higher.

**Warning**

When the code is run, it results in a warning message at the bottom. As the model did not predict any stroke cases, the precision for Class 1 is not defined. As a result, the precision is set to 0, which leads to a warning about dividing by zero.

#**Decision Tree:**#

**Class 0 (Non-Stroke)**

- Precision: 0.95. This result means the model accurately predicted 95% of the non-stroke cases as correct.
- Recall: 0.95. This result means 95% of the non-stroke cases were accurately identified by the model.
- F1-Score: 0.95. The F1 score balances precision and recall. A score of 0.95 is relatively high, meaning the model is effective at predicting non-stroke cases.
- Support: 960. This is the number of non-stroke cases within the test data.

**Class 1 (Stroke)**

- Precision: 0.23. This result means the model accurately predicted 23% of the stroke cases it classified as strokes.
Recall: 0.23. This result means 23% of the actual stroke cases were correctly identified by the model.
- F1-Score: 0.23. The F1 score balances precision and recall. A score of 0.23 shows that the model is still struggling to accurately predict stroke cases.
- Support: 62. This is the actual number of stroke cases within the test data.

**Accuracy**

The overall accuracy of the Decision Tree model is 0.91, meaning it accurately predicted 91% of the total cases, including both stroke and non-stroke.

**Macro Average**

- Precision: 0.59, Recall: 0.59, F1-Score: 0.59.
The macro average is the unweighted average of precision, recall, and the F1-score. The Decision Tree model has a more balanced performance across both classes compared to the other models, but it is still not very effective at predicting stroke cases.

**Weighted Average**

- Precision: 0.91, Recall: 0.91, F1-Score: 0.91.
The weighted average takes into account the number of cases in each class, stroke and non-stroke. Since there are significantly more non-stroke cases than stroke cases, the overall scores are higher, reflecting the modelâ€™s strength in predicting the majority class.



In [4]:
#Importing libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Using OOP to create a 'model' class.
class Model:
    def __init__(self, model, name):
        self.model = model
        self.name = name

#Training the data with the x and y variables
    def train(self, X_train, y_train):
        self.model.fit(X_train, y_train)

#Evaluating the model performance using 20% of the dataset that was previously split. This will help assess how well the models makes predictions using only the test data.
#y_pred is a variable that holds the predictions made by the model.
    def evaluate(self, X_test, y_test):
        y_pred = self.model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        conf_matrix = confusion_matrix(y_test, y_pred)
        return accuracy, conf_matrix

#Displaying the results.
    def display_results(self, accuracy, conf_matrix):
        print(f"Model: {self.name}")
        print(f"Accuracy: {accuracy:.2f}")
        print(f"Confusion Matrix:\n{conf_matrix}\n")


def main():
  """#Data Import"""
  #Importing data

  #Specify the path to your CSV file
  file_path = 'https://drive.google.com/uc?export=download&id=1IKUzSjd0qkuLdyT2iC5O8zbpXb-mK_fT'

  # Load the CSV file into a pandas DataFrame
  df = pd.read_csv(file_path)

  # Display the first few rows of the DataFrame
  print(df.head())

  """#Cleaning and Preprocessing the Dataset"""
  df = pd.read_csv(file_path)

  # Some values within the bmi section of this dataset are missing therefore I am replacing the missing values by calculating the mean of the non-missing values.
  df['bmi'].fillna(df['bmi'].mean(), inplace=True)

  # Categorical variables cannot be used directly in machine learning models. One-hot encoding converts categorical variables into numerical columns with binary values (0 or 1). This is necessary for machine learning models to process categorical data. We set drop_first=True to avoid unnecessary columns.
  df = pd.get_dummies(df, columns=['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status'], drop_first=True)

  # Scale numerical features ensures all numerical data is on a similar scale to ensure no one feature overpowers others.
  scaler = StandardScaler()
  df[['age', 'avg_glucose_level', 'bmi']] = scaler.fit_transform(df[['age', 'avg_glucose_level', 'bmi']])


  """#Splitting the Data into Features (x) and Target (y)"""
  # Split data into features (X) and target variable (y). Dropping columns id and stroke as these are irrelevant when training the models to predict if an indidivual has or hasn't had a stroke.
  X = df.drop(columns=['id', 'stroke'])
  y = df['stroke']

  """#Splitting Data into Training and Test Sets"""
  #80% of the data will be used to train the model and 20% of it to test the model.
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


  """#Training and Evaluating the Models using Test Data with Object Oriented Programming"""
  """#Initialising the models"""
  # Initialise models, setting parameters and preparing the models for training and evaluation.
  # By using OOP to create a model class, logistic regression, random forest and KNN models are initliased as instances of this class. Thus, allowing for an organised and consistant way to train and evaluate each model.
  logistic_regression_model = Model(LogisticRegression(random_state=42), "Logistic Regression")
  random_forest_model = Model(RandomForestClassifier(random_state=42), "Random Forest")
  knn_model = Model(KNeighborsClassifier(), "K-Nearest Neighbors")
  decision_tree_model = Model(DecisionTreeClassifier(random_state=42), "Decision Tree")

  #Logistic Regression
  logistic_regression_model.train(X_train, y_train)
  accuracy_lr, conf_matrix_lr = logistic_regression_model.evaluate(X_test, y_test)
  logistic_regression_model.display_results(accuracy_lr, conf_matrix_lr)

  #Random Forest
  random_forest_model.train(X_train, y_train)
  accuracy_rf, conf_matrix_rf = random_forest_model.evaluate(X_test, y_test)
  random_forest_model.display_results(accuracy_rf, conf_matrix_rf)

  #KNN
  knn_model.train(X_train, y_train)
  accuracy_knn, conf_matrix_knn = knn_model.evaluate(X_test, y_test)
  knn_model.display_results(accuracy_knn, conf_matrix_knn)

  #Decision Tree
  decision_tree_model.train(X_train, y_train)
  accuracy_dt, conf_matrix_dt = decision_tree_model.evaluate(X_test, y_test)
  decision_tree_model.display_results(accuracy_dt, conf_matrix_dt)

  print(f"Logistic Regression Accuracy: {accuracy_lr:.2f}")
  print(f"Random Forest Accuracy: {accuracy_rf:.2f}")
  print(f"K-Nearest Neighbors Accuracy: {accuracy_knn:.2f}")
  print(f"Decision Tree Accuracy: {accuracy_dt:.2f}")

  """#Making a table to show the results of each model"""
  #Creating a dictionary to store the model names, accuracies, and parameters (if any)
  data = {
    'Machine Learning Model': ['Logistic Regression', 'Random Forest', 'K-Nearest Neighbors', 'Decision Tree'],
    'Optimal Accuracy': [accuracy_lr, accuracy_rf, accuracy_knn, accuracy_dt],

  }

  #Creating a pandas DataFrame from the dictionary
  df_results = pd.DataFrame(data)

  #Display the table
  #creating a new line before the df_results
  print()
  print(df_results)

main()

      id  gender   age  hypertension  heart_disease ever_married  \
0   9046    Male  67.0             0              1          Yes   
1  51676  Female  61.0             0              0          Yes   
2  31112    Male  80.0             0              1          Yes   
3  60182  Female  49.0             0              0          Yes   
4   1665  Female  79.0             1              0          Yes   

       work_type Residence_type  avg_glucose_level   bmi   smoking_status  \
0        Private          Urban             228.69  36.6  formerly smoked   
1  Self-employed          Rural             202.21   NaN     never smoked   
2        Private          Rural             105.92  32.5     never smoked   
3        Private          Urban             171.23  34.4           smokes   
4  Self-employed          Rural             174.12  24.0     never smoked   

   stroke  
0       1  
1       1  
2       1  
3       1  
4       1  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['bmi'].fillna(df['bmi'].mean(), inplace=True)


Model: Logistic Regression
Accuracy: 0.94
Confusion Matrix:
[[960   0]
 [ 62   0]]

Model: Random Forest
Accuracy: 0.94
Confusion Matrix:
[[960   0]
 [ 62   0]]

Model: K-Nearest Neighbors
Accuracy: 0.93
Confusion Matrix:
[[955   5]
 [ 62   0]]

Model: Decision Tree
Accuracy: 0.91
Confusion Matrix:
[[913  47]
 [ 48  14]]

Logistic Regression Accuracy: 0.94
Random Forest Accuracy: 0.94
K-Nearest Neighbors Accuracy: 0.93
Decision Tree Accuracy: 0.91

  Machine Learning Model  Optimal Accuracy
0    Logistic Regression          0.939335
1          Random Forest          0.939335
2    K-Nearest Neighbors          0.934442
3          Decision Tree          0.907045


#Evaluation

The objective of this project was to create a machine learning model that is able to accurately predict a stroke. I used four models, Logistic Regression, Random Forest, KNN, and Decision Tree to do this. These models were trained on a dataset with multiple health-related features to identify patterns that are likely to indicate a stroke.

Although all of the models achieved high accuracy, with Logistic Regression and Random Forest at 94%, the models struggled to identify the minority class of stroke cases. As a result, the models all had low performance for stroke detection.

Of all the models, I believe the Decision Tree is currently the best at predicting stroke cases. Although its accuracy is lower for non-stroke cases (95%), the Decision Tree model correctly predicted more actual stroke cases compared to all of the other models, despite also having a lower overall accuracy (91%) for the total number of cases.

The dataset used is very unbalanced, with far more non-stroke cases than stroke cases. The model could be improved if the dataset was more balanced. Addressing the imbalnace in data with techniques such as class-weighting (givign the minorty class a higher weight) or SMOTE (creating synthetic examples of the minorty class) and undersampling (reducing the number of majority class)







If the dataset had a more even number of stroke vs non-stroke cases, the models would be better trained and therefore have better stroke detection performance.