# <code style="background:#6c89cc; color:black">Project Details</code>

The objective of the project is to let students have a hands-on experience on machine 
learning application development, this will help the students to have a better 
understanding on the topics and algorithms learned. Specifically in problem formulation, 
data collection and processing, data analysis, experiment design, machine learning 
methods comparison, performance evaluation, and result analysis. Each group has its 
flexibility to choose its own problem and data.

# <code style="background:#6c89cc; color:black">Problem Background</code>

Diseases are something that plague human beings. The top 10 causes of death in the world are all diseases, not accidents. Out of these 10, 7 of them account for diseases that are noncommunicable. These means that they are diseases that are not transmitted from one person to another. These account for 68% of the top 10 causes of death. If we frame this context at a national level, Singapore shares the same sentiment. The top 10 causes of death are also due to diseases. One such disease that we wanted to explore is stroke as it is in the top 5 causes of death worldwide and in Singappore. By taking into account the age-old idiom, "Prevention is better than cure", we want to delve deep into the details to figure out how to improve one's health.
<br>
<br>
https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death
<br>
https://www.moh.gov.sg/others/resources-and-statistics/principal-causes-of-death

# <code style="background:#6c89cc; color:black">Problem Statement</code>

This project aims to analyse and predict the probability of stroke based of the sourced dataset. They take into account lifestyle habits and medical history to aid with with the analysis of data as well as the training of models.

# <code style="background:#6c89cc; color:black">Data Collection</code>

The dataset used for this problem statement was sourced from Kaggle, a dataset hosting website. It consists of 11 variables, 10 of which are features and 1 of it being the target. The features variables consist of behavioural attributes as well as hereditary attributes that affect a person's probability of getting a stroke.

# <code style="background:#6c89cc; color:black">Load the Libraries</code>

In [21]:
# Load libraries
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
# from flask import Flask, request, render_template
import joblib

import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB

from sklearn import preprocessing

import numpy as np
import seaborn as sns

from matplotlib.pylab import seed

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
# ...and other regression models

# <code style="background:#6c89cc; color:black">Load the Dataset</code>

We will be using the variable "df" to denote the instances of the dataset. "df" will be called in the following cells and be used for different purposes.

In [22]:
df = pd.read_csv("../Datasets/stroke_data.csv")

In [23]:
df.shape

(40910, 11)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40910 entries, 0 to 40909
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sex                40907 non-null  float64
 1   age                40910 non-null  float64
 2   hypertension       40910 non-null  int64  
 3   heart_disease      40910 non-null  int64  
 4   ever_married       40910 non-null  int64  
 5   work_type          40910 non-null  int64  
 6   Residence_type     40910 non-null  int64  
 7   avg_glucose_level  40910 non-null  float64
 8   bmi                40910 non-null  float64
 9   smoking_status     40910 non-null  int64  
 10  stroke             40910 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 3.4 MB


# <code style="background:#6c89cc; color:black">Data Exploration</code>

In [25]:
#Assessing for missing values
df.isnull().sum()

sex                  3
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [26]:
#Evaluating number of unique values in each column
unique_vals=[]

for col in df.columns:
    unival=df[col].nunique()
    unique_vals.append(unival)

#Presenting the findings using a dataframe
pd.DataFrame(unique_vals,columns=['Unique_Values'],index=df.columns)

Unnamed: 0,Unique_Values
sex,2
age,111
hypertension,2
heart_disease,2
ever_married,2
work_type,5
Residence_type,2
avg_glucose_level,2903
bmi,370
smoking_status,2


In [27]:
df.columns

Index(['sex', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [28]:
cols=['sex', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke']

# <code style="background:#6c89cc; color:black">Feature Engineering</code>

It is important to conduct feature engineering as we want to ensure that the data is well prepped to be used for the training models. It consists of handling missing values, encoding categorial variables and feature scaling to name a few.

Here, we drop the columns with null values and check to ensure that they have been dropped.

In [29]:
df = df.dropna()

In [30]:
#Assessing for missing values
df.isnull().sum()

sex                  0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [31]:
df.shape

(40907, 11)

# <code style="background:#6c89cc; color:black">Creating Models</code>

The next step is to create the X and Y arrays. These will then further be split into two more arrays, with each having a "train" and "test" set.

In [32]:
X = df[['sex', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status']].values
y = df['stroke'].values

accuracies = {}

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [34]:
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

In [35]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [36]:
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.LongTensor(y_train)

X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.LongTensor(y_test)

In [37]:
from torch.utils.data import TensorDataset, DataLoader

# Create TensorDatasets
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

# Create DataLoader
batch_size = 8
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

In [38]:
# Properties of train and test datasets
print("Properties of train/test datasets")
print("Train dataset length:", len(train_dataset))
print("First element of train dataset:", train_dataset[0])
print("Test dataset length:", len(test_dataset))
print("First element of test dataset:", test_dataset[0])

# Properties of train and test loader
print("\n Properties of loaders")
print("Number of batches in train loader:", len(train_loader))
print("Batch size in train loader:", train_loader.batch_size)
print("Number of batches in test loader:", len(test_loader))
print("Batch size in test loader:", test_loader.batch_size)

Properties of train/test datasets
Train dataset length: 27407
First element of train dataset: (tensor([-1.1240,  1.0927, -0.5221, -0.3813,  0.4644,  0.6891,  0.9688, -0.7199,
         0.0705, -0.9752]), tensor(0))
Test dataset length: 13500
First element of test dataset: (tensor([-1.1240, -0.9938, -0.5221, -0.3813, -2.1531, -1.8677, -1.0322, -0.0179,
        -0.7185,  1.0255]), tensor(0))

 Properties of loaders
Number of batches in train loader: 3426
Batch size in train loader: 8
Number of batches in test loader: 1688
Batch size in test loader: 8


In [58]:
import torch
import torch.nn as nn

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define model with proper input/output dimensions
class BinaryClassificationMLP(nn.Module):
    def __init__(self, input_size):
        super(BinaryClassificationMLP, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()  # Outputs probability between 0-1
        )
    
    def forward(self, x):
        return self.layers(x)

# Use binary cross-entropy loss
criterion = nn.BCELoss()

# Create a new model with correct input size
model = BinaryClassificationMLP().to(device)

TypeError: BinaryClassificationMLP.__init__() missing 1 required positional argument: 'input_size'

In [56]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Properties of optimizer and criterion
print("Properties of optimizer and criterion:")
print("Optimizer:", optimizer)
print("Criterion:", criterion)

Properties of optimizer and criterion:
Optimizer: SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    fused: None
    lr: 0.01
    maximize: False
    momentum: 0
    nesterov: False
    weight_decay: 0
)
Criterion: CrossEntropyLoss()


In [57]:
# Training loop
num_epochs = 100
logs = []

# Enable debug info to check shapes
print(f"First batch shape check:")
for inputs, labels in train_loader:
    print(f"Input shape: {inputs.shape}, Labels shape: {labels.shape}")
    break

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)  # Make sure model is on the right device

for epoch in range(num_epochs):
    running_loss = 0.0
    for inputs, labels in train_loader:
        # Move tensors to the correct device and ensure correct types
        inputs = inputs.to(device).float()
        labels = labels.to(device).float().view(-1, 1)  # Reshape labels for regression task
        
        # Forward pass
        outputs = model(inputs)  # Compute model predictions
        
        # Make sure output and label dimensions match
        if outputs.shape != labels.shape:
            outputs = outputs.view(labels.shape)
            
        loss = criterion(outputs, labels)  # Compute the loss

        # Backward pass and optimization
        optimizer.zero_grad()  # Clear gradients
        loss.backward()  # Compute gradients
        optimizer.step()  # Update model parameters
        
        running_loss += loss.item()
        
    # Calculate average loss for the epoch
    epoch_loss = running_loss / len(train_loader)
    
    if epoch % 20 == 0:
        # Log progress every 20 epochs
        logs.append(f"Epoch {epoch}, Loss: {epoch_loss:.4f}")
        print(f"Epoch {epoch}, Loss: {epoch_loss:.4f}")

First batch shape check:
Input shape: torch.Size([8, 10]), Labels shape: torch.Size([8])
Epoch 0, Loss: 0.0000
Epoch 20, Loss: 0.0000
Epoch 40, Loss: 0.0000
Epoch 60, Loss: 0.0000
Epoch 80, Loss: 0.0000


In [59]:
# Set the model to evaluation mode
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        # Move tensors to device and ensure correct formats
        inputs = inputs.to(device).float()
        labels = labels.to(device)  # For classification, keep as integers
        
        # Set the forward pass
        outputs = model(inputs)
        
        # For regression tasks, compare if predictions are close to targets
        # For classification tasks, get the predicted class
        if outputs.shape != labels.shape:
            if outputs.dim() > 1 and outputs.size(1) > 1:
                # Classification task with multiple output classes
                _, predicted = torch.max(outputs, 1)
                correct += (predicted == labels).sum().item()
            else:
                # Regression task
                outputs = outputs.view(-1)
                predicted = outputs.round()  # Round to nearest integer for evaluation
                correct += (predicted == labels).sum().item()
        else:
            # Direct comparison
            predicted = outputs.round()
            correct += (predicted == labels).sum().item()
            
        # Get the number of observations in the batch
        total += labels.size(0)

mlp_accuracy = correct / total
print(f'Test Accuracy: {mlp_accuracy * 100:.2f}%')

Test Accuracy: 49.96%
