#### GROUP 17: Shuvojyoti Singha, Artjom Smorgulenko, Kaan Özkiliç, Samuel Pasierb

# Stroke Prediction using Support Vector Machine (SVM)

## Background
This notebook uses SVM algorithm to evaluate a dataset where stroke prediction data is recorded. The dataset includes patient information like gender, glucose level and other health related data; and the aim is to predict if patient is likely to have a stroke based on trained data.

This notebook file evaluates how different TODO: add

# Constants
Defined constant variables for readibility and to avoid repetition.

In [1]:
ID = "ID"
GENDER = "Gender"
AGE = "Age"
HYPERTENSION = "Hypertension"
HEART_DISEASE = "Heart Disease"
EVER_MARRIED = "Ever Married"
WORK_TYPE = "Work Type"
RESIDENCE_TYPE = "Residence Type"
AVG_GLUCOSE_LEVEL = "Average Glucose Level"
BMI = "BMI"
SMOKING_STATUS = "Smoking Status"
STROKE = "Stroke"

# Import

In this notebook we've used certain libraries:
- **pandas**: for data handling,
- **numpy**: also for data handling,
- **matplotlib**: for plotting graphs,
- **sklearn**: for implementing SVM algorithm, model evaluation and training-testing split.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from matplotlib import pyplot as plt

# Loading Dataset

The dataset is downloaded from Kaggle. The dataset includes bunch of columns that are significant to predict stroke prediction.

*Link to the [the dataset from Kaggle](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset).*

In [3]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


# Data Cleaning

The missing data is handled and categorical values are converted to numerical values.

- The missing values are filled with 0.
- ID and work type columns are dropped due to inconvenience.
- Categorical values are replaced with integers.

In [4]:
df.fillna(0, inplace=True)
df.isnull().sum()

id                   0
gender               0
age                  0
hypertension         0
heart_disease        0
ever_married         0
work_type            0
Residence_type       0
avg_glucose_level    0
bmi                  0
smoking_status       0
stroke               0
dtype: int64

In [5]:
if "id" in df.columns: df = df.drop(["id"], axis=1)
if "work_type" in df.columns: df = df.drop(["work_type"], axis=1)
print(df.columns)
df.columns = [GENDER, AGE, HYPERTENSION, HEART_DISEASE, EVER_MARRIED, RESIDENCE_TYPE, AVG_GLUCOSE_LEVEL, BMI, SMOKING_STATUS, STROKE]
df.head()

Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status',
       'stroke'],
      dtype='object')


Unnamed: 0,Gender,Age,Hypertension,Heart Disease,Ever Married,Residence Type,Average Glucose Level,BMI,Smoking Status,Stroke
0,Male,67.0,0,1,Yes,Urban,228.69,36.6,formerly smoked,1
1,Female,61.0,0,0,Yes,Rural,202.21,0.0,never smoked,1
2,Male,80.0,0,1,Yes,Rural,105.92,32.5,never smoked,1
3,Female,49.0,0,0,Yes,Urban,171.23,34.4,smokes,1
4,Female,79.0,1,0,Yes,Rural,174.12,24.0,never smoked,1


In [6]:
# gender: male = 0, female = 1, other = 2 ;
# ever married: yes = 1, no = 0 ;
# work type: children = 0, govt job = 1, never worked = 2, private = 3, self-employed = 4 ;
# residence type: rural = 0, urban = 1 ;
# smoking status: formerly smoked = 0, never smoked = 1, smokes = 2, unknown = 3

df[GENDER] = df[GENDER].replace({"Male": 0, "Female": 1, "Other": 2}).astype(int)
df[EVER_MARRIED] = df[EVER_MARRIED].replace({"Yes": 1, "No": 0}).astype(int)
df[RESIDENCE_TYPE] = df[RESIDENCE_TYPE].replace({"Rural": 0, "Urban": 1}).astype(int)
df[SMOKING_STATUS] = df[SMOKING_STATUS].replace({"formerly smoked": 0, "never smoked": 1, "smokes": 2, "Unknown": 3}).astype(int)
df.head()

  df[GENDER] = df[GENDER].replace({"Male": 0, "Female": 1, "Other": 2}).astype(int)
  df[EVER_MARRIED] = df[EVER_MARRIED].replace({"Yes": 1, "No": 0}).astype(int)
  df[RESIDENCE_TYPE] = df[RESIDENCE_TYPE].replace({"Rural": 0, "Urban": 1}).astype(int)
  df[SMOKING_STATUS] = df[SMOKING_STATUS].replace({"formerly smoked": 0, "never smoked": 1, "smokes": 2, "Unknown": 3}).astype(int)


Unnamed: 0,Gender,Age,Hypertension,Heart Disease,Ever Married,Residence Type,Average Glucose Level,BMI,Smoking Status,Stroke
0,0,67.0,0,1,1,1,228.69,36.6,0,1
1,1,61.0,0,0,1,0,202.21,0.0,1,1
2,0,80.0,0,1,1,0,105.92,32.5,1,1
3,1,49.0,0,0,1,1,171.23,34.4,2,1
4,1,79.0,1,0,1,0,174.12,24.0,1,1


# Separating Data and Target

The data is seperated from target values. The `df_data` has input values and `df_target` has the results if stroke occurred.

In [None]:
df_data = df.iloc[:, :-1]
df_target = df.iloc[:, -1]
print(df_data)
print(df_target)

# Splitting into Training and Testing Sets

The dataset is splitted into training and testing sets. In this cell, 80/20 split is used.

In [None]:
data_train, data_test, target_train, target_test = train_test_split(df_data, df_target, test_size=0.2, shuffle=True, random_state=0)