# Stroke Prediction using k-Nearest Neighbors (kNN)

## Background
This notebook uses kNN algorithm to evaluate a dataset where stroke prediction data is recorded. The dataset includes patient information like gender, glucose level and other health related data; and the aim is to predict if patient is likely to have a stroke based on trained data.

This notebook file evaluates how different kNN values, training-testing ratio and k-folding change the accuracy of the prediction.

# Constants
Defined constant variables for readibility and to avoid repetition.

In [None]:
ID = "ID"
GENDER = "Gender"
AGE = "Age"
HYPERTENSION = "Hypertension"
HEART_DISEASE = "Heart Disease"
EVER_MARRIED = "Ever Married"
WORK_TYPE = "Work Type"
RESIDENCE_TYPE = "Residence Type"
AVG_GLUCOSE_LEVEL = "Average Glucose Level"
BMI = "BMI"
SMOKING_STATUS = "Smoking Status"
STROKE = "Stroke"

# Import

In this notebook we've used certain libraries:
- **pandas**: for data handling,
- **numpy**: also for data handling,
- **matplotlib**: for plotting graphs,
- **sklearn**: for implementing kNN algorithm, model evaluation and training-testing split.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from matplotlib import pyplot as plt

# Loading Dataset

The dataset is downloaded from Kaggle. The dataset includes bunch of columns that are significant to predict stroke prediction.

*Link to the [the dataset from Kaggle](https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset).*

In [None]:
df = pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

# Data Cleaning

The missing data is handled and categorical values are converted to numerical values.

- The missing values are filled with 0.
- ID and work type columns are dropped due to inconvenience.
- Categorical values are replaced with integers.

In [None]:
df.fillna(0, inplace=True)
df.isnull().sum()

In [None]:
if "id" in df.columns: df = df.drop(["id"], axis=1)
if "work_type" in df.columns: df = df.drop(["work_type"], axis=1)
print(df.columns)
df.columns = [GENDER, AGE, HYPERTENSION, HEART_DISEASE, EVER_MARRIED, RESIDENCE_TYPE, AVG_GLUCOSE_LEVEL, BMI, SMOKING_STATUS, STROKE]
df.head()

In [None]:
# gender: male = 0, female = 1, other = 2 ;
# ever married: yes = 1, no = 0 ;
# work type: children = 0, govt job = 1, never worked = 2, private = 3, self-employed = 4 ;
# residence type: rural = 0, urban = 1 ;
# smoking status: formerly smoked = 0, never smoked = 1, smokes = 2, unknown = 3

df[GENDER] = df[GENDER].replace({"Male": 0, "Female": 1, "Other": 2}).astype(int)
df[EVER_MARRIED] = df[EVER_MARRIED].replace({"Yes": 1, "No": 0}).astype(int)
df[RESIDENCE_TYPE] = df[RESIDENCE_TYPE].replace({"Rural": 0, "Urban": 1}).astype(int)
df[SMOKING_STATUS] = df[SMOKING_STATUS].replace({"formerly smoked": 0, "never smoked": 1, "smokes": 2, "Unknown": 3}).astype(int)
df.head()

# Separating Data and Target

The data is seperated from target values. The `df_data` has input values and `df_target` has the results if stroke occurred.

In [None]:
df_data = df.iloc[:, :-1]
df_target = df.iloc[:, -1]
df_data

In [None]:
df_target

# Splitting into Training and Testing Sets

The dataset is splitted into training and testing sets. In this cell, 80/20 split is used.

In [None]:
data_train, data_test, target_train, target_test = train_test_split(df_data, df_target, test_size=0.2, shuffle=True, random_state=0)

# k-NN Classifier

kNN with k value of 3 is implemented to predict stroke occurances. The model trains of the training data and tests on the data test set.

In [None]:
K = 3
knn = KNeighborsClassifier(K)
knn.fit(data_train, target_train)
target_pred = knn.predict(data_test)
print(target_pred)

# Evaluating the Model's Accuracy

In this part, `accuracy_score` from sklearn library is used to tell how successful was the model upon predicting stroke cases through the dataset.

In [None]:
accuracy_score(target_test, target_pred)

# Parameter Experimentation

In order to calculate the optimal value for k value, the k values are tested ranging from 1 to 20. For each value, the model is trained and tested with recorded accuracy; then the results are sorted to find the most optimal k value.

The results are visualized with a plot.

In [None]:
k_values = {}

for i in range(1, 21):
    knn = KNeighborsClassifier(i)
    knn.fit(data_train, target_train)
    target_pred = knn.predict(data_test)
    k_values[i] = accuracy_score(target_test, target_pred)

# Sort the accuracies from the best to the worst
sorted_outcome = sorted(k_values.items(), key=lambda x: x[1], reverse=True)
BEST_K_VALUE = sorted_outcome[0][0]

# Get 3 highest accuracies for different k_values
print(sorted_outcome[:3])

# Get 3 lowest accuracies for different k_values
print(sorted_outcome[-3:])

bins = list(map(lambda x: x / 100, range(90, 95)))

# accuracies = pd.Series(
# print(pd.crosstab(pd.cut(accuracies, bins=bins), "Count"))


plt.bar(list(map(lambda x: str(x[0]), sorted_outcome)), list(map(lambda x: x[1], sorted_outcome)), linewidth=3)
plt.ylim(bottom=0.9, top=0.95)
plt.title("Accuracy depending on the K value")
plt.xlabel("K value")
plt.ylabel("Accuracy (%)")
plt.show()

# Train-Test Split Analysis

To see how test size affects he accuracy, a split analysis is conducted.

With the range from 20 to 50, the model is trained and tested to see the optimal test size. The results are plotted.

In [None]:
knn = KNeighborsClassifier(BEST_K_VALUE)
accuracies = []

for i in range(20, 51):
    data_train, data_test, target_train, target_test = train_test_split(df_data, df_target, test_size=i / 100, shuffle=True, random_state=0)
    knn.fit(data_train, target_train)
    target_pred = knn.predict(data_test)
    accuracies.append(accuracy_score(target_test, target_pred))    
    
plt.plot(range(20, 51), list(map(lambda x: x * 100, accuracies)), linewidth=3)
plt.title("Accuracy change depending on the test/target distribution")
plt.ylabel("Accuracy (%)")
plt.xlabel("Test size (%)")
plt.show()

# K-Fold Cross Validation

k-fold cross validation is applied to the measure the performance of the model. 

In this case, the dataset is divided into 5 folds, trained and tested each time using a different fold as testset.

In [None]:
k = 5
kf = KFold(n_splits=k , shuffle=True, random_state=42)

# Train and Evaluate the Model with K-Fold Cross-Validation

In [None]:
k_neighbors = 3
accuracies = []

for train_index, test_index in kf.split(df_data):
    data_train, data_test = df_data.iloc[train_index], df_data.iloc[test_index]
    target_train, target_test = df_target.iloc[train_index], df_target.iloc[test_index]

    knn = KNeighborsClassifier(n_neighbors=k_neighbors)
    knn.fit(data_train, target_train)

    target_pred = knn.predict(data_test)

    accuracy = accuracy_score(target_test, target_pred)
    accuracies.append(accuracy)

average_accuracy = np.mean(accuracies)
print(average_accuracy)

## Use Case Identification 

1. In terms of medical, KNN can be used to predict disease risk among patients regarding stroke by comparing past data with new patients features such as blood pressure, age, habits etc with similar past patients.

2. For businesses, by using KNN businesses like online markets or banks can understand their new customers income range, credit, loan risk, average expense etc by comparing with past customers profile.
