# Hands On Workshop - Big Data in Healthcare 8400
### Hadas Volkov
## Predicting Diabetes - A Machine Learning Approach

In 2018, about 10.5% of Americans were estimated to have diabetes. Furthermore, about one-fifth of those cases were undiagnosed. Early detection is key in diabetes because early treatment can prevent serious complications. When a problem with blood sugar is found, doctors and patients can take steps to prevent permanent damage to the heart, kidneys, eyes, nerves, blood vessels, and other vital organs. </br>
A patient must go through several tests, and checked for multiple factors, in order to be diagnosed with diabetes. The long process makes it difficult for doctors to keep track and can lead to inaccurate results which makes the detection very challenging. Due to recent advances in machine learning algorithms it is now possible to conduct a fast and accurate prediction of the disease in candidate patients.

### About the Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. [Pima Indians Diabetes Database on Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download)

### Data Analysis
The platform chosen for this exercise is a Jupyter notebook. For data scientists, notebooks are a crucial tool. Notebooks are a form of interactive computing, in which users write and execute code, visualize the results, and share insights. Typically, data scientists use notebooks for experiments and exploration tasks. </br>
It is not expected of you to fully understand the code, it is here for you if you’d like to dive deeper, but whether this form of presentation allows me to integrate the computing environment and to facilitate the work of a data scientist to you. </br>
You are asked to follow the notebook and execute each code block by highlighting the block and using the ‘play’ button above, or use the keyboard shortcut ‘shift+enter’ to execute.


#### Python and python packages
The python programing language has dominated the field of machine learning for the past years. There are two main reasons; One, python is a relatively simple to pick up for non-coders and facilitate the most intuitive programing syntax. The second reason, and the more important one, is the abundance of packages available for python users, especially for data scientists. A python package is a program written in python and offers some specific functionality to the user. For example, the ‘pandas’ package allows for handling csv and text files easily, ‘scikit-learn’ wraps almost all common machine learning algorithms, making it easy for us to quickly test variety of methods on our data.


In the bellow code block we’ll import some packages and functions for our analysis. Please execute the block before moving forward

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.plotting import plot_decision_regions
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from scipy.stats import pearsonr
from sklearn.svm import SVC

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Preprocessing

We will start by importing our dataset to our environment and saving it under the name ‘df’ (short for csv dataframe). The data is kept in a ‘csv’ file named ‘diabetes.csv’ in the current directory as this notebook. </br>
The command ‘df.head()’ will print the first five rows in the file

In [None]:
#importing dataset
df = pd.read_csv('diabetes.csv') 
df.head()

The dataset contains 768 observable values with eight feature variables and one target variable. Before starting to analyze the data and draw any conclusions, it is essential to understand the presence of missing values in any dataset. To do so, the simplest way is to use 'df.info()' function which will provide us the column names with the type of data in each column.

In [None]:
# Get information on the dataset
df.info()

There are five features in the data that contain null values, Glucose, BloodPressure, SkinThickness, Insulin and BMI. A null value is a special marker to indicate that a data value does not exist in the database. In other words, it is just a placeholder to denote values that are missing or that we do not know. </br> 
We can perform a quick calculation and print the percentage of missing values for each feature

In [None]:
# making a list of columns with total number of missing values
print('Column'+ '\t\t\t\t Total missing Values'+'\t\t\t\t % of missing values')
#print("\n")
for i in df.columns:
    print(f"{i: <50}{df[i].isnull().sum():<30}{((df[i].isnull().sum())*100)/df.shape[0]: .2f}")

Next, we can try and discard rows containing these values.

In [None]:
# Drop all rows containing a null values and get information on the new dataset
df.dropna().info()

Removing all rows containing null values significantly reduced our dataset to only 392 entries. Instead, ww can perform mean imputation, or mean substitution, replacing missing values of a certain variable by the mean of non-missing cases of that variable.

In [None]:
# Mean imputation on null containing columns
for col in ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']:
    df[col].fillna(df[col].mean(), inplace=True)
df.info()

### Data Visualization

To get an initial ‘feel’ of data we can plot a few visualization schemes for our data. 

In [None]:
# Boxplots
plt.figure(figsize=(18, 6), dpi=80)
sns.boxplot(data=df, orient="h",
            palette="Set1")
plt.show()

In [None]:
# Heatmap correlation
sns.heatmap(df.corr())

In [None]:
# Pairwise relationships
# A function to compute the correlation coefficient
graph = sns.PairGrid(df, hue ='Outcome')
# Type of graph for diagonal
graph = graph.map_diag(plt.hist)
# Type of graph for non-diagonal
graph = graph.map_offdiag(plt.scatter)
graph = graph.add_legend()

#This might take a few seconds

Any intersting relationships in the data?

### Classification

To proceed with the classification training, we need to separate our dataset to features and targets. By convention features in the dataset are denoted with an ‘X’ and target variables with ‘y’.

In [None]:
X = df.drop('Outcome',axis=1)
y = df['Outcome']
sns.countplot(x="Outcome", data=df)

Next, we will split the dataset into training and testing groups. We will follow common practice and split the dataset into 80% for the training group and 20% as the validation set.

In [None]:
# Randomly split the dataset to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=0)

Before implementing classification algorithm, we need to scale the feature variables of our dataset. We will resize the distribution of values for each feature so that the mean of the observed values is 0 and the standard deviation is 1. </br>
Why is this scaling important?

In [None]:
scaling_x = StandardScaler()
X_train = scaling_x.fit_transform(X_train)
X_test = scaling_x.transform(X_test)

### Training and Evaluating Models

In the main part of this workshop, we will experiment with a few common prediction algorithms and asses their performance on our dataset

#### K-Nearest Neighbors

KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase. </br>
In KNN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm. This is the simplest case.

![title](img/knn.png)

In [None]:
# KNN with one neighbor
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print("Calssifiction accuracy with one neighbor: ", "{:.3f}".format(accuracy_score(y_test, y_pred)))

With only one neighbor our model’s classification accuracy is 75.3%. </br>
Let’s scan a range of neighbor values to see whether it is possible to better this result

In [None]:
error1= []
error2= []
neighbors = range(1,50)
for k in neighbors:
    # using KNN algorithm
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train,y_train)
    y_pred1 = knn.predict(X_train)

    # stroring the errors
    error1.append(np.mean(y_train!= y_pred1))
    y_pred2 = knn.predict(X_test)
    error2.append(np.mean(y_test != y_pred2))

# ploting the graphs for testing and training 
plt.plot(neighbors, error1, label="train")
plt.plot(neighbors, error2, label="test")
plt.xlabel('k Value')
plt.ylabel('Error')
plt.legend()

min_knn_value = min(error2)
min_knn_index = error2.index(min_knn_value)
print("Minimum error with {} neighbors: {:.3f}".format(max_knn_index, max_knn_value))

Minimum test model error is received with 31 neighbors. We can compute accuracy value for that k. Also, lets compute the confusion matrix to find out the precision of our model.

In [None]:
knn = KNeighborsClassifier(n_neighbors=31)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm,annot=True)

# Classifiction report
print(classification_report(y_test, y_pred))

#### Support Vector Machines

Support Vector Machines is considered to be a classification approach, it but can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes.

![title](img/svm.webp.crdownload)

In [None]:
# Create a svm Classifier
clf = SVC(kernel='linear') # Linear Kernel
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm,annot=True)

# Classifiction report
print(classification_report(y_test, y_pred))

#### Random Forests Classifier

Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In [None]:
# Create a Random Forest Classifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm,annot=True)

# Classifiction report
print(classification_report(y_test, y_pred))

Random forests also offers a good feature selection indicator. The below compution plots the relative importance or contribution of each feature in the prediction. It automatically computes the relevance score of each feature in the training phase. Then it scales the relevance down so that the sum of all scores is 1.

In [None]:
# Calculate feature importance
feature_imp = pd.Series(rfc.feature_importances_,index=df.columns[:-1]).sort_values(ascending=False)
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

## Discussion

* Which one of the prediction algorithms presented here is the most preferable for the task?
* Can you think on other methods to achieve similar or better performance?
* What other measurements can we compute to compare these methods?