# Hands On Workshop - Big Data in Healthcare 8400
### Hadas Volkov
## Predicting Diabetes - A Machine Learning Approach

In 2018, about 10.5% of Americans were estimated to have diabetes. Furthermore, about one-fifth of those cases were undiagnosed. Early detection is key in diabetes because early treatment can prevent serious complications. When a problem with blood sugar is found, doctors and patients can take steps to prevent permanent damage to the heart, kidneys, eyes, nerves, blood vessels, and other vital organs.
A patient must go through several tests, and checked for multiple factors, in order to be diagnosed with diabetes. The long process makes it difficult for doctors to keep track and can lead to inaccurate results which makes the detection very challenging. Due to recent advances in machine learning algorithms it is now possible to conduct a fast and accurate prediction of the disease in candidate patients.

### About the Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. [Pima Indians Diabetes Database on Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download)

### Data Analysis
The platform chosen for this exercise is a Jupyter notebook. For data scientists, notebooks are a crucial tool. Notebooks are a form of interactive computing, in which users write and execute code, visualize the results, and share insights. Typically, data scientists use notebooks for experiments and exploration tasks.
It is not expected of you to fully understand the code, it is here for you if you’d like to dive deeper, but whether this form of presentation allows me to integrate the computing environment and to facilitate the work of a data scientist to you.
You are asked to follow the notebook and execute each code block by highlighting the block and using the ‘play’ button above, or use the keyboard shortcut ‘shift+enter’ to execute.


#### Python and python packages
The python programing language has dominated the field of machine learning for the past years. There are two main reasons; One, python is a relatively simple to pick up for non-coders and facilitate the most intuitive programing syntax. The second reason, and the more important one, is the abundance of packages available for python users, especially for data scientists. A python package is a program written in python and offers some specific functionality to the user. For example, the ‘pandas’ package allows for handling csv and text files easily, ‘scikit-learn’ wraps almost all common machine learning algorithms, making it easy for us to quickly test variety of methods on our data.


In the bellow code block we’ll import some packages and functions for our analysis. Please execute the block before moving forward

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.plotting import plot_decision_regions
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

We will start by importing our dataset to our environment and saving it under the name ‘df’ (short for csv dataframe). The data is kept in a ‘csv’ file named ‘diabetes.csv’ in the current directory as this notebook.
The command ‘df.head()’ will print the first five rows in the file

In [5]:
#importing dataset
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


The dataset contains 768 observable values with eight feature variables and one target variable. Before starting to analyze the data and draw any conclusions, it is essential to understand the presence of missing values in any dataset. To do so the simplest way is to use 'df.info()' function which will provide us the column names with the type of data in each column.

In [6]:
# Get information on the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


There are five features in the data that contain zero values, Glucose, BloodPressure, SkinThickness, Insulin and BMI. These zero values are not possible in the medical history. We can either discard rows containing these values or replace them with the column's mean value.