# Introduction

<p align="justify">Welcome! In this case we'll be exploring how to use advanced analytic and machine learning techniques to predict diabetes among the Pima Indian population. 
<br>
<br>
<details>
<summary>Some of the skills you'll explore are (Click to Expand):</summary>
<ul>
    <li>R Programming</li>
    <li>Data Cleaning</li>
    <li>Exploratory Data Analysis</li>
    <li>Data Visualization</li>
    <li>Leveraging Domain Knowledge</li>
    <li>Machine Learning</li>
    <li>Gradient Boosting Machines</li>
</details><br>
Don't worry if you're unsure what some of these terms are. They'll be explained throughout the case. Let's begin! 
<img src="https://i.stack.imgur.com/zlAi2.png" style="float: left; width: 33%; margin-right: 1%; margin-bottom: 0.5em;">
<img src="https://cdn.images.express.co.uk/img/dynamic/11/590x/Diabetes-symptoms-870995.jpg" style="float: left; width: 35%; margin-bottom: 0.5em;">
<img src="https://46gyn61z4i0t1u1pnq2bbk2e-wpengine.netdna-ssl.com/wp-content/uploads/2018/11/sankey-diagram-1.png" style="float: left; width: 28%; margin-left: 1%; margin-bottom: 0.5em;">

## Case Scenario

Imagine you're a statistical officer for the Indian Health Service (IHS). Your state has recently obtained a federal grant aimed at improving health equity. This includes improving care among traditionally disadvantaged groups. From working within the IHS for several years now, you know that American Indians are more likely to suffer from metabolic diseases, particularly diabetes. 

<img width="400" height=325 src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/73/Indian_Health_Service_Logo.svg/1200px-Indian_Health_Service_Logo.svg.png">

With this grant, you now have the chance to leverage new technologies in analytics and machine learning to improve care among this group. If you can predict diabetes in patients before it manifests symptomatically, you can preemptively target preventative services toward this population. However, you have to first develop a way to predict diabetes. How can you do this?

Continue the case to find out how

### Extra: What is Diabetes?

Diabetes describes a group of metabolic disorders that is characterized by abnormally high blood glucose (blood sugar). Insulin is a critical hormone that allows the body to take up glucose and use it as energy. Diabetes occurs when your body does not produce enough insulin or cannot use insulin well. When this happens, glucose stays in your blood instead of being used as energy. Over time, this glucose in your blood can lead to various health problems. These can include serious complications such as blindness, limb amputation, and stroke. 

<img src="http://cdn.shopify.com/s/files/1/0582/0445/files/blood-sugar-levels-and-paleo_diagram_of_excessive_blood_glucose.jpg?14686865159083212781" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

Diabetes is common among individuals over 45 years old, have a family history of diabetes, and are overweight. Poor lifestyle choices such as not exercising or smoking can also increase the risk. Diabetes is one of the most common conditions in the United States with over 30 million people suffering from diabetes. The total estimated cost of diabetes to the health care system is estimated to be $327 Billion. Current treatment includes insulin and medication to control blood glucose levels. Lifestyle modifications before the onset of diabetes can make a huge difference in whether a person develops diabetes. Any algorithm which could reliably predict diabetes could give providers critical information to intervene before a patient has diabetes.

## How To Run The Case (Do Not Skip)

Before we begin the case, we need to know how to use Jupyter Notebook and run the case. First, look for the the `Run` button. The location of the `Run` button is shown below and can be found in the tool bar above. 


<img src="https://i.imgur.com/jr4dpLW.png">

The cell below is a code cell. You will be running numerous code cells like the one below throughout the case. Select the cell and select the run button above. 

In [None]:
# This is an example of a code cell
cat('Congratulations! \n')
cat('You\'ve run your first code cell.\n')


<img width = 50 height = 50 style="float: left; margin-right: 10px;" src="https://upload.wikimedia.org/wikipedia/commons/b/b9/Stop_sign_dark_red.svg">Stop! If you have not learned to run a code cell, restart this section. You will not be able to go through the case at all if you are unable to run code cells. Otherwise, it's time to meet our data!

## Meeting Our Data

We'll be using a set of patient data made available by the National Institute of Diabetes and Digestive and Kidney Diseases. It's important to note that several constraints were made in order to acquire the dataset. All patients are female, at least 21 years old, and of Pima Indian heritage. The variables selected were chosen due to being significant risk factors for diabetes among Pimas or other population groups. The data is hosted on the [University of California Irvine's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/diabetes). 

**Acknowledgements** <br>
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

### Data First Look

Lets get a first glimpse of our data.

In [None]:
pima_diabetes <- read.csv(file="data/diabetes.csv",  encoding="UTF-8", header=TRUE, sep=",")
head(pima_diabetes)

What do you notice? Do you notice anything unusual about the data? Don't worry if you don't notice anything. We will be getting to know our data better as we go through the case. 

### Data Variable Information 

There are several variables or labels which you might not understand. The way to combat this is by consulting the data dictionary. 

> A data dictionary describes a dataset and provides information on the meaning of each variable. Always look for documentation or a data dictionary before starting an analysis.

Unfortunately, no documentation is provided on the original data page. However, the original study the data comes from is available [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/pdf/procascamc00018-0276.pdf). The relevant information has been added below for your convenience. 

<center>

| *Variable*               | *Definition*                                                         |
| ------------------------ | -------------------------------------------------------------------- |
| Pregnancies              | Number of times pregnant                                             |
| Glucose                  | Plasma glucose concentration following a 2 hour oral glucose tolerance test                                             |
| BloodPressure            | Diastolic Blood Pressure (mm Hg)                                     | 
| SkinThickness            | Triceps Skin Fold Thickness (mm)                                     |
| Insulin                  | 2-Hour Serum Insulin Levels                                          |
| BMI                      | Weight in kg / (Heigh in m)^2                                        |
| DiabetesPedigreeFunction | Measure of expected genetic influence on the subject's diabetes risk |
| Age                      | Age (Years)                                                          |
| Outcome                  | Diabetes as defined as plasma glucose concentration > 200 mg/dl two hours after ingestion of 75 gram carbohydrate solution |


</center>

# Setup (Do Not Skip)

The code below will set up specific settings for the case to run properly. Do not worry if you do not understand what the code is doing, this will not impact your understanding of the case. Run the code below to complete the setup for the case. Do not skip this step!

In [None]:
# Increase max number of columns displayed in output tables
options(repr.matrix.max.cols = 50)
set.seed(10) # Make sure your ML results are the same

# Calling external libraries for additional functionality
suppressMessages(library(tidyverse))
suppressMessages(library(randomForest))
suppressMessages(library(forcats))
suppressMessages(library(cowplot))
suppressMessages(library(caret))
suppressMessages(library(e1071))
suppressMessages(library(pROC))
suppressMessages(library(mice))
suppressMessages(library(gbm))

# Create function to test models
test_model <- function(features){
    # Set up the model
    total_variables <- append(features,"Outcome")
    play_trainingData <- subset(training_data,select = total_variables)
    play_testData <- subset(test_data,select = total_variables)
    # Add in NA action to exclude missing 
    model_gbm <- train(Outcome~., data=play_trainingData, method="gbm", na.action = na.omit,verbose=FALSE)

    # Predict
    prediction_gbm <- predict(model_gbm, play_testData)

    # Confusion Matrix
    cm <- confusionMatrix(prediction_gbm, play_testData$Outcome)
    print(cm)
    # Create a ROC curve
    ROC <- roc(response = play_testData$Outcome, predictor = factor(prediction_gbm, 
                                                               ordered = TRUE))

    # Plot ROC with ggplot2
    plot_ROC <- ggroc(ROC)
    print(plot_ROC)
    cat('AUC:', round(auc(ROC), 2),'\n')
    test <- varImp(model_gbm)
    ggplot(test)
    }

cat('Setup complete!')

><img width = 50 height = 50 style="float: left; margin-right: 10px;" src="https://upload.wikimedia.org/wikipedia/commons/b/b9/Stop_sign_dark_red.svg">Stop! If you have not run the code cell above, please do so. The case will not work properly if you do not complete the setup. Otherwise, lets begin!

# Cleaning Our Data

The first step in any analytic project is to clean our data. This is a critical step which will ensure our data is correct, consistent, and ready for analysis. If we do not properly process our data, we will not able to analyze it effectively regardless of how advanced our analytic technique is. A common saying in data analysis is "Junk in, Junk out". 

## Inspecting Our Data

We'll begin by reading in our data. This means we will be loading the data onto the system so Python can understand the data. 

In [None]:
# Note: Unicode Transformation Format – 8 (UTF-8) is a standard to encode characters in different languages
cat('Data loading, please wait\n')
pima_diabetes <- read.csv(file="data/diabetes.csv",  encoding="UTF-8", header=TRUE, sep=",")
cat('Data loaded!')

Now let's get an overview of our data

In [None]:
head(pima_diabetes)
str(pima_diabetes)
summary(pima_diabetes)

For the most part all values are classified correctly as numeric or categorical besides the `Outcome` variable.  In addition there seem to be some odd values. There are individuals with a `BMI` of 0 or 67.1, likely implausible values. Other instances of implausible values include a `Glucose` of 0, 17 `Pregnancies`, and `SkinThickness` of 99 mm. We will consider all of these characteristics as we clean our data. 

## Recoding Variables

Sometime, the data you receive may be coded in such a way that is not easily understandable. Recoding the data (turning the data into a more easily human readable format) can help make your analysis quicker and smoother. 

The only variable we need to recode for our data is `Outcome`. Lets recode the variable `Outcome` into something meaningful. Based upon the data dictionary, we can see that `1` indicates the patient is classified as having Diabetes. `0` indicates they are not classified as having diabetes. 

In [None]:
# Recoding
pima_diabetes$Outcome <- ifelse(pima_diabetes$Outcome == 1, 'Diabetes', 
                               ifelse(pima_diabetes$Outcome == 0, 'No diabetes', NA))

# Convert from character to factor
pima_diabetes$Outcome <- as.factor(pima_diabetes$Outcome)
cat('Data Recoded')

Let's confirm our changes

In [None]:
head(pima_diabetes[c('Outcome')])

## Checking for Missing Values

Next, we will check for missing values. Missing values may unknowingly influence our results. For instance, if the missing data differs from the non-missing data in a significant way, this could lead you to draw erroneous conclusions from incomplete data. 

Let's examine the number of missing values

In [None]:
cat('Number of Missing Data for Each Variable:')
sapply(pima_diabetes, function(x) sum(is.na(x)))

This is very interesting. There are no missing values. If this is true, then this is an incredibly clean dataset. However, this is rarely the case. It's more likely that R cannot recognize the missing or erroneous values as being missing. 

All of our variables are numeric so it is unlikely there will be categories of erroneously coded data. The more likely option is that there are implausible values. 

## Removing Implausible Values

It is important to check you data for an implausible value. These data can influence your final analysis in undesirable ways. Often, these data are the result of mistakes rather than actual outliers and can lead to erroneous conclusions

Earlier we saw several possible implausible values. These include `Glucose`, `BloodPressure`, `SkinThickness`, `Pregnancies`, `Insulin`, and `BMI`. Lets take a closer look.

In [None]:
cat('Glucose:')
quantile(pima_diabetes$Glucose, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))
cat('Blood Pressure:')
quantile(pima_diabetes$BloodPressure, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))
cat('Skin Thickness:')
quantile(pima_diabetes$SkinThickness, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))
cat('Serum Insulin:')
quantile(pima_diabetes$Insulin, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))
cat('BMI:')
quantile(pima_diabetes$BMI, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))
cat('Pregnancies')
quantile(pima_diabetes$Pregnancies, c(0, .01, .05, .10, .25, .50, .75, .90, .95, .99, 1))

Based on the percentiles above, we can get a better sense of implausible and plausible values. Based on the results we can make the following judgements:

**Glucose:**
- Glucose is unlikely to go below 50 with the 1st percentile being 57. The 0 glucose measurement is physiologically impossible and likely to be an error or missing

**Blood Pressure:**
- Diastolic blood pressure is unlikely to go below 30 with the 5th percentile being 38.7. Patients at this level are likely to be considered hypotensive or in shock. 0 diastolic blood pressure is physiologically impossible and likely to be an error or missing.

**Skin Thickness:**
-  It is biologically unlikely that skin thickness would be above 80 mm or below 10 mm. This is confirmed by our percentile data with a huge jump between the 99th and 100th percentile. 

**Serum Insulin:**
-  2 hour serum insulin levels are physiologically unlikely to fall below 10 or above 300. This is confirmed by our percentile results. The jump from 95th to 99th and 100th percentile is enormous. In addition the values 0 for serum insulin are physiologically impossible and likely errors. 

**BMI:**
- BMI is unlikely to go above the 99th percentile of 50.759. The maximum of 67.1 seems biologically implausible. 
- BMI below the 5th percentile of 21.8 are possible. However, BMI values of 0 are likely mistakes. 

Based on these findings, let's reclassify implausible values as `Na`

In [None]:
pima_diabetes$Glucose[pima_diabetes$Glucose < 50] <- NA
pima_diabetes$BloodPressure[pima_diabetes$BloodPressure < 30] <- NA
pima_diabetes$SkinThickness[pima_diabetes$SkinThickness < 10 | pima_diabetes$SkinThickness > 80] <- NA
pima_diabetes$Insulin[pima_diabetes$Insulin < 10 | pima_diabetes$Insulin > 500] <- NA
pima_diabetes$BMI[pima_diabetes$BMI < 10 | pima_diabetes$BMI > 50] <- NA

cat('Data Recoded')

While our data is not as clean, this is a more realistic result. 

## Creating Clinically Relevant Variables

Our data includes the raw lab values for BMI. This is not a very useful measure by itself. Let's convert it into something more clinically meaningful. 

### BMI

BMI stands for Body Mass Index. This is a measure of body weight based upon a person's weight and height. This measure is commonly used to classify individuals as being overweight or a healthy weight. Below is the formula for BMI. 

\begin{equation*}
\large BMI = \frac{\large weight (kg)}{\large height (m^2)}
\end{equation*}

We will create a new variable which reflects the clinical cutoffs for bmi. 

**Knowledge Check:** What are the clinical cut-offs for BMI?

<center>

| *Category*     | *BMI Range*     |
| -------------- | --------------- |
| Underweight    | BMI < 18.5      |
| Healthy Weight | 18.5 ≤ BMI < 25 |
| Overweight     | 25 ≤ BMI < 30   |
| Obese          | 30 ≥ BMI        |

</center>

Let create the new variable `bmi_interp` based on these cut-offs 

In [None]:
# Create 'bmi_interp'
pima_diabetes <- mutate(pima_diabetes, bmi_interp = ifelse(BMI < 18.5, 'Underweight', 
                                        ifelse(BMI >= 18.5 & BMI < 25, 'Healthy Weight',
                                              ifelse(BMI >= 25 & BMI < 30, 'Overweight',
                                                    ifelse(BMI >= 30, 'Obese', NA)))))

# Convert from character to categorical
pima_diabetes$bmi_interp <- as.factor(pima_diabetes$bmi_interp)

cat('\'bmi_interp\' variable created!')

Let's confirm our results

In [None]:
head(pima_diabetes[c('BMI', 'bmi_interp')])

#### Limitations and Considerations when using BMI

BMI is a simple, inexpensive, and common measure for body fat. However, there are several clinical considerations to keep in mind when using this measure. It's critical to keep in mind BMI is only a surrogate measure since it uses weight instead of actual body fat content in its calculations. Below are three examples of factors that can influence BMI:

- age: older adults usually have more body fat than younger adults for the same BMI
- gender: women tend to have greater amounts of body fat compared to men for the same BMI
- muscle mass: muscular individuals or athletes may have higher BMI due to increased muscle mass

[Source](https://www.cdc.gov/obesity/downloads/bmiforpactitioners.pdf)

# Exploratory Data Analysis 

Now that we've cleaned our data we can begin exploring our data and selecting variables (also known as features) which we predict will be good candidates for our predictive model. How will we know which features are good candidates? One way we can quantitatively assess our variables is through descriptive analysis and data visualization. We will explore our data based on their data type (quantitative or categorical). 

What are quantitative and categorical variables?

- **Quantitative:** variables whose values are whole numbers (ie. numbers, percents)
- **Categorical:** variables whose values are selected from a group (ie. dog breeds, male/female) 

### Why Can't We Just Use All or Most Variables?

One issue you might be wondering about is why do we even need to select variables. Why not just use all of the variables? After all, more data lead to better models right? This is a common misconception that even experienced analysts need to watch out for. Including too many features in your prediction model can lead to what is known as 'overfitting'. Overfitting is essentially where you build a model that adheres too closely to your current data set and is unable to predict observations that are not from your current data set. In other words, it is where you develop a model that tuned too closely to your current data, and is not generalizable to outside data sources. 

<img src="https://3gp10c1vpy442j63me73gy3s-wpengine.netdna-ssl.com/wp-content/uploads/2018/03/Screen-Shot-2018-03-22-at-11.22.15-AM-e1527613915658.png" align="center" style="width: 50%; margin-bottom: 0.5em; margin-top: 0.5em;">

## Assessing Numeric Variables

First we will examine the quantitative or numeric variables. The code below will give us an overview of the structure of our data. Look for variables with the label `int` or `num`. These are two kinds of quantitative variables in R. 

In [None]:
str(pima_diabetes)

We can see that the most of our predictor variables are numeric with the exception of `bmi_interp`. Let's examine the distribution of the quantitative variables. 

In [None]:
### Pregnancies ###
# Create Plot
pregnancy_plot <- ggplot(pima_diabetes, aes(x=Pregnancies)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=15, color ='black ', fill='light blue') +
labs(x='Number of Pregnancies', y='Frequency Count', caption = 'Dashed Line Represents Median Pregnancies (3)')

# Display + Median Line
pregnancy_plot <- pregnancy_plot + geom_vline(aes(xintercept=median(Pregnancies)),
            color="blue", linetype="dashed", size=1)

### Glucose ###
# Create Plot
glucose_plot <- ggplot(pima_diabetes, aes(x=Glucose)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=20, color ='black ', fill='light blue') +
labs(x='Blood Glucose ', y='Frequency Count', caption = 'Dashed Line Represents Median Blood Glucose (117)') 

# Display + Median Line
glucose_plot <- glucose_plot + geom_vline(aes(xintercept=median(Glucose, na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) 

### BloodPressure ###
# Create Plot
bp_plot <- ggplot(pima_diabetes, aes(x=BloodPressure)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=20, color ='black ', fill='light blue') +
labs(x='Blood Pressure (mm Hg)', y='Frequency Count', caption = 'Dashed Line Represents Median Blood Pressure (72)') 

# Display + Median Line
bp_plot <- bp_plot + geom_vline(aes(xintercept=median(BloodPressure, na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) 

### Skin Thickness ###
# Create Plot
skin_plot <- ggplot(pima_diabetes, aes(x=SkinThickness)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=20, color ='black ', fill='light blue') +
labs(x='Skin Thickness (mm)', y='Frequency Count', caption = 'Dashed Line Represents Median Skin Thickness (29)') 

# Display + Median Line
skin_plot <- skin_plot + geom_vline(aes(xintercept=median(SkinThickness, na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) 

### Diabetes Pedigree ###
# Create Plot
pedigree_plot <- ggplot(pima_diabetes, aes(x=DiabetesPedigreeFunction)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=20, color ='black ', fill='light blue') +
labs(x='Diabetes Pedigree Function', y='Frequency Count', caption = 'Dashed Line Represents Median \nDiabetes Pedigree Function (0.37)') 

# Display + Median Line
pedigree_plot <- pedigree_plot + geom_vline(aes(xintercept=median(DiabetesPedigreeFunction, na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) 

### Age ###
# Create Plot
age_plot <- ggplot(pima_diabetes, aes(x=Age)) +
geom_histogram(alpha = 0.5, position = 'identity', bins=20, color ='black ', fill='light blue') +
labs(x='Age (years)', y='Frequency Count', caption = 'Dashed Line Represents Median Age (29)') 

# Display + Median Line
age_plot <- age_plot + geom_vline(aes(xintercept=median(Age, na.rm=TRUE)),
            color="blue", linetype="dashed", size=1) 

plot_grid(pregnancy_plot, glucose_plot, bp_plot, skin_plot, pedigree_plot, age_plot, ncol=2)

There does not appear to be any extreme values or prominent clusters. Now let's see if there is a relationship between diabetes and our predictors.

In [None]:
### Pregnancy ###
violin_plot_pregnancy <- ggplot(pima_diabetes, aes(x=Outcome, y=Pregnancies, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Number Pregnancies', x='Diabetes Status') +
coord_flip()

### Glucose ###
violin_plot_glucose <- ggplot(pima_diabetes, aes(x=Outcome, y=Glucose, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Blood Glucose', x='Diabetes Status') +
coord_flip()

### Blood Pressure ###
violin_plot_bp <- ggplot(pima_diabetes, aes(x=Outcome, y=BloodPressure, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Blood Pressure (mm Hg)', x='Diabetes Status') +
coord_flip()

### Skin Thickness ###
violin_plot_skin <- ggplot(pima_diabetes, aes(x=Outcome, y=SkinThickness, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Skin Thickness (mm)', x='Diabetes Status') +
coord_flip()

### Diabetes Pedigree function ###
violin_plot_pedigree <- ggplot(pima_diabetes, aes(x=Outcome, y=DiabetesPedigreeFunction, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Diabetes Pedigree Function', x='Diabetes Status') +
coord_flip()

### Age ###
violin_plot_age <- ggplot(pima_diabetes, aes(x=Outcome, y=Age, color = Outcome, fill = Outcome)) + 
geom_violin(alpha = 0.3, trim = FALSE) + # By default tails are trimmed
stat_summary(fun.y=median, geom="point", shape = 23, size = 2) +
theme(legend.position='none') +
labs(y='Age', x='Diabetes Status') +
coord_flip()

plot_grid(violin_plot_pregnancy, violin_plot_glucose, violin_plot_bp, violin_plot_skin, 
          violin_plot_pedigree, violin_plot_age, ncol = 2)

Overall, most variables have different distribution between 'No Diabetes' and 'Diabetes'. This indicates that these variables can discriminate between our two outcomes and are likely excellent candidate predictor variables. However, there is one variable which does not appear to have a very different distribution? Which variable is it?

The only exception is `Blood Pressure`. However, we cannot completely discount `Blood Pressure`. [Relevant literature](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3314178/) has shown a link between hypertension (high blood pressure) and diabetes. For this reason, `Blood Pressure` still could be a feature in our model. 

### Balancing Feature Selection with Domain Knowledge

There may be a time where your analysis where your data may show a feature does not have an effect or be a statistically significant feature. These always need to be balanced with clinical knowledge. If you know that something is important clinically that should balance incidental statistical findings. Statistical effects and significance can change based on the characteristics of your data. Always use your clinical/domain knowledge to inform the analytic process when possible. 

## Assessing Categorical Variables

We can now examine our final candidate predictor variable `BMI Interpretaion`. 

In [None]:
# Create Plot
bmi_plot <- ggplot(data=(subset(pima_diabetes, !is.na(bmi_interp))), aes(x=bmi_interp, fill = Outcome)) + 
geom_bar(position='fill') +
labs(y='Proportion', x='BMI Status', fill = "Outcome") +
theme(legend.position = 'none') + theme(axis.title.y=element_blank())

# Display Plot
bmi_plot

We can see that increasing BMI leads to increased proportion of diabetes. We also know this clinically since many of the metabolic risk factors behind obesity/overweight underpin diabetes. All in all this indicates that `BMI Status` is an excellent predictor variable from a data science and clinical perspective. 

## Logistic Regression

Now that our variables have been successfully converted and our outcome has been defined, we can analyze our data. Logistic regression is a mathematical model that estimates the probability of a binary outcome (such as our risk label). It is named after the logistic curve which takes the S-shape depicted below.
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/640px-Logistic-curve.svg.png?1566122052688" alt="Logistic Curve" title="Logistic Curve" />

**Pre-Check:** What is our primary outcome? What information will a logistic regression model tell you about our outcome?

Our primary outcome is whether the individual has diabetes. The logistic regression model will allow us to see how individuals variables affect whether an individual has a stroke **while controlling for other variables in the model**. For instance, we can see whether being older affects having diabetes while controlling for weight, genetics, etc...

Very useful indeed!

**Follow-Up:** What is statistical significance? What is a generally accepted level of statistical significance in healthcare research?

It will allow us to analyze which variables have a statistically significant effect on whether an asthmatic individual is at high- or low-risk. Logistic regression is a commonly used technique in health analytics because it is easy to interpret and is thought to model the multi-factorial causes of disease well. 

Statistical Significance can be defined as the chance that the relationship you observed in your data occurred by chance. What does this mean? Let's say our logistic regression model finds that weight has a statistically significant effect on being at high risk or low risk asthmatic patient. This means that it is more likely that there is indeed a relationship between weight and risk than chance would suggest. 

The conventional level of significance that is accepted is < 0.05 (this number is referred to as a p-value). This means that there is less than 5% chance that the observed relationship in the data was due to chance alone. The image below displays a sample R output.

<img src="https://drchrispook.files.wordpress.com/2017/02/anova-output-from-r1.jpg" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

Let's create out logistic model

In [None]:
# Creating a logistic regression model
mylogit <- glm(Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
               bmi_interp +Insulin+ DiabetesPedigreeFunction + Age,
               data = pima_diabetes, family = "binomial")
mylogit.sum <- summary(mylogit)
mylogit.sum

The above model allows us to see what variables are considered to have a statically significant effect on risk for diabetes. For instance, `Glucose` has a statistically significant effect with a p-value of 2.07e-10. 

Keep in mind that even if a variable is not statistically significant does not mean it is a poor feature. If you have domain knowledge which indicate a feature is particularly important for your outcome, you should still consider including that feature in your model. Clinical significance is more important than statistical significance!

# Building A Predictive Model

We are now ready to build a predictive model. If all goes well, we will have a tool that will accurately predict which patients will develop diabetes.

**Pre-Check:** So far we haven't done any machine learning yet. What we've done can be considered traditional statistical analysis. What differentiate machine learning for statistical analysis?

In machine learning, data is split into a training set and a test set. A machine learning model is then trained on the training set to predict whatever outcome of interest it was designed to predict (in our case we're predicting whether the patient has diabetes). The model's predictive performance is then evaluated using the test set. 

<img src="https://www.sqlservercentral.com/wp-content/uploads/2019/05/Image-2.jpg" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

An important decision you have to make when building your model is to decide what kind of predictive technique you will use. For our case, we will be using a model called gradient boosting machines. To understand gradient boosting machines we first must understand what boosting and decision trees are. Boosting is the process of converting a weak learner into a strong learner. Decision trees are charts which help make a decision or prediction. Each branch represents a possible outcome. The end of branches represent an end result or decision.

Decision trees are common in medical settings. For instance, below is an algorithm for evaluating febrile seizures. This is an example of a decision tree.

<img src="https://img.grepmed.com/uploads/1105/febrileseizure-management-algorithm-diagnosis-complex-original.png" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

In gradient boosting machines, we train numerous decision trees. With each training iteration, the algorithm identifies weak decision trees, and subsequently improves on these trees. This process continues until we have our final model. This final model is a curated and weighted sum of the predictions made by previous decision trees run by the algorithm. 

We now need to split our data into training and test data. We will be splitting our data into 80% training data and 20% test data.

In [None]:
# Splitting the data into training and test set data
# Setting the seed value so we get the same result when we repreat
set.seed(100)

# Imputing Na w/ mice package so that our model works
pima_diabetes <- pima_diabetes %>%
    mutate(
        Glucose = as.numeric(Glucose),
        BloodPressure = as.numeric(BloodPressure),
        SkinThickness = as.numeric(SkinThickness)
    )

suppressWarnings(mice_impute <- mice(pima_diabetes, m=1, maxit=10))
pima_diabetes <- complete(mice_impute, 1)

# Determining which rows willbe in the traiing data
training_index <- sample(nrow(pima_diabetes), 0.8*nrow(pima_diabetes), replace = FALSE)  

# Create Training Set
training_data <- pima_diabetes[training_index,]

# Create Test Set
test_data <- pima_diabetes[-training_index,]

cat('Training and Test Data Created!')

Now let's fit our model to the training data. We will then take a look at our model's performance using the test data and a confusion matrix.

> If you're unsure what a confusion matrix is, please consult section 5.0.1 ('What is a Confusion Matrix')

In [None]:
# Set up the model
model <- (Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + bmi_interp +Insulin+ DiabetesPedigreeFunction + Age)

# Add in NA action to exclude missing 
model_gbm <- suppressWarnings(train(model, data=training_data, method="gbm", na.action = na.omit,verbose=FALSE))

# Predict
prediction_gbm <- predict(model_gbm, test_data[,-9])

# Confusion Matrix
confusionMatrix(prediction_gbm, test_data$Outcome)

Here we can see several useful metrics for our model. For instance, we have an `Accuracy` of 0.83, a `Sensitivity` of 0.63 and a `Specificity` of 0.91. 

One question you may be wondering is does our model perform well enough? That depends. That depends on the type of conditions or predictions we're making. That depends on whether alternative predictive models or tools exist and how our new model compares. Additional research or consideration should always be done consider whether a model's result is not only statistically significant, but **clinically significant**. 

### What Is A Confusion Matrix

A confusion matrix is a 2x2 table which computes 4 different combinations of predicted vs. actual values. The combinations are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN)

<img src="https://miro.medium.com/max/320/1*Z54JgbS4DUwWSknhDCvNTQ.png" align="center" style="margin-bottom: 0.5em; margin-top: 0.5em;">

These 4 interpretations can be combined to generate many useful metrics. For our purpose there are three we will focus on. The first is accuracy:

\begin{equation*}
\large \text{Accuracy} = \frac{\large TP + TN}{\large TP + TN + FP + FN}
\end{equation*}

Accuracy allows us to measure how often our model predicted correctly. The second metric is sensitivity:

\begin{equation*}
\large \text{Sensitivity} = \frac{\large TP}{\large TP + FN}
\end{equation*}

Sensitivity asks the question, that when our prediction is positive (ie. in our case when a patient is predicted to be high risk) how often will the model correctly predict positively (ie. how often will the model then predict the patient to be at high risk). The final metric is specificity:

\begin{equation*}
\large \text{Specificity} = \frac{\large TN}{\large FP + TN}
\end{equation*}

Specificity asks the question, that when our prediction is negative (ie. in our case when a patient is predicted to be high risk) how often will the model correctly predict negatively (ie. how often will the model then predict the patient to be at high risk). 

> Note: Sometime you  may see precision and recall used instead of sensitivity and specificity. While recall is equivalent to sensitivity, precision is equivalent to something known as positive predictive value. Going into the differences is beyond this single case. Just know that these measures provide more information than accuracy alone. Precision and recall are commonly used in computer science while sensitivity and specificity are more commonly used in medicine. 

## Evaluating our Model

We will be evaluating our model using a receiver operating curve (ROC) and the area under the curve (AUC) value. 

> If you're unsure what a ROC or AUC value is, please consult section 5.1.1 ('Understanding ROC Curves and AUC Values')

In [None]:
# Create a ROC curve
ROC <- roc(response = test_data$Outcome, predictor = factor(prediction_gbm, 
                                                           ordered = TRUE))

# Plot ROC with ggplot2
plot_ROC <- ggroc(ROC)
plot_ROC

In [None]:
cat('AUC:', round(auc(ROC), 2))

The closer to the top left corner our ROC curve, the better. The higher our AUC value, the better. These metrics provide useful measures when tuning our model. They are also better overall measures than accuracy alone. We can compare different models using these two metrics. 

### Understanding ROC Curves and AUC Values

A ROC plots sensitivity (also known as the true positive rate) against 1-specificity (also known as the false positive rate). A model with a 50-50 chance of making a correct decision will have a ROC curve which is just a diagonal line. A model with a curve that hugs the top left corner is a perfect model. The area under a curve is a measure of the magnitude of the ROC curve. The closer the ROC curve is to the top left corner, the higher the AUC value. The higher the AUC value, the better. AUC value range from 0 -1. 

<img src="https://miro.medium.com/max/406/1*pk05QGzoWhCgRiiFbz-oKQ.png" style="float: center; width: 34%; margin-bottom: 0.5em;">

## Explaining our Model

An important part of any model is to explain it. We will be measuring the variable importance for our model. The higher the variable importance, the more important that variable is for predicting high risk asthma patients. This allows us to quantitatively compare which variables are more important and how much more important they are. 

In [None]:
test <- varImp(model_gbm)
ggplot(test)

We can see that `Glucose` was by far the most important variable. This makes sense since diabetes is a reflection of abnormally high blood glucose levels. What's surprising is how much of an effect that `DiabetsPedigreeFunction` has. There are two variants of diabetes, Type 1 and Type 2. Type 2 diabetes has a strong genetic component. This indicates many in our data have type 2 diabetes. Variable importance can be a good way to look for surprising results. Any surprising variables can be the subject of further investigation!

# Function

You've now learned everything you need to begin testing your own models! Analytics is an iterative process and requires constantly tuning and testing different models against one another. You will now build your own model and see how it performs.

One of the most decisions in model building is to decide what features/variables to include on your model. This will have a huge impact on your models performance. Run the code below to see all the features in our data. 

In [None]:
colnames(training_data)

From the list of features above, pick the features you believe are the most predictive for predicting diabetes. You can type in your features in between the brackets below. Please follow the format shown in the example below.

<code>c("Pregnancies","BloodPressure")</code>

Be careful! If there are any typo, this will not work and you will need to run the code below again with the typo corrected. 

In [None]:
features <- c("Pregnancies","Glucose","BloodPressure")

Now we will see how our model performs using the features you selected. The below code will output evaluation metrics. 

In [None]:
test_model(features)

Experiment with different features. As mentioned earlier, feature selection makes a huge difference in model performance. Determine which features work best and see if you can beat the model we built earlier in the case! You can rerun the <code>test_model()</code> function with new features as many times as you like. Please refer to the instructions after our variable list to see how to select new features. 

Congratulations! You've reached the end of the case! This case provided just one example of how analytics and healthcare can be combined to solve clinical problems. I hope your curiosity has been piqued. Data analytics will become increasingly important in healthcare as time goes on. It is a skill well worth investing in. 