# [MCB 32]: Lab 09 - Diabetes Classifcation
---

<div class="alert alert-info"> 
** NOTE: You will work in pairs for this lab and turn in one copy of the notebook for each pair. Write the names of the two people who worked on this notebook together in the box below.
</div>

Student #1: **FIRST STUDENT**

Student #2: **SECOND STUDENT**

### Professor Robin Ball


In this lab, we will be exploring a data set on diabetes in an attempt to find out which health measurements are correlated the most with diabetic patients. After we establish the relationships between these measurments and diabetes, we will build a classifier that will, given a certain patient's health measurements, classify whether that patient has diabetes or not. 

### Table of Contents
1. [The Setup](#section 1)
2. [Exploring the Data](#section 2)
3. [Conceptual Background](#section 3)
4. [Building the Classifier](#section 4)
5. [Testing the Classifier](#section 5)

### Completing the Notebook

<div class="alert alert-info"> 

**QUESTION** cells are in blue and ask you to make graphs, answer conceptual questions, or do other lab tasks. To receive full credit for your lab, you must complete all **QUESTION** cells.

</div>

To run a code cell once it's been selected, 
- press Shift-Enter, or
- click the Run button in the toolbar at the top of the screen. 

If a code cell is running, you will see an asterisk (\*) appear in the square brackets to the left of the cell. Once the cell has finished running, a number will replace the asterisk and any output from the code will appear under the cell.

---

# 1. The Setup<a id='section 1'></a>

The first thing we need to do is set up our environment so we can build tables and visualize the data, so just run the next cells.

In [None]:
!pip install datascience numpy matplotlib scipy sklearn

In [None]:
from datascience import *
import numpy as np
import matplotlib as plt
from sklearn.neighbors import NearestNeighbors
%matplotlib inline
plt.style.use('fivethirtyeight')

Then we will load in the data on the patients into a table in the next cell.

It is important to understand what the information inside of our table is, such as the columns, rows and individual entries. Each row corresponds to a particular patient and the health measurements that, that patient had recorded druring the gathering of the data. As a reminder, here is a breakdown of the columns in our table.

Variable       | Description
-------------- | ------------------------------------------------------------------
Pregnancies | The number of pregnancies the patient has had
Glucose | Plasma glucose concentration 2 hours after the oral glucose tolerance test
BloodPressure | Diastolic Blood Pressure (mm Hg)
SkinThickness | Triceps skin fold thickness (mm)
Insulin | 2-hour serum insulin
BMI | Body mass index (weight in kg/(heigth in meters)^2)
DiabetesPedigreeFunction | A function which extrapolates the genetic risk a patient has of getting diabetes based on history of the disease in family/relatives
Age | Age of patient (years)
Outcome | Distinguishes if patient has diabetes or not. 0 : NO, 1 : Yes


In [None]:
diabetes = Table().read_table("diabetes.csv")

In [None]:
diabetes

---

# 2. Exploring the Data<a id='section 2'></a>

Now would be a good time to do some preliminary exploration into our dataset. When doing this we can try to find different relationships or correlations between variables in our dataset. This can give us an idea about the structure and nature of our dataset and the variables inside of it.

The following cell shows us how many entries are in each column. It looks like our table has 9 columns and 768 rows. Each row corresponds to a certain patient, therefore, we have 768 patients.

In [None]:
diabetes.size

We can also try to see if there is any missing or anamolous data in our dataset.

In [None]:
diabetes.group('Insulin')

The table above shows the number of patients at each level of insulin. Do you notice anything interesting about one of the counts? One of the counts is 374 and it corresponds to the insulin level of 0. It isn't possible to have an insulin level of 0. Can there be some other explanation as to why so many of the patients in out dataset have a measured insulin level of 0?

What happened is that many patients did not have their insulin level measured, so the constructors of this dataset decided to use 0 as a default value for those who did not have their insulin measured. This is a common practice in data science and is something that data scientists often have to deal with and infer upon. Having a lot of 0's for insulin can hurt our analysis on this dataset, but in order to make up for it we can either ignore the insulin ratings that have a 0 or we can just ignore the entire insulin column in our table.

For example, in the following cell, we decide to select all rows or individuals in our dataset who don't have 0 as their insulin, or, in other words, had their insulin level measured. Then, we construct a histogram which compares the distribution of insulin for those with diabetes in our dataset and the distribution of insulin for those without diabetes. Which one do you think should have a lower insulin level on average, those with diabetes or those without?

In [None]:
diabetes.where('Insulin', are.not_equal_to(0)).hist('Insulin', group='Outcome')

<div class="alert alert-info"> 

**QUESTION**: In the cell below, explain the differences you see in insulin levels between those with diabetes and those without. Explain why this difference exists using your knowledge about biology and what you read in the lab09 overview. (Hint: which distribution or histogram is "higher" or more to the right, remeber 0 represents those without diabetes and 1 is those with diabetes)

</div>

**Replace this text with your answer (double click the cell to enable editing)**

We can also try to find other relationships between the variables in our data. We can try to answer questions like: How is BMI related to Age in the people in our dataset? How do people with diabetes compare to those without diabetes when it comes to their glucose levels in our dataset?

The following scatterplot plots an individual's Age vs. the number of pregnancies they have had. Look at the scatterplot and try to see what the relationship between Age and Pregnancies is.

In [None]:
diabetes.scatter('Age', 'Pregnancies', fit_line=True)

<div class="alert alert-info"> 

**QUESTION**:In the cell below explain the raltionship between age and number of pregnancies in the above scatterplot. Use the plotted line to help explain this relationship. How do the number of pregnancies seem to change with increasing age for the people in our dataset?

</div>

**Replace this text with your answer**

---

# 3. Conceptual Background<a id='section 3'></a>

Now that we have explored our data a little, it's a good time to delve further into the idea of classification and, more generally, machine learning.

The next section is dense so it might be helpful to walk through the concepts with a partner or in a group.

Machine learning is a school of thought which generally tries to make accurate models and predictions about the world using data. Machine learning algorithms can become quite complex, to the point where an algorithm can start to learn from data on its own. In today's lab, we will be implementing a machine learning model which uses supervised learning to build it. The model we will be building with our data is a classifier.

Classification is a type of machine learning and modeling which takes in certain inputs or data about a subject and tries to predict which category or group that certain data point or subject lies in based on data or knowledge we already have. The knowledge that it generally relies on, is that it uses a dataset of subjects that we already know the categories or classifications for. The categories or classes for our data are diabetic and non-diabetic. We already have a dataset of individuals for whom we know if a particular individual is diabetic or not.

Our classifier will use a K-nearest neighbors algorithm to classify an input as diabetic or non-diabetic. Please don't get too caught up on the scary name. Essentially, this process finds the closest individuals in our dataset to the new individual we are trying to classify. It does this by calculating distances between our individual of interest's health measurements and the health measurements of all the individuals in the dataset. Once it has these k-closest individuals, it can then classify our new individual as diabetic or non-diabetic based on the classifications of the individuals we found are closest to him or her.

Here is an example: 

Say I am trying to classify Natalie with a 3-Nearest Neighbor classifier and find that her three neearest neighbors in my dataset are Jose, Kimberly, and Luke. I know that Luke and Kimberly have diabetes and that Jose doesn't because they are in my dataset of people who I know the classification of. Because a majority of the three nearest neighbors to Natalie are diabetic, I would then classify Natalie as diabetic.

<div class="alert alert-info">
**QUESTION**: Classify Adel based on the following information and put your answer into the quotations for the variable adel_classification 
</div>


I find that Adel's 5-nearest neighbors are Tracy, Hanh, Jordan, Justin, and Sam. Sam and Tracy have diabetes, but Jordan, Justin, and Hanh do not. How should I classify Adel?

In [None]:
adel_classification = ""

Now is a good time to take a deeper look at the data we have for diabetes and see if we can see any problems with the data in relation to the classifier we will be building.

In [None]:
diabetes

One problem with the data is that the different columns or health measurements we take on an individual have different scales. For example, Glucose has numbers in the hundreds, Age in the tens, and The Diabetes Pedigree Function is a number between 0 and 1. Could this be a problematic feature of our data set, given we are measuring the distances between two individuals with these health measurements?

The problem with having different scales when we are calculating distance between two people is that two people could be very close to each other in their Diabetes Pedigree Function (DPF) Score but we may not be able to see this because Glucose is on such a larger scale. For example, say we have two pairs of people and we just have their Glucose and DPF data in the form (Glucose, DPF). The first pair of people have data of (140, 0) (150, 1) and the second pair of people have data of (140, 0.350), (153, 0.352). The first pair has a distance of $\sqrt101$ but the second pair has a distance of approximately $\sqrt169$. What we aren't taking into account is that a distance of 1 in DPF scores is actually very significant because its the biggest possible distance you can have in DPF scores, but the second pair's distance in Glucose is three larger than the first pair's, which our distance calculation is putting a lot of weight on, even though its not that significant of a difference in Glucose levels. The second pair should actually be closer because their DPF scores are almost identical, and their glucose measurements are much further off than the first pairs. 

One way we can account for this is to just put all of the data into standard units. Essentially, instead of having the raw data we put the number of standard deviations that a specific entry is above or below the mean of the data in that column. This will make sure all data points are in a range of numbers from about -3 to 3. Once we do this we will no longer have any issues with scaling for our distance calculations

In [None]:
def standard_units(x):
    mean = np.mean(x)
    std = np.std(x)
    normalized_x = (x - mean)/std
    return normalized_x

In [None]:
outcomes = diabetes.column("Outcome")
diabetes_std = util.table_apply(diabetes, standard_units)
diabetes_std = diabetes_std.drop("Outcome")
diabetes_std = diabetes_std.with_column("Outcome", outcomes)
diabetes_std

Now that we have an idea of how our classification works and we've fixed our data, how will we know how well our classifier works? In order to know this, we would have to have access to new data which we have not created our classifier with and that still has the known classification of the individuals in it. We could go scouring the internet for a new data set, but luckily data scientists have come up with a clever trick to get around this.

The idea is to split up the data set we already have into what are called training and testing sets. The training set will be larger and will consist of the data that we use to create our classifier, while the testing set will not be accessed or viewed until we want to test how well our classifier works. Because we know the classification of everyone in our data set already, we will know whether or not a certain individual has diabetes or not in our testing set. Then we will try to classify everyone in our testing set with our classifier and see how accurate or how often our classifier makes the right prediction.

The next cell calls a function on our table, .split(k), which will automatically split our tables into one with k rows that are randomly sampled from the original table and one table with the rest of the rows in our original table. We usually want between 10-30% of our data in the testing set and the rest in the training.

In [None]:
diabetes_testing, diabetes_training = diabetes_std.split(170)

In [None]:
diabetes_training.num_rows, diabetes_testing.num_rows

---

# 4. Building the Classifier<a id='section 4'></a>

Now that we finally have an understanding of classification and a training data set we can build our classifier with, we can finally start to build it. From now on we will be exclusively using our training set to build our model.

What we need to figure out, is which health measurements in our data will give us the most accurate classifications. We can do this by finding out which variables have the biggest difference between people with diabetes and people without diabetes. One thing we can check is a difference in means between the two groups of people.

In [None]:
diabetes_training.select("BloodPressure", "Outcome").group("Outcome", np.mean)

What we did in the above cell is select the BloodPressure and Outcome columns in our table (remember outcome has two possible values, 0 and 1 for non-diabetic and diabetic), and then we grouped by outcome. What this does is separate our entries into the different unique values for that column. So for our table, our entries are split into the groups of people who have 0 and 1 for their outcome value. We also pass in a function np.mean into the group method. Passing in this function, np.mean, tells the computer how it should aggregate the BloodPressure entries in each group. Therefore, np.mean tells the computer to find the mean BloodPressure of each of these groups. So the second column conatins the means for each group, 0 and 1.

Remember that our variables are in standard units now, so the values indicate how many standard deviations above and below the mean each respective group is on average. If you are unfamiliar with this terminology, then just think of a standard unit as a measure of how high or below a specific number is with respect to the mean. A number that is negative is below the mean, a number that is positive is above the mean. Numbers that are close to zero are close to the mean with respect to that group of data. About as far as a number can get from the mean is 3 or -3 (usually), with 3 being a number very high above the mean and -3 being a number very below the mean.

For our table above we see group 0 has an average of -0.0685657 meaning, on average, non-diabetic patients have a Bloodpressure below the mean BloodPressure for everyone in the training data. And the other mean for group 1 is 0.0584094, meaning the average person with diabetes has a BloodPressure above the same mean. Both values are pretty close to zero though, so there isn't that big of a difference in means between the two groups.

<div class="alert alert-info">
**QUESTION**: Now try finding the mean of different health measurements for each group, 0 and 1, in our training data. Try doing it for the "Age" column and the "BMI" column. (Hint: look at where we passed the "BloodPressure" column label in the previous line of code) Observe if the means are above or below zero and how far above and below zero they are for each group.) Write which variable seems to have the biggest disparity so far in the biggest_mean_difference variable below.
    </div>

In [None]:
#Replace the Ellipsis with your code to compute the mean table for each column
bmi_means = ...
bmi_means

In [None]:
age_means = ...
age_means

In [None]:
biggest_mean_difference = ""
biggest_mean_difference

Seeing a difference in the mean between the two groups of those who are diabetic and those who are not in our training data is helpful in deciding which health measurements have the biggest disparities between those two groups. However, it is not that intuitive or easy to see and it might be better to try to visualize the data. If we could see the distribution of the health measurements for the different groups then it might be easier to decide which health measurements seem to be the biggest difference makers between diabetic and non-diabetic patients. 

Luckily, we have a tool that we have seen before called a histogram which places our data into bins. Each bin has a height which is a percentage of the data in that bin and a width which is the range of the data in that bin. This information is not too important and you don't have to completely understand the intricacies of a histogram. What is important to see is that larger or taller bars have more of the data in them and that the histogram overall shows the distribution of the data. Therefore, if we create a histogram for each group for say BloodPressure then we can see a more complete picture of how diabetic and non-dabetic people differ in BloodPressure.

In [None]:
diabetes_training.select("BloodPressure", "Outcome").hist(group="Outcome")

Diabetic patients have the yellow histogram and non-diabetic patients have the blue histogram. We can see that there are some differences between the two distributions. The diabetic histogram seems to have values which are on average above the mean. We can see this because its center, or where is peaks is above 0 (again we are in standard units) and the non-diabetic histogram has a center below zero. We can see from the histograms that those with diabetes seem to have higher Blood Pressure than those without diabetes in our dataset. We can also see that the distribution is not that significant becuse the histograms seem to overlap quite a bit. This agrees with our above table when we found the difference in BloodPressure means between group 0 and 1.

<div class="alert alert-info">
**QUESTION**: In the cells below, create the histogram indicated by the comment. Use the above histogram call as a scaffold for your histograms. Once you have created these histograms answer the questions that follow them.
    </div>

In [None]:
#Create the histogram for the "Age" column
diabetes_training.select(..., "Outcome").hist(group="Outcome")

In [None]:
#Create the histogram for the "BMI" column
...

In [None]:
#Create the histogram for the "DiabetesPedigreeFunction" column
...

In [None]:
#Create the histogram for the "Glucose" column
...

In [None]:
#Create the histogram for the "Pregnancies" column
...

In [None]:
#Create the histogram for the "SkinThickness" column
...

<div class="alert alert-info">
**QUESTION**: Looking at the above histograms that you made, which variable seems to have the biggest disparity between patients who are diabetic and patients who are not diabetic? (Essentially, which variable results in the two most different histograms?) Using you knowledge of diabetes, why is it that this variable would be so important in distinguishing between diabetic people and non-diabetic people? Should we use this variable in our classifier?
    </div>

**Replace this text with your answer**

<div class="alert alert-info">
**QUESTION**: Look at the histograms provided for BloodPressure and the one you made for BMI. Besed on these histograms which variable seems to be the more discriminating of the two between people with diabetes and people without diabetes. Which one would be better to use for our classifier?
    </div>

**Replace this text with your answer**

---

# 5. Testing the Classifier<a id='section 5'></a>

After creating and analyzing the above histograms we now have a good idea of which variables would be the most effective to use in our classifier. The cells below actually implement the K Nearest Neighbors Algorithms we talked about in the Conceptual Background section earlier in this lab. It is not expected that you understand how these functions work, but it is good to understand how the classifier works on a high level.

The `classifier_accuracy` function is the function which will actually take in our training set and use it to predict the classifications of the individuals in our testing set. Then, it will check how many of our predictions were right and return a decimal bewteen 0 and 1. This decimal is the ratio of how many predictions we got right over how many predictions we made. Multiply this number by 100 and you get the percentage accuracy of our classifier. The way to call it is to put the training data into the first argument, the test data into the second argument and the number of neightbors you want your K Nearest Neighbors Algorithm to use like this:

``` python
classifier_accuracy(training, testing, k)
```

Run the cells to below to make sure we can access the function.

In [None]:
def get_k_nearest(test, table, k):
    x = test.rows
    x_array = np.array(x)
    nbrs = NearestNeighbors(n_neighbors=k).fit(np.array(table.rows))
    distances, indices = nbrs.kneighbors(x_array)
    return distances, indices
    

In [None]:
def classifier_accuracy(table, test, k):
    distance, indices = get_k_nearest(test.drop("Outcome"), table.drop("Outcome"), k)
    rows = table.rows
    num_right = 0
    classifications = []
    test_rows = test.rows
    for index in indices:
        classes_nearest = []
        diabetes_count = 0
        for i in index:
            classes_nearest += [rows[i][-1]]
        for elem in classes_nearest:
            if elem == 1:
                diabetes_count += 1
        if diabetes_count > k/2:
            classifications += [1]
        else:
            classifications += [0]
    for i in range(len(classifications)):
        if classifications[i] == test_rows[i][-1]:
            num_right += 1
    return num_right/len(test_rows)

For `diabetes_training_best` and `diabetes_testing_best`, input the labels of the columns you think would be best to use for our classifier. You can figure this out from the analysis we did on the histograms before. You can also try to figure out the difference in means like we did before, and see which difference is the largest. Try different combinations of columns even if you think they won't work as well as others. Seeing how different columns or variables change our accuracy is important and can verify the assumptions we made about the importance of those specific columns. You can also change the number of neighbors you want to use and see how the accuracy changes. Currently, it is set at `5`, but you could change it to another number. Make sure the number is odd though, because that way we can guarantee a majority of either diabetic or non-diabtic patients. 

You can make your classifier use all the columns if you would like. Just make sure to keep the "Outcome" column because we want to know what the classification of the people are in both the training and testing sets. Try to make your accuracy as high as possible by using what you think are the most important columns/variables. Just enter it in like this:

``` python
diabetes_training_best = diabetes_training.select("Column1", "Column2", "Column3", ..., "Outcome")
diabetes_testing_best = diabetes_testing.select("Column1", "Column2", "Column3", ..., "Outcome")
```

<div class="alert alert-info">
**QUESTION**: Follow the example above to try and reach the highest possible accuracy. You should be using a variety of different columns in the dataset to try and reach a better conclusion, as well as changing the number of Nearest Neighbors to see if your accuracy changes. For each attempt that you run, make an new box below as marked.
</div>

In [None]:
# Replace ... with the names of columns you want to include in the classifier.
diabetes_training_best = diabetes_training.select(..., "Outcome")
diabetes_testing_best = diabetes_testing.select(..., "Outcome")
classifier_accuracy(diabetes_training_best, diabetes_testing_best, 5)

In [None]:
# Create more boxes here and below with your other attempts.

The highest percentage accuracy you got was probably around 75%. So what's the big deal? Well, try thinking about it this way. 

If you were to try to guess heads or tails on a coin flip 170 times (which is the size of our testing set), the chance of you getting 75% of those guesses right or more is a very miniscule chance, and if you did get 75% or more of them right then that would probably be reason to believe that you could predict the future.

Essentially, that is what our classifier is doing. It is making guesses (educated ones) about a person given the nature of the data surrounding that person. A tool like this can be very important to doctors and instrumental in saving people's lives. It is important to make a classifier such as this as accurate as possible and even bias it to classify people as diabetic more often. The reason for doing this is because we don't want people walking around thinking they don't have diabetes when they actually do, so what we're afriad of is a false negative.

<div class="alert alert-info">
**QUESTION**: Look up a research article on Pubmed about one of these variables and how it relates to diabetes. Write the citation in the box below and explain the main findings in 2-3 sentences.  [PubMed Link](https://www.ncbi.nlm.nih.gov/pubmed/)
</div>

**Replace this text with your answer**

### Saving the Notebook as an HTML

Congrats on finishing your final lab notebook! This time, you will be submitting this notebook as an ```HTML``` file. To turn in this lab assignment follow the steps below:

>1. In the toolbar above, click on `File` > `Download as` > `HTML`
2. The file should have been saved in an HTML format in your downloads folder.
3. Click to open in a web browser of your choice to make sure that everything looks okay.

Your lab instructor will explain to you what to do afterwards.

---

Notebook developed by: Jason Webb

Data Science Modules: http://data.berkeley.edu/education/modules