In [2]:
import os
import pandas as pd

# I. Overview of Machine Learning
## Scenario
In this module, we'll build a classifier to predict whether or not a patient has diabetes.

Let's say that you're assigned the task of determining whether a set of 100 patients has diabetes. You don't get to meet with the patient or run any tests. All that you're given is some information that was collected for each of them, such as how many pregnancies they've had and what their current glucose levels are.

**Option 1**

If you're a clinician and an expert in diabetes, this information might be enough for you to make your decisions. You might follow existing guidelines, such as these provided by the [National Diabetes Education Initiative](http://www.ndei.org/ADA-diabetes-management-guidelines-diagnosis-A1C-testing.aspx.html). For each patient, you consult the guidelines and make a decision of whether they **have diabetes** or **do not have diabetes**.

**Option 2**

However, maybe the information that you've been provided isn't detailed or relevant enough. Or maybe you don't have the medical knowledge to make such a complex decision. In that case, an alternative approach could be to compare the patients you've been asked to classify to other similar patients who have already been diagnosed. You ask your data warehouse managers to retrieve the 8 columns of information for 900 other patients in the same population, plus whether or not they had diabetes. Now, you can see what kind of patterns occur in patients who have diabetes and those who do not, and you can use those patterns to make decisions about whether these new patients have diabetes.

Machine learning uses this second option to make decisions. A **classifier** is an algorithm that takes data as input and learns how to make decisions based on that data. In our case, our classifier will decide whether a patient has diabetes.

## Definitions
- **Task** - what we want our classifier to do. We want our classifier to predict if a patient is **positive** or **negative** for diabetes
- **Model/Classifier** - this is the algorithm that we will use to make predictions
- **Training Data** - the data that we provide our algorithm to learn patterns. In our scenario, the training data is the information for the 900 patients who have already been diagnosed with diabetes or been determined to not have diabetes
- **Features** - the information that is collected for each patient in the dataset, such as number of pregnancies and glucose levels
- **Label/Outcome** - a classification for each patient in the training data. For example, if a patient has a diabetes, then we might put a "1" in the outcome column, and if they don't we might put a "0"
- **Training** - how our model learns to make predictions
- **Evaluation** - once 

# Our Dataset
We will use the [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database/home), which can be downloaded from Kaggle. This dataset was originally created by the National Institute of Diabetes and Digestive and Kidney Diseases and contains data for a number of patients. Each patient is a female at least 21 years old of [Pima Indian heritage](https://en.wikipedia.org/wiki/Pima_people).

## What's in the Data?
Let's take a look at our dataset. We'll read in the data from a comma-separated file and look at it as a table where each row represents a different patient:

In [3]:
df = pd.read_csv('diabetes.csv')

In [4]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Features
The **"features"** in a dataset are the information collected for each data point. In this scenario, the features are the 8 types of information collected for each patient.

Take a few minutes with your group and look through some of the features. Try to get a sense for what each attribute is measuring. Optionally, do some programmatic analysis to look at the mean, standard deviations, etc.

- **Pregnancies**: Number of times pregnant
- **Glucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- **BloodPressure**: Diastolic blood pressure (mm Hg)
- **SkinThickness**: Triceps skin fold thickness (mm)
- **Insulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body mass index (weight in kg/(height in m)^2)
- **DiabetesPedigreeFunction**: Diabetes pedigree function which considers family history of diabetes
- **Age**: Age (years)

In [5]:
# Optional: Do additional analysis
df.Pregnancies.describe()

count    768.000000
mean       3.845052
std        3.369578
min        0.000000
25%        1.000000
50%        3.000000
75%        6.000000
max       17.000000
Name: Pregnancies, dtype: float64

### Label
The **label** signifies what **class** each row belongs to. A **"1"** means that the patient has diabetes (positive class), while a **"0"** means that the patient does not have diabetes (negative class). This is contained in the *Outcome* column.

In [6]:
df.Outcome.value_counts()

0    500
1    268
Name: Outcome, dtype: int64

# Up Next
Next, we'll look more closely at our dataset and analyze our features and class distribution.

[II. Data Analysis](./II_DataAnalysis.ipynb)