# Overview
One of the most important steps in a machine learning project is exploring the dataset and gaining an understanding of what the data and task are. In this notebook, we will explore our dataset and analyze the variables contained in our dataset. See the Kaggle website for an explanation of this dataset and where it comes from: [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database/home).


In [None]:
import sklearn
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

import seaborn as sns
sns.set()

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

# Outcome variable
The **outcome variable** is the value that we want our classifier to learn. This is also known as the **dependent variable**, or in machine learning terminology the **label**. In this case, the outcome variable is whether or not a patient has diabetes, and it is contained in the "**Outcome**" column. **"1"** means that the patient has diabetes (positive class), while a **"0"** means that the patient does not have diabetes (negative class).

### TODO
Get the count of patients with and without diabetes in the **"Outcome"** column. Then plot it as a barplot.

In [None]:
df.____("Outcome").____()

In [None]:
df.____("Outcome").____().plot.____()

# Independent Variables
The **independent variables** in a dataset are the other values in a dataset which can be used to predict the outcome variable. These are called **features** or **predictors** in ML terminology. Later, we'll separate the features and labels. First, let's take a look at our entire dataset.


### TODO
Call `df.describe()` to get a summary of a dataset. Then create a histogram for each of the columns in the dataset by calling `df.hist()`.

- Take a look at the values in each column. Consider the min, max, and mean values
- Do the values in the dataset make sense with the real-world values they represent?
- Are there any outstanding questions you have? Does anything not make sense?

In [None]:
__.____()

In [None]:
len(df)

In [None]:
_ = df.____(figsize=(12, 8))

# Correlation
One informative analysis we can do with our dataset is to look at the **correlation** between our variables. We've dealt with this before when analyzing the relationship between *systolic* and *diastolic blood pressure* in MIMIC patients. We can then look at which variables correlate most strongly with the outcome variable - these may be the features which are informative to our classifier.

### TODO
Let's visualize the correlation of our variables. We can calculate the correlation between each of the variables by generating a **corelation matrix**, which is a table where rows and columns represent the variables and the cells contain the correlation of the two intersecting variable. We can then visualize this with a **heat map**, which uses a grid and encodes the value with the color.

- Create a dataframe containing the corrlation matrix of each column with each other column by calling the `.corr()` method of the DataFrame
- Generate a matplotlib figure. To increase the size, we'll set the `figsize` to be (12, 9)
- Call `sns.heatmap` and pass in the correlation matrix. Other arguments for customizing the plot have already been filled in

In [None]:
# 1. Generate correlation table and heatmap
corr = df.____()
corr

In [None]:
# 2. Set up the matplotlib figure size
fig, ax = plt.subplots(figsize=(__, __))

# Some additional settings to make the plot prettier
# and only show below the diagonal
cmap = sns.diverging_palette(220, 10, as_cmap=True)
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# 3. Draw the heatmap with the correct aspect ratio
sns.____(____, mask=mask, cmap=cmap, 
            vmax=.3, center=0,annot=True,
            square=True, linewidths=.5, 
            cbar_kws={"shrink": .25})

### Discussion
- Explain the heat map. How does this visually represent the correlation of the dataset?
- Looking at the plot above, which variables correlate most strongly to whether a patient has diabetes?
- What other variables have a strong correlation with each other?
- Do any of the correlations surprise you?

# Deep-Dive Comparative Analysis
Now that we know which variables are likely to be related to our outcome variable, let's do a deeper analysis comparing some of the most important variables between positive and negative patients.

### TODO
- Using boolean indexing, separate the data into two subgroups:
    - `neg`: Patients without diabetes
    - `pos`: Patients with diabetes
    

In [None]:
____ = df[df["Outcome"] == __]
neg = ____

### TODO
Now, pick one of the variables which you want to analyze. Pick one of the variables with a strong correlation to diabetes so we can analyze how this variable can be informative to our task. Save the column name to a variable called `varname`.

Next, generate three plots:
- A histogram of the overall distribution of the variable (not separated by group)
- Two histograms overlaid on top of each other comparing the positive and negative patients
- A boxplot stratified by Outcome

In [None]:
var_name = ____

In [None]:
# 1. Generate a histogram of the entire population
sns.distplot(df[var_name])
plt.title("Overall Distribution of {}".format(var_name))

In [None]:
# 2. Create separate histograms for positive and negative patients
ax = sns.distplot(neg[var_name],label="Negative", color="C0")
sns.distplot(pos[var_name], label="Positive", color="C1")
ax.legend()
plt.title("Distribution of {} stratified by Outcome".format(var_name))

In [None]:
# 3. Create a boxplot for each population
sns.boxplot(x="Outcome", y=var_name, data=df)
plt.title("Boxplot of {} stratified by Outcome".format(var_name))

# Next Steps
Now that we have an understanding of what's in our dataset, there are a few data processing steps which we need to take care of before we can start modeling.

[./02-data-prep.ipynb](./02-data-prep.ipynb)