# Overview
In the last step, we analyzed our data and found a few issues with some of the columns. Before we start training our machine learning models, we should fix up these issues.

In [None]:
import sklearn
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import rcParams
%matplotlib inline

import seaborn as sns
sns.set()

In [None]:
df = pd.read_csv("diabetes.csv")
df.head()

## Data Cleanup
No dataset is perfect. When you look at some of our features, you might notice that not everything makes sense. For example, if you look at the minimum of some of these columns, you notice that some patients have a BMI and blood pressure of 0. Does that sound right?

Chances are these are **missing values**: those patients don't really have a BMI of 0, but maybe the researchers didn't collect those patient's BMI and so just put 0 in as a subsitute. 
You might also see these as "NaN", meaning "not a number", but in this case they were assigned a value of 0.

## Questions to discuss
- Why might these values be missing?
- Does every column with a "0" mean that that's a missing value?
- What are some potential problems of building a classifier with missing values?
- What should we do about them?

### TODO
Impute the missing values by finding the mean of the columns.
- Specify which columns have missing values. Save this to a list called `cols_with_missing_vals`
- Loop through these columns in the DataFrame
- For each column, filter to rows where the value is not **0**. We don't want the 0's to artificially lower the mean
- Replace the 0's with the imputed value by using the `.replace()` method of the column

In [None]:
cols_with_missing_vals = []

In [None]:
for col in ____:
    not_missing = df[df[col] != __] 
    imputed_value = not_missing[col].____()
    df[col] = df[col].replace(__, ____)

In [None]:
df.describe()

# Labels and features
Next, we need to separate our data into two separate variables. The first are the **features** and the second are called labels.

## Features
The **"features"** in a dataset are the information collected for each data point. This is the data which we'll provide to our model to learn from. The features are also known as the *independent variables*.
### Discussion
Which column(s) the **features** of our dataset?

## Labels
The **label** signifies what **class** each row belongs to. This is also known as the *dependent variable* In our task, **"1"** means that the patient has diabetes (positive class), while a **"0"** means that the patient does not have diabetes (negative class). This is contained in the "**Outcome**" column.

By convention, the features are usually called **`X`**, while the labels are called **`y`**.

### TODO
Split out DataFrame up into two variables, `X` and `y`. Create `X` by calling `df.drop()` and passing in a list of columns which shouldn't be included in your features. Create `y` by accessing the column containing the outcome variable.

In [None]:
X = df.drop([____], axis=1)
y = df[____]

In [None]:
X.head()

In [None]:
y.head()

# Test-Train Split
We'll also split up our dataset into a *train* and *test* set. Our ultimate goal is to be able to predict whether a set of brand-new patients has diabetes. These new patients have never been seen before by our classifier. 

A common practice in machine learning development is to take a portion of our data and leave those rows out during training, then we'll see how our classifiers perform on these rows.

https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

## Questions to discuss
- Why is it important  to evaluate on testing data that is separate from our training data?
- How should you select the data that you'll leave out for testing?
- What are the costs of taking out data for testing?
- What proportions should be used for testing and training?

### TODO
- Split up X and y in to corresponding train and test sets using the `train_test_split` method from sklearn. 
- The training set should contain **80%** of the data, while the testing set should contain **20%**

In [None]:
# Split up data
from sklearn.model_selection import ____
X_train, X_test, y_train, y_test = ____(X, y, train_size=__)

In [None]:
len(X_train)

In [None]:
len(X_test)

# Save our dataset
In our next notebook, we'll start doing actual machine learning on our dataset. But since we've done so much work cleaning up and splitting our dataset, we'll save our processed data to disk so that we can load it in without having to redo all of these steps.

Using the Python `pickle` package, save the dataset to a file called **"diabetes_data.pkl"**. `pickle` is a serialization package in Python and can be used to save Python objects to disk and then reload them in another session.

In [None]:
import pickle
with open("diabetes_data.pkl", "wb") as f:
    pickle.dump((X_train, X_test, y_train, y_test), f)

# Next Steps
Now that we have our data prepared, we're finally ready to do some machine learning! In our final in-class notebook, we will trained models on our dataset and evaluate to see how well they perform.

[./03-model-training-and-evaluation.ipynb](./03-model-training-and-evaluation.ipynb)