<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Sklearn,-Data-preprocessing,-and-KNN" data-toc-modified-id="Sklearn,-Data-preprocessing,-and-KNN-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Sklearn, Data preprocessing, and KNN</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Separating-Training-and-Test-data" data-toc-modified-id="Separating-Training-and-Test-data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Separating Training and Test data</a></span></li><li><span><a href="#Dataset-Transformations-in-Sklearn" data-toc-modified-id="Dataset-Transformations-in-Sklearn-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Dataset Transformations in Sklearn</a></span></li><li><span><a href="#Imputer" data-toc-modified-id="Imputer-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Imputer</a></span></li><li><span><a href="#StandardScaler" data-toc-modified-id="StandardScaler-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>StandardScaler</a></span><ul class="toc-item"><li><span><a href="#StandardScaler-directions:" data-toc-modified-id="StandardScaler-directions:-2.4.1"><span class="toc-item-num">2.4.1&nbsp;&nbsp;</span>StandardScaler directions:</a></span></li></ul></li><li><span><a href="#PCA" data-toc-modified-id="PCA-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>PCA</a></span><ul class="toc-item"><li><span><a href="#PCA-directions:" data-toc-modified-id="PCA-directions:-2.5.1"><span class="toc-item-num">2.5.1&nbsp;&nbsp;</span>PCA directions:</a></span></li></ul></li><li><span><a href="#Preparing-our-test-data" data-toc-modified-id="Preparing-our-test-data-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Preparing our test data</a></span></li></ul></li><li><span><a href="#KNeighborsRegressor" data-toc-modified-id="KNeighborsRegressor-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>KNeighborsRegressor</a></span></li><li><span><a href="#Extension" data-toc-modified-id="Extension-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Extension</a></span></li></ul></div>

# Sklearn, Data preprocessing, and KNN
Now that we've implimented our own version of KNN, we can be a little bit more lazy and let scikit-learn take care of algorithm implimentation for us. The challenges below will walk you through some using some preprocessing techniques and using Sklearn models.

In [1]:
# Run this cell to import dependencies!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
cars = pd.read_csv('data/sklearn-auto-mpg.csv')

cars.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration
0,18.0,8,307.0,130.0,3504,12.0
1,15.0,8,350.0,165.0,3693,11.5
2,18.0,8,318.0,150.0,3436,11.0
3,16.0,8,304.0,150.0,3433,12.0
4,17.0,8,302.0,140.0,3449,10.5


If you want to store this data as a Python list, you can store a reference to cars.values

In [2]:
data = cars.values
data[:5]

array([[   18. ,     8. ,   307. ,   130. ,  3504. ,    12. ],
       [   15. ,     8. ,   350. ,   165. ,  3693. ,    11.5],
       [   18. ,     8. ,   318. ,   150. ,  3436. ,    11. ],
       [   16. ,     8. ,   304. ,   150. ,  3433. ,    12. ],
       [   17. ,     8. ,   302. ,   140. ,  3449. ,    10.5]])

# Preprocessing

## Separating Training and Test data
We will use sklearn's built in [data-splitting utility](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

1. In the first code cell, we imported the `train_test_split`  utillity from sklearn.
2. As we did earlier, we must split the `data` into `X` and `y` ourselves.
3. Import `train_test_split` from sklearn.model_selection. After separating the labels from the features, use `train_test_split`, specifing a test_size of 0.2 and a random_state of 42. `train_test_split` uses the random_state parameter to shuffle our data before dividing it.
4. `train_test_split` returns  `X_train`, `X_test`, `y_train` and `y_test` values (in that order). Store these results as variables.


## Dataset Transformations in Sklearn
Most, if not all, of the built-in sklearn models we will be using today are [*dataset transformations*](http://scikit-learn.org/stable/data_transforms.html). A model is a *dataset transformation* if it has a *fit* method, a *transform* method, and a *fit_transform* method. When we instantiate a new instance of a *dataset transformation*, the first thing we must do is train it on our training data. We do this with the *fit* method. Once we have fit the model, we may use the model to transform our training data, or any new, unseen data. This is what the *transform* method is for. If we want to fit a model to our training data, and simultaneously transform the data based on the learned model, we may use the *fit_transform* method. Since nearly all sklearn models follow this design pattern, once we know the syntax for one model, we pretty much know the syntax for every model. How convenient! 

## Imputer
The first *dataset transformation* we will look at is the imputer. Now that we've separated our data into training data and test data, we have to take care of missing values. If we feed data with missing values directly into sklearn's knn model, it will produce a runtime error. There are a few potential approaches for dealing with missing values. One way would be to entirely remove the observation. This is less than ideal since it throws away otherwise useful information.

There are two more commonly used approaches:
1. In the first approach, fill in missing values with either
    * the mean value of the feature,
    * the median value of the feature, or 
    * the mode value of the feature
2. In the second approach, we first use approach number 1, but in addition we add a new, binary feature to our dataset that takes on the value 0 if the feature was NOT missing, and 1 if the feature was missing. That way, our model can learn if the absence of a certain feature is helpful for predicting labels.

We will use the [sklearn imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html) to take approach 1.

1. Check to see if `X_train` or `X_test` contain missing values (hint: use functional tools like filter and map)
1. If we have missing values, we should import the imputer from sklearn.preprocessing
1. Create a new instance of an imputer. We'd like to replace NaNs and Nones with the mean value of the respective feature. The imputer behaves this way by default, but we can also explicitly set it this way by setting the `missing_values` and `strategy` parameters. Save the imputer to the variable `imp`.
1. fit the imputer to `X_train`.
1. reassign `X_train` the the result of transforming `X_train`

## StandardScaler
A big flaw in our knn model from yesterday is that the [variance](https://en.wikipedia.org/wiki/Variance) in the weight of a vehicle is far greater than the variance in any of the other features. This means that the weight of a vehicle contributed by far the most to the distance a vehicle was from other vehicles, so our model was far less sensitive to variation in cylinders or horsepower.

To combat this issue, we can scale our data. Consider the weight of each vehicle. Suppose the mean weight of a vehicle in our training set is 3000kg, and the standard deviation of the weight variable is 550kg (this means that about 68% of vehicles in our training set fall between 2450kg and 3550kg - i.e. 3000kg +/- 550kg). To scale the weight feature, we subtract 3000kg from the weight of each feature, and then divide by 550kg. This leaves us with data that has a mean of 0 and a standard deviation of 1. If we scale all our features, we will mitigate one of the biggest issues with our naive implimentation of knn.

### StandardScaler directions:
1. Before using StandardScaler, use matplotlib to create a histogram of each feature

2. Now, import [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) from sklearn.preprocessing
2. Create a new instance of the StandardScaler transformer and save it to a variable __scaler__. If you set *copy = False*, when we tranform the data it will transform the data in place. (*note: this will only work if our data is stored in a numpy array*)
2. Use fit_transform to simultaneously fit __scaler__ to the training data and transform it to the scaled version.
2. Use matplotlib to create a histogram for each scaled feature. Compare the shape of these histograms to the shapes of the histograms you generated above. Compare the mean and variances, too.

## PCA
We'll perform one last tranformation on our data prior to fitting our knn regression model. PCA, short for Principle Component Analysis, is a dimensionality reduction algorithm. That is, PCA can help us turn our dataset which contains 5 unique features into a dataset with fewer unique features.

__Why would we want to do this? Aren't we throwing away useful data?__
Well, it sort of depends. Suppose we added another feature to our dataset - weight in lbs. This new feature doesn't give us any extra information, since we already know the weight of each car in kg. To put it differently, dimensionality reduction algorithms identify features that are highly correlated, and help us combine them into a single feature. This means we take up less memory and don't allow highly correlated features to add to the distance between points in our knn model. It also mitigates risk of overfitting our model to the training data.

__A high level description of PCA:__ PCA essentially fits an ellipse to our data. An n-dimensional ellipse has n different axes. If we have a dataset with n features, the n axes that PCA finds are called the principle components. PCA orders the principle components by their length. The longer a component is relative to the other axes, the more variation in the dataset that is captured by that component. We may discard sufficiently small components because they don't contribute much information about a data-point. Below is an example of a dataset with two features. Its principle components are overlaid as arrows: ![title](GaussianScatterPCA.svg.png)

### PCA directions:
1. Import [PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) from sklearn.decomposition
1. instantiate a new PCA transformer and save it to the variable __pca__
1. fit __pca__ to the training data
1. print out the values of pca.components\_, pca.explained\_variance\_, and pca.explained\_variance\_ratio\_
1. interpret these values and determine how many principle components to keep
1. redo step 2, this time passing the parameter n\_components = # of components you decided to keep
1. fit and transform the model with __X_train__

## Preparing our test data
We've adequately prepared our training data and are almost ready to fit knn models with out transformed data.. In order for the trained model to make valid predictions for our __X_test__, though, we must first transform __X_test__ with the same models we fit to our training data. Do this in the cell below.

# KNeighborsRegressor
We're finally ready to create our knn models!

1. import the KNeighborsRegressor from sklearn.neighbors
1. import mean_squared_error from sklearn.metrics
2. loop through values k = 0 to k = math.floor(math.sqrt(len(__X_train__)))
3. for each value of k:
    * instantiate a knn model with n_neighbors = k
    * use model.predict to generate an array of predictions for __X_test__
    * store the value of k in a list called __ks__, and store the mean_squared_error of the predictions in an array called __errors__
3. use matplotlib to plot the value of k vs the mean_square_error produced by the corresponding model. Which value of k works best?
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor

# Extension
1. Change-up your preprocessing steps, trying to produce a model with a lower mean squared error. You may:
    * Tune the hyperparameters for your existing transformers (hyperparameters are the parameters you pass when you instantiate a new tranformer)
    * Switch transformers or use fewer transformers (e.g. remove observations with NaNs instead of using imputer, switch out PCA for [Kernel PCA](http://scikit-learn.org/stable/auto_examples/decomposition/plot_kernel_pca.html))
1. Each transformer we use in preprocessing comes with implicit assumptions about how our data is structured/distributed. For each transformer, ask yourself what assumptions we are making about our data in order to conclude that the transformation step will be useful.
1. Combine your preprocessing steps and knn predictor into a single sklearn [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html)