#  Regression (kNN and Linear) against a single feature

In [None]:
# The %... is an iPython thing, and is not part of the Python language.
# In this case we're just telling the plotting library to draw things on
# the notebook, instead of on a separate window.
%matplotlib inline 
#this line above prepares IPython notebook for working with matplotlib

# See all the "as ..." contructs? They're just aliasing the package names.
# That way we can call methods like plt.plot() instead of matplotlib.pyplot.plot().

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options

In [None]:
dfcars=pd.read_csv("data/mtcars-cleaned.csv")
dfcars.head()

## Numpy indexing and the train-test split

We can use `range` to construct an object which represents the list of numbers between 0 and some N. This is done as `range(N)`.

In [None]:
length_dataframe = dfcars.shape[0]
range(length_dataframe)

The range can be materialized by running the `list` constructor over it. Why do it this way? Suppose you wanted range(million). You dont want to store million numbers in memory when you can always generate the next one by adding 1 to the previous one:

In [None]:
list(range(length_dataframe))

Lets use `range` in the construction of training and test sets. Recall that we split our data into training and test sets so that we can evaluate our model on the test set. The diagram below illustrates a situation in which we split our dataset 80% training, with the remaining 20% testing.

![](images/train-test.png)

Our general strategy is to do this randomly. `sklearn` gives us an easy-to-use function for this purpose. Notice that we split the range, which then leads to a materialization into lists of indices.

In [None]:
from sklearn.model_selection import train_test_split
split = train_test_split(range(length_dataframe), train_size=0.8)

In [None]:
split

Lets assign index lists to each member of the split:

In [None]:
i_train, i_test = split
i_train

In another way of picking certain "rows" from a dataframe, we can use this list of indices to pick up a bunch of car weights for the training set.

In [None]:
dfcars.wt[i_train]

Notice that this does not work for the entire dataframe!

In [None]:
dfcars[i_train]

This is because the fundamental model in indexing dataframes refers to columns, not rows. To make this work in dataframes we use `iloc`

In [None]:
dfcars.iloc[i_train]

## Creating features for regression

Our next job is to create the weight feature training set for our regression. We can use the `Pandas` series or the corresponding `numpy` array. The example below uses the `numpy` array.

In [None]:
xtrain = dfcars.wt.values[i_train]
xtrain

> YOUR TURN NOW

>Create the test set of car weights in the variable `xtest`.

In [None]:
# your code here

In [None]:
ytrain = dfcars.mpg.values[i_train]
ytest = dfcars.mpg.values[i_test]

## The shape of things in scikit-learn

Scikit-learn is the main `python` machine learning library. It consists of many learners which can learn models from data, as well as a lot of utility functions such as `train_test_split`. It can be used in `python` by the incantation `import sklearn`.

The library has a very well-defined interface. This makes the library a joy to use, and surely contributes to its popularity. As the scikit-learn API paper [Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).] says:

>All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and ﬁtting models, a predictor interface for making predictions and a transformer interface for converting data. The estimator interface is at the core of the library. It deﬁnes instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classiﬁcation, regression or clustering) are oﬀered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.


Lets see the structure of scikit-learn needed to make these fits. `.fit` always takes two arguments:

`estimator.fit(Xtrain, ytrain)`.

Critically, `Xtrain` must be in the form of an array of arrays, with the inner array each corresponding to one sample, and whose elements correspond to the feature values for that sample. 

The `ytrain` on the other hand is a simple array of responses...continuous for regression problems.

![](images/sklearn2.jpg)

Let us see what our shapes look like:

In [None]:
xtrain.shape

This is not what we want! We have 25 samples, but we want the data to look like a list of 25 feature vextors (each of size 1 here). So we must *reshape*.

In [None]:
Xtrain = xtrain.reshape(xtrain.shape[0], 1)
Xtrain

In [None]:
Xtrain.shape

Notice our notation: we started with the vector `xtrain`, a vector of length 25 (shape (25,)) and constructed a **design matrix** `Xtrain` of size 25 x 1. We use CAPS for the first letter to remind ourselves of this.

In [None]:
ytrain.shape

`ytrain` is expected to be a vector. 

> YOUR TURN NOW

> Let us also reshape `xtest` into `Xtest`

In [None]:
# your code here

### Regress

In [None]:
#import linear model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

#create linear model
regression = LinearRegression()

#fit linear model
regression.fit(Xtrain, ytrain)

At this point we have fit our model using the `fit` API method in `sklearn`. Now comes the next critical method, `predict`. The test set `Xtest` has the same structure as `Xtrain`, and is used in the `.predict` interface. Once we have fit the estimator, we predict the results on the test set by:

`estimator.predict(Xtest)`.

The results of this are a simple array of predictions, of the same form and shape as `ytest`.

In [None]:
#predict y-values
predicted_y = regression.predict(Xtest)

`sklearn` will now provide you with a default way to score your model, which for regression problems is $R^2$.

In [None]:
#score predictions (sklearn gives you R^2 as well)
r2 = regression.score(Xtest, ytest)
r2

> YOUR TURN NOW

> Dind the $R^2$ on the training set. Is it better or worse?

In [None]:
#your turn now
regression.score(Xtrain, ytrain)

We can also access the mean squared error:

In [None]:
mean_squared_error(predicted_y, ytest)

> YOUR TURN NOW

>Plot the predicted y against the actual y to see how we did. Ideally we'd want to be on the 45 degree line between predicted y and actual y. In general we'll want the test set data to be distributed around this line.

In [None]:
# your code here

We can also predict the results on a grid of x's to draw the regression line. This is akin to treating the grid like a test set, but not quite, because the grid may contain points from the training set.

In [None]:
plt.plot(dfcars.wt, dfcars.mpg, 'o')
xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)
plt.plot(xgrid, regression.predict(xgrid.reshape(100, 1)));

## Nearest Neighbor regression

Now that we know the `sklearn` API, let's repeat the above for nearest neighbor regression with 5 neighbors.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knnreg = KNeighborsRegressor(n_neighbors=5)

In [None]:
knnreg.fit(Xtrain, ytrain)
r2 = knnreg.score(Xtest, ytest)
r2

> YOUR TURN NOW

> How do we do on the training set?

In [None]:
# your code here

Lets vary the number of neighbors and see what we get

In [None]:
regdict = {}
for k in [1, 2, 4, 6, 8, 10, 15]:
    knnreg = KNeighborsRegressor(n_neighbors=k)
    knnreg.fit(Xtrain, ytrain)
    regdict[k] = knnreg

In [None]:
with sns.plotting_context('poster'):
    plt.plot(dfcars.wt, dfcars.mpg, 'o', label="data")
    xgrid = np.linspace(np.min(dfcars.wt), np.max(dfcars.wt), 100)
    for k in [1, 2, 6,  10, 15]:
        predictions = regdict[k].predict(xgrid.reshape(100,1))
        if k in [1, 6, 15]:
            plt.plot(xgrid, predictions, label="{}nn".format(k))
    plt.legend();

Notice how the 1NN goes through every point on the training set but utterly fails elsewhere. Lets look at the scores on the training set.

In [None]:
ks = range(1, 15)
scores_train = []
for k in ks:
    knnreg = KNeighborsRegressor(n_neighbors=k)
    knnreg.fit(Xtrain, ytrain)
    score_train = knnreg.score(Xtrain, ytrain)
    scores_train.append(score_train)
plt.plot(ks, scores_train,'o-');

Why do we get a perfect $R^2$ at k=1?

> YOUR TURN NOW

> Make the same plot on the test set:

In [None]:
# your code here

What is the best k?