# DAT19 Class 5 - Model Evaluation


# Cross Validation with KNN

Part of the big step with this lab is understanding general sklearn syntax. Each family of classification algorithms have various knobs and levers to tune it appropriately but there is a general overall structure to these models that will help you as you move forward.
1. All models need to be trained. Sklearn models have a `.fit` method for doing so.
2. We need to use the model to make a guess. the `.predict` method takes data and returns the model's guess for the value. Stipulations around this pertain to the specific model.

Last time, we imported our data from the UCI Machine Learning repository using pandas. Scikit-learn also includes some well-known datasets. So, for convenience, we will import the iris data set from sklearn this time.

In [None]:
import numpy as np

In [None]:
# from the datasets load the iris data into a variable called iris
from sklearn import datasets

sk_iris = datasets.load_iris()

In [None]:
type(sk_iris)

In [None]:
print sk_iris

In [None]:
help(sk_iris)

That's interesting:
```Container object for datasets: dictionary-like object that exposes its keys as attributes.```

In [None]:
print sk_iris['DESCR']

In [None]:
sk_iris['feature_names']

In [None]:
sk_iris['target_names']

Remember last time when we put all the features in a matrix and the labels (what we are trying to predict) into a vector?

Let's re-assign the data to standard named variables. Sklearn makes this very easy.

In [None]:
X = sk_iris.data #features
y = sk_iris.target #class labels
Names = sk_iris.target_names

In [None]:
print type(X)
print np.shape(X)
print X
## features now in a matrix

In [None]:
print type(Names)
print np.shape(y)
print Names

# matrix of features, vector of class labels

Now we get into cross validation! The first step is to split the data into a training set and a test set.

In [None]:
# is there a function to do that in sklearn?
from sklearn.cross_validation import train_test_split

In [None]:
ind = range(150) #What data structure is ind? What is its shape?
## list with values from 0-149

np.random.shuffle(ind) #Why must we randomly shuffle the (indices for the) training data before splitting it?
## shuffling because classes are in order; if we don't shuffle we won't get folds that represent dataset as a whole

test_ind = ind[:150/5] #Would this work if 20% of the number of records were not an integer?
## grab beginning of list 150/5 because we're using 1/5 of entire set

train_ind = ind[150/5:]

In [None]:
print test_ind
print 'length of test index is ' + str(len(test_ind))
print '\n'
print train_ind
print 'length of training index is ' + str(len(train_ind))

## example of train/test split!!!!

In [None]:
## Another example of train/test split

X_train = []
y_train = []

X_test = []
y_test = []


## for each index in list grab data that fits criteria and append to list
for ind in test_ind:
    X_test.append(X[ind])
    y_test.append(y[ind])
    
for ind in train_ind:
    X_train.append(X[ind])
    y_train.append(y[ind])

In [None]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=0)

## test_size = size of fold
## we're assigning results to multiple variables
## this function is able to give back list to be unpacked

Wait a minute, what's going on with this syntax above? Does anything about it look unusual to you?
Let's take a look at the [function documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) and the [user guide](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

In [None]:
tts_return = train_test_split( X, y, test_size=0.20, random_state=0)
print len(tts_return)
print type(tts_return)
#tts_return

In [None]:
print np.shape(X_train)
X_train

In [None]:
print np.shape(y_train)
y_train
# data below corresponds with features above
# the first row above corresponds with the first index below

In [None]:
y_test

In [None]:
#Quick Question: How can we double check that got the number of features and labels that we expected?


Now, we'll train our model and use it to make predictions, following the steps we outlined last time.

In [None]:
# Train KNN classifier defined function on the train data
from sklearn.neighbors import KNeighborsClassifier

In [None]:
myknn = KNeighborsClassifier(2).fit(X_train,y_train)
## want 2 neighbors in model
## fit it using the training data and training data from test function

Let's figure out how good our model is. The traditional score is what percentage of my labels did I correctly identify. This is called **accuracy** or **precision**. There are other types of statistical scores but we will start here. We'll ask our model to predict what the labels for our test set are, then generate a score.

In [None]:
## predict feeds in data and gets predictions out
predictions = myknn.predict(X_test)
print predictions
type(predictions)

In [None]:
correct = 0

for a,b in zip(y_test,myknn.predict(X_test)):
    if a == b:
        correct += 1
    else:
        pass

print "Number correct:",correct
print "Score:",float(correct)/len(y_test)


### this is only 1 sample of OOS test accuracy

That was easy enough. Sklearn also has an easy method for generating a score. 

In [None]:
myknn.score(X_test, y_test)

Sklearn also has a way of showing more information about the prediction. Here, we're using sklearn.metrics.classification_report to generate a more informative picture. The wikipedia pages for recall, f1-score, and support are also informative if you're looking to understand more.

https://en.wikipedia.org/wiki/Precision_and_recall

In [None]:
from sklearn import metrics

print metrics.classification_report([sk_iris['target_names'][label] for label in y_test], 
                                    [sk_iris['target_names'][label] for label in myknn.predict(X_test)])

## Exercise

#### 1. How does the model perform as we increase the number of neighbors?  To answer this, plot the score as a function of the number of neighbors.

In [66]:
# Create a list of the various numbers of neighbors to use to build models
# Create training and test sets
# Iterate through that list and for each number of neighbors:
#    Build a KNN model
#    Evaluate it
#    Record the score with the number of neighbors for that model
# Plot results

In [67]:
import numpy as np
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [68]:
sk_iris = datasets.load_iris()

In [69]:
X = sk_iris.data #features
y = sk_iris.target #class labels
Names = sk_iris.target_names

In [70]:
ind = range(150) 

np.random.shuffle(ind)

test_ind = ind[:150/5] 

train_ind = ind[150/5:]

In [71]:
# Create a list of the various numbers of neighbors to use to build models
# Create training and test sets

X_train = []
y_train = []

X_test = []
y_test = []


## for each index in list grab data that fits criteria and append to list
for ind in test_ind:
    X_test.append(X[ind])
    y_test.append(y[ind])
    
for ind in train_ind:
    X_train.append(X[ind])
    y_train.append(y[ind])

In [72]:
print X_test
np.shape(X_test)

[array([ 5.6,  3. ,  4.1,  1.3]), array([ 7.7,  3.8,  6.7,  2.2]), array([ 6.4,  3.1,  5.5,  1.8]), array([ 6.4,  2.7,  5.3,  1.9]), array([ 7.3,  2.9,  6.3,  1.8]), array([ 6.5,  3. ,  5.8,  2.2]), array([ 5. ,  3.2,  1.2,  0.2]), array([ 6.3,  2.3,  4.4,  1.3]), array([ 4.4,  3. ,  1.3,  0.2]), array([ 5.7,  2.9,  4.2,  1.3]), array([ 6.7,  3.1,  4.7,  1.5]), array([ 5.6,  3. ,  4.5,  1.5]), array([ 6.3,  2.5,  4.9,  1.5]), array([ 6. ,  3. ,  4.8,  1.8]), array([ 6.1,  2.8,  4. ,  1.3]), array([ 6.1,  2.6,  5.6,  1.4]), array([ 5.8,  2.7,  4.1,  1. ]), array([ 4.9,  3.1,  1.5,  0.1]), array([ 4.8,  3. ,  1.4,  0.3]), array([ 4.9,  3.1,  1.5,  0.1]), array([ 5.4,  3.4,  1.5,  0.4]), array([ 4.8,  3.4,  1.9,  0.2]), array([ 4.9,  3. ,  1.4,  0.2]), array([ 7.4,  2.8,  6.1,  1.9]), array([ 6.3,  3.4,  5.6,  2.4]), array([ 5.9,  3.2,  4.8,  1.8]), array([ 5.8,  2.6,  4. ,  1.2]), array([ 5.8,  4. ,  1.2,  0.2]), array([ 4.6,  3.4,  1.4,  0.3]), array([ 5.5,  2.3,  4. ,  1.3])]


(30, 4)

In [73]:
print y_test
np.shape(y_test)

[1, 2, 2, 2, 2, 2, 0, 1, 0, 1, 1, 1, 1, 2, 1, 2, 1, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 0, 0, 1]


(30,)

In [74]:
print X_train
np.shape(X_train)

[array([ 5.8,  2.7,  5.1,  1.9]), array([ 5. ,  2. ,  3.5,  1. ]), array([ 6.7,  3.3,  5.7,  2.5]), array([ 5.7,  2.8,  4.1,  1.3]), array([ 4.6,  3.6,  1. ,  0.2]), array([ 6.7,  2.5,  5.8,  1.8]), array([ 4.8,  3.4,  1.6,  0.2]), array([ 7.2,  3.6,  6.1,  2.5]), array([ 6. ,  2.2,  4. ,  1. ]), array([ 6.5,  3.2,  5.1,  2. ]), array([ 6.9,  3.1,  4.9,  1.5]), array([ 6. ,  2.7,  5.1,  1.6]), array([ 6.1,  3. ,  4.9,  1.8]), array([ 6.4,  2.8,  5.6,  2.2]), array([ 6.3,  3.3,  4.7,  1.6]), array([ 5.5,  2.4,  3.7,  1. ]), array([ 5. ,  3.6,  1.4,  0.2]), array([ 5.7,  4.4,  1.5,  0.4]), array([ 4.7,  3.2,  1.6,  0.2]), array([ 5.3,  3.7,  1.5,  0.2]), array([ 5.7,  2.6,  3.5,  1. ]), array([ 6.8,  3. ,  5.5,  2.1]), array([ 6.7,  3.1,  4.4,  1.4]), array([ 7.1,  3. ,  5.9,  2.1]), array([ 6.1,  2.8,  4.7,  1.2]), array([ 5.1,  3.3,  1.7,  0.5]), array([ 5.4,  3.9,  1.3,  0.4]), array([ 5.4,  3. ,  4.5,  1.5]), array([ 5. ,  3.5,  1.6,  0.6]), array([ 5.8,  2.8,  5.1,  2.4]), array([ 6

(120, 4)

In [75]:
print y_train
np.shape(y_train)

[2, 1, 2, 1, 0, 2, 0, 2, 1, 2, 1, 1, 2, 2, 1, 1, 0, 0, 0, 0, 1, 2, 1, 2, 1, 0, 0, 1, 0, 2, 2, 1, 0, 0, 1, 0, 1, 1, 2, 2, 0, 0, 0, 2, 1, 1, 1, 2, 0, 0, 0, 0, 2, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 2, 1, 0, 2, 2, 2, 0, 0, 0, 0, 2, 1, 1, 1, 1, 0, 0, 2, 2, 0, 0, 1, 1, 2, 0, 0, 2, 1, 1, 2, 1, 2, 1, 2, 2, 0, 0, 2, 1, 0, 1, 2, 0, 2, 1, 2, 1, 0, 2, 1, 2, 2, 0, 2, 1, 2, 1]


(120,)

In [82]:
myknn = KNeighborsClassifier(11).fit(X_train,y_train)

In [83]:
predictions = myknn.predict(X_test)
print predictions
type(predictions)

[1 2 2 2 2 2 0 1 0 1 1 1 1 2 1 2 1 0 0 0 0 0 0 2 2 2 1 0 0 1]


numpy.ndarray

In [84]:
correct = 0

for tumble,weed in zip(y_test,myknn.predict(X_test)):
    if tumble == weed:
        correct += 1
    else:
        pass

print "Number correct:",correct
print "Score:",float(correct)/len(y_test)

Number correct: 29
Score: 0.966666666667


#### 2. Do different train/test splits affect our score (accuracy)? How much do the scores vary each time you shuffle and split?

In [85]:
# Yes, different train/test splits affect score (accuracy)

In [86]:
### Using same dataset, different test/train percentage splits ###

In [87]:
import numpy as np
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
sk_iris = datasets.load_iris()

In [None]:
X = sk_iris.data #features
y = sk_iris.target #class labels
Names = sk_iris.target_names

In [None]:
ind = range(150) 

np.random.shuffle(ind)

test_ind = ind[:150/5] 

train_ind = ind[150/5:]