##Class 9
### 29 June 2015

#*Bias-Variance Tradeoff*
- Black line is not the decision boundary but the ideal distinction?
- Low k
    - high variance
        - how much does it vary for a particular point between realizations of the model
        - there's a lot of change
    - low bias: how well does it match the training data?
    - overfitting the model: trying to follow, match every prediction rather than the underlying signal
- High k
    - high bias: when compared to training data, it doesn't really capture the same signal
    - low variance: generally the color distribution doesn't change that much
- How do we choose the correct model?
- By changing a tuning parameter (e.g. k) you change the complexity of the model, thus changing the tradeoff between bias and variance
    - Finding the optimum model complexity
    - A low tuning parameter (k) indicates higher model complexity
        - lower k makes the predictions more complex (not as smooth as the larger, more bias-resulting k values)

#Model Evaluation

- Create a procedure that *estimates* how well a model is likely to perform on out-of-sample data and use that to choose between models
- k = 1 generally does not "generalize" because it fits the noise too much
- want a model that best generalizes

###*Train and test on entire data set*###
1. Train the model on the entire dataset
2. Test model on the same exact data, and compare how well the model predicted by comparing the predicted response values with actual response values.

In [3]:
# read the iris data into a DataFrame
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)

In [4]:
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [5]:
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]

In [6]:
# store response vector in "y"
y = iris.species_num

In [7]:
X.shape

(150, 4)

In [8]:
# X is our new data frame, with a shape of 150 x 4 bc we want to use those 4 features

**KNN (K=50)**

In [9]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model
knn = KNeighborsClassifier(n_neighbors=50)

# train the model on the entire dataset
knn.fit(X, y)

# predict the response values for the observations in X ("test the model")
# makes predictions for all 150
knn.predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

In [10]:
# store the predicted response values
y_pred = knn.predict(X)

**Evaluation Metric**
- a numeric calculation to quantify the performance of the model
- choose one based on goals of problem
- most common choice for classification:
    - Classification accuracy: a reward function, % of correct predictions --> something you want to maximize
    - Classification error: loss function, % of incorrect predictions --> something you want to minimize
    - error = (1 - accuracy)
- we're using accuracy

In [11]:
# compute classification accuracy
# within metrics, there are a lot of different metrics available
# always need y and y_pred (actual vs predicted)
from sklearn import metrics
print metrics.accuracy_score(y, y_pred)

0.94


With k = 50, the model has 94% correct predictions

AKA **training accuracy** because it's being tested on the training data, on the data used to build the model.

In [12]:
# Trying KNN with k = 1

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
y_pred = knn.predict(X)
print metrics.accuracy_score(y, y_pred)

1.0


But we know that k=1 is not the best value for k because despite being low bias, it has high variance.
- Of course it has low bias. It's using the exact same data, with a super complex model, to predict responses which are going to be close to or exact to the observed responses

Training accuracy: rewards overly complex models that won't necessarily generalize
- Unnecessarily complex models overfit the training data
- Learns the "noise" rather than the "signal"
- Building a model too complex learns the quirks of your training data
- Not a good estimate of out-of-sample accuracy

###*Train/test split*###

1. Split the data into two pieces: training and testing
2. Train the model on the training set
3. Test the model on the testing set-- how well did we do?

In [13]:
## "unpacking"
def min_max(nums):
    smallest = min(nums)
    largest = max(nums)
    return [smallest, largest]
# function that returns a list of the smallest and largest value

In [14]:
min_and_max = min_max([1, 2, 3])
print min_and_max
print type(min_and_max)

[1, 3]
<type 'list'>


In [16]:
the_min, the_max = min_max([1, 2, 3]) # this is the unpacking. Can be unpacked into separate variables
print the_min
print type(the_min)
print the_max
print type(the_max)

1
<type 'int'>
3
<type 'int'>


In [17]:
# using the train-test-split function
from sklearn.cross_validation import train_test_split
print train_test_split(X, y, test_size=0.4)

[array([[ 4.9,  3.1,  1.5,  0.1],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 5.7,  3. ,  4.2,  1.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 5.6,  3. ,  4.1,  1.3],
       [ 6.4,  3.2,  5.3,  2.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.2,  2.2,  4.5,  1.5],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 4.4,  3. ,  1.3,  0.2],
       [ 4.6,  3.4,  1.4,  0.3],
       [ 4.6,  3.1,  1.5,  0.2],
       [ 5.1,  3.8,  1.9,  0.4],
       [ 5.5,  2.3,  4. ,  1.3],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 5.1,  3.5,  1.4,  0.2],
       [ 6.3,  2.5,  5. ,  1.9],
       [ 6.8,  3. ,  5.5,  2.1],
       [ 5.9,  3. ,  5.1,  1.8],
       [ 7.9,  3.8,  6.4,  2. ],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 