# Bias Variance

## Import stuff

We'll need: pandas, numpy, matplotlib, sklearn.

In [143]:
import pandas as pd
import numpy as np
import matplotlib as plt
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import neighbors



## Get the data

Dataset URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

In [121]:
a = pd.read_csv('iris.csv')
c = a['species']
a

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


## Import stuff for Train-Test Split

We'll need: train_test_split

In [122]:
from sklearn.model_selection import train_test_split

Determine X_train, X_test, y_train, y_test using sklearn.

In [123]:
# determine X and y
X = a.drop(['species'],axis=1)
y = a['species']

# split into train and test
X_train,X_test,y_train,y_test = train_test_split(X,y)
print(np.shape(X_train))

(112, 4)


## Import stuff for kNN

We'll need: KNeighborsClassifier, accuracy_score.

In [124]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

## High Bias

Let's start by training a K Nearest Neighbors classifier with high bias.

In [126]:
classifier = KNeighborsClassifier(n_neighbors=110)
classifier.fit(X_train,y_train)
y_predicted = classifier.predict(X_test)

accuracy_score(y_test, y_predicted)

0.5

#### What happened here?

With a super high number of n_numbers (110 in this case), the classifier has a very high bias. Therefore, the overall accuracy is very low, or 47.3%. Too much bias results in underfitting. 

Now let's train a K Nearest Neighbors classifier with high variance.

In [129]:
classifier = KNeighborsClassifier(n_neighbors=1)
classifier.fit(X_train,y_train)
y_predicted = classifier.predict(X_test)

accuracy_score(y_test, y_predicted)

0.9736842105263158

#### What happened here?

Since the n_neighbors value is super low, the variance is very high. However, the data is still quite accurate, with an accuracy score of 92%. Optimizing on only 1 nearest point means that the noise is modeled very high. Shuffling the data will greatly change the model every time. 

Can we do better? Try experimenting with the value of K.

In [144]:
for k in range(1,110): 
    classifier = KNeighborsClassifier(n_neighbors=k)
    classifier.fit(X_train,y_train)
    y_predicted = classifier.predict(X_test)
    a.append([k,accuracy_score(y_test,y_predicted)])
    y.append(accuracy_score(y_test,y_predicted))
    

x = np.reshape(a, (109,2))

plt.plot(range(1,110),y)

#the maximum accuracy is when k=2; look at matrix "x"

classifier = KNeighborsClassifier(n_neighbors=2)
classifier.fit(X_train,y_train)
y_predicted = classifier.predict(X_test)

accuracy_score(y_test, y_predicted)

  result = result.union(other)


TypeError: cannot concatenate object of type "<class 'numpy.float64'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

#### What happened here?

The maximum accuracy_score was recieved when k=2. 

#### Explain the Bias-Variance tradeoff.

This tradeoff represents the struggle to find a value that minimizes both sources of error - bias and variance. With a high variance, the bias is near 0, which means that we aren't doing a good job of predicting the data values in the test data. With a very high bias, we aren't correctly finding trends in the data and are making a super general model.

#### How do bias and variance affect training and testing error?

High variance means that the data is overfitted; it doesn't fit well on a cross-validation set. A high bias would imply underfitting, where the model is not fit well to basically any data. High variance implies a low training error, but high testing error. Bias is vice-cersa. 

# K-Fold Cross Validation

#### What is K-Fold Cross Validation?

KFCV splits a group into k groups. Cross-validation is used to estimate a model's skill on unseen data. There's always a tradeoff when you want more test sets, which would mean less training sets. 

If you have 200 data points, split it into 10 bins, so 20 points per bin. Run k separate learning experiments. Split each 20 into test and training sets. Average out the test results from the k experiments. 

#### How can we use K-Fold Cross Validation to determine the optimal value of K to use in kNN?

First find a value that lowers the variance. Don't choose a very large K otherwise that would limit the number of iterations that are possible. A larger K means less bias towards overestimating but a higher variance (and less efficient). 

## Import stuff for K-Fold Cross Validation

We'll need: cross_val_score.

In [132]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Now let's use cross validation to tune the hyperparameter K (the number of neighbors to consider). We want to choose the value of K that minimizes the test error.

In [148]:
# list of possible K-values for KNN
k_values = [i for i in range(1, 50)]

X_train,X_test,y_train,y_test = train_test_split(X,y)



Plot the accuracy of the classifier for each value of K.

In [154]:
e = []
f = []
average = []
for i in k_values:
    KNC = neighbors.KNeighborsClassifier(n_neighbors = i)
    KNC.fit(X_train,y_train)
    e.append(i)
    f.append(KNC.score(X_test,y_test))
    best = cross_val_score(KNC,X_test,y_test)
    average.append(np.mean(best))

plt.plot(e, average)

ValueError: Expected n_neighbors <= n_samples,  but n_samples = 25, n_neighbors = 26

What is the optimal value of K found using 20-fold cross validation?

In [156]:
#10 is generally an optimal value

#### Explain stratified cross validation.

This involves selecting folds so that the mean response value of the folds is approximately equal in all the folds. 

# Grid Search

#### What is Grid Search?

Grid Search involves having a set of models with different parameter values, and then evaluating them with a cross-validation to select the most accurate one. 

## Import stuff for GridSearch

We'll need: GridSearchCV.

In [157]:
from sklearn.model_selection import GridSearchCV
from sklearn import svm, datasets


Try using GridSearch to determine the optimal value of K.

In [161]:
# list of possible k values for KNN
k_values = list(range(1, 50))
parameters = {"n_neighbors":k_values}

clf = GridSearchCV(KNC, parameters)
clf.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=26, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)