# KNN
<br> 1. Load IRIS dataset. It contains 50 samples of the iris flower varieties. Each row contains four features (length and width of sepals and petals in cm).
<br> 2. Build a classification model that takes four features as input and predict the species of IRIS flower (setosa/virginica/versicolor)

In [1]:
# Import packages
from matplotlib import pyplot as plt
import numpy as np

# Import IRIS dataset
from sklearn import datasets
iris = datasets.load_iris()

# divide this data into features and labels
X = iris.data
Y = iris.target

print(f"X is of type: {type(X)}")
print(f"Y is of type: {type(Y)}")

# How does our data look
print(f"First 5 rows of our data: {X[:5, :]}")
print(f"Unique labels: {np.unique(Y)}")

X is of type: <class 'numpy.ndarray'>
Y is of type: <class 'numpy.ndarray'>
First 5 rows of our data: [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
Unique labels: [0 1 2]


## Let's split our data into 80% training and 20% testing.

In [2]:
from sklearn.model_selection import train_test_split

# Split the data into 80% training and 20% testing sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=42)

print(f"Xtrain: {Xtrain.shape}")
print(f"Xtrain: {Ytrain.shape}")
print(f"Xtrain: {Xtest.shape}")
print(f"Xtrain: {Ytest.shape}")

print(f"First 5 rows of Xtrain: {Xtrain[:5, :]}")

Xtrain: (120, 4)
Xtrain: (120,)
Xtrain: (30, 4)
Xtrain: (30,)
First 5 rows of Xtrain: [[4.6 3.6 1.  0.2]
 [5.7 4.4 1.5 0.4]
 [6.7 3.1 4.4 1.4]
 [4.8 3.4 1.6 0.2]
 [4.4 3.2 1.3 0.2]]


## Build classification model, train, test and measure accuracy.

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# Build a KNN using 5 neighbor nodes
knn_model = KNeighborsClassifier(n_neighbors=5)

# Fit the model using our training data
knn_model.fit(Xtrain, Ytrain)

# Training accuracy
knn_acc = metrics.accuracy_score(Ytrain, knn_model.predict(Xtrain))
print(f"KNN training accuracy: {knn_acc}")


KNN training accuracy: 0.9666666666666667


In [4]:
# Perform testing and measure test accuracy.
knn_acc_test = metrics.accuracy_score(Ytest, knn_model.predict(Xtest))
print(f"KNN test accuracy: {knn_acc_test}")

KNN test accuracy: 1.0


# SVM - Support Vector Machine
<br> Same as before -
<br> 1. Import libraries and dataset
<br> 2. Train-test split
<br> 3. Extend previous code by building a SVM classification model for this data using a linear kerne.

In [5]:
# Let's start with a linear kernel with the default parameters.

from sklearn import svm
from sklearn import metrics

# Fit an SVM using linear kernel
svm_model = svm.SVC(kernel='linear')
svm_model.fit(Xtrain, Ytrain)

# Training/Testing accuracy
svm_acc = metrics.accuracy_score(Ytrain, svm_model.predict(Xtrain))
print(f"SVM training accuracy: {svm_acc}")


SVM training accuracy: 0.975


In [6]:
# Support vector data points can be found using support_vectors parameter of svm model.

print("Support vector type:", type(svm_model.support_vectors_))
print(f"Support vector shape: {svm_model.support_vectors_.shape}")

print(f"Data has a total of {svm_model.support_vectors_.shape[0]} support vectors.")

Support vector type: <class 'numpy.ndarray'>
Support vector shape: (25, 4)
Data has a total of 25 support vectors.


In [7]:
# Test accuracy
svm_acc_test = metrics.accuracy_score(Ytest, svm_model.predict(Xtest))

print(f"SVM test accuracy: {svm_acc_test}")

SVM test accuracy: 1.0


## SVM: Polynomial kernel
<!-- <br> See in presentation, IRIS dataset class 1 and class 2 are not linearly separable.
<br> So, we remove out class 0 from X (data) and Y (label).
<br> Numpy __in1d()__ function can be used to do that. -->

In [10]:
# Training with linear kernel first to see accuracy
svc_model = svm.SVC(kernel='poly', degree=2, gamma='auto')
svc_model.fit(Xtrain, Ytrain)

# Training/testing accuracy
svc_acc = metrics.accuracy_score(Ytrain, svc_model.predict(Xtrain))
print(f"SVM-poly training accuracy: {svc_acc}")

svc_acc_test = metrics.accuracy_score(Ytest, svc_model.predict(Xtest))
print(f"SVM-poly testing accuracy: {svc_acc}")

SVM-poly training accuracy: 0.9833333333333333
SVM-poly testing accuracy: 0.9833333333333333


## SVM: RBF kernel


In [13]:
# Training with linear kernel first to see accuracy
svc_model = svm.SVC(kernel='rbf', degree=2, gamma='auto')
svc_model.fit(Xtrain, Ytrain)

# Training/testing accuracy
svc_acc = metrics.accuracy_score(Ytrain, svc_model.predict(Xtrain))
print(f"SVM-rbf training accuracy: {svc_acc}")

svc_acc_test = metrics.accuracy_score(Ytest, svc_model.predict(Xtest))
print(f"SVM-rbf testing accuracy: {svc_acc}")

SVM-rbf training accuracy: 0.9916666666666667
SVM-rbf testing accuracy: 0.9916666666666667


## Effect of C parameter on SVM
<br> C is a __regularization parameter__ that controls the trade off between the achieving a low training error and a low testing error that is the ability to generalize your classifier to unseen data.
<br> C is a Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
<br> C is 1 by default.
<br> For noisy data, C should have lower value. (less regularize, allowing more support vectors.)
<br> High C value corresponds to high penalty for mis-classification, thus, learnt margin will be narrow, resulting in small number of support vectors.

In [19]:
# Higher C value.

svc_model = svm.SVC(kernel='linear', C=1e2)
svc_model.fit(Xtrain, Ytrain)

# Support vectors
print(f"Data has a total of {svc_model.support_vectors_.shape[0]} support vectors.")

# Training/testing accuracy
svc_acc = metrics.accuracy_score(Ytrain, svc_model.predict(Xtrain))
print(f"SVM-rbf training accuracy: {svc_acc}")

svc_acc_test = metrics.accuracy_score(Ytest, svc_model.predict(Xtest))
print(f"SVM-rbf testing accuracy: {svc_acc}")

Data has a total of 14 support vectors.
SVM-rbf training accuracy: 0.975
SVM-rbf testing accuracy: 0.975


In [20]:
# Lower C value.

svc_model = svm.SVC(kernel='linear', C=1e-2)
svc_model.fit(Xtrain, Ytrain)

# Support vectors
print(f"Data has a total of {svc_model.support_vectors_.shape[0]} support vectors.")

# Training/testing accuracy
svc_acc = metrics.accuracy_score(Ytrain, svc_model.predict(Xtrain))
print(f"SVM-rbf training accuracy: {svc_acc}")

svc_acc_test = metrics.accuracy_score(Ytest, svc_model.predict(Xtest))
print(f"SVM-rbf testing accuracy: {svc_acc}")

Data has a total of 101 support vectors.
SVM-rbf training accuracy: 0.925
SVM-rbf testing accuracy: 0.925
