#Simple Mutliclass Classification with KNN

In this section we will be looking at very simple multiclass classification using the the KNN algorithm to identify species of iris flowers.

The Iris dataset is a very famous dataset and contains 3 different types of irises with 50 examples for each class. The different attributes are sepal length, sepal width, petal length and petal width.

Before we get started we will need to import the following libraries:

1. Numpy - Fundamental package for scientific computing with Python
3. Pandas - Library providing high-performance, easy-to-use data structures and data analysis tools.
4. SKLearn - Simple and efficient tools for data mining and data analysis


In [0]:
# Import all the necessary libraries
from __future__ import print_function
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

##Load Dataset
First of all we want to read the CSV file that contains the dataset we will be using.

ToDo:
- Read in the dataset 
- View the first few entries of the dataset

In [0]:
# Load the tabular data into the notebook
url = 'https://ai-camp-content.s3.amazonaws.com/IRIS.csv'

#Read the data from url
iris = pd.read_csv(url)

In [0]:
# View first few entries in the dataset
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


##Manipulating the Data

ToDo:

- Separate the X and Y values in the dataset.

Hint: remember that y is a single target feature and x is all the other predictor features.

In [0]:
# Separate the X and Y values of the dataset
x = iris.iloc[:,:4]
y = iris.iloc[:,4]

##Split the Data


ToDo:
- Split the data using SKLearns train_test_split

In [55]:
# Split the dataset into the Training set and Test set

x_test, x_train, y_test, y_train = train_test_split(x, y, test_size=0.2)

37         Iris-setosa
105     Iris-virginica
19         Iris-setosa
47         Iris-setosa
57     Iris-versicolor
59     Iris-versicolor
58     Iris-versicolor
7          Iris-setosa
75     Iris-versicolor
84     Iris-versicolor
95     Iris-versicolor
55     Iris-versicolor
96     Iris-versicolor
70     Iris-versicolor
100     Iris-virginica
132     Iris-virginica
87     Iris-versicolor
137     Iris-virginica
73     Iris-versicolor
48         Iris-setosa
5          Iris-setosa
67     Iris-versicolor
101     Iris-virginica
135     Iris-virginica
92     Iris-versicolor
110     Iris-virginica
71     Iris-versicolor
9          Iris-setosa
46         Iris-setosa
114     Iris-virginica
            ...       
145     Iris-virginica
12         Iris-setosa
94     Iris-versicolor
62     Iris-versicolor
28         Iris-setosa
45         Iris-setosa
39         Iris-setosa
149     Iris-virginica
104     Iris-virginica
4          Iris-setosa
31         Iris-setosa
51     Iris-versicolor
99     Iris

##Define and fit the model
We now need to create the classifier that we will use and train it with our data. 

To Do:
- Define the model (KNeighborsClassifier)
- Fit the model
- Predict the diagnosis values for the test data given
- Find out how accurate our model was 

In [0]:
# Create the KN classifier
classifier = KNeighborsClassifier()


# Fit a model using the classifier and our training dataset
classifier.fit(x_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [0]:
# Predict the diagnosis values on the test dataset and print them
y_prediction = classifier.predict(x_test)

y_test

46         Iris-setosa
127     Iris-virginica
144     Iris-virginica
63     Iris-versicolor
78     Iris-versicolor
28         Iris-setosa
94     Iris-versicolor
77     Iris-versicolor
60     Iris-versicolor
29         Iris-setosa
72     Iris-versicolor
67     Iris-versicolor
118     Iris-virginica
24         Iris-setosa
10         Iris-setosa
138     Iris-virginica
135     Iris-virginica
44         Iris-setosa
3          Iris-setosa
91     Iris-versicolor
123     Iris-virginica
104     Iris-virginica
146     Iris-virginica
92     Iris-versicolor
53     Iris-versicolor
61     Iris-versicolor
71     Iris-versicolor
143     Iris-virginica
103     Iris-virginica
42         Iris-setosa
            ...       
102     Iris-virginica
59     Iris-versicolor
93     Iris-versicolor
17         Iris-setosa
119     Iris-virginica
34         Iris-setosa
40         Iris-setosa
26         Iris-setosa
51     Iris-versicolor
105     Iris-virginica
89     Iris-versicolor
14         Iris-setosa
48         

##Analysis of Model Performance
Now we have our model we need to look how well it performed. A good fitting model is one where the difference between the actual values and the predicted values is small and unbiased for train, validation and test data sets. 

ToDo:

- View the models confusion Matrix
- View the models accuracy
- Use SKLearns functions to evaluate precision, recall and f1-score

In [0]:
# View the model's confusion matrix
print("Confusion Matrix: \n%s" % metrics.confusion_matrix(y_test,y_prediction))
#Confusion matrix is 3x3 due to having 3 possible outcomes that we are trying to predict

Confusion Matrix: 
[[38  0  0]
 [ 0 41  3]
 [ 0  2 36]]


###Scores

In [0]:
# View the model's accuracy
print('Accuracy: \n%s' % metrics.accuracy_score(y_test, y_prediction))

Accuracy: 
0.9583333333333334


In [0]:
# Use SciKit learn's built in functions to evaluate precision, recall, f1-score: 
print(metrics.classification_report(y_test, y_prediction))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        38
Iris-versicolor       0.95      0.93      0.94        44
 Iris-virginica       0.92      0.95      0.94        38

       accuracy                           0.96       120
      macro avg       0.96      0.96      0.96       120
   weighted avg       0.96      0.96      0.96       120

