# Basic classification model


**Classification** is a part of supervised learning.

What **classification** does is it produces an output from the given inputs and tries to label or predict an output on newer given inputs basing from the past data it learned. 

One famous example and probably the most popular is the "ham" or "spam" email verification as the machine learns from the existing data and labels the new incoming ones as "ham" which is a 'legit' email v.s "spam" emails. 

In this tutorial we will discuss about basic classification models being used by everyone who studies or even use machine learning. 

There are a lot of classification algorithms being used by many but as for the beginners here are two (2) of those that are used on their square one journey towards machine learning.

At the end of this tutorial we will be testing our models. You will learn a basic hands on with scikit learn, classifaction models and will have a basic scratch on Machine Learning. Have fun and enjoy!

# What dataset we'll be using
In this tutorial we will be using the [iris](https://en.wikipedia.org/wiki/Iris_flower_data_set), it is a famous dataset that is best for beginning practice, this dataset is already given to us when we install `scikitlearn`.

By using the classifaction models we will be making, we will see on how well the model can predict what flower type our new inputs will be basing on its characteristics (Sepal Length, Sepal Width, Petal Length and Petal Width).

# Importing the dataset and the packages we will be needing

In [1]:
# import numpy

import numpy as np
# import the iris dataset and save it

from sklearn import datasets
# import the module for spliiting training sets and test sets

from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

# Slicing the dataset for X(inputs) and y(labels)
The very basic explanation for the slicing is all the values to be used in predicting a new input are coming from all the existing input values that are stored in a variable this time it is `X` and all our labels or target basing from `X` is stored in the `y`. for every `X` there is an equivalent value in `y`.

Our `X` and `y` in this dataset will have a shape of 150.

Now we will also slice it randomly to have our training set and test sets depending on the train size you will set, these are relevant to avoid biased data, and to really determine whether our model is doing fine or not. These will be saved to train and test variables using scikitlearn's `train_test_split()`

In [2]:
# slicing the dataset to our Xs and ys.
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.7, shuffle='true') # the remaining 30% will be our test sets

#also, ignore the warning



# Visualize X_train and y_train

In [3]:
print(X_train)

[[ 5.8  2.7  4.1  1. ]
 [ 5.   3.2  1.2  0.2]
 [ 5.   2.   3.5  1. ]
 [ 5.4  3.9  1.7  0.4]
 [ 5.6  3.   4.1  1.3]
 [ 5.1  3.8  1.5  0.3]
 [ 6.4  2.7  5.3  1.9]
 [ 5.1  3.7  1.5  0.4]
 [ 6.5  3.   5.8  2.2]
 [ 5.6  2.5  3.9  1.1]
 [ 5.4  3.4  1.5  0.4]
 [ 4.7  3.2  1.3  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.1  3.4  1.5  0.2]
 [ 4.6  3.2  1.4  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 6.3  3.3  6.   2.5]
 [ 5.2  3.4  1.4  0.2]
 [ 6.8  2.8  4.8  1.4]
 [ 6.9  3.1  4.9  1.5]
 [ 5.2  4.1  1.5  0.1]
 [ 5.   3.3  1.4  0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 6.9  3.1  5.1  2.3]
 [ 6.7  2.5  5.8  1.8]
 [ 6.3  2.8  5.1  1.5]
 [ 6.4  3.2  5.3  2.3]
 [ 6.   2.2  5.   1.5]
 [ 4.8  3.   1.4  0.3]
 [ 5.   3.5  1.3  0.3]
 [ 5.7  2.8  4.1  1.3]
 [ 5.   2.3  3.3  1. ]
 [ 7.7  3.   6.1  2.3]
 [ 5.4  3.   4.5  1.5]
 [ 5.4  3.4  1.7  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 6.9  3.2  5.7  2.3]
 [ 7.7  3.8  6.7  2.2]
 [ 6.7  3.1  5.6  2.4]
 [ 5.9  3.   5.1  1.8]
 [ 6.3  3.3  4.7  1.6]
 [ 5.7  2.9  4.2  1.3]
 [ 7.2  3.   5.8  1.6]
 [ 5.   3. 

In [4]:
print(y_train)

[1 0 1 0 1 0 2 0 2 1 0 0 0 0 0 0 2 0 1 1 0 0 0 2 2 2 2 2 0 0 1 1 2 1 0 0 2
 2 2 2 1 1 2 0 2 2 1 2 2 0 1 1 0 2 2 2 2 1 0 1 1 1 0 1 2 1 0 0 0 1 2 0 1 2
 2 1 1 2 2 1 1 0 2 2 2 2 0 2 0 1 1 1 0 1 0 2 1 0 1 0 0 2 2 1 2]


# K-Nearest Neighbours

[K-Nearest Neighbour](https://brilliant.org/wiki/k-nearest-neighbors/) or **KNN** is a very simple and effective machine learning algorithm that sorts an input by using its `k` nearest neighbours. Scikit learn already provided a package for **KNN**

`k` - being a parameter (i.e. k=1, k=2, or k=n)

# Importing the package we will be needing and making our `k` a parameter

In [5]:
# import scikit learn's KNN
from sklearn.neighbors import KNeighborsClassifier

# making k a parameter where in it is the number of neighbors, i'll make 15 in this example, feel free to change it
k = 15

# Creating our KNN model

In [6]:
# initializing our KNN model
neigh = KNeighborsClassifier(n_neighbors=k)

# Fitting our all our X and y to the model
By fitting our training sets we are telling our model that here are all the inputs (X) and its corresponding label (y).

In [7]:
neigh.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=2,
           weights='uniform')

# We will now test its accuracy 
It is necessary to see the accuracy when creating a specific model for your problem, that way you can figure out if there are problems concerning the on the data you are training with. By then you will know if you still need adjustments or if the model is `just fine` or okay. Remember there are problems concerning [overfitting and underfitting](https://www.lasseschultebraucks.com/overfitting-underfitting-ml/).

Note: The testing of accuracy will be performed on the remaining test sets of the data. These test sets are unseen by the model and this is how we can tell if it really knows how to predict it correctly.

In [8]:
neigh.score(X, y)

0.96666666666666667

our model scored over 96% and that is good for this model since the dataset is small and simple. 

# Actual testing if our model is working and predicting correctly
**0 = Setosa**, **1 = Versicolour**, **2 = Virginica**

**Remember this table**

| Type | Sepal Length | Sepal Width | Petal Length | Petal Width | 
|-------|-------|----|----|----|
| Setosa | 4.3 - 5.5 | 2.3 - 4.4 | 1.0 - 1.9 | 0.1 - 0.6 |
| Versicolour | 5.0 - 7.0 | 2.0 - 3.4 | 3.0 - 5.1 | 0.1 - 0.6 |
| Virginica | 4.9 - 7.7 | 2.2 - 3.8 | 4.1 - 6.9 | 1.4 - 2.5 |

# Create a list that will append depending on the value it will generate. 
See the original values and appended values for reference

In [9]:
flower = []

for index in neigh.predict(X_test):
    if index==0:
        flower.append('Setosa')
        
    elif index==1:
        flower.append('Versicolour')
        
    elif index==2:
        flower.append('Virginica')
        
print(neigh.predict(X_test))        
print(flower)

[0 2 1 1 2 1 1 2 2 0 0 1 0 1 1 2 2 2 1 1 2 1 0 2 1 1 0 2 1 0 2 0 0 1 2 0 0
 2 1 0 0 1 0 0 2]
['Setosa', 'Virginica', 'Versicolour', 'Versicolour', 'Virginica', 'Versicolour', 'Versicolour', 'Virginica', 'Virginica', 'Setosa', 'Setosa', 'Versicolour', 'Setosa', 'Versicolour', 'Versicolour', 'Virginica', 'Virginica', 'Virginica', 'Versicolour', 'Versicolour', 'Virginica', 'Versicolour', 'Setosa', 'Virginica', 'Versicolour', 'Versicolour', 'Setosa', 'Virginica', 'Versicolour', 'Setosa', 'Virginica', 'Setosa', 'Setosa', 'Versicolour', 'Virginica', 'Setosa', 'Setosa', 'Virginica', 'Versicolour', 'Setosa', 'Setosa', 'Versicolour', 'Setosa', 'Setosa', 'Virginica']


# To check if predictions are correct check if the labels for X_test and the prediction is correct

In [10]:
print(X_test, flower)

[[ 4.4  2.9  1.4  0.2]
 [ 6.7  3.3  5.7  2.5]
 [ 7.   3.2  4.7  1.4]
 [ 6.3  2.3  4.4  1.3]
 [ 5.9  3.2  4.8  1.8]
 [ 5.8  2.6  4.   1.2]
 [ 4.9  2.4  3.3  1. ]
 [ 6.7  3.   5.2  2.3]
 [ 7.9  3.8  6.4  2. ]
 [ 4.8  3.4  1.9  0.2]
 [ 5.5  4.2  1.4  0.2]
 [ 5.1  2.5  3.   1.1]
 [ 5.   3.4  1.6  0.4]
 [ 6.2  2.9  4.3  1.3]
 [ 6.1  2.8  4.   1.3]
 [ 6.4  2.8  5.6  2.2]
 [ 7.7  2.8  6.7  2. ]
 [ 6.5  3.2  5.1  2. ]
 [ 5.5  2.3  4.   1.3]
 [ 5.5  2.4  3.7  1. ]
 [ 6.5  3.   5.2  2. ]
 [ 6.6  2.9  4.6  1.3]
 [ 5.2  3.5  1.5  0.2]
 [ 6.4  2.8  5.6  2.1]
 [ 6.1  3.   4.6  1.4]
 [ 5.6  3.   4.5  1.5]
 [ 4.9  3.   1.4  0.2]
 [ 6.2  3.4  5.4  2.3]
 [ 6.7  3.1  4.7  1.5]
 [ 4.4  3.2  1.3  0.2]
 [ 6.3  2.5  5.   1.9]
 [ 5.   3.4  1.5  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 6.   3.4  4.5  1.6]
 [ 6.   3.   4.8  1.8]
 [ 4.6  3.4  1.4  0.3]
 [ 5.1  3.5  1.4  0.3]
 [ 6.3  2.7  4.9  1.8]
 [ 5.9  3.   4.2  1.5]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.7  1.5  0.2]
 [ 6.1  2.9  4.7  1.4]
 [ 4.3  3.   1.1  0.1]
 [ 4.4  3. 

# Mathematical explanations below:

# References
[Brilliant](https://brilliant.org/wiki/k-nearest-neighbors/)

[Lasse Schultebraucks](https://www.lasseschultebraucks.com/overfitting-underfitting-ml/)