#**Demo: KNN Algorithm**

##**Problem Definition**
A flower seller in Japan has planted lots of iris plants,and now before selling he wants to classify the flowers into their species. But he don't want to invest so much time in classifying them. To automate this task, he hires you to create a model, which can predict the flowers' species by looking at the features of flower.

##**Dataset Description**
The dataset contains 3 different iris species: Iris-Setosa, Iris-Versicolor and Iris-Virginica. It consist of 150 samples and 4 features: sepal length, sepal width, petal length, petal width. These features will be used to predict the target variable i.e. species.

##**Objective**

>* **Classify:** We want to predict if the flower belongs to Iris-setosa, Iris-Virginica or  Iris-Versicolor.
>* **Understanding KNN:** For classification here we are using knn, so let's see how knn works.
>* **Collecting the data**
>* **Splitting the dataset for training and testing:** Since we want to know how good our model is, we will split the main dataset into training and testing datasets. The test data will be used later for evaluating.
>* **Implementing KNN from scratch**
>* **Implenting KNN using sklearn**
>* **Training the model:** We will create the model by training the algorithm on the training dataset(which contains the actual labels).
>* **Testing the model:**  We will test the model on the test dataset to check how good our model works when it sees a new sample. 
>* **Model Performance:** We will calculate our model's performance, by comparing our predicted values with actual values.



**K-nearest neighbour** is a classification algorithm. It is a supervised machine learning algorithm, which means it requires training data.

Here 'k'is the number of nearest neighbours we want to consider for predicting. It is a hyper-parameter, which means you have to try out different values of k and compare the accuracy then choose a suitable value for k.
###**How KNN works?**
>* Plot all the instances from training samples in a vector space.
>* Plot the point of the query instance in that vector space. 
>* Calculate the Euclidean distance from the query instance to all the training instances and choose the k-nearest neighbours
>> euclidean distance = $\sum_{i=0}^n \sqrt{(x_i-y_i)^2}$ ,
>> where n is the number of features. 
>> x are the training instances and y is the query instance.

>* Take the labels of k-nearest neighbours as per the euclidean distance.
>* The label occuring for the highest number of time will be chosen for the query instance. This is called hard-voting.


### Importing the libraries

In [0]:
#data analysis
import pandas as pd
import numpy as np

#dataset from sklearn
from sklearn import datasets
#machine learning
from sklearn.model_selection import train_test_split
#algorithm
from sklearn.neighbors import KNeighborsClassifier
#metrics
from sklearn.metrics import accuracy_score,confusion_matrix

### Collecting the data

In [2]:
iris_data=datasets.load_iris()
print("Features: ", iris_data.feature_names)
print("Labels: ", iris_data.target_names)

Features:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Labels:  ['setosa' 'versicolor' 'virginica']


In [3]:
features=pd.DataFrame(iris_data.data)
features.columns=iris_data.feature_names
labels=pd.DataFrame(iris_data.target)
labels.columns=['class']
dataframe=pd.concat([features,labels],axis=1)
dataframe.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


###**Splitting the dataset for training and testing**

In [4]:
X_train,X_test,y_train,y_test=train_test_split(dataframe.iloc[:,0:-1],dataframe.iloc[:,-1],test_size=0.20,random_state=3)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


##**Implementing KNN from scratch**

In [0]:
x1=X_train.iloc[:,0] #sepal length of X_train
x2=X_train.iloc[:,1] #sepal width of X_train
x3=X_train.iloc[:,2] #petal length of X_train
x4=X_train.iloc[:,3] #petal width of X_train
    

In [0]:
y_pred=list()
for a,b,c,d in zip( X_test.iloc[:,0], X_test.iloc[:,1], X_test.iloc[:,2], X_test.iloc[:,3]):   #a=sepal length of X_test, b=sepal width of X_test,c= petal length of X_test, d=petal width of X_train
  dist=((a-x1)**2 + (b-x2)**2 + (c-x3)**2 + (d-x4)**2)**0.5 #calculating euclidean distance
  dist=np.array(dist)
  indexes = np.argsort(dist) #sorts the values in ascending order and return their indexes
  k=3
  l2=[y_train.iloc[indexes[0]],y_train.iloc[indexes[1]],y_train.iloc[indexes[2]]]   #labels of 3 nearest instances
  y_pred.append(max(l2,key=l2.count)) #taking maximum occuring label out of 3 nearest labels
  

##**Evaluating our KNN model**

In [7]:
print(accuracy_score(y_test,y_pred))

0.9666666666666667


##**Implementing KNN using sklearn**

In [8]:
model=KNeighborsClassifier(n_neighbors=3,metric='euclidean')  # here k=3
model.fit(X_train,y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

###**Testing our model**

In [0]:
y_pred2=model.predict(X_test)

###**Model Performance**

In [10]:
print(accuracy_score(y_test,y_pred2))

0.9666666666666667


We can see sklearn's and our accuracy are exactly the same and they are quite good too.

In [11]:
print(confusion_matrix(y_test,y_pred2))

[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
