**This is the notebook that demonstrates how to classify labels images using KNN algorithm**

This notebook loads the animal dataset from the google drive.Then applies KNN algorithm

![alt text](http://csopensource.com/wp-content/uploads/2018/07/knnmeme.jpeg)

---

The k-Nearest Neighbor classifier is by far the most simple machine learning and image classi-
fication algorithm. 
It doesn’t actually “**learn**” anything. Instead, this
algorithm directly relies on the **distance between feature vectors** (which in our case, are the raw
RGB pixel intensities of the images).

k-NN algorithm classifies unknown data points by finding the most common
class among the k closest examples. Each data point in the k closest data points casts a vote, and the
category with the highest number of votes wins. 
Or, in plain English: “***Tell me who your neighbors are, and I’ll tell you who you are***”

K- Nearest Neighbors is a

> **Supervised machine learning algorithm**

> **Non parametric** as it **does not** make an **assumption** about the **underlying data distribution pattern**


> It doesnt have the training step,here **K stands for Number of neighbours**.It uses **distance metric **like **L1 ,L2 distance** to predict the label of new point into N-dimensional space

**Pros**:


1.   Learns complex models easily.
2.   Robust to noisy data,
3    No training phase involved as it direclty relies on labels of K nearest neighbours
4    Effective if training set is large
5   Classifying a new testing point requires a comparison to every single data point in our training data, which scales O(N), making working with larger datasets computationally prohibitive.


**Cons**

1.   Difficult to choose value of K in this approach
2.   Difficut to estimate which distance could give best result.
3    Not effective if data has high dimensional since large storage is required,low computational efficiency ,data sparsity
,false intuition,close nearest neighbours  becomes less relevant
4. Works well if data is low dimensional







We will load the dataset from the google drive. For that we import google colab library .

In [None]:

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


Lets **list** the folder in our dataset ,here the folder is animals

In [None]:
!ls /content/gdrive/My\ Drive/datasets/animals

cats  dogs  panda


Here is the class SimpleDatasetLoader which loads the dataset fom the drive and it gives data and label,which are tuple of numpy array of data and labels

# Implement SimpleDatasetLoader

In [None]:
#Class to load the dataset images from drivce
import os
import cv2
import numpy as np


class SimpleDatasetLoader:
    # Method: Constructor
    def __init__(self, preprocessors=None):
        """
        :param preprocessors: List of image preprocessors
        """
        self.preprocessors = preprocessors

        if self.preprocessors is None:
            self.preprocessors = []

    # Method: Used to load a list of images for pre-processing
    def load(self, image_paths, verbose=-1):
        """
        :param image_paths: List of image paths
        :param verbose: Parameter for printing information to console
        :return: Tuple of data and labels
        """
        data, labels = [], []

        for i, image_path in enumerate(image_paths):
            image = cv2.imread(image_path)
            label = image_path.split(os.path.sep)[-2]

            if self.preprocessors is not None:
                for p in self.preprocessors:
                    image = p.preprocess(image)

            data.append(image)
            labels.append(label)

            if verbose > 0 and i > 0 and (i+1) % verbose == 0:
                print('[INFO]: Processed {}/{}'.format(i+1, len(image_paths)))

        return (np.array(data), np.array(labels))






Machine learning algorithm such as k-NN require all images in a dataset to have a **fixed feature vector size**.

In the case of images, this
requirement implies that our images must be preprocessed and scaled to have identical widths and heights.

There are a number of ways to accomplish this resizing and scaling, ranging from more advanced methods that respect the aspect ratio of the original image to the scaled image to simple methods that ignore the aspect ratio and simply squash the width and height to the required dimensions

class SimplePreprocessor builds an image preprocessor that resizes
the image, ignoring the aspect ratio.


#Implementing SimplePreprocessor

In [None]:
#Class Preprocessror 
class SimplePreprocessor:
    # Method: Constructor
    def __init__(self, width, height, interpolation=cv2.INTER_AREA):
        """
        :param width: Image width
        :param height: Image height
        :param interpolation: Interpolation algorithm
        """
        self.width = width
        self.height = height
        self.interpolation = interpolation

    # Method: Used to resize the image to a fixed size (ignoring the aspect ratio)
    def preprocess(self, image):
        """
        :param image: Image
        :return: Re-sized image
        """
        return cv2.resize(image, (self.width, self.height), interpolation=self.interpolation)

# Implementing k-NN

---
• Step #1 –** Gather Our Dataset**: The datasets consists of 3,000 images with 1,000
images per dog, cat, and panda class, respectively. Each image is represented in the RGB76
color space. We will preprocess each image by resizing it to 32 × 32 pixels. Taking into
account the three RGB channels, the resized image dimensions imply that each image in the
dataset is represented by 32 × 32 × 3 = 3, 072 integers.

• Step #2 – **Split the Dataset**: We will split the data, One split for training, and the other for testing. 
• Step #3 – **Train the Classifier**: Our k-NN classifier will be trained on the raw pixel intensi-
ties of the images in the training set.

• Step #4 – **Evaluate**: Once our k-NN classifier is trained, we can evaluate performance on
the test set.

In [None]:

from imutils import paths
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from __main__ import SimplePreprocessor
from __main__ import SimpleDatasetLoader




# Get list of image paths
image_paths = list(paths.list_images("/content/gdrive/My Drive/datasets/animals"))

# Initialize SimplePreprocessor and SimpleDatasetLoader and load data and labels
print('[INFO]: Images loading....')
sp = SimplePreprocessor(32, 32)
sdl = SimpleDatasetLoader(preprocessors=[sp])
(data, labels) = sdl.load(image_paths, verbose=500)

# Reshape from (3000, 32, 32, 3) to (3000, 32*32*3=3072)
data = data.reshape((data.shape[0], 3072))

# Print information about memory consumption
print('[INFO]: Features Matrix: {:.1f}MB'.format(float(data.nbytes / 1024*1000.0)))

# Encode labels as integers
le = LabelEncoder()
labels = le.fit_transform(labels)

# Split data into training (75%) and testing (25%) data
(train_x, test_x, train_y, test_y) = train_test_split(data, labels, test_size=0.25, random_state=42)

# Train and evaluate the k-NN classifier on the raw pixel intensities
print('[INFO]: Classification starting....')
model = KNeighborsClassifier(n_neighbors=7,
                             n_jobs=1)
model.fit(train_x, train_y)
print(classification_report(test_y, model.predict(test_x),
                            target_names=le.classes_))

[INFO]: Images loading....
[INFO]: Processed 500/3000
[INFO]: Processed 1000/3000
[INFO]: Processed 1500/3000
[INFO]: Processed 2000/3000
[INFO]: Processed 2500/3000
[INFO]: Processed 3000/3000
[INFO]: Features Matrix: 9000000.0MB
[INFO]: Classification starting....
              precision    recall  f1-score   support

        cats       0.44      0.67      0.53       239
        dogs       0.43      0.49      0.46       249
       panda       0.93      0.39      0.54       262

   micro avg       0.51      0.51      0.51       750
   macro avg       0.60      0.51      0.51       750
weighted avg       0.61      0.51      0.51       750



# How to find Best K?

In [None]:
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import numpy as np 

model = KNeighborsClassifier(n_neighbors=2,n_jobs=1)
model.fit(train_x, train_y)

accuracy = accuracy_score(model.predict(test_x), test_y)
print(accuracy)
n_neighbors = np.array([7,8,9,10,12,15,20])
param_grid = dict(n_neighbors=n_neighbors)
grid = GridSearchCV(estimator=model, param_grid=param_grid)
grid.fit(train_x, train_y)
print(grid.best_score_)
print(grid.best_estimator_.n_neighbors)

0.4146666666666667




0.47688888888888886
20


In [None]:
import matplotlib.pyplot as plt
#print(grid.cv_results_)
print(grid.param_grid)
print(grid.best_score_)
print(grid.scorer_)


{'n_neighbors': array([ 7,  8,  9, 10, 12, 15, 20])}
0.47688888888888886
<function _passthrough_scorer at 0x7fe7edff9d90>


K=20 Neighbours gives best score.

**Classifying a new testing point
requires a comparison to every single data point in our training data, which scales O(N), making
working with larger datasets computationally prohibitive.**

# How to make KNN Faster?



In [None]:
 #algorithm='ball_tree'
 #its fs
  
from imutils import paths
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from __main__ import SimplePreprocessor
from __main__ import SimpleDatasetLoader




# Get list of image paths
image_paths = list(paths.list_images("/content/gdrive/My Drive/datasets/animals"))

# Initialize SimplePreprocessor and SimpleDatasetLoader and load data and labels
print('[INFO]: Images loading....')
sp = SimplePreprocessor(32, 32)
sdl = SimpleDatasetLoader(preprocessors=[sp])
(data, labels) = sdl.load(image_paths, verbose=500)

# Reshape from (3000, 32, 32, 3) to (3000, 32*32*3=3072)
data = data.reshape((data.shape[0], 3072))

# Print information about memory consumption
print('[INFO]: Features Matrix: {:.1f}MB'.format(float(data.nbytes / 1024*1000.0)))

# Encode labels as integers
le = LabelEncoder()
labels = le.fit_transform(labels)

# Split data into training (75%) and testing (25%) data
(train_x, test_x, train_y, test_y) = train_test_split(data, labels, test_size=0.25, random_state=42)

# Train and evaluate the k-NN classifier on the raw pixel intensities
print('[INFO]: Classification starting....')
model = KNeighborsClassifier(n_neighbors=7,
                             n_jobs=1,algorithm='kd_tree')
model.fit(train_x, train_y)
print(classification_report(test_y, model.predict(test_x),
                            target_names=le.classes_))

[INFO]: Images loading....
[INFO]: Processed 500/3000
[INFO]: Processed 1000/3000
[INFO]: Processed 1500/3000
[INFO]: Processed 2000/3000
[INFO]: Processed 2500/3000
[INFO]: Processed 3000/3000
[INFO]: Features Matrix: 9000000.0MB
[INFO]: Classification starting....
              precision    recall  f1-score   support

        cats       0.44      0.67      0.53       239
        dogs       0.43      0.49      0.46       249
       panda       0.93      0.39      0.54       262

   micro avg       0.51      0.51      0.51       750
   macro avg       0.60      0.51      0.51       750
weighted avg       0.61      0.51      0.51       750



The construction of a KD tree is very fast: because **partitioning** is performed only along the data axes, no -dimensional distances need to be computed. Once constructed, the nearest neighbor of a query point can be determined with only  distance computations. Though the KD tree approach is **very fast for low-dimensional (O(logN)) n**eighbors searches, it becomes inefficient as  grows very large: this is one manifestation of the so-called “curse of dimensionality”. 