<h1><center>K Nearest Neighbor Classification</center></h1>

The goal of this tutorial is to develop your understanding of a basic Machine Learning classification model and learn some basic area of machine learning like supervised learning. It is necessary to not only understand the mathematics behind the models but also to know how to convert that knowledge into code that can work.

### By the end of this tutorial, you’ll have learned:

   - What KNN is
   - KNN Algorithm
   - How to implement KNN in Python, step by step

### Supervised Learning

Supervised learning in machine learning is the task of learning a funcion that maps some inputs to outputs based on some input-output pair. It infers a function from labeled training data consisting of a set of training examples. In supervised learning we are given labeled data with which we can compare our model's predictions with. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

Classification is an example of supervised learning. We will be looking at one of the most basic classification algorithm called K Nearest Neighbor (KNN) in this tutorial.

#### What is Classification?

In machine learning, classification is the problem of identifying to which of a set of categories a new observation belongs to, on the basis of a training set of data containing observations whose category membership is known (is labeled). For example assigning an email spam or not spam is an example of classification. or Identifying a patient to have cancer or not is also an example of classification.

### K-Nearest Neighbor

K-Nearest Neighbors, or KNN is one of the simplest machine learning algorithms and is commonly used in many cases which are rather simple. KNN is a non-parametric algorithm which means that it doesn't make any assumptions about the data. KNN makes its decision based on distance of one example from others. This distance can simply be Euclidean distance. Also, KNN is a lazy algorithm which means that there is little to no training phase. Therefore, new data can be immediately classified.

#### Advantages and Disadvantages of KNN

##### Advantages
   - Makes no assumptions about the data.
   - Simple algorithm.
   - Used for classification easily.
    
##### Disadvantages
   - A lot of memory is required.
   - Is very sensitive to irrelevant features.
   - Also, very sensitive to the scale of data we are computing the distance with. 

### Algorithm of KNN

   - Pick some value of k i.e. 3, 5 or 7.
   - Choose k nearest neighbor of the new data point according to its Eucliean distance.
   - For each data point in test we do.
        - Calculate the distance between test data and each row of training data with the help of Euclidean distance.
        - Now, sort in ascending order according to the distance computed.
        - Choose top k from the distance array.
        - Now, assign a class to the test sample based on most frequent class of these rows.
   - End Algorithm.

#### Example

Let's see an example to understand better.
Suppose we have some data which is plotted as.

![title](one.png)

You can see that there are two classes in data one is red and other is blue.

Now, consider that we have a test data point (black colored) and we have to predict whether it belongs to red class or the blue class. 
We will compute Euclidean distance of test point with k nearest neighbors. Here k = 3

![title](two.png)

Now, we have computed the distance of test point with the neighbors and can see that it is much closer to the red class. Hence 
this data point will be classified as red class.

### Implementation using python

These are some of the essential libraries that are needed to use some built-in functionalities provided by python. Some well-known libraries are like Numpy, sklearn.

Sklearn provide a lot of support for machine learning algorithms. We will be using GridSearchCV funstion provided by sklearn which can be very useful.

##### Grid Search CV

Grid search is the process of performing hyper-parameter tuning in order to determine the optimal values of the hyper-parameter for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified.

##### Why use it?

Chooing best parameters in machine learning process is like a nightmare for practioners. There are libraries that have been implemented, such as GridSearchCV of the sklearn library, in order to automate this process and make the process bit easier for ML practioners.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

##### KNN function

Here we will create a custom KNN method with 5 parameters i.e. training examples, training labels, test examples, test label and a list of possible values of k to train on.

First, we create an object of KNeighborsClassifier() we imported from sklearn.
Then we created a dictionary named parameters and stored the list k in it.
Third step is to pass the classifier i.e. knn and the paramters to GridSearchCV and fit this model on training data
GridSearchCV will give us best parameter of training and we will make predictions on test data using that paramter. 
model.predict() predicts the labels on test data and we can check its accuracy by the function accuracy_score() we imported using sklearn.

In [2]:
def KNN(x_tr, y_tr, x_te, y_te, k):
    print('\nTraining Started for values of k', [each for each in k],'.......')
    # Create an knn object using imported KNeighborsClassifier() from sklearn
    knn = KNeighborsClassifier()
    # parameters i.e. k neighbors list
    parameters = {'n_neighbors':k}
    
    # Training the model
    model = GridSearchCV(knn, param_grid = parameters, cv=3)
    model.fit(x_tr, y_tr)
    print('Best value of k is ',model.best_params_)
    
    # Making Predictions on test data
    print('\nPredicting on Test data.......')
    pred = model.predict(x_te)
    print('\nAccuracy of model on test is', accuracy_score(y_te, pred)*100,'%')
    return accuracy_score(y_te, pred)

This custom method is irrelevant for this tutorial. It just some pre-processing done on the data that I am using for KNN classification which is Google Playstore Dataset.

In [3]:
def Data_preProcess():
    # Processing Apps.csv
    data = pd.read_csv('apps.csv')
    columns = ['App', 'Category', 'Rating', 'Size', 'Type', 'Price', 'Genres']
    data[columns]
    new_data = data[columns].copy()
    new_data = new_data.fillna(0)
    for each in range(0, len(new_data['Rating'])):
        if new_data['Rating'][each] == 0:
            new_data.at[each, 'Rating'] = new_data['Rating'].mean()
    price_list = [float(each.replace("$","")) for each in new_data.Price]
    new_data.Price = price_list
    
    # Processing User_reviews.csv
    data2 = pd.read_csv('user_reviews.csv')
    column = ['App', 'Sentiment_Polarity', 'Sentiment_Subjectivity', 'Sentiment']
    data2[column]
    new_data2 = data2[column].copy()
    
    # Merging the two datasets into one final dataset
    df = new_data.merge(new_data2, on='App')
    df.Sentiment = df['Sentiment'].replace(to_replace='Positive', value=1).replace(to_replace='Negative', value=-1).replace(to_replace='Neutral', value=0)
    df.Sentiment_Polarity = df.Sentiment_Polarity.fillna(df.Sentiment_Polarity.mean())
    df.Sentiment_Subjectivity = df.Sentiment_Subjectivity.fillna(df.Sentiment_Subjectivity.mean())
    df = df[df['Sentiment'].notna()]
    df.Type = df['Type'].replace(to_replace='Free', value=1).replace(to_replace='Paid', value=0)
    df = df.drop(['Size'], axis=1)
    
    # Separating dataset into samples and labels
    X = df.iloc[:, 0:7]
    y = df.iloc[:, 8:9]
    
    # Encoding the dataset 
    X = pd.get_dummies(X)
    print('\nFinished pre-processing data....')
    return X, y

We create a main function and all the processing is done in this function. Like we will call the above created methods in this main function. Also, we are applying some data normalization techiques in this function and calling the custom function on our data.

You might not need normalization according to the data you use. 

In [4]:
if __name__ == "__main__":
    X, y = Data_preProcess()
    # Splitting the data into Train and Test data
    x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1)
    
    # Normalizing the data
    sc_X = StandardScaler()
    x_train = sc_X.fit_transform(x_train)
    x_test = sc_X.transform(x_test)
    y_train = np.array(y_train)
    y_test = np.array(y_test)
    
    #Running KNN
    k = [3,5,7]
    acc = KNN(x_train[:5000], y_train[:5000].ravel(), x_test, y_test, k)


Finished pre-processing data....

Training Started for values of k [3, 5, 7] .......
Best value of k is  {'n_neighbors': 7}

Predicting on Test data.......

Accuracy of model on test is 86.07469428225184 %


We got an Accuracy of 86 % which very decent for this model.