# K-Nearest Neighbors

Principle : To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset—its “nearest neighbors.”
Simplest version : take THE closest neighbor (*k=1*)
Advanced version : take the K-nearest neighbors, using a voting system where the majority wins (*k=K*)
Let's start with *k=1* and see how it goes :

In [2]:
# Imports
import pandas as pd
from sklearn.preprocessing import Normalizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split


import sys
sys.path.append('../utils')
from utils import perf, thomas_parser

Before going further on with the first algorithm, let's retrieve the ground truth from a CSV file generated on the 13th of January 2020 at 2:25 PM. This ground truth gathers results from a thousand malwares collected on the 25th of June 2019.
The data are then splitted between the inputs and outputs as follow :

In [4]:
gt = pd.read_csv('../../dumps/references/2020.01.13-14.25.csv')
cols = [col for col in gt.columns if col not in ['label']]
data = gt[cols]
target = gt['label']

Let's now create the training and test sets. Since the dataset is quite small,the *80/20* rule is applied for the size of the sets. The *random_state* variable is a seed used to add randomness in the set. Since the purpose of the experiment is to compare results, deterministic values should be favored at first. The value is then set to 0.

In [5]:
data_train, data_test, target_train, target_test = train_test_split(data,target, test_size = 0.20, random_state = 0)

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that use distance measures such as K-Nearest Neighbors.

Let's normalize our sets then.

In [6]:
scaler = Normalizer()
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)

neigh = KNeighborsClassifier()
neigh.fit(data_train, target_train)
print("Test set accuracy: {:.2f}".format(neigh.score(data_test, target_test)))

Test set accuracy: 0.82
