# Project objective
This project is designed to review k nearest neighbour(kNN) method and its python implementation using Wine dataset.

Information about the dataset, some technical details about the used machine learning method(s) and mathematical details of the quantifications approaches are provided in the code. 

# Packages we work with in this notebook
We are going to use the following libraries and packages:

* **numpy**: NumPy is the fundamental package for scientific computing with Python. (http://www.numpy.org/)
* **sklearn**: Scikit-learn is a machine learning library for Python programming language. (https://scikit-learn.org/stable/)


In [0]:
import numpy as np
import sklearn as sk

# Introduction to the dataset

**Name**: Wine dataset

**Summary**: Chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. Hence, the task is multiclass (3 class) classification.

**number of features**: 13 (real, positive) 

**Number of data points (instances)**: 178

**dataset accessibility**: Dataset is available as part of sklearn package.

**Link to the dataset**: http://archive.ics.uci.edu/ml/datasets/Wine/




## Loading the dataset and separating features and labels
The dataset is available as part of sklearn package. Hence, we do not need to import the data directly from UCI ML repository. 

In [2]:
from sklearn.datasets import load_wine

# Loading wine data
target_dataset = load_wine()

# separating feature arrays of pixel values (X) and labels (y) 
input_features = target_dataset.data
output_var = target_dataset.target
# printing number of features (pixels) and data points 
n_samples, n_features = input_features.shape
print("number of samples (data points):", n_samples)
print("number of features:", n_features)

number of samples (data points): 178
number of features: 13


## Splitting data to training and testing sets

We need to split the data to train and test, if we do not have a separate dataset for validation and/or testing, to make sure about gneralizability of the model we train.

**test_size**: Traditionally, 30%-40% of the dataset can be used for test set. If you split the data to train, validation and test, you can use 60%, 20% and 20% of teh dataset, respectively.

**Note.**: We need the validation and test sets to be big enough for checking genralizability of our model. At the same time we would like to have as much data as possible in the training set to train a better model.

**random_state**: as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(input_features, output_var, test_size=0.30, random_state=5)

## Building the supervised learning model
We want to build a 3 class classification model as the output variable is categorical with 3 classes. Here we build a simple k nearest neighbour model.

## k nearest neighbour(kNN)
k nearest neighbour uses a distance metric like Euclidean distance to identity similarity of a data point (sample) to the other data points (samples) in the trainign set. Then based on the user specified k, it finds the k closest points (samples) to the target data point. Afterward, it chooses the most frequent label among the k closes points (majority voting) as the class label of the target sample. The class labels can be also assigned based on weighted voting of the k closest data points to the data point. 

This process is basis of identifying regions in the space belong to each class. For small k (k=1 or 2) the space may look like collection of islands belonging to diffeerent classes. While getting to higher k values make the islands connected to become class territories.


In [4]:
from sklearn.neighbors import KNeighborsClassifier

# Create logistic regression object
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')

# Train the model using the training sets
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='distance')

## Prediction of test (or validation) set
We now have to use the trained model to predict y_test.

In [0]:
# Make predictions using the testing set
y_pred = knn.predict(X_test)

## Evaluating performance of the model
We need to assess performance of the model using the predictions of the test set. We use precision and recall. Here are their definitions:

* **precision** is also referred to as positive predictive value (PPV)

How many selected item are relevant

$${\displaystyle {\text{Precision}}=\text{True positive rate} = {\frac {tp}{tp+fp}}\,}$$


* **recall** in this context is also referred to as the true positive rate or sensitivity

How many relevant item are selected




$${\displaystyle {\text{recall}}={\frac {tp}{tp+fn}}\,} $$

 
Precision and recall will be reported for each class.

In [7]:
from sklearn import metrics

print("precision of the predictions:", metrics.precision_score(y_test, y_pred, average=None))
print("recall of the predictions:", metrics.recall_score(y_test, y_pred, average=None))

precision of the predictions: [0.89473684 0.63157895 0.5       ]
recall of the predictions: [0.73913043 0.66666667 0.61538462]


Based on the reported precision and recall, the model has higher precision for 1st class while better recall for the 2nd and 3rd classes.