# Foundations of Artificial Intelligence and Machine Learning
## A Program by IIIT-H and TalentSprint


#### To be done in the Lab

In this experiment, we will use the data set on fruits which we explored earlier and learn how a simple K nearest neighbour classification works. 

Let us consider a simple situation. Given some data about a fruit, we want to label it automatically.

Fruits are characterized by 
 * weight in grams as a float
 * colour as an integer
     - 1 $\rightarrow$ red
     - 2 $\rightarrow$ orange
     - 3 $\rightarrow$ yellow
     - 4 $\rightarrow$ green
     - 5 $\rightarrow$ blue
     - 6 $\rightarrow$ purple
 * label as a string
     - "A" $\rightarrow$ Apple
     - "B" $\rightarrow$ Banana
     
We are given some sample data such as (303, 3, "A") meaning the fruit with 303 gram weight, and yellow colour is an apple. A set of such *training samples* is given in “01-train.csv”. This has a small set of 17 **labeled** samples. 

We are given a set of **test** data where only weight and colour are given,  eg. (373,1). We should design a simple Nearest Neighbour classifier that will find the fruit label. i.e., "A" or "B", meaning Apple or Banana. 

We have 30 such testcases. We are also given an additional file which have the correct labels for the test cases. We can compare our predictions, with these. If your predicted label is correct, you have done well!

Here are the details of all the files:
  * **01-train.csv** $\Rightarrow$ The original input data. 
    - 18 lines
    - the first line is a header
    - each of the remaining 17 lines has three pieces of data:
       * weight in grams :: float
       * colour code :: 1, 2, 3, 4, 5 
       * label :: "A", "B"
  * **01-test1.csv** $\Rightarrow$ The first test data set.
    - 31 lines
    - the first line is a header
    - each of the remaining 30 lines has two pieces of data
       * weight in grams :: float
       * colour code :: 1, 2, 3, 4, 5
  * **01-test1-labels.csv** $\Rightarrow$ The labels for test data set above. That is, each line has just the correct label.

In [None]:
## Let us set up the file names
FRUITS_TRAIN = "../Datasets/AIML_DS_TRAIN_SAMPLE.csv"
FRUITS_TEST1 = "../Datasets/AIML_DS_TEST1_SAMPLE.csv"
FRUITS_LABELS1 = "../Datasets/AIML_DS_TEST-LABLES_SAMPLE.csv"

In [None]:
# Let us first read the data from the file and do a quick visualization
import pandas as pd
train = pd.read_csv(FRUITS_TRAIN)
train

In [None]:
apples = train[train.Label == "A"]
bananas = train[train.Label == "B"]
import matplotlib.pyplot as plt
plt.plot(apples.Weight, apples.Colour, "ro")
plt.plot(bananas.Weight, bananas.Colour, "y+")
plt.xlabel("Weight -- in grams")
plt.ylabel("Colour -- r-o-y-g-b-p")
plt.legend(["Apples", "Bananas"])
plt.show()

We  see that similar fruits come close in the feature (weight, color) space? Now let us plot one sample data given in black.

In [None]:
plt.plot(apples.Weight, apples.Colour, "ro")
plt.plot(bananas.Weight, bananas.Colour, "y+")
plt.xlabel("Weight -- in grams")
plt.ylabel("Colour -- r-o-y-g-b-p")
plt.legend(["Apples", "Bananas"])
plt.plot([373], [1], "ko")
plt.show()

From the visualization alone, we can infer that the unknown fruit is likely to be an apple. 

The job now is to instead of eyeballing it one at a time like above, use a kNN classifier with, say, $k = 3$ and using the *Euclidean* distance, to determine the correct label for the data in the file "01-test1.csv" that has 30 data points. 

Let us first write a distance function to calculate the *Euclidean* distance between two fruits.

$distance$ = $\Sigma(a_i -b_i)^2$

In [None]:
import math
def dist(a, b):
    ''' a is the n-dimesnional co-ordinate of point 1
        b is the n-dimensional co-ordinate of point 2'''
    sqSum = 0
    for i in range(len(a)):
        sqSum += (a[i] - b[i]) ** 2
    return math.sqrt(sqSum)

Now let us write code to find the $k$ nearest neighbours of a given fruit

In [None]:
def kNN(k, train, given):
    distances = []
    for t in train.values:              
        # loop over all training samples
        distances.append((dist(t[:2], given), t[2])) 
        # compute and store distance of each training sample from the given sample
    distances.sort()            
    return distances[:k]    # return first k samples = nearest  k distances to the given sample

In [None]:
print(kNN(3, train, (373, 1)))
print(kNN(5, train, (373, 1)))

Now let us load the test data and find the KNNs for some of them 

In [None]:
trial = pd.read_csv(FRUITS_TEST1).values[:10]
for t in trial:
    print(t, kNN(3, train, t))

## Summary
In the above experiment, we find that a simple nearest neighbour method can successfully predict labels with a small number of labelled examples. But we need to write some more code to actually count the nearest neighbour and pick the class.

## Acknowledgment
This experiment is based on the blog post http://www.jiaaro.com/KNN-for-humans. 