# Mandatory assignment 2

In [15]:
import numpy as np
from math import sqrt, floor, pow
import pandas as pd

## Preparing the data
The dataset is gathered from a url and converted to a Pandas dataframe

In [16]:
URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

In [17]:
df_iris = pd.read_csv(URL,
                      header=None,
                      names=['sepal length', 'sepal width',
                             'petal length', 'petal width', 'class'])
df_iris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


The labels are strings, so it will be easier and more efficient to convert them to numbers.
This is achieved by using the `pd.factorize()` method.
The dataframe is then converted to a numpy array

In [18]:
df_iris['class'] = pd.factorize(df_iris['class'])[0]
np_iris = df_iris.to_numpy()
n_columns = np_iris.shape[1] # number of columns
np_iris

array([[5.1, 3.5, 1.4, 0.2, 0. ],
       [4.9, 3. , 1.4, 0.2, 0. ],
       [4.7, 3.2, 1.3, 0.2, 0. ],
       [4.6, 3.1, 1.5, 0.2, 0. ],
       [5. , 3.6, 1.4, 0.2, 0. ],
       [5.4, 3.9, 1.7, 0.4, 0. ],
       [4.6, 3.4, 1.4, 0.3, 0. ],
       [5. , 3.4, 1.5, 0.2, 0. ],
       [4.4, 2.9, 1.4, 0.2, 0. ],
       [4.9, 3.1, 1.5, 0.1, 0. ],
       [5.4, 3.7, 1.5, 0.2, 0. ],
       [4.8, 3.4, 1.6, 0.2, 0. ],
       [4.8, 3. , 1.4, 0.1, 0. ],
       [4.3, 3. , 1.1, 0.1, 0. ],
       [5.8, 4. , 1.2, 0.2, 0. ],
       [5.7, 4.4, 1.5, 0.4, 0. ],
       [5.4, 3.9, 1.3, 0.4, 0. ],
       [5.1, 3.5, 1.4, 0.3, 0. ],
       [5.7, 3.8, 1.7, 0.3, 0. ],
       [5.1, 3.8, 1.5, 0.3, 0. ],
       [5.4, 3.4, 1.7, 0.2, 0. ],
       [5.1, 3.7, 1.5, 0.4, 0. ],
       [4.6, 3.6, 1. , 0.2, 0. ],
       [5.1, 3.3, 1.7, 0.5, 0. ],
       [4.8, 3.4, 1.9, 0.2, 0. ],
       [5. , 3. , 1.6, 0.2, 0. ],
       [5. , 3.4, 1.6, 0.4, 0. ],
       [5.2, 3.5, 1.5, 0.2, 0. ],
       [5.2, 3.4, 1.4, 0.2, 0. ],
       [4.7, 3

<a id='funcy'></a>
## Defining a function for euclidian distance
${d(p,q) = \sqrt{ \sum_{i=1}^{N} (q_i -p_i)^2}}$

The function for euclidian distance can be solved in a crude way using a simple for loop.
The loop iterates over the features of the dataset while it compares one vector of data `q` to another vector of data `p`

In [19]:
def euclidian_distance(q, p, N):
    sum = 0
    for i in range(N):
        p_float = np.float64(p[i]).item() # convert numpy float to python float
        q_float = np.float64(q[i]).item() # convert numpy float to python float
        sum += pow(q_float - p_float, 2) # sum every squared comparison
    return sqrt(sum) # return square root of sum

## Control calculation of distance between two rows
By predicting the distance of two first rows manually, we will make sure that the euclidian_distance() method yields the correct output



In [20]:
manual_first_distance = sqrt(pow(5.1 - 4.9, 2) + pow(3.5 - 3.0, 2) + pow(0, 2) + pow(0, 2))
print("Manual calculation of euclidian distance between first two rows:", manual_first_distance)

Manual calculation of euclidian distance between first two rows: 0.5385164807134502


We shall now calculate the euclidian distance between the first two rows using the python function of cell 12 above.
The data is extracted directly from the dataset and run through the function.

In [21]:
row0 = np_iris[0, 0:4]
row1 = np_iris[1, 0:4]
first_distance = euclidian_distance(row0, row1, n_columns - 1)
print("First distance calculated with the euclidian_distance() method yields:", first_distance)
print("The results is the same as the manual calculation:", manual_first_distance == first_distance)

First distance calculated with the euclidian_distance() method yields: 0.5385164807134502
The results is the same as the manual calculation: True


## Predicting a new value
The task at hand is to classify a flower with the features  `[7.0,3.1,1.3,0.7]`
In order to do this we iterate over every row in the dataset
For every row we find the euclidian distance to that row of features

### Calculating all euclidian distances

In [22]:
distances = []
new_dp = np.array([7.0,3.1,1.3,0.7])

for i in range(len(np_iris)):
    p_row = np_iris[i,0:4] # extracting the 4 features of the i'th row from the numpy array
    label = np_iris[i, 4] # extracting the label from the i'th row of the numpy array
    distances.append([euclidian_distance(new_dp, p_row, n_columns - 1), label])  # append distance of i'th row to list of distances
distances

[[2.0074859899884734, 0.0],
 [2.1633307652783933, 0.0],
 [2.355843797877949, 0.0],
 [2.459674775249769, 0.0],
 [2.12367605815953, 0.0],
 [1.8574175621006705, 0.0],
 [2.4535688292770597, 0.0],
 [2.092844953645635, 0.0],
 [2.657066051117284, 0.0],
 [2.1931712199461306, 0.0],
 [1.7916472867168916, 0.0],
 [2.2956480566497994, 0.0],
 [2.2847319317591728, 0.0],
 [2.7748873851023217, 0.0],
 [1.584297951775486, 0.0],
 [1.8734993995195195, 0.0],
 [1.813835714721705, 0.0],
 [1.984943324127921, 0.0],
 [1.5811388300841893, 0.0],
 [2.0736441353327724, 0.0],
 [1.7492855684535897, 0.0],
 [2.024845673131659, 0.0],
 [2.519920633670831, 0.0],
 [1.9621416870348587, 0.0],
 [2.353720459187964, 0.0],
 [2.085665361461421, 0.0],
 [2.0663978319771825, 0.0],
 [1.9209372712298545, 0.0],
 [1.8947295321496413, 0.0],
 [2.3748684174075834, 0.0],
 [2.2759613353482084, 0.0],
 [1.6673332000533063, 0.0],
 [2.1540659228538015, 0.0],
 [1.928730152198591, 0.0],
 [2.1931712199461306, 0.0],
 [2.0663978319771825, 0.0],
 [1.63

### Sorting the distances
The next step is to sort the euclidian distances from smalles to biggest
Since the label is attached to each distance, we will immediately get an idea of the labels that represent the nearest neighbours

In [23]:
distances = sorted(distances)
distances

[[1.5811388300841893, 0.0],
 [1.584297951775486, 0.0],
 [1.6309506430300091, 0.0],
 [1.6673332000533063, 0.0],
 [1.7492855684535897, 0.0],
 [1.7916472867168916, 0.0],
 [1.813835714721705, 0.0],
 [1.8574175621006705, 0.0],
 [1.8734993995195195, 0.0],
 [1.881488772222678, 0.0],
 [1.8947295321496413, 0.0],
 [1.9209372712298545, 0.0],
 [1.928730152198591, 0.0],
 [1.9621416870348587, 0.0],
 [1.984943324127921, 0.0],
 [1.9974984355438181, 0.0],
 [2.0074859899884734, 0.0],
 [2.024845673131659, 0.0],
 [2.0639767440550294, 0.0],
 [2.0663978319771825, 0.0],
 [2.0663978319771825, 0.0],
 [2.073644135332772, 0.0],
 [2.0736441353327724, 0.0],
 [2.078460969082653, 0.0],
 [2.085665361461421, 0.0],
 [2.092844953645635, 0.0],
 [2.1071307505705477, 0.0],
 [2.12367605815953, 0.0],
 [2.1330729007701543, 0.0],
 [2.1540659228538015, 0.0],
 [2.1633307652783933, 0.0],
 [2.1931712199461306, 0.0],
 [2.1931712199461306, 0.0],
 [2.1931712199461306, 0.0],
 [2.2405356502408083, 0.0],
 [2.2759613353482084, 0.0],
 [2.


### Choosing a k value
We will now definine the K value as an integer.
K defines how many of the closest neighbours we will inspect in order to classify the new flower.
Choosing a low value for k means that noise can interfere with the classification, but choosing a high value is computationally expensive.
After searching articles and forums on the web I discovered that there is a kind of rule of thumb for choosing k value.
The rule of thumb is simply to use k values equal to square root of the length of the dataset you are using for comparisons.
This is the reason for my choice of k which then turned out to be 12.

In [24]:
k = floor(sqrt(len(np_iris)))
k

12

### Extracting K nearest neighbours
The following code slices the list to get the k nearest neighbours
It then extracts the labesl off of these.

In [25]:
k_nearest_neighbours = distances[0:k]
k_nearest_labels = []
for i in range(k):
    k_nearest_labels.append(k_nearest_neighbours[i][1])
k_nearest_labels

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

### Defining a function for counting instances
In order to decide which class to classify the new flowers as, we will use a function that counts the most frequent elements of a list

In [26]:
from collections import Counter

def most_frequent(List):
    occurence_count = Counter(List)
    print(occurence_count)
    return occurence_count.most_common(1)[0][0]

### Predicting the class of the new flower
By using the most_frequent() function we will end up with a final prediction of classification for the new flower

In [27]:
pred = most_frequent(k_nearest_labels)
print("prediction is for label: ", pred)

Counter({0.0: 12})
prediction is for label:  0.0


**The flower with registered features `[7.0,3.1,1.3,0.7]` is predicted to be of class 0, which corresponds to Iris-Setosa**

## Conclusion
The results predicted 0 as the class, and looking at the sorted data it comes at no surprise as at a glance about 20 of the nearest neighbours are all 0. However, for a more difficult dataset the sorted outcome would probably be less obvious.

Another note is on the computational expense which is a bit greater than necessary due to this being a jupyter notebook file and also due to printing out and inspecting all values at every step. The main idea is nevertheless presented and I feel like I have got a good first insight into the use of KNN in practice.