# LAB #2: Numpy

## Introduction
In this laboratory, you will build your own version of the K-Nearest Neighbors algorithm (a.k.a. KNN) using
the NumPy library.
## 0 Preliminary steps
### 0.1 NumPy
Make sure you have the NumPy library installed, its use is strongly recommended for this laboratory.
NumPy is the fundamental package for scientific computing with Python. You can read more about it on
the official documentation.


In [None]:
pip install numpy

### 0.2 Iris dataset download 
For this lab, you will need two of the datasets you have already met: Iris and MNIST. Please refer to
Laboratory 1 for a complete description of the datasets.
Iris. You can download it from:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

## 1 Exercises 
Note that exercises marked with a ($\star$) are optional, you should focus on completing the other ones first.

## 1.1 Iris Analysis with Numpy
As you might remember from Lab. 1, the Iris dataset collects the measurements of different Iris flowers,
and each data point is characterized by 4 **features** (sepal length, sepal width, petal length, petal width) and is associated to 1 **label** (i.e. an Iris species - Setosa, Versicolor, or Virginica)  

1. Load the Iris dataset. You can use the `csv` library that we saw in the last laboratory or read it with the standard `open(filename, strategy)`. 
In the second case remember to split correctly the different fields, avoid new line characters. In any case check for empty lines. 
This time remember to store the 4 features in a numpy array `x` of shape (n_sample, 4) and the labels in a different array `y` of shape (n_sample,) converting the 3 different species to a corresponding numerical value. E.g.,
      - Iris-setosa: 0
      - Iris-versicolor: 1
      - Iris-virginica: 2

In [8]:
import numpy as np
import csv

# Define a dictionary to map flower species to numerical values
label_dict = {
    "Iris-setosa": 0,
    "Iris-versicolor": 1,
    "Iris-virginica": 2
}

# Initialize empty lists to store data
features = []
labels = []

# Open the CSV file
with open('iris.csv', 'r') as file:
    reader = csv.reader(file)
    
    # Skip the header if it exists
    next(reader, None)
    
    # Read the data
    for row in reader:
        if not row:  # Check for empty lines
            continue
        
        # Extract features and label
        feature = list(map(float, row[:-1]))
        label = label_dict[row[-1]]
        
        # Append to the lists
        features.append(feature)
        labels.append(label)

# Convert lists to numpy arrays
x = np.array(features)
y = np.array(labels)

print(x.shape)  # Should print (n_samples, 4)
print(y.shape)  # Should print (n_samples,)
print(y)

(149, 4)
(149,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2]


2. Compute again the mean and standard deviation for each class by means of the numpy functions

In [16]:
means = [[],[],[]]
std_dev= [[],[],[]]

for j in range(3):
    for i in range(4):
        means[j].append(np.mean(x[np.argwhere(y==j)]))
        std_dev[j].append(np.std(x[np.argwhere(y==j)]))

print(means)
print(std_dev)

[[2.5326530612244897, 2.5326530612244897, 2.5326530612244897, 2.5326530612244897], [3.573000000000001, 3.573000000000001, 3.573000000000001, 3.573000000000001], [4.285, 4.285, 4.285, 4.285]]
[[1.8420416529118055, 1.8420416529118055, 1.8420416529118055, 1.8420416529118055], [1.7579735492890671, 1.7579735492890671, 1.7579735492890671, 1.7579735492890671], [1.910595456919125, 1.910595456919125, 1.910595456919125, 1.910595456919125]]


3. Compute the distances among two samples (e.g., the $35^{th}$ and the $80^{th}$ and the $12^{th}$ and the $14^{th}) 
by means of the 'np.linalg.norm(a-b)' function which computes the euclidean distance. 
  - Can you guess if the two couples of samples belong to the same species?
  - From the mean and standard deviations computed before can you guess which species? 

In [13]:
s1 = 35
s2 = 80
print(np.linalg.norm(x[s1]-x[s2]))


2.7586228448267445


'import numpy as np\nindices= np.arange(150)\nshuffle_indices = np.random.shuffle(indices)\ntrain_len =80*len(shuffle_indices)//100\ntrain_idx = shuffle_indices[:train_len]\ntest_idx=shuffle_indices[train_len:]'

## 1.2 KNN design and implementation
In this exercise, you will implement your own version of the K-Nearest Neighbors (KNN) algorithm, and you will use it to assign an
Iris species (i.e. a label) to flowers whose species is unknown.

The KNN algorithm is straightforward. Suppose that some measurements (e.g., the iris features) and their
relative label (e.g., the iris species) of a set of samples are known in advance. 

<img src="https://mlarchive.com/wp-content/uploads/2022/09/img2.png" width="800">

Then, whenever we want to label a new sample, we look at the K most similar points (a.k.a. neighbors) and assign a label accordingly. 

<img src="https://mlarchive.com/wp-content/uploads/2022/09/img1-1.png" width="800">


The simplest solution is using a majority voting scheme: if the majority of the neighbors votes for a label, we will go for it. 
This approach is naive only at first sight: the local similarity assumed by KNN happens to be roughly true. 
Even though this reasoning does not generalize well, the KNN provides a valid baseline for your tasks.




1. Let’s identify a portion of our data for which we will try to guess the species. Randomly select 20%
of the records and store the first four columns (i.e. the features representing each flower) into a
two-dimensional numpy array of shape ($N_{test} \times 4$), you can call it `X_test` and $N_{test}$ is the 20% of the total number of samples.
For the same records, store the test label column (i.e. the one with the species values) into another array, namely `y_test`. 
This is the data that will be used to test the accuracy of your KNN implementation and its correct functioning (i.e. the testing data).

In [50]:
x_test= []
y_test= []

x = np.random.random((150,4))
y = np.random.random((150,))

indices= np.arange(150)
np.random.shuffle(indices)
train_len =80*len(indices)//100
train_idx = indices[:train_len]
test_idx=indices[train_len:]

x_train=x[train_idx]
y_train=y[train_idx]


x_test= x[test_idx]
y_test= y[test_idx]

print(y_train.shape)
print(x_train.shape)
print(x_test.shape)
print(y_test.shape)

print(y_train)
print(x_train)
print(x_test)
print(y_test)

(120,)
(120, 4)
(30, 4)
(30,)
[0.71073944 0.88478482 0.32073638 0.39739397 0.86990071 0.55818006
 0.35003279 0.36057274 0.92598339 0.47836996 0.85384321 0.95175019
 0.65345471 0.30426156 0.86703124 0.52852099 0.23358504 0.58770008
 0.48342107 0.99854938 0.44158792 0.95761223 0.04579012 0.5746722
 0.33057378 0.90127543 0.03937536 0.813195   0.97927308 0.26521039
 0.47291796 0.88480254 0.19116497 0.65497127 0.77210084 0.19939051
 0.6950874  0.21075552 0.10509848 0.75675528 0.69105081 0.54862826
 0.98328242 0.05696363 0.40683031 0.25821944 0.35840692 0.17893882
 0.31588856 0.96466273 0.66269641 0.97402232 0.91465397 0.66107475
 0.30938121 0.83008149 0.40505504 0.33633965 0.60982161 0.78314077
 0.95725803 0.74678824 0.67895774 0.39925903 0.84205541 0.56742844
 0.85596248 0.40788241 0.27250859 0.28338367 0.4531153  0.13765589
 0.19289392 0.77443361 0.86466098 0.38420492 0.43920022 0.07569626
 0.34063438 0.11321185 0.6779739  0.09981176 0.19058789 0.31478342
 0.15604073 0.08341673 0.16853716

2. Store the remaining 80% of the records in the same way. In this case, use the names X_train andy_train for the arrays.
This is the data that your model will use as ground-truth knowledge (i.e. the training data, from which we extract the knowledge and that we will use for comparison).


In [51]:
x_test= []
y_test= []

x = np.random.random((150,4))
y = np.random.random((150,))

indices= np.arange(150)
np.random.shuffle(indices)
train_len =80*len(indices)//100
train_idx = indices[:train_len]
test_idx=indices[train_len:]

x_train=x[train_idx]
y_train=y[train_idx]


x_test= x[test_idx]
y_test= y[test_idx]

print(y_train.shape)
print(x_train.shape)
print(x_test.shape)
print(y_test.shape)

print(y_train)
print(x_train)
print(x_test)
print(y_test)


(120,)
(120, 4)
(30, 4)
(30,)
[0.40795741 0.45260519 0.22502971 0.02629057 0.79742732 0.3890438
 0.3953801  0.93451488 0.19972614 0.02642829 0.26154267 0.12661729
 0.62767348 0.94435074 0.65473198 0.19950735 0.56899279 0.17708708
 0.55214613 0.30700185 0.21853097 0.88289578 0.91370615 0.11331405
 0.67113275 0.11765902 0.20521241 0.13333872 0.82061848 0.26244691
 0.84851835 0.33103921 0.07052995 0.09243591 0.21755092 0.94886042
 0.18044651 0.28681461 0.39390535 0.50947403 0.58820153 0.78707092
 0.92627922 0.51845872 0.82598491 0.50086744 0.04464928 0.34292936
 0.59853406 0.33545958 0.43759882 0.20620615 0.54489198 0.62229819
 0.02297431 0.00677676 0.57487265 0.92182652 0.71543829 0.7737546
 0.73361743 0.15259296 0.46529959 0.61176183 0.18657207 0.04058787
 0.19081321 0.88739432 0.09986781 0.2232116  0.43724185 0.68936501
 0.24127348 0.03432606 0.35255916 0.18666882 0.60025737 0.19042246
 0.50790396 0.55297268 0.8799934  0.77457892 0.67858622 0.04998738
 0.61464524 0.82288127 0.68452728 

3. Focus now on the KNN technique. 
From the next month, you will use the `scikit-learn` package. Many of its functionalities
are exposed via an object-oriented interface. With this paradigm in mind, implement now the KNN
algorithm and expose it as a Python class. The bare skeleton of your class should look like this (you
are free to add other methods if you want to).

```
class KNearestNeighbors:
    def __init__(self, k):
        self.k = k
    def fit(self, X, y):
        """
        Store the 'prior knowledge' of you model that will be used
        to predict new labels.
        :param X : input data points, ndarray, shape = (R,C).
        :param y : input labels, ndarray, shape = (R,).
        """
        pass # TODO: implement it!
    
    def predict(self, X):
        """Run the KNN classification on X.
        :param X: input data points, ndarray, shape = (N,C).
        :return: labels : ndarray, shape = (N,).
        """
        pass # TODO: implement it!

```


Implement the fit method first. Here, you should only keep track of the main attributes that will be used by the algorithm.
In this version of the algorithm, does the KNN need to store all the samples of `X_train` and `y_train`?

In [None]:
class KNearestNeighbors:
    def __init__(self, k):
        self.k = k
    def fit(self, X, y):
        """
        Store the 'prior knowledge' of you model that will be used
        to predict new labels.
        :param X : input data points, ndarray, shape = (R,C).
        :param y : input labels, ndarray, shape = (R,).
        """
        pass # TODO: implement it!
    
    def predict(self, X):
        """Run the KNN classification on X.
        :param X: input data points, ndarray, shape = (N,C).
        :return: labels : ndarray, shape = (N,).
        """
        pass # TODO: implement it!


4. Implement the `predict` method. The function receives as input a numpy array with N rows and C
columns, corresponding to N flowers. The method assigns to each row one of the three Iris species 
using the KNN algorithm, and returns the predicted species as a numpy array. For the actual implementation, apply
the identify K neighbors using the euclidean distance specified by the parameters k.
Then, assign the label using a majority voting scheme


5. Now let’s fit the KNN model with the X_train and y_train data. Then, try to use your KNN model
to predict the species for each record in X_test and store them in a nupy array called y_pred.
As we did in the previous lab, check how many Iris species in the array y_pred have been guessed correctly computing with respect to the ones in y_test computing the accuracy. 
A prediction is correct if `y_pred[i] == y_test[i]`. To get the accuracy then compute the ratio between the number of correct guesses and the total number of guesses is known. 
If all labels are assigned correctly ((y_pred == y_test).all() == True), the accuracy of the model is 100%. 
Instead, if none of the guessed species corresponds to the real one ((y_pred == y_test).any() == False), the accuracy is 0%


6. ($\star$) As a software developer, you might want to increase the functionalities of your product and
publish newer versions over time. The better your code is structured and organized, the lower is the
effort to release updates.
As such,  extend your KNN implementation adding the parameter `distance`. This has to be one among:
    - Euclidean distance: $ euclidean(p,q) = \sqrt{\sum_{i=1}^{n} (p_i _- q_i)^2} $
    - Manhattan distance: $ manhattan(p,q) = \sum_{i=1}^n |p_i - q_i|$
    - Cosine distance: $ cosine(p, q) = \frac{\sum_{i=1}^n p_i q_i}{ \sqrt{\sum^n_{i=1}} p^2_i \cdot \sqrt{\sum^n_{i=1} q_i^2}}$

If any of this distance is not already implemented in `numpy` implement it yourself


7. ($\star$) Again, extend now your KNN implementation by adding the parameter `weights` to the constructor,
as shown below:

```
class KNearestNeighbors:
def __init__(self, k, distance_metric="euclidean", weights="uniform"):
self.k = k
self.distance_metric = distance_metric
self.weights = weights
```

Change your KNN implementation to accept a new weighting scheme for the labels. If weights=
"distance", weight neighbor votes by the inverse of their distance (for the distance, again, use
distance_metric). The weight for a neighbor of the point p is:

$
w(p, n) = \frac{1}{distance_metric(p, n)}
$

Instead, if the default is chosen (weights="uniform"), use the majority voting you already implemented
in Exercise 6.

<img src="https://mlarchive.com/wp-content/uploads/2022/09/img5.png">


8. ($\star$) Test the modularity of the implementation applying it on a different dataset. Ideally, you should
not change the code of your KNN python class.
- Download the MNIST dataset and sample only 100 points per digit. You will end up with a dataset of 1000 samples.
- Define again four numpy arrays as you did in Exercises 2 and 3.
- Apply your KNN as you did for the Iris dataset.
- Evaluate the accuracy on MNIST’s y_test.