# Lab #1

The purpose of this laboratory is to get you acquainted with Python. 
More specifically, you will learn how to:
- read different types of datasets (CSV and JSON). 
- extract some useful information (mean and standard deviation) from the datasets while only using basic python.
- create a simple rule-based classifier that is already capable to perform some classification.


## Preliminaries
### Python availability
Make sure that Python 3 is installed on your device with the commands `python --version`. The version should be in the form `3.x.x.`

In [1]:
! python --version
! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv
! wget "https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv" -O mnist.csv

Python 3.12.6


--2024-11-19 12:19:43--  https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: 'iris.csv'

     0K ....                                                   12,8M=0s

2024-11-19 12:19:43 (12,8 MB/s) - 'iris.csv' saved [4551]

--2024-11-19 12:19:44--  https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18289443 (17M) [text/plain]
Saving to: 'mnist.csv'

     0K .......... .......... .......... .......... ..........  0% 4,96M 4s

In [2]:

IRIS = "iris.csv"
MNIST = "mnist.csv"

In [3]:
from typing import List
def mean_fn(data: List[List[float]]) -> List[float]:
    return [sum(col)/len(col) for col in zip(*data)]

def std_fn(data: List[List[float]], means: List[float]) -> List[float]:
    assert len(data[0]) == len(means)
    return [pow(sum((x - m)**2 for x in col) / len(col), 1/2) for col, m in zip(zip(*data), means)]

### Dataset Download
For this lab, three different datasets will be used. Here, you will learnmore about them and how to retrieve
them.

#### Iris
Iris is a particularly famous *toy dataset* (i.e. a dataset with a small number of rows and columns, mostly
used for initial small-scale tests and proofs of concept). 
This specific dataset contains information about the **Iris**, a genus that includes 260-300 species of plants. 
The Iris dataset contains measurements for 150 Iris flowers, each belonging to one of three species (50 flowers each): 

Iris Virginica             |  Iris Versicolor          |   Iris Setosa  |
:-------------------------:|:-------------------------:|:---------------|
:<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/1200px-Iris_virginica_2.jpg" alt="Iris Virginica" width="200" /> | <img src="https://www.waternursery.it/document/img_prodotti/616/1646318149.jpeg" alt="Iris Versicolor" width="200" /> |<img src="https://d2j6dbq0eux0bg.cloudfront.net/images/28296135/2323483832.jpg" alt="Iris Setosa" width="200" />|

Each of the 150 flowers contained in the Iris dataset is represented by 5 values:
- sepal length, in cm
- sepal width, in cm
- petal length, in cm
- petal width, in cm
- Iris species, one of: Iris-setosa, Iris-versicolor, Iris-virginica (the label)

Each row of the dataset represents a distinct flower (as such, the dataset will have 150 rows). Each
row then contains 5 values (4 measurements and a species label).
The dataset is described in more detail on the [UCI Machine Learning Repository website](https://archive.ics.uci.edu/dataset/53/iris). The dataset
can either be downloaded directly from there (iris.data file), or from a terminal, using the `wget` tool. The
following command downloads the dataset from the original URL and stores it in a file named iris.csv.

`wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv`

The dataset is available as a Comma-Separated Values (CSV) file. These files are typically used to
represent tabular data. 
- Each row is represented on one of the lines. 
- Each of the rows contains a fixed number of columns. 
- Each of the columns (in each row) is separated by a comma (,).

To read CSV files, Python offers a module called `csv` (here the offical [doc](https://docs.python.org/3/library/csv.html)). This module allows using `csv.reader()`, which
reads a file row by row. For each row, it returns a list of columns that can be processed as needed. 


Let's download the dataset and print the first three rows.




In [4]:
import csv

print("Reading first lines of IRIS dataset")
with open(IRIS) as f:
    for i, cols in enumerate(csv.reader(f)):
        print(cols)
        if i >= 4:
            break

Reading first lines of IRIS dataset
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']


Note by default, csv.reader converts all fields read into strings (str). 
If you want to treat them as number, remember to cast them correctly!

#### MNIST
The MNIST dataset is another particularly famous dataset. It contains several thousands of hand-written
digits (0 to 9). 
- Each hand-written digit is contained in an image represented as $28 x 28$ 8-bit grayscale image. 
- This means that each digit has $784$ ($28^2$) pixels
- Each pixel has a value that ranges from 0 (black) to 255 (white).

<img src="https://machinelearningmastery.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png" alt="MNIST images" width="500" />

The dataset can be downloaded from the following link:

[https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv](https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv)

In this case, MNIST is represented as a CSV file. Similarly to the Iris dataset, each row of the MNIST
datasets represents the pixels of the image representing a digit. For the sake of simplicity, this dataset contains only a small fraction (10; 000
digits out of 70; 000) of the real MNIST dataset. 

For each digit, 785 values are available: 
- the first one is the numerical value depicted in the image (e.g. for Figure 2 it would be 5). 
- the following 784 columns represent the grayscale image in row-major order (for more information about row- and column-major order of matrices, see [Wikipedia](https://en.wikipedia.org/wiki/Row-_and_column-major_order)).

The MNIST dataset in CSV format can be read with the same approach used for Iris, keeping in mind
that, in this case, the digit label (i.e. the first column) is an integer from 0 to 9, while the following 784
values are integers between 0 and 255.

## Exercises
Note that exercises marked with a (*) are optional, you should focus on completing the other ones first.
### Iris analysis
1. Load the previously downloaded Iris dataset as a list of lists (each of the 150 lists should have 5 elements). You can make use of the csv module presented

In [5]:
import csv

def try_callback(n:str, cb):
    try:
        return cb(n)
    except:
        return n


iris_data = []
with open(IRIS) as f:
    for cols in csv.reader(f):
        if len(cols) != 5: continue
        iris_data.append(list([try_callback(x, float) for x in cols]))
        
[print(x) for i,x in enumerate(iris_data) if i <5]

[5.1, 3.5, 1.4, 0.2, 'Iris-setosa']
[4.9, 3.0, 1.4, 0.2, 'Iris-setosa']
[4.7, 3.2, 1.3, 0.2, 'Iris-setosa']
[4.6, 3.1, 1.5, 0.2, 'Iris-setosa']
[5.0, 3.6, 1.4, 0.2, 'Iris-setosa']


[None, None, None, None, None]

2. Compute and print the mean and the standard deviation for each of the 4 measurement columns (i.e. sepal length and width, petal length and width). Remember that, for a given list of n values $x = (x_1, x_2, ..., x_n)$, the mean $\mu$ and the standard deviation $\sigma$ are defined respectively as:
$$\mu = {1 \over n} \sum_i^n x_i $$

$$ \sigma = \sqrt{ {1 \over n} \sum_i^n (x_i - \mu)^2} $$

In [6]:
mean = mean_fn([x[:-1] for x in iris_data])
std = std_fn([x[:-1] for x in iris_data], mean)

print(mean)
print(std)

[5.843333333333334, 3.0540000000000003, 3.758666666666666, 1.1986666666666668]
[0.8253012917851409, 0.43214658007054346, 1.758529183405521, 0.7606126185881716]



3. Compute and print the mean and the standard deviation for each of the 4 measurement columns, separately for each of the three Iris species (versicolor, virginica and setosa).

In [7]:
species = set(x[-1] for x in iris_data)
splitted_data = {spec: [x[:-1] for x in iris_data if x[-1] == spec] for spec in species}

mean_species = {spec: mean_fn(data) for spec, data in splitted_data.items()}
std_species = {spec: std_fn(data, mean_species[spec]) for spec, data in splitted_data.items()}

for spec in species:
	print(f"{spec} mean: {mean_species[spec]}")
	print(f"{spec} std: {std_species[spec]}")
	print()

species_data = {
	spec: {
		"data": data,
		"mean": mean_species[spec],
		"std": std_species[spec]
	} for spec, data in splitted_data.items()
}

Iris-virginica mean: [6.587999999999999, 2.9739999999999998, 5.5520000000000005, 2.026]
Iris-virginica std: [0.6294886813914926, 0.3192553836664309, 0.546347874526844, 0.2718896835115301]

Iris-versicolor mean: [5.936, 2.77, 4.26, 1.3259999999999998]
Iris-versicolor std: [0.5109833656783751, 0.31064449134018135, 0.4651881339845203, 0.19576516544063705]

Iris-setosa mean: [5.006, 3.418, 1.464, 0.24400000000000002]
Iris-setosa std: [0.3489469873777391, 0.37719490982779713, 0.17176728442867112, 0.10613199329137281]




4. Based on the results of exercises 2 and 3, which of the 4 measurements would you considering as being the most characterizing one for the three species? (In other words, which measurement would you consider “best”, if you were to guess the Iris species based only on those four values?)

The forth, because have the most distinct mean and low std


5. Based on the considerations of Exercise 3, assign the flowers with the following measurements to what you consider would be the most likely species.
````
5.2, 3.1, 4.0, 1.2
4.9, 2.5, 5.6, 2.0
5.4, 3.2, 1.9, 0.4
````

````
5.2, 3.1, 4.0, 1.2 -> Verisicolor
4.9, 2.5, 5.6, 2.0 -> Virgninica
5.4, 3.2, 1.9, 0.4 -> Setosa
````


6. (*) Create a Rule-based classifier similar to the one seen in class. This classifier, again, will receive some rule and will classify each sample into one of the three species.

In [8]:
def classifier(data: List[float]) -> str:
    distances = {
        spec: abs(data[-1] - specs["mean"][-1]) for spec, specs in species_data.items()
    }
    return min(distances, key=distances.get)

[print(classifier(x)) for x in (
    (5.2, 3.1, 4.0, 1.2),
    (4.9, 2.5, 5.6, 2.0),
    (5.4, 3.2, 1.9, 0.4)
)]


Iris-versicolor
Iris-virginica
Iris-setosa


[None, None, None]

7. (*) Compute prediction for all the elements in the dataset and store them in a list. Then, compute the accuracy of the classifier that you create. Remember that the accuracy metric is:

$$ {\text{number of correct predictions (TP + TN)} \over \text{total number of predictions (TP+TN+FP+FN)}} $$

Where one can check whether the prediction is correct by looking at the label of the sample ($5^{th}$ column)

In [9]:
with_prediction = [
    x + [classifier(x[:-1])] for x in iris_data
]

correct = sum(1 for x in with_prediction if x[-1] == x[-2])
print("Accuracy", correct/len(with_prediction))

Accuracy 0.96


### MNIST Analysis

1. Load the previously downloaded MNIST dataset. You can make use of the csv module already presented.

In [10]:
import csv

mnist_data = []
with open(MNIST) as f:
    for cols in csv.reader(f):
        if len(cols) != 785: continue
        mnist_data.append(list([try_callback(x, int) for x in cols]))

2. Create a function that, given a position $1 < k < 10,000$, prints the $k^{th}$ sample of the dataset (i.e. the $k^{th}$ row of the csv file) as a grid of $28x28$ characters. More specifically, you should map each range of pixel values to the following characters:
    - [0; 64) &rarr; " "
    - [64; 128) &rarr; "."
    - [128; 192) &rarr; "*"
    - [192; 256) &rarr; "#"
So, for example, you should map the sequence `0, 72, 192, 138, 250` to the string `.#*#`.
*Note*: Remember to start a new line every time you read 28 characters

Example of output: 
```
         .#      **
        .##..*#####
       #########*.
      #####***.
     ##*
    *##
    ##
   .##
    ###*
    .#####.
        *###*
           *###*
              ###
              .##
              ###
            .###
      .    *###.
     .#  .*###*
     .######.
      *##*.
```


In [11]:
from random import random
def pretty_print(data: List[int]):
    mapping = lambda x: " " if (x//64)%4 == 0 else '.' if (x//64)%4 == 1 else '*' if (x//64)%4 == 2 else '#'

    for i in range(1, len(data), 28):
        print(''.join([mapping(x) for x in data[i: i+28]]))

pretty_print(mnist_data[int(random() * len(mnist_data))])

                            
                            
                            
                            
                            
              *#####.       
            *########       
          .#####***##*      
          ###*.    ##*      
         *##*      ##.      
         ###       ##*      
         ##*      *##       
         ###      ##        
         .##*    .#.        
          ###    ..         
           ###              
           .###. .****.     
           .#########*.     
          #######.          
         ####*###.          
         ##*   *#.          
         ##.   *#*          
         *##. .##*          
          #######.          
           *####.           
                            
                            
                            


3. Compute the Euclidean distance between each pair of the 784-dimensional vectors of the digits at
the following positions: $26^{th}$, $30^{th}$, $32^{nd}$, $35^{th}$.

*Note*: Remember that Python arrays are indexed from 0, so the $k^{th}$ value will be at position $k-1$

In [12]:
def euc_distance(a, b):
    return pow(sum([(x - y) ** 2 for x, y in zip(a, b)]), 1 / 2)
# 
indexes = [25, 29, 31, 34]

for i in indexes:
    for j in indexes:
        if i == j: continue
        print(i, j, euc_distance(mnist_data[i][1:], mnist_data[j][1:]))
    print()

25 29 3539.223219860539
25 31 3556.4199695761467
25 34 3223.2069434027967

29 25 3539.223219860539
29 31 1171.8293391104355
29 34 2531.0033583541526

31 25 3556.4199695761467
31 29 1171.8293391104355
31 34 2515.5599774205343

34 25 3223.2069434027967
34 29 2531.0033583541526
34 31 2515.5599774205343



4. Based on the distances computed in the previous step and knowing that the digits listed in Exercise 3 are (not necessarily in this order) $0, 1, 1, 7$ can you assign the correct label to each of the digits of Exercise 3?

25: 0
29: 1
31: 1
34: 

5. There are 1,135 images representing 1’s and 980 images representing 0’s in the dataset. For all 0’s and 1’s separately, count the number of times each of the 784 pixels is black (use 128 as the threshold value). You can do this by building a list `Z` and a list `O`, each containing 784 elements, containing respectively the counts for the 0’s and the 1’s. `Z[i]` and `O[i]` contain the number of times the $i^{th}$ pixel was black for either class. For each value i, compute `abs(Z[i] - O[i])`. The $i$ with the highest value represents the pixel that best separates the digits “0” and “1” (i.e. the pixel that is most often black for one class and white for the other). Where is this pixel located within the grid? Why is it?

In [13]:
only_zero = [x[1:] for x in mnist_data if x[0] == 0]
only_one = [x[1:] for x in mnist_data if x[0] == 1]

print(len(only_zero), len(only_one))

Z = [ sum(col) for col in zip(*[[1 if el > 128 else 0 for el in row] for row in only_zero])]
O = [ sum(col) for col in zip(*[[1 if el > 128 else 0 for el in row] for row in only_one])]

a = [(i, abs(z-o)) for i, (z, o) in enumerate(zip(Z, O))]

m = (-1, 0)
for i, dist in a:
    if dist > m[1]: m = (i, dist)
    print(f"{i}) {dist= }, {Z[i]=}, {O[i]=}")
print(f"Highest dist {m[1]} at pixel {m[0]} located in position {m[0] //28}x{m[0]%28}")


980 1135
0) dist= 0, Z[i]=0, O[i]=0
1) dist= 0, Z[i]=0, O[i]=0
2) dist= 0, Z[i]=0, O[i]=0
3) dist= 0, Z[i]=0, O[i]=0
4) dist= 0, Z[i]=0, O[i]=0
5) dist= 0, Z[i]=0, O[i]=0
6) dist= 0, Z[i]=0, O[i]=0
7) dist= 0, Z[i]=0, O[i]=0
8) dist= 0, Z[i]=0, O[i]=0
9) dist= 0, Z[i]=0, O[i]=0
10) dist= 0, Z[i]=0, O[i]=0
11) dist= 0, Z[i]=0, O[i]=0
12) dist= 0, Z[i]=0, O[i]=0
13) dist= 0, Z[i]=0, O[i]=0
14) dist= 0, Z[i]=0, O[i]=0
15) dist= 0, Z[i]=0, O[i]=0
16) dist= 0, Z[i]=0, O[i]=0
17) dist= 0, Z[i]=0, O[i]=0
18) dist= 0, Z[i]=0, O[i]=0
19) dist= 0, Z[i]=0, O[i]=0
20) dist= 0, Z[i]=0, O[i]=0
21) dist= 0, Z[i]=0, O[i]=0
22) dist= 0, Z[i]=0, O[i]=0
23) dist= 0, Z[i]=0, O[i]=0
24) dist= 0, Z[i]=0, O[i]=0
25) dist= 0, Z[i]=0, O[i]=0
26) dist= 0, Z[i]=0, O[i]=0
27) dist= 0, Z[i]=0, O[i]=0
28) dist= 0, Z[i]=0, O[i]=0
29) dist= 0, Z[i]=0, O[i]=0
30) dist= 0, Z[i]=0, O[i]=0
31) dist= 0, Z[i]=0, O[i]=0
32) dist= 0, Z[i]=0, O[i]=0
33) dist= 0, Z[i]=0, O[i]=0
34) dist= 0, Z[i]=0, O[i]=0
35) dist= 0, Z[i]=0, 

6. (*) Extract a subset of the MNIST dataset composed of only 0 and 1 digits. Create a Rule-based classifier that take as input the rule that you discovered in ex. 5. As previously then, compute the prediction of such a classifier on all the samples in the dataset

In [14]:
class RuleModel:
    __rules = []

    def __init__(self, default_class):
        """
        Create the rule-based model.
        :param default_class: default class when no rule applies
        """
        self.__default_class = default_class

    def add_rule(self, rule, output_class):
        """
        Add rule to the model.
        :param rule: lambda function with the conditions on the input sample
        :param output_class: output label to be assigned when the rule is satisfied
        """
        self.__rules.append((rule, output_class))

    def predict(self, x):
        """
        Apply rules to a sample. The first rule that applies represents the output label.
        :param x: dictionary representing the input sample 
        """
        for rule, out_class in self.__rules:
            if rule(x):
                return out_class
        return self.__default_class
    
model = RuleModel(0)
model.add_rule(lambda x: x[406] > 128, 1)

correct_predictions = 0
label = 1
for row in only_one:
    prediction = model.predict(row)
    if prediction == label:
        correct_predictions += 1
    else:
        print(f"Error, predicted {prediction}, actual label {label}")
label = 0
for row in only_zero:
    prediction = model.predict(row)
    if prediction == label:
        correct_predictions += 1
    else:
        print(f"Error, predicted {prediction}, actual label {label}")

# Compute accuracy
accuracy = correct_predictions / len(only_one + only_zero)
print(f"Accuracy of the model: {accuracy}")

Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 0, actual label 1
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Error, predicted 1, actual label 0
Accuracy of the model: 0.9895981087470449
