# Lab #1

The purpose of this laboratory is to get you acquainted with Python. 
More specifically, you will learn how to:
- read different types of datasets (CSV and JSON). 
- extract some useful information (mean and standard deviation) from the datasets while only using basic python.
- create a simple rule-based classifier that is already capable to perform some classification.


## Preliminaries
### Python availability
Make sure that Python 3 is installed on your device with the commands `python3 --version`. The version should be in the form `3.x.x.`

In [3]:
! python3 --version

Python 3.10.12


### Dataset Download
For this lab, three different datasets will be used. Here, you will learnmore about them and how to retrieve
them.

#### Iris
Iris is a particularly famous *toy dataset* (i.e. a dataset with a small number of rows and columns, mostly
used for initial small-scale tests and proofs of concept). 
This specific dataset contains information about the **Iris**, a genus that includes 260-300 species of plants. 
The Iris dataset contains measurements for 150 Iris flowers, each belonging to one of three species (50 flowers each): 

Iris Virginica             |  Iris Versicolor          |   Iris Setosa  |
:-------------------------:|:-------------------------:|:---------------|
:<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/1200px-Iris_virginica_2.jpg" alt="Iris Virginica" width="200" /> | <img src="https://www.waternursery.it/document/img_prodotti/616/1646318149.jpeg" alt="Iris Versicolor" width="200" /> |<img src="https://d2j6dbq0eux0bg.cloudfront.net/images/28296135/2323483832.jpg" alt="Iris Setosa" width="200" />|

Each of the 150 flowers contained in the Iris dataset is represented by 5 values:
- sepal length, in cm
- sepal width, in cm
- petal length, in cm
- petal width, in cm
- Iris species, one of: Iris-setosa, Iris-versicolor, Iris-virginica (the label)

Each row of the dataset represents a distinct flower (as such, the dataset will have 150 rows). Each
row then contains 5 values (4 measurements and a species label).
The dataset is described in more detail on the [UCI Machine Learning Repository website](https://archive.ics.uci.edu/dataset/53/iris). The dataset
can either be downloaded directly from there (iris.data file), or from a terminal, using the `wget` tool. The
following command downloads the dataset from the original URL and stores it in a file named iris.csv.

`wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv`

The dataset is available as a Comma-Separated Values (CSV) file. These files are typically used to
represent tabular data. 
- Each row is represented on one of the lines. 
- Each of the rows contains a fixed number of columns. 
- Each of the columns (in each row) is separated by a comma (,).

To read CSV files, Python offers a module called `csv` (here the offical [doc](https://docs.python.org/3/library/csv.html)). This module allows using `csv.reader()`, which
reads a file row by row. For each row, it returns a list of columns that can be processed as needed. 


Let's download the dataset and print the first three rows.




In [24]:
#! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" -O iris.csv
import csv
with open("iris.csv", "r") as f:
    #csvreader = csv.reader(f)
    #header = []
    #header = next(csvreader)
    #print(header)
    #rows= []
    #for row in csvreader:
    #    rows.append(row)
    #for i in range(3):
    #    print(rows[i])

    for i, cols in enumerate(csv.reader(f)):
        print(cols)
        if i >= 2:
            break


['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']


Note by default, csv.reader converts all fields read into strings (str). 
If you want to treat them as number, remember to cast them correctly!

#### MNIST
The MNIST dataset is another particularly famous dataset. It contains several thousands of hand-written
digits (0 to 9). 
- Each hand-written digit is contained in a $28 x 28$ 8-bit grayscale image. 
- This means that each digit has $784$ ($28^2$) pixels
- Each pixel has a value that ranges from 0 (black) to 255 (white).

<img src="https://machinelearningmastery.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png" alt="MNIST images" width="500" />

The dataset can be downloaded from the following link:

[https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv](https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv)



In this case, MNIST is represented as a CSV file. Similarly to the Iris dataset, each row of the MNIST
datasets represents a digit. For the sake of simplicity, this dataset contains only a small fraction (10; 000
digits out of 70; 000) of the real MNIST dataset. 

For each digit, 785 values are available: 
- the first one is the numerical value depicted in the image (e.g. for Figure 2 it would be 5). 
- the following 784 columns represent the grayscale image in row-major order (for more information about row- and column-major order of matrices, see [Wikipedia](https://en.wikipedia.org/wiki/Row-_and_column-major_order)).

The MNIST dataset in CSV format can be read with the same approach used for Iris, keeping in mind
that, in this case, the digit label (i.e. the first column) is an integer from 0 to 9, while the following 784
values are integers between 0 and 255. Let's download it and print the first digit formatted as a 28 x 28 matrix

In [25]:
#! wget https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv -O mnist_test.csv
import csv

with open("mnist_test.csv") as f:
    

--2023-10-10 12:11:17--  https://raw.githubusercontent.com/dbdmg/data-science-lab/master/datasets/mnist_test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18289443 (17M) [text/plain]
Saving to: ‘mnist_test.csv’


2023-10-10 12:11:17 (24.7 MB/s) - ‘mnist_test.csv’ saved [18289443/18289443]





## Exercises
Note that exercises marked with a (*) are optional, you should focus on completing the other ones first.
### Iris analysis
1. Load the previously downloaded Iris dataset as a list of lists (each of the 150 lists should have 5 elements). You can make use of the csv module presented

In [3]:
import csv
rows=[]

with (open("iris.csv") as iris):
    for cols in csv.reader(iris):
        rows.append(cols)

print(rows)

[['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'], ['4.9', '3.0', '1.4', '0.2', 'Iris-setosa'], ['4.7', '3.2', '1.3', '0.2', 'Iris-setosa'], ['4.6', '3.1', '1.5', '0.2', 'Iris-setosa'], ['5.0', '3.6', '1.4', '0.2', 'Iris-setosa'], ['5.4', '3.9', '1.7', '0.4', 'Iris-setosa'], ['4.6', '3.4', '1.4', '0.3', 'Iris-setosa'], ['5.0', '3.4', '1.5', '0.2', 'Iris-setosa'], ['4.4', '2.9', '1.4', '0.2', 'Iris-setosa'], ['4.9', '3.1', '1.5', '0.1', 'Iris-setosa'], ['5.4', '3.7', '1.5', '0.2', 'Iris-setosa'], ['4.8', '3.4', '1.6', '0.2', 'Iris-setosa'], ['4.8', '3.0', '1.4', '0.1', 'Iris-setosa'], ['4.3', '3.0', '1.1', '0.1', 'Iris-setosa'], ['5.8', '4.0', '1.2', '0.2', 'Iris-setosa'], ['5.7', '4.4', '1.5', '0.4', 'Iris-setosa'], ['5.4', '3.9', '1.3', '0.4', 'Iris-setosa'], ['5.1', '3.5', '1.4', '0.3', 'Iris-setosa'], ['5.7', '3.8', '1.7', '0.3', 'Iris-setosa'], ['5.1', '3.8', '1.5', '0.3', 'Iris-setosa'], ['5.4', '3.4', '1.7', '0.2', 'Iris-setosa'], ['5.1', '3.7', '1.5', '0.4', 'Iris-setosa'], ['4.6', '

2. Compute and print the mean and the standard deviation for each of the 4 measurement columns (i.e. sepal length and width, petal length and width). Remember that, for a given list of n values $x = (x_1, x_2, ..., x_n)$, the mean $\mu$ and the standard deviation $\sigma$ are defined respectively as:
$$\mu = {1 \over n} \sum_i^n x_i $$

$$ \sigma = \sqrt{ {1 \over n} \sum_i^n (x_i - \mu)^2} $$

In [2]:
import math

def mean(list):
    sum = 0.0
    for el in list:
        sum += el
    return sum/4

def std_dev(list, mu):
    sum = 0.0
    for el in list:
        sum += (el-mu)**2
    return math.sqrt(sum/4)

sepal_length= []
sepal_width = []
petal_length = []
petal_width = []

for row in rows:
    sepal_length.append(float(row[0]))
    sepal_width.append(float(row[1]))
    petal_length.append(float(row[2]))
    petal_width.append(float(row[3]))
    if len(row) == 0:
        break

mu = []
mu.append(mean(sepal_length))
mu.append(mean(sepal_width))
mu.append(mean(petal_width))
mu.append(mean(petal_width))

sigma = []
sigma.append(std_dev(sepal_length,mu[0]))
sigma.append(std_dev(sepal_width,mu[1]))
sigma.append(std_dev(petal_width,mu[2]))
sigma.append(std_dev(petal_width,mu[3]))

print(mu)
print(sigma)


NameError: name 'rows' is not defined


3. Compute and print the mean and the standard deviation for each of the 4 measurement columns, separately for each of the three Iris species (`Iris-versicolor`, `Iris-virginica` and `Iris-setosa`). *Remember* that the label is stored in the $5^{th}$ (last) cell of the row.

In [5]:
import math 

def mean(list):
    sum = 0.0
    for j,el in enumerate(list):
        sum += el
    return sum/j

def std_dev(list, mu):
    sum = 0.0
    for j,el in enumerate(list):
        sum += (el-mu)**2
    return math.sqrt(sum/j)

sepal_length= [[],[],[]]
sepal_width = [[],[],[]]
petal_length = [[],[],[]]
petal_width = [[],[],[]]

for row in rows:
    if row[4] == "Iris-setosa": i=0
    elif row[4] == "Iris-versicolor": i=1
    else: i=2
    
    sepal_length[i].append(float(row[0]))
    sepal_width[i].append(float(row[1]))
    petal_length[i].append(float(row[2]))
    petal_width[i].append(float(row[3]))   
    
    if len(row) == 0:
        break
        
mu = [[], [], []]
for i in range(3):
    mu[i].append(mean(sepal_length[i]))
    mu[i].append(mean(sepal_width[i]))
    mu[i].append(mean(petal_length[i]))
    mu[i].append(mean(petal_width[i]))

sigma = [[],[],[]]
for i in range(3):
    sigma[i].append(std_dev(sepal_length[i],mu[i][0]))
    sigma[i].append(std_dev(sepal_width[i],mu[i][1]))
    sigma[i].append(std_dev(petal_length[i],mu[i][2]))
    sigma[i].append(std_dev(petal_width[i],mu[i][3]))
  
print("Sepal length | Sepal width | Petal length | Petal width\n")
for i in range(3):
    if i == 0: print("Iris-setosa\n")
    elif i == 1: print("Iris-versicolor\n")
    else: print("Iris-virginica\n")
    print(f"\tMean:\n"
            f"\t{mu[i]}\n"
            f"\tStandard deviation:\n"
            f"\t{sigma[i]}\n")


Sepal length | Sepal width | Petal length | Petal width

Iris-setosa

	Mean:
	[5.108163265306122, 3.487755102040817, 1.4938775510204083, 0.2489795918367346]
	Standard deviation:
	[0.3672864265954213, 0.38748505469628697, 0.17611646209075507, 0.10732744259289463]

Iris-versicolor

	Mean:
	[6.057142857142857, 2.8265306122448983, 4.346938775510203, 1.353061224489796]
	Standard deviation:
	[0.5304787901247119, 0.3189519030352046, 0.4780469948369792, 0.19963310702203832]

Iris-virginica

	Mean:
	[6.722448979591835, 3.03469387755102, 5.665306122448979, 2.06734693877551]
	Standard deviation:
	[0.6502217267925985, 0.3282727629736983, 0.5636382156555484, 0.27780768811834367]



4. Based on the results of exercises 2 and 3, which of the 4 measurements would you considering as being the most characterizing one for the three species? (In other words, which measurement would you consider “best”, if you were to guess the Iris species based only on those four values?)

 *Insert the best measurement*



5. Based on the considerations of Exercise 3, assign the flowers with the following measurements to what you consider would be the most likely species.


`5.2, 3.1, 4.0, 1.2`: Versicolor

`4.9, 2.5, 5.6, 2.0`: Virginica

`5.4, 3.2, 1.9, 0.4`: Setosa


6. (*) Create a Rule-based classifier similar to the one seen in class. This classifier, again, will receive some rule and will classify each sample into one of the three species. According to your analysis in the previous point where you identified the most discriminative feature, provide the classifier with 3 rules, one for classifying each iris, based on this feature.

7. (*) Compute prediction for all the elements in the dataset and store them in a list. Then, compute the accuracy of the classifier that you create. You will see it later, but the accuracy metric can be computed as:

$$ Acc = {\text{number of correct predictions} \over \text{total number of predictions}} $$

One can compute the number of correct predictions by checking how many times the predicted class is equal to the label of the sample ($5^{th}$ column)

### MNIST Analysis

1. Load the previously downloaded MNIST dataset. You can make use of the csv module already presented.

2. Create a function that, given a position $1 < k < 10,000$, prints the $k^{th}$ sample of the dataset (i.e. the $k^{th}$ row of the csv file) as a grid of $28x28$ characters. More specifically, you should map each range of pixel values to the following characters:
    - [0; 64) &rarr; " "
    - [64; 128) &rarr; "."
    - [128; 192) &rarr; "*"
    - [192; 256) &rarr; "#"
So, for example, you should map the sequence `0, 72, 192, 138, 250` to the string `.#*#`.
*Note*: Remember to start a new line every time you read 28 characters

Example of output of the $130^{th}$ sample: 
```
         .#      **
        .##..*#####
       #########*.
      #####***.
     ##*
    *##
    ##
   .##
    ###*
    .#####.
        *###*
           *###*
              ###
              .##
              ###
            .###
      .    *###.
     .#  .*###*
     .######.
      *##*.
```


3. Compute the Euclidean distance between each pair of the 784-dimensional vectors of the digits at
the following positions: $26^{th}$, $30^{th}$, $32^{nd}$, $35^{th}$.

*Note*: Remember that Python arrays are indexed from 0, so the $k^{th}$ value will be at position $k-1$

4. Based on the distances computed in the previous step and knowing that the digits listed in Exercise 3 are (not necessarily in this order) $0, 1, 1, 7$ can you assign the correct label to each of the digits of Exercise 3?

The $0$ is sample $25$ beacuse it has the highest distance to all digits

The $1$s are samples $29$ and $31$ because they have the shortest distance among them

The $7$ is sample $34$ because it more similar to the sample representing $1$s than to the sample representing the $0$

5. There are 1,135 images representing 1’s and 980 images representing 0’s in the dataset. For all 0’s and 1’s separately, count the number of times each of the 784 pixels is black (use 128 as the threshold value). You can do this by building a list `Z` and a list `O`, each containing 784 elements, containing respectively the counts for the 0’s and the 1’s. `Z[i]` and `O[i]` contain the number of times the $i^{th}$ pixel was black for either class. For each value i, compute `abs(Z[i] - O[i])`. The $i$ with the highest value represents the pixel that best separates the digits “0” and “1” (i.e. the pixel that is most often black for one class and white for the other). Where is this pixel located within the grid? Why is it?

6. (*) Considering only the MNIST rows composed of 0 and 1 digits. Create a Rule-based classifier that take as input a rule based on the pixel that you discovered in ex. 5. As previously, compute the prediction of such a classifier on all the samples in the dataset