# Reading Iris Dataset

Iris dataset is one of the best known datasets in the supervised classification field. 

This dataset contains data for 3 types of Iris Flowers. For each instance, the data contains 4 features: the Sepal Length, Sepal Width, Petal Length and Petal Width, all registered in *cm* \[2\].

We state that this dataset is pretty well known in the *supervised* classficiation. We say that it is supervised because for each row in the data you have the actual Iris flower type.

## Notebook Technologies

To read the `iris.data` file we are going to make use of a single library named `pandas` \[1\].

`pandas` is an Open Source Python package that provides fast, flexible and expressive data structres specilly thougth to handle tabular data such as CSV files, SQL Tables, etc.



## Loading the data

In [10]:
from pathlib import Path

import pandas as pd

In [11]:
DATA_PATH = Path('../../data')

If we take a quick look to `iris.data` file:

```
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
```

we can quickly see that it is a standard csv file, for this reason, we will be able to load it using the `read_csv` function provided by `pandas`.

Also, it is important to note that the file does not have columns names, so we don't what does each column is actually encoding. But actually, thanks to \[2\], we know the column names and by know we can *hard code* them when loading the file with the argument `names`. 

In [12]:
columns = ['sepal_length', 
           'sepal_width', 
           'petal_length', 
           'petal_width', 
           'class']

df = pd.read_csv(DATA_PATH / 'raw' / 'iris.data', 
                 header=None, names=columns)

A `pandas.DataFrame` provides a really interessting method called `head`, which allow us to preview the first $n$ rows of our tabular data.

In [13]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Equivallently, we have the `sample` method, that contrary to the head method, it randomly samples $n$ elements. 

In [14]:
df.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
79,5.7,2.6,3.5,1.0,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
46,5.1,3.8,1.6,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
78,6.0,2.9,4.5,1.5,Iris-versicolor


## Exported the *cleaned* data

In the case of this simple dataset, we did not make any *big* change, we have only added the column names, which in our opinion it is enough to store a new version of the dataset. This way, the next time that we load the data it won't be needed to specify the column names.

In [16]:
df.to_csv(DATA_PATH / 'processed' / 'iris.csv', index=False)

## References

\[1\] [Pandas Documentation](https://pandas.pydata.org/docs/)

\[2\] [UCI Machine Learning Repository](https://archive.ics.uci.edu/)