# Parse Data
The Iris dataset from [UCI's MLR](https://archive.ics.uci.edu/ml/datasets/iris) is provide in a commna-delimited format without column names. 
Our task to make it machine usable is simple: add column names and convert it to a CSV format with a header

In [1]:
import pandas as pd

In [2]:
data_path = 'data/raw/iris.data'

## Load the data, apply column names
We can load the data and provide column names all in one command

Print the first three lines to show what the file looks like

In [3]:
with open(data_path) as fp:
    print("\n".join(l.strip() for _, l in zip(range(3), fp)))

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa


As I claimed, the data has no header. Reading the data page at UCI MLR, the columns are:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class

In [4]:
data = pd.read_csv('data/raw/iris.data', 
                   header=None,
                  names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])

In [5]:
data.head(2)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa


You can see that we parsed the data correctly. The first row of the file is in the correct position.

## Create a numerical version the class column
Many machine learning models work only with numical data. 
Let's convert the "class" column into digits

In [6]:
classes = data['class'].value_counts().index.tolist()
print(f'Found {len(classes)} classes: {classes}')

Found 3 classes: ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']


Lookup the value for each label then store it in another column

In [7]:
data['class_id'] = data['class'].apply(classes.index)

In [8]:
data.drop_duplicates(['class', 'class_id'])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class,class_id
0,5.1,3.5,1.4,0.2,Iris-setosa,0
50,7.0,3.2,4.7,1.4,Iris-versicolor,1
100,6.3,3.3,6.0,2.5,Iris-virginica,2


Alright! We've now got a unique numerical value for the class.

## Save the data
CSV would be a good choice, as most programs read it.

In [9]:
data.to_csv('data/clean/iris.csv', index=False)  # No index, as it's not meaningful here