# Data files

During your project, you'll be working with data from the [UCI Machine Learning](https://archive.ics.uci.edu/ml/index.php) repository. Some of this data is already loaded into scikit learn's `datasets` package:

In [1]:
from sklearn import datasets

iris = datasets.load_iris()

The returned object is a dictionary with information about the dataset

In [2]:
print(list(iris.keys()))

['data', 'target', 'target_names', 'DESCR', 'feature_names']


The data is contained in `iris['data']`

In [3]:
print(iris['data'][:5, :])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


and the labels are in `iris['target']`

In [4]:
print(iris['target'])

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


If we're using a dataset from a file, however, we'll need to import it into python to use it. For this, we'll use numpy's [input and output](https://docs.scipy.org/doc/numpy/reference/routines.io.html) functions.

We can get the original [iris data](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) from the UCI site. It looks like this:

    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa
    5.0,3.6,1.4,0.2,Iris-setosa
    5.4,3.9,1.7,0.4,Iris-setosa
    4.6,3.4,1.4,0.3,Iris-setosa
    5.0,3.4,1.5,0.2,Iris-setosa
    4.4,2.9,1.4,0.2,Iris-setosa
    4.9,3.1,1.5,0.1,Iris-setosa
    5.4,3.7,1.5,0.2,Iris-setosa
    4.8,3.4,1.6,0.2,Iris-setosa

Notice that there's a comma (",") between every number. That's the *delimiter*. In some data files it'll be a space " ", but most will use a comma. The other thing to notice is that, while the data in scikit learn has the labels already as numbers, in the original data they're names. We'll need to change them.

First, let's load the data. We'll specify that we only want the data columns (columns 0-3) by specifying the `usecols` argument.

In [13]:
import numpy as np

iris_data = np.genfromtxt("exercises/iris.data", delimiter=',', usecols=range(4))
print(iris_data[:5, :])

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]


To get the label data, first we'll need to read only the last column, again using the `usecols` argument. This time we're going to also specify that the data is a string by providing `dtype=str` as an argument.

In [6]:
iris_labels = np.genfromtxt("exercises/iris.data", delimiter=',', usecols=(4), dtype=str)
print(iris_labels)

['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor

Almost there! The labels are still in string format though. To fix this, we'll first make a new array of zeros for our final target:

In [7]:
iris_target = np.zeros(iris_labels.size, dtype=int)
print(iris_target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]


And now we'll fill this with the right value:

In [8]:
for i in range(len(iris_labels)):
    if iris_labels[i] == "Iris-setosa":
        iris_target[i] = 0
    elif iris_labels[i] == "Iris-versicolor":
        iris_target[i] = 1
    else:
        iris_target[i] = 2

print(iris_target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Here's another way of doing that using *fancy indexing*:

In [9]:
iris_targets = np.zeros(iris_labels.size, dtype=int)
labels = np.unique(iris_labels)
for l in range(len(labels)):
    iris_targets[iris_labels == labels[l]] = l
    
print(iris_targets)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Once you have your data, remember to split it into test and training sets:

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_data, iris_targets, test_size=0.25, random_state=1234)
print(X_train.shape, X_test.shape)

(112, 4) (38, 4)


##  Mixed data

Should your data have some numerical data and some categorical data, you might find it easiest to read them using python's default [CSV reader](https://docs.python.org/3.6/library/csv.html)

In [16]:
import csv

data = []
targets = []
with open("exercises/iris.data", newline='') as csvfile:
    irisreader = csv.reader(csvfile, delimiter=',')
    for row in irisreader:
        if len(row) == 5:
            data.append([float(i) for i in row[:4]])
            if row[4] == "Iris-setosa":
                targets.append(0)
            elif row[4] == "Iris-versicolor":
                targets.append(1)
            else:
                targets.append(2)

data = np.array(data)
print(data[:5, :])
print(targets)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]


## Normalization

Some algorithms will benefit from normalizing the data. Normalization is a process of fixing the data into a specific range. For example, if we wanted all of our data to be between 0 and 1, we could do the following:

In [11]:
maxs = np.max(iris_data, axis=0)
mins = np.min(iris_data, axis=0)
print(maxs, mins)

[7.9 4.4 6.9 2.5] [4.3 2.  1.  0.1]


In [12]:
normalized = (iris_data - mins) / (maxs - mins)
print(normalized[:5, :])

[[0.22222222 0.625      0.06779661 0.04166667]
 [0.16666667 0.41666667 0.06779661 0.04166667]
 [0.11111111 0.5        0.05084746 0.04166667]
 [0.08333333 0.45833333 0.08474576 0.04166667]
 [0.19444444 0.66666667 0.06779661 0.04166667]]


In [13]:
print("Max: ", np.max(normalized), "Min: ", np.min(normalized))

Max:  1.0 Min:  0.0


Here we've normalized each column by the maximum and minimum of that column. Scikit learn provides many different means of [preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization) your data, which may improve your algorithms' performance.