<table>
<tr>
  <td><img src="figures/iris_setosa.jpg"></td>
  <td><img src="figures/iris_versicolor.jpg"></td>
  <td><img src="figures/iris_virginica.jpg"></td>
</tr>

<tr>
  <td>Iris Setosa</td><td>Iris Versicolor</td><td>Iris Virginica</td>
</tr>

From <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Wikipedia</a>:

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

<img src="figures/petal_sepal.jpg" alt="Sepal" style="width: 25%; float: left; padding: 1em;"/>

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

<br/>
"Petal-sepal". Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Petal-sepal.jpg#/media/File:Petal-sepal.jpg

In [None]:
%matplotlib inline

import pandas as pd
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

# Exercise

In this exercise, we train a classifier to tell us the type of a flower, based on several of its measured properties.

### Load the data

1. Using the ``read_csv`` command in Pandas, load the dataset in the file ``data/iris.csv``.
2. Use the ``head`` method of the resulting DataFrame to see the first few entries.
3. Use the ``tail`` method to see the last few entries.

In [None]:
iris = ...
iris.head()

In [None]:
iris.tail()

### Examine the data

Let's use ``groupby`` to group measurements by flower name and then display summary statistics.  Look at the numbers, and see if you can identify any big differences between species:

In [None]:
iris.groupby('Name').describe()

### Visualize the data

- Use Seaborn's ``pairplot`` function to display the different features of the data.
- Remember to set the keyword `hue` to the correct column name so that the classes are colored separately.

You can see that this dataset should be very easy to classify.

Now, use the ``pop`` method of the DataFrame to grab the labels from the Name column:

In [None]:
data = iris.copy()
labels = ...

In [None]:
labels[:15]

### Transform labels to numbers

Let's transform those text labels to numbers (0, 1, or 2) respectively.  Strictly speaking, this isn't necessary, since NumPy arrays can contain objects, and comparisons work fine on those.  But in general, working with numbers speed things up a lot.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
labels = ...
labels

### Split data into training & testing data

Using ``sklearn.cross_validation.train_test_split``, split the dataset and labels into two: training data and testing data.

**Since this is such an easy problem, we only want 0.1% of the samples to go towards training.**

Note, it's important to pass both data and labels into the ``train_test_split`` function together so that they can be split the same way (i.e., do not run ``train_test_split`` once for each).

**Hint:** *If you set the ``random_state`` variable, you will always get the same split, and your experiments will be comparable.*

In [None]:
from sklearn.cross_validation import train_test_split

In [None]:
data_train, data_test, labels_train, labels_test = ...

Check the lengths of the training and testing data and labels.

In [None]:
len(data_train), len(data_test)

### Classify and evaluate

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### The Gaussian Naive Bayes classifier is a good entry-level benchmark

In [None]:
nb = GaussianNB()
nb.fit(...)
labels_predicted_nb = ...
accuracy_score(labels_test, labels_predicted_nb)

### Now, try the evaluation, but with a RandomForest classifier

In [None]:
rf = RandomForestClassifier(random_state=42)
...

And let's look at the feature importances, just for the fun of it:

In [None]:
list(zip(iris.columns, rf.feature_importances_))

# Extra/advanced exercise

In this example, we're going to work with a digits dataset.  Each digit is represented as a matrix, with values representing intensity (0 is black, 16 is white).  To form a feature vector, the matrix is "unravelled", i.e. all the values are unpacked from the matrix into a single, long vector.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()

In [None]:
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

### Visualization

A good first-step for many problems is to visualize the data using one of the
*Dimensionality Reduction* techniques we saw earlier.  We'll start with the
most straightforward one, Principal Component Analysis (PCA).

PCA seeks orthogonal linear combinations of the features which show the greatest
variance, and as such, can help give you a good idea of the structure of the
data set.  Here we'll use `RandomizedPCA`, because it's faster for large `N`.

In [None]:
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=2, random_state=1999)
proj = pca.fit_transform(digits.data)

In [None]:
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap='RdBu')
plt.colorbar();

Here we see that the digits do cluster fairly well, so we can expect even
a fairly naive classification scheme to do a decent job separating them.

A weakness of PCA is that it produces a linear dimensionality reduction:
this may miss some interesting relationships in the data.  If we want to
see a nonlinear mapping  of the data, we can use one of the several
methods in the `manifold` module.  Here we'll use Isomap (ISOmetric MAPping)
which is a manifold learning method based on graph theory:

In [None]:
from sklearn.manifold import Isomap
iso = Isomap(n_neighbors=5, n_components=2)
proj = iso.fit_transform(digits.data)

In [None]:
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap='RdBu')
plt.colorbar();

These visualizations show that it classification should be possible!

### Now, train classifiers as we did with the Iris dataset, and see how they do!