## Iris Classification

### Background

We assume we are a hobby botanist, and we are interested in distinguishing the species of iris flowers we have found.

We have collected some measurements associated with each iris: the length and width of the petals and the length and width of the sepals, all measured in centimeters.

We also have the measurements of some irises that have previously been identified by an expert botanist as belonging to the species *setosa*, *versicolor*, or *virginica*.  For our purposes, we'll assume that these are the only species we'll see in the wild.

Our goal is to build a model so that we can predict the species of a new iris based on its measurements.  This is a *classification* problem, with the species of irises as the classes.  The desired output for a data point is the species of an iris.  For a particular data point, the species it belongs to is called its *label*.

### Iris Data

The data we'll use for this example is the Iris dataset, a classical dataset in machine learning and statistics.  The dataset is included in the *scikit-learn datasets* module.

In [26]:
# Standard imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import sklearn
from IPython.display import display
import mglearn

In [27]:
# Load the iris dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [28]:
# Here are the keys of the iris Bunch object
print("Keys of iris_dataset:\n", iris_dataset.keys())

Keys of iris_dataset:
 dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


In [29]:
# The value of the 'DESCR' key is a short description of the dataset
print(iris_dataset['DESCR'][:310] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
...


In [30]:
# The value of the 'target_names' key is an array of strings, containing 
#   the species we want to predict.
print("Target names:", iris_dataset['target_names'])

Target names: ['setosa' 'versicolor' 'virginica']


In [31]:
# The value of 'feature_names' is a list of strings, giving the 
#    description of each feature.
print("Feature names:\n", iris_dataset['feature_names'])

Feature names:
 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [32]:
# The data itself is contained in the 'target' and 'data' fields.
#   'data' contains the numeric measurements.
print("Type of data:", type(iris_dataset['data']))

Type of data: <class 'numpy.ndarray'>


In [33]:
print("Shape of data:", iris_dataset['data'].shape)

Shape of data: (150, 4)


In [34]:
# Here are the feature values for the first 5 samples
print("First five rows of data:\n", iris_dataset['data'][:5])

First five rows of data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [35]:
# The target array contains the species of each of the flowers that
#   were measured, also as a Numpy array
print("Type of target:", type(iris_dataset['target']))

Type of target: <class 'numpy.ndarray'>


In [36]:
print("Shape of target:", iris_dataset['target'].shape)

Shape of target: (150,)


In [39]:
# The species are encoded as integers from 0 to 2
# The values are given by the 'target_names':
#   0 = setosa
#   1 = versicolor
#   2 = virginica
print("Target:\n", iris_dataset['target'])

Target:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Measuring Success: Training and Testing Data

To assess our model's performance, we show it data for which we have labels.  This is usually done by splitting the labeled data we have collected into 2 parts.  

One part, which we use to build our model, is called the *training data* or *training set*.

The other part, which we will use to assess how well the model works, is called the *test data* or *test set* or *hold-out set*.

scikit-learn contains a function that shuffles the dataset and splits it for you, the *train_test_split* function.  This function extracts 75% of the rows as the training set, along with the labels for this data.  The remaining 25% is the test set.

In scikit-learn, data is usually denoted with *X*, while labels are denoted by *y*.  This is inspired by the standard formula f(x)=y, where x is the input and y is the output.  We use a capital *X*, because the data is a 2D matrix, and we use a lowercase *y* because the target is a 1D vector.

In [40]:
# Split our data and assign the outputs
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

In [41]:
# Training data shapes
print("X_train_shape:", X_train.shape)
print("y_train_shape:", y_train.shape)

X_train_shape: (112, 4)
y_train_shape: (112,)


In [42]:
# Test data shapes
print("X_test shape:", X_test.shape)
print("y_test_shape:", y_test.shape)

X_test shape: (38, 4)
y_test_shape: (38,)


### First Look at Our Data

We want to take a quick look at our data to see if any abnormalities stick out to us.  One way to do this is to visualize it.

Here, we'll create scatterplots between each 2 pairs of features (known as *pair plots*).  Note that this is a reasonable approach, because we only have 4 features, but it would not be reasonable for a much larger number of features.