Now for the fun part, machine learning!

The `seaborn` package is a wrapper around `matplotlib` but in this case I'm just exploiting the built in data sets.  

In [1]:
import seaborn as sns
import numpy as np

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

**Step 1: Acquire data**

In [2]:
iris = sns.load_dataset('iris')

The iris data set is the 'hello world' of machine learning.  It's a very simple data set with data about 150 iris flowers.  There are four measurements and an associated species.  There are 3 species, with 50 instances of each in the data set.  Our task is to create a 'classifier' which, when given values for the measurements, will (hopefully) correctly predict the species to which that flower belongs.

In [3]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


**Step 2: Prepare Data**

Scikit-learn (`sklearn`) doesn't like text values so I'm mapping a numeric value to each species.

*Machine-learning afficiandos: Yes I am using ordinal values here.  However, it's a small example and doesn't really make a difference in the results.  Also, I get to save time by avoiding the explanation of one-hot encoding.*

In [4]:
species_codes = {
    'setosa': 0,
    'versicolor': 1,
    'virginica': 2
}

iris['target'] = [species_codes[s] for s in iris.species]

iris.iloc[[0,1,50,51,100,101]]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,target
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
50,7.0,3.2,4.7,1.4,versicolor,1
51,6.4,3.2,4.5,1.5,versicolor,1
100,6.3,3.3,6.0,2.5,virginica,2
101,5.8,2.7,5.1,1.9,virginica,2


**Step 3: Split data**

I'm going to split the data set into two parts, a training data set and a test data set.  The training data set will be used to generate a *model* which will represent the 'knowledge' that has been dervied from analyzing the data.  Then the test set will be passed into the model which will predict a species for each instance.  The predictions of the model will be compared to the values in the test data set to evaluate the accuracy of the model.

The function I am using for this comes from scikit-learn.  It requires that the features (independent variables) be separated from the targets (dependent variables).  This is easy with `pandas`.

In [5]:
X = iris[iris.columns[:4]]
y = iris['target']

In [6]:
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

The `train_test_split` function accepts a keyword argument with the size of the test data set.  In this case I'm using 50% because of small data set size.  This is to intentionally introduce some errors.  In practice 20-30% would suffice but with the Iris data set it trains the data set 'perfectly'.

The function returns four values: the training and test sets for the features and the same for the targets.

In [8]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=.5)

Notice something interesting about these new data sets

In [9]:
train_X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
88,5.6,3.0,4.1,1.3
13,4.3,3.0,1.1,0.1
32,5.2,4.1,1.5,0.1
119,6.0,2.2,5.0,1.5
91,6.1,3.0,4.6,1.4


In [10]:
train_y.head()

88     1
13     0
32     0
119    2
91     1
Name: target, dtype: int64

The indicies are randomized.  This is to attempt to get a roughly equal distribution of the targets in both the training and test data sets.  What would happen if we selected the first 50% of the Iris data set?

In [11]:
from collections import Counter

c = Counter(y[np.arange(75)].values)
c.most_common()

[(0, 50), (1, 25)]

Since the rows are grouped by species, the training data set would include no virginica and the test data set would include no setosa.  This would undoubtedly result in an unacceptable model.

Randomly rearranging the indicies is better but still not as even as the `train_test_split` function.

In [12]:
import numpy as np

idx = np.arange(150)
np.random.shuffle(idx)

c = Counter(y[idx[:75]].values)
c.most_common()

[(2, 29), (0, 24), (1, 22)]

Again this is more obvious with a larger data set.

In [13]:
c = Counter(test_y.values)
c.most_common()

[(2, 29), (0, 28), (1, 18)]

**Step 5: Train model**

Here is where the heavy lifting beings.  However, it's the simplest part of the code.  This is the power of scikit-learn.

In [14]:
iris_model = GaussianNB()

In [15]:
iris_model.fit(train_X, train_y)

GaussianNB(priors=None)

**Step 6: Test model**

Using the `iris_model`, use each row of the test data set to pass features to the `predict` method. (The `reshape` method just transforms the values from a row to a column which scikit-learn wants, don't worry about it.)  The result will be a list of values, one for each target.  Since there is only one target - species - there will only be one value.  Compare it to the known value in the test data set and keep track of the number of correct predicitions.

In [16]:
predictions = 0

for idx in test_X.index:
    prediction = iris_model.predict(test_X.loc[idx].values.reshape(1, -1))
    if test_y.loc[idx] == prediction[0]: predictions += 1

In [17]:
predictions

70

**Step 7: Evaluate model**

The simplest measure of accuracy is the ratio of correct predictions to total predictions.  

In [18]:
accuracy = predictions / len(test_X)
accuracy

0.9333333333333333

In [19]:
'{}%'.format(round(accuracy * 100, 2))

'93.33%'