# "HUMAN LEARNING" WITH IRIS DATA

Can you predict the species of an iris using petal and sepal measurements?

### OUTLINE:
1. Read the iris data into a pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use groupby, sorting, and plotting to look for differences between species.
4. Write down a set of rules that could be used to predict species based on measurements.
5. Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

## PART 1: READ DATA

In [None]:
# read the iris data into a pandas DataFrame, including column names
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                   header=None, names=col_names)

In [None]:
iris.head()

## PART 2: GATHER BASIC INFORMATION

In [None]:
iris.shape

In [None]:
iris.head()

In [None]:
iris.describe()

In [None]:
iris.species.value_counts()

In [None]:
iris.index

In [None]:
iris.dtypes

In [None]:
iris.isnull().sum()

## PART 3: GROUPBY

In [None]:
# use groupby to look for differences between the species
iris.groupby('species').sepal_length.mean()

In [None]:
iris.groupby('species').mean()

In [None]:
iris.groupby('species').describe()

In [None]:
# use sorting to look for differences between the species
iris.sort_values('sepal_length').values

In [None]:
iris.sort_values(by=['sepal_length']).values

In [None]:
iris.sort_values('sepal_width').values

In [None]:
iris.sort_values('petal_length').values

In [None]:
iris.sort_values('petal_width').values

In [None]:
plt.style.use('fivethirtyeight')

In [None]:
iris.groupby('species')['petal_width'].plot(kind='hist', legend=True);

In [None]:
# use plotting to look for differences between the species
iris.petal_width.hist(by=iris.species, sharex=True, sharey=True)

In [None]:
iris.boxplot(column='petal_width', by='species', figsize=(20,10))

In [None]:
iris.boxplot(by='species', showmeans=True, figsize=(20,10));

In [None]:
# map species to a numeric value so that plots can be colored by category
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
import matplotlib.pyplot as plt
#cm = 'Reds'
cm = plt.cm.summer_r
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cm, figsize=(10,5))

In [None]:
pd.scatter_matrix(iris, c=iris.species_num, figsize=(20,10));

## PART 4: CUSTOM FUNCTION

~~~
If petal length is less than 3, predict setosa.
Else if petal width is less than 1.8, predict versicolor.
Otherwise predict virginica.
~~~

In [None]:
PETAL_LENGTH = 2

# define a function that accepts a row of data and returns a predicted species
def classify_iris(row):
    if row[PETAL_LENGTH] < 3:          # petal_length
        return 0    # setosa
    elif row[3] < 1.8:      # petal_width
        return 1    # versicolor
    else:
        return 2    # virginica

In [None]:
# predict for a single row to test the function
classify_iris(iris.iloc[0, :])      # first row

In [None]:
classify_iris(iris.iloc[149, :])    # last row

In [None]:
# store predictions for all rows
predictions = [classify_iris(row) for row in iris.values]

In [None]:
predictions

In [None]:
# calculate the percentage of correct predictions
import numpy as np
np.mean(iris.species_num == predictions)    # 0.96