## The scenario:

You are a business intelligence manager at a fast moving startup that deals with flowers. Iris Mania is sweeping the world and certain species fetch upwards of 50,000 dollars AU for a single flower!

A new iris has just been delivered. It’s species is not known and the resident florist is
on holidays.

The business has a sample data set with typical measures for the
following three species for iris flower.

Our mystery flower has the following characteristics: <br><br>
Sepal length = 4.2 cm <br>
Sepal width =  4.1 cm <br>
Petal length = 1.3 cm <br>
Petal width =  0.25 cm<br>


Which species is it likely to be?

In [None]:
%pylab inline

import pandas as pd
import seaborn as sns

In [None]:
#Read in data, check for missing values
df = pd.read_csv('./iris.csv')
df.info()

In [None]:
#Check species names and number of species samples collected
print(df['species'].unique())
print(df['species'].value_counts())

In [None]:
#Explore average values of sepal + petal characteristics
groups = df.groupby(by = ['species'])
groups.mean()

Sepal length = 4.2 cm <br>
Sepal width =  4.1 cm <br>
Petal length = 1.3 cm <br>
Petal width =  0.25 cm<br>

### Lets use the first iris as our unidentified flower

In [None]:
s_length, s_width, p_length, p_width, v = df.loc[0]

In [None]:
print('s_length',s_length)
print('s_width',s_width)
print('p_length', p_length)
print('p_width',p_width)
print('variety', v)

In [None]:
#Frequeny plot of characteristics - First sepal length:
ax, fig = plt.subplots(figsize = [7, 5])
for species in df['species'].unique():
    sns.distplot(df[df['species'] == species]['sepal_length'], hist = False, label = species);
plt.legend();

ymax = fig.get_ylim()[1]

plt.vlines(x = s_length, ymin = 0, ymax = ymax, linestyles = 'dashed', colors = 'k');

print('Unknown species sepal length = {} cm'.format(s_length))

In [None]:
ax, fig = plt.subplots(figsize = [7,5])
for species in df['species'].unique():
    sns.distplot(df[df['species'] == species]['sepal_width'], hist = False, label = species)
plt.legend()

ymax = fig.get_ylim()[1]

plt.vlines(x = s_width, ymin = 0, ymax = ymax , linestyles = 'dashed', colors = 'k');

print('Unknown species sepal width = {} cm'.format(s_width))

In [None]:
ax, fig = plt.subplots(figsize = [7,5])
for species in df['species'].unique():
    sns.distplot(df[df['species'] == species]['petal_length'], hist = False, label = species)
plt.legend();

ymax = fig.get_ylim()[1]

plt.vlines(x = p_length, ymin = 0, ymax = ymax, linestyles = 'dashed', colors = 'k');

print('Unknown species petal length = {} cm'.format(p_length))

In [None]:
ax, fig = plt.subplots(figsize = [7,5])
for species in df['species'].unique():
    sns.distplot(df[df['species'] == species]['petal_width'], hist = False, label = species)
plt.legend()

ymax = fig.get_ylim()[1]

plt.vlines(x = p_width, ymin = 0, ymax = ymax, linestyles = 'dashed', colors = 'k');

print('Unknown species petal width = {} cm'.format(p_width))

## Given our data, which species is the unknown iris likely to be?

In [None]:
df.loc[0]

In [None]:
# y is the targets
unknown_X = df.iloc[0,:-1]
unknown_y = df.iloc[0,-1]
X = df.iloc[1:,:-1]
y = df.iloc[1:,-1]

In [None]:
unknown_X

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X, y) 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')


In [None]:
knn.predict(unknown_X.values.reshape(1,-1))