# Determining the species of lilies with Machine Learning

In this notebook, I'll try to develop a model that can classify a lily, based on some feautures like petal and sepal size. The data are directly imported from seaborn. Since the data is pretty much complete and has very little noise, we only need to make some light manipulation and then go straight to modeling.

### Notebook by <a href="http://twitter.com/mxdeegan237">Eric Hamers</a>


## Imports

In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set_style('darkgrid')

In [2]:
data = sns.load_dataset('iris')

## Manipulating Data

As we can see below, there is not missing data and only one columns ('species') with categorical data. Since this is the columns that we want to determine, we can transform it into a numerical feature.

In [3]:
data.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [4]:
data['species'].value_counts()

setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64

In [5]:
species_dict = {
    'setosa': 1,
    'virginica': 2,
    'versicolor': 3
}

In [6]:
data['species'].replace(species_dict, inplace=True)

In [7]:
data['species'].value_counts()

3    50
2    50
1    50
Name: species, dtype: int64

In [8]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1


## Modeling

In [9]:
from sklearn.cross_validation import train_test_split



In [10]:
X = data.drop('species', axis=1)
y = data['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
logreg = LogisticRegression()

In [13]:
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [14]:
pred = logreg.predict(X_test)

In [15]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [16]:
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

[[13  0  0]
 [ 0 12  0]
 [ 0  2 18]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       0.86      1.00      0.92        12
          3       1.00      0.90      0.95        20

avg / total       0.96      0.96      0.96        45



In [17]:
acc_logreg = accuracy_score(y_test, pred)

In [18]:
acc_logreg

0.9555555555555556

## Gaussian Naive Bayes

In [19]:
from sklearn.naive_bayes import GaussianNB

In [20]:
nb = GaussianNB()

In [21]:
nb.fit(X_train, y_train)

GaussianNB(priors=None)

In [22]:
predictions_nb = nb.predict(X_test)

In [23]:
print(confusion_matrix(y_test, predictions_nb))
print(classification_report(y_test, predictions_nb))

[[13  0  0]
 [ 0 11  1]
 [ 0  1 19]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       0.92      0.92      0.92        12
          3       0.95      0.95      0.95        20

avg / total       0.96      0.96      0.96        45



In [24]:
acc_nb = accuracy_score(y_test, predictions_nb) * 100

In [25]:
acc_nb

95.555555555555557

## KNearestNeighbors

In [26]:
from sklearn.neighbors import KNeighborsClassifier

In [27]:
knn = KNeighborsClassifier(n_neighbors=3)

In [28]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [29]:
predictions_knn = knn.predict(X_test)

In [30]:
print(confusion_matrix(y_test, predictions_knn))
print(classification_report(y_test, predictions_knn))

[[13  0  0]
 [ 0 12  0]
 [ 0  0 20]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       1.00      1.00      1.00        12
          3       1.00      1.00      1.00        20

avg / total       1.00      1.00      1.00        45



In [31]:
acc_knn = accuracy_score(y_test, predictions_knn) * 100

In [32]:
acc_knn

100.0

## DecisionTrees

In [33]:
from sklearn.tree import DecisionTreeClassifier

In [34]:
dtc = DecisionTreeClassifier()

In [35]:
dtc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [36]:
predictions_dtc = dtc.predict(X_test)

In [37]:
acc_dtc = accuracy_score(y_test, predictions_dtc) * 100

In [38]:
acc_dtc

95.555555555555557