# Determinando espécie de lírios com Machine Learning

Neste notebook, tentamos determinar a espécie de uma flor baseado em suas características como tamanho de pétala, caule etc. Os dados são diretamente importados do seaborn. Como os dados estão bem completos, precisamos fazer uma manipulação muito simples de dados e partimos direto pro desenvolvimento dos modelos.

### Notebook por Eric Hamers

## Importando as bibliotecas e dados

In [254]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set_style('darkgrid')

In [255]:
data = sns.load_dataset('iris')

## Manipulando dados

Como podemos verificar abaixo, não temos nenhum dados faltando e apenas 1 coluna ('species') em categorias. Uma vez que essa é a coluna que queremos determinar, podemos apenas transformar de categoria para valores discretos.

In [256]:
data.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [257]:
data['species'].value_counts()

setosa        50
virginica     50
versicolor    50
Name: species, dtype: int64

In [258]:
species_dict = {
    'setosa': 1,
    'virginica': 2,
    'versicolor': 3
}

In [259]:
data['species'].replace(species_dict, inplace=True)

In [262]:
data['species'].value_counts()

3    50
2    50
1    50
Name: species, dtype: int64

In [265]:
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1


## Modelando

Agora podemos partir para separar nossos dados para treino e teste, pra isso vamos usar train_test_split do scikitlearn.

In [266]:
from sklearn.cross_validation import train_test_split

In [267]:
X = data.drop('species', axis=1)
y = data['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## Regressão Logística

In [268]:
from sklearn.linear_model import LogisticRegression

In [269]:
logreg = LogisticRegression()

In [270]:
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [271]:
pred = logreg.predict(X_test)

In [272]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [273]:
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

[[13  0  0]
 [ 0 12  0]
 [ 0  2 18]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       0.86      1.00      0.92        12
          3       1.00      0.90      0.95        20

avg / total       0.96      0.96      0.96        45



In [274]:
acc_logreg = accuracy_score(y_test, pred)

In [275]:
acc_logreg

0.9555555555555556

Utilizando o modelo de regressão logística, conseguimos uma precisão de 96%.

## Gaussian Naive Bayes

In [283]:
from sklearn.naive_bayes import GaussianNB

In [284]:
nb = GaussianNB()

In [285]:
nb.fit(X_train, y_train)

GaussianNB(priors=None)

In [286]:
predictions_nb = nb.predict(X_test)

In [287]:
print(confusion_matrix(y_test, predictions_nb))
print(classification_report(y_test, predictions_nb))

[[13  0  0]
 [ 0 11  1]
 [ 0  1 19]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       0.92      0.92      0.92        12
          3       0.95      0.95      0.95        20

avg / total       0.96      0.96      0.96        45



In [288]:
acc_nb = accuracy_score(y_test, predictions_nb) * 100

In [289]:
acc_nb

95.555555555555557

Com Gaussian Naive Bayes tivemos um resultado similar a regressão logística, 96%.

## KNearestNeighbors

In [303]:
from sklearn.neighbors import KNeighborsClassifier

In [340]:
knn = KNeighborsClassifier(n_neighbors=3)

In [341]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [342]:
predictions_knn = knn.predict(X_test)

In [343]:
print(confusion_matrix(y_test, predictions_knn))
print(classification_report(y_test, predictions_knn))

[[13  0  0]
 [ 0 12  0]
 [ 0  0 20]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       1.00      1.00      1.00        12
          3       1.00      1.00      1.00        20

avg / total       1.00      1.00      1.00        45



In [344]:
acc_knn = accuracy_score(y_test, predictions_knn) * 100

In [345]:
acc_knn

100.0

Com o KNearestNeighbors, conseguimos classificar corretamente todos os dados, precisão de 100%.

## DecisionTrees

In [297]:
from sklearn.tree import DecisionTreeClassifier

In [298]:
dtc = DecisionTreeClassifier()

In [299]:
dtc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [300]:
predictions_dtc = dtc.predict(X_test)

In [301]:
print(confusion_matrix(y_test, predictions_dtc))
print(classification_report(y_test, predictions_dtc))

[[13  0  0]
 [ 0 11  1]
 [ 0  1 19]]
             precision    recall  f1-score   support

          1       1.00      1.00      1.00        13
          2       0.92      0.92      0.92        12
          3       0.95      0.95      0.95        20

avg / total       0.96      0.96      0.96        45



In [250]:
acc_dtc = accuracy_score(y_test, predictions_dtc) * 100

In [302]:
acc_dtc

95.555555555555557

Utilizando uma árvore de decisões simples comseguimos uma precisão similar a regressão logística e NaiveBayes. 96%.