# Naive Bayes - Ejemplo - Iris

**Contexto**  
Este conjunto de datos es muy famoso y busca identificar la especie de iris, de acuerdo a las características proporcionadas.

**Contenido**  
El conjunto de datos proviene de Seaborn.  
Contiene 150 renglones con las siguientes columnas:  

| Variable     | Definición                      | Valor                         |
| ------------ | ------------------------------- | ----------------------------- |
| sepal_length | Longitud del sépalo             | Centímetros                   |
| sepal_width  | Ancho de sépalo                 | Centímetros                   |
| petal_length | Longitud del pétalo             | Centímetros                   |
| petal_width  | Ancho del pétalo                | Centímetros                   |
| species      | Especie **(variable objetivo)** | setosa, versicolor, virginica |

La variable objetivo se proporciona en un conjunto por separado

**Planteamiento del problema**  
Se busca predecir la especie de iris, de acuerdo a los datos de los sépalos y pétalos.

In [1]:
# Importar librerias
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## Cargar Datos

In [2]:
# Importar los datos
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
# Renombrar columnas
df.columns = ['longitud_sepalo', 'ancho_sepalo', 'longitud_petalo', 'ancho_petalo', 'especie']

In [4]:
print(df['especie'].unique())

['setosa' 'versicolor' 'virginica']


In [5]:
# Transformar target a numérico
df.replace('setosa',     '0', inplace=True)
df.replace('versicolor', '1', inplace=True)
df.replace('virginica',  '2', inplace=True)
df = df.astype({'especie':'int'})
df.head()

Unnamed: 0,longitud_sepalo,ancho_sepalo,longitud_petalo,ancho_petalo,especie
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Normalizacion

In [6]:
# Variables independientes
X = df[['longitud_sepalo', 'ancho_sepalo', 'longitud_petalo', 'ancho_petalo']]
X.head()

Unnamed: 0,longitud_sepalo,ancho_sepalo,longitud_petalo,ancho_petalo
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
# Normalizar
scaler = StandardScaler()
X_adj = scaler.fit_transform(X)
print(X_adj)

[[-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00 -1.31979479e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.38535265e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
 [-1.50652052e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  1.24920112e+00 -1.34022653e+00 -1.31544430e+00]
 [-5.37177559e-01  1.93979142e+00 -1.16971425e+00 -1.05217993e+00]
 [-1.50652052e+00  7.88807586e-01 -1.34022653e+00 -1.18381211e+00]
 [-1.02184904e+00  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
 [-1.74885626e+00 -3.62176246e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.44707648e+00]
 [-5.37177559e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
 [-1.26418478e+00  7.88807586e-01 -1.22655167e+00 -1.31544430e+00]
 [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.44707648e+00]
 [-1.87002413e+00 -1.31979479e-01 -1.51073881e+00 -1.44707648e+00]
 [-5.25060772e-02  2.16998818e+00 -1.45390138e+00 -1.31544430e

In [8]:
# Variable dependiente
y = df['especie']
y.head()

0    0
1    0
2    0
3    0
4    0
Name: especie, dtype: int32

In [9]:
print('X:', len(X_adj), 'y:', len(y))

X: 150 y: 150


## Modelado

In [10]:
# Conjunto de entrenamiento y pruebas
X_train, X_test, y_train, y_test = train_test_split(X_adj, y, test_size=0.3, random_state=0)

In [11]:
print('X_train:', len(X_train), 'y_train:', len(y_train))
print('X_test:',  len(X_test),  'y_test:',  len(y_test))

X_train: 105 y_train: 105
X_test: 45 y_test: 45


In [12]:
# Entrenamiento
model = GaussianNB()
model.fit(X_train,y_train)

In [13]:
# Predicciones
prediction = model.predict(X_test)
prediction

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 1, 1, 1, 2, 0, 2, 0,
       0])

## Evaluacion

In [14]:
print(confusion_matrix(y_test, prediction))

[[16  0  0]
 [ 0 18  0]
 [ 0  0 11]]


In [15]:
print(classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        16
           1       1.00      1.00      1.00        18
           2       1.00      1.00      1.00        11

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

