# Vinhos

- **Usando o modelo de árvore de decisão para categorizar vinhos como tinto(red) ou branco(white).**
- Dados importados do kaggle:
 - https://www.kaggle.com/datasets/dell4010/wine-dataset

### Objetivo
Criar um algoritmo de machine learning, que vai aprender quais são as caracteristicas que fazem um vinho ser tinto, e um vinho ser branco.

In [11]:
# Importando o pandas
import pandas as pd

In [12]:
# importando a base de dados
base = pd.read_csv('wine_dataset.csv')
base.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,style
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


In [13]:
# Vendo informações da base
# Temos 6497 linhas e 13 colunas
base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  style                 6497 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


**Tratamento de dados**

In [14]:
# Alterando os valores da coluna `style` para dados numericos
# Onde era 'red' passa a ser 0, e onde era 'white' passa a ser 1
vinho = {'red': 0, 'white': 1}
base['target'] = base['style'].map(vinho)

# Alem disso vamos substituir a coluna style pela coluna target, contendo as infos dos vinhos
base = base.drop('style', axis=1)
base.head()


Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,target
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,0


### Separando os dados 

- x vai conter todas as colunas da tabela, exceto target, que contém a resposta
- y vai conter a coluna target com as respostas se é vinho tinto(0), ou branco(1)

**Depois disso vamos separar os dados de treino e dados de teste**

In [15]:
# Separando o x e o y
x = base.drop('target', axis=1)
y = base.target

In [16]:
# Separando dados de treino e dados de teste com train_test_split

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

### Criando e treinando o modelo

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

In [17]:
# Criando o modelo e treinando 
from sklearn.ensemble import ExtraTreesClassifier

modelo = ExtraTreesClassifier()
modelo.fit(x_train, y_train)

In [18]:
# Verificando o score, o quão perto ele chegou dos dados reais
modelo.score(x_test, y_test)

0.9938461538461538

**O modelo foi muito bem acertando 99.38% das vezes**

- Agora vamos criar a predição dos resultados e olhar na matriz confusão

In [21]:
# Modelo prevendo os resultados com base nos dados de x_test

previsao = modelo.predict(x_test)

In [22]:
# importando matriz de confusão
from sklearn.metrics import confusion_matrix

# usando para comparar o y_test(com as reais respostas) com previsao(com o que o modelo classificou)
confusion_matrix(y_test, previsao)

# verdadeiro positivo,  falso negativo
# falso positivo, verdadeiro negativo

array([[ 453,    8],
       [   4, 1485]])