# [DCC 030] Aprendizado Profundo para Processamento de Linguagem Natural: Projeto Final
__Aluno:__
- Eduardo Villani de Carvalho Filho - 2015104008
---
# [Parte 1] O trabalho


A técnicas de NLP podem ser utilizadas nos mais diversos problemas,
sejam estes voltados para linguagem ou até abstraído para problemas
não relacionados à linguagem natural. Isso porque, mesmo que não
seja algo de linguagem, podemos abstrair objetos para que sejam transformados
em formas de linguagem.

Um desses problemas que podemos usar NLP é em aplicações de recomendações
de itens. Existem diversas formas de se recomendar itens, como comparações
de ratings entre os itens, consumo local, etc.

Neste trabalho, iremos explorar a recomendação de itens por meio
da relação entre a descrição de itens  e, se dado essas descrições,
podemos determinar se este item pertence a algum tipo de classe.

## Implementação

Para facilitar a implementação, o projeto foi totalmente modularizado,
mas será explicado o objetivo de cada classe e seu funcionamento.


## O Dataset

Para esse problema, iremos usar um dataset de vinhos. Cada entrada é
um vinho diferente, contendo informações de região, descrição, nome, etc.

Vamos pegar um exemplo com a classe que carrega o dataset.

In [49]:
from copy import deepcopy
from typing import Union

import pandas as pd
import numpy as np
from nltk import RegexpTokenizer


class WineDataSet:
    def __init__(self, filter_by_topn_varieties: Union[None, int] = None):
        self._df = pd.read_csv('inputs/wine_dataset/winemag-data-130k-v2.csv')
        if filter_by_topn_varieties is not None:
            self._df = self._df[
                self._df['variety'].isin(self._df.varieties_count()[:filter_by_topn_varieties]['variety'])]
        self._clean_description()

    def __len__(self):
        return len(self._df)

    def __iter__(self):
        return np.array(self._df)

    @property
    def data(self) -> pd.DataFrame:
        return deepcopy(self._df)

    @property
    def countries(self):
        return self._df['country'].unique()

    @property
    def varieties(self):
        return self._df['variety'].unique()

    def varieties_count(self):
        return self._df.groupby('variety').count()['country'] \
            .sort_values(ascending=False) \
            .reset_index() \
            .rename(columns={'country': 'count'})

    def _clean_description(self):
        def remove_non_ascii(s):
            return "".join(i for i in s if ord(i) < 128)

        def make_lower_case(text):
            return text.lower()

        def remove_stop_words(text):
            from nltk.corpus import stopwords
            text = text.split()
            stops = set(stopwords.words("english"))
            text = [w for w in text if not w in stops]
            text = " ".join(text)
            return text

        def remove_punctuation(text):
            tokenizer = RegexpTokenizer(r'\w+')
            text = tokenizer.tokenize(text)
            text = " ".join(text)
            return text

        df = self._df
        df['description_cleaned'] = df['description'].apply(remove_non_ascii)
        df['description_cleaned'] = df.description_cleaned.apply(func=make_lower_case)
        df['description_cleaned'] = df.description_cleaned.apply(func=remove_stop_words)
        df['description_cleaned'] = df.description_cleaned.apply(func=remove_punctuation)
        df['description_cleaned'] = df['description_cleaned'].str.replace('\d+', '')

### Exemplo do Dataset

In [15]:
wine_dataset = WineDataSet()
wine_dataset.data.head(10)


Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery,description_cleaned
0,0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia,aromas include tropical fruit broom brimstone ...
1,1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos,ripe fruity wine smooth still structured firm ...
2,2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm,tart snappy flavors lime flesh rind dominate g...
3,3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian,pineapple rind lemon pith orange blossom start...
4,4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks,much like regular bottling comes across rathe...
5,5,Spain,Blackberry and raspberry aromas show a typical...,Ars In Vitro,87,15.0,Northern Spain,Navarra,,Michael Schachner,@wineschach,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,Tandem,blackberry raspberry aromas show typical navar...
6,6,Italy,"Here's a bright, informal red that opens with ...",Belsito,87,16.0,Sicily & Sardinia,Vittoria,,Kerin O’Keefe,@kerinokeefe,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,Terre di Giurfo,here s bright informal red opens aromas candie...
7,7,France,This dry and restrained wine offers spice in p...,,87,24.0,Alsace,Alsace,,Roger Voss,@vossroger,Trimbach 2012 Gewurztraminer (Alsace),Gewürztraminer,Trimbach,dry restrained wine offers spice profusion bal...
8,8,Germany,Savory dried thyme notes accent sunnier flavor...,Shine,87,12.0,Rheinhessen,,,Anna Lee C. Iijima,,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer,Heinz Eifel,savory dried thyme notes accent sunnier flavor...
9,9,France,This has great depth of flavor with its fresh ...,Les Natures,87,27.0,Alsace,Alsace,,Roger Voss,@vossroger,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...,Pinot Gris,Jean-Baptiste Adam,great depth flavor fresh apple pear fruits tou...


### Tamanho do dataset

In [4]:
wine_dataset.data.shape

(129971, 15)

### Variedades de uvas

In [16]:
wine_dataset.varieties_count()


Unnamed: 0,variety,count
0,Pinot Noir,13269
1,Chardonnay,11750
2,Cabernet Sauvignon,9470
3,Red Blend,8935
4,Bordeaux-style Red Blend,6915
...,...,...
702,Vranac,0
703,Ojaleshi,0
704,Tsapournakos,0
705,Tsolikouri,0


### Países produtores

In [17]:
wine_dataset.data.groupby('country').count()['title'].sort_values(ascending=False)

country
US                        54504
France                    22093
Italy                     19540
Spain                      6645
Portugal                   5691
Chile                      4472
Argentina                  3800
Austria                    3345
Australia                  2329
Germany                    2165
New Zealand                1419
South Africa               1401
Israel                      505
Greece                      466
Canada                      257
Hungary                     146
Bulgaria                    141
Romania                     120
Uruguay                     109
Turkey                       90
Slovenia                     87
Georgia                      86
England                      74
Croatia                      73
Mexico                       70
Moldova                      59
Brazil                       52
Lebanon                      35
Morocco                      28
Peru                         16
Ukraine                      14


In [18]:
del wine_dataset

## Entidade de Vinhos e Dicionario de Vinhos

Para facilitar algumas operações, foram criadas duas entidades: Wine e WineDict, que é um
dicionário de Wines. Isso é para ajudar no sistema de recomendação de vinhos. Uma terceira
entidade foi feita para facilitar a conversão de rows do dataset para a entidade Wine.

In [78]:
import json


class Wine:
    def __init__(self,
                 points: int,
                 title: str,
                 description: dict,
                 taster_name: str,
                 taster_twitter_handle: str,
                 price: float,
                 designation: str,
                 variety: str,
                 region_1: str,
                 region_2: str,
                 province: str,
                 country: str,
                 winery: str,
                 ):
        self._data = {
            "points": int(points),
            "title": title,
            "description": description,
            "taster_name": taster_name,
            "taster_twitter_handle": taster_twitter_handle,
            "price": float(price),
            "designation": designation,
            "variety": variety,
            "region_1": region_1,
            "region_2": region_2,
            "province": province,
            "country": country,
            "winery": winery
        }

    def __str__(self):
        return json.dumps(self._data, indent=2)

    def __call__(self):
        return self._data

    __repr__ = __str__

    @property
    def title(self) -> str:
        return self._data['title']

    @title.setter
    def title(self, value):
        self._data['title'] = value

    def __getitem__(self, key):
        return self._data[key]

    @property
    def cleaned_description(self) -> str:
        return self._data['description']['cleaned']

In [44]:
from typing import Union

from models.wine import Wine


class WineDict:
    def __init__(self):
        self._data = {}
        self._title2index = {}
        self._index2title = {}
        self._repeated_wines_name = {}

    def append(self, wine: Wine):
        index = len(self._data)
        if wine.title in self._title2index:
            try:
                self._repeated_wines_name[wine.title].append(index)
            except KeyError:
                self._repeated_wines_name[wine.title] = [self._title2index[wine.title], index]
            wine.title = f"{wine.title} ({len(self._repeated_wines_name[wine.title])})"

        self._data[index] = wine
        self._index2title[index] = wine.title
        self._title2index[wine.title] = index

    def __iter__(self):
        return self._data.__iter__()

    def __getitem__(self, key: Union[str, int]):
        if isinstance(key, int):
            return self._data[key]
        return self._data[self._title2index[key]]

    def __str__(self):
        return str(self._data)

    __repr__ = __str__

    @property
    def title2index(self):
        return self._title2index

    @property
    def index2title(self):
        return self._index2title

In [45]:
class Row2Json:
    def __new__(cls, row):
        return {
            "points": int(row[4]),
            "title": row[11] if not isinstance(row[11], float) else None,
            "description": {
                'original': row[2] if not isinstance(row[2], float) else None,
                'cleaned': row[14] if not isinstance(row[14], float) else None
            },
            "taster_name": row[9] if not isinstance(row[9], float) else None,
            "taster_twitter_handle": row[10] if not isinstance(row[10], float) else None,
            "price": float(row[5]),
            "designation": row[3] if not isinstance(row[3], float) else None,
            "variety": row[12] if not isinstance(row[12], float) else None,
            "region_1": row[7] if not isinstance(row[7], float) else None,
            "region_2": row[8] if not isinstance(row[8], float) else None,
            "province": row[6] if not isinstance(row[9], float) else None,
            "country": row[1] if not isinstance(row[1], float) else None,
            "winery": row[13] if not isinstance(row[13], float) else None
        }

## Inicializando o Dict

Um exemplo do que acontece para incializar o WineDict com Wines

In [79]:
from tqdm import tqdm

wine130k = WineDataSet()
wine_dict = WineDict()
for row in tqdm(np.array(wine130k.data)):
    wine_dict.append(Wine(**Row2Json(row)))
del row, wine130k

100%|██████████| 129971/129971 [00:03<00:00, 33692.00it/s]


## Acessando vinhos

Há duas formas de se acessar um vinho: index ou o seu nome

In [83]:
wine_dict[0]

{
  "points": 87,
  "title": "Nicosia 2013 Vulk\u00e0 Bianco  (Etna)",
  "description": {
    "original": "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",
    "cleaned": "aromas include tropical fruit broom brimstone dried herb palate overly expressive offering unripened apple citrus dried sage alongside brisk acidity"
  },
  "taster_name": "Kerin O\u2019Keefe",
  "taster_twitter_handle": "@kerinokeefe",
  "price": NaN,
  "designation": "Vulk\u00e0 Bianco",
  "variety": "White Blend",
  "region_1": "Etna",
  "region_2": null,
  "province": "Sicily & Sardinia",
  "country": "Italy",
  "winery": "Nicosia"
}

In [112]:
wine_dict[wine_dict[1000].title]

{
  "points": 88,
  "title": "Arcane Cellars 2006 Cabernet Sauvignon (Rogue Valley)",
  "description": {
    "original": "Arcane's Cab is stylistically apart from either California or Washington. It defines its own space. There's plenty of new oak, but the fruit, acid and tannins stand up to it. This is sharp and tangy; cranberry and raspberry, strawberry and citric acids all playing their part. Still young, give it some time in a decanter or in your cellar to come together and show its best.",
    "cleaned": "arcane s cab stylistically apart either california washington defines space there s plenty new oak fruit acid tannins stand it sharp tangy cranberry raspberry strawberry citric acids playing part still young give time decanter cellar come together show best"
  },
  "taster_name": "Paul Gregutt",
  "taster_twitter_handle": "@paulgwine\u00a0",
  "price": 24.0,
  "designation": null,
  "variety": "Cabernet Sauvignon",
  "region_1": "Rogue Valley",
  "region_2": "Southern Oregon",
  

## Wine2Index e Index2Wine

Para facilitar mapear index para vinhos e vice-versa, temos os seguintes atributos:

In [7]:
wine_dict.index2title[0]

'Nicosia 2013 Vulkà Bianco  (Etna)'

In [8]:
wine_dict.title2index[wine_dict[1000].title]

1000

# [Parte 2] Wine Recommender

A primeira parte deste trabalho consiste em fazer um recomendador de vinhos. Podemos notar
pelo dataset que temos uma descrição de vinhos. Essa é feita por sommeliers e traz
informações de características de vinhos. Resolvi ir por esse lado, pois é um
ponto importante na escolha de vinhos.

In [84]:
wine_dict[100]['description']['original']

"Fresh apple, lemon and pear flavors are accented by a hint of smoked nuts in this bold, full-bodied Pinot Gris. Rich and a bit creamy in mouthfeel yet balanced briskly, it's a satisfying white with wide pairing appeal. Drink now through 2019."

In [85]:
wine_dict[200]['description']['original']


'Aromas of mature black-skinned berry, tobacco and dark spice lead the way. The rounded palate offers dried cherry, raspberry, coffee and a licorice note alongside velvety if not very persistent tannins. Showing the heat of the vintage, this is evolving quickly so enjoy over the next several years.'

## Representando vinhos

Vamos partir do principio que esss vinhos são representados pelas suas descrições, eles
são caractéristicas que realmente definem os vinhos. Então, vinhos parecidos, terão
descrições parecidas. Uma forma de representa essas palavras númericas é com
o word2vec. A limpeza de dados já foi feita pelo WineDataset, então só temos que
treinar um modelo.

## Gerando o Corpus

Com precisamos de um corpus para treinar o Word2Vec, iremos gerar um para ele.

In [53]:
from typing import Union

import numpy as np
from tqdm import tqdm

from models.wine_dict import WineDict


class WineCorpus:
    def __init__(self, text: Union[str, WineDict]):
        corpus = []
        sentence_len = []

        self._corpus = None
        self._max_len = None
        self._min_len = None
        self._avg_len = None
        self._std_len = None
        if isinstance(text, WineDict):
            for w in tqdm(text):
                wine = text[w]
                cd = wine.cleaned_description.split()
                sentence_len.append(len(cd))
                corpus.append(cd)
            self._corpus = corpus
            del text, w
        elif isinstance(text, str):
            with open(text, 'r') as f:
                for sentence in tqdm(f.readlines()):
                    corpus.append(sentence.replace('\n', '').split())
                    sentence_len.append(len(sentence.split()))
                self._corpus = corpus
            f.close()
        else:
            for w in tqdm(text):
                wine = text[w]
                cd = wine.cleaned_description.split()
                sentence_len.append(len(cd))
                corpus.append(cd)
            self._corpus = corpus
            del text, w
        self.init_data(sentence_len)

    def init_data(self, sentence_len):
        sentence_len = np.array(sentence_len)
        self._max_len = max(sentence_len)
        self._min_len = min(sentence_len)
        self._avg_len = int(np.mean(sentence_len))
        self._std_len = int(np.std(sentence_len))

    def __iter__(self):
        return self._corpus.__iter__()

    @property
    def corpus(self):
        return self._corpus

    @property
    def max_len(self):
        return self._max_len

    @property
    def min_len(self):
        return self._min_len

    @property
    def avg_len(self):
        return self._avg_len

    @property
    def std_len(self):
        return self._std_len

    def save(self, save_path: str):
        with open(save_path, "w") as f:
            for sentence in self._corpus:
                s = " ".join(map(str, sentence))
                f.write(f"{s}\n")
        f.close()

### Montando o corpus

In [86]:
wine_corpus = WineCorpus(wine_dict)

129971it [00:00, 160996.84it/s]


### O corpus

In [114]:
wine_corpus.corpus[0]

['aromas',
 'include',
 'tropical',
 'fruit',
 'broom',
 'brimstone',
 'dried',
 'herb',
 'palate',
 'overly',
 'expressive',
 'offering',
 'unripened',
 'apple',
 'citrus',
 'dried',
 'sage',
 'alongside',
 'brisk',
 'acidity']

### Alguns dados dos corpus

In [115]:
print(f'''
MAX: {wine_corpus.max_len}
AVG: {wine_corpus.avg_len}
MIN: {wine_corpus.min_len}
SDT: {wine_corpus.std_len}
''')


MAX: 80
AVG: 25
MIN: 2
SDT: 6



### Salvando o corpus

In [116]:
wine_corpus.save('corpus/all_wines.txt')

## Wine2Vec

Aqui vamos definir a entidade que define a relação de cada vinho com as palavras.

In [55]:
import logging
import os
from typing import Union, List

import numpy as np
from dask.array import from_array
from gensim.models import Word2Vec
from tqdm import tqdm

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


class Wine2Vec:
    def __init__(self, sentences,
                 iter: int = 5,
                 min_count: int = 5,
                 size: int = 300,
                 workers: int = os.cpu_count() - 1,
                 sg: int = 1,
                 hs: int = 0,
                 negative: int = 5,
                 window: int = 5):
        self._sentences = sentences
        self._iter = iter
        self._min_count = min_count
        self._size = size
        self._workers = workers
        self._sg = sg
        self._hs = hs
        self._negative = negative
        self._window = window

        self._model = None
        self._wine_embedding = None

    @property
    def wine2vec(self):
        return self._wine_embedding

    @property
    def model(self) -> Word2Vec:
        return self._model

    @model.setter
    def model(self, value):
        self._model = value

    def train(self, ):
        model = Word2Vec(
            sentences=self._sentences,
            iter=self._iter,
            min_count=self._min_count,
            size=self._size,
            workers=self._workers,
            sg=self._sg,
            hs=self._hs,
            negative=self._negative,
            window=self._window
        )

        self._model = model

    def save(self, path: str):
        self._model.save(f"{path}")

    def load(self, load_path: str):
        from gensim.models import Word2Vec
        self._model = Word2Vec.load(load_path)

    def similarity(self, w1, w2):
        return self._model.wv.similarity(w1, w2)

    def word(self, word):
        return self._model.wv[word]

    def most_similar(self, words: Union[str, List[str]], topn: int = 10):
        return self._model.wv.most_similar(words, topn=topn)

    def wine_embeddings(self):
        embeddings = []
        for sentence in tqdm(self._sentences):
            sentence_phrase = []
            for word in sentence:
                sentence_phrase.append(self.word(word))
            embeddings.append(np.array(sentence_phrase))
        wine_embedding = []
        for e in tqdm(embeddings):
            wine_embedding.append(np.mean(e, axis=0))
        wine_embedding = np.array(wine_embedding)
        self._wine_embedding = from_array(wine_embedding)

### Gerando o modelo

In [87]:
iter_ = 30
min_count = 1
size = 300
window = wine_corpus.max_len
file_path = f'wine2vec_pretrained/wine2vec_model_i{iter_}_mc{min_count}_s{size}_w{window}'

wine2vec = Wine2Vec(
    sentences=wine_corpus,
    iter=iter_,
    min_count=min_count,
    size=size,
    window=window
)

In [40]:
wine2vec.train()

2021-03-25 17:04:25,149 : INFO : collecting all words and their counts
2021-03-25 17:04:25,151 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-03-25 17:04:25,202 : INFO : PROGRESS: at sentence #10000, processed 255602 words, keeping 11569 word types
2021-03-25 17:04:25,249 : INFO : PROGRESS: at sentence #20000, processed 510235 words, keeping 15714 word types
2021-03-25 17:04:25,302 : INFO : PROGRESS: at sentence #30000, processed 764360 words, keeping 18665 word types
2021-03-25 17:04:25,364 : INFO : PROGRESS: at sentence #40000, processed 1022370 words, keeping 21089 word types
2021-03-25 17:04:25,405 : INFO : PROGRESS: at sentence #50000, processed 1277854 words, keeping 23117 word types
2021-03-25 17:04:25,445 : INFO : PROGRESS: at sentence #60000, processed 1535119 words, keeping 24868 word types
2021-03-25 17:04:25,489 : INFO : PROGRESS: at sentence #70000, processed 1787073 words, keeping 26336 word types
2021-03-25 17:04:25,529 : INFO : PROGRESS:

### Salvando o modelo

In [41]:
wine2vec.save(file_path)

2021-03-25 17:26:37,702 : INFO : saving Word2Vec object under wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80, separately None
2021-03-25 17:26:37,717 : INFO : not storing attribute vectors_norm
2021-03-25 17:26:37,729 : INFO : not storing attribute cum_table
2021-03-25 17:26:40,799 : INFO : saved wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80


### Carregando o modelo

In [88]:
wine2vec.load(file_path)

2021-03-25 20:17:01,720 : INFO : loading Word2Vec object from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80
2021-03-25 20:17:03,264 : INFO : loading wv recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.wv.* with mmap=None
2021-03-25 20:17:03,271 : INFO : setting ignored attribute vectors_norm to None
2021-03-25 20:17:03,271 : INFO : loading vocabulary recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.vocabulary.* with mmap=None
2021-03-25 20:17:03,272 : INFO : loading trainables recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.trainables.* with mmap=None
2021-03-25 20:17:03,272 : INFO : setting ignored attribute cum_table to None
2021-03-25 20:17:03,273 : INFO : loaded wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80


### Vendo o funcionamento do modelo

In [89]:
wine2vec.most_similar('aromas')

2021-03-25 20:17:08,666 : INFO : precomputing L2-norms of word weight vectors


[('palate', 0.6787787675857544),
 ('notes', 0.6084168553352356),
 ('scents', 0.5799988508224487),
 ('finish', 0.5651005506515503),
 ('nose', 0.561394453048706),
 ('lead', 0.5575189590454102),
 ('flavors', 0.541110634803772),
 ('offers', 0.5409454107284546),
 ('herbs', 0.517071008682251),
 ('closes', 0.5008352994918823)]

In [90]:
wine2vec.similarity('apple', 'citrus')

0.74055004

In [121]:
del wine2vec, wine_corpus, wine_dict

## O Recomenadnor

Finalmente, iremos desenvolver o recomendador. Ele partirá do principio que
cada item é representado por uma palavra de sua descrição, logo
um item é a média dessas palavras em um espaço n-dimensional (no nosso caso, um espaço 300-dimensional)

Ele usará todas as demais classes geradas, então, iremos aproveitar alguns modelos já treinados.

In [58]:
import json
from typing import Union

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

from models.row2json import Row2Json
from models.wine import Wine
from models.wine2vec import Wine2Vec
from models.wine_corpus import WineCorpus
from models.wine_dataset import WineDataSet
from models.wine_dict import WineDict


class WineRecommender:
    def __init__(self):
        self._wine130k = WineDataSet()
        self._wine_dict = WineDict()
        for row in tqdm(np.array(self._wine130k.data)):
            self._wine_dict.append(Wine(**Row2Json(row)))
        del row

        self._wine_corpus: Union[None, WineCorpus] = None
        self._wine2vec: Union[None, Wine2Vec] = None

    @property
    def wine2vec(self) -> Wine2Vec:
        return self._wine2vec

    @property
    def wine_corpus(self) -> WineCorpus:
        return self._wine_corpus

    @property
    def wine_dict(self) -> WineDict:
        return self._wine_dict

    @property
    def wine_dataset(self):
        return self._wine130k

    def build_corpus(self, save_path: Union[None, str] = None):
        self._wine_corpus = WineCorpus(self._wine_dict)
        if save_path is not None:
            self._wine_corpus.save(save_path)

    def load_corpus(self, load_path: str):
        self._wine_corpus = WineCorpus(load_path)

    def build_item2vec(self,
                       iter: int = 5,
                       min_count: int = 5,
                       size: int = 300,
                       window: int = 5,
                       save_path: Union[None, str] = None):

        self._wine2vec = Wine2Vec(
            sentences=self._wine_corpus.corpus,
            iter=iter,
            min_count=min_count,
            size=size,
            window=window
        )
        self._wine2vec.train()
        if save_path is not None:
            self._wine2vec.save(save_path)

    def load_item2vec(self, load_path: str):
        self._wine2vec = Wine2Vec(self._wine_corpus)
        self._wine2vec.load(load_path)
        if self._wine2vec.wine2vec is None:
            self._wine2vec.wine_embeddings()

    def build_gword2vec(self,
                        google_pretrained: str,
                        iter: int = 5,
                        min_count: int = 5,
                        size: int = 300,
                        window: int = 5,
                        save_path: Union[None, str] = None):
        # google_word2vec = KeyedVectors.load_word2vec_format(file_path, binary=True)
        from gensim.models import Word2Vec
        import os
        self._wine2vec = Wine2Vec(
            sentences=self._wine_corpus.corpus,
            min_count=min_count,
            size=size,
            window=window
        )

        self._wine2vec.model = Word2Vec(
            min_count=min_count,
            size=size,
            workers=os.cpu_count() - 1,
            window=window
        )

        self._wine2vec.model.build_vocab(self._wine_corpus)
        self._wine2vec.model.intersect_word2vec_format(google_pretrained,
                                                       lockf=1.0, binary=True)
        self._wine2vec.model.train(self._wine_corpus.corpus, total_examples=self._wine2vec.model.corpus_count,
                                   epochs=iter)
        if save_path is not None:
            self._wine2vec.save(save_path)

    def recommend(self, wine: int):
        import json
        wine_index = None
        if isinstance(wine, str):
            wine_index = self._wine_dict.title2index[wine]
        elif isinstance(wine, int):
            wine_index = wine
        if self._wine2vec.wine2vec is None:
            self._wine2vec.wine_embeddings()
        wine_embeddings = self._wine2vec.wine2vec[wine_index:wine_index + 1]
        similarities = cosine_similarity(wine_embeddings, self._wine2vec.wine2vec)
        bigger_similarity = np.argsort(-similarities)[0][1]
        return json.dumps(self._wine_dict[int(bigger_similarity)](), indent=2)

    def recommend_by_description(self, description: str):
        words = description.split()
        words_embeddings = []
        for w in words:
            try:
                words_embeddings.append(self._wine2vec.model[w])
            except KeyError:
                pass

        words_embeddings = np.array(words_embeddings)
        similarities = cosine_similarity(words_embeddings, self._wine2vec.wine2vec)
        bigger_similarity = np.argsort(-similarities)[0][0]
        return json.dumps(self._wine_dict[int(bigger_similarity)](), indent=2)

In [59]:
wine_recommender = WineRecommender()
wine_recommender.load_corpus('corpus/all_wines.txt')

iter = 30
min_count = 1
size = 300
window = wine_recommender.wine_corpus.max_len
file_path = f'wine2vec_pretrained/wine2vec_model_i{iter}_mc{min_count}_s{size}_w{window}'

wine_recommender.load_item2vec(file_path)

100%|██████████| 129971/129971 [00:01<00:00, 78515.01it/s] 
100%|██████████| 129971/129971 [00:01<00:00, 92316.66it/s] 
2021-03-25 20:01:45,642 : INFO : loading Word2Vec object from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80
2021-03-25 20:01:46,460 : INFO : loading wv recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.wv.* with mmap=None
2021-03-25 20:01:46,461 : INFO : setting ignored attribute vectors_norm to None
2021-03-25 20:01:46,462 : INFO : loading vocabulary recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.vocabulary.* with mmap=None
2021-03-25 20:01:46,463 : INFO : loading trainables recursively from wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80.trainables.* with mmap=None
2021-03-25 20:01:46,464 : INFO : setting ignored attribute cum_table to None
2021-03-25 20:01:46,465 : INFO : loaded wine2vec_pretrained/wine2vec_model_i30_mc1_s300_w80
129971it [00:13, 9902.92it/s] 
100%|██████████| 129971/129971 [00:19<00:00, 6676.50it/

### A representação de um vinho

In [92]:
wine_recommender.wine2vec.wine2vec[0].compute()

array([ 7.79037029e-02,  5.50034232e-02,  1.78660098e-02,  5.25492206e-02,
        4.02740426e-02, -1.46335233e-02, -3.55600640e-02,  5.85262366e-02,
       -2.59448644e-02, -8.53577927e-02, -3.84667739e-02, -9.83122643e-03,
        2.24168181e-01,  4.19204980e-02,  4.41841818e-02,  6.59767389e-02,
        1.47900537e-01, -9.37720835e-02,  3.66962552e-02, -8.75810832e-02,
       -2.05426831e-02,  8.68328065e-02, -1.30920798e-01,  4.80929017e-03,
       -1.15127429e-01,  1.12668738e-01, -7.93112367e-02, -1.22390650e-01,
       -1.35365722e-03,  1.53373972e-01,  1.87498212e-01, -3.97933982e-02,
       -1.40859693e-01, -1.88364647e-02,  9.81048122e-02, -3.50236818e-02,
       -5.89289963e-02, -4.07277308e-02,  1.99139863e-03, -1.05597116e-01,
        9.71265808e-02, -8.73626582e-03, -3.41257155e-02, -1.28449142e-01,
        9.34429914e-02, -1.52593566e-04,  2.81453375e-02,  1.28495693e-01,
        9.31473598e-02,  8.29847008e-02,  3.63429412e-02,  1.53439581e-01,
       -1.44884706e-01,  

### Recomendando vinhos

#### Index

In [93]:
wine = wine_recommender.wine_dict[0]

In [94]:
wine

{
  "points": 87,
  "title": "Nicosia 2013 Vulk\u00e0 Bianco  (Etna)",
  "description": {
    "original": "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.",
    "cleaned": "aromas include tropical fruit broom brimstone dried herb palate overly expressive offering unripened apple citrus dried sage alongside brisk acidity"
  },
  "taster_name": "Kerin O\u2019Keefe",
  "taster_twitter_handle": "@kerinokeefe",
  "price": NaN,
  "designation": "Vulk\u00e0 Bianco",
  "variety": "White Blend",
  "region_1": "Etna",
  "region_2": null,
  "province": "Sicily & Sardinia",
  "country": "Italy",
  "winery": "Nicosia"
}

In [95]:
wine_recommended = wine_recommender.recommend(0)

In [96]:
import json
json.loads(wine_recommended)


{'points': 86,
 'title': 'Principe di Corleone 2015 Bianca di Corte Grillo (Sicilia)',
 'description': {'original': 'Aromas of yellow flower, citrus and dried herb float out of the glass. The palate is a bit on the lean side, offering lime, grilled sage and bitter almond alongside zesty acidity.',
  'cleaned': 'aromas yellow flower citrus dried herb float glass palate bit lean side offering lime grilled sage bitter almond alongside zesty acidity'},
 'taster_name': 'Kerin O’Keefe',
 'taster_twitter_handle': '@kerinokeefe',
 'price': 13.0,
 'designation': 'Bianca di Corte',
 'variety': 'Grillo',
 'region_1': 'Sicilia',
 'region_2': None,
 'province': 'Sicily & Sardinia',
 'country': 'Italy',
 'winery': 'Principe di Corleone'}

#### Nome do Vinho

In [97]:
wine = wine_recommender.wine_dict[1].title

In [98]:
wine_recommender.wine_dict[1]

{
  "points": 87,
  "title": "Quinta dos Avidagos 2011 Avidagos Red (Douro)",
  "description": {
    "original": "This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red berry fruits and freshened with acidity. It's  already drinkable, although it will certainly be better from 2016.",
    "cleaned": "ripe fruity wine smooth still structured firm tannins filled juicy red berry fruits freshened acidity already drinkable although certainly better "
  },
  "taster_name": "Roger Voss",
  "taster_twitter_handle": "@vossroger",
  "price": 15.0,
  "designation": "Avidagos",
  "variety": "Portuguese Red",
  "region_1": null,
  "region_2": null,
  "province": "Douro",
  "country": "Portugal",
  "winery": "Quinta dos Avidagos"
}

In [99]:
wine_recommended = wine_recommender.recommend(wine)

In [100]:
import json
json.loads(wine_recommended)

{'points': 86,
 'title': 'Château Fortin Plaisance 2012  Saint-Émilion',
 'description': {'original': "This ripe and fruity wine has an edge of tannins to give it structure. It's full of juicy black fruits, rich berry flavors and fresh acidity. The wine will age as the tannins round out further, so drink from 2018.",
  'cleaned': 'ripe fruity wine edge tannins give structure full juicy black fruits rich berry flavors fresh acidity wine age tannins round further drink '},
 'taster_name': 'Roger Voss',
 'taster_twitter_handle': '@vossroger',
 'price': 22.0,
 'designation': None,
 'variety': 'Bordeaux-style Red Blend',
 'region_1': 'Saint-Émilion',
 'region_2': None,
 'province': 'Bordeaux',
 'country': 'France',
 'winery': 'Château Fortin Plaisance'}

#### Recomenando por textos

Uma implementação foi considerar que, dado um texto, ele pode representar
um vinho que você queira.

In [101]:
json.loads(wine_recommender.recommend_by_description('oak coffee vanilla'))

  words_embeddings.append(self._wine2vec.model[w])


{'points': 85,
 'title': 'Angeli di Varano 2010 Riserva Stile Libero  (Conero)',
 'description': {'original': 'Notes of toasted oak and vanilla are the opening notes of this wine. The palate has ripe plum and blackberry that are overwhelmed by oak, espresso, coffee and sweet vanilla bean. Too much oak and too little fruit.',
  'cleaned': 'notes toasted oak vanilla opening notes wine palate ripe plum blackberry overwhelmed oak espresso coffee sweet vanilla bean much oak little fruit'},
 'taster_name': 'Kerin O’Keefe',
 'taster_twitter_handle': '@kerinokeefe',
 'price': 32.0,
 'designation': 'Riserva Stile Libero',
 'variety': 'Montepulciano',
 'region_1': 'Conero',
 'region_2': None,
 'province': 'Central Italy',
 'country': 'Italy',
 'winery': 'Angeli di Varano'}

In [None]:
del wine, wine_recommended


# [Parte 3] Wine Classifier


Vimos que é possível recomendar vinhos de acordo com vinhos
que são parecidos por suas descrições. Então, será que é possível
determinar um tipo de vinho que uma pessoa quer dar uma descrição?

Para ver isso, vamos montar um classificador de vinhos usando o wine2vec
com um modelo.

## Escolhendo as variedades

Como vimos na primeira parte, há mais de 700 variedades de vinhos
nesse dataset, e muitos deles com menos de 200 itens. Então, por
simplificação, iremos considerar somente a top 7 variedades (o que
cobre mais de 50% do dataset)


In [61]:
from copy import deepcopy

import numpy as np
from dask.array import from_array
from keras_preprocessing.sequence import pad_sequences
from keras_preprocessing.text import Tokenizer

from models.wine_dataset import WineDataSet


class WineMLDataGetter:
    def __init__(self, wine_dataset: WineDataSet, max_len, topn_varieties: int = 7, balance_class=False):
        filter_list = wine_dataset.varieties_count()['variety'][:topn_varieties].tolist()
        filtered_df = wine_dataset.data[wine_dataset.data['variety'].isin(filter_list)]

        if balance_class:
            aux_df = deepcopy(filtered_df)
            d = aux_df.groupby('variety')
            d = d.apply(lambda x: x.sample(d.size().min()).reset_index(drop=True))
            d = d.reset_index(drop=True)
            filtered_df = d
            del aux_df, d

        wine_embeddings_filter = filtered_df.index.values

        self._varieties_list = from_array(filter_list)
        self._wine_embeddings_filter = from_array(wine_embeddings_filter)

        self._variety2index = {variety: index for index, variety in enumerate(filter_list)}
        self._index2variety = {index: variety for index, variety in enumerate(filter_list)}
        # self._X = wine_embeddings[wine_embeddings_filter].compute()

        self._X = deepcopy(filtered_df['description_cleaned'].tolist())
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(self._X)
        self._X, self._X_tokenizer = tokenizer.texts_to_sequences(self._X), tokenizer
        self._X = pad_sequences(self._X, maxlen=84, padding="pre", truncating="post")

        self._index2word = self._X_tokenizer.index_word
        self._word2index = self._X_tokenizer.word_index

        self._index2word.update({0: 'pad'})
        self._word2index.update({'pad': 0})

        self._Y = deepcopy(filtered_df['variety'])
        self._Y.replace(self._variety2index, inplace=True)
        self._Y = np.array(self._Y.tolist())

    @property
    def word2index(self):
        return self._word2index

    @property
    def index2word(self):
        return self._index2word

    @property
    def X(self):
        return self._X

    @property
    def Y(self):
        return self._Y

In [102]:
wine_data_getter = WineMLDataGetter(
        wine_dataset=wine_recommender.wine_dataset,
        max_len=wine_recommender.wine_corpus.max_len,
        topn_varieties=7,
        balance_class=False
)

## Separando em Treino, Teste e Validação

Usaremos uma proporção 80(20):20 para gerar e validar nosso modelo.

In [63]:
from datetime import datetime

from sklearn.model_selection import train_test_split


class TrainTestAndValidation:
    def __init__(self, X, Y, test_size, valid_size):
        TEST_SIZE = test_size
        X_train, X_test, Y_train, Y_test = train_test_split(
            X,
            Y,
            test_size=TEST_SIZE,
            random_state=4,
            stratify=Y
        )
        VALID_SIZE = valid_size
        X_train, X_validation, Y_train, Y_validation = train_test_split(
            X_train,
            Y_train,
            test_size=VALID_SIZE,
            random_state=4,
            stratify=Y_train
        )
        print(f'{datetime.now()} - TRAINING DATA')
        print('Shape of input sequences: {}'.format(X_train.shape))
        print('Shape of output sequences: {}'.format(len(Y_train)))
        print("-" * 50)
        print(f'{datetime.now()} - VALIDATION DATA')
        print('Shape of input sequences: {}'.format(X_validation.shape))
        print('Shape of output sequences: {}'.format(len(Y_validation)))
        print("-" * 50)
        print(f'{datetime.now()} - TESTING DATA')
        print('Shape of input sequences: {}'.format(X_test.shape))
        print('Shape of output sequences: {}'.format(len(Y_test)))

        self._X_train, self._X_test, self._Y_train, self._Y_test = X_train, X_test, Y_train, Y_test
        self._X_validation, self._Y_validation = X_validation, Y_validation

    @property
    def X_train(self):
        return self._X_train

    @property
    def X_test(self):
        return self._X_test

    @property
    def X_validation(self):
        return self._X_validation

    @property
    def Y_train(self):
        return self._Y_train

    @property
    def Y_test(self):
        return self._Y_test

    @property
    def Y_validation(self):
        return self._Y_validation

In [103]:
train_test_and_validation = TrainTestAndValidation(
    wine_data_getter.X,
    wine_data_getter.Y,
    test_size=0.2,
    valid_size=0.2
)

2021-03-25 20:21:10.640594 - TRAINING DATA
Shape of input sequences: (38728, 84)
Shape of output sequences: 38728
--------------------------------------------------
2021-03-25 20:21:10.640755 - VALIDATION DATA
Shape of input sequences: (9683, 84)
Shape of output sequences: 9683
--------------------------------------------------
2021-03-25 20:21:10.640813 - TESTING DATA
Shape of input sequences: (12103, 84)
Shape of output sequences: 12103


## O Classificador


Para o classificador, usaremos um modelo LSTM Bidirecional, com 5 camadas
todas com ativação relu. Foi colocado um early_stopping em caso de ocorrer over-fitting
antes das épocas definidas

In [65]:
import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as L
from tensorflow.python.keras.callbacks import EarlyStopping
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras.losses import SparseCategoricalCrossentropy
from tensorflow.python.keras.models import load_model
from tqdm import tqdm


class WineClassifier:
    def __init__(self, vocabulary_size, embedding_size, max_seq_length, embedding_weights, num_classes):
        VOCABULARY_SIZE = vocabulary_size
        EMBEDDING_SIZE = embedding_size
        MAX_SEQ_LENGTH = max_seq_length
        embedding_weights = embedding_weights
        NUM_CLASSES = num_classes

        import tensorflow.python.keras.activations as a
        self._model = tf.keras.Sequential([
            Embedding(
                input_dim=VOCABULARY_SIZE,
                output_dim=EMBEDDING_SIZE,
                input_length=MAX_SEQ_LENGTH,
                weights=[embedding_weights],
                trainable=True
            ),
            L.Bidirectional(L.LSTM(64, return_sequences=False)),
            L.Dense(64, activation=a.relu),
            L.Dense(32, activation=a.relu),
            L.Dense(32, activation=a.relu),
            L.Dense(16, activation=a.relu),
            L.Dense(NUM_CLASSES, activation=a.softmax)
        ])

        self._model.compile(
            loss=SparseCategoricalCrossentropy(),
            optimizer='adam',
            metrics=['accuracy']
        )
        self._results = None
        # self._model.summary()

    @property
    def model(self):
        return self._model

    def fit(self, X_train, y_train, epochs, batch_size, X_validation, y_validation):
        early_stopping = EarlyStopping()
        self._results = self._model.fit(
            X_train.reshape(*X_train.shape, 1),
            np.array(y_train),
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(X_validation.reshape(*X_validation.shape, 1), np.array(y_validation)),
            callbacks=[early_stopping]
        )

    def load(self, load_path: str):
        self._model = load_model(load_path)

    def save(self, save_path: str):
        self._model.save(save_path)

    def evaluate(self, X_test, y_test):
        self._model.evaluate(
            X_test,
            y_test
        )

    def predict(self, X_test):
        return self._model.predict_classes(X_test)


def emb_weights(word2vec, word2index, vocabulary_size, embedding_size):
    VOCABULARY_SIZE = vocabulary_size
    EMBEDDING_SIZE = embedding_size
    word2id = word2index
    embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))
    for word, index in tqdm(word2id.items()):
        try:
            embedding_weights[index, :] = word2vec[word]
        except KeyError:
            pass
    return embedding_weights

In [104]:
vocabulary_size = len(wine_data_getter.word2index) + 1
embedding_size = 300

wine_classifier = WineClassifier(
    vocabulary_size=vocabulary_size,
    embedding_size=embedding_size,
    max_seq_length=wine_recommender.wine_corpus.max_len,
    embedding_weights=emb_weights(
        word2vec=wine_recommender.wine2vec.model,
        word2index=wine_data_getter.word2index,
        vocabulary_size=vocabulary_size,
        embedding_size=embedding_size
    ),
    num_classes=len(set(train_test_and_validation.Y_test))
)

  embedding_weights[index, :] = word2vec[word]
100%|██████████| 23567/23567 [00:00<00:00, 64427.15it/s]


## Treinando o modelo

In [102]:
wine_classifier.fit(
    X_train=train_test_and_validation.X_train,
    y_train=train_test_and_validation.Y_train,
    epochs=40,
    batch_size=100,
    X_validation=train_test_and_validation.X_validation,
    y_validation=train_test_and_validation.Y_validation
)

wine_classifier.save('wine_classifider_model_100b_40e')

2021-03-25 19:21:31,698 : INFO : Assets written to: wine_classifider_model_100b_40e_balanced/assets


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40


In [105]:
wine_classifier.load('wine_classifider_model_100b_40e')

## Validando o modelo

In [106]:
wine_classifier.evaluate(train_test_and_validation.X_test, train_test_and_validation.Y_test)





In [107]:
Y_pred = wine_classifier.predict(train_test_and_validation.X_test)




### Funções para imprimir tabelas

In [70]:
def overall_accuracy(cm):
    hits = np.diag(cm).sum()
    all_data = cm.sum()
    print("Hits: {}".format(hits))
    print("Total Data: {}".format(all_data))
    print("Percentage Accuracy: {:.2f}%".format((hits / all_data) * 100))
    print("\n")


def overall_accuracy_by_class(cm, tags):
    from texttable import Texttable

    diag_values = np.diag(cm / cm.astype(np.float).sum(axis=1))
    data_list = [['Variety', 'Value (%)']]
    for index, result in enumerate(diag_values):
        data_list.append([tags[index], "{:.2f}".format(result * 100)])
    t = Texttable(1000)
    t.add_rows(data_list)
    t.set_cols_align(["l", "r"])
    print(t.draw())


def print_confusion_matrix(cm, tags):
    from texttable import Texttable

    len_cm = len(cm)
    cm_aux = cm / cm.astype(np.float).sum(axis=1) * 100
    tags_names = ['Variety'] + list(tags)
    t = Texttable(1000)
    data_list = [tags_names]
    for index, result in enumerate(cm_aux):
        data_list.append([tags[index]] + list(cm_aux[index]))
    t.add_rows(data_list)
    print(t.draw())

In [108]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(
        y_true=train_test_and_validation.Y_test,
        y_pred=Y_pred
    )

In [109]:
overall_accuracy(cm)


Hits: 10290
Total Data: 12103
Percentage Accuracy: 85.02%




In [110]:
print_confusion_matrix(cm, wine_data_getter._varieties_list.compute())

+--------------------------+------------+------------+--------------------+-----------+--------------------------+----------+-----------------+
|         Variety          | Pinot Noir | Chardonnay | Cabernet Sauvignon | Red Blend | Bordeaux-style Red Blend | Riesling | Sauvignon Blanc |
| Pinot Noir               | 85.989     | 1.531      | 8.448              | 4.080     | 7.014                    | 0.289    | 0.302           |
+--------------------------+------------+------------+--------------------+-----------+--------------------------+----------+-----------------+
| Chardonnay               | 0.753      | 94.258     | 0.053              | 0.279     | 0.868                    | 5.395    | 4.129           |
+--------------------------+------------+------------+--------------------+-----------+--------------------------+----------+-----------------+
| Cabernet Sauvignon       | 4.331      | 0.298      | 83.369             | 8.720     | 2.531                    | 0.193    | 0         

In [111]:
overall_accuracy_by_class(cm, wine_data_getter._varieties_list.compute())

+--------------------------+-----------+
|         Variety          | Value (%) |
| Pinot Noir               |    85.990 |
+--------------------------+-----------+
| Chardonnay               |    94.260 |
+--------------------------+-----------+
| Cabernet Sauvignon       |    83.370 |
+--------------------------+-----------+
| Red Blend                |    83.010 |
+--------------------------+-----------+
| Bordeaux-style Red Blend |    80.550 |
+--------------------------+-----------+
| Riesling                 |    89.790 |
+--------------------------+-----------+
| Sauvignon Blanc          |    68.580 |
+--------------------------+-----------+


# [Parte 4] Conclusões e resultados

Notamos que tivemos resultados agradáveis para o recomendador. Usar a média das
palavras para representar um vinho foi bom, mas há outras técnicas a serem exploradas
como o Doc2Vec. Também faltou colocar outras informações (preço, rating, país/região, tipo).
No mais, os resultados foram bons para esse modelo.

Para o classificador, ele teve uma média geral de 85%, bem animador. O maior problema
foi classificado o Sauvgnon Blan. Talvez por esse ser uma classe com poucas entidades
comparadas aos demais. No mais, está dentro do estado da arte, apesar de
mudar um pouco no que é usado para classificar o vinho. Vi muitos lugares
considerando características físicas e quimicas deles. Aqui, usamos a descrição
do sabor.
