# Projeto 4 - Recomendação de vinhos - Wine Reviews

# Enunciado

## Wine Reviews - Kaggle


O conjunto de dados Wine Reviews possui o comentário e a nota dada por cada especialista sobre diversos vinhos. O autor se inspirou para fazer este conjunto de dados após assistir Somm, um documentário sobre mestres sommeliers de vinho.

## Preparação do ambiente


* Para este projeto, acessem o link do Kaggle e, logo abaixo, cliquem em "Download". Caso você não tenha uma conta no Kaggle, crie uma e retorne para esse ponto para realizar o download. Descompacte o arquivo.
* Crie o projeto no Github
* Use o cookiecutter para organizar o projeto
* Leia todo o material disponibilizado no Kaggle para entender esses dados
* Caso precise, leia outros projetos que usaram este mesmo conjunto de dados

## Projeto

<ol>
    <li>Use os passos do Crisp-DM para desenvolver o projeto.</li>
    <li>Observe os dados e busque levantar observações importantes, tais como:</li>
    <ol>
        <li>Qual vinho é o mais caro? E o mais barato?</li>
        <li>Qual especialista avaliou mais vinhos? Qual a sua nota média?</li>
        <li>Qual região possui os vinhos com as maiores avaliações? E qual possui os vinhos mais baratos?<?li>
    </ol>
    <li>Faça o sistema de recomendação</li>
    <li>Escreva o relatório</li>
    <li>Disponibilize o repositório do seu projeto</li>
</ol>

## Exercícios

Separe o notebook entre analise exploratória, construção do sistema de recomendação e relatório.
1. (3 pontos) **Análise exploratória**: realize as análises deste conjunto de dados, crie gráficos e anote as suas idéias.
2. (3 pontos) **Sistema de recomendação**: faça a construção de seu sistema de recomendação
3. (4 pontos) **Relatório**: construa um relatório utilizando o Markdown para escrever. Seja criativo para apresentar seus achados e siga os passos abaixo:
    * Imagine que você possui uma startup e este é o primeiro relatório apresentará
    * Coloque o nome do seu produto
    * Apresente a introdução do problema (seja sucinto, escreva com poucas palavras)
    * Coloque gráficos e frases para sustentar seus argumentos
    * Mostre as soluções do mercado
    * Escreva sobre a sua solução e por que ela é a melhor

# 2. (3 pontos) **Sistema de recomendação**: faça a construção de seu sistema de recomendação


## Sistema de Recomendação

### Leitura dos dados necessários

In [2]:
import pandas as pd

system_data = pd.read_csv('../data/processed/wines_recomendation_system.csv')
user_data = pd.read_csv('../data/processed/wines_user_consult.csv')

### Implementação dos modelos de classificação.

Serão testados dois modelos um utilizando o TfidfVectorizer e outro utilizando o CountVectorizer

#### Implementação do TfidfVectorizer

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_jobid = tfidf_vectorizer.fit_transform(system_data.description)
tfidf_jobid

<117236x22352 sparse matrix of type '<class 'numpy.float64'>'
	with 2878782 stored elements in Compressed Sparse Row format>

#### Implementação do CountVectorizer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
count_jobid = count_vectorizer.fit_transform(system_data.description)
count_jobid

<117236x22352 sparse matrix of type '<class 'numpy.int64'>'
	with 2878782 stored elements in Compressed Sparse Row format>

#### Testando os modelos

Variável de teste.

In [5]:
data_test = system_data.sample(frac = 0.01)
data_test.shape

(1172, 2)

In [6]:
from sklearn.neighbors import NearestNeighbors

KNN_tfidf = NearestNeighbors(n_neighbors=11, metric='cosine')
KNN_tfidf.fit(tfidf_jobid)
avg_dist = 0
for row in range(len(data_test)):
    test = data_test.iloc[row]['description']
    user_tfidf = tfidf_vectorizer.transform([test])
    NNs_tfidf = KNN_tfidf.kneighbors(user_tfidf, return_distance=True)
    avg_dist += NNs_tfidf[0][0][1:].mean()
avg_dist /= len(data_test)

print('A distância média do conjunto de teste foi de:', avg_dist)

A distância média do conjunto de teste foi de: 0.6809220977485865


In [7]:
KNN = NearestNeighbors(n_neighbors=11, metric='cosine')
KNN.fit(count_jobid)
avg_dist = 0
for row in range(len(data_test)):
    test = data_test.iloc[row]['description']
    user_count = count_vectorizer.transform([test])
    NNs = KNN.kneighbors(user_count, return_distance=True)
    avg_dist += NNs[0][0][1:].mean()
avg_dist /= len(data_test)

print('A distância média do conjunto de teste foi de:', avg_dist)

A distância média do conjunto de teste foi de: 0.5687918682750208


Como o modelo utilizando CountVectorizer produziu a menor distância média no grupo de teste ele será utilizado no modelo de produção.

A seguir está um exemplo de teste mostrando a diferença dos resultados entre o modelo utilizando o TfidfVectorizer e o CountVectorizer.

Resultado da classificação pelo TfidfVectorizer

In [8]:
answer = user_data.loc[NNs_tfidf[1][0, :]]
answer['similarity'] = NNs_tfidf[0][0, :]
answer

Unnamed: 0,country,description,points,price,province,title,variety,winery,style,similarity
55504,US,From an estate vineyard and given very little ...,91,53.0,California,Small Vines 2013 Chardonnay (Sonoma Coast),Chardonnay,Small Vines,Light-Bodied White Wines,0.0
111139,US,There's a chalky baking soda touch to the nose...,89,14.0,California,Jekel 2015 Riesling (Monterey),Riesling,Jekel,Aromatic White Wines,0.721589
82949,US,"Vibrant in apple blossom and coconut, this vol...",90,45.0,California,Silverpoint Cellars 2013 Chardonnay (Sonoma Co...,Chardonnay,Silverpoint Cellars,Light-Bodied White Wines,0.735321
43165,US,"Fresh apple blossoms, lemon-lime soda and frag...",88,25.0,California,Darcie Kent Vineyards 2015 Rava Blackjack Vine...,Grüner Veltliner,Darcie Kent Vineyards,Light-Bodied White Wines,0.750956
95325,US,Given time in both French oak and stainless-st...,86,14.0,California,Ferrari-Carano 2014 Fumé Blanc (Sonoma County),Fumé Blanc,Ferrari-Carano,Light-Bodied White Wines,0.757356
84391,Italy,This has heady floral aromas of honeysuckle an...,88,17.0,Tuscany,Guicciardini Strozzi 2015 Villa Cusona (Verna...,Vernaccia,Guicciardini Strozzi,Full-Bodied White Wines,0.775202
63480,US,"This white invites you in with floral, citrus ...",88,35.368644,California,Moniker 2012 Chardonnay (Mendocino County),Chardonnay,Moniker,Light-Bodied White Wines,0.775815
70727,US,This is a pretty wine in terms of floral aroma...,89,34.0,California,Pine Ridge 2013 Dijon Clones Chardonnay (Carne...,Chardonnay,Pine Ridge,Light-Bodied White Wines,0.77972
54088,US,"Feral and smoky, this light-bodied, light-colo...",91,55.0,California,Small Vines 2013 Pinot Noir (Sonoma Coast),Pinot Noir,Small Vines,Light-Bodied Red Wines,0.780392
68548,South Africa,"Pretty aromas of yellow florals, wood-grilled ...",85,10.0,Western Cape,Releaf 2011 Made With Organically Grown Grapes...,Chenin Blanc,Releaf,Aromatic White Wines,0.783559


Resultado da classificação pelo CountVectorizer

In [9]:
answer = user_data.loc[NNs[1][0, :]]
answer['similarity'] = NNs[0][0, :]
answer

Unnamed: 0,country,description,points,price,province,title,variety,winery,style,similarity
55504,US,From an estate vineyard and given very little ...,91,53.0,California,Small Vines 2013 Chardonnay (Sonoma Coast),Chardonnay,Small Vines,Light-Bodied White Wines,3.330669e-16
63480,US,"This white invites you in with floral, citrus ...",88,35.368644,California,Moniker 2012 Chardonnay (Mendocino County),Chardonnay,Moniker,Light-Bodied White Wines,0.6726732
82949,US,"Vibrant in apple blossom and coconut, this vol...",90,45.0,California,Silverpoint Cellars 2013 Chardonnay (Sonoma Co...,Chardonnay,Silverpoint Cellars,Light-Bodied White Wines,0.6894705
54088,US,"Feral and smoky, this light-bodied, light-colo...",91,55.0,California,Small Vines 2013 Pinot Noir (Sonoma Coast),Pinot Noir,Small Vines,Light-Bodied Red Wines,0.7039217
21575,US,"Gravelly texture gives way to faint apple, pre...",88,42.0,California,Clif Family 2015 Chardonnay (Napa Valley),Chardonnay,Clif Family,Light-Bodied White Wines,0.7113249
70727,US,This is a pretty wine in terms of floral aroma...,89,34.0,California,Pine Ridge 2013 Dijon Clones Chardonnay (Carne...,Chardonnay,Pine Ridge,Light-Bodied White Wines,0.719255
81567,US,Grown on a family-owned vineyard and aged in 2...,90,55.0,California,Tognetti 2013 Aloise Francisco Vineyards Chard...,Chardonnay,Tognetti,Light-Bodied White Wines,0.7239738
78065,US,This vineyard-designate rocks its way into you...,93,45.0,California,Lost Canyon 2013 Morelli Lane Vineyard Pinot N...,Pinot Noir,Lost Canyon,Light-Bodied Red Wines,0.7239738
116250,US,"Stone, lime salt and lemon zest pique the entr...",90,32.0,California,Kokomo 2015 Peters Vineyard Chardonnay (Russia...,Chardonnay,Kokomo,Light-Bodied White Wines,0.7241614
90418,US,"A woody, toasted note speaks to well-integrate...",91,24.0,California,Hall 2015 Sauvignon Blanc (Napa Valley),Sauvignon Blanc,Hall,Light-Bodied White Wines,0.7284623


Modelo final

In [12]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer

system_data = pd.read_csv('../data/processed/wines_recomendation_system.csv')
user_data = pd.read_csv('../data/processed/wines_user_consult.csv')

count_vectorizer = CountVectorizer()
count_jobid = count_vectorizer.fit_transform(system_data.description)

KNN = NearestNeighbors(n_neighbors=11, metric='cosine')
KNN.fit(count_jobid)

test = system_data.loc[10, :].to_frame().T

user_count = count_vectorizer.transform(test.description)
NNs = KNN.kneighbors(user_count, return_distance=True)

answer = user_data.loc[NNs[1][0, :]]
answer['similarity'] = NNs[0][0, :]

In [13]:
answer

Unnamed: 0,country,description,points,price,province,title,variety,winery,style,similarity
10,US,"Soft, supple plum envelopes an oaky structure ...",87,19.0,California,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature,Full-Bodied Red Wines,0.0
112230,France,"This is a caramel-flavored blend of Merlot, Ta...",85,12.0,Southwest France,Domaine de Pellehaut 2016 Harmonie de Gascogne...,Rosé,Domaine de Pellehaut,Rosé Wines,0.622036
39264,France,"A big, burly wine with attractive tannins and ...",86,15.0,Southwest France,Domaine Brunet 2009 Malbec (Cahors),Malbec,Domaine Brunet,Full-Bodied Red Wines,0.622876
93044,France,This is a soft and fruity wine. It has gentle ...,86,12.0,Southwest France,Domaine D'en Ségur 2016 Le Rosé (Côtes du Tarn),Rosé,Domaine D'en Ségur,Rosé Wines,0.650851
106762,US,"Oodles of rich, ripe fruit in this soft, gentl...",86,28.0,California,Leal Vineyards 2005 MacWilliamson Vineyard Pet...,Petite Sirah,Leal Vineyards,Full-Bodied Red Wines,0.66045
45831,Spain,"The goofy name is questionable, but the wine i...",90,25.0,Catalonia,Picolino 2012 Suprafabulicious Red (Priorat),Red Blend,Picolino,Full-Bodied Red Wines,0.661938
102621,US,Herbal peppercorn cradles a soft composition w...,88,90.0,California,Terra Valentine 2011 Marriage Red (Spring Moun...,Bordeaux-style Red Blend,Terra Valentine,Full-Bodied Red Wines,0.665748
5978,US,"A majority Cabernet Sauvignon, with 15% Merlot...",87,48.0,California,Terra Valentine 2012 Estate Grown Cabernet Sau...,Cabernet Sauvignon,Terra Valentine,Full-Bodied Red Wines,0.666377
32283,France,This dense wine has black plum and chocolate f...,91,35.368644,Bordeaux,Château Vieux Maillet 2009 Pomerol,Bordeaux-style Red Blend,Château Vieux Maillet,Full-Bodied Red Wines,0.670017
82165,Australia,"Fairly full-bodied and lush, with berry fruit ...",87,30.0,South Australia,Kangarilla Road 2008 Black St. Peters Zinfande...,Zinfandel,Kangarilla Road,Medium-Bodied Red Wines,0.670017


Exportando modelo de produção

In [14]:
from joblib import dump, load

dump(KNN, '../app/classifier.joblib')
dump(count_vectorizer, '../app/vectorizer.joblib')

['../app/vectorizer.joblib']

In [15]:
clf = load('../app/classifier.joblib') 
vetorizador = load('../app/vectorizer.joblib')

user_count_vec = vetorizador.transform(test.description)
NNs_clf = clf.kneighbors(user_count_vec, return_distance=True)
answer = user_data.loc[NNs_clf[1][0, :]]
answer['similarity'] = NNs_clf[0][0, :]
answer

Unnamed: 0,country,description,points,price,province,title,variety,winery,style,similarity
10,US,"Soft, supple plum envelopes an oaky structure ...",87,19.0,California,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,Kirkland Signature,Full-Bodied Red Wines,0.0
112230,France,"This is a caramel-flavored blend of Merlot, Ta...",85,12.0,Southwest France,Domaine de Pellehaut 2016 Harmonie de Gascogne...,Rosé,Domaine de Pellehaut,Rosé Wines,0.622036
39264,France,"A big, burly wine with attractive tannins and ...",86,15.0,Southwest France,Domaine Brunet 2009 Malbec (Cahors),Malbec,Domaine Brunet,Full-Bodied Red Wines,0.622876
93044,France,This is a soft and fruity wine. It has gentle ...,86,12.0,Southwest France,Domaine D'en Ségur 2016 Le Rosé (Côtes du Tarn),Rosé,Domaine D'en Ségur,Rosé Wines,0.650851
106762,US,"Oodles of rich, ripe fruit in this soft, gentl...",86,28.0,California,Leal Vineyards 2005 MacWilliamson Vineyard Pet...,Petite Sirah,Leal Vineyards,Full-Bodied Red Wines,0.66045
45831,Spain,"The goofy name is questionable, but the wine i...",90,25.0,Catalonia,Picolino 2012 Suprafabulicious Red (Priorat),Red Blend,Picolino,Full-Bodied Red Wines,0.661938
102621,US,Herbal peppercorn cradles a soft composition w...,88,90.0,California,Terra Valentine 2011 Marriage Red (Spring Moun...,Bordeaux-style Red Blend,Terra Valentine,Full-Bodied Red Wines,0.665748
5978,US,"A majority Cabernet Sauvignon, with 15% Merlot...",87,48.0,California,Terra Valentine 2012 Estate Grown Cabernet Sau...,Cabernet Sauvignon,Terra Valentine,Full-Bodied Red Wines,0.666377
32283,France,This dense wine has black plum and chocolate f...,91,35.368644,Bordeaux,Château Vieux Maillet 2009 Pomerol,Bordeaux-style Red Blend,Château Vieux Maillet,Full-Bodied Red Wines,0.670017
82165,Australia,"Fairly full-bodied and lush, with berry fruit ...",87,30.0,South Australia,Kangarilla Road 2008 Black St. Peters Zinfande...,Zinfandel,Kangarilla Road,Medium-Bodied Red Wines,0.670017
