### 2. Title Recommender

- Same as the wine recommendation by variety
- Content Based Filtering: using cosine similarity
- Using title(name of the wine) and description(review)
- *CountVectorizer*, *TfidfTransformer*
    - Get a commonly used words in the reviews on the same title

In [None]:
# library
# Importing Libraries
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

import matplotlib.pyplot as plt
from matplotlib import rcParams
import os
import seaborn as sns

In [2]:
# Dataset
wines2 = pd.read_csv('datasets/winemag-data-130k-v2.csv')
wines2.sample(5)  # breifly check the data

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
121397,121397,US,This wine has a severity of sour cherry and ye...,,85,18.0,California,Dry Creek Valley,Sonoma,Virginie Boone,@vboone,Haraszthy 2013 Zinfandel (Dry Creek Valley),Zinfandel,Haraszthy
127989,127989,Portugal,"A structured, finely balanced wine that has in...",Mural,87,11.0,Douro,,,Roger Voss,@vossroger,Quinta do Portal 2013 Mural Red (Douro),Portuguese Red,Quinta do Portal
70728,70728,Portugal,This wine is full and richly fruity. It has a ...,Dory,87,12.0,Lisboa,,,Roger Voss,@vossroger,Adega Mãe 2015 Dory White (Lisboa),Portuguese White,Adega Mãe
54517,54517,US,This wine conveys a powerful sense of extracti...,Huber Vineyard,90,39.0,California,Sta. Rita Hills,Central Coast,Matt Kettmann,@mattkettmann,Pali 2014 Huber Vineyard Chardonnay (Sta. Rita...,Chardonnay,Pali
23711,23711,US,Fruity and slightly candied flavors make this ...,Zin Nymph Sophia Favaloro Vineyard White,87,18.0,California,Contra Costa County,Central Coast,Jim Gordon,@gordone_cellars,Rock Wall 2016 Zin Nymph Sophia Favaloro Viney...,Zinfandel,Rock Wall


In [13]:
title_df = wines2[["title", "variety", "points", "price", "description"]]

# Drop duplicated or NAs
title_df = title_df.drop_duplicates().dropna()
title_df.head(10)

Unnamed: 0,title,variety,points,price,description
1,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,87,15.0,"This is ripe and fruity, a wine that is smooth..."
2,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,87,14.0,"Tart and snappy, the flavors of lime flesh and..."
3,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,87,13.0,"Pineapple rind, lemon pith and orange blossom ..."
4,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,87,65.0,"Much like the regular bottling from 2012, this..."
5,Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...,Tempranillo-Merlot,87,15.0,Blackberry and raspberry aromas show a typical...
6,Terre di Giurfo 2013 Belsito Frappato (Vittoria),Frappato,87,16.0,"Here's a bright, informal red that opens with ..."
7,Trimbach 2012 Gewurztraminer (Alsace),Gewürztraminer,87,24.0,This dry and restrained wine offers spice in p...
8,Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...,Gewürztraminer,87,12.0,Savory dried thyme notes accent sunnier flavor...
9,Jean-Baptiste Adam 2012 Les Natures Pinot Gris...,Pinot Gris,87,27.0,This has great depth of flavor with its fresh ...
10,Kirkland Signature 2011 Mountain Cuvée Caberne...,Cabernet Sauvignon,87,19.0,"Soft, supple plum envelopes an oaky structure ..."


In [14]:
title_df.shape

(111592, 5)

In [15]:
# number of unique varieties
len(title_df["title"].unique().tolist())

110637

In [18]:
# Count the number of reviews per title of the wine
title_rev_num = title_df["title"].value_counts()
rev_num_df = pd.DataFrame({"title":title_rev_num.index, 'rev_num':title_rev_num.values})

# Seperate varieties which have more than one review, and only have one review
multi_rev_title = rev_num_df[(rev_num_df["rev_num"]>1)]["title"].tolist()
one_rev_title = rev_num_df[(rev_num_df["rev_num"]==1)]["title"].tolist()

In [19]:
# Set index
title_df = title_df.set_index("title")
title_df

Unnamed: 0_level_0,variety,points,price,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,87,15.0,"This is ripe and fruity, a wine that is smooth..."
Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,87,14.0,"Tart and snappy, the flavors of lime flesh and..."
St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore),Riesling,87,13.0,"Pineapple rind, lemon pith and orange blossom ..."
Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley),Pinot Noir,87,65.0,"Much like the regular bottling from 2012, this..."
Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra),Tempranillo-Merlot,87,15.0,Blackberry and raspberry aromas show a typical...
...,...,...,...,...
Dr. H. Thanisch (Erben Müller-Burggraef) 2013 Brauneberger Juffer-Sonnenuhr Spätlese Riesling (Mosel),Riesling,90,28.0,Notes of honeysuckle and cantaloupe sweeten th...
Citation 2004 Pinot Noir (Oregon),Pinot Noir,90,75.0,Citation is given as much as a decade of bottl...
Domaine Gresser 2013 Kritt Gewurztraminer (Alsace),Gewürztraminer,90,30.0,Well-drained gravel soil gives this wine its c...
Domaine Marcel Deiss 2012 Pinot Gris (Alsace),Pinot Gris,90,32.0,"A dry style of Pinot Gris, this is crisp with ..."


In [65]:
title_df.loc["Force Majeure 2014 Parabellum Red (Red Mountain)"]

Unnamed: 0_level_0,variety,points,price,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Force Majeure 2014 Parabellum Red (Red Mountain),Bordeaux-style Red Blend,91,55.0,"This wine is a blend of Merlot (42%), Cabernet..."
Force Majeure 2014 Parabellum Red (Red Mountain),Rhône-style Red Blend,92,45.0,This wine is 61% Syrah and 39% Mourvèdre. Appe...


In [75]:
# TfidfTransformer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer


title_df2 = pd.DataFrame(columns = ["title", "variety", "points", "price","description"])

# CountVectorizer object, using 'english' as stopwords 
cv = CountVectorizer(stop_words = "english", ngram_range=(2,2))

# TfidTransformer object
tfidf_transformer = TfidfTransformer(smooth_idf = True, use_idf = True)

for title in multi_rev_title:
    df = title_df.loc[[title]]
    
    # Word counts for the words used in the reviews of a specific  variety
    word_count = cv.fit_transform(df["description"])
    
    # Compute IDF values
    tfidf_transformer.fit(word_count)
    
    # Get top 100 common words (low IDF values) used in the reviews
    df_idf = pd.DataFrame(tfidf_transformer.idf_, index = cv.get_feature_names_out(), columns = ["idf_weights"])
    df_idf.sort_values(by = ["idf_weights"], inplace = True)
    
    # put 100 common words in a list
    common_words = df_idf.iloc[:100].index.tolist()
    
    # Convert the list to a string and create a df 
    common_words_str = ", ".join(elem for elem in common_words)
    variety_lst = df.loc[title,"variety"].tolist()
    points_lst = df.loc[title, "points"].tolist()
    price_lst = df.loc[title,"price"].tolist()
    new_row = {"title": title, "variety": variety_lst[1], "points": points_lst[1], "price": price_lst[1], "description": common_words_str}
    
    title_df2 = pd.concat([title_df2, pd.DataFrame([new_row])], ignore_index=True)

In [77]:
# Set index again
title_df2 = title_df2.set_index("title")
title_df2

Unnamed: 0_level_0,variety,points,price,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County),Sparkling Blend,88,22.0,"sparkling wine, pinot noir, bread dough, green..."
Segura Viudas NV Aria Estate Extra Dry Sparkling (Cava),Sparkling Blend,81,14.0,"extra dry, sweet extra, white fruit, acidic ed..."
Segura Viudas NV Extra Dry Sparkling (Cava),Sparkling Blend,87,10.0,"lemon lime, extra dry, dry apple, powdered sug..."
Bailly-Lapierre NV Brut (Crémant de Bourgogne),Chardonnay,91,25.0,"bottle age, pinot noir, north burgundy, blanc ..."
J Vineyards & Winery NV Brut Rosé Sparkling (Russian River Valley),Sparkling Blend,90,45.0,"abundant white, rough scoury, rough mouth, ros..."
...,...,...,...,...
Joseph Drouhin 2012 Premier Cru (Chambolle-Musigny),Pinot Noir,93,113.0,"acidity having, packed acidity, parcels premie..."
Force Majeure 2014 Parabellum Red (Red Mountain),Rhône-style Red Blend,92,45.0,"35 cabernet, lead bodied, lead bold, likely in..."
Mumm Napa NV Blanc de Blancs Sparkling (Napa Valley),Sparkling Blend,84,22.0,"pinot gris, 90 chardonnay, sugary orange, spar..."
L'Antica Quercia 2007 Arió Extra Dry (Prosecco di Conegliano),Prosecco,86,18.0,"2007 vintage, mineral nose, mouth grapes, nose..."


In [80]:
title_df2 = pd.concat([title_df2, title_df.loc[one_rev_title]])
title_df2

Unnamed: 0_level_0,variety,points,price,description
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County),Sparkling Blend,88,22.0,"sparkling wine, pinot noir, bread dough, green..."
Segura Viudas NV Aria Estate Extra Dry Sparkling (Cava),Sparkling Blend,81,14.0,"extra dry, sweet extra, white fruit, acidic ed..."
Segura Viudas NV Extra Dry Sparkling (Cava),Sparkling Blend,87,10.0,"lemon lime, extra dry, dry apple, powdered sug..."
Bailly-Lapierre NV Brut (Crémant de Bourgogne),Chardonnay,91,25.0,"bottle age, pinot noir, north burgundy, blanc ..."
J Vineyards & Winery NV Brut Rosé Sparkling (Russian River Valley),Sparkling Blend,90,45.0,"abundant white, rough scoury, rough mouth, ros..."
...,...,...,...,...
Rijk's 2009 Touch of Oak Shiraz (Tulbagh),Shiraz,86,40.0,"Although the label reads “Touch of Oak,” this ..."
Punch 2008 Tokalon Vineyard-Eastridge Vineyard Cabernet Sauvignon (California),Cabernet Sauvignon,86,75.0,"A middle-of-the-road Cabernet, with ripe black..."
Protopapas 2010 Chardonnay (Pageon),Chardonnay,86,17.0,"Aromas of pineapple, orange and lemon lead thi..."
Pietrafitta 2009 Vernaccia di San Gimignano,Vernaccia,86,16.0,"Making wine since 961, Pietrafitta presents a ..."


In [None]:
# TF-IDF = TF*IDF
# It will be high if the term is unique(IDF) in the whole document and the term appeared frequently in a description(TF)
from sklearn.feature_extraction.text import TfidfVectorizer

# stopwords
tfidf = TfidfVectorizer(ngram_range=(2,2), stop_words = "english")

# Count the words in each review, calculate idf, and multiply tf by idf
tfidf_matrix = tfidf.fit_transform(title_df2["description"])

# We have converted the text into a matrix of Tfidf values
tfidf_matrix.shape