# Wine Reviews Recommendation Systems

**Prepared by Elizabeth Webster**

*November 2022*

## Overview

Create a recommendation system for Wine Enthusiast's tasters using Surprise.

## Business Problem

This project is being prepared for a small winery in Walla Walla.  They are just starting out and currently only producing a few wines. Their wine maker wants to gain insight on how to generate wines that will be rated highly.

In this section of the project, I will create a recommendation system for Wine Enthusiast's tasters in order to understand which wines are most often highly recommended.

## Dataset

The data that I am using comes from Wine Enthusiast and includes information on 130,000 different wines.  This information includes the description, variety, winery, country, taster name, etc.

# Data Understanding

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, SVDpp, SlopeOne, NMF 
from surprise.prediction_algorithms import NormalPredictor, KNNWithZScore, BaselineOnly
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Data/winemag-data-130k-v2.csv.zip', encoding='latin-1', index_col=0)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


In [4]:
rec_df = df.loc[:, ('points', 'taster_name', 'variety')]
rec_df.head()

Unnamed: 0,points,taster_name,variety
0,87,Kerin OâKeefe,White Blend
1,87,Roger Voss,Portuguese Red
2,87,Paul Gregutt,Pinot Gris
3,87,Alexander Peartree,Riesling
4,87,Paul Gregutt,Pinot Noir


In [5]:
rec_df.groupby('taster_name').mean()

Unnamed: 0_level_0,points
taster_name,Unnamed: 1_level_1
Alexander Peartree,85.855422
Anna Lee C. Iijima,88.415629
Anne KrebiehlÂ MW,90.562551
Carrie Dykes,86.395683
Christina Pickard,87.833333
Fiona Adams,86.888889
Jeff Jenssen,88.319756
Jim Gordon,88.626287
Joe Czerwinski,88.536235
Kerin OâKeefe,88.867947


In [6]:
reader = Reader(rating_scale=(80,100))
data = Dataset.load_from_df(rec_df[['taster_name', 'variety', 'points']],reader)

In [8]:
trainset = data.build_full_trainset()
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  20 

Number of items:  708


In [32]:
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [33]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 2.7856298788014238, 'mae': 2.248002500016892}
{'rmse': {'n_factors': 100, 'reg_all': 0.02}, 'mae': {'n_factors': 50, 'reg_all': 0.02}}


In [34]:
# cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [35]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([2.81895081, 2.81854303, 2.82900165, 2.83455793, 2.80618284]))
('test_mae', array([2.27421766, 2.27431611, 2.27991496, 2.28361643, 2.26902035]))
('fit_time', (7.812848091125488, 9.4933922290802, 9.550081968307495, 9.52426290512085, 9.5621337890625))
('test_time', (43.87547063827515, 42.92222285270691, 43.07898497581482, 43.33851885795593, 43.352028131484985))
-----------------------
2.8214472528607524


In [36]:
# cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [37]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([2.75913742, 2.76863796, 2.78020751, 2.78121568, 2.77353012]))
('test_mae', array([2.20994114, 2.2158177 , 2.22625775, 2.22439083, 2.21243304]))
('fit_time', (8.161241054534912, 8.349957942962646, 8.331696033477783, 8.293354988098145, 8.450613975524902))
('test_time', (42.2398579120636, 42.06733298301697, 42.256608963012695, 42.61938691139221, 42.24858474731445))


2.7725457389696295

## Making Predictions

In [9]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f8b44320a00>

In [10]:
svd.predict(2,4)

Prediction(uid=2, iid=4, r_ui=None, est=88.44713820775404, details={'was_impossible': False})

In [11]:
def predict_scores(number_of_users, variety_number):
    for number in range(number_of_users):
        print(svd.predict(number, variety_number))

In [12]:
predict_scores(10, 4)

user: 0          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 1          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 2          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 3          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 4          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 5          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 6          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 7          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 8          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}
user: 9          item: 4          r_ui = None   est = 88.45   {'was_impossible': False}


In [13]:
testset = trainset.build_anti_testset()
predictions = svd.test(testset)

In [28]:
predictions

[Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Portuguese Red', r_ui=88.44713820775404, est=88.94096588502134, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Pinot Gris', r_ui=88.44713820775404, est=88.47080678317882, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Pinot Noir', r_ui=88.44713820775404, est=88.66026402910238, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Tempranillo-Merlot', r_ui=88.44713820775404, est=88.07238283018036, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Malbec', r_ui=88.44713820775404, est=88.86719316882518, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Tempranillo Blend', r_ui=88.44713820775404, est=89.11793521403439, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Meritage', r_ui=88.44713820775404, est=88.73338088949608, details={'was_impossible': Fa

In [24]:
def get_top_n(predictions, n=5):

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [25]:
top_n = get_top_n(predictions, n=5)

In [26]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

Kerin OâKeefe ['Bual', 'Sangiovese Grosso', 'Tannat-Cabernet Franc', 'Sherry', 'Tinta del Pais']
Roger Voss ['Tokaji', 'Tinto Fino', 'Syrah-Viognier', 'Sangiovese Grosso', 'Sherry']
Paul Gregutt ['Bual', 'Picolit', 'Muscadelle', 'Tokaji', 'Sangiovese Grosso']
Alexander Peartree ['Picolit', 'Tinta del Pais', 'Tokaji', 'Sangiovese Grosso', 'Malbec-Petit Verdot']
Michael Schachner ['Picolit', 'Aglianico', 'Tokaji', 'Bual', 'Zibibbo']
Anna Lee C. Iijima ['Bual', 'Muscadelle', 'Malbec-Tannat', 'Picolit', 'Tannat-Cabernet Franc']
Virginie Boone ['Bual', 'Picolit', 'Muscadelle', 'Tokaji', 'Malbec-Tannat']
Matt Kettmann ['Bual', 'Tokaji', 'Muscadelle', 'Malbec-Tannat', 'Tannat-Cabernet Franc']
nan ['Tinta del Pais', 'Tokaji', 'Bual', 'Cabernet Franc-Malbec', 'Muscadelle']
Sean P. Sullivan ['Tokaji', 'Bual', 'Muscadelle', 'Petit Manseng', 'Malbec-Tannat']
Jim Gordon ['Bual', 'Tokaji', 'Malbec-Tannat', 'Muscadelle', 'Tannat-Cabernet Franc']
Joe Czerwinski ['Bual', 'Petit Manseng', 'Tannat-Cabe

In [27]:
top_n

defaultdict(list,
            {'Kerin Oâ\x80\x99Keefe': [('Bual', 91.08138579635235),
              ('Sangiovese Grosso', 90.81811147285775),
              ('Tannat-Cabernet Franc', 90.6513947561229),
              ('Sherry', 90.60036839249057),
              ('Tinta del Pais', 90.51846337604151)],
             'Roger Voss': [('Tokaji', 91.26381160648951),
              ('Tinto Fino', 90.42218655165497),
              ('Syrah-Viognier', 90.05839185019533),
              ('Sangiovese Grosso', 90.01868681017326),
              ('Sherry', 89.98900994795994)],
             'Paul Gregutt': [('Bual', 91.35990086084843),
              ('Picolit', 90.91295209243047),
              ('Muscadelle', 90.8162209775091),
              ('Tokaji', 90.75821433873958),
              ('Sangiovese Grosso', 90.5129063116403)],
             'Alexander Peartree': [('Picolit', 88.23045975356668),
              ('Tinta del Pais', 88.05246847325598),
              ('Tokaji', 88.01562920763746),
              ('S

In [12]:
def predict_all_scores(dataset, variety):
    taster_list = ['Roger Voss', 'Michael Schachner', 'Kerin OâKeefe',
                   'Virginie Boone', 'Paul Gregutt', 'Matt Kettmann',
                   'Joe Czerwinski', 'Sean P. Sullivan', 'Anna Lee C. Iijima',
                   'Jim Gordon', 'Lauren Buzzeo','Susan Kostrzewa', 
                   'Mike DeSimone', 'Jeff Jenssen', 'Alexander Peartree', 
                   'Carrie Dykes', 'Fiona Adams', 'Christina Pickard']
    for taster in taster_list:
        inner_uid = dataset.to_inner_uid(ruid=taster)
        inner_iid = dataset.to_inner_iid(riid=variety)
        estimated_score = svd.predict(inner_uid, inner_iid)[3]
        print(taster,'scores',variety,':',estimated_score)