# Wine Reviews Recommendation Systems

**Prepared by Elizabeth Webster**

*November 2022*

## Overview

Create a recommendation system for Wine Enthusiast's tasters using Surprise.

## Business Problem

This project is being prepared for a small winery in Walla Walla.  They are just starting out and currently only producing a few wines. Their wine maker wants to gain insight on how to generate wines that will be rated highly.

In this section of the project, I will create a recommendation system for Wine Enthusiast's tasters in order to understand which wines are most often highly recommended.

## Dataset

The data that I am using comes from Wine Enthusiast and includes information on 130,000 different wines.  This information includes the description, variety, winery, country, taster name, etc.

# Data Understanding

In [73]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, SVDpp, SlopeOne, NMF 
from surprise.prediction_algorithms import NormalPredictor, KNNWithZScore, BaselineOnly
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise import accuracy
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

In [57]:
df = pd.read_csv('Data/winemag-data-130k-v2.csv.zip', encoding='latin-1', index_col=0)

In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


In [97]:
df.title.value_counts()

Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County)                                              11
Korbel NV Brut Sparkling (California)                                                                9
Segura Viudas NV Extra Dry Sparkling (Cava)                                                          8
Ruinart NV Brut RosÃ©  (Champagne)                                                                   7
Segura Viudas NV Aria Estate Extra Dry Sparkling (Cava)                                              7
                                                                                                    ..
En Garde 2007 Reserve Cabernet Sauvignon (Diamond Mountain District)                                 1
Bonny Doon 2014 Pinot Doonier Sparkling (California)                                                 1
Le Farnete 2014  Carmignano                                                                          1
Santa Alicia 2014 Reserva Espiritu de Los Andes Estate Bottled Cabernet S

In [98]:
rec_df = df.loc[:, ('points', 'taster_name', 'title')]
rec_df.head()

Unnamed: 0,points,taster_name,title
0,87,Kerin OâKeefe,Nicosia 2013 VulkÃ Bianco (Etna)
1,87,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,87,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,87,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...
4,87,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...


In [99]:
reader = Reader(rating_scale=(80,100))
data = Dataset.load_from_df(rec_df[['taster_name', 'title', 'points']],reader)

In [100]:
trainset, testset = train_test_split(data, test_size=0.25)
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items)

Number of users:  20 

Number of items:  91106


In [101]:
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [102]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 2.8280199469021787, 'mae': 2.2035071117452802}
{'rmse': {'n_factors': 50, 'reg_all': 0.02}, 'mae': {'n_factors': 100, 'reg_all': 0.02}}


In [103]:
# cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [104]:
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([2.92570483, 2.9432257 , 2.95864957, 2.93067864, 2.94354489]))
('test_mae', array([2.26925654, 2.28296538, 2.29768739, 2.26903934, 2.28516609]))
('fit_time', (0.007861137390136719, 0.005543947219848633, 0.00569605827331543, 0.005543947219848633, 0.005298137664794922))
('test_time', (0.1432960033416748, 0.14139199256896973, 0.13251876831054688, 0.13090300559997559, 0.11562895774841309))
-----------------------
2.940360728039037


In [105]:
# cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [106]:
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([2.79633606, 2.78475221, 2.81275495, 2.78378813, 2.788628  ]))
('test_mae', array([2.13444708, 2.11817893, 2.14682591, 2.11399016, 2.12377018]))
('fit_time', (0.2769172191619873, 0.28096699714660645, 0.2944509983062744, 0.30768299102783203, 0.3059689998626709))
('test_time', (0.25560903549194336, 0.11307692527770996, 0.11180901527404785, 0.23702096939086914, 0.11268115043640137))


2.7932518702113724

## Making Predictions

In [107]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f880aea97c0>

In [108]:
predictions = svd.test(testset)

In [109]:
predictions

[Prediction(uid='Kerin Oâ\x80\x99Keefe', iid="Coppo 2015 Camp du Rouss  (Barbera d'Asti)", r_ui=91.0, est=88.97442718246911, details={'was_impossible': False}),
 Prediction(uid='Anne KrebiehlÂ\xa0MW', iid='Artner 2013 Kirchweingarten BlaufrÃ¤nkisch (Carnuntum)', r_ui=93.0, est=90.33465009099069, details={'was_impossible': False}),
 Prediction(uid='Anna Lee C. Iijima', iid="Pellegrini Vineyards 2010 Vintner's Pride Estate Grown Encore Red (North Fork of Long Island)", r_ui=88.0, est=88.51368273696669, details={'was_impossible': False}),
 Prediction(uid='Roger Voss', iid='Mirabeau 2013 RosÃ© (CÃ´tes de Provence)', r_ui=85.0, est=85.6609226729196, details={'was_impossible': False}),
 Prediction(uid='Paul Gregutt', iid='A Blooming Hill Vineyard 2012 Pinot Noir (Chehalem Mountains)', r_ui=85.0, est=89.0097803559408, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid="Cantina del Nebbiolo 2012 del Comune di Serralunga d'Alba  (Barolo)", r_ui=90.0, est=88.9744271

In [110]:
accuracy.rmse(predictions)

RMSE: 2.8092


2.809188510597363

In [111]:
def get_top_n(predictions, n=5):

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [112]:
top_n = get_top_n(predictions, n=5)

In [113]:
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])

Kerin OâKeefe ['Elvio Cogno 2013 Ravera  (Barolo)', 'Cantina Produttori San Michele Appiano 2012 Sanct Valentin Sauvignon (Alto Adige)', 'Bucci 2012 Villa Bucci Riserva  (Verdicchio dei Castelli di Jesi Classico Superiore)', "Brovia 2009 Ca' Mia  (Barolo)", 'Castello di Verduno 2009 Monvigliero Riserva  (Barolo)']
Anne KrebiehlÂ MW ['Domaine Roland Schmitt 2013 Altenberg de Bergbieten Grand Cru Riesling (Alsace)', 'Eichinger 2013 Gaisberg Reserve Riesling (Kamptal)', 'Anton Bauer 2015 Grande Reserve GrÃ¼ner Veltliner (Wagram)', 'Allram 2015 Renner GrÃ¼ner Veltliner (Kamptal)', 'Gruber RÃ¶schitz 2012 Hundspoint GrÃ¼ner Veltliner (Weinviertel)']
Anna Lee C. Iijima ['Reichsgraf von Kesselstatt 2015 Braunberger Juffer-Sonnenuhr SpÃ¤tlese Grosse Lage Riesling (Mosel)', 'Reichsgraf von Kesselstatt 2014 Graach JosephshÃ¶fer Monopol SpÃ¤tlese Grosse Lage Riesling (Mosel)', 'Johannishof 2015 Johannisberger Klaus SpÃ¤tlese Riesling (Rheingau)', 'ThÃ¶rle 2013 Saulheimer Probstey Trocken Rieslin

In [114]:
top_n

defaultdict(list,
            {'Kerin Oâ\x80\x99Keefe': [('Elvio Cogno 2013 Ravera  (Barolo)',
               95.40413740960271),
              ('Cantina Produttori San Michele Appiano 2012 Sanct Valentin Sauvignon (Alto Adige)',
               93.00792336347334),
              ('Bucci 2012 Villa Bucci Riserva  (Verdicchio dei Castelli di Jesi Classico Superiore)',
               92.94840105296184),
              ("Brovia 2009 Ca' Mia  (Barolo)", 92.93036350849425),
              ('Castello di Verduno 2009 Monvigliero Riserva  (Barolo)',
               92.74478450553092)],
             'Anne KrebiehlÂ\xa0MW': [('Domaine Roland Schmitt 2013 Altenberg de Bergbieten Grand Cru Riesling (Alsace)',
               93.12939458678753),
              ('Eichinger 2013 Gaisberg Reserve Riesling (Kamptal)',
               92.90299717208276),
              ('Anton Bauer 2015 Grande Reserve GrÃ¼ner Veltliner (Wagram)',
               92.51261967938498),
              ('Allram 2015 Renner GrÃ¼ner Velt

In [93]:
def predict_all_scores(dataset, variety):
    taster_list = ['Roger Voss', 'Michael Schachner', 'Kerin OâKeefe',
                   'Virginie Boone', 'Paul Gregutt', 'Matt Kettmann',
                   'Joe Czerwinski', 'Sean P. Sullivan', 'Anna Lee C. Iijima',
                   'Jim Gordon', 'Lauren Buzzeo','Susan Kostrzewa', 
                   'Mike DeSimone', 'Jeff Jenssen', 'Alexander Peartree', 
                   'Carrie Dykes', 'Fiona Adams', 'Christina Pickard']
    for taster in taster_list:
        inner_uid = trainset.to_inner_uid(ruid=taster)
        inner_iid = trainset.to_inner_iid(riid=variety)
        estimated_score = svd.predict(inner_uid, inner_iid)[3]
        print(taster,'scores',variety,':',estimated_score)

In [94]:
predict_all_scores(10, 5)

ValueError: Item 5 is not part of the trainset.