# Wine Reviews Recommendation Systems

**Prepared by Elizabeth Webster**

*November 2022*

## Overview

Create a recommendation system for Wine Enthusiast's tasters using Surprise.

## Business Problem

This project is being prepared for a small winery in Walla Walla.  They are just starting out and currently only producing a few wines. Their wine maker wants to gain insight on how to generate wines that will be rated highly.

In this section of the project, I will create a recommendation system for Wine Enthusiast's tasters in order to suggest wines to certain tasters. By understanding the wines that are recommended, the winery will get an idea of what type of wines to create and who to market them to.

## Dataset

The data that I am using comes from Wine Enthusiast and includes information on 130,000 different wines.  This information includes the description, variety, winery, country, taster name, etc.

For this section of the project, we will just be focused on points, taster name (users), and title (items).

# Data Understanding

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from surprise import Reader, Dataset
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, SVDpp, SlopeOne, NMF 
from surprise.prediction_algorithms import NormalPredictor, KNNWithZScore, BaselineOnly
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
from surprise import accuracy
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import dataframe
df = pd.read_csv('Data/winemag-data-130k-v2.csv.zip', 
                 encoding='latin-1', 
                 index_col=0)

In [3]:
# Explore dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   region_1               108724 non-null  object 
 7   region_2               50511 non-null   object 
 8   taster_name            103727 non-null  object 
 9   taster_twitter_handle  98758 non-null   object 
 10  title                  129971 non-null  object 
 11  variety                129970 non-null  object 
 12  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


In [4]:
df.title.value_counts()

Gloria Ferrer NV Sonoma Brut Sparkling (Sonoma County)       11
Korbel NV Brut Sparkling (California)                         9
Segura Viudas NV Extra Dry Sparkling (Cava)                   8
Gloria Ferrer NV Blanc de Noirs Sparkling (Carneros)          7
Ruinart NV Brut RosÃ©  (Champagne)                            7
                                                             ..
Mauro Sebaste 2009 Santa Rosalia  (Dolcetto d'Alba)           1
Soutiran NV Alexandre Premier Cru Brut  (Champagne)           1
ChÃ¢teau Haut-Monplaisir 2008 Prestige Malbec (Cahors)        1
En Garde 2011 Adamus Red (Diamond Mountain District)          1
Emblem 2007 Oso Vineyard Cabernet Sauvignon (Napa Valley)     1
Name: title, Length: 118840, dtype: int64

In [6]:
df.taster_name.value_counts()

Roger Voss            25514
Michael Schachner     15134
Kerin OâKeefe       10776
Virginie Boone         9537
Paul Gregutt           9532
Matt Kettmann          6332
Joe Czerwinski         5147
Sean P. Sullivan       4966
Anna Lee C. Iijima     4415
Jim Gordon             4177
Anne KrebiehlÂ MW      3685
Lauren Buzzeo          1835
Susan Kostrzewa        1085
Mike DeSimone           514
Jeff Jenssen            491
Alexander Peartree      415
Carrie Dykes            139
Fiona Adams              27
Christina Pickard         6
Name: taster_name, dtype: int64

For our recommendation system, we will only be needing the columns:
taster name - users
title - items
points - target

In [5]:
# Create dataframe with specified columns
rec_df = df.loc[:, ('points', 'taster_name', 'title')]
rec_df.head()

Unnamed: 0,points,taster_name,title
0,87,Kerin OâKeefe,Nicosia 2013 VulkÃ Bianco (Etna)
1,87,Roger Voss,Quinta dos Avidagos 2011 Avidagos Red (Douro)
2,87,Paul Gregutt,Rainstorm 2013 Pinot Gris (Willamette Valley)
3,87,Alexander Peartree,St. Julian 2013 Reserve Late Harvest Riesling ...
4,87,Paul Gregutt,Sweet Cheeks 2012 Vintner's Reserve Wild Child...


In [12]:
# Check for missing values
print('Missing Taster Names:', rec_df.taster_name.isna().sum())
print('Missing Titles:', rec_df.title.isna().sum())
print('Missing Points:', rec_df.points.isna().sum())

Missing Taster Names: 26244
Missing Titles: 0
Missing Points: 0


In [14]:
# Drop missing values
rec_df.dropna(subset=['taster_name'], inplace=True)

In [15]:
# Check for missing values
print('Missing Taster Names:', rec_df.taster_name.isna().sum())

Missing Taster Names: 0


Now that the dataset is cleaned, we can move on to building the model.

# Building a Model

In [16]:
# Read the data into Surprise
reader = Reader(rating_scale=(80,100))
data = Dataset.load_from_df(rec_df[['taster_name', 'title', 'points']],reader)

In [35]:
# Perform a train test split
trainset, testset = train_test_split(data, test_size=0.25)
print('Number of trainset users: ', trainset.n_users)
print('Number of trainset items: ', trainset.n_items)

Number of trainset users:  19
Number of trainset items:  72331


## Testing Different Models

The metric I am interested in is RMSE - Root Mean Squared Error. I am looking for the model with the lowest RMSE.

### SVD - Singular Value Decomposition

In [18]:
# Run a grid search for parameters
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)
g_s_svd.fit(data)

In [19]:
# Print the best score and parameters
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 2.6853571420358646, 'mae': 2.0570073799480846}
{'rmse': {'n_factors': 100, 'reg_all': 0.02}, 'mae': {'n_factors': 100, 'reg_all': 0.02}}


### KNN Basic

In [20]:
# Cross validating with KNNBasic
knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

In [21]:
# Find mean rmse
for i in cv_knn_basic.items():
    print(i)
print('-----------------------')
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([2.81872178, 2.83520903, 2.83093154, 2.82588793, 2.828012  ]))
('test_mae', array([2.13933723, 2.16744591, 2.15432195, 2.15352458, 2.16025291]))
('fit_time', (0.00648188591003418, 0.004214048385620117, 0.0042879581451416016, 0.004130840301513672, 0.003963947296142578))
('test_time', (0.14880824089050293, 0.1419227123260498, 0.12962603569030762, 0.11473917961120605, 0.10323691368103027))
-----------------------
2.8277524552580626


### KNN Baseline

In [22]:
# Cross validating with KNNBaseline
knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


In [23]:
# Find mean rmse
for i in cv_knn_baseline.items():
    print(i)

np.mean(cv_knn_baseline['test_rmse'])

('test_rmse', array([2.6889015 , 2.68124213, 2.67527997, 2.69097158, 2.67882287]))
('test_mae', array([2.0432891 , 2.0323867 , 2.02651135, 2.04292702, 2.02259507]))
('fit_time', (0.20996713638305664, 0.22531390190124512, 0.2353348731994629, 0.22832727432250977, 0.22541379928588867))
('test_time', (0.1498420238494873, 0.08863711357116699, 0.09263920783996582, 0.09062600135803223, 0.08864808082580566))


2.6830436096283536

The model with the lowest RMSE is Singular Value Decomposition, so this is what I will use for our final model

## Build SVD Model

In [24]:
# Instantiate model and fit trainset
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f8b81c313a0>

## Making Predictions

In [25]:
# Make predictions with the testset
predictions = svd.test(testset)

In [26]:
predictions

[Prediction(uid='Paul Gregutt', iid='Melrose 2013 Baco Noir (Umpqua Valley)', r_ui=86.0, est=88.97770409559391, details={'was_impossible': False}),
 Prediction(uid='Mike DeSimone', iid='Montefiore 2014 Cabernet Sauvignon (Judean Hills)', r_ui=91.0, est=89.06880023135628, details={'was_impossible': False}),
 Prediction(uid='Kerin Oâ\x80\x99Keefe', iid='Tenuta La Fuga 2011 Le Due Sorelle Riserva  (Brunello di Montalcino)', r_ui=91.0, est=88.8164603899712, details={'was_impossible': False}),
 Prediction(uid='Anna Lee C. Iijima', iid='Barrel Oak 2008 Reserve Viognier (Virginia)', r_ui=85.0, est=88.46653057193892, details={'was_impossible': False}),
 Prediction(uid='Michael Schachner', iid='Ã\x80nima Negra 2012 QuÃ\xadbia Falanis White (Vi de la Terra Mallorca)', r_ui=86.0, est=87.038168595377, details={'was_impossible': False}),
 Prediction(uid='Anna Lee C. Iijima', iid='Robert Weil 2015 Kiedrich GrÃ¤fenberg SpÃ¤tlese Riesling (Rheingau)', r_ui=94.0, est=88.46653057193892, details={'was_im

In [27]:
# Check our predictions for accuracy
accuracy.rmse(predictions)

RMSE: 2.6998


2.6997501588312436

### Top Predictions

In [32]:
# Build a function for retrieving top predictions
def get_top_n(predictions, n=5):

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [33]:
top_n = get_top_n(predictions, n=5)

In [34]:
# Check top 5 predictions for each user
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings], '\n')

Paul Gregutt ['Doyenne 2008 Grand Ciel Vineyard Syrah (Red Mountain)', 'Gramercy 2009 The Third Man Red Red (Columbia Valley (WA))', 'Bethel Heights 2014 West Block Pinot Noir (Eola-Amity Hills)', 'DanCin 2014 Melodia Pinot Noir (Oregon)', 'Sparkman 2010 Ruckus Syrah (Red Mountain)'] 

Mike DeSimone ['Kavaklidere 2010 Pendore Syrah (Aegean)', 'Dalton 2013 Alma Scarlet Red (Galilee)', 'Tabor 2011 Adama Cabernet Sauvignon (Galilee)', "Segal's 2013 Fusion Red (Galilee)", 'Teliani Valley 2015 Semi-Sweet Khvanchkara Red (Georgia)'] 

Kerin OâKeefe ['Michele Chiarlo 2011 Cerequio  (Barolo)', 'Venturini Massimino 2005 Riserva  (Amarone della Valpolicella Classico)', "Brovia 2009 Ca' Mia  (Barolo)", 'Castello di Verduno 2009 Monvigliero Riserva  (Barolo)', 'Cantina Produttori San Michele Appiano 2012 Sanct Valentin Sauvignon (Alto Adige)'] 

Anna Lee C. Iijima ['Schloss Vollrads 2014 SpÃ¤tlese Riesling (Rheingau)', 'Robert Weil 2014 Kiedrich Turmberg Trocken Riesling (Rheingau)', "Osprey's D

# Conclusions

These recommendations can be used to understand: 
* Which wine varieties are most often recommended
* Who to market specific varieties to

The red varieties that are showing up the most are:
* Pinot Noir
* Cabernet Sauvignon
* Syrah

The white varieties that are showing up the most are:
* Riesling
* Chardonnay

The tasters that are recommended wines from our city, Walla Walla, or region, the Pacific Northwest are:
* Sean Sullivan
* Paul Gregutt
* Virginie Boone

# Recommendations

Pinot Noirs, Cabernets, Syrahs, Rieslings, and Chardonnays seem to be the most popular wine varieties according to our recommendation system.  I would recommend producing these wines since they reach a larger target audience and are more approachable. Once the winery has established themselves with this strong base, they could move on to more niche wines.

Once the wines have been produced, I would recommend sending them to Sean, Paul, or Virginie for tasting. These tasters enjoy the above varieties from our region, so they are most likely to highly rate our wines.

# Next Steps

**Cold Start Problem** - One next step would be addressing the cold start problem, or how to recommend wines for a user that we have no information on.  The strategy I would use is having the new user provide preferences on a few of the wines to get us started.