# **Wine Recommendations**

**Evan Slack, Dylan Wagman, Ryan Maiman**

Final Report

ME 193: Data Analytics, Spring 2021

Professor Howard Hamilton

### 1. Abstract

With hundreds of thousands of unique wines in existence, it can be daunting to identify which wine a user will enjoy. Wine connoisseurs want to know what type of wine they will like based on their current preferences. Our solution parses over 130,000 wine reviews scraped from WineEnthusiast.com to offer a wine recommendation based on user input. The data is used to determine the most important characteristics and build a model that groups wine by similarities. To accurately create this recommendation system, we applied a Word2Vec algorithm through Amazon SageMaker, a cloud computing and machine learning service. SageMaker was used to establish a model and implement a word embeddings algorithm to the testing data.

### 2. Introduction

**How It Works**

Recomendation systems have become increasingly prevalent in digital services. Therefore, our team felt it would be beneficial to learn the ins and outs of creating these types of systems. A wine reccomendation system fit perfectly with both our professional and personal interests. Thus, this notebook was born.

Our algorithm takes in a user input wine. The user can select a type of wine in the dataset from a drop down menu. From here, the wine descriptors are taken into account by using a word2vec algorithm, explained in more detail under the Model Description section. These descriptors are then compared to the other wine in the dataset and an output of the 

**Related Works**

Recomendation systems are not a new idea, however, many of current systems use a collaborative filtering algorithm. Collaborative filtering works by creating a matrix of users and their preferences. A separate user then inputs their preferences, which is compared to the preferences of the existing users. If a similar preference is found, recommendations are made based on the matching user's other preferences. 

Examples of some recommendation systems that use collaborative filtering in some way are Spotify, Amazon, and Netflix. These systems have millions of users and thousands to millions of items. They can create and update their user preference matrices in real time to provide users with accurate reccomendations. Likely, these companies are using collaborative filtering in conjunction with other algorithms to provide the best suggestions for their users. 

Initially we wanted to pursue this sort of method for our reccomendation systems but after analyzing our dataset, realized that it would not be the most effective due to a lack of significant users. Therefore, the user x preference matrix was too sparse. We instead choose to use word embeddings to find similar wine descriptors. 

Word embeddings can be found in a lot of the technology we interact with on a daily basis. When writing emails or text messages, software will commonly provide next word (autofill) suggestions. These predictions are typically based on a Continuous Bag-of-Words implementation of Word2Vec, a word embeddings algorithm discussed later on in the algorithm section. 

### 3. Data

**Overview**

The dataset consists of 130,000+ reviews from WineEnthusiast.com. It was scraped and able to be downloaded from kaggle. Once downloaded, the data was consolidated, and cleaned. The wine characteristics include descriptions written in prose, prices, wineries, and varieties. An additional wine feature, 'name', was created by combining the winery and wine variety. An example of the dataset and the full list of characteristics can be seen below.

Dataset URL: https://www.kaggle.com/zynicide/wine-reviews

![WineList](Images/WineList.png)

**Preprocessing**

To create our recommendation system, the reviews written in prose had to be standardized. To do this, the following steps were completed:
1. A large text corpus with all of the reviews was created
2. The corpus was tokenized to remove stop words
3. Similar words were grouped together
4. Uninformative descriptors were removed

### 4. Model Description

**Word embeddings** is a natural language processing technique in which words from a body of text are mapped to a vector space in which words with similar meanings are closer together. The problem is that most machine learning techniques are unable to process strings of letters. These algorithms want to recieve numerical inputs instead. Therefore, in order to perform numerical experiments on bodies of text, the words must be converted into a vector space representation. Once in this space, a simple consine distance metric can be used 

There are several methods for mapping words to this numerical space; We chose to work with an unsupervised learning algorithm called **Word2Vec**. This algorithm was created by a machine learning team at Google in 2013 and is capable of capturing semantic relationships between words from a large dataset. There are two distinct types of Word2Vec, Contextual Bag-Of-Words (CBOW) and **Skip-Gram with Negative Sampling** (SGNS). CBOW aims to takes a context input of previous words and tries to predict the following word where SGNS takes an input of one word and tries to predict the context (similar words). As we are trying to take a descriptor from a wine review and predict similar descriptors, we are more concerned with the SGNS architecture.

![SkipGram](Images/SkipGram.png)

The model is built with a 3 layer neural network as seen in the image above. Essentially each input computes a dot product with the weight matrix of the hidden layer, then the outer layer computes a dot product with the output of the middle layer and the weight matrix of the outer layer. Finally, a softmax activtion function is applied to this output to determine the probability of words appearing in the context of the input word. 

The softmax probability function is as follows: 

$$ s(z) = \frac{e^{z_i}}{\sum_{i=1}^{K}e^{z_j}} $$

where 
$$e^z = \text{output layer elements}$$
$$K = \text{# output layer elements}$$

### 5. Implementation

In order to begin creating the recommendation system, we had to first import the necessary libraries and modules.

```python
# import top level modules
import os
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor, NearestNeighbors
```

The dataset was pulled from the downloaded kaggle file and any wines with too many descriptors were removed from the dataset.

```python
# load in data generated from blazing text algorithm 
# this includes original dataset with wine details, and the individual wine embedding vectors 
full_wine_df = pd.read_pickle(os.path.join("Data", "full_wine_df.pkl"))
full_wine_df_test = pd.read_pickle(os.path.join("Data", "full_wine_df_test.pkl"))
df_reds = pd.read_pickle(os.path.join("Data", "df_reds.pkl"))
full_wine_df.head()
df_reds.head()


#use only well described wines, this is wines with more than a threshold of descriptors
full_wine_df = full_wine_df.loc[full_wine_df['descriptor_count'] > 5]
```

Clustering was applied to get a better understanding of the dataset and the relationship between the wines. A cosine metric because it was the most succinct.

```python
# make input X vector for Nearest Neighbors
X_in = pd.DataFrame(full_wine_df['review_vector'].values.tolist())
X_in.dropna(inplace=True)

# build KNN model with training data
knn = NearestNeighbors(n_neighbors=10, algorithm= 'brute', metric='cosine')
model_knn = knn.fit(X_in)
```


**Training**

```python
# input the desired wine name or generate a random sample from test set
name_test = full_wine_df_test.sample().name.iloc[0]

# first attempts to search training set otherwise looks in test set for the input wine
# finds neighbours  of desired wine
try:
    #print("train")
    wine_test_vector = full_wine_df.loc[full_wine_df['name'] == name_test]['review_vector'].values.tolist()     
    distance, indice = model_knn.kneighbors(wine_test_vector, n_neighbors=4)
    distance_list = distance[0].tolist()[1:]
    indice_list = indice[0].tolist()[1:]
name_test = "Ponzi Pinot Noir"

wine_test_vector = full_wine_df.loc[full_wine_df['name'] == name_test]['review_vector'].values.tolist()
distance, indice = model_knn.kneighbors(wine_test_vector, n_neighbors=4)
distance_list = distance[0].tolist()[1:]
indice_list = indice[0].tolist()[1:]

main_wine = full_wine_df.loc[full_wine_df['name'] == name_test]

    print('Similar to:', name_test)
    print('The input wine is:', list(main_wine['descriptors'])[0])
    print('_________')
    n = 1
    for d, i in zip(distance_list, indice_list):
        wine_name = full_wine_df['name'][i]
        wine_descriptors = full_wine_df['descriptors'][i]
        wine_price = df_reds.loc[df_reds['name'] == wine_name]['price'].values.tolist()[0]
        wine_country = df_reds.loc[df_reds['name'] == wine_name]['country'].values.tolist()[0]
        wine_points = df_reds.loc[df_reds['name'] == wine_name]['points'].values.tolist()[0]
        #print('Suggestion', str(n), ':', wine_name, 'distance of', "{:.3f}".format(d))
        print('Suggestion', str(n), ':', wine_name)
        #print('and descriptors:', wine_descriptors)
        #print('')
        print('Price ($):', wine_price)
        print('Country:', wine_country)
        print('Points:', wine_points)
        print('')
        n+=1
```

**Testing**

```python
except:
    #print("test")
    wine_test_vector = full_wine_df_test.loc[full_wine_df_test['name'] == name_test]['review_vector'].values.tolist()     
    distance, indice = model_knn.kneighbors(wine_test_vector, n_neighbors=4)
    distance_list = distance[0].tolist()[1:]
    indice_list = indice[0].tolist()[1:]

    main_wine = full_wine_df_test.loc[full_wine_df_test['name'] == name_test]

    print('Similar to:', name_test)
    print('The input wine is:', list(main_wine['descriptors'])[0])
    print('_________')
    n = 1
    for d, i in zip(distance_list, indice_list):
        wine_name = full_wine_df_test['name'][i]
        wine_descriptors = full_wine_df_test['descriptors'][i]
        wine_price = df_reds.loc[df_reds['name'] == wine_name]['price'].values.tolist()[0]
        wine_country = df_reds.loc[df_reds['name'] == wine_name]['country'].values.tolist()[0]
        wine_points = df_reds.loc[df_reds['name'] == wine_name]['points'].values.tolist()[0]
        #print('Suggestion', str(n), ':', wine_name, 'distance of', "{:.3f}".format(d))
        print('Suggestion', str(n), ':', wine_name)
        #print('and descriptors:', wine_descriptors)
        #print('')
        print('Price ($):', wine_price)
        print('Country:', wine_country)
        print('Points:', wine_points)
        print('')
        n+=1
```

**User Interface**

The GUI uses the ipywidgets within Jupyter Notebook. We chose to use the combobox feature because it allows the user to choose from a drop down menu, but they are also able to type into the interface to narrow down their selection.

```python
name = df_reds['name']

all_names = pd.unique(name).tolist()

input_names = widgets.Combobox(
    placeholder='Choose a wine',
    options= all_names,
    description='',
    ensure_option=True,
    disabled=False
)

print("Enter Wine Name")
display(input_names)
```

### 6. Experimental Results

**Recommendation Strength**

A cosine distance metric was used to determine the strength of the recommendation. A wine with a distance metric closer to zero is a better suggestion than a wine with a larger distance metric. The output below shows a potential recommendation result. In this case, the Domaine Saint Andrieu Rhone-style Red Blend is quantitatively the best suggestion. This Red Blend can also be classified as the best qualitative suggestion based on the descriptors. Stonier Pinot Noir and the Red Blend have the most similar flavors, such as fruit/cherry/berry and earth/wood/root.

![StonierPinot](Images/StonierPinot.jpeg)

**User Interface**

The user is able to enter a type of wine they have enjoyed from a drop down menu with all of the wines in the dataset. This drop down menu is quite large, however, it shrinks once the user begins to type. The interface was greated with Jupyter's interactive ipywidgets. 

![GUI](Images/GUI.png)

The user output is simpilar than the output shown above in the Stonier Pinot Noir experiment. We thought that the most valuable information for the users' wine recommendations would be the wine names, prices, countries, and wine points. Thus, the user output interface looks like the following:

<img src="Images/UserOutput.jpeg" alt="UserOutput" width="500"/>

### 7. Conclusion

### 8. Contributions

**Evan**: Created bar and scatterplots, clustered data using kmeans

**Dylan**: Scraped and standardized the data from the Vivino website, created a random forest tree

**Ryan**: Project proposal and delivery, project progress report

### 9. References

**Related Work**

R. Schuring, “RoboSomm Chapter 3: Wine Embeddings and a Wine Recommender,” Medium, 28-Dec-2019. [Online]. Available: https://towardsdatascience.com/robosomm-chapter-3-wine-embeddings-and-a-wine-recommender-9fc678f1041e. [Accessed: 11-May-2021].

**Algorithm Information**

“Gensim: topic modelling for humans,” models.word2vec – Word2vec embeddings - gensim, 29-Apr-2021. [Online]. Available: https://radimrehurek.com/gensim/models/word2vec.html. [Accessed: 11-May-2021].

D. Radečić, “Softmax Activation Function Explained,” Medium, 18-Jun-2020. [Online]. Available: https://towardsdatascience.com/softmax-activation-function-explained-a7e1bc3ad60. [Accessed: 11-May-2021].

N. S. S. I. am a perpetual, “Understanding Word Embeddings: From Word2Vec to Count Vectors,” Analytics Vidhya, 19-Oct-2020. [Online]. Available: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/. [Accessed: 11-May-2021].

S. Engdahl, “Blogs,” Amazon, 2008. [Online]. Available: https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/. [Accessed: 11-May-2021].

S. Doshi, “Skip-Gram: NLP context words prediction algorithm,” Medium, 17-Mar-2019. [Online]. Available: https://towardsdatascience.com/skip-gram-nlp-context-words-prediction-algorithm-5bbf34f84e0c. [Accessed: 11-May-2021].

**Code and Libraries**

Scikit-learn
>Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.