# **Assignment 2: Building a Crowdsourced Beer Recommendation System**

### Group Members: Bhavna Kaparaju (bgk378), Callie Gilmore (cgg756), Dawson Cook (dcc2436), India Lindsay (igl257), Ali Daanesh Sayyed (as92998)

### 8:30am Class

## Task A: Scraping  Top Rated Beers on Beer Advocate (source: https://www.beeradvocate.com/beer/top-rated/) 

In [None]:
import requests
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd

In [None]:
def get_product_links():
    df = pd.DataFrame(columns=['product_name', 'product_review', 'user_rating'])
    
    base_url = 'https://www.beeradvocate.com'
    
    page = requests.get(base_url+'/beer/top-rated/')
    soup = BeautifulSoup(page.text,'html.parser')
    
    cells = soup.find_all('td')
    for cell in cells:
        if len(cell) == 2:            
            product = cell.find('b').text
            external_url = cell.find('a')['href']
            
            product_page = requests.get(base_url+external_url)
            tree = html.fromstring(product_page.content)
            for i in range(1,26):
                try:
                    rating = tree.xpath('/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div[{}]/div[2]/span[2]/text()'.format(i))[0]
                                         #/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div[1]/div[2]/span[2]
                    comments = tree.xpath('/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div[{}]/div[2]/text()'.format(i))[1:6]
                                         #/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div[1]/div[2]/text()[2]
                    comment = ""
                    for com in comments:
                        comment += com.strip('\n') + " "

                    df = df.append({'product_name': product,
                                    'product_review': comment, 
                                    'user_rating': rating}, 
                                    ignore_index=True)
                except:
                    pass

    return df 

df = get_product_links()

In [None]:
df.to_csv('beer_reviews.csv',index=False)

In [None]:
beer_review = df.copy()

The following data frame contains 6,230 reviews for a total of 250 beer products

In [None]:
df.head()

Unnamed: 0,product_name,product_review,user_rating
0,Kentucky Brunch Brand Stout,Smell: early morning pancakes and coffee befor...,5.0
1,Kentucky Brunch Brand Stout,2019 vintage. Pours a very dark brown color wi...,4.53
2,Kentucky Brunch Brand Stout,It's hyped... There is a lot of breweries doin...,1.49
3,Kentucky Brunch Brand Stout,Reviewing 2019 vintage. This pours thick and c...,4.52
4,Kentucky Brunch Brand Stout,2018 version. Poured dark with a small head. S...,4.99


## Task B: Specifying 3 Attributes in a Product

In [None]:
from google.colab import files 
import io
import pandas as pd
from string import punctuation
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Word Frequency Analysis and Attribute Selection

We calculated the frequency of each word mentioned in each review while specifically seeking to identify any words that describe the attributes of beer products. 

If you're interested in learning more about the many attributes of beer, reference this link: https://www.dummies.com/food-drink/drinks/beer/beer-for-dummies-cheat-sheet/

In [None]:
wordcounter = pd.DataFrame(columns = ['Word','Count', 'Attribute']) 

attributes = ['aggressive', 'balanced', 'wellbalanced', 'well-balanced', 'complex', 'crisp', 'fruity', 'fruitforward', 'hoppy', 'malty', 'malt', 'robust']

stop_words = set(stopwords.words('english')) 

df['product_review'] = df['product_review'].astype(str) 

for com in df['product_review']:  
  com = re.sub(r'[^\w\s]', '', com) 
  words = com.lower().split()
  words = [word for word in words if word not in stop_words] 
  for i in range(len(words)): 
    w = words[i]
    b = False
    if w in attributes: 
      b = True
    if w in wordcounter['Word'].tolist():
      w_index = wordcounter.index[wordcounter['Word'] ==w]
      wordcounter.loc[w_index, 'Count'] += 1
      if b == True: 
        wordcounter.loc[w_index,'Attribute'] = 'yes'
      else:
        wordcounter.loc[w_index,'Attribute'] = 'no'
    else: 
      if b == True:
        new_row = {'Word':w, 'Count':1,'Attribute':'yes'} 
      else:
        new_row = {'Word':w, 'Count':1,'Attribute':'no'}
        wordcounter = wordcounter.append(new_row,ignore_index=True) 

In [None]:
wordcounter.sort_values(by='Count', ascending=False, inplace=True)
wordcounter

In [None]:
wordcounter.to_csv('word_count.csv', index=False) 
files.download('word_count.csv') 

### Looking through the most frequently occuring words in the beer product reviews, we identified the following attributes used with the highest occurence within reviews:

1.   Citrus
2.   Fruit
3.   Smooth

Citrus is used to describe beers that contain citrusy elements within their flavor. Fruit can be used to describe a beer that contains flavors reminiscent of various fruits. Citrus and fruit may overlap in their descriptions of several beers however, they do have clear differences. Citrus references fruity flavors that are tart, bright, and slightly acidic: orange, lemon, grapefruit. Fruit references all flavors that are related to a fruit taste and extends to multiple categories of fruits, including berries, drupes, and pomes. Smooth describes the texture of the beer and the consistency of the flavor. 

For the purpose of our recommendation system, we will assume a customer has specified these three attributes as being important. We will seek to identify beer products that are described as containing these attributes.

## Task C: Performing a Similarity Analysis Using Cosine Similarity with Specified Attributes


In [None]:
df = beer_review.copy() #importing beer reviews dataframe

In [None]:
import io
import pandas as pd
from string import punctuation
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re 
stop_words = set( stopwords.words('english'))
import numpy as np
import string

In [None]:
import numpy
import re
from scipy import spatial

def tokenize(sentences):
    words = []
    for sentence in sentences:
        w = word_extraction(sentence)
        words.extend(w)
    words = sorted(list(set(words)))
    return words

def word_extraction(sentence):
    ignore = ['a', "the", "is"]
    words = re.sub("[^\w]", " ",  sentence).split()
    cleaned_text = [w.lower() for w in words if w not in ignore]
    return cleaned_text

def generate_bow(allsentences):
    vocab = ['citrus', 'fruit', 'smooth']
    #print("Word List for Document \n{0} \n".format(vocab));
    words = word_extraction(allsentences)
    bag_vector = numpy.zeros(len(vocab))
    for w in words:
        for i,word in enumerate(vocab):
            if word == w:
                bag_vector[i] = 1
               # if bag_vecor[i] >1:
                 #   bag_vector[i] = 1
    #print("{0} \n{1}\n".format(allsentences,numpy.array(bag_vector)))
    #print(numpy.array(bag_vector))
    return numpy.array(bag_vector)

def CosSim(x):
    feature_vector = [1.0,1.0,1.0]
    cosine_similarity = 1 - spatial.distance.cosine(feature_vector, x)
    return cosine_similarity

df['Similarity_Score'] = df['product_review'].map(generate_bow).map(CosSim).fillna(0)

In [None]:
df2 = df[['product_name', 'product_review', 'Similarity_Score']]
df2.to_csv (r'C:\Users\bhavn\Documents\Text Analysis\Assignment 2\export_dataframe.csv', index = False, header=True)
similarity_score = df2.copy()

We performed a similarity analysis (using cosine similarity) to identify whether the three attributes were contained within the review. Similarity scores range between 0 and 1. A similarity score of 1 indicates that all three attributes were mentioned within the review while a score of 0 indicates that none of the attributes were mentioned. 

In [None]:
df2.head()

Unnamed: 0,product_name,product_review,Similarity_Score
0,Kentucky Brunch Brand Stout,Smell: early morning pancakes and coffee befor...,0.0
1,Kentucky Brunch Brand Stout,2019 vintage. Pours a very dark brown color wi...,0.0
2,Kentucky Brunch Brand Stout,It's hyped... There is a lot of breweries doin...,0.0
3,Kentucky Brunch Brand Stout,Reviewing 2019 vintage. This pours thick and c...,0.57735
4,Kentucky Brunch Brand Stout,2018 version. Poured dark with a small head. S...,0.57735


## Task D: Performing a Feature Level Sentiment Analysis for Each of the 3 Features

For each review, if the attribute was mentioned in the review, we wanted to understand the sentiment behind the author's description of the attribute in relation to the beer. This is important as we will only want to recommend beer products in which the desired attributes were mentioned in a positive manner. 

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re 
import spacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #using Vader tool for sentiment analysis 


#loading nltk packages
nltk.download('stopwords')
nltk.download('punkt')

stop = stopwords.words('english')
stop.remove('not') #can convey different sentiment


products = beer_review.copy() #importing dataframe of beer reviews

In [None]:
products['product_review'] = products['product_review'].astype(str) #converting review to string

In [None]:
def word_window(string,n,keyword):
    
    '''Searches for keyword in text and returns n words on either side of it as a tuple'''
    
    
    '''splitting words seperated by spaces, want to contain all capitalization and 
    punctuation as it contains sentiment, also removing all stopwords'''

    string_tokens = string.split() 
    string_no_sw = [word for word in string_tokens if not word in stop] 
    
    #empty list to store words before and after keyword
    window = ''
    
    #identifying if keyword in string and then grabbing 3 words before and 3 words after
    for i in range(len(string_no_sw)):
        word = string_no_sw[i]
        if word == keyword:
            
            #words that appear before keyword
            while (i-n) >= 0 and n >= 1:
                window += string_no_sw[i-n] + ' '
                n = n - 1
            
            #words that appear after keyword
            while (i+n) <= (len(string_no_sw)-1) and n <= 3:
                window += string_no_sw[i+n] + ' '
                n = n + 1
                
    return window

In [None]:
def sentiment_score(df,attr):

  ''' given a list of attributes, this function finds the sentiment for a window of words surround each attribute in each review'''
    
    attr1 = attr[0]
    attr2 = attr[1]
    attr3 = attr[2]
    
    #creating dataframe to store sentiment
    sentiment = pd.DataFrame(columns = ['Product_Name','Review',attr1, attr2, attr3]) 
    
    analyzer = SentimentIntensityAnalyzer() #creating sentiment-intensity-analyzer object
    review_index = 0 #keeps track of index of review
    
    for rev in df['product_review']:
        
        #rev = str(rev) #converting review to a string 
        
        product_name = df['product_name'].iloc[review_index] #get product name associated with this review
       
        has_attributes = False #variable to check if review contains attributes
    
        for word in rev.split(): #for each word in the review
            
            word2 = word.lower() #only grab lower case to identify presence of attribute
     
           
            if word2 in attr: #if word is an attribute
                has_attributes = True #we know this review contains an attribute 
                a = word2
                window = word_window(rev,3,word) #grabbing window of 3 surrounding words in each direction 
                score = analyzer.polarity_scores(window) #grab the sentiment score of this window of words
                score = score['compound'] #grab the total compound score
                #this grabs multiple mentions of same attribute in one review
                
                #using np.nan instead of zero so not included when taking averages
                if word2 == attr1: 
                    new_row = {'Product_Name':product_name,'Review':rev,attr1:score, attr2:np.nan, attr3:np.nan}
                elif word2 == attr2: 
                    new_row = {'Product_Name':product_name,'Review':rev,attr1:np.nan, attr2:score, attr3:np.nan}
                else: #attr3
                    new_row = {'Product_Name':product_name,'Review':rev,attr1:np.nan, attr2:np.nan, attr3:score}
            
                sentiment = sentiment.append(new_row,ignore_index=True) #add row to dataframe
        
        if has_attributes == False: # if rev contained no attributes.. update as np.nan for all attr to keep review in df 
            new_row = {'Product_Name':product_name,'Review':rev,attr1:np.nan, attr2:np.nan, attr3:np.nan}
            sentiment = sentiment.append(new_row,ignore_index=True) #add row to dataframe
        
        review_index += 1 #increments review index     

    return sentiment
    

In [None]:
list_attr = ['citrus','smooth','fruit']
sentiment_df = sentiment_score(products,list_attr)

If the attribute was mentioned in the review, we grabbed a window of words surrounding the attribute and calculated the sentiment used to describe such attribute. If an attribute was not mentioned within a review, a value of NaN was assigned to the attribute.

The following dataframe contains each product, the review for each product, and the sentiment score if the review contains one of our selected attributes: citrus, smooth, fruit

In [None]:
sentiment_df 

Unnamed: 0.1,Unnamed: 0,Product_Name,Review,citrus,smooth,fruit
0,0,Kentucky Brunch Brand Stout,Smell: early morning pancakes and coffee befor...,,,
1,1,Kentucky Brunch Brand Stout,2019 vintage. Pours a very dark brown color wi...,,,
2,2,Kentucky Brunch Brand Stout,It's hyped... There is a lot of breweries doin...,,,
3,3,Kentucky Brunch Brand Stout,Reviewing 2019 vintage. This pours thick and c...,,0.6249,
4,4,Kentucky Brunch Brand Stout,2018 version. Poured dark with a small head. S...,,0.5859,



We then calculated the average sentiment score for the mention of each attribute within all reviews for each product. A negative score indicates negative sentiment while a positive score indicates positive sentiment. A score of 0 indicates neutral sentiment. 

In [None]:
avg_sentiment = sentiment_df.groupby(['Product_Name'])['citrus','smooth','fruit'].mean()

In [None]:

avg_sentiment.head()

Unnamed: 0,Product_Name,citrus,smooth,fruit
0,3rd Anniversary Imperial IPA,0.51948,0.352267,0.0
1,4th Anniversary,0.0,0.028733,-0.197929
2,A Deal With The Devil,,0.284975,0.31845
3,A Deal With The Devil - Double Oak-Aged,,0.626633,0.0
4,Aaron,,0.173029,-0.106067


The following dataframe contains the overall average sentiment for each review: 

In [None]:
avg_attribute_sent = sentiment_df[['citrus','smooth','fruit']].mean().to_frame(name = 'Avg Sentiment')

In [None]:

avg_attribute_sent 

Unnamed: 0.1,Unnamed: 0,Avg Sentiment
0,citrus,0.205478
1,smooth,0.242716
2,fruit,0.257298


In [None]:
sentiment_df.to_csv("Sentiment_Scores.csv") #each product with each attribute if in review and sentiment score 
avg_sentiment.to_csv("Avg_Sentiment_Product.csv") #each product w avg sentiment score for each attribute

In [None]:
avg_attribute_sent.to_csv("Avg_Sentiment_Attribute.csv") #each attribute w avg sentiment score 

## Task E: We calculated the overall evaluation score for each beer as a combination of the average similarity score and average sentiment score. Our goal was to identify 3 beer products with the highest evaluation score to reccommend to the customer. 

In [None]:
import pandas as pd

In [None]:
df = similarity_score.copy() #importing similarity score dataframe from task C
df2 = avg_sentiment.copy() #importing dataframe with avg sentiment score for each product by the three attributes 

In [None]:
df_Product_Simscore = df.groupby(['product_name'])[['Similarity_Score']].mean()

In [None]:
df_merged = df_Product_Simscore.merge(df2, left_on='product_name', right_on='Product_Name')

In [None]:
df_merged = df_merged.fillna(0)

In [None]:
df_merged.head()

In [None]:
df_merged['Citrus_evaluation_score'] = (df_merged['Similarity_Score'] + df_merged['citrus']) / 2
df_merged['Smooth_evaluation_score'] = (df_merged['Similarity_Score'] + df_merged['smooth']) / 2
df_merged['Fruit_evaluation_score'] = (df_merged['Similarity_Score'] + df_merged['fruit']) / 2
df_merged['Overall_evaluation_score'] = (df_merged['Similarity_Score'] + df_merged['fruit'] + df_merged['citrus'] + df_merged['smooth']) / 4

In [None]:
df_Evaluation_Scores = df_merged[['Product_Name', 'Citrus_evaluation_score', 'Smooth_evaluation_score', 'Fruit_evaluation_score', 'Overall_evaluation_score']]

### The following dataframe contains the overall evaluation score and the evaluation score by attribute. Given that our customer prefers beers that our citrusy, fruity, and/or smooth, we would reccomend the following three beers as they have the highest overall evaluation score. 
1. Double Nelson 
2. Ghost In The Machine - Double Dry-Hopped
3. Double Dry Hopped Fort Point Pale Ale

In [None]:
df_Evaluation_Scores.sort_values(by='Overall_evaluation_score', ascending=False)[0:3]

Unnamed: 0,Product_Name,Citrus_evaluation_score,Smooth_evaluation_score,Fruit_evaluation_score,Overall_evaluation_score
89,Double Nelson,0.301038,0.340869,0.522069,0.452169
119,Ghost In The Machine - Double Dry-Hopped,0.366213,0.581688,0.306648,0.449287
84,Double Dry Hopped Fort Point Pale Ale,0.24401,0.51461,0.429322,0.429511


## Task F: We then sought to analyze how our reccommendations would shift if we used word vectors to calculate similarity score (using the spaCy package). 


In [None]:
import io
import pandas as pd
from string import punctuation
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re 
stop_words = set( stopwords.words('english'))
import numpy as np
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
df = beer_review.dropna()

In [None]:
df['cleaned_review'] = df['product_review'].apply(lambda x :x.translate(str.maketrans('', '', string.punctuation)))
df['cleaned_review'] = df['cleaned_review'].apply(lambda x :x.lower())

df['cleaned_review'] = df['cleaned_review'].apply(word_tokenize).apply(set).apply(list)
def remove_stopwords(s):
    return [w for w in s if not w in stop_words] 
    
df['cleaned_review'] =  df['cleaned_review'].apply(remove_stopwords)

In [None]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [None]:
def join_words(comment):   
    """Joins the tokenized words to a sentence"""
    return " ".join(comment) 

df['joined_review'] = df['cleaned_review'].map(join_words)

In [None]:
def calculate_similarity(comment):
    """Compute similarity score"""
    base = nlp(comment)
    compare = nlp(input_attributes)
    return base.similarity(compare)

In [None]:
input_list = ['citrus', 'fruit', 'smooth']
input_attributes =  " ".join(input_list)
df['spacy_similarity'] = df['joined_review'].map(calculate_similarity)

The below dataframe shows the spacy similarity score for each review. 

In [None]:
spacy_sim = df[['product_name','product_review','user_rating','spacy_similarity']]

Unnamed: 0,product_name,product_review,user_rating,spacy_similarity
0,Kentucky Brunch Brand Stout,Smell: early morning pancakes and coffee befor...,5.00,0.663443
1,Kentucky Brunch Brand Stout,2019 vintage. Pours a very dark brown color wi...,4.53,0.745518
2,Kentucky Brunch Brand Stout,It's hyped... There is a lot of breweries doin...,1.49,0.506517
3,Kentucky Brunch Brand Stout,Reviewing 2019 vintage. This pours thick and c...,4.52,0.619972
4,Kentucky Brunch Brand Stout,2018 version. Poured dark with a small head. S...,4.99,0.695420
...,...,...,...,...
6224,Madagascar,Pours a very dark chestnut brown with two-and-...,4.07,0.777653
6225,Madagascar,Vintage 2018 Heavy booze aroma with vanilla ac...,4.15,0.788488
6226,Madagascar,"Opaque, near black body, with a frothy, cola-l...",4.21,0.758685
6227,Madagascar,"Drank from a 1 pint, 6 fl oz bottle purchased ...",4.61,0.448734


In [None]:
#import pandas as pd

In [None]:

df2 = avg_sentiment.copy() #importing dataframe with avg sentiment score for each product by the three attributes 

In [None]:
df_Product_spacyscore = spacy_sim.groupby(['product_name'])[['spacy_similarity']].mean()

In [None]:
df_merged = df_Product_spacyscore.merge(df2, left_on='product_name', right_on='Product_Name')

In [None]:
df_merged = df_merged.fillna(0)

In [None]:
df_merged.head()

In [None]:
df_merged['Citrus_evaluation_score'] = (df_merged['spacy_similarity'] + df_merged['citrus']) / 2
df_merged['Smooth_evaluation_score'] = (df_merged['spacy_similarity'] + df_merged['smooth']) / 2
df_merged['Fruit_evaluation_score'] = (df_merged['spacy_similarity'] + df_merged['fruit']) / 2
df_merged['Overall_evaluation_score'] = (df_merged['spacy_similarity'] + df_merged['fruit'] + df_merged['citrus'] + df_merged['smooth']) / 4

In [None]:
df_Evaluation_Scores_Spacy = df_merged[['Product_Name', 'Citrus_evaluation_score', 'Smooth_evaluation_score', 'Fruit_evaluation_score', 'Overall_evaluation_score']]

Using Spacy's cosine calculations, we found that the three beers we should recommend are actually the same beers recommending using the regular cosine similarity score calculations: 
- Double Nelson
- Ghost In The Machine - Double Dry-Hopped	
- Double Dry Hopped Fort Point Pale Ale

However, the overall evaluation scores are much higher than the previous evaluation scores (seen in the second dataframe). 


In [None]:
df_Evaluation_Scores_Spacy.sort_values(by='Overall_evaluation_score', ascending=False)[0:3]

Unnamed: 0,Product_Name,Citrus_evaluation_score,Smooth_evaluation_score,Fruit_evaluation_score,Overall_evaluation_score
89,Double Nelson,0.542417,0.582248,0.763448,0.572859
119,Ghost In The Machine - Double Dry-Hopped,0.542725,0.7582,0.48316,0.537543
84,Double Dry Hopped Fort Point Pale Ale,0.437351,0.707951,0.622663,0.526182


The spaCy cosine similarity and the regular cosine similarity are both calculating cosine similarities while they have different approaches. Regular cosine similarity relies on a bag-of-words model while spaCy uses word vectors. 

The bag of words model counts the frequencies of each word and calculates the cosine similarity score relying upon the number of words that are the same in each document. In this context, a review that contained all three attributes explicitly was given a cosine similarity score of 1. 

A word vector captures the context for each word by calculating the probabilities for neighboring words. The python library spaCy contains pre-built word vectors that contain these probabilities. In this context, if a review contained words that have a high probability of being associated with one of our attributes, even if the review did not explicitly mention the attributes, then it was given a high cosine similarity score. 

Given that our similarity scores are higher when using the spaCy library and the exact same beers were recommended, it is likely that many of our reviews mention attributes that are often associated with the three attributes (citrus, smooth, and fruit) while these reviews do not explicitly state them. This helps us rely upon a greater number of reviews and thus have greater accuracy when recommending beer. 

If the spaCy cosine similarity score resulted in different beers being recommended, that would indicate that the word vector approach is associating words with our attributes that may have a different meaning within the context of beers. 

The evaluation scores found using regular cosine calculations can be seen below. 

In [None]:
df_Evaluation_Scores.sort_values(by='Overall_evaluation_score', ascending=False)[0:3]

Unnamed: 0,Product_Name,Citrus_evaluation_score,Smooth_evaluation_score,Fruit_evaluation_score,Overall_evaluation_score
89,Double Nelson,0.301038,0.340869,0.522069,0.452169
119,Ghost In The Machine - Double Dry-Hopped,0.366213,0.581688,0.306648,0.449287
84,Double Dry Hopped Fort Point Pale Ale,0.24401,0.51461,0.429322,0.429511


## Task G: How Recommendations Differ if based purely on Rating and not Feature Sentiment and Similarity Scores

If we relied purely on the ratings posted on the beer review website to recommend 3 products, our reccomendations would fail to capture the true opinion of the reviewers. A product-rating score is indicative of the reviewers preferences for this certain beer. However, one customer's preferences may not match another customer's preferences. If we relied merely on the stars to recommend products, we would not be accounting for the
variety of customer's preferences. Rather, our reccomendations would purely rely on a simplistic quantitative value. 

Beers have a wide variety of attributes, ranging from taste, smell, color, consistency, and vintage. 
By receiving input from the customer regarding their desired attributes, we are able to personalize the recommendations and account for both the complex nature of beer and customer preferences.
If given 3 desired beer attributes, we can identify beers that have reviews that both contain these attributes (using similarity score) 
and mention these attributes in a positive manner (using sentiment analysis of attributes). We can suit the needs of any type of customer, from a beer novice to a beer connoisseur, as long as they know what they like. 

If we merely recommended the top 3 products based on the reviewers with the highest user ratings and the beer's product ratings, we would be reccomending the following beers to every customer: 
- Schaarbeekse Kriek
- Trappist Westvleteren 12 (XII)
- Everett Porter

It also is extremely difficult to select the top 3 beers when 9 beers all have the same rating of 5.0. Here, we additionally relied upon the number of reviews per product to select the top 3. In the below dataframe, you can observe the beers that received 5 star ratings.

In [None]:
highest_rated_beers = beer_review.groupby(['product_name','user_rating',]).count().sort_values(by = 'user_rating', ascending=False)
highest_rated_beers.iloc[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,product_review
product_name,user_rating,Unnamed: 2_level_1
Mother Of All Storms,5.0,15
Focal Banger,5.0,16
Everett Porter,5.0,17
Kentucky Brunch Brand Stout,5.0,7
Double Barrel Jesus,5.0,10
Dinner,5.0,13
Trappist Westvleteren 12 (XII),5.0,17
Marshmallow Handjee,5.0,14
Schaarbeekse Kriek,5.0,19
Barrel Aged Imperial German Chocolate Cupcake Stout,4.96,16





However, when seeking to recomend a product based on the customer's preferences and the insights gained from the reviews, we found the following three beers:

- Double Nelson
- Ghost In The Machine - Double Dry-Hopped
- Double Dry Hopped Fort Point Pale Ale

These three beers would not have been recommended if we did not incorporate cosine similarity or sentiment analysis as they were not in the list of top rated beers. However, the customer is more likely to enjoy these beers as we know they match their specified preferences. 

An additional benefit of this approach is that if the customer ranked their desired attributes in order of preference, we would have additional insight into recommending a beer. For example, if this customer preferred fruity beers over smooth and citrus beers, we would recommend the Double Nelson as it had the highest evaluation score for the fruit attribute. 


In [None]:
df_Evaluation_Scores_Spacy.sort_values(by='Overall_evaluation_score', ascending=False)[0:3]

Unnamed: 0,Product_Name,Citrus_evaluation_score,Smooth_evaluation_score,Fruit_evaluation_score,Overall_evaluation_score
89,Double Nelson,0.542417,0.582248,0.763448,0.572859
119,Ghost In The Machine - Double Dry-Hopped,0.542725,0.7582,0.48316,0.537543
84,Double Dry Hopped Fort Point Pale Ale,0.437351,0.707951,0.622663,0.526182
