# Building a Crowdsourced Recommendation System

**High level description:** The objective of this group assignment is to create the building blocks of a crowdsourced recommendation system. This recommendation system should accept user inputs about desired attributes of a product and come up with 3 recommendations. 
Obtain reviews of craft beer from beeradvocate.com. I would suggest using the following link, which shows the top 250 beers sorted by ratings: 
https://www.beeradvocate.com/beer/top-rated/
The nice feature of the above link is that it is a single-page listing of 250 top-rated beers (avoids the pagination feature, which you need in cases where listings go on for many pages). The way beeradvocate.com organizes reviews is that it provides about 25 reviews per page. The output file should have 3 columns: product_name, product_review, and user_rating. 

Your submission (python notebook) should include the following: 
(i)	Names of all team members inside the python notebook (only one submission per team) including morning/late morning cohort information. 
(ii)	All scripts 
(iii)	The sentiment and similarity scores for the three products you recommended in task E.
(iv)	Your analyses for and answer to task F. Make sure you show the ratings, similarity scores and sentiments for the products you recommend in tasks E and F. Use tables whenever possible.


## Task A. 
Extract about 5-6k reviews. 


---



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings
warnings.filterwarnings('ignore')

nltk.download('punkt')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
#Read output from the scraping
beer_df = pd.read_csv('beeradvocate.csv')

In [None]:
beer_df

Unnamed: 0,web-scraper-order,web-scraper-start-url,beer,beer-href,score,comment,name
0,1634682101-4857,https://www.beeradvocate.com/beer/top-rated/,A Deal With The Devil - Double Oak-Aged,https://www.beeradvocate.com/beer/profile/2490...,,4DAloveofSTOUT from Illinois\n\n4.73/5 rDev +...,A Deal With The Devil - Double Oak-Aged\nAncho...
1,1634681662-147,https://www.beeradvocate.com/beer/top-rated/,Darkstar November,https://www.beeradvocate.com/beer/profile/3382...,,Thomas_Wikman from Texas\n\n4.59/5 rDev +2.5%...,Darkstar November\nBottle Logic Brewing
2,1634681792-1521,https://www.beeradvocate.com/beer/top-rated/,Last Snow,https://www.beeradvocate.com/beer/profile/3180...,,HattedClassic from Virginia\n\n4.36/5 rDev -2...,Last Snow\nFunky Buddha Brewery
3,1634681852-2139,https://www.beeradvocate.com/beer/top-rated/,Art,https://www.beeradvocate.com/beer/profile/2251...,,rodmanfor3 from Vermont\n\n4.79/5 rDev +5.7%\...,Art\nHill Farmstead Brewery
4,1634682188-5808,https://www.beeradvocate.com/beer/top-rated/,Julius,https://www.beeradvocate.com/beer/profile/2874...,,Sachin-Tendulkar from New York\n\n4.8/5 rDev ...,Julius\nTree House Brewing Company
...,...,...,...,...,...,...,...
6226,1634682129-5143,https://www.beeradvocate.com/beer/top-rated/,Atrial Rubicite,https://www.beeradvocate.com/beer/profile/2401...,,thedaveofbeer from Massachusetts\n\n4.87/5 rD...,Atrial Rubicite\nJester King Brewery
6227,1634682113-4992,https://www.beeradvocate.com/beer/top-rated/,King Sue,https://www.beeradvocate.com/beer/profile/2322...,,imnodoctorbut from Texas\n\n4.54/5 rDev -1.5%...,King Sue\nToppling Goliath Brewing Company
6228,1634682153-5412,https://www.beeradvocate.com/beer/top-rated/,Lou Pepe - Kriek,https://www.beeradvocate.com/beer/profile/388/...,,TrilliumFan from Massachusetts\n\n4.75/5 rDev...,Lou Pepe - Kriek\nBrasserie Cantillon
6229,1634681929-2982,https://www.beeradvocate.com/beer/top-rated/,Pseudo Sue,https://www.beeradvocate.com/beer/profile/2322...,,detgfrsh from Texas\n\n4.34/5 rDev -4%\nlook:...,Pseudo Sue\nToppling Goliath Brewing Company


In [None]:
#Parsed_df will be the final dataframe
parsed_df = pd.DataFrame([], columns= ['beer_name', 'beer_detail', 'reviewer', 'rating', 'score', 'att_scores', 'full_review'])

#We perform log over each review scraped 
for i in range(beer_df.shape[0]):

  #assign beer name as-is
  str_beer = beer_df['beer'].iloc[i]
  str_name = beer_df['name'].iloc[i]

  #Process the comment column to parse the entry
  str_comment = beer_df['comment'].iloc[i]
  try:
    review_lines = str_comment.split('\n')
  except AttributeError as verr:
    continue

  #Iterate through the different lines in the comment
  #Some are the review, and some are metadata
  comment = []
  i = 0
  for line in review_lines:
    if line == '' or line == ' ':
      continue
    i += 1
    if i == 1:
      #Name of the rewiever
      reviewer_name = line
    elif i == 2:
      #Rating
      rating = line
      score = rating.split('/')[0]
    elif i == 3:
      #attribute-based score
      attribute_scores = line
    else:
      #review
      comment.append(line)
  
  #Join the lines that are review in one
  full_comment = ' '.join(comment)

  # Form of the new DF
  new_df = pd.DataFrame([[str_beer, str_name, reviewer_name, rating, score, attribute_scores, full_comment]],
                        columns =['beer_name', 'beer_detail', 'reviewer', 'rating', 'score', 'att_scores', 'full_review'])

  parsed_df = pd.concat([parsed_df,new_df])

In [None]:
parsed_df

Unnamed: 0,beer_name,beer_detail,reviewer,rating,score,att_scores,full_review
0,A Deal With The Devil - Double Oak-Aged,A Deal With The Devil - Double Oak-Aged\nAncho...,4DAloveofSTOUT from Illinois,4.73/5 rDev +0.2%,4.73,look: 4 | smell: 4.75 | taste: 4.75 | feel: 5 ...,2017 Vintage This is up there with some of the...
0,Darkstar November,Darkstar November\nBottle Logic Brewing,Thomas_Wikman from Texas,4.59/5 rDev +2.5%,4.59,look: 4.75 | smell: 4.5 | taste: 4.75 | feel: ...,"It is truly delicious. A little bit foamy, the..."
0,Last Snow,Last Snow\nFunky Buddha Brewery,HattedClassic from Virginia,4.36/5 rDev -2.9%,4.36,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4 |...,The beer pours a nice creamy brown head that d...
0,Art,Art\nHill Farmstead Brewery,rodmanfor3 from Vermont,4.79/5 rDev +5.7%,4.79,look: 4.5 | smell: 4.75 | taste: 4.75 | feel: ...,L: Golden yellow pour with minimal head and vi...
0,Julius,Julius\nTree House Brewing Company,Sachin-Tendulkar from New York,4.8/5 rDev +2.6%,4.8,look: 4.5 | smell: 4.5 | taste: 5 | feel: 5 | ...,Canned 08/26/20 Pours a hazy orange with two f...
...,...,...,...,...,...,...,...
0,Atrial Rubicite,Atrial Rubicite\nJester King Brewery,thedaveofbeer from Massachusetts,4.87/5 rDev +5.4%,4.87,look: 5 | smell: 4.75 | taste: 5 | feel: 4.75 ...,It took a lot to get one of these so I was anx...
0,King Sue,King Sue\nToppling Goliath Brewing Company,imnodoctorbut from Texas,4.54/5 rDev -1.5%,4.54,look: 3.75 | smell: 4 | taste: 4.75 | feel: 5 ...,"tallboy 4-pack, dated best by Sept; it's 6/13 ..."
0,Lou Pepe - Kriek,Lou Pepe - Kriek\nBrasserie Cantillon,TrilliumFan from Massachusetts,4.75/5 rDev +2.2%,4.75,look: 4.75 | smell: 4.75 | taste: 4.75 | feel:...,"On tap at Moeder Lambic, first one in a lambic..."
0,Pseudo Sue,Pseudo Sue\nToppling Goliath Brewing Company,detgfrsh from Texas,4.34/5 rDev -4%,4.34,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4.2...,From a tallboy can packaged 5/18/21. Cloudy go...


In [None]:
#Download a copy to store it
parsed_df.to_csv('parsed_reviews.csv') 

##Task B. 

Assume that a customer, who will be using this recommender system, has specified 3 attributes in a product. E.g., one website describes multiple attributes of beer:
https://www.dummies.com/food-drink/drinks/beer/beer-for-dummies-cheat-sheet/
*	Aggressive (Boldly assertive aroma and/or taste) 
*	Balanced: Malt and hops in similar proportions; equal representation of malt sweetness and hop bitterness in the flavor — especially at the finish
*	Complex: Multidimensional; many flavors and sensations on the palate
*	Crisp: Highly carbonated; effervescent
*	Fruity: Flavors reminiscent of various fruits or Hoppy: Herbal, earthy, spicy, or citric aromas and flavors of hops or Malty: Grainy, caramel-like; can be sweet or dry
*	Robust: Rich and full-bodied

A word frequency analysis of beer reviews may be a better way to find important attributes. 
Assume that a customer has specified three attributes of the product as being important to him or her. 


---





In [None]:
#Start by using the processed data
parsed_beer_df = pd.read_csv('parsed_reviews.csv')

In [None]:
#Understand the contents of the reviews DF
parsed_beer_df

Unnamed: 0.1,Unnamed: 0,reviewer,rating,att_scores,full_review,beer_name,beer_detail,score
0,0,4DAloveofSTOUT from Illinois,4.73/5 rDev +0.2%,look: 4 | smell: 4.75 | taste: 4.75 | feel: 5 ...,2017 Vintage This is up there with some of the...,A Deal With The Devil - Double Oak-Aged,A Deal With The Devil - Double Oak-Aged\nAncho...,4.73
1,0,Thomas_Wikman from Texas,4.59/5 rDev +2.5%,look: 4.75 | smell: 4.5 | taste: 4.75 | feel: ...,"It is truly delicious. A little bit foamy, the...",Darkstar November,Darkstar November\nBottle Logic Brewing,4.59
2,0,HattedClassic from Virginia,4.36/5 rDev -2.9%,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4 |...,The beer pours a nice creamy brown head that d...,Last Snow,Last Snow\nFunky Buddha Brewery,4.36
3,0,rodmanfor3 from Vermont,4.79/5 rDev +5.7%,look: 4.5 | smell: 4.75 | taste: 4.75 | feel: ...,L: Golden yellow pour with minimal head and vi...,Art,Art\nHill Farmstead Brewery,4.79
4,0,Sachin-Tendulkar from New York,4.8/5 rDev +2.6%,look: 4.5 | smell: 4.5 | taste: 5 | feel: 5 | ...,Canned 08/26/20 Pours a hazy orange with two f...,Julius,Julius\nTree House Brewing Company,4.80
...,...,...,...,...,...,...,...,...
6222,0,thedaveofbeer from Massachusetts,4.87/5 rDev +5.4%,look: 5 | smell: 4.75 | taste: 5 | feel: 4.75 ...,It took a lot to get one of these so I was anx...,Atrial Rubicite,Atrial Rubicite\nJester King Brewery,4.87
6223,0,imnodoctorbut from Texas,4.54/5 rDev -1.5%,look: 3.75 | smell: 4 | taste: 4.75 | feel: 5 ...,"tallboy 4-pack, dated best by Sept; it's 6/13 ...",King Sue,King Sue\nToppling Goliath Brewing Company,4.54
6224,0,TrilliumFan from Massachusetts,4.75/5 rDev +2.2%,look: 4.75 | smell: 4.75 | taste: 4.75 | feel:...,"On tap at Moeder Lambic, first one in a lambic...",Lou Pepe - Kriek,Lou Pepe - Kriek\nBrasserie Cantillon,4.75
6225,0,detgfrsh from Texas,4.34/5 rDev -4%,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4.2...,From a tallboy can packaged 5/18/21. Cloudy go...,Pseudo Sue,Pseudo Sue\nToppling Goliath Brewing Company,4.34


In [None]:
list_words = []

#List (bag) of words, but each word is only included once per review 
for rows in parsed_beer_df['full_review']:
  list_words.extend(set(word_tokenize(rows.lower())))

#Create dictionary with the Counter of the words
dict_word_count = Counter(list_words)

#Convert to a dataframe and download
word_count_df = pd.DataFrame.from_dict(dict_word_count, orient='index', columns = ['Count'])
word_count_df.to_csv('word_count.csv')

In [None]:
# proposed attributes for beer
BEER_ATTRIBUTES = ['taste', 'pours', 'sweet', 'dark', 'carbonation', 'overall',
                   'mouthfeel', 'aroma', 'body', 'black', 'light', 'vanilla',
                   'medium', 'smooth', 'flavor', 'fruit', 'thick', 'feel',
                   'smell', 'flavors', 'bitterness', 'sweetness', 'creamy',
                   'dry', 'caramel', 'balanced']

word_count_df.loc[BEER_ATTRIBUTES,:]

Unnamed: 0,Count
taste,2646
pours,1781
sweet,1680
dark,1665
carbonation,1644
overall,1507
mouthfeel,1493
aroma,1480
body,1430
black,1385


## Task C. 

Perform a similarity analysis using cosine similarity (without word embeddings) with the 3 attributes specified by the customer and the reviews. From the output file, calculate the average similarity between each product and the preferred attributes. 
For similarity analysis, use cosine similarity with bag of words. The script should accept as input a file with the product attributes, and calculate similarity scores (between 0 and 1) between these attributes and each review. That is, the output file should have 3 columns – product_name (for each product, the product_name will repeat as many times as there are reviews of the product), product_review and similarity_score. 


---




In [None]:
#Read parsed reviews
parsed_beer_df = pd.read_csv('parsed_reviews.csv')

# Convert to lowercase
parsed_beer_df['full_review'] = parsed_beer_df['full_review'].str.lower()

#List of attributes to work with (Mutable)
USER_BEER_ATTRIBUTES = ['taste', 'carbonation', 'aroma']

In [None]:
parsed_beer_df

Unnamed: 0.1,Unnamed: 0,reviewer,rating,att_scores,full_review,beer_name,beer_detail,score
0,0,4DAloveofSTOUT from Illinois,4.73/5 rDev +0.2%,look: 4 | smell: 4.75 | taste: 4.75 | feel: 5 ...,2017 vintage this is up there with some of the...,A Deal With The Devil - Double Oak-Aged,A Deal With The Devil - Double Oak-Aged\nAncho...,4.73
1,0,Thomas_Wikman from Texas,4.59/5 rDev +2.5%,look: 4.75 | smell: 4.5 | taste: 4.75 | feel: ...,"it is truly delicious. a little bit foamy, the...",Darkstar November,Darkstar November\nBottle Logic Brewing,4.59
2,0,HattedClassic from Virginia,4.36/5 rDev -2.9%,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4 |...,the beer pours a nice creamy brown head that d...,Last Snow,Last Snow\nFunky Buddha Brewery,4.36
3,0,rodmanfor3 from Vermont,4.79/5 rDev +5.7%,look: 4.5 | smell: 4.75 | taste: 4.75 | feel: ...,l: golden yellow pour with minimal head and vi...,Art,Art\nHill Farmstead Brewery,4.79
4,0,Sachin-Tendulkar from New York,4.8/5 rDev +2.6%,look: 4.5 | smell: 4.5 | taste: 5 | feel: 5 | ...,canned 08/26/20 pours a hazy orange with two f...,Julius,Julius\nTree House Brewing Company,4.80
...,...,...,...,...,...,...,...,...
6222,0,thedaveofbeer from Massachusetts,4.87/5 rDev +5.4%,look: 5 | smell: 4.75 | taste: 5 | feel: 4.75 ...,it took a lot to get one of these so i was anx...,Atrial Rubicite,Atrial Rubicite\nJester King Brewery,4.87
6223,0,imnodoctorbut from Texas,4.54/5 rDev -1.5%,look: 3.75 | smell: 4 | taste: 4.75 | feel: 5 ...,"tallboy 4-pack, dated best by sept; it's 6/13 ...",King Sue,King Sue\nToppling Goliath Brewing Company,4.54
6224,0,TrilliumFan from Massachusetts,4.75/5 rDev +2.2%,look: 4.75 | smell: 4.75 | taste: 4.75 | feel:...,"on tap at moeder lambic, first one in a lambic...",Lou Pepe - Kriek,Lou Pepe - Kriek\nBrasserie Cantillon,4.75
6225,0,detgfrsh from Texas,4.34/5 rDev -4%,look: 4 | smell: 4.25 | taste: 4.5 | feel: 4.2...,from a tallboy can packaged 5/18/21. cloudy go...,Pseudo Sue,Pseudo Sue\nToppling Goliath Brewing Company,4.34


In [None]:
def find_messages_with_str(df, *argv):
  """This function takes a dataframe = df, and a any number of word parameters 
  through the *argv arguments. This function returns a NEW dataframe with the
  messages that contain all the words (word1 & word2 & ... wordn) unwrapped 
  from the *argvs

  Parameters:
         df = Dataframe with the columns ['Date', 'Messages']
      *argv = Every argument is a different word to be matched
  """

  #Create a regular expression pattern containing all words
  regex_pattern = ''
  for arg in argv:
    regex_pattern += '(?=.*' + arg + ')'

  #filter the dataframe through the regular expression and return the new df
  result = df['full_review'].str.contains(pat = regex_pattern, regex=True)
  filtered_df = df.loc[result]
  filtered_df[str(argv[0])] = 1
  return filtered_df



#Find reviews with words in the user attributes
tables = []

for queryWord in USER_BEER_ATTRIBUTES:
  temp_df = find_messages_with_str(parsed_beer_df, queryWord)
  print(temp_df.shape)
  tables.append(temp_df)

(3204, 9)
(1651, 9)
(1964, 9)


In [None]:
new_df = tables[0]
#Loop over the 3 different dfs for each attribute
for idx in range (1, len(tables)):
  # Get a smaller dataframe with just the columns of interest
  temp_df = tables[idx][ ['full_review', 'beer_name', 'score', USER_BEER_ATTRIBUTES[idx]] ]

  # rename the columns to be able to identify them later
  new_full_review_name = 'full_review' + str(idx)
  new_beer_name_name = 'beer_name' + str(idx)
  score_name = 'score' + str(idx)
  temp_df = temp_df.rename(columns={'full_review': new_full_review_name, 'beer_name': new_beer_name_name, 'score': score_name})

  #join the dataframes together
  new_df = pd.concat([new_df, temp_df], axis=1)

#Finally get the dataframe with the (ordered) columns of interest
new_df = new_df[ ['full_review', 'full_review1', 'full_review2', 'beer_name', 'beer_name1', 'beer_name2', 'score', 'score1', 'score2', USER_BEER_ATTRIBUTES[0], USER_BEER_ATTRIBUTES[1], USER_BEER_ATTRIBUTES[2]]] 
new_df

Unnamed: 0,full_review,full_review1,full_review2,beer_name,beer_name1,beer_name2,score,score1,score2,taste,carbonation,aroma
1,,,"it is truly delicious. a little bit foamy, the...",,,Darkstar November,,,4.59,,,1.0
2,the beer pours a nice creamy brown head that d...,,,Last Snow,,,4.36,,,1.0,,
3,,l: golden yellow pour with minimal head and vi...,,,Art,,,4.79,,,1.0,
4,canned 08/26/20 pours a hazy orange with two f...,,,Julius,,,4.80,,,1.0,,
5,serving: 16 oz can (“pkg05/07/21… best by 09/0...,serving: 16 oz can (“pkg05/07/21… best by 09/0...,serving: 16 oz can (“pkg05/07/21… best by 09/0...,King Sue,King Sue,King Sue,4.30,4.30,4.30,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6220,"nice looking stout. smell is of bourbon, sweet...",,,Space Trace,,,4.20,,,1.0,,
6222,,,it took a lot to get one of these so i was anx...,,,Atrial Rubicite,,,4.87,,,1.0
6223,"tallboy 4-pack, dated best by sept; it's 6/13 ...","tallboy 4-pack, dated best by sept; it's 6/13 ...","tallboy 4-pack, dated best by sept; it's 6/13 ...",King Sue,King Sue,King Sue,4.54,4.54,4.54,1.0,1.0,1.0
6224,"on tap at moeder lambic, first one in a lambic...",,,Lou Pepe - Kriek,,,4.75,,,1.0,,


Notice that when a review has 2 or 1 attributes, then some values will be NaN, we can fix this:

In [None]:
#Remove NaN accordingly
new_df['full_review'].fillna('', inplace=True)
new_df['full_review1'].fillna('', inplace=True)
new_df['full_review2'].fillna('', inplace=True)
new_df['beer_name'].fillna('', inplace=True)
new_df['beer_name1'].fillna('', inplace=True)
new_df['beer_name2'].fillna('', inplace=True)
new_df['score'].fillna(0, inplace=True)
new_df['score1'].fillna(0, inplace=True)
new_df['score2'].fillna(0, inplace=True)
new_df[USER_BEER_ATTRIBUTES[0]].fillna(0, inplace=True)
new_df[USER_BEER_ATTRIBUTES[1]].fillna(0, inplace=True)
new_df[USER_BEER_ATTRIBUTES[2]].fillna(0, inplace=True)
new_df

Unnamed: 0,full_review,full_review1,full_review2,beer_name,beer_name1,beer_name2,score,score1,score2,taste,carbonation,aroma
1,,,"it is truly delicious. a little bit foamy, the...",,,Darkstar November,0.00,0.00,4.59,0.0,0.0,1.0
2,the beer pours a nice creamy brown head that d...,,,Last Snow,,,4.36,0.00,0.00,1.0,0.0,0.0
3,,l: golden yellow pour with minimal head and vi...,,,Art,,0.00,4.79,0.00,0.0,1.0,0.0
4,canned 08/26/20 pours a hazy orange with two f...,,,Julius,,,4.80,0.00,0.00,1.0,0.0,0.0
5,serving: 16 oz can (“pkg05/07/21… best by 09/0...,serving: 16 oz can (“pkg05/07/21… best by 09/0...,serving: 16 oz can (“pkg05/07/21… best by 09/0...,King Sue,King Sue,King Sue,4.30,4.30,4.30,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
6220,"nice looking stout. smell is of bourbon, sweet...",,,Space Trace,,,4.20,0.00,0.00,1.0,0.0,0.0
6222,,,it took a lot to get one of these so i was anx...,,,Atrial Rubicite,0.00,0.00,4.87,0.0,0.0,1.0
6223,"tallboy 4-pack, dated best by sept; it's 6/13 ...","tallboy 4-pack, dated best by sept; it's 6/13 ...","tallboy 4-pack, dated best by sept; it's 6/13 ...",King Sue,King Sue,King Sue,4.54,4.54,4.54,1.0,1.0,1.0
6224,"on tap at moeder lambic, first one in a lambic...",,,Lou Pepe - Kriek,,,4.75,0.00,0.00,1.0,0.0,0.0


In [None]:
#Combine repeated columns into 1
new_df['full_review'] = new_df[['full_review','full_review1','full_review2']].max(axis = 1)
new_df['beer_name'] = new_df[['beer_name','beer_name1','beer_name2']].max(axis = 1)
new_df['score'] = new_df[['score','score1','score2']].max(axis = 1)
new_df.drop(columns = ['full_review1', 'full_review2', 'beer_name1', 'beer_name2', 'score1', 'score2'], inplace=True)
new_df

Unnamed: 0,full_review,beer_name,score,taste,carbonation,aroma
1,"it is truly delicious. a little bit foamy, the...",Darkstar November,4.59,0.0,0.0,1.0
2,the beer pours a nice creamy brown head that d...,Last Snow,4.36,1.0,0.0,0.0
3,l: golden yellow pour with minimal head and vi...,Art,4.79,0.0,1.0,0.0
4,canned 08/26/20 pours a hazy orange with two f...,Julius,4.80,1.0,0.0,0.0
5,serving: 16 oz can (“pkg05/07/21… best by 09/0...,King Sue,4.30,1.0,1.0,1.0
...,...,...,...,...,...,...
6220,"nice looking stout. smell is of bourbon, sweet...",Space Trace,4.20,1.0,0.0,0.0
6222,it took a lot to get one of these so i was anx...,Atrial Rubicite,4.87,0.0,0.0,1.0
6223,"tallboy 4-pack, dated best by sept; it's 6/13 ...",King Sue,4.54,1.0,1.0,1.0
6224,"on tap at moeder lambic, first one in a lambic...",Lou Pepe - Kriek,4.75,1.0,0.0,0.0


In [None]:
def cosine_similarity(v1, v2):
  dot_product = np.dot(v1, v2)
  return dot_product / (np.linalg.norm(v1) * np.linalg.norm(v2))


attribute_vector = np.array([1,1,1])

#Code to calculate cosine similarity between (1,1,1) and the reviews
similarity_list = []
for index, row in new_df.iterrows():
    vector = np.array([ row[USER_BEER_ATTRIBUTES[0]], row[USER_BEER_ATTRIBUTES[1]], row[USER_BEER_ATTRIBUTES[2]] ])
    similarity = cosine_similarity(attribute_vector, vector)
    similarity_list.append(similarity)

new_df['similarity'] = similarity_list

In [None]:
#Final output
new_df

Unnamed: 0,full_review,beer_name,score,taste,carbonation,aroma,similarity
1,"it is truly delicious. a little bit foamy, the...",Darkstar November,4.59,0.0,0.0,1.0,0.57735
2,the beer pours a nice creamy brown head that d...,Last Snow,4.36,1.0,0.0,0.0,0.57735
3,l: golden yellow pour with minimal head and vi...,Art,4.79,0.0,1.0,0.0,0.57735
4,canned 08/26/20 pours a hazy orange with two f...,Julius,4.80,1.0,0.0,0.0,0.57735
5,serving: 16 oz can (“pkg05/07/21… best by 09/0...,King Sue,4.30,1.0,1.0,1.0,1.00000
...,...,...,...,...,...,...,...
6220,"nice looking stout. smell is of bourbon, sweet...",Space Trace,4.20,1.0,0.0,0.0,0.57735
6222,it took a lot to get one of these so i was anx...,Atrial Rubicite,4.87,0.0,0.0,1.0,0.57735
6223,"tallboy 4-pack, dated best by sept; it's 6/13 ...",King Sue,4.54,1.0,1.0,1.0,1.00000
6224,"on tap at moeder lambic, first one in a lambic...",Lou Pepe - Kriek,4.75,1.0,0.0,0.0,0.57735


In [None]:
#Download output
new_df.to_csv('new_df_similarity.csv') 

## Task D. 

For every review, perform a sentiment analysis. 


---



In [None]:
# Sort by similarity
new_df_sorted = new_df.sort_values('similarity', ascending=False).head(1000)

In [None]:
new_df_sorted

Unnamed: 0,full_review,beer_name,score,taste,carbonation,aroma,similarity
582,"holy crap is this rated high, i had no idea. w...",Saison Bernice,4.00,1.0,1.0,1.0,1.000000
5851,bottle: poured alight peachy color ale with a ...,Abricot Du Fermier,4.22,1.0,1.0,1.0,1.000000
5854,"i mean, damn near perfect. pours a nice medium...",Pseudo Sue,4.64,1.0,1.0,1.0,1.000000
3264,"recently canned, and it feels even better than...",Double Dry Hopped Double Mosaic Daydream,4.46,1.0,1.0,1.0,1.000000
701,"pours a hazy golden with a thick, creamy white...",Very Hazy,4.58,1.0,1.0,1.0,1.000000
...,...,...,...,...,...,...,...
4161,"on tap at armsby abbey hazy orange liquid, on...",Society & Solitude #6,4.54,1.0,0.0,1.0,0.816497
4219,enjoyed at the copenhagen beer fest. reviewed ...,Mexican Brunch,4.74,1.0,0.0,1.0,0.816497
3819,look: this pours black with almost no carbonat...,Bourbon Paradise,4.05,1.0,1.0,0.0,0.816497
3816,"canned 8/5/20, purchased 8/22 and drunk the ne...",Gggreennn!,4.60,1.0,0.0,1.0,0.816497


In [None]:
#Calculate sentiment analysis on each of the top 1000 reviews

def get_sentiment_score(review):
  analyzer = SentimentIntensityAnalyzer()
  sentiment_score = analyzer.polarity_scores(review)
  return sentiment_score['compound']

new_df_sorted['sentiment_score'] = new_df_sorted['full_review'].map(get_sentiment_score)

In [None]:
#Final output
new_df_sorted

Unnamed: 0,full_review,beer_name,score,taste,carbonation,aroma,similarity,sentiment_score
582,"holy crap is this rated high, i had no idea. w...",Saison Bernice,4.00,1.0,1.0,1.0,1.000000,0.9252
5851,bottle: poured alight peachy color ale with a ...,Abricot Du Fermier,4.22,1.0,1.0,1.0,1.000000,0.9482
5854,"i mean, damn near perfect. pours a nice medium...",Pseudo Sue,4.64,1.0,1.0,1.0,1.000000,0.9670
3264,"recently canned, and it feels even better than...",Double Dry Hopped Double Mosaic Daydream,4.46,1.0,1.0,1.0,1.000000,0.9904
701,"pours a hazy golden with a thick, creamy white...",Very Hazy,4.58,1.0,1.0,1.0,1.000000,0.9640
...,...,...,...,...,...,...,...,...
4161,"on tap at armsby abbey hazy orange liquid, on...",Society & Solitude #6,4.54,1.0,0.0,1.0,0.816497,0.5267
4219,enjoyed at the copenhagen beer fest. reviewed ...,Mexican Brunch,4.74,1.0,0.0,1.0,0.816497,0.9790
3819,look: this pours black with almost no carbonat...,Bourbon Paradise,4.05,1.0,1.0,0.0,0.816497,0.6874
3816,"canned 8/5/20, purchased 8/22 and drunk the ne...",Gggreennn!,4.60,1.0,0.0,1.0,0.816497,0.9680


## Task E. 

Assume an evaluation score for each beer = average similarity score + average sentiment score. 
Now recommend 3 products to the customer. 


---




In [None]:
#Calculate evaluation score as the sum of 2 scores
new_df_sorted['evaluation_score'] = new_df_sorted['similarity'] + new_df_sorted['sentiment_score']
bow_top_3_recommend = new_df_sorted.groupby(['beer_name']).mean().sort_values('evaluation_score', ascending=False).head(3)
bow_top_3_recommend

Unnamed: 0_level_0,score,taste,carbonation,aroma,similarity,sentiment_score,evaluation_score
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fuzzy,4.356667,1.0,1.0,1.0,1.0,0.9918,1.9918
Upper Case,4.38,1.0,1.0,1.0,1.0,0.9781,1.9781
Scaled Way Up,4.28,1.0,1.0,1.0,1.0,0.9752,1.9752


**The top 3 brands of beer we recommend based on the evaluation score by BOW similarity is Fuzzy, Upper Case, Scaled Way Up**

These are the individual reviews that were considered into this evaluation_score calculation:

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Fuzzy']:
  print(review, "\n")

had this at the firestone walker 2017 invitational. the long line paired well with the hangover i was nursing in the heat. pours innocent and yellow with a chunkee or two floating around and a super thin mild head that was gone before i took a sip. wow, what a great aroma. peach, brett, barrel, funky yet dry. pretty top notch in that department.  taste, oh man i could drill this all day. nice soft white peach envelopes the mouth, not overly acidic or sour, just right. carbonation on point, the nice little white wine notes are complimentary. it is definitely upper echelon fruited wild ale territory. doubt i would ever trade for such a thing because it is so damn valued and there are comparable beers like this in my neck of the woods (de garde- the peach). its almost as good as o'so's peach wild ale. so i'd say it is running #2 in the midwest, which is still a huge compliment. peach and sour fans will dig it. aug 31, 2017 

375ml bottle into a teku blend 4 (aka vintage 2020) a: pops and 

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Upper Case']:
  print(review, "\n")

traded for this a while ago, as the date on the bottom of the can read 10/31/16 (yes, i know it's old!) with "happy halloween" written below it. sure, this beer was quite filling but the flavors, stickiness, and lack of any massive sediment at the bottom of the liquid led me to believe that this was a quality brew, even down to the last sip! this beer looked like nectar as i poured it out with the waxy, hazy liquid being quite appealing and thicker than anything that i've had from trillium. hardly any lacing was left behind as there were just a few spots below a faint thin ring near the top of my pint glass. lots of mango, papaya, ripe fruit, and tropical hops in the nose as this had quite a heavy aroma, albeit not anything terribly strong. more hops, grape, white wine, faint wood, and hints of a wild ale came through in the taste as this was much more complex than i was first led to believe. this malted out for sure but the pilsner, and flaked wheat malt held on along with some of the

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Scaled Way Up']:
  print(review, "\n")

thanks for the can kevin! a- tallboy to a small snifter with a murky yellow-orange body and a one finger, frothy cap. decent head retention as the crown falls to a small ring around the edges. spotty lacing is seen rarely around the glass. s- bright orange juices and juicy citrus fruits (mandarin, tangerine) come out on the front end of the aroma and burst out of the glass. plenty of other hops to back that with peach, bubble gum, mango, apricot, papaya and dank weedy herbs. hops hops hops. bright, clean and fresh. t- the citrus hops pick up a lot and shift more towards the grapefruit and orange part of the spectrum but stay fresh, juicy and zesty. the bitterness also picks up and is on the level of biting into a citrus peel, intense and sharp. peach and mango are downsized here and bubble gum, dank weed, pine resin, berries, bitter seville orange marmalade and hop powder astringency join in as well.  mf- smooth creaminess and puffy texture to the medium bodied ale. a high level of car

## Task F. 

How would your recommendation change if you use word vectors (the spaCy package would be the easiest to use with pretrained word vectors) instead of plain vanilla bag-of-words cosine similarity? One way to analyze the difference would be to consider the % of reviews that mention a preferred attribute. E.g., if you recommend a product, what % of its reviews mention an attribute specified by the customer? Do you see any difference across bag-of-words and word vector approaches? This article may be useful: https://medium.com/swlh/word-embeddings-versus-bag-of-words-the-curious-case-of-recommender-systems-6ac1604d4424?source=friends_link&sk=d746da9f094d1222a35519387afc6338
Note that the article doesn’t claim that bag-of-words will always be better than word embeddings for recommender systems. It lays out conditions under which it is likely to be the case. That is, depending on the attributes you use, you may or may not see the same effect. 


---




In [None]:
#download spacy en_core_web_lg
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 53.0 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
# Load necessary imports for word embeddings with spacy
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

# Calculate the spacy similarity between the 3 attributes and the rewiews
similarity_list = []
for review in new_df_sorted['full_review']:
  doc1 = nlp(review)
  doc2 = nlp("taste carbonation aroma")
  spacy_similarity = doc1.similarity(doc2)
  similarity_list.append(spacy_similarity)

new_df_sorted['spacy_similarity'] = similarity_list

In [None]:
new_df_sorted

Unnamed: 0,full_review,beer_name,score,taste,carbonation,aroma,similarity,sentiment_score,evaluation_score,spacy_similarity
582,"holy crap is this rated high, i had no idea. w...",Saison Bernice,4.00,1.0,1.0,1.0,1.000000,0.9252,1.925200,0.467122
5851,bottle: poured alight peachy color ale with a ...,Abricot Du Fermier,4.22,1.0,1.0,1.0,1.000000,0.9482,1.948200,0.634302
5854,"i mean, damn near perfect. pours a nice medium...",Pseudo Sue,4.64,1.0,1.0,1.0,1.000000,0.9670,1.967000,0.571053
3264,"recently canned, and it feels even better than...",Double Dry Hopped Double Mosaic Daydream,4.46,1.0,1.0,1.0,1.000000,0.9904,1.990400,0.545924
701,"pours a hazy golden with a thick, creamy white...",Very Hazy,4.58,1.0,1.0,1.0,1.000000,0.9640,1.964000,0.644332
...,...,...,...,...,...,...,...,...,...,...
4161,"on tap at armsby abbey hazy orange liquid, on...",Society & Solitude #6,4.54,1.0,0.0,1.0,0.816497,0.5267,1.343197,0.659910
4219,enjoyed at the copenhagen beer fest. reviewed ...,Mexican Brunch,4.74,1.0,0.0,1.0,0.816497,0.9790,1.795497,0.531322
3819,look: this pours black with almost no carbonat...,Bourbon Paradise,4.05,1.0,1.0,0.0,0.816497,0.6874,1.503897,0.527160
3816,"canned 8/5/20, purchased 8/22 and drunk the ne...",Gggreennn!,4.60,1.0,0.0,1.0,0.816497,0.9680,1.784497,0.509815


In [None]:
# Calculate a single evaluation score
new_df_sorted['spacy_evaluation_score'] = new_df_sorted['spacy_similarity'] + new_df_sorted['sentiment_score']
spacy_top_3_recommend = new_df_sorted.groupby(['beer_name']).mean().sort_values('spacy_evaluation_score', ascending=False).head(3)

In [None]:
spacy_top_3_recommend

Unnamed: 0_level_0,score,taste,carbonation,aroma,similarity,sentiment_score,evaluation_score,spacy_similarity,spacy_evaluation_score
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Scaled Way Up,4.28,1.0,1.0,1.0,1.0,0.9752,1.9752,0.620603,1.595803
Pseudo Sue,4.48,1.0,0.833333,1.0,0.969416,0.9543,1.923716,0.616153,1.570453
Beer Geek Vanilla Shake - Bourbon Barrel-Aged,4.456,1.0,0.6,1.0,0.926599,0.94632,1.872919,0.621647,1.567967


**The top 3 recommended beers by using the spacy similarity are: Scaled Way Up, Pseudo Sue, Beer Geek Vanilla Shake.** Notice that the only beer recommended by the 2 approaches is "Scaled Way Up".

**Note:** The field "score" is the original score from the reviews, whereas the sentiment_score, evaluation_score, spacy_evaluation_score are all calculate by us.

These are the individual reviews that were considered in this calculation. 

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Scaled Way Up']:
  print(review, "\n")

thanks for the can kevin! a- tallboy to a small snifter with a murky yellow-orange body and a one finger, frothy cap. decent head retention as the crown falls to a small ring around the edges. spotty lacing is seen rarely around the glass. s- bright orange juices and juicy citrus fruits (mandarin, tangerine) come out on the front end of the aroma and burst out of the glass. plenty of other hops to back that with peach, bubble gum, mango, apricot, papaya and dank weedy herbs. hops hops hops. bright, clean and fresh. t- the citrus hops pick up a lot and shift more towards the grapefruit and orange part of the spectrum but stay fresh, juicy and zesty. the bitterness also picks up and is on the level of biting into a citrus peel, intense and sharp. peach and mango are downsized here and bubble gum, dank weed, pine resin, berries, bitter seville orange marmalade and hop powder astringency join in as well.  mf- smooth creaminess and puffy texture to the medium bodied ale. a high level of car

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Pseudo Sue']:
  print(review, "\n")

i mean, damn near perfect. pours a nice medium orange, good head and carbonation. the aroma wafts of tropical fruits and herbal hops. the flavor / taste is probably the most complex of all the pale ale’s i’ve tried: a panoply of tropical and citrus flavors, herbal, sneaky hoppy, perfect balance. i don’t know whether i like this or zombie better because they are sorta different animals but god this is great, cheers! may 08, 2021 

pours a beautiful hazy orange with a huge creamy head.  all citrus in the aroma.  flavor is orange and mango finishes bitter grapefruit peel and pine.  medium body with lively carbonation.  if you want to know what citra hops taste like try this one. great hoppy beer with just the right amount of bitterness apr 25, 2021 

poured from a can into a pint glass appearance – the beer pours a super hazy yellow color with a one finger head of pure white puffy foam. the head has a great level of retention, slowly fading over time to leave a decent amount of foamy laci

In [None]:
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Beer Geek Vanilla Shake - Bourbon Barrel-Aged']:
  print(review, "\n")

500 ml bottle into tulip glass, no bottle dating. pours dense pitch black color with a 1 finger dense and rocky tan head with great retention, that reduces to a nice cap that lasts. great spotty soapy lacing clings on the glass. aromas and flavors of huge milk/dark chocolate, cocoa, caramel, brown sugar, coffee, vanilla, bourbon, coconut, toasted oak, and dark/brown bread; with lighter notes of molasses, toffee, and yeast/oak earthiness. damn nice aromas with great balance and complexity of dark/roast/bready malts, vanilla, and bourbon barrel notes; with big strength. taste of huge milk/dark chocolate, cocoa, caramel, brown sugar, coffee, vanilla, bourbon, coconut, toasted oak, and dark/brown bread; with lighter notes of molasses, toffee, and yeast/oak earthiness. light-moderate pine, herbal, floral, grassy bitterness; and yeast spiciness on the finish. lingering notes of milk/dark chocolate, cocoa, caramel, brown sugar, coffee, vanilla, bourbon, coconut, toasted oak, and dark/brown br

In [None]:
spacy_sim_top_3 = pd.DataFrame([], columns = ['word1', 'word2', 'spacy_similarity'])

#Calculate the similarity for each individual word in the reviews of Pseudo Sue
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Pseudo Sue']:
  word2 = 'carbonation'
  for word in review.split(" "):
    if word == '' or word == ' ':
      continue
    doc1 = nlp(word)
    doc2 = nlp(word2)
    spacy_similarity = doc1.similarity(doc2)
    spacy_sim_df = pd.DataFrame([[word, word2, spacy_similarity]], columns = ['word1', 'word2', 'spacy_similarity'])
    spacy_sim_top_3 = pd.concat( [spacy_sim_top_3, spacy_sim_df] )

#Calculate the similarity for each individual word in the rewiews of Beer Geek Vanilla Shake
for review in new_df_sorted['full_review'][new_df_sorted['beer_name'] == 'Beer Geek Vanilla Shake - Bourbon Barrel-Aged']:
  word2 = 'carbonation'
  for word in review.split(" "):
    if word == '' or word == ' ':
      continue
    doc1 = nlp(word)
    doc2 = nlp(word2)
    spacy_similarity = doc1.similarity(doc2)
    spacy_sim_df = pd.DataFrame([[word, word2, spacy_similarity]], columns = ['word1', 'word2', 'spacy_similarity'])
    spacy_sim_top_3 = pd.concat( [spacy_sim_top_3, spacy_sim_df] )

In [None]:
#Get the words that are similar to carbonation that are NOT carbonation
relevant_similarities_df = spacy_sim_top_3[ (spacy_sim_top_3['word2'] == 'carbonation')].sort_values('spacy_similarity', ascending= False)
relevant_similarities_df = relevant_similarities_df[(~relevant_similarities_df['word1'].isin(['carbonation', 'carbonation.']))].head(20)

In [None]:
relevant_similarities_df

Unnamed: 0,word1,word2,spacy_similarity
0,mouthfeel,carbonation,0.820975
0,mouthfeel,carbonation,0.820975
0,mouthfeel,carbonation,0.820975
0,mouthfeel,carbonation,0.820975
0,mouthfeel,carbonation,0.820975
0,mouthfeel.,carbonation,0.728304
0,mouthfeel:,carbonation,0.706142
0,aftertaste,carbonation,0.680229
0,fizzy,carbonation,0.664632
0,hoppy,carbonation,0.631161


Note that from the 3 recommended beers, 2 of them lack the explicit mention of "carbonation" in some of the reviews. 

**Pseudo Sue:**	16.67% do NOT mention carbonation
**Beer Geek Vanilla Shake - Bourbon Barrel-Aged:**	40.00% do NOT mention carbonation

But, there are other words in the reviews that appear to be similar to it, some of these actually represent that the beer indeed may have a good carbonation, while some of them do not. This does not mean directly that word embeddings are  worse, or that BOW is worse. In fact, there are many **shortcomings** with both approaches. 


- The BOW approach is explicit. In the absence of comprehensively large target attribute lists, BOW may miss semantically valuable reviews that simply aren’t as explicit as they need to be to this method to find them. 
- The spaCy approach is implicit, allowing semantically meaningful information to be captured even if it’s not explicit. However, there is an increased chance that top recommendations may not include high BOW similarity across all 3 attributes. Furthermore, the strength of the recommendation is inconsistent, and relies on the robustness of the word embedding vectors. It is possible for this approach to:
    - Introduce noise into the model, recommending products based on attributes that do not really represent the same in the specialized context. For example, ‘roasty’ or 'mouthfeel' could mean different things.
    - Fail to detect key terms based on industry jargon not built into the word embedding. For example, while ‘nose’ is an industry-specific term for aroma, the spaCy similarity between ‘nose’ and ‘aroma’ is only 0.466.
    - Overrepresent smaller reviews. Empirically, we see that a notable difference between the 2 methods is that the reviews attached to the cosine outputs (BOW similarity) are significantly longer than those of word embeddings. This outcome could be a reflection of the hypothesis presented by Josh in the Medium article; closely related embeddings that hold disparate meanings in the review likely appear nearer to each other in shorter, condensed reviews.



In [None]:
doc1 = nlp("nose")
doc2 = nlp("aroma")
print(doc1.similarity(doc2))

0.46648263262026535


However, each approach also has **strenghts**:
* The BOW approach will guarantee results, in the sense that the recommended products will always have a mention of the attributes. 
* The SpaCy approach will be able to identify the attribute even implicitly. In our example, we can identify the word 'fizzy'as one word that is significantly associated with “carbonation”.

Also, It must be noted that one beer made the top three list for both approaches. Similar to the logic of using a stacked ML model, identifying recommendations that are common between two recommendation models seems to achieve the best of both worlds, even a flat average should help to get a better recomendation. 

Finally, we can propose a new method to get the best possible results by combining the 2 methods, as well as a framework to decided when to implement each method.

**New Recommendation Method**

In order to get the best results possible. We recommend to do one or many of the following: 

* Train a word embeddings model using the data from specialized forums in the same subject. 
* Use word embeddings to calculate similarity with the reviews, but manually veto words that are really not similar in that context. 
* Ask the end-user to provide feedback on reviews that do not contain a certain attribute explicitly, and train a machine learning model to predict these scenarios. 

**Framework to decide between Word Embeddings and BOW**

If the following criteria is met, then BOW is the preferred option:
* The available options for word embeddings are trained on generic text corpuses, and not with the specialized lingo of the subject.
* The reviews are not bound to a character count, and therefore both very large and very short reviews exist.
* The subject uses a wide set of specialized terms. 



## Task G. 

How would your recommendations differ if you ignored the similarity and feature sentiment scores and simply chose the 3 highest rated products from your entire dataset? Would these products meet the requirements of the user looking for recommendations? Why or why not? Justify your answer with analysis. Use the similarity and sentiment scores as well as overall ratings to answer this question. 
Here is a sample web implementation of a recommender system based on the same principles (runningshoe4you.com), but in this assignment, we will not have the time for this type of full automation.


---




Without any sort of cosine similarity or sentiment analysis, we are back to using our original dataset, only using information and ratings we receive from BeerAdvocate.  Given that our previous two recommendation outputs rely on cosine similarity and sentiment analysis, respectively, disregarding the two metrics will give us completely different results. In fact, is not surprising that the output doesn't overlap with the users inputs and therefore the results from our other models.

We can see this by sorting the top3 beers only by their BeerAdvocate score.

In [None]:
new_df.groupby(['beer_name']).mean().sort_values('score', ascending=False).head(3)

Unnamed: 0_level_0,score,taste,carbonation,aroma,similarity
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SR-71,4.771818,0.909091,0.0,0.181818,0.599091
It Was All A Dream,4.755833,0.833333,0.083333,0.416667,0.657066
Kentucky Brunch Brand Stout,4.74375,0.9375,0.125,0.3125,0.66703


As you can see, from the top 3 beers NONE of them consistently mention ALL the 3 attributes for all the reviews. And none of them were recommended before with the previous 2 methods.

We can see that the reason is that numerically these beers do not match the users preference, the BOW similarity is low, and the sentiments are inconsistent as well. 

In [None]:
#Get the evaluation_score for the top 3 beers
short_df = new_df[new_df['beer_name'].isin(list(top_3_score_recommend.index))]
short_df['sentiment_score'] = short_df['full_review'].map(get_sentiment_score)
short_df['evaluation_score'] = short_df['similarity'] + short_df['sentiment_score']
recommended_beers_df = short_df.groupby(['beer_name']).mean().sort_values('score', ascending=False).head(3)
recommended_beers_df

Unnamed: 0_level_0,score,taste,carbonation,aroma,similarity,sentiment_score,evaluation_score
beer_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SR-71,4.771818,0.909091,0.0,0.181818,0.599091,0.933009,1.5321
It Was All A Dream,4.755833,0.833333,0.083333,0.416667,0.657066,0.494033,1.151099
Kentucky Brunch Brand Stout,4.74375,0.9375,0.125,0.3125,0.66703,0.762381,1.429411


After looking at the individual, one can easily see that some users do not tie their review to the score, and therefore some beers could have a high score (because the user just gave it a high score) even though their reviews are off from the score (very negative sentiments), and of course don't even mention the user desired attributes.

However, in certain circumstances, this may actually be a suitable outcome; perhaps a user has no preference in any aspect of the beer, and wants to simply sample the overall "best" beer - in the scope of an aggregated average.

This kind of recommender system wouldn't be as elegant or precise of a solution, but it would serve as a simple baseline for an overall average user case, specially when no attributes are given. It's not uncommon when shopping online to do a sort by top reviews or top selling; while these may not exactly tailor to an individual user, they do reflect an aggregate of opinion, which absolutely serves a benefit to the user's decision-making. In addition, with our scraped data we also have attributes ratings available; with these, a recommender could be built to output top suggestions of particular ratings, giving the users a moderately more filtered result. In classic business sense, the answer to the question of whether this kind of watered-down recommender system would meet a user's requirement is: it depends. The system offers a starting point from an overall community's cummulative experience, but it doesn't go further than that like a cosine similarity or sentiment based system would. How deep and individually-tailored does the user need this recommender to be? We can measure this "need" by the amount of input preferences that the user enters, and tailor the type of recommender system accordingly. 