# Building a Crowdsourced Recommendation System
## ----------------------------------------------------------------------------------------------------------------------------------

## Team Members


| Name | EID |
| --- | --- |
| Brandt Green | bwg537 |
| Jackson Hittner | jbh3692 |
 Bret Jaco | bcj646 |
| Brandon Pover | bnp669 |
| Matthew Tran | mct2345 |


In [1]:
import pandas as pd
import numpy as np
import warnings
import nltk
import string
from nltk.corpus import stopwords
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import numpy as np
from typing import Text
from bs4 import BeautifulSoup
import time
import requests
english_stopwords = stopwords.words('english')

## Task A: Scrape Data
### Warning - scraping code block below will take around 10 minutes to run.

In [2]:

# base_url = 'https://www.beeradvocate.com'
# beer_url = 'https://www.beeradvocate.com/beer/top-rated/'

# request = requests.get(beer_url)
# html = (request.content)
# soup = BeautifulSoup(html)

# table_rows = soup.table.find_all('tr')

# all_beer_links = []
# for row in table_rows[1:]:
#     link = row.find_all('td')[1].find('a')['href']
#     all_beer_links.append(base_url + link)

# beer_data = {}
# beer_data['product_name'] = []
# beer_data['product_review'] = []
# beer_data['user_rating'] = []

# for beer_link in all_beer_links:
#     time.sleep(2)
#     page_soup = BeautifulSoup(requests.get(beer_link).content)
#     beer_name = page_soup.find('div',class_='titleBar').h1.find(text=True)

#     reviews_list = page_soup.find_all('div',class_='user-comment')
#     # reviews_list = page_soup.find_all('div',class_='rating_fullview_content_2')
#     for review in reviews_list:
#         user_rating = review.find('span',class_='BAscore_norm').text.strip()
#         product_review = ' '.join(review.find('span',class_='BAscore_norm').parent.find_all(text=True,recursive=False))
#         beer_data['product_name'].append(beer_name)
#         beer_data['product_review'].append(product_review)
#         beer_data['user_rating'].append(user_rating)        

# df = pd.DataFrame(beer_data)
# df.to_csv('beer_data.csv')

## Task B: Finding Beer Attributes

In [3]:
beer_data = pd.read_csv('beer_data.csv')
beer_data.product_review = beer_data.product_review.str.strip().str[4:] # Don't include that weird stuff in the first 4 characters of every string
beer_data

Unnamed: 0,product_name,product_review,user_rating
0,Kentucky Brunch Brand Stout,2020 vintage acquired during the pandemic. I...,5.00
1,Kentucky Brunch Brand Stout,"Long time waiting to tick this one, and I ha...",4.56
2,Kentucky Brunch Brand Stout,This review is for the 2019 batch. It was bo...,5.00
3,Kentucky Brunch Brand Stout,Supreme maple OD! Soooo easy drinking & well...,5.00
4,Kentucky Brunch Brand Stout,I have now had 4 different years of KBBS and...,5.00
...,...,...,...
6214,The Streets,"Had the good fortune to get 24 of these, \nT...",4.85
6215,The Streets,Incredible beer. Tasted from can. Robust aro...,5.00
6216,The Streets,Cloudy orange appearance with white head tha...,4.52
6217,The Streets,Can dated 3/20/17. This is the third can con...,4.75


### To find what beer attributes people discuss the most, we wanted to examine the most frequent adjectives people use when talking about beer. We first cleaned up the words and then analyzed the frequency distributions of words.

First, clean the words. We got rid of anything that is not a character or whitespace. 

In [27]:
regex_pattern = "[^a-zA-Z\s]" # Regex to match everything that is not a character or white space.
beer_data['cleaned_review'] = beer_data.product_review.str.lower().str.replace(pat=regex_pattern,repl='',regex=True) # lower case and strip out stuff
beer_data.head()

Unnamed: 0,product_name,product_review,user_rating,cleaned_review
0,Kentucky Brunch Brand Stout,2020 vintage acquired during the pandemic. I...,5.0,vintage acquired during the pandemic it was...
1,Kentucky Brunch Brand Stout,"Long time waiting to tick this one, and I ha...",4.56,long time waiting to tick this one and i hav...
2,Kentucky Brunch Brand Stout,This review is for the 2019 batch. It was bo...,5.0,this review is for the batch it was bottle ...
3,Kentucky Brunch Brand Stout,Supreme maple OD! Soooo easy drinking & well...,5.0,supreme maple od soooo easy drinking wellta...
4,Kentucky Brunch Brand Stout,I have now had 4 different years of KBBS and...,5.0,i have now had different years of kbbs and ...


Now we used NLTK to tokenize the entire corpus. We get two sets of tokens: one looked at all of the words in the corpus, and the other included only the adjectives. Both token lists excluded stop words.

In [5]:
entire_corpus = beer_data.cleaned_review.str.cat(sep=' ') # Entire corpus in one big string
all_tokens = nltk.word_tokenize(entire_corpus) # Tokenize everything
tokens_no_stop_words = [token for token in all_tokens if token not in english_stopwords] # Remove stop words from all tokens

Top words by looking at everything:

In [6]:
word_counts_all = pd.DataFrame(data=nltk.FreqDist(tokens_no_stop_words).most_common(1_000_000), columns=['word','frequency'])
word_counts_all.head(15)

Unnamed: 0,word,frequency
0,beer,4946
1,head,3792
2,taste,3133
3,chocolate,2868
4,dark,2753
5,sweet,2424
6,like,2351
7,one,2246
8,coffee,2201
9,bourbon,2142


Some of the above were helpful, but we want to understand what **attributes** people care about the most so we examined adjectives only:

In [7]:
tagged_tokens = nltk.pos_tag(tokens_no_stop_words) # Here, we get the parts of speech for each token, this is needed to filter by adjectives in a minute
adjectives_only = [word for word, tag in tagged_tokens if tag in ['JJ','JJR','JJS']] # Filter for adjectives
word_counts_adjectives = pd.DataFrame(data=nltk.FreqDist(adjectives_only).most_common(1_000_000), columns=['word','frequency']) 
word_counts_adjectives.head(30)

Unnamed: 0,word,frequency
0,sweet,2003
1,nice,1972
2,good,1932
3,black,1590
4,white,1552
5,overall,1545
6,great,1435
7,finish,1416
8,dark,1400
9,nose,1397


This list is much more helpful! After exploring the list, we chose three attributes from this list and put them into a csv file called 'attributes.csv' which we will read from later on as we attempt to simulate a customer sending us their three chosen attributes. 

Our chosen attributes are: "smooth", "creamy", and "tropical".

## Task C: Cosine Similarity

In [8]:
customers_attributes = list(pd.read_csv('attributes.csv')['attributes'])
# customers_attributes = ['robust', 'crisp', 'hoppy'] # This line is for testing
customers_attributes = ['smooth', 'creamy', 'tropical'] # This line is for testing
customers_attributes = [word.lower() for word in customers_attributes]
customers_attributes

['smooth', 'creamy', 'tropical']

### To calculate the similarity between the 3 desired attributes provided by our customer and the products, we will calculate the cosine similarity between our attributes and each review using a Bag of Words approach. Then we will average the similarity scores across each product to to find the most similar products.
<br><br>
To calculate the similarities, we first need a document matrix where each row represents a review. The columns represent words where the value in each cell will be the number of occurrences of that word in that document. So we first create an empty data frame of zeros, and then we fill the cells in with the appropriate word counts:

In [9]:
all_words = sorted(set(word_counts_all['word'])) # unique words in the corpus
df_words = pd.DataFrame(np.zeros((len(beer_data), len(all_words)))) # dataframe of zeros
df_words.columns = all_words

def get_tokens_no_stops(text:str):
    """Just tokenizes a string and removes stop words. It returns a list of tokens. This function is used to get the tokens for each review separately."""
    tokens = nltk.word_tokenize(text)
    return [token for token in tokens if token not in english_stopwords]

all_reviews_series = beer_data.cleaned_review.apply(get_tokens_no_stops) # Returns a series where the values are the tokenized versions of each review

# The loop below will populate each cell in the df_words matrix with its appropraite value.
for index, review in enumerate(all_reviews_series):
    unique_words = set(review)
    for word in unique_words:
        df_words.loc[index, word] = review.count(word)

# Just a quick check to make sure that the above actually worked
df_words.sum(axis=1).head(3) # Each row should have some numbers now.

0    17.0
1    25.0
2    39.0
dtype: float64

Now that we have our word vectors for each document, we can calculate the cosine similarity between each review and the chosen attributes: 

In [10]:
def calculate_cosine_similarity(word_vector:pd.Series) -> float:
    """Calculate the cosine simlarity between two word vectors using bag of words."""
    # Because our attribute vectors are just 1's, the dot product is simply the sum of the word counts in the document word vector
    numerator_total = 0
    # This for loop below is just so that we don't try to index into our word vector if the attribute is not inside of our word vector. 
    for word_attribute in customers_attributes:
        if word_attribute in word_vector:
            numerator_total += word_vector[word_attribute] # add the count of the word attribute to our numerator sum

    denominator = np.sqrt(sum(np.power(word_vector,2))) * np.sqrt(len(customers_attributes)) 

    return numerator_total/denominator

# Returns a series of the similarities with attribute vector
cosine_similarities = pd.DataFrame(df_words.apply(calculate_cosine_similarity, axis=1)) 

# Below we merge, the similarity information onto the product information we have above
df_scores = pd.merge(beer_data['product_name'],cosine_similarities, left_index=True, right_index=True)
df_scores.columns = ['product_name','cosine_similarity']

### Top 5 beers based on average cosine similarity:
Now that we have the cosine similarity of each review with the chosen customer attributes, we can take the average similarity across each beer as our measure of total beer similarity to the attributes. The top five beers ranked by cosine similarity are presented below.

In [11]:
avg_cosine_similarities = df_scores.groupby('product_name')['cosine_similarity'].mean().sort_values(ascending=False)
avg_cosine_similarities[:5]

product_name
Double Dry Hopped Double Mosaic Daydream    0.115292
Doubleganger                                0.112345
King Julius                                 0.104317
Gggreennn!                                  0.101801
King JJJuliusss                             0.100903
Name: cosine_similarity, dtype: float64

## Task D: Sentiment Analysis
But a recommendation based only on similar words is incomplete and we should also include a measure of sentiment. Below, we calculate a sentiment score using VADER.

In [12]:
sid = SentimentIntensityAnalyzer()
df_scores['polarity'] = beer_data.product_review.apply(sid.polarity_scores) # Get the sentiment scores for each review
df_scores['sentiment_score'] = df_scores.polarity.apply(lambda score_dict: score_dict['compound']) # Extract the compound score for each review and put it into its own column
df_scores = df_scores.drop(columns='polarity')

With the sentiment scores calculated, we show the top 5 beers, ranked by highest sentiment below:

In [13]:
avg_sentiments = df_scores.groupby('product_name')['sentiment_score'].mean().sort_values(ascending=False)
avg_sentiments.head()

product_name
Cable Car Kriek                              0.911064
Genealogy Of Morals - Bourbon Barrel-Aged    0.904744
Cable Car                                    0.904356
Mother Of All Storms                         0.903976
Expedition Stout - Bourbon Barrel-Aged       0.899836
Name: sentiment_score, dtype: float64

## Task E: Recommendation
To recommend 3 beers to our customer, we calculate a combined evaluation score for each beer as:
 $$EvaluationScore = Average(CosineSimilarity) + Average(SentimentScore)$$

Calculate the average evaluation scores for the beers and sort in descending order:

In [14]:
df_scores['evaluation_score'] = df_scores['cosine_similarity'] + df_scores['sentiment_score']
average_metrics = df_scores.groupby('product_name').mean().sort_values(by='evaluation_score',ascending=False)
average_metrics.head(3)

Unnamed: 0_level_0,cosine_similarity,sentiment_score,evaluation_score
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Genealogy Of Morals - Bourbon Barrel-Aged,0.035481,0.904744,0.940225
Double Dry Hopped Double Mosaic Daydream,0.115292,0.823236,0.938528
Mother Of All Storms,0.0334,0.903976,0.937376


### Recommend the beers below!

In [15]:
recommended_beers = list(average_metrics.index[:3])
recommended_beers

['Genealogy Of Morals - Bourbon Barrel-Aged',
 'Double Dry Hopped Double Mosaic Daydream',
 'Mother Of All Storms']

## Task F: Spacy VS Bag of Words

We have our recommendations using bag of words approach for calculating similarity, but how would these results compare to those found if we used spacy's word vectors to calculate similarity?

To answer this question, we will simply calculate the similarity between reviews and our customer attributes using spacy. This is very similar to the approach we used above.

Below, we add the spacy similarity metric to our 'df_scores' dataframe.

Warning, below cell takes a few minutes to run!!

In [16]:
import spacy
nlp = spacy.load('en_core_web_sm') 

def get_spacy_similarity(text:str):
    """Get spacy similarity between our attributes and a review"""
    review = nlp(text)
    attributes = nlp(' '.join(customers_attributes))
    return review.similarity(attributes)

df_scores['spacy_similarity'] = beer_data['cleaned_review'].apply(get_spacy_similarity)

  return review.similarity(attributes)


Top 5 beers based on Spacy similarity are shown below:

In [17]:
avg_spacy_similarity = df_scores.groupby('product_name')['spacy_similarity'].mean().sort_values(ascending=False)

Get, the top 3 recommendations below:

In [18]:
df_scores['evaluation_score_spacy'] = df_scores['spacy_similarity'] + df_scores['sentiment_score']
average_metrics_spacy = df_scores.groupby('product_name')[['sentiment_score','spacy_similarity','evaluation_score_spacy']].mean().sort_values(by='evaluation_score_spacy',ascending=False)
average_metrics_spacy.head(3)

Unnamed: 0_level_0,sentiment_score,spacy_similarity,evaluation_score_spacy
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mother Of All Storms,0.903976,0.468992,1.372968
Zenne Y Frontera,0.88006,0.487365,1.367425
Cable Car Kriek,0.911064,0.444171,1.355235


In [19]:
recommended_beers_spacy = list(average_metrics_spacy.index[:3])
recommended_beers_spacy

['Mother Of All Storms', 'Zenne Y Frontera', 'Cable Car Kriek']

Now, we can compare how both measures perform. We will look at the reviews of the products recommended and calculate the percentage of reviews for each product that mentions one of the preferred attributes.

In [20]:
def get_word_percent_df(recommended_beers:list,df_words:pd.DataFrame):
    """This function will compute the % of reviews that contained the customer attributes. Just send in a list of the attributes and the entire df_words dataframe."""
    
    df_words_recommended =  beer_data[['product_name']].merge(df_words,left_index=True,right_index=True)
    df_words_recommended = df_words_recommended[df_words_recommended['product_name'].isin(recommended_beers)] # Only look at reviews for recommended beers
    df_words_recommended = df_words_recommended[['product_name'] + customers_attributes] # Only keep the columns with the customer attributes chosen
    # df_words_recommended['any_attribute'] = df_words_recommended.sum(axis=1).astype(bool) 
    df_words_recommended[customers_attributes] = df_words_recommended[customers_attributes].astype(bool)

    return df_words_recommended.groupby('product_name').sum()/df_words_recommended.groupby('product_name').count()


### Percent of reviews containing an attribute using the cosine similarity approach:

In [21]:
get_word_percent_df(recommended_beers,df_words)

Unnamed: 0_level_0,smooth,creamy,tropical
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Double Dry Hopped Double Mosaic Daydream,0.16,0.44,0.6
Genealogy Of Morals - Bourbon Barrel-Aged,0.28,0.08,0.0
Mother Of All Storms,0.32,0.12,0.0


### Percent of reviews containing an attribute using the spacy similarity approach:

In [22]:
get_word_percent_df(recommended_beers_spacy, df_words)

Unnamed: 0_level_0,smooth,creamy,tropical
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cable Car Kriek,0.12,0.0,0.0
Mother Of All Storms,0.32,0.12,0.0
Zenne Y Frontera,0.12,0.0,0.0


## Task G:

How do the recommendations we have provided above compare to the basic approach of just recommending the highest rated beers?

In [23]:
# Add ratings to our scores dataframe and merge it onto a ne df containing all scores
df_scores_all = df_scores.copy()
df_scores_all = df_scores_all.merge(beer_data[['user_rating']],left_index=True, right_index=True)

# Aggregate all the average metrics into one dataframe which makes comparisons easy
avg_metrics_all = df_scores_all.groupby('product_name').mean().sort_values(by='user_rating',ascending=False)

highest_avg_ratings = avg_metrics_all['user_rating']
recommended_beers_avg_rating = list(highest_avg_ratings.index[:3])
recommended_beers_avg_rating

['Chemtrailmix', 'Vanilla Bean Assassin', 'Blessed']

### Let's see how these beers chosen from the top rated beers compare to the rankings based on our similarity scores:

The below table allows us to put the specific numbers of the highest rated beers into the context of the entire data set:

In [24]:
avg_metrics_all.describe()

Unnamed: 0,cosine_similarity,sentiment_score,evaluation_score,spacy_similarity,evaluation_score_spacy,user_rating
count,250.0,250.0,250.0,250.0,250.0,250.0
mean,0.035475,0.757478,0.792953,0.437689,1.195166,4.477383
std,0.023869,0.081259,0.080378,0.038093,0.0887,0.110307
min,0.001089,0.468052,0.501756,0.319957,0.897976,4.1928
25%,0.017426,0.718742,0.74831,0.417146,1.148008,4.3973
50%,0.032159,0.767404,0.802559,0.441315,1.20354,4.4744
75%,0.046235,0.814274,0.853881,0.463082,1.256228,4.542
max,0.115292,0.911064,0.940225,0.520613,1.372968,4.7716


In [25]:
avg_metrics_all.loc[recommended_beers_avg_rating]

Unnamed: 0_level_0,cosine_similarity,sentiment_score,evaluation_score,spacy_similarity,evaluation_score_spacy,user_rating
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Chemtrailmix,0.016081,0.78636,0.802441,0.42987,1.21623,4.7716
Vanilla Bean Assassin,0.007693,0.826438,0.834131,0.418218,1.244655,4.74625
Blessed,0.016965,0.784832,0.801797,0.38277,1.167602,4.7428


The above table shows the average similarities between the highest rated beers and the chosen customer attributes are extremely low. This dissimilarity becomes more apparent when focusing on just the cosine similarity, where we see that all three recommended beers lie in the bottom quartile in terms of cosine similarity. This result indicates that the reviews for the highest rated beers do not frequently discuss the attributes that an individual customer prefers!

These high-rated beers have just mediocre sentiment scores and evaluation scores. Suppose we assume that our calculated evaluation scores are the better predictor of customer desires. In that case, it's quite clear that choosing the highest-rated beers will likely provide the customer with a mediocre experience.

Simply recommending the highest rated beers may be a safe choice as you are unlikely to recommend something where the customer will have a truly horrible experience. However, you are also unlikely to satisfy your customer on a deeper level and generate the customer passion needed to thrive. Businesses can die on customer indifference. 

In [26]:
get_word_percent_df(recommended_beers_avg_rating, df_words)

Unnamed: 0_level_0,smooth,creamy,tropical
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Blessed,0.2,0.04,0.0
Chemtrailmix,0.16,0.08,0.0
Vanilla Bean Assassin,0.0625,0.0,0.0


The above table shows the percentage of reviews for each beer that mentioned the customer-provided attributes. These scores are much lower than the ones we found using spacy similarity and cosine similarity, which makes sense because choosing the highest-rated beers does not use the attribute information to make the recommendation! The above table tells us that beers with high ratings tend to discuss the feature of smoothness frequently. Or simply that 'smooth' appears in many reviews. 

<br>

### Overall, the three highest-rated products do not meet the requirements of the user seeking recommendations.  We also feel that completely ignoring a customer's stated preferences and blindly recommending high-rated beers should be considered business malpractice. The discrepancy between user ratings and tangible recommendations based on specific attributes demonstrates the need for crowd-sourced recommender systems to discover products better suited to customer-specific requirements.  Ultimately, this relates to the long tail of products. If users deferred to the highest-rated products, they may come away unsatisfied with their purchases and would be none the wiser that there are products better suited to their needs.