<h1><center> Analysis of textual reviews </center></h1>

Our goal for this part is to analyze the textual reviews and see if we can derive insights from them. In particular, we aim at predicting whether a given review has a positive or negative sentiment. We think that by considering the sentiment associated to a review, we can study the popularity of certain beers better than just by looking at the numerical ratings. The intuition behind this is that a textual review is a richer source of information than a rating between 0 and 5. We hope that by leveraging sentiment analysis, we can encapsulate this additional information and use it as an additional measure for beer popularity.

We will therefore present a first implementation of a pipeline that:


        - Extracts textual reviews from `reviews.txt` into a dataframe
        - Annotates the reviews (positive/negative for later training)
        - Represents each review in an embedding space
        - Trains an SVM classifier on the review embeddings
        - Uses the classifier to predict the sentiment of a given review

We will then use this model to rank beer popularity. A beer is said to more popular than another if its associated reviews have a higher proportion of positively-classified reviews. In particular, we will look at the top 10 countries with highest review output and try to find their most popular beer. As mentioned above, we have yet no reason to believe that this ranking is more "useful" than a rating-based ranking, we just want to study to what extent these two are different and then expand on these results in the following milestone. 


Having clarified the motivation behind this task, we can start by importing the necessary packages

In [2]:
import pandas as pd
import math
import json
import os
import pickle
import numpy as np
import torch

from random import shuffle
from joblib import dump, load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


The textual reviews are stored in `reviews.txt` and we start by extracting them into a dataframe

In [3]:
def read_data(filepath="reviews.txt", num_rows=200, extract_text_reviews=False):
    """
    Extracts reviews from .txt file and saves it into a dataframe. Converts date to readable format and store numeric values as int/float
    
    Input:
        filepath: str, path of dataset ("reviews.txt" by default)
        num_rows: int, number of rows to extract from dataset (200 by default)
        extract_text_reviews: boolean, set to True to extract text reviews as well (False by default)
        
    Return:
        df: shape: (min(num_rows, num_datapoints), 15) if extract_text_reviews=False
                   (min(num_rows, num_datapoints), 16) if extract_text_reviews=True
            contains the content of file pointed to by filepath.
    """
    # set column names
    column_names = ["beer_name", "beer_id", "brewery_name", "brewery_id", "style", "abv", "date", "user_name", "user_id", "appearance", "aroma", "palate", "taste", "overall", "rating"]
    if extract_text_reviews:
        column_names.append("text")
        
    # initialise empty dataframe 
    data_dict = {col: [] for col in column_names}

    # read from file line by line
    with open('reviews.txt') as data_file:
        for line in data_file:
            # skip if line is empty
            if line == '\n':
                continue
            
            # get attribute (beer_name, beer_id, etc)
            attribute = line[:line.index(':')]
        
            # skip if attribute is text 
            if attribute == "text" and not extract_text_reviews:
                continue
            
            # add value of attribute to the corresponding list   
            data_dict[attribute].append(line[line.index(':')+2:-1])
            
            # stop reading from file if we gathered num_rows datapoints
            if len(data_dict["rating"]) == num_rows:
                break

    # convert to dataframe            
    df = pd.DataFrame(data_dict)
    # retrieve numerical value of ratings (from string to float/int)
    df.astype({'beer_id':'int32', 'brewery_id':'int32', 'abv':'float', 'date':'int32', 'appearance':'float', 'aroma':'float', 'palate':'float', 'taste':'float', 'overall':'float', 'rating':'float'})
    # convert unix time to readable format
    df['date'] = pd.to_datetime(df['date'],unit='s')
    
    return df

In [4]:
# the line below extracts ALL datapoints (with text), change num_rows to 100 if the cell takes too long to run
df_with_text = read_data(filepath="reviews.txt", num_rows=math.inf, extract_text_reviews=True) 

# get first 5 reviews
df_with_text.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating,text
0,Régab,142544,Societe des Brasseries du Gabon (SOBRAGA),37262,Euro Pale Lager,4.5,2015-08-20 10:00:00,nmann08,nmann08.184925,3.25,2.75,3.25,2.75,3.0,2.88,"From a bottle, pours a piss yellow color with ..."
1,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2009-02-20 11:00:00,StJamesGate,stjamesgate.163714,3.0,3.5,3.5,4.0,3.5,3.67,Pours pale copper with a thin head that quickl...
2,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2006-03-13 11:00:00,mdagnew,mdagnew.19527,4.0,3.5,3.5,4.0,3.5,3.73,"500ml Bottle bought from The Vintage, Antrim....."
3,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-12-01 11:00:00,helloloser12345,helloloser12345.10867,4.0,3.5,4.0,4.0,4.5,3.98,Serving: 500ml brown bottlePour: Good head wit...
4,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-08-30 10:00:00,cypressbob,cypressbob.3708,4.0,4.0,4.0,4.0,4.0,4.0,"500ml bottlePours with a light, slightly hazy ..."


We inspect how many reviews we have in our dataset and notice that there are $2589586$ reviews in total.

In [5]:
# print shapes of each dataframe 
print("Shape of data frame with text column: ", df_with_text.shape, end='\n')

Shape of data frame with text column:  (2589586, 16)


Since our analysis takes into account the reviewing behaviour of each country, we must add a `country` column that indicates the country of origin of each review. We get this additional column from the `users.csv` file.

In [6]:
users_df = pd.read_csv("users.csv", delimiter = ',').set_index("user_id")
df_with_text['country'] = users_df.loc[df_with_text["user_id"]].location.to_list()

Our final dataset for the sentiment analysis task is therefore the following:

In [7]:
df_with_text.head()

Unnamed: 0,beer_name,beer_id,brewery_name,brewery_id,style,abv,date,user_name,user_id,appearance,aroma,palate,taste,overall,rating,text,country
0,Régab,142544,Societe des Brasseries du Gabon (SOBRAGA),37262,Euro Pale Lager,4.5,2015-08-20 10:00:00,nmann08,nmann08.184925,3.25,2.75,3.25,2.75,3.0,2.88,"From a bottle, pours a piss yellow color with ...","United States, Washington"
1,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2009-02-20 11:00:00,StJamesGate,stjamesgate.163714,3.0,3.5,3.5,4.0,3.5,3.67,Pours pale copper with a thin head that quickl...,"United States, New York"
2,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2006-03-13 11:00:00,mdagnew,mdagnew.19527,4.0,3.5,3.5,4.0,3.5,3.73,"500ml Bottle bought from The Vintage, Antrim.....",Northern Ireland
3,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-12-01 11:00:00,helloloser12345,helloloser12345.10867,4.0,3.5,4.0,4.0,4.5,3.98,Serving: 500ml brown bottlePour: Good head wit...,Northern Ireland
4,Barelegs Brew,19590,Strangford Lough Brewing Company Ltd,10093,English Pale Ale,4.5,2004-08-30 10:00:00,cypressbob,cypressbob.3708,4.0,4.0,4.0,4.0,4.0,4.0,"500ml bottlePours with a light, slightly hazy ...",Northern Ireland


Given that this is a supervised task, we have to manually annotate a portion of the reviews with labels positive(1)/negative(0). To do so, we rely on the `rating` value. For each review, if its rating is higher than a `positive_threshold` (say 4), we consider this review positive. Conversely, if its rating is lower than a `negative_threshold` (say 2), we consider this review negative. To ensure better separability we can make the threshold more extreme (higher for `positive_threshold` and lower for `negative_threshold`).

We present the following function that extracts reviews over the the threshold and annotates them. The annoted reviews are then saved in the file `beer_reviews/annotated_reviews.json`.

In [8]:
def extract_and_annotate_reviews(data_dict, positive_threshold=5, negative_threshold=2, sample_size=500):
    
    # if annotation has already been performed, load the saved version and return it
    annotated_data_path = "beer_reviews/annotated_reviews.json"
    if os.path.exists(annotated_data_path):
        with open(annotated_data_path, encoding="utf-8") as data:
            annotated_data = json.load(data)
        return annotated_data

    # otherwise iterate over the reviews and annotate them according their rating
    # for instance if a certain review has a numeric rating >= positive_threshhold
    # annotate this review as positive (class 1) and if it has a numeric rating <= negative_threshhold
    # annotate this review as negative (class 0). This annotation will later serve to train the SVM classifier.
    positive_ratings = []
    negative_ratings = []
    for k in range(len(data_dict["rating"])):
        if float(data_dict["rating"][k]) >= positive_threshold:
            positive_ratings.append({"id": k, "text": data_dict["text"][k], "class": 1})
        elif float(data_dict["rating"][k]) <= negative_threshold:
            negative_ratings.append({"id": k, "text": data_dict["text"][k], "class": 0})

    # shuffle the entire data to get more variations
    shuffle(positive_ratings)
    shuffle(negative_ratings)

    # then select sample_size from the shuffled data
    sample_size = 500
    annotated_data = positive_ratings[:sample_size] + negative_ratings[:sample_size]

    # save it in json format
    with open(
        "beer_reviews/annotated_reviews.json",
        "w",
        encoding="utf-8",
    ) as outfile:
        json.dump(annotated_data, outfile, indent=4)
    return annotated_data

In [11]:
data_dict = df_with_text.to_dict()
annotated_data = extract_and_annotate_reviews(data_dict, positive_threshold=4, negative_threshold=2, sample_size=500)

Now that we have our annotated dataset, it remains to generate the embedding of each review and aggregate them in a data matrix. We perform this task using an off-the-shelf Bert model and tokenizer. 

In [13]:
def cls_embedding(context):
    """generate CLS embedding of each review"""
    context = context.lower()
    input_ids = tokenizer.encode(context, add_special_tokens=True)
    with torch.no_grad():
        last_hidden_states = model(torch.tensor([input_ids]))[0]

    # Retrieving the sentence [c][l]a[s]sification token's embedding
    cls_embedding = np.asarray(last_hidden_states[0])
    return cls_embedding[0]


def embed_reviews(annotated_data):
    """Generate the classification (CLS) embedding
    of each review and train an SVM classifier using
    the binary quantization of ratings as ground truth"""
    if os.path.exists("beer_reviews/X.dat") and os.path.exists("beer_reviews/y.dat"):
        with open("beer_reviews/X.dat", "rb") as input_file:
            X = pickle.load(input_file)
        with open("beer_reviews/y.dat", "rb") as input_file:
            y = pickle.load(input_file)
        return X, y

    # Create embeddings
    X = []
    y = []
    missing_ = []
    for i in range(len(annotated_data)):
        try:
            X.append(np.asarray(cls_embedding(annotated_data[i]["text"])))
            y.append(annotated_data[i]["class"])
        except:
            missing_.append(i)
            continue
        with open("beer_reviews/X.dat", "wb") as output_file:
            pickle.dump(X, output_file)
        with open("beer_reviews/y.dat", "wb") as output_file:
            pickle.dump(y, output_file)
    return X, y

In [14]:
X, y = embed_reviews(annotated_data)

Token indices sequence length is longer than the specified maximum sequence length for this model (680 > 512). Running this sequence through the model will result in indexing errors


We can now fit an SVM classifier on the `X y` matrices using `sklearn`


In [18]:
def SVM_fit(X, y):
    """Train a linear SVM classifier"""
    model_path = "beer_reviews/linear_SVM.joblib"
    # Load model if it has already been trained
    if os.path.exists(model_path):
        # Loading pre-training models
        clf = load(model_path)
    else:
        # Split dataset and fit model
        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)
        clf = SVC(kernel="linear", C=0.025)
        clf.fit(X_train, y_train)
        # Save model
        dump(clf, model_path)
    return clf

In [20]:
classifier = SVM_fit(X, y)

It is now time to make some predictions on beer popularity per country. First we create a dictionary of the form:
> `key`: country


> `value`: list of reviews originating from this country 

We will use this dictionary to extract the top-K countries with the highest number of reviews. Then, for each of these countries in the top-K, we will predict the sentiment of their associated reviews. Since each review is for a given beer, we can rank beer popularity for this country by looking at the "most-positive" reviews for this country according to our classifier. This finally gives us a list of the most popular beers per country (in the topK) only based on the textual reviews.

We start by extracting the country-reviews dictionary described above:

In [32]:
def country_reviews_dict(data_dict):
    """Create a {"country": [index of review 1, index of review 2, etc...]}- dict"""
    if os.path.exists("beer_reviews/country_indices.json"):
        with open("beer_reviews/country_indices.json", encoding="utf-8") as input_file:
            country_dict = json.load(input_file)
        return country_dict

    country_set = {str(country) for country in df_with_text.country if "</a>" not in str(country)}
    country_dict = dict()
    for i, text in enumerate(data_dict["text"]):
        country_review = df_with_text.loc[i].country
        if country_review in country_set:
            if country_review in country_dict:
                country_dict[country_review].append(text)
            else:
                country_dict[country_review] = [text]

    # Save dictionary in a json format
    with open(
        "beer_reviews/country_indices.json",
        "w",
        encoding="utf-8",
    ) as outfile:
        json.dump(country_dict, outfile, indent=4)
    return country_dict

In [34]:
country_dict = country_reviews_dict(data_dict)

As described above, we use this dictionary to find which countries are in the top-10 most reviews list:

In [48]:
def top_k_countries(country_dict, k=15):
    """Output highest percentiles of
    most frequent countries in the review
    dataset, and return the top k countries"""
    freq_list = []
    for country_, ix_list in country_dict.items():
        freq_list.append(len(ix_list))
    print("Review frequency percentiles:")
    for percentile_ in [50, 75, 90, 95, 99]:
        print("The", percentile_, "th percentile of number of reviews is:\t", np.percentile(freq_list, percentile_))

    # Sort the country/indices-dict according to frequency
    sorted_country_dict = [d for _, d in sorted(zip(freq_list, country_dict))]
    sorted_country_dict.reverse()
    top_k_sorted = sorted_country_dict[:k]
    return top_k_sorted

In [49]:
top_k_countries = top_k_countries(country_dict, 10)
print("\n Countries with the highest review output:")
print(top_k_countries)

Review frequency percentiles:
The 50 th percentile of number of reviews is:	 266.5
The 75 th percentile of number of reviews is:	 8276.0
The 90 th percentile of number of reviews is:	 50875.600000000086
The 95 th percentile of number of reviews is:	 78053.4
The 99 th percentile of number of reviews is:	 164699.85000000038

 Countries with the highest review output:
['United States, Pennsylvania', 'United States, California', 'United States, New York', 'United States, Illinois', 'United States, Massachusetts', 'Canada', 'United States, Ohio', 'United States, Texas', 'United States, Washington', 'United States, New Jersey']


For each of the countries listed above, we find the most popular beer (i.e highest rated according to our classifier) and summarize it in `output_dict`

In [73]:
def most_popular_beers_per_country(data_dict, top_k_countries, country_dict, clf, l=10):
    """Predict l most popular beers in
    the k- most frequent countries.
    Returns --> {"k^th country_": [1st most confident
                    predicted positive beer review,
                        2nd most confidently predicted beer,
                        ...]}"""
    if os.path.exists("beer_reviews/most_popular_beers.json"):
        with open("beer_reviews/most_popular_beers.json", encoding="utf-8") as input_file:
            output_dict = json.load(input_file)
        return output_dict
    
    output_dict = {}
    for country in top_k_countries:
        #
        index_list = country_dict[country]
        output_dict[country] = []

        # Gather most positive reviews from each country
        shuffle(index_list)
        predictions_ = []
        for index in index_list[:min(100, len(index_list))]:
            try:
                embedding_ = cls_embedding(data_dict["text"][index])
                predicted_class = clf.predict(np.asarray([embedding_]))
                predictions_.append(predicted_class)
            except:
                predictions_.append(0)
                continue
        sorted_index_list = [d for _, d in sorted(zip(predictions_, index_list))]
        sorted_index_list.reverse()
        top_l_reviews = sorted_index_list[:min(l, len(sorted_index_list))]

        # Retrieve beer name from review index (ix)
        for review_ix in top_l_reviews:
            output_dict[country].append(data_dict["beer_name"][review_ix])

    # Save it in a json format
    with open(
        "beer_reviews/most_popular_beers.json",
        "w",
        encoding="utf-8",
    ) as outfile:
        json.dump(output_dict, outfile, indent=4)
    return output_dict

In [74]:
output_dict = most_popular_beers_per_country(data_dict, top_k_countries, country_dict, clf, l=10)

In [81]:
print("---- Top 10 countries in reviewing output ----\n")
for country in output_dict:
    print(country, end='\n')

print("\n---- Most popular beer for the selected countries ----\n")
for country in output_dict:
    top_beer = output_dict[country][0]
    print(f"{country}: {top_beer}", end='\n')

---- Top 10 countries in reviewing output ----

United States, Pennsylvania
United States, California
United States, New York
United States, Illinois
United States, Massachusetts
Canada
United States, Ohio
United States, Texas
United States, Washington
United States, New Jersey

---- Most popular beer for the selected countries ----

United States, Pennsylvania: Gatekeeper
United States, California: Porter
United States, New York: Wild Goose Nut Brown Ale
United States, Illinois: Celis Grand Cru
United States, Massachusetts: Concord IPA
Canada: Once Upon A Time 1901 KK
United States, Ohio: RedLegg Ale
United States, Texas: Cottonwood Pumpkin Spiced Ale
United States, Washington: Queen Nina's Imperial IPA
United States, New Jersey: Saint Botolph's Town


The list above showcases the results of our review analysis pipeline. As a byproduct of the ranking we also generated data collections such as `annotated_reviews`, `most_popular_beers` and `country_indices` that we can use for the final milestone. More importantly we have a new metric for beer popularity that could be used in addition to the numerical ratings to provide a more complete analysis. The next steps in this direction consist in finding the differences between both metrics and if we could combine both into a single indicator of popularity. This indicator would then be used as the final metric of beer popularity. 