# Unlocking the Palate - Evaluating Taste Consensus Among Beer Reviewers

---

Group [**BlackAda**](https://en.wikipedia.org/wiki/Blackadder)

> - Ludek Cizinsky ([ludek.cizinsky@epfl.ch](ludek.cizinsky@epfl.ch))
> - Peter Nutter ([peter.nutter@epfl.ch](peter@nutter@epfl.ch))
> - Pierre Lardet ([pierre.lardet@epfl.ch](pierre@lardet@epfl.ch))
> - Christopher Bastin ([christopher.bastin@epfl.ch](christian@bastin@epfl.ch))
> - Mika Senghaas ([mika.senghaas@epfl.ch](mika@senghaas@epfl.ch))

## Introduction

---

Navigating the world of beer reviews can be a daunting task for non-experts. Beer aficionados often describe brews as having nuanced flavors such as "grassy notes" and "biscuity/ crackery malt," with hints of "hay." But do these descriptions reflect the actual tasting experience? Following a "wisdom-of-the-crowd" approach, a descriptor can be considered meaningful if many, independent reviewers use similar descriptors for a beer's taste. To quantify consensus, we use natural language processing techniques to extract descriptors of a beer's taste and numerically represent these descriptors to compute similarity or consensus scores. The consensus scores between beer reviews will unveil whether there is a shared understanding of taste among beer geeks.

## Dependencies

---

We load the dependencies required for this project to run.

In [None]:
# Enable continuous module reloading
%load_ext autoreload
%autoreload 1
%aimport src

# Standard library
import os

# External library
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
import pandas as pd
import numpy as np
import spacy

# Custom modules
from src import utils
from src import extractors
from src import embedders
from src import consensus
from src import pipeline

And set some global variables.

In [None]:
# Plotting settings
colorstyle = "RdBu"
sns.set_style('dark')
sns.set_palette(colorstyle)

# Pandas settings
pd.options.display.max_colwidth = 150 

# 
nlp = spacy.load("en_core_web_sm")

# URL for the full dataset
DATA_URL = "https://drive.google.com/u/2/uc?id=1IqcAJtYrDB1j40rBY5M-PGp6KNX-E3xq&export=download"

# Subsetting options
SUBSET = True
NUM_SUBSET_SAMPLES = 10_000

# Paths
ROOT_DIR = os.getcwd()
DATA_DIR = os.path.join(ROOT_DIR, "data")

# Random seed
SEED = 42

# Ensure data directory exists
os.makedirs(DATA_DIR, exist_ok=True)

## Data

---

We will be working with the beer review data from the [BeerAdvocate](https://www.beeradvocate.com/) platform. 


### Data Download

Due to its size (uncompressed 1.6 GB), the dataset is not included in the repository but must be downloaded. The course staff has provided the data via Google Drive. On the first run of this notebook, we download the compressed data file from Google Drive and extract it to the `data` folder. The compressed file is ~1.5 GB in size. 

After extraction and the removal of unnecessary files (archives, ratings file, ...), the data folder should contain the following files: `beers.csv`, `breweries.csv`, `users.csv`, `reviews.txt`. The total size of the data is ~2.9 GB.

*NB: Data loading takes around **~8min** on the first run. Subsequent runs of this cell are instant.*

In [None]:
# Download the BeerAdvocate dataset if it doesn't exist
if not os.path.exists(os.path.join(DATA_DIR, "reviews.txt")):
    utils.download_data(DATA_URL, data_dir=DATA_DIR)
print(f"Beer reviews downloaded to {DATA_DIR} ✅.")

### Data Loading

Next, we load the data into a Pandas DataFrame. On the first run, we load all the reviews from the `reviews.txt` file and populate it with some additional meta-data from the other files. We then save the DataFrame to a `.feather` file for faster loading in the future. On subsequent runs, we load the DataFrame from the `.feather` file if it exists.

*NB: Running this cell for the first time reads in all `2.5M` reviews which takes **~7min**. Subsequent runs should be much faster, taking about **~1min**.*

In [None]:
# Load all reviews and a subset of reviews (100,000)
if SUBSET:
    reviews = utils.load_data(DATA_DIR, num_samples=NUM_SUBSET_SAMPLES, seed=SEED)
else:
    reviews = utils.load_data(DATA_DIR, seed=SEED)

msg = "Subset of Data" if SUBSET else "Full Data"
print(f"Loaded {len(reviews)} reviews ✅. ({msg})")

### Sanity Checks

During the data loading (`utils.load_data`) we perform some basic data pre-processing and merging. Specifically, we do the following:
- Merge the reviews data with some additional meta-data about the beers, users and breweries (e.g. beer style, user location, ...) and collect in a singe multi-column DataFrame.
- We cast each column to the correct type, e.g. `date` is converted to a `datetime` object.
- We remove any reviews with any missing values (as there are only very few where this is the case)

We check that each of these steps is performed correctly and that the data is consistent.

In [None]:
# Check that additional information is loaded in the reviews
additional_cols = [("user", "location")]

for col in additional_cols:
    err_msg = f"❌ Additional column {col} not loaded."
    assert col in reviews.columns, err_msg
print(f"✅ Additional columns loaded.")

In [None]:
# Check that columns have correct type (e.g. review time is a datetime)
example_types = {("review", "date"): "datetime64[ns]", ("review", "rating"): "float64", ("review", "text"): "object"}

for col, dtype in example_types.items():
    err_msg = f"❌ Column has type {reviews[col].dtype} but should be {dtype}"
    assert reviews[col].dtype == dtype, err_msg
print(f"✅ All columns have correct type.")

In [None]:
# Check that there are no missing values (NaNs)
missing_values = reviews.isna().sum()

err_msg = f"❌ There are {missing_values.sum()} missing values in the dataset!"
assert missing_values.sum() == 0, err_msg
print(f"✅ There are no missing values.")

### Understanding the Data

Let's explore the data a bit. In this section we will investigate the total number of reviews and various statistics and distributions about the reviews, beers, users and breweries.

*Note: We have a full notebook with more detailed EDA of the data in the [`playground/eda.ipynb`](https://github.com/epfl-ada/ada-2023-project-blackada/blob/main/playground/eda.ipynb) notebook. In this notebook we focus on the parts of the data exploration that are important for our project.*

In [None]:
# Show the first 5 rows of the data
reviews.head(3)

We see that all data is in a single data frame with multi-column indexing. Each row corresponds to a single review of a beer and denotes the user (`user`), beer (`beer`) and brewery (`brewery`) meta information, as well as the actual review data (`review`) in separate column indices. For example, we can look at the keys individually for the first three reviews.

In [None]:
# Meta-information on beer for first 3 samples
reviews["beer"].head(3)

In [None]:
# Meta-information on user for first 3 samples
reviews["user"].head(3)

In [None]:
# Meta-information on brewery for first 3 samples
reviews["brewery"].head(3)

In [None]:
# Information about review for first 3 samples
reviews["review"].head(3)

As we see, for each review, we have information on the following features:
    
1. **Review** (`review`): Review Text, Ratings (Appearance, Aroma, Palate, Taste, Overall, Rating), Date
2. **User** (`user`): User ID, User Name, #Ratings, #Reviews, Joined Date, Location
3. **Beer** (`beer`): Beer ID, Beer Name, Beer Style, ABV (Alcohol By Volume), #Ratings, #Reviews
4. **Brewery** (`brewery`): Brewery ID, Brewery Name, Location, #Beers

### Groups

In our analysis we want to compute the consensus between the language used in reviews of a) all beers, b) beers of the same style, c) beers from the same brewery and, finally, d) invidual beers. The hypothesis is that the finer-grained the grouping, the higher the consensus between the reviewers. However, for the analysis to be meaningful we need to ensure that there are enough reviews in each group. We therefore compute the number of reviews in each group and plot the distribution of the number of reviews per group.

In [None]:
unique_beer_styles = reviews.beer["style"].drop_duplicates()
unique_breweries = reviews.brewery.drop_duplicates()
unique_beers = reviews.beer.drop_duplicates()

print(f"Number of unique beer styles: {len(unique_beer_styles)}")
print(f"Number of unique breweries: {len(unique_breweries)}")
print(f"Number of unique beers: {len(unique_beers)}")

In [None]:
# Compute the number of reviews for each element in each group
reviews_per_beer_style = reviews.groupby(by=("beer", "style")).size().sort_values(ascending=False)
reviews_per_brewery = reviews.groupby(by=("brewery", "id")).size().sort_values(ascending=False)
reviews_per_beer = reviews.groupby(by=("beer", "id")).size().sort_values(ascending=False)

# Plot number of reviews per beer style
fig, axs = plt.subplots(ncols=3, figsize=(20, 5))
for ax, reviews_per_group in zip(axs, [reviews_per_beer_style, reviews_per_brewery, reviews_per_beer]):
    sns.lineplot(x=range(len(reviews_per_group)), y=reviews_per_group.values, ax=ax)
    ax.plot([0, len(reviews_per_group)], [100, 100], linestyle="--", color="black")
    a, b = reviews_per_group.index.names[0]
    ax.set(
        title=f"#Reviews per {a.capitalize()} {b.capitalize()}",
        xlabel="Rank",
        ylabel="Counts (Log)",
        yscale="log"
        )
    

In [None]:
MIN_REVIEWS = 100

# Filter out beer styles with less than MIN_REVIEWS reviews
included_beer_styles = reviews_per_beer_style[reviews_per_beer_style >= MIN_REVIEWS].index
included_breweries = reviews_per_brewery[reviews_per_brewery >= MIN_REVIEWS].index
included_beers = reviews_per_beer[reviews_per_beer >= MIN_REVIEWS].index

# Create masks for filtering out beer styles with less than MIN_REVIEWS reviewsk
min_reviews_beer_style_mask = reviews.beer["style"].isin(included_beer_styles)
min_reviews_breweries_mask = reviews.brewery["id"].isin(included_breweries)
min_reviews_beer_mask = reviews.beer["id"].isin(included_beers)

# Filter out reviews for beer styles with less than MIN_REVIEWS reviews
original_reviews = reviews.copy()
reviews = reviews[min_reviews_beer_style_mask & min_reviews_breweries_mask & min_reviews_beer_mask]

print(f"✅ Filtering done. Reviews after filtering: {len(reviews)} (Removed {len(original_reviews) - len(reviews)} reviews)")

### Reviews Statistics

The textual reviews are central to our analysis and we will be using them to extract the taste descriptors. Let's look at some statistics about the reviews to ensure that they are of good quality.

In [None]:
# Let's show some example reviews
pd.DataFrame(reviews.review.head(10)["text"])

We see that this random samples of 10 reviews consists only of reviews that are very detailed and descriptive about the beer and its taste. This suggests that the majority of reviews are of good quality and suited for our analysis. However, we suspect that there might be some meaningless "spam" that may skew our results. We will investigate this by checking for outliers in the review length. We use simple proxies for review length, namely the number of words and characters in the review.

In [None]:
# Compute character and word lengths of reviews
character_lengths = reviews.review.text.str.len()
word_lengths = reviews.review.text.apply(lambda x: len(x.split()))

# Distribution of the number of ratings/ reviews per user
fig, ax = plt.subplots(ncols=2, figsize=(20, 5))
sns.histplot(x=character_lengths, kde=True, ax=ax[0])
sns.histplot(x=word_lengths, kde=True, ax=ax[1])

character_lengths_stats = character_lengths.describe()
word_lengths_stats = word_lengths.describe()

ax[0].set(
    title="Distribution of Character Lengths in Reviews",
    xlabel="Character Length",
    ylabel="Frequency",
)
ax[1].set(
    title="Distribution of Word Lengths in Reviews",
    xlabel="Word Lengths",
    ylabel="Frequency",
)

# Show summary statistics
pd.DataFrame([character_lengths_stats, word_lengths_stats], index=["Character Lengths", "Word Lengths"])

We see that most reviews are around **~680 characters** and **~118 words** long. There is a slight right-skew in the distribution, meaning that there are some very long reviews. The very short reviews are probably not very helpful for our analysis as the numeric representation will not be meaningful. Let's look at those reviews to see if further processing is required.

In [None]:
# Show the shortest 0.1% of reviews (by character count)
n = int(len(reviews) * 0.001)
character_sorted = list(character_lengths.sort_values().index.values)
shortest_character_length_reviews = reviews.review[reviews.index.isin(character_sorted[:n])]

pd.DataFrame(shortest_character_length_reviews.text)

In [None]:
# Show the shortest 0.1% of reviews (by word count)
n = int(len(reviews) * 0.001)
words_sorted = list(word_lengths.sort_values().index.values)
shortest_word_length_reviews = reviews.review[reviews.index.isin(words_sorted[:n])]

pd.DataFrame(shortest_word_length_reviews.text)

Upon inspecting the shortest reviews, we can see that most of the shortest reviews by character count are actually regular reviews that are just short. However, in the reviews with very little words we can see some "spam" reviews that are not very helpful for our analysis. It is likely that our extractors are going to struggle with these kinds of reviews. Therefore, we remove all reviews that have less than 10 words.

In [None]:
# Remove the shortest reviews by word count from the dataset
MIN_WORDS = 10
filtered_review = reviews.copy()
reviews = reviews[word_lengths >= MIN_WORDS]

print(f"Removed {(word_lengths < MIN_WORDS).sum()} reviews with less than {MIN_WORDS} words ✅")
print(f"Number of reviews: {len(reviews)}")

## Analysis

---

Nice - the beer style and beer names are quite diverse. In our later analysis we will use these sub-groups to compute consensus scores among reviews for beer styles and specific beers. This analysis suggests that we will have enough sub-groups, where 

Let's denote all the $n=2400935$ reviews as $r_i \in \mathcal{R}$, where $\mathcal{R}$ denotes the set of all reviews.

$$
\mathcal{R} = \{r_1, r_2, \dots, r_n\}, \text{ with } |R| = n
$$

As a baseline, we will compute the consensus score over all of these reviews through a consensus function $\mathcal{C}: \mathcal{R} \rightarrow \mathbb{R}$ that computes a consensus score for a set of reviews. We hypothesise that this consensus score will be lower for the subgroup of reviews that pertain to specific beers, breweries and beer styles, than for the entire set of reviews.

We then repeat the analysis on sub-groups of increasing granularity. The sub-groups are all reviews for a specific beer style ($S_i \in \mathcal{S}$ where $|\mathcal{S}|=104$), a brewery as $Br_i \in \mathcal{Br}$ where $|\mathcal{Br}|={11117}$ and finally all unique beers as $B_i \in \mathcal{B}$ where $|\mathcal{B}|=141833$. 

\begin{align*}
S_i &= \{r_i | S(r_i) = S_i\} \\
B_i &= \{r_i | B(r_i) = B_i\} \\
Br_i &= \{r_i | Br(r_i) = Br_i\},
\end{align*}


where we use mapping functions $S: \mathcal{R} \rightarrow \mathcal{S}$, $B: \mathcal{R} \rightarrow \mathcal{B}$, $Br: \mathcal{Br} \rightarrow \mathcal{Br}$ to get the beer style, beer and brewery for a review, respectively.

It generally holds that the union of all reviews for a specific beer style, beer or brewery is equal to the set of all reviews, e.g.

$$
S_1 \cup S_2 \cup \dots \cup S_{104} = \mathcal{R}, ...,
$$

and the intersection of all reviews for a specific beer style, beer or brewery is empty, e.g.

$$
S_1 \cap S_2 \cap \dots \cap S_{104} = \emptyset, ...,
$$

### Extractors

Before we embed reviews into a numerical representation, we preprocess them using different **extractors modules**. For this project, we have considered the following method:

(1) `DummyExtractor`: This is a dummy extractor that does not do any preprocessing. It simply returns the input text as is.

(2) `LemmaExtractor`: Tokenizes the text and then uses only *lemmas* of the extracted tokens. A lemma is the base form of a word. For example, the lemma of **was** is **be**. Thus, the `LemmaExtractor` might be thougt of as a text normaliser which maps all tokens to the normalised space.

(3) `AdjectiveExtractor`: As the name suggests, extract tokens which were classified by `spaCy` as **adjectives**.


In [None]:
# Define all extractor models
extractor_models: list[extractors.ExtractorBase] = [
    extractors.DummyExtractor(),
    extractors.LemmaExtractor(),
    extractors.AdjectiveExtractor()
]

We want to understand the behaviour of each of the extractors in detail. To do this, we process an example review.

In [None]:
# Define demo review
demo_review = \
"""Pours with a frothy head then settles to a thin head with thin lacing. 
Transparent. Golden to bronze in color. Dry grains. 
Light notes of citrus - orange. Pilsner-esque. Very light malt sweetness - caramel. 
Moves to a dry hoppy-ness. Light bodied. Dry. Somewhat chalky. Meh. 
Just average. Not one I would suggest to a friend, but thank for the organic 
ingredients.
"""

# Preprocess the example with Spacy
processed_demo_review = [nlp(demo_review)]

In [None]:
# Run the extractors against the example
transformed_all = []
for extractor in extractor_models:
    transformed_example = extractor.transform(processed_demo_review)
    transformed_all.append(transformed_example[0])

Starting with the `DummyExtractor`, we can use it as a reference baseline for the other two extractors.

In [None]:
print("DummyExtractor:\n", transformed_all[0].strip())

Let's look at the `LemmaExtractor` next.

In [None]:
print("LemmaExtractor\n", transformed_all[1].strip())

As the text below shows, `LemmaExtractor` has normalised the words to their base form (lemma), a couple of examples:

(1) `grains` -> `grain` (get rid of the plural form)

(2) `settles` -> `settle` (remove `s` from the he/she/it form)

(3) `bodied` -> `body` (stem form)

Apart from the lemmatisation, we can also see that how `spaCy` tokenizes the text. In particular, it treats punctuation marks as separate tokens. For example, `.` is a separate token.

Lastly, we run the `AdjectiveExtractor` on the example review.

In [None]:
print("AdjectiveExtractor\n", transformed_all[2].strip())

Finally, looking at the `AdjectiveExtractor`, we can see that it stips the text to only adjectives, thereby potentially losing some useful information. On the other hand,
for the purposes of our analysis, this might be in fact useful as we only want our embeddings be based on the descriptive words related to beer and avoid the noise.

Now, let's run the extractors against the selected subsample and then investigate the results in more detail. We start by preprocessing the text.

*NB:* takes around **4 minutes** to run.

In [None]:
processed_reviews = [nlp(text) for text in tqdm(reviews.review.text.tolist())]
reviews[("review", "docs")] = processed_reviews

Now, we will run the extractors against the subsample and save the results.

In [None]:
# We map the list of docs to the list of preprocessed strings
extracted_reviews = [extractor.transform(processed_reviews) for extractor in extractor_models]
frequencies = [utils.get_word_frequency(text) for text in extracted_reviews]

# Plot the word frequency of top-10 words for each extractor
fig, axes = plt.subplots(1, 3, figsize=(10, 5))
for ax, freq, extractor in zip(axes, frequencies, extractor_models):
    sns.barplot(x='frequency', y='word', data=freq.head(10), ax=ax)
    ax.set_title(extractor.name)
    ax.set_xlabel('Word Frequency')
    ax.set_ylabel('Word')

plt.tight_layout()

In summary, each of the extractors works as expected. Given the manual inspection of the extraction process, we hypothesise that the `AdjectiveExtractor` is the most suitable one for our task because the adjectives are most related to the taste of a beer. Thus, numerically representing only the subset of adjectives is going to be the closest proxy to an embedding of the beer's taste.

### Embedders

We need an embedding module to turn the extracted information from the reviews into a numeric representation.

In [None]:
# Initalise embedders
embedding_models: list[embedders.EmbedderBase] = [
    embedders.CountEmbedder(),
    embedders.TfidfEmbedder(),
    embedders.BertEmbedder(),
    embedders.SentenceTransformerEmbedder(),
]

Let's go over how each one works.

- CountEmbeddors uses sklearn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Count vectorisation simply assigns each word in the vocabulary to a variable in the feature vector, and the values are the counts of each word. 

- TFIDF is similar to CountVectorizer, but also multiplies by an 'inverse document frequency' term. This weights a word in the vocabular by how frequently it appears in the corpus. Very common words are penalised, and rarer words are given more weight. This also uses sklearn's [TFIDFVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). This is done using the following formula:

$$w_{i,j} = tf_{i,j} \times \frac{N}{df_i}$$

where $w_{i,j}$ denotes the TFIDF of the $i$ th term in review $j$, $tf_{i,j}$ is the 'term frequency' (the count vectorization) of term $i$ in review $j$, $N$ is the total number of reviews and $df_i$ is the 'document frequency' of term $i$ i.e. the number of documents in which $i$ appears. This second half of the equation corresponds to the 'inverse document frequency' (IDF) of TFIDF.

- BERTEmbeddor uses `bert-base-uncased` [from HuggingFace](https://huggingface.co/bert-base-uncased). BERT is a bidirectional encoder-only transformer. There are many options for extracting embeddings from the model since there are 12 layers, and an embbedding for each token input. Currently, the implementation takes the penultimate hidden state of the model and takes the mean across all tokens in the input (see [this guide](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/)).

- SentenceTransformerEmbeddor uses the recommended `all-MiniLM-L6-v2` model from the `sentence-transformers` [library]((https://www.sbert.net/docs/pretrained_models.html)). These models take outputs from BERT, conduct pooling similar to above (e.g. by default, mean of last layer), and are trained on various sentence-related NLP problems using [Siamese networks](https://towardsdatascience.com/a-friendly-introduction-to-siamese-networks-85ab17522942).

In [None]:
# Create a dataframe with some sample reviews
embedder_demo = pd.DataFrame(
    {
        "text": [
            "The beer is nice, with sweet nutty flavours",
            "This is a very different sentence",
            "Not sweet enough. I like my beer sweet. ",
            "Not sweet at all. Terrible beer. ",
            "Not sweet at all. But I like bitter beers so it is a nice beer. ",
            "Piss yellow beer",
            "Sweet beer"
        ]
    }
)

# Loop through each model and add the embeddings to the dataframe
for embedder in embedding_models:
    embedder_demo[embedder.name] = embedder.transform(embedder_demo["text"]).tolist()

# Show the dataframe
embedder_demo.head()

We can see how each of the embeddors behave. CountVectorisor outputs mostly 0s and 1s, and occasionally higher numbers since these are the counts of each word in the review corpus. TFIDF similarly outputs many 0s when a review does not contain any instances of a given word, but the nonzero terms are less easily interpretable, but roughly corresponds to count, with the IDF term also taken into account.
The BERT and SentenceTransformer models are not interpretable at all.

We can now compare the models' behaviour with desired behaviour using cosine similarity. Given these sample texts, for simplicity, we compute the cosine similarity in the embeddings between the 1st sentence and each subsequent sentence.

In [None]:
def get_similarity(review1: str, review2: str) -> float:
    """Computes the similarity between two reviews using all the models"""
    texts = [review1, review2]
    similarities = {}
    for model in embedding_models:
        embeddings = model.transform(texts)
        similarities[model.name] = utils.cosine_similarity(embeddings[0], embeddings[1])
    return similarities

df_methods = pd.DataFrame(index=[model.name for model in embedding_models])

# Compute the similarity between the first and the nth sentence
for i in range(1, len(embedder_demo)):
    df_methods["Similarity " + str(i+1)] = get_similarity(embedder_demo["text"][0], embedder_demo["text"][i]).values()

print("Similarity between first and nth sentence:")
df_methods.head()

Here we can see the obvious pitfall of using CountVectorizer and TFIDF - they lose all context. Sentences 3 and 4 would ideally have the lowest similairty with sentence 1 since they are opposite in meaning. However, these samples all use the same words which Count and TFIDF interpret as therefore being similar. If, during the pipeline, we were to group beers by some measure that affects their sweetness, then in order to confirm out hypothesis we would like to see an increase of similarity inside each group, but we may lower similarity due to negations.

However, BERT and SentenceTransformers are not necessarily better. The values are far less interpretable, with sentence embeddor falling for a similar negation trap since similarities 3 and 4 are higher than 7. Interestingly, SentenceTransformer was far better than BERT at differentiating between sentences on different topic matters (similarity 2). BERT's scores are all broadly similar, and roughly gets the order in line what we might expect, but we have little faith that this translates any better than sentence transformer to the real reviews since these little samples play into BERT's context-aware strengths.

However, for now, we will try to make conclusions using tf-idf. It is the most interpretable (we can get out the most impactful words at the end), and so long as there are enough reviews that are long enough, we should see a meaningful vocabulary emerge. If the tfidf embeddings seem to be limiting us in the future, we can experiment with other methods.

Now let's try with some real sample reviews.

In [None]:
# Create a dataframe with some sample reviews
embedder_sample = pd.DataFrame({ "text": reviews.sample(4, random_state=0).review.text.values.tolist() })

# Compute the similarity between the first and the nth sentence
results = pd.DataFrame(index=[model.name for model in embedding_models])
for i in range(1, len(embedder_sample)):
    results["Similarity " + str(i+1)] = get_similarity(embedder_sample["text"][0], embedder_sample["text"][i]).values()

print("Similarity between first and nth sentence:")
results.head()

Unsurprisinlgy, Count and TFIDF agree on ordering of similarity. However, BERT and SentenceTransformer disagree both with this ordering and each other. 
BERT is the less 'sure', with very high and close values, as in the previous example.

Reading the reviews, it's very hard to define what the ordering *should* be, therefore it is hard to define which embedder has done a better job in this sample. Further investigation will be carried out for P3.

### Consensus Clustering

The final step in the pipeline is to compute the consensus scores for a set of beer reviews. These are implemented as child classes of the `consensus.ConsensusBase` class. Currently, we have implemented the following consensus functions:

- `CosineSimilarity`: Computes the pairwise cosine similarity between all reviews in a set of reviews. The consensus score is the mean of all pairwise cosine similarities.

In [None]:
# Initialise the consensus models
consensus_models: list[consensus.ConsensusBase] = [
    consensus.CosineSimilarity(),
]

## Pipeline

---

We can bring the previous steps together into a `pipeline.TextAnalysis` objec that exposes a `transform` method which brings together all of the previous functionality.

We can now easily pass in a group of reviews and obtain a consensus score.

In [None]:
# Chosen extractor, embedder and consensus model
adjective_extractor = extractors.LemmaExtractor()
tfidf_embedder = embedders.TfidfEmbedder()
cosine_consensus = consensus.CosineSimilarity()

# Combine into pipeline
pipe = pipeline.TextAnalysis(adjective_extractor, tfidf_embedder, cosine_consensus)

In [None]:
# Overall consensus score
overall_consensus = pipe.transform(reviews.review.docs)

print(f"Overall Consensus: {overall_consensus}")

In [None]:
# Consensus within beer style
consensus_scores_per_beer_style = reviews.groupby(by=("beer", "style")).apply(lambda x: pipe.transform(x.review.docs))

# Consensus within brewery
consensus_scores_per_brewery = reviews.groupby(by=("brewery", "id")).apply(lambda x: pipe.transform(x.review.docs))

# Consensus within beer
consensus_scores_per_beer = reviews.groupby(by=("beer", "id")).apply(lambda x: pipe.transform(x.review.docs))

In [None]:
print(f"Consensus score per beer style: {consensus_scores_per_beer_style.mean()}")
print(f"Consensus score per brewery: {consensus_scores_per_brewery.mean()}")
print(f"Consensus score per beer: {consensus_scores_per_beer.mean()}")

While the consensus score does not change significantly, the consensus does increase for all 3 sub-groupings (style, brewery and beer). This is promising for confirming our hypothesis that language used differs between beer types.

## Conclusion

---

There are many more avenues which we can explore to answer our research questions, including groupings by other variables, interpretability of the language use (through TFIDF) and combination with an additional dataset of critics' reviews.

In this notebook we have outlined our entire data processing pipeline for our project. We have shown that it is both theoretically and computationally feasible, and that the results are promising. 


