# Unlocking the Palate - Evaluating Taste Consensus Among Beer Reviewers

---

Group [**BlackAda**](https://en.wikipedia.org/wiki/Blackadder)

> - Ludek Cizinsky ([ludek@cizinsky@epfl.ch](ludek.cizinsky@epfl.ch))
> - Peter Nutter ([peter@nutter@epfl.ch](peter@nutter@epfl.ch))
> - Pierre Lardet ([pierre@lardet@epfl.ch](pierre@lardet@epfl.ch))
> - Christian Bastin ([christian@bastin@epfl.ch](christian@bastin@epfl.ch))
> - Mika Senghaas ([mika@senghaas@epfl.ch](mika@senghaas@epfl.ch))

## Introduction

---

Navigating the world of beer reviews can be a daunting task for non-experts. Beer aficionados often describe brews as having nuanced flavors such as "grassy notes" and "biscuity/ crackery malt," with hints of "hay." But do these descriptions reflect the actual tasting experience? Following a "wisdom-of-the-crowd" approach, a descriptor can be considered meaningful if many, independent reviewers use similar descriptors for a beer's taste. To quantify consensus, we use natural language processing techniques to extract descriptors of a beer's taste and numerically represent these descriptors to compute similarity or consensus scores. The consensus scores between beer reviews will unveil whether there is a shared understanding of taste among beer geeks.

## Dependencies

---

We load the dependencies required for this project to run.

In [1]:
# Enable continuous module reloading
%load_ext autoreload
%autoreload 2 
%aimport src

# Standard library
import os

# Custom modules
from src import utils

And set some global variables.

In [2]:
# URL for the full dataset
DATA_URL = "https://drive.google.com/u/2/uc?id=1IqcAJtYrDB1j40rBY5M-PGp6KNX-E3xq&export=download"

# Number of samples to use for the subset
NUM_SUBSET_SAMPLES = 100

# Paths
ROOT_DIR = os.getcwd()
DATA_DIR = os.path.join(ROOT_DIR, "data")

# Ensure data directory exists
os.makedirs(DATA_DIR, exist_ok=True)

## Data

---

We will be working with the beer review data from the [BeerAdvocate](https://www.beeradvocate.com/) platform. 


### Data Download

Due to its size (uncompressed 1.6 GB), the dataset is not included in the repository but must be downloaded. The course staff has provided the data via Google Drive. On the first run of this notebook, we download the compressed data file from Google Drive and extract it to the `data` folder. The compressed file is ~1.5 GB in size. 

After extraction and removing of unnecessary files (archives, ratings file, ...), the data folder should contain the following files: `beers.csv`, `breweries.csv`, `users.csv`, `reviews.txt`. The total size of the data is ~2.9 GB.

*NB: Data loading takes around **~8min** on the first run. Subsequent runs of this cell are instant.*

In [31]:
# Download the BeerAdvocate dataset if it doesn't exist
if not os.path.exists(os.path.join(DATA_DIR, "reviews.txt")):
    utils.download_data(DATA_URL, data_dir=DATA_DIR)
print(f"Beer reviews downloaded to {DATA_DIR} ✅.")

Beer reviews downloaded to /Users/peter/Developer/ada-2023-project-blackada/data ✅.


### Data Loading

Next, we load the data into a Pandas DataFrame. On the first run, we load all the reviews from the `reviews.txt` file and populate it with some additional meta-data from the other files. We then save the DataFrame to a `.feather` file for faster loading in the future. On subsequent runs, we load the DataFrame from the `.feather` file if it exists.

*NB: Running this cell for the first time reads in all `2.5M` reviews which takes **~7min**. Subsequent runs should be much faster, taking about **~1min**.*

In [3]:
# Load all reviews and a subset of reviews (10,000)
reviews = utils.load_data(DATA_DIR)
sub_reviews = utils.load_data(DATA_DIR, num_samples=NUM_SUBSET_SAMPLES)

print(f"Loaded {len(reviews)} reviews ✅. (+{len(sub_reviews)} reviews in subset)")

Loaded 2589586 reviews ✅. (+100 reviews in subset)


### EDA

Let's explore the data a bit. We will analyse:

- ...

In [33]:
# TODO

In [4]:
from src.consensus import CosineSimilarity
from src.extractors import *
from src.pipeline import TextAnalysis
from src.embedders import *
import pandas as pd
import spacy

nlp = spacy.load("en_core_web_sm")
SEED = 42
DEVICE = "mps"  # Use cpu if not on M* mac
np.random.seed(SEED)
torch.manual_seed(SEED)
lemma_extractor = LemmaExtractor()
adjective_extractor = AdjectiveExtractor()
dummy_extractor = DummyExtractor()

count_embeddor = CountEmbeddor()
tfidf_embeddor = TfidfEmbeddor()
bert_embeddor = BertEmbeddor(device=DEVICE)
sentence_transformer_embeddor = SentenceTransformerEmbeddor(device=DEVICE)
cosine_metric = CosineSimilarity()

lemma_and_count = TextAnalysis(lemma_extractor, count_embeddor, cosine_metric)
lemma_and_tfidf = TextAnalysis(lemma_extractor, tfidf_embeddor, cosine_metric)
adj_and_bert = TextAnalysis(adjective_extractor, bert_embeddor, cosine_metric)
adj_and_sen_tran = TextAnalysis(
    adjective_extractor, sentence_transformer_embeddor, cosine_metric
)
dummy_and_bert = TextAnalysis(dummy_extractor, bert_embeddor, cosine_metric)


def run_pipelines(docs, pipelines):
    results = {}
    for name, pipeline in pipelines.items():
        results[name] = pipeline.transform(docs)
    return results


pipelines = {
    "Lemma & Count": lemma_and_count,
    "Lemma & TFIDF": lemma_and_tfidf,
    "Adjective & BERT": adj_and_bert,
    "Adjective & SentenceTransformer": adj_and_sen_tran,
    "Dummy & BERT": dummy_and_bert,
}

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
most_popular_styles = reviews["beer"]["style"].value_counts()
most_popular_beers = reviews["beer"]["name"].value_counts()

In [6]:
style_results = []
for style in most_popular_styles.index[:10]:
    sample = reviews[reviews["beer"]["style"] == style].sample(100, random_state=SEED)
    transform = sample["review"]["text"].apply(lambda x: nlp(x))
    result = run_pipelines(transform, pipelines)
    style_results.append(result)
df_style_results = pd.DataFrame(style_results)
df_style_results

Unnamed: 0,Lemma & Count,Lemma & TFIDF,Adjective & BERT,Adjective & SentenceTransformer,Dummy & BERT
0,0.387268,0.15179,0.913955,0.441771,0.958266
1,0.439708,0.169459,0.925934,0.484051,0.958078
2,0.418018,0.161084,0.915397,0.476532,0.957206
3,0.402188,0.155264,0.916692,0.480332,0.953592
4,0.432146,0.170018,0.920941,0.526817,0.953939
5,0.424493,0.170841,0.916305,0.505684,0.942794
6,0.420441,0.155198,0.928033,0.503213,0.950007
7,0.406826,0.160338,0.9267,0.484078,0.955128
8,0.422228,0.149259,0.914928,0.425485,0.951311
9,0.402771,0.160068,0.924918,0.514777,0.94644


In [16]:
beer_results = []
for beer in most_popular_beers.index[:10]:
    sample = reviews[reviews["beer"]["name"] == beer].sample(100, random_state=SEED)
    transform = sample["review"]["text"].apply(lambda x: nlp(x))
    result = run_pipelines(transform, pipelines)
    beer_results.append(result)
df_beer_results = pd.DataFrame(beer_results)
df_beer_results

In [13]:
df_beer_results.mean(axis=0)

NameError: name 'df_beer_results' is not defined

In [14]:
df_style_results.mean(axis=0)

Lemma & Count                      0.415609
Lemma & TFIDF                      0.160332
Adjective & BERT                   0.920380
Adjective & SentenceTransformer    0.484274
Dummy & BERT                       0.952676
dtype: float64