# Unlocking the Palate - Evaluating Taste Consensus Among Beer Reviewers

---

Group [**BlackAda**](https://en.wikipedia.org/wiki/Blackadder)

> - Ludek Cizinsky ([ludek@cizinsky@epfl.ch](ludek.cizinsky@epfl.ch))
> - Peter Nutter ([peter@nutter@epfl.ch](peter@nutter@epfl.ch))
> - Pierre Lardet ([pierre@lardet@epfl.ch](pierre@lardet@epfl.ch))
> - Christian Bastin ([christian@bastin@epfl.ch](christian@bastin@epfl.ch))
> - Mika Senghaas ([mika@senghaas@epfl.ch](mika@senghaas@epfl.ch))

## Introduction

---

Navigating the world of beer reviews can be a daunting task for non-experts. Beer aficionados often describe brews as having nuanced flavors such as "grassy notes" and "biscuity/ crackery malt," with hints of "hay." But do these descriptions reflect the actual tasting experience? Following a "wisdom-of-the-crowd" approach, a descriptor can be considered meaningful if many, independent reviewers use similar descriptors for a beer's taste. To quantify consensus, we use natural language processing techniques to extract descriptors of a beer's taste and numerically represent these descriptors to compute similarity or consensus scores. The consensus scores between beer reviews will unveil whether there is a shared understanding of taste among beer geeks.

## Dependencies

---

We load the dependencies required for this project to run.

In [None]:
# Enable continuous module reloading
%load_ext autoreload
%autoreload 1
%aimport src

# Standard library
import os

# Custom modules
from src import utils

And set some global variables.

In [None]:
# URL for the full dataset
DATA_URL = "https://drive.google.com/u/2/uc?id=1IqcAJtYrDB1j40rBY5M-PGp6KNX-E3xq&export=download"

# Number of samples to use for the subset
NUM_SUBSET_SAMPLES = 10000

# Paths
ROOT_DIR = os.getcwd()
DATA_DIR = os.path.join(ROOT_DIR, "data")

# Ensure data directory exists
os.makedirs(DATA_DIR, exist_ok=True)

## Data

---

We will be working with the beer review data from the [BeerAdvocate](https://www.beeradvocate.com/) platform. 


### Data Download

Due to its size (uncompressed 1.6 GB), the dataset is not included in the repository but must be downloaded. The course staff has provided the data via Google Drive. On the first run of this notebook, we download the compressed data file from Google Drive and extract it to the `data` folder. The compressed file is ~1.5 GB in size. 

After extraction and removing of unnecessary files (archives, ratings file, ...), the data folder should contain the following files: `beers.csv`, `breweries.csv`, `users.csv`, `reviews.txt`. The total size of the data is ~2.9 GB.

*NB: Data loading takes around **~8min** on the first run. Subsequent runs of this cell are instant.*

In [None]:
# Download the BeerAdvocate dataset if it doesn't exist
if not os.path.exists(os.path.join(DATA_DIR, "reviews.txt")):
    utils.download_data(DATA_URL, data_dir=DATA_DIR)
print(f"Beer reviews downloaded to {DATA_DIR} ✅.")

### Data Loading

Next, we load the data into a Pandas DataFrame. On the first run, we load all the reviews from the `reviews.txt` file and populate it with some additional meta-data from the other files. We then save the DataFrame to a `.feather` file for faster loading in the future. On subsequent runs, we load the DataFrame from the `.feather` file if it exists.

*NB: Running this cell for the first time reads in all `2.5M` reviews which takes **~7min**. Subsequent runs should be much faster, taking about **~1min**.*

In [None]:
# Load all reviews and a subset of reviews (10,000)
reviews = utils.load_data(DATA_DIR)
sub_reviews = utils.load_data(DATA_DIR, num_samples=NUM_SUBSET_SAMPLES)

print(f"Loaded {len(reviews)} reviews ✅. (+{len(sub_reviews)} reviews in subset)")

### EDA

Let's explore the data a bit. We will analyse:

- ...

In [None]:
# TODO