# Data Analysis and Cleaning of MatchedBeer Dataset

The MatchedBeer dataset combines data from BeerAdvocate and RateBeer, two of the most prominent online platforms for beer ratings. This combined dataset was created to address the herding effect in online ratings, where early ratings can skew subsequent user opinions. By merging the two platforms, the MatchedBeer dataset allows for the study of the same beers rated independently on each platform, thereby providing a unique opportunity to analyze unbiased rating patterns.

## Matching Process Overview
The matching process is described in detail in the paper ["When Sheep Shop: Measuring Herding Effects in Product Ratings with Natural Experiments"](https://dlab.epfl.ch/people/west/pub/Lederrey-West_WWW-18.pdf), focusing on accurately aligning beers from both websites. The matching is performed in two phases to prioritize precision:

1. **Brewery Alignment**: Breweries across BeerAdvocate and RateBeer are aligned based on their exact location (state or country), with names represented as TF-IDF vectors to compute cosine similarity.
2. **Beer Alignment**: Within matched breweries, beers are aligned based on name similarity and require an exact match in alcohol by volume (ABV). Before matching, any shared tokens in brewery and beer names are removed to improve accuracy.

The algorithm ensures high precision by setting strict thresholds for cosine similarity and enforcing a significant gap between the best and second-best matches. This approach results in a dataset where brewery matching has a reported precision of 99.6%, and beer matching achieves a precision of 100%, meaning that only highly confident matches are included, though some potential matches may be excluded.

## Structure of MatchedBeer Dataset
The dataset contains essential information on beers, breweries, users, and user ratings from both platforms. Since the user bases of BeerAdvocate and RateBeer differ geographically, the combined dataset reflects a more balanced view. BeerAdvocate is more U.S.-centric, while RateBeer has a more internationally diverse user base. Despite these differences, the matched dataset has been verified for both internal and external validity, making it well-suited for analyzing the influence of initial ratings on user perceptions across platforms.

MatchedBeer provides insights into how early ratings impact subsequent ones by comparing ratings for the same beer on both sites, especially when early ratings on each platform differ significantly. By standardizing and aligning ratings across time and platforms, MatchedBeer reveals patterns that allow for the identification of herding effects, helping researchers understand the biases in user-generated ratings.

In [1]:
# Import all the necessary libraries
import polars as pl
import tqdm
import os

# Import the file in the utils folder
import sys
sys.path.append('../../utils')
from data_desc import describe_numbers, describe
from matched_dataset import load_and_process_files

# Define the paths
SRC_DATA_PATH = '../../../data/MatchedBeer'
DST_DATA_PATH = '../../../data/MatchedBeer/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

#### Data Exploration

In [2]:
file_paths = [SRC_DATA_PATH + "/beers.csv", SRC_DATA_PATH + "/breweries.csv", SRC_DATA_PATH + "/ratings.csv", SRC_DATA_PATH + "/users_approx.csv"]
dataframes_list = load_and_process_files(file_paths)

df_beers = dataframes_list[0]
df_breweries = dataframes_list[1]
df_ratings = dataframes_list[2]
df_users = dataframes_list[3]

##### Beers dataset
We'll start by looking at the beers dataset.

In [3]:
df_beers.sample(5)

abv_ba,avg_ba,avg_computed_ba,avg_matched_valid_ratings_ba,ba_score_ba,beer_id_ba,beer_name_ba,beer_wout_brewery_name_ba,brewery_id_ba,brewery_name_ba,bros_score_ba,nbr_matched_valid_ratings_ba,nbr_ratings_ba,nbr_reviews_ba,style_ba,zscore_ba,abv_rb,avg_rb,avg_computed_rb,avg_matched_valid_ratings_rb,beer_id_rb,beer_name_rb,beer_wout_brewery_name_rb,brewery_id_rb,brewery_name_rb,nbr_matched_valid_ratings_rb,nbr_ratings_rb,overall_score_rb,style_rb,style_score_rb,zscore_rb,diff_scores,sim_scores
f64,f64,f64,f64,f64,i64,str,str,i64,str,f64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,str,i64,str,i64,i64,f64,str,f64,f64,f64,f64
7.7,,,,,243668,"""Heel""","""Heel""",43890,"""Pinellas Ale Works""",,0,0,0,"""American Brown Ale""",,7.7,,,,435328,"""Pinellas Heel""","""Heel""",25774,"""Pinellas Ale Works""",0,0,,"""Brown Ale""",,,1.0,1.0
4.9,3.62,3.5575,3.856667,,180967,"""Session IPA""","""Session IPA""",40535,"""Duck Foot Brewing Co.""",,3,4,3,"""American IPA""",-0.911524,4.9,3.16,3.285714,3.285714,347019,"""Duck Foot Session IPA""","""Session IPA""",22965,"""Duck Foot Brewing Company""",7,7,,"""Session IPA""",,-0.255436,0.723186,1.0
7.2,4.01,4.01,4.06,,253776,"""Citrus Tsunami""","""Tsunami Citrus""",47387,"""Parkersburg Brewing Company""",,2,3,1,"""American IPA""",0.035585,7.2,3.36,3.9,3.9,447459,"""Parkersburg Citrus Tsunami""","""Tsunami Citrus""",28555,"""Parkersburg Brewing Company""",3,3,,"""India Pale Ale (IPA)""",,0.773624,1.0,1.0
5.7,3.99,4.01,3.97,,213395,"""Super Clinic""","""Clinic Super""",24056,"""Melvin Brewing / Thai Me Up""",,3,6,3,"""American IPA""",0.062831,5.7,3.36,3.9,3.9,394670,"""Melvin Super Clinic IPA""","""Clinic Super IPA""",12066,"""Melvin Brewing Company &#40;Th…",3,3,,"""India Pale Ale (IPA)""",,0.820444,0.442664,0.903038
4.8,3.25,3.25,,,141762,"""Rapperswiler Lager""","""Rapperswiler Lager""",31982,"""Bier Factory Rapperswil AG""",,0,1,0,"""Euro Pale Lager""",-1.12621,4.8,3.21,3.3375,3.3375,192379,"""Bier Factory Rapperswil Rapper…","""Rapperswiler Lager""",5192,"""Bier Factory Rapperswil""",8,8,,"""Premium Lager""",,-0.183587,1.0,1.0


This dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for each beer in BA/RB | 
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `beer_wout_brewery_name_ba` / `beer_wout_brewery_name_rb` | Beer name without brewery name in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings in BA/RB |
| `nbr_reviews_ba` | Number of reviews in BA |
| `avg_ba` / `avg_rb` | Average rating in BA/RB |
| `avg_computed_ba` / `avg_computed_rb` | Computed average rating in BA/RB |
| `zscore_ba` / `zscore_rb` | Standardized score in BA/RB |
| `nbr_matched_valid_ratings_ba` / `nbr_matched_valid_ratings_rb` | Number of matched valid ratings in BA/RB |
| `avg_matched_valid_ratings_ba` / `avg_matched_valid_ratings_rb` | Average of matched valid ratings in BA/RB |
| `ba_score_ba` | BeerAdvocate score |
| `bros_score_ba` | Bros score in BA |
| `overall_score_rb` | Overall score in RB |
| `style_score_rb` | Style-specific score in RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

Now that we have an idea what the structure of our datasets looks like, let's see if our dataset is complete or whether it contains a lot of missing entries.

In [4]:
describe(df_beers)

+------------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|------------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| abv_ba                       | Float64 |            45640 |             0 |    0.00 % |                   351 |            0.77 % |
| avg_ba                       | Float64 |            40285 |          5355 |   11.73 % |                   373 |            0.93 % |
| avg_computed_ba              | Float64 |            40285 |          5355 |   11.73 % |                 12090 |           30.01 % |
| avg_matched_valid_ratings_ba | Float64 |            28272 |         17368 |   38.05 % |                  6658 |           23.55 % |
| ba_score_ba                  | Float64 |            10499 | 

In [5]:
describe_numbers(df_beers, filters =["beer_id_ba", "brewery_id_ba", "beer_id_rb", "brewery_id_rb"])

+------------------------------+-----------+-----------+-----------+------------+-----------+----------+---------+
| Column                       |      Mean |       Std |       25% |        50% |       75% |      Min |     Max |
|------------------------------+-----------+-----------+-----------+------------+-----------+----------+---------|
| abv_ba                       |   6.32086 |   1.85773 |         5 |          6 |       7.2 |     0.38 |    67.5 |
| avg_ba                       |   3.73237 |  0.439988 |      3.52 |       3.78 |         4 |        1 |       5 |
| avg_computed_ba              |   3.72785 |  0.424824 |      3.52 |     3.7725 |         4 |        1 |       5 |
| avg_matched_valid_ratings_ba |   3.74777 |  0.484775 |      3.52 |    3.80833 |     4.045 |        1 |       5 |
| ba_score_ba                  |    84.871 |   3.28287 |        83 |         85 |        86 |       63 |     100 |
| bros_score_ba                |   86.1623 |    8.5394 |        83 |         88 

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns show identical distributions across BeerAdvocate and RateBeer, with a mean ABV of approximately 6.32%.
   - This similarity suggests a good alignment in the types of beers reviewed on both platforms. The maximum ABV of 67.5% is possibly an outlier, likely representing rare extreme beers, as typical ABVs are much lower.

2. Average Ratings (BeerAdvocate vs. RateBeer)
   - The `avg_ba` (BeerAdvocate) mean is 3.73, while `avg_rb` (RateBeer) is significantly lower at 3.14.
   - This aligns with previous findings that BeerAdvocate users tend to rate beers more generously than RateBeer users. The higher `std` value for `avg_ba` also indicates a wider range in BeerAdvocate’s ratings, suggesting more variability in user opinions.

3. Computed Averages (Consistency in Ratings)
   - `avg_computed_ba` and `avg_computed_rb` refer to recalculated averages, showing slightly lower standard deviations than their direct counterparts (`avg_ba` and `avg_rb`), indicating a slight smoothing effect.
   - Both BeerAdvocate and RateBeer have computed averages (`avg_computed_ba` vs. `avg_computed_rb`) that are fairly close to the respective raw averages, implying that the original averages align well with user trends.

4. Rating Counts
   - The median `nbr_ratings_ba` (3) and `nbr_ratings_rb` (5) are relatively low, showing that many beers have few ratings on both sites, though RateBeer has slightly more ratings per beer on average.
   - RateBeer’s broader user base also shows a higher standard deviation in the count of matched valid ratings (`nbr_matched_valid_ratings_rb` is 80.10 vs. 42.51 for BeerAdvocate), meaning some beers have far more reviews than others on RateBeer.

5. Z-scores (Standardized Ratings)
   - The mean `zscore_ba` (-0.41) is lower than `zscore_rb` (-0.10), suggesting BeerAdvocate users' ratings tend to deviate more below the mean, which may imply a slight rating inflation on RateBeer.
   - BeerAdvocate’s z-scores also exhibit a broader spread, evidenced by a standard deviation of 0.81 compared to RateBeer's 0.73, reinforcing the idea that BeerAdvocate reviews show more polarized opinions.

6. Score Difference and Similarity
   - `diff_scores` has a mean of 0.798, which indicates a considerable average difference in the ratings across the two platforms.
   - `sim_scores`, however, is high (mean of 0.986), suggesting that while individual score values differ, the rating trends between matched pairs are generally consistent.

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each platform, every brewery has a location, a name, and the amount of beers they produce, along with a unique identifier. We also have the similarity and difference scores between the breweries from the two platforms. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [6]:
df_breweries.sample(5)

id_ba,location_ba,name_ba,nbr_beers_ba,id_rb,location_rb,name_rb,nbr_beers_rb,diff_scores,sim_scores
i64,str,str,i64,i64,str,str,i64,f64,f64
13444,"""Denmark""","""Heimdal-Bryg 2""",0,5786,"""Denmark""","""Heimdal-Bryg""",11,0.345058,1.0
14294,"""England""","""Newby Wyke Brewery""",5,3192,"""England""","""Newby Wyke""",77,0.764933,0.97721
42076,"""United States, Alabama""","""Red Clay Brewing Company""",8,25762,"""United States, Alabama""","""Red Clay Brewing Company""",8,0.386143,1.0
42578,"""United States, Oregon""","""Leikam Brewing""",8,24989,"""United States, Oregon""","""Leikam Brewing""",8,0.770562,1.0
2696,"""United States, Colorado""","""Jarre Creek Ranch Brewery""",4,1690,"""United States, Colorado""","""Jarre Creek Ranch Brewing""",9,0.553282,0.96798


The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `id_ba` / `id_rb` | Unique identifier for each brewery in BA/RB | 
| `name_ba` / `name_rb` | Name of the brewery in BA/RB |
| `location_ba` / `location_rb` | Geographic location of the brewery in BA/RB |
| `nbr_beers_ba` / `nbr_beers_rb` | Number of beers produced by the brewery in BA/RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

In [7]:
describe(df_breweries)

+--------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| id_ba        | Int64   |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| location_ba  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_ba      | String  |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| nbr_beers_ba | Int64   |             8281 |             0 |    0.00 % |                   231 |            2.79 % |
| id_rb        | Int64   |             8281 |             0 |    0.00 % |                  8235 |           99.44 % |
| location_rb  | String  |             8281 |           

In [8]:
describe(df_breweries, filters=["id_ba", "id_rb"])

+--------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| location_ba  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_ba      | String  |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| nbr_beers_ba | Int64   |             8281 |             0 |    0.00 % |                   231 |            2.79 % |
| location_rb  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_rb      | String  |             8281 |             0 |    0.00 % |                  8235 |           99.44 % |
| nbr_beers_rb | Int64   |             8281 |           

For our breweries on both platforms, we have no missing values (in this matched data).

##### Users dataset

**Note**: The users_approx data is essentailly exactly the same however rather than users being matched exactly by username across the two platforms, they are matched using a TF-IDF vectorizer and cosine similarity. Users are matched if username is close enough and same location.

For each of our users (for each platform), we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [9]:
df_users.sample(5)

joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,user_id_ba,user_name_ba,user_name_lower_ba,joined_rb,location_rb,nbr_ratings_rb,user_id_rb,user_name_rb,user_name_lower_rb,sim_scores
f64,str,i64,i64,str,str,str,f64,str,i64,i64,str,str,f64
1252100000.0,"""United States, Texas""",309,0,"""frogfan.366893""","""frogfan""","""frogfan""",1004900000.0,"""United States, Texas""",9,2133,"""frogfan99""","""frogfan99""",0.866025
1490200000.0,"""Spain""",1,0,"""ramonet.1194922""","""Ramonet""","""ramonet""",1365000000.0,"""Spain""",2,252277,"""Ramonetti""","""ramonetti""",0.866025
1300000000.0,"""United States, North Carolina""",84,84,"""raszputini.579433""","""raszputini""","""raszputini""",1289300000.0,"""United States, North Carolina""",58,116932,"""raszputini""","""raszputini""",1.0
1120100000.0,"""United States, Oregon""",7,7,"""kevinphish.26808""","""kevinphish""","""kevinphish""",1139800000.0,"""United States, Oregon""",23,33493,"""kevinphish""","""kevinphish""",1.0
1393000000.0,"""United States, Texas""",1920,77,"""domvan.783801""","""Domvan""","""domvan""",1422400000.0,"""United States, Texas""",1,355050,"""domvan""","""domvan""",1.0


The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `user_id_ba` / `user_id_rb` | Unique identifier for each user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `user_name_lower_ba` / `user_name_lower_rb` | Lowercase username in BA/RB |
| `joined_ba` / `joined_rb` | Date when the user joined BA/RB |
| `location_ba` / `location_rb` | Geographic location of the user in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings submitted by the user in BA/RB |
| `nbr_reviews_ba` | Number of reviews submitted by the user in BA |

In [10]:
describe(df_users)

+--------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column             | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| joined_ba          | Float64 |             3341 |             0 |    0.00 % |                  2352 |           70.40 % |
| location_ba        | String  |             3341 |             0 |    0.00 % |                   111 |            3.32 % |
| nbr_ratings_ba     | Int64   |             3341 |             0 |    0.00 % |                   660 |           19.75 % |
| nbr_reviews_ba     | Int64   |             3341 |             0 |    0.00 % |                   475 |           14.22 % |
| user_id_ba         | String  |             3341 |             0 |    0.00 % |                  3316 |           99.25 % |
| user_n

In [11]:
describe_numbers(df_users, filters=["user_id_rb", "joined_ba", "joined_rb"])

+----------------+----------+-----------+-------+-------+-------+----------+-------+
| Column         |     Mean |       Std |   25% |   50% |   75% |      Min |   Max |
|----------------+----------+-----------+-------+-------+-------+----------+-------|
| nbr_ratings_ba |  211.612 |   685.867 |     2 |    13 |   106 |        1 | 12046 |
| nbr_reviews_ba |  118.232 |   454.174 |     1 |     3 |    29 |        0 |  7593 |
| nbr_ratings_rb |  270.923 |   1143.09 |     2 |     5 |    38 |        1 | 20678 |
| sim_scores     | 0.986855 | 0.0425066 |     1 |     1 |     1 | 0.800641 |     1 |
+----------------+----------+-----------+-------+-------+-------+----------+-------+


To make the `joined` columns more readable we cast them to a datetime object.

In [12]:
df_users = df_users.with_columns((pl.col("joined_ba").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users = df_users.with_columns((pl.col("joined_rb").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,user_id_ba,user_name_ba,user_name_lower_ba,joined_rb,location_rb,nbr_ratings_rb,user_id_rb,user_name_rb,user_name_lower_rb,sim_scores
datetime[ms],str,i64,i64,str,str,str,datetime[ms],str,i64,i64,str,str,f64
2013-08-01 10:00:00,"""United States, Oregon""",73,2,"""hoppytoday.745949""","""HoppyToday""","""hoppytoday""",2010-07-10 10:00:00,"""United States, Oregon""",5,109636,"""HoppyToday""","""hoppytoday""",1.0
2013-07-01 10:00:00,"""England""",11,11,"""edking.739491""","""EdKing""","""edking""",2011-02-25 11:00:00,"""England""",2213,124233,"""EdKing""","""edking""",1.0
2010-05-10 10:00:00,"""United States, Florida""",32,32,"""konadrinker.457918""","""konadrinker""","""konadrinker""",2010-05-15 10:00:00,"""United States, Florida""",31,106412,"""Konadrinker""","""konadrinker""",1.0
2013-05-11 10:00:00,"""United States, Michigan""",1,1,"""darb85.732106""","""darb85""","""darb85""",2013-11-21 11:00:00,"""United States, Michigan""",207,289453,"""darb85""","""darb85""",1.0
2008-12-08 11:00:00,"""United States, New York""",10,10,"""sonicstylee.274700""","""sonicstylee""","""sonicstylee""",2008-09-18 10:00:00,"""United States, New York""",11,81776,"""sonicstylee""","""sonicstylee""",1.0


##### Ratings

In [13]:
df_ratings.sample(5)

abv_ba,appearance_ba,aroma_ba,beer_id_ba,beer_name_ba,brewery_id_ba,brewery_name_ba,date_ba,overall_ba,palate_ba,rating_ba,review_ba,style_ba,taste_ba,text_ba,user_id_ba,user_name_ba,abv_rb,appearance_rb,aroma_rb,beer_id_rb,beer_name_rb,brewery_id_rb,brewery_name_rb,date_rb,overall_rb,palate_rb,rating_rb,style_rb,taste_rb,text_rb,user_id_rb,user_name_rb
f64,f64,f64,i64,str,i64,str,i64,f64,f64,f64,bool,str,f64,str,str,str,f64,f64,f64,i64,str,i64,str,i64,f64,f64,f64,str,f64,str,i64,str
6.0,4.0,4.0,212102,"""Roses Are Brett""",24299,"""To Øl""",1493460000,4.5,4.0,4.1,True,"""Saison / Farmhouse Ale""",4.0,"""330 ml bottle into tulip glass…","""superspak.456300""","""superspak""",6.0,4.0,8.0,421055,"""To Øl Roses are Brett""",12119,"""To Øl""",1492941600,17.0,4.0,4.1,"""Saison""",8.0,"""330 ml bottle into tulip glass…",105791,"""superspak"""
6.3,4.0,3.75,131281,"""Star Baby IPA""",35337,"""Brenner Brewing Co.""",1437559200,3.75,3.75,3.87,True,"""American IPA""",4.0,"""This pours a dark amber color …","""lemke10.493944""","""Lemke10""",6.3,4.0,8.0,271366,"""Brenner Star Baby IPA""",19943,"""Brenner Brewing Company""",1437472800,14.0,4.0,3.7,"""India Pale Ale (IPA)""",7.0,"""This pours a dark amber color …",147999,"""Lemke10"""
6.2,3.5,3.5,52475,"""Tule Duck Red Ale""",16851,"""Buckbean Brewing Company""",1326366000,3.5,3.0,3.45,True,"""American Amber / Red Ale""",3.5,"""Received in trade from mdascha…","""johnnnniee.125029""","""johnnnniee""",6.2,3.0,6.0,86608,"""Buckbean Tule Duck Red Ale""",9472,"""Buckbean Brewing Company""",1326279600,12.0,3.0,3.0,"""Amber Ale""",6.0,"""Received in trade from mdascha…",60182,"""johnnnniee"""
6.1,4.0,4.5,22790,"""Blind Pig IPA""",863,"""Russian River Brewing Company""",1244628000,4.5,4.0,4.22,True,"""American IPA""",4.0,"""Bottle dated 020209, got as an…","""cakanator.195218""","""Cakanator""",6.1,4.0,9.0,48429,"""Russian River Blind Pig IPA""",1480,"""Russian River Brewing""",1266836400,18.0,4.0,4.4,"""India Pale Ale (IPA)""",9.0,"""Notes from last year.Bottle da…",70429,"""cakanator"""
5.3,3.5,4.0,8698,"""Hopyard IPA""",3353,"""Rockyard Brewing""",1301824800,3.5,2.5,3.52,True,"""American IPA""",3.5,"""Some citrus orange taste, but …","""sammy.3853""","""Sammy""",5.3,3.0,6.0,3641,"""Rockyard Hopyard IPA""",628,"""Rockyard Brewing Company""",1301738400,13.0,2.0,3.0,"""India Pale Ale (IPA)""",6.0,"""Some citrus orange taste, but …",11905,"""Sammy"""


In [14]:
describe(df_ratings)

+-----------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column          | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| abv_ba          | Float64 |            21964 |             0 |    0.00 % |                   187 |            0.85 % |
| appearance_ba   | Float64 |            19295 |          2669 |   12.15 % |                    18 |            0.09 % |
| aroma_ba        | Float64 |            19295 |          2669 |   12.15 % |                    18 |            0.09 % |
| beer_id_ba      | Int64   |            21964 |             0 |    0.00 % |                  9025 |           41.09 % |
| beer_name_ba    | String  |            21964 |             0 |    0.00 % |                  8719 |           39.70 % |
| brewery_id_ba   | Int64   |   

The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for the beer in BA/RB |
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `user_id_ba` / `user_id_rb` | Unique identifier for the user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `date_ba` / `date_rb` | Date of the review in BA/RB |
| `rating_ba` / `rating_rb` | Overall rating given by the user in BA/RB |
| `overall_ba` / `overall_rb` | Overall score in BA/RB |
| `appearance_ba` / `appearance_rb` | Appearance score in BA/RB |
| `aroma_ba` / `aroma_rb` | Aroma score in BA/RB |
| `palate_ba` / `palate_rb` | Palate score in BA/RB |
| `taste_ba` / `taste_rb` | Taste score in BA/RB |
| `text_ba` / `text_rb` | Review text in BA/RB |
| `review_ba` | Whether or not there is a BA review available |

In [15]:
describe_numbers(df_ratings, filters=["beer_id_ba", "beer_id_rb", "brewery_id_ba", "brewery_id_rb", "user_id_ba", "user_id_rb"])

+---------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Column        |        Mean |         Std |         25% |         50% |         75% |         Min |         Max |
|---------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------|
| abv_ba        |     6.80022 |     2.09619 |         5.2 |         6.4 |           8 |         0.5 |          39 |
| appearance_ba |     3.78329 |    0.562042 |         3.5 |           4 |           4 |           1 |           5 |
| aroma_ba      |     3.70404 |    0.597862 |         3.5 |        3.75 |           4 |           1 |           5 |
| date_ba       | 1.34367e+09 | 1.12069e+08 | 1.27098e+09 | 1.36835e+09 | 1.43471e+09 | 9.84136e+08 | 1.50141e+09 |
| overall_ba    |     3.73419 |    0.614882 |         3.5 |           4 |           4 |           1 |           5 |
| palate_ba     |     3.67871 |    0.618717 |         3.5 |        3.75 

Here we see:

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns are identical again of course.

2. Rating Categories on BeerAdvocate (BA)
   - `appearance_ba`, `aroma_ba`, `palate_ba`, `taste_ba`, and `overall_ba` ratings on BeerAdvocate have similar means (ranging from 3.68 to 3.78) with a consistent standard deviation of around 0.6.
   - The median for each category is close to 4, suggesting that BeerAdvocate users tend to give positive ratings overall, as the maximum for these is 5.
   - The overall rating (`rating_ba`) has a mean of 3.71, indicating a generally favorable tendency in ratings across these categories.

3. Rating Categories on RateBeer (RB)
   - `appearance_rb`, `palate_rb`, and `rating_rb` have means similar to those on BeerAdvocate (3.53 to 3.69), suggesting that for these categories, user ratings are comparable across platforms.
   - `aroma_rb` and `taste_rb`, however, have higher maximum scores of 10, with means of 6.96 and 7.01, respectively. This difference likely reflects a different scoring system or metric on RateBeer, which will need to be accounted for in our analysis.
   - `overall_rb` shows a much broader scale, with a mean of 14.40 and a maximum of 20, indicating that RateBeer may use a larger rating scale for this category than BeerAdvocate does. Again this will need to be accounted for.

4. General Trends
   - BeerAdvocate ratings are generally more constrained within a 1-5 range for all categories, while RateBeer uses a broader scale for certain categories (`aroma_rb`, `taste_rb`, and `overall_rb`). This suggests a difference in rating systems that should be considered in cross-platform comparisons.
   - Despite these scale differences, the means for comparable categories like `appearance`, `palate`, and `rating` are fairly aligned between platforms, implying similar user perceptions on these aspects of beer.

Finally we'll convert the date into a datetime object.


In [16]:
df_ratings = df_ratings.with_columns((pl.col("date_ba").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_ratings = df_ratings.with_columns((pl.col("date_rb").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))

df_ratings.sample(5)

abv_ba,appearance_ba,aroma_ba,beer_id_ba,beer_name_ba,brewery_id_ba,brewery_name_ba,date_ba,overall_ba,palate_ba,rating_ba,review_ba,style_ba,taste_ba,text_ba,user_id_ba,user_name_ba,abv_rb,appearance_rb,aroma_rb,beer_id_rb,beer_name_rb,brewery_id_rb,brewery_name_rb,date_rb,overall_rb,palate_rb,rating_rb,style_rb,taste_rb,text_rb,user_id_rb,user_name_rb
f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,bool,str,f64,str,str,str,f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,str,f64,str,i64,str
4.5,4.0,3.5,80811,"""Terrapin Easy Rider""",2372,"""Terrapin Beer Company""",2012-07-21 10:00:00,3.5,3.0,3.68,True,"""American Pale Ale (APA)""",4.0,"""12 oz bottle into a pint glass…","""rangerred.112838""","""rangerred""",4.5,5.0,6.0,172124,"""Terrapin Easy Rider""",2851,"""Terrapin Beer Company""",2012-07-21 10:00:00,17.0,2.0,3.8,"""Amber Ale""",8.0,"""12 oz bottle into a pint glass…",137945,"""rangerred"""
5.0,3.75,3.25,87019,"""Preservation Ale""",812,"""Marin Brewing Company""",2013-07-05 10:00:00,3.25,3.75,3.43,True,"""American Pale Ale (APA)""",3.5,"""Pours pale gold and mildly haz…","""popery.196587""","""popery""",5.0,4.0,5.0,176350,"""Marin Preservation Ale""",222,"""Marin Brewing Company""",2013-07-07 10:00:00,11.0,4.0,3.0,"""American Pale Ale""",6.0,"""Pours pale gold and mildly haz…",69847,"""popery"""
7.5,4.0,4.0,238194,"""Prairie Ace""",30356,"""Prairie Artisan Ales""",2016-09-18 10:00:00,4.0,4.0,4.0,True,"""Saison / Farmhouse Ale""",4.0,"""500 ml bottle into tulip glass…","""superspak.456300""","""superspak""",7.5,4.0,8.0,443132,"""Prairie Ace""",15476,"""Prairie Artisan Ales""",2016-09-12 10:00:00,16.0,4.0,4.0,"""Saison""",8.0,"""500 ml bottle into tulip glass…",105791,"""superspak"""
6.1,3.5,3.5,61889,"""Zepptemberfest""",2455,"""French Broad Brewing Co.""",2011-03-10 11:00:00,4.0,3.5,3.4,True,"""Märzen / Oktoberfest""",3.0,"""Appearance: Dispensed from the…","""chaingangguy.8942""","""ChainGangGuy""",6.1,3.0,7.0,110459,"""French Broad Zepptemberfest""",2879,"""French Broad Brewing""",2011-03-10 11:00:00,14.0,4.0,3.4,"""Oktoberfest/Märzen""",6.0,"""Appearance: Dispensed from the…",32031,"""ChainGangGuy"""
4.7,4.0,4.0,52096,"""Black Duck Porter""",20766,"""Greenport Harbor Brewing Compa…",2012-08-19 10:00:00,3.5,3.5,3.65,True,"""English Porter""",3.5,"""On cask at Barcade in Brooklyn…","""huhzubendah.168653""","""Huhzubendah""",4.7,4.0,7.0,109038,"""Greenport Harbor Black Duck Po…",10720,"""Greenport Harbor Brewing Co.""",2012-08-19 10:00:00,13.0,3.0,3.3,"""Porter""",6.0,"""On cask at Barcade in Brooklyn…",96987,"""Huhzubendah"""


## Processing and saving the cleaned data
Here we'll sync the data of the matched beer dataset with the one of the RateBeer and BeerAdvocate. This is needed because we've removed some data from the BeerAdvocate and RateBeer datasets during the data cleaning process and we need to ensure that the matched beer dataset is in sync with the other two datasets.
##### Preliminary

In [17]:
BEERADVOCATE_PATH = '../../../data/BeerAdvocate/processed'
RATEBEER_PATH = '../../../data/RateBeer/processed'


##### Users dataset

In [18]:
df_filtered_users = df_users.select(["user_id_ba", "user_id_rb", "sim_scores"])
print("Data before cleaning", df_filtered_users.shape[0])

# BeerAdvocate
df_users_ba = pl.read_parquet(BEERADVOCATE_PATH + "/users.pq")
df_users_ba = df_users_ba.rename({col: f"{col}_ba" for col in df_users_ba.columns})
df_filtered_users = df_filtered_users.join(df_users_ba, on="user_id_ba", how="inner")
del df_users_ba

# RateBeer
df_users_rb = pl.read_parquet(RATEBEER_PATH + "/users.pq")
df_users_rb = df_users_rb.rename({col: f"{col}_rb" for col in df_users_rb.columns})
df_filtered_users = df_filtered_users.join(df_users_rb, on="user_id_rb", how="inner")
del df_users_rb

print("Data after cleaning", df_filtered_users.shape[0])
print(f"Percentage of data lost: {100 - (df_filtered_users.shape[0] / df_users.shape[0] * 100)}%")

# Save the data
df_filtered_users.write_parquet(DST_DATA_PATH + "/users.pq")

Data before cleaning 3341
Data after cleaning 3341
Percentage of data lost: 0.0%


In [19]:
df_filtered_users.sample(5)

user_id_ba,user_id_rb,sim_scores,user_name_ba,joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,nbr_interactions_ba,user_name_rb,joined_rb,location_rb,nbr_ratings_rb,nbr_reviews_rb,nbr_interactions_rb
str,i64,f64,str,datetime[ms],str,i64,i64,i64,str,datetime[ms],str,i32,i32,i32
"""mcnealc31.335631""",119458,1.0,"""McNealc31""",2009-05-31 10:00:00,"""United States, Ohio""",243.0,7.0,250,"""mcnealc31""",2010-12-20 11:00:00,"""United States, Ohio""",211,211,211
"""holybrew.185690""",50421,1.0,"""holybrew""",2008-01-10 11:00:00,"""India""",2.0,,2,"""holybrew""",2007-02-25 11:00:00,"""India""",14,14,14
"""doob.1172376""",73490,1.0,"""Doob""",2016-09-05 10:00:00,"""England""",1.0,,1,"""Doob""",2008-04-15 10:00:00,"""England""",1,1,1
"""metalgdog.540033""",127030,1.0,"""metalgdog""",2010-12-17 11:00:00,"""United States, New York""",9.0,408.0,417,"""metalgdog""",2011-04-16 10:00:00,"""United States, New York""",1,1,1
"""deekyn.643035""",137131,1.0,"""deekyn""",2011-12-20 11:00:00,"""United States, California""",,219.0,219,"""deekyn""",2011-10-05 10:00:00,"""United States, California""",1135,1135,1135


##### Beers dataset

In [20]:
df_filtered_beers = df_beers.select(["beer_id_ba", "beer_id_rb", "diff_scores", "sim_scores"])
print("Data before cleaning", df_filtered_beers.shape[0])

# BeerAdvocate
df_beers_ba = pl.read_parquet(BEERADVOCATE_PATH + "/beers.pq")
df_beers_ba = df_beers_ba.rename({col: f"{col}_ba" for col in df_beers_ba.columns})
df_filtered_beers = df_filtered_beers.join(df_beers_ba, on="beer_id_ba", how="inner")

# RateBeer
df_beers_rb = pl.read_parquet(RATEBEER_PATH + "/beers.pq")
df_beers_rb = df_beers_rb.rename({col: f"{col}_rb" for col in df_beers_rb.columns})
df_filtered_beers = df_filtered_beers.join(df_beers_rb, on="beer_id_rb", how="inner")

print("Data after cleaning", df_filtered_beers.shape[0])
print(f"Percentage of data lost: {(1 - df_filtered_beers.shape[0] / df_beers.shape[0]) * 100:.2f}%")

# Save the data
df_filtered_beers.write_parquet(DST_DATA_PATH + "/beers.pq")

Data before cleaning 45640
Data after cleaning 45636
Percentage of data lost: 0.01%


In [21]:
df_filtered_beers.sample(5)

beer_id_ba,beer_id_rb,diff_scores,sim_scores,beer_name_ba,brewery_id_ba,brewery_name_ba,style_ba,abv_ba,rating_score_avg_ba,rating_score_std_ba,rating_score_median_ba,nbr_interactions_ba,nbr_ratings_ba,nbr_reviews_ba,beer_name_rb,brewery_id_rb,brewery_name_rb,style_rb,abv_rb,rating_score_avg_rb,rating_score_std_rb,rating_score_median_rb,nbr_interactions_rb,nbr_ratings_rb,nbr_reviews_rb
i64,i64,f64,f64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64
118056,258812,1.0,1.0,"""Cameo""",24075,"""Funkwerks""","""Saison / Farmhouse Ale""",5.5,3.785,0.473377,3.875,8,6,2,"""Funkwerks Cameo""",12024,"""Funkwerks""","""Saison""",5.5,3.371429,0.281154,3.4,7,7,7
88753,25982,0.69937,1.0,"""Krönes Eifeler Landbier""",30621,"""Gemünder Brauerei""","""Kellerbier / Zwickelbier""",5.1,3.559091,0.5622,3.46,11,5,6,"""Krönes Eifeler Landbier""",3902,"""Gemünder Brauerei""","""Zwickel/Keller/Landbier""",5.1,3.035052,0.367152,3.0,97,97,97
122583,268000,1.0,1.0,"""Serpent Bite""",34960,"""Orpheus Brewing""","""Saison / Farmhouse Ale""",6.5,3.930108,0.537151,4.0,93,73,20,"""Orpheus Serpent Bite""",19782,"""Orpheus Brewing""","""Sour/Wild Ale""",6.5,3.712121,0.357734,3.8,33,33,33
247157,446460,1.0,1.0,"""Tart Sunshine""",32656,"""Arizona Wilderness Brewing Co.""","""American Wild Ale""",4.0,4.065769,0.195431,4.0,26,23,3,"""Arizona Wilderness Tart Sunshi…",17445,"""Arizona Wilderness Brewing Com…","""Sour/Wild Ale""",4.0,3.76129,0.231916,3.8,31,31,31
40664,83368,1.0,1.0,"""Winterbok""",16787,"""Bronckhorster Brewing Company""","""Belgian Strong Dark Ale""",7.0,3.33,,3.33,1,0,1,"""Bronckhorster Winterbok""",10949,"""Bronckhorster Brewing Company""","""Doppelbock""",7.0,3.3,,3.3,1,1,1


##### Breweries dataset

In [22]:
df_filtered_breweries = df_breweries.select(["id_ba", "id_rb", "diff_scores", "sim_scores"])
print("Data before cleaning", df_filtered_breweries.shape[0])

# BeerAdvocate
df_breweries_ba = pl.read_parquet(BEERADVOCATE_PATH + "/breweries.pq")
df_breweries_ba = df_breweries_ba.rename({col: f"{col}_ba" for col in df_breweries_ba.columns})
df_filtered_breweries = df_filtered_breweries.join(df_breweries_ba, on="id_ba", how="inner")

# RateBeer
df_breweries_rb = pl.read_parquet(RATEBEER_PATH + "/breweries.pq")
df_breweries_rb = df_breweries_rb.rename({col: f"{col}_rb" for col in df_breweries_rb.columns})
df_filtered_breweries = df_filtered_breweries.join(df_breweries_rb, on="id_rb", how="inner")

print("Data after cleaning", df_filtered_breweries.shape[0])
print(f"Percentage of data lost: {(1 - df_filtered_breweries.shape[0] / df_breweries.shape[0]) * 100:.2f}%")

# Save the data
df_filtered_breweries.write_parquet(DST_DATA_PATH + "/breweries.pq")

Data before cleaning 8281
Data after cleaning 7428
Percentage of data lost: 10.30%


In [23]:
df_filtered_breweries.sample(5)

id_ba,id_rb,diff_scores,sim_scores,location_ba,name_ba,beers_count_ba,location_rb,name_rb,beers_count_rb
i64,i64,f64,f64,str,str,i64,str,str,i32
41074,22687,0.342566,1.0,"""Australia""","""Pirate Life Brewing""",10,"""Australia""","""Pirate Life Brewing""",18
20914,1342,0.845173,0.845173,"""Lithuania""","""Volfas Engelman""",37,"""Lithuania""","""Volfas Engelman (Olvi)""",94
40399,16413,1.0,1.0,"""Spain""","""Gothia Launia""",1,"""Spain""","""Gothia Launia""",3
41596,29499,0.770562,1.0,"""United States, Pennsylvania""","""Helicon Brewing""",13,"""United States, Pennsylvania""","""Helicon Brewing""",23
37557,19205,0.69256,1.0,"""Switzerland""","""La Nébuleuse""",5,"""Switzerland""","""La Nébuleuse""",29


##### Ratings dataset

In [24]:
df_ratings.head()

abv_ba,appearance_ba,aroma_ba,beer_id_ba,beer_name_ba,brewery_id_ba,brewery_name_ba,date_ba,overall_ba,palate_ba,rating_ba,review_ba,style_ba,taste_ba,text_ba,user_id_ba,user_name_ba,abv_rb,appearance_rb,aroma_rb,beer_id_rb,beer_name_rb,brewery_id_rb,brewery_name_rb,date_rb,overall_rb,palate_rb,rating_rb,style_rb,taste_rb,text_rb,user_id_rb,user_name_rb
f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,bool,str,f64,str,str,str,f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,str,f64,str,i64,str
11.3,4.5,4.5,645,"""Trappistes Rochefort 10""",207,"""Brasserie de Rochefort""",2011-12-25 11:00:00,5.0,4.5,4.8,True,"""Quadrupel (Quad)""",5.0,"""Best before 27.07.2016Directly…","""erzengel.248045""","""Erzengel""",11.3,4.0,10.0,2360,"""Rochefort Trappistes 10""",406,"""Brasserie Rochefort""",2013-12-22 11:00:00,19.0,4.0,4.6,"""Abt/Quadrupel""",9.0,""" a) Geruch malzig-schwer-sÃ¼Ã…",83106,"""Erzengel"""
5.0,,,28191,"""Myanmar Lager Beer""",9369,"""Myanmar Brewery and Distillery""",2011-11-30 11:00:00,,,3.0,True,"""American Adjunct Lager""",,"""nan""","""visionthing.639993""","""visionthing""",5.0,2.0,3.0,17109,"""Myanmar Lager Beer""",2921,"""Myanmar Brewery and Distillery""",2011-11-29 11:00:00,6.0,2.0,1.7,"""Pale Lager""",4.0,"""Can. Weak and watery, not the …",91324,"""visionthing"""
5.0,3.5,3.5,57911,"""Cantillon Tyrnilambic Baie D’A…",388,"""Brasserie Cantillon""",2012-08-04 10:00:00,4.0,4.0,3.85,True,"""Lambic - Fruit""",4.0,"""Bottle @ One Pint Pub, Helsink…","""tiong.608427""","""tiong""",5.0,4.0,8.0,35298,"""Cantillon Tyrnilambic Baie dA…",1069,"""Cantillon""",2012-11-22 11:00:00,17.0,4.0,4.1,"""Lambic Style - Fruit""",8.0,"""Bottle @ One Pint Pub, Helsink…",98624,"""tiong"""
5.0,4.0,3.5,57913,"""Cantillon Pikkulinnun Viskilam…",388,"""Brasserie Cantillon""",2012-08-04 10:00:00,4.0,4.0,3.68,True,"""Lambic - Unblended""",3.5,"""Originally rated on 16.11.2009…","""tiong.608427""","""tiong""",5.0,4.0,8.0,113596,"""Cantillon Pikkulinnun Viskilam…",1069,"""Cantillon""",2014-11-17 11:00:00,16.0,4.0,4.1,"""Lambic Style - Unblended""",9.0,"""Draught @Â Pikkulintu, Helsink…",98624,"""tiong"""
6.0,4.0,4.0,81125,"""Drie Fonteinen Oude Geuze - Ar…",2216,"""Brouwerij 3 Fonteinen""",2012-08-29 10:00:00,4.0,4.0,4.0,True,"""Gueuze""",4.0,"""750ml bottle, originally rated…","""tiong.608427""","""tiong""",6.0,4.0,8.0,173481,"""3 Fonteinen Oude Geuze (Armand…",2058,"""Brouwerij 3 Fonteinen""",2012-08-18 10:00:00,16.0,4.0,4.0,"""Lambic Style - Gueuze""",8.0,"""750ml bottleBottling date: 201…",98624,"""tiong"""


In [25]:
df_ratings_filtered = df_ratings.select(["beer_id_ba", "brewery_id_ba", "user_id_ba", "beer_id_rb", "brewery_id_rb", "user_id_rb"])

print("Data before cleaning", df_ratings_filtered.shape[0])

# BeerAdvocate
df_ratings_ba = pl.read_parquet(BEERADVOCATE_PATH + "/ratings.pq")
df_ratings_ba = df_ratings_ba.rename({col: f"{col}_ba" for col in df_ratings_ba.columns})
df_ratings_filtered = df_ratings_filtered.join(df_ratings_ba, on=["beer_id_ba", "brewery_id_ba", "user_id_ba"], how="inner")

# RateBeer
df_ratings_rb = pl.read_parquet(RATEBEER_PATH + "/ratings.pq")
df_ratings_rb = df_ratings_rb.rename({col: f"{col}_rb" for col in df_ratings_rb.columns})
df_ratings_filtered = df_ratings_filtered.join(df_ratings_rb, on=["beer_id_rb", "brewery_id_rb", "user_id_rb"], how="inner")

print("Data after cleaning", df_ratings_filtered.shape[0])
print(f"Percentage of data lost: {(1 - df_ratings_filtered.shape[0] / df_ratings.shape[0]) * 100:.2f}%")

# Save the data
df_ratings_filtered.write_parquet(DST_DATA_PATH + "/ratings.pq")

Data before cleaning 21964
Data after cleaning 21964
Percentage of data lost: 0.00%


### Discussion
As we observed in the data exploration of RateBeer and MatchedBeer we don't lose lots of information regarding beers, users or ratings and the only relevant data lost is for the breweries (where we has lost approximately 10% of the data). Given that no review has been lost, and the filtering has been done in the processing of the files used to do the matching here, we can be confident that the data lost weren't relevant given that they didn't have any reviews.