# Data Analysis and Cleaning of MatchedBeer Dataset

The MatchedBeer dataset combines data from BeerAdvocate and RateBeer, two of the most prominent online platforms for beer ratings. This combined dataset was created to address the herding effect in online ratings, where early ratings can skew subsequent user opinions. By merging the two platforms, the MatchedBeer dataset allows for the study of the same beers rated independently on each platform, thereby providing a unique opportunity to analyze unbiased rating patterns.

## Matching Process Overview
The matching process is described in detail in the paper ["When Sheep Shop: Measuring Herding Effects in Product Ratings with Natural Experiments"](https://dlab.epfl.ch/people/west/pub/Lederrey-West_WWW-18.pdf), focusing on accurately aligning beers from both websites. The matching is performed in two phases to prioritize precision:

1. **Brewery Alignment**: Breweries across BeerAdvocate and RateBeer are aligned based on their exact location (state or country), with names represented as TF-IDF vectors to compute cosine similarity.
2. **Beer Alignment**: Within matched breweries, beers are aligned based on name similarity and require an exact match in alcohol by volume (ABV). Before matching, any shared tokens in brewery and beer names are removed to improve accuracy.

The algorithm ensures high precision by setting strict thresholds for cosine similarity and enforcing a significant gap between the best and second-best matches. This approach results in a dataset where brewery matching has a reported precision of 99.6%, and beer matching achieves a precision of 100%, meaning that only highly confident matches are included, though some potential matches may be excluded.

## Structure of MatchedBeer Dataset
The dataset contains essential information on beers, breweries, users, and user ratings from both platforms. Since the user bases of BeerAdvocate and RateBeer differ geographically, the combined dataset reflects a more balanced view. BeerAdvocate is more U.S.-centric, while RateBeer has a more internationally diverse user base. Despite these differences, the matched dataset has been verified for both internal and external validity, making it well-suited for analyzing the influence of initial ratings on user perceptions across platforms.

MatchedBeer provides insights into how early ratings impact subsequent ones by comparing ratings for the same beer on both sites, especially when early ratings on each platform differ significantly. By standardizing and aligning ratings across time and platforms, MatchedBeer reveals patterns that allow for the identification of herding effects, helping researchers understand the biases in user-generated ratings.

In [1]:
# Import all the necessary libraries
import polars as pl
import tqdm
import os

# Import the file in the utils folder
import sys
sys.path.append('../../utils')
from data_desc import describe_numbers, describe
from matched_dataset import load_and_process_files

# Define the paths
SRC_DATA_PATH = '../../../data/MatchedBeer'
DST_DATA_PATH = '../../../data/MatchedBeer/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

#### Data Exploration

In [2]:
file_paths = [SRC_DATA_PATH + "/beers.csv", SRC_DATA_PATH + "/breweries.csv", SRC_DATA_PATH + "/ratings.csv", SRC_DATA_PATH + "/users_approx.csv"]
dataframes_list = load_and_process_files(file_paths)

df_beers = dataframes_list[0]
df_breweries = dataframes_list[1]
df_ratings = dataframes_list[2]
df_users = dataframes_list[3]

##### Beers dataset
We'll start by looking at the beers dataset.

In [3]:
df_beers.sample(5)

abv_ba,avg_ba,avg_computed_ba,avg_matched_valid_ratings_ba,ba_score_ba,beer_id_ba,beer_name_ba,beer_wout_brewery_name_ba,brewery_id_ba,brewery_name_ba,bros_score_ba,nbr_matched_valid_ratings_ba,nbr_ratings_ba,nbr_reviews_ba,style_ba,zscore_ba,abv_rb,avg_rb,avg_computed_rb,avg_matched_valid_ratings_rb,beer_id_rb,beer_name_rb,beer_wout_brewery_name_rb,brewery_id_rb,brewery_name_rb,nbr_matched_valid_ratings_rb,nbr_ratings_rb,overall_score_rb,style_rb,style_score_rb,zscore_rb,diff_scores,sim_scores
f64,f64,f64,f64,f64,i64,str,str,i64,str,f64,i64,i64,i64,str,f64,f64,f64,f64,f64,i64,str,str,i64,str,i64,i64,f64,str,f64,f64,f64,f64
5.4,3.5,3.5,,,251920,"""Wet Season""","""Wet Season""",39655,"""Chainline Brewing Company""",,0,1,0,"""Saison / Farmhouse Ale""",-1.016361,5.4,3.07,3.35,3.35,452911,"""Chainline Wet Season""","""Wet Season""",22191,"""Chainline Brewing Company""",2,2,,"""Belgian Ale""",,-0.239896,1.0,1.0
7.8,3.56,3.56,,,179246,"""20th Anniversary IPA""","""20th Anniversary IPA""",1394,"""Skagit River Brewery""",,0,1,0,"""American IPA""",-0.788544,7.8,3.0,3.3,3.3,339413,"""Skagit River 20th Anniversary …","""20th Anniversary IPA""",658,"""Skagit River Brewing Co.""",1,1,,"""Imperial IPA""",,-0.229064,0.724698,1.0
5.4,3.47,3.472727,,83.0,145554,"""Emmett""","""Emmett""",22412,"""Noble Ale Works""",,0,11,0,"""Belgian Dark Ale""",-0.790864,5.4,2.77,2.8,2.8,296604,"""Noble Ale Works Emmett""","""Emmett""",11238,"""Noble Ale Works""",3,3,,"""Belgian Ale""",,-1.097199,1.0,1.0
6.0,4.02,4.02,4.015,85.0,58892,"""Saison Du Repos""","""Du Repos Saison""",15526,"""Hopfenstark""",,4,11,4,"""Saison / Farmhouse Ale""",0.201509,6.0,3.54,3.595,3.595,103844,"""Hopfenstark Saison Du Repos""","""Du Repos Saison""",8201,"""Hopfenstark""",40,40,90.0,"""Saison""",84.0,0.420911,0.777297,1.0
4.7,3.75,3.75,,,283173,"""Black Crow Lager""","""Crow Black Lager""",35910,"""Crowbar & Bryggeri""",,0,1,0,"""Euro Dark Lager""",-0.559449,4.7,3.16,3.223077,3.223077,222558,"""Crowbar Black Crow Lager""","""Crow Black Lager""",15726,"""Crowbar & Bryggeri""",13,13,48.0,"""Schwarzbier""",60.0,-0.322717,0.664474,1.0


This dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for each beer in BA/RB | 
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `beer_wout_brewery_name_ba` / `beer_wout_brewery_name_rb` | Beer name without brewery name in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings in BA/RB |
| `nbr_reviews_ba` | Number of reviews in BA |
| `avg_ba` / `avg_rb` | Average rating in BA/RB |
| `avg_computed_ba` / `avg_computed_rb` | Computed average rating in BA/RB |
| `zscore_ba` / `zscore_rb` | Standardized score in BA/RB |
| `nbr_matched_valid_ratings_ba` / `nbr_matched_valid_ratings_rb` | Number of matched valid ratings in BA/RB |
| `avg_matched_valid_ratings_ba` / `avg_matched_valid_ratings_rb` | Average of matched valid ratings in BA/RB |
| `ba_score_ba` | BeerAdvocate score |
| `bros_score_ba` | Bros score in BA |
| `overall_score_rb` | Overall score in RB |
| `style_score_rb` | Style-specific score in RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

Now that we have an idea what the structure of our datasets looks like, let's see if our dataset is complete or whether it contains a lot of missing entries.

In [4]:
describe(df_beers)

+------------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|------------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| abv_ba                       | Float64 |            45640 |             0 |    0.00 % |                   351 |            0.77 % |
| avg_ba                       | Float64 |            40285 |          5355 |   11.73 % |                   373 |            0.93 % |
| avg_computed_ba              | Float64 |            40285 |          5355 |   11.73 % |                 12090 |           30.01 % |
| avg_matched_valid_ratings_ba | Float64 |            28272 |         17368 |   38.05 % |                  6658 |           23.55 % |
| ba_score_ba                  | Float64 |            10499 | 

In [5]:
describe_numbers(df_beers, filters =["beer_id_ba", "brewery_id_ba", "beer_id_rb", "brewery_id_rb"])

+------------------------------+-----------+-----------+-----------+------------+-----------+----------+---------+
| Column                       |      Mean |       Std |       25% |        50% |       75% |      Min |     Max |
|------------------------------+-----------+-----------+-----------+------------+-----------+----------+---------|
| abv_ba                       |   6.32086 |   1.85773 |         5 |          6 |       7.2 |     0.38 |    67.5 |
| avg_ba                       |   3.73237 |  0.439988 |      3.52 |       3.78 |         4 |        1 |       5 |
| avg_computed_ba              |   3.72785 |  0.424824 |      3.52 |     3.7725 |         4 |        1 |       5 |
| avg_matched_valid_ratings_ba |   3.74777 |  0.484775 |      3.52 |    3.80833 |     4.045 |        1 |       5 |
| ba_score_ba                  |    84.871 |   3.28287 |        83 |         85 |        86 |       63 |     100 |
| bros_score_ba                |   86.1623 |    8.5394 |        83 |         88 

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns show identical distributions across BeerAdvocate and RateBeer, with a mean ABV of approximately 6.32%.
   - This similarity suggests a good alignment in the types of beers reviewed on both platforms. The maximum ABV of 67.5% is possibly an outlier, likely representing rare extreme beers, as typical ABVs are much lower.

2. Average Ratings (BeerAdvocate vs. RateBeer)
   - The `avg_ba` (BeerAdvocate) mean is 3.73, while `avg_rb` (RateBeer) is significantly lower at 3.14.
   - This aligns with previous findings that BeerAdvocate users tend to rate beers more generously than RateBeer users. The higher `std` value for `avg_ba` also indicates a wider range in BeerAdvocate’s ratings, suggesting more variability in user opinions.

3. Computed Averages (Consistency in Ratings)
   - `avg_computed_ba` and `avg_computed_rb` refer to recalculated averages, showing slightly lower standard deviations than their direct counterparts (`avg_ba` and `avg_rb`), indicating a slight smoothing effect.
   - Both BeerAdvocate and RateBeer have computed averages (`avg_computed_ba` vs. `avg_computed_rb`) that are fairly close to the respective raw averages, implying that the original averages align well with user trends.

4. Rating Counts
   - The median `nbr_ratings_ba` (3) and `nbr_ratings_rb` (5) are relatively low, showing that many beers have few ratings on both sites, though RateBeer has slightly more ratings per beer on average.
   - RateBeer’s broader user base also shows a higher standard deviation in the count of matched valid ratings (`nbr_matched_valid_ratings_rb` is 80.10 vs. 42.51 for BeerAdvocate), meaning some beers have far more reviews than others on RateBeer.

5. Z-scores (Standardized Ratings)
   - The mean `zscore_ba` (-0.41) is lower than `zscore_rb` (-0.10), suggesting BeerAdvocate users' ratings tend to deviate more below the mean, which may imply a slight rating inflation on RateBeer.
   - BeerAdvocate’s z-scores also exhibit a broader spread, evidenced by a standard deviation of 0.81 compared to RateBeer's 0.73, reinforcing the idea that BeerAdvocate reviews show more polarized opinions.

6. Score Difference and Similarity
   - `diff_scores` has a mean of 0.798, which indicates a considerable average difference in the ratings across the two platforms.
   - `sim_scores`, however, is high (mean of 0.986), suggesting that while individual score values differ, the rating trends between matched pairs are generally consistent.

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each platform, every brewery has a location, a name, and the amount of beers they produce, along with a unique identifier. We also have the similarity and difference scores between the breweries from the two platforms. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [6]:
df_breweries.sample(5)

id_ba,location_ba,name_ba,nbr_beers_ba,id_rb,location_rb,name_rb,nbr_beers_rb,diff_scores,sim_scores
i64,str,str,i64,i64,str,str,i64,f64,f64
33759,"""England""","""Nene Valley Brewery""",5,13712,"""England""","""Nene Valley Brewery""",29,0.434105,1.0
28716,"""United States, Oregon""","""Below Grade Brewing""",9,14417,"""United States, Oregon""","""Below Grade Brewing LLC""",10,0.403559,0.873627
40868,"""Belgium""","""Brouwerij Den Toetëlèr""",5,12610,"""Belgium""","""Brouwerij Den Toetëlèr""",19,0.474433,1.0
49207,"""United States, Minnesota""","""Broken Clock Brewing Cooperati…",4,30631,"""United States, Minnesota""","""Broken Clock Brewing Cooperati…",5,0.38763,1.0
2772,"""United States, Wisconsin""","""Titletown Brewing Company""",81,524,"""United States, Wisconsin""","""Titletown Brewing Company""",221,0.658297,1.0


The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `id_ba` / `id_rb` | Unique identifier for each brewery in BA/RB | 
| `name_ba` / `name_rb` | Name of the brewery in BA/RB |
| `location_ba` / `location_rb` | Geographic location of the brewery in BA/RB |
| `nbr_beers_ba` / `nbr_beers_rb` | Number of beers produced by the brewery in BA/RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

In [7]:
describe(df_breweries)

+--------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| id_ba        | Int64   |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| location_ba  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_ba      | String  |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| nbr_beers_ba | Int64   |             8281 |             0 |    0.00 % |                   231 |            2.79 % |
| id_rb        | Int64   |             8281 |             0 |    0.00 % |                  8235 |           99.44 % |
| location_rb  | String  |             8281 |           

In [8]:
describe(df_breweries, filters=["id_ba", "id_rb"])

+--------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| location_ba  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_ba      | String  |             8281 |             0 |    0.00 % |                  8281 |          100.00 % |
| nbr_beers_ba | Int64   |             8281 |             0 |    0.00 % |                   231 |            2.79 % |
| location_rb  | String  |             8281 |             0 |    0.00 % |                   205 |            2.48 % |
| name_rb      | String  |             8281 |             0 |    0.00 % |                  8235 |           99.44 % |
| nbr_beers_rb | Int64   |             8281 |           

For our breweries on both platforms, we have no missing values (in this matched data).

##### Users dataset

**Note**: The users_approx data is essentailly exactly the same however rather than users being matched exactly by username across the two platforms, they are matched using a TF-IDF vectorizer and cosine similarity. Users are matched if username is close enough and same location.

For each of our users (for each platform), we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [9]:
df_users.sample(5)

joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,user_id_ba,user_name_ba,user_name_lower_ba,joined_rb,location_rb,nbr_ratings_rb,user_id_rb,user_name_rb,user_name_lower_rb,sim_scores
f64,str,i64,i64,str,str,str,f64,str,i64,i64,str,str,f64
1405900000.0,"""Taiwan""",499,248,"""jackieku.830313""","""JackieKu""","""jackieku""",1414100000.0,"""Taiwan""",3,341016,"""jackieku""","""jackieku""",1.0
1314600000.0,"""Netherlands""",499,332,"""mar02x.618252""","""Mar02x""","""mar02x""",1311200000.0,"""Netherlands""",1,132153,"""Mar02x""","""mar02x""",1.0
1297000000.0,"""United States, Virginia""",289,0,"""miamilice.563376""","""Miamilice""","""miamilice""",1323200000.0,"""United States, Virginia""",1,144492,"""Miamilice""","""miamilice""",1.0
1299900000.0,"""United States, Georgia""",56,0,"""sirzekel.578924""","""sirzekel""","""sirzekel""",1447900000.0,"""United States, Georgia""",48,391650,"""sirzekel""","""sirzekel""",1.0
1112300000.0,"""United States, Texas""",12,12,"""2ndstage.17173""","""2ndstage""","""2ndstage""",1086300000.0,"""United States, Texas""",4,12887,"""2ndstage""","""2ndstage""",1.0


The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `user_id_ba` / `user_id_rb` | Unique identifier for each user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `user_name_lower_ba` / `user_name_lower_rb` | Lowercase username in BA/RB |
| `joined_ba` / `joined_rb` | Date when the user joined BA/RB |
| `location_ba` / `location_rb` | Geographic location of the user in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings submitted by the user in BA/RB |
| `nbr_reviews_ba` | Number of reviews submitted by the user in BA |

In [10]:
describe(df_users)

+--------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column             | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| joined_ba          | Float64 |             3341 |             0 |    0.00 % |                  2352 |           70.40 % |
| location_ba        | String  |             3341 |             0 |    0.00 % |                   111 |            3.32 % |
| nbr_ratings_ba     | Int64   |             3341 |             0 |    0.00 % |                   660 |           19.75 % |
| nbr_reviews_ba     | Int64   |             3341 |             0 |    0.00 % |                   475 |           14.22 % |
| user_id_ba         | String  |             3341 |             0 |    0.00 % |                  3316 |           99.25 % |
| user_n

In [11]:
describe_numbers(df_users, filters=["user_id_rb", "joined_ba", "joined_rb"])

+----------------+----------+-----------+-------+-------+-------+----------+-------+
| Column         |     Mean |       Std |   25% |   50% |   75% |      Min |   Max |
|----------------+----------+-----------+-------+-------+-------+----------+-------|
| nbr_ratings_ba |  211.612 |   685.867 |     2 |    13 |   106 |        1 | 12046 |
| nbr_reviews_ba |  118.232 |   454.174 |     1 |     3 |    29 |        0 |  7593 |
| nbr_ratings_rb |  270.923 |   1143.09 |     2 |     5 |    38 |        1 | 20678 |
| sim_scores     | 0.986855 | 0.0425066 |     1 |     1 |     1 | 0.800641 |     1 |
+----------------+----------+-----------+-------+-------+-------+----------+-------+


To make the `joined` columns more readable we cast them to a datetime object.

In [12]:
df_users = df_users.with_columns((pl.col("joined_ba").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users = df_users.with_columns((pl.col("joined_rb").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,user_id_ba,user_name_ba,user_name_lower_ba,joined_rb,location_rb,nbr_ratings_rb,user_id_rb,user_name_rb,user_name_lower_rb,sim_scores
datetime[ms],str,i64,i64,str,str,str,datetime[ms],str,i64,i64,str,str,f64
2008-03-21 11:00:00,"""United States, California""",71,63,"""jjh19.205113""","""jjh19""","""jjh19""",2009-11-25 11:00:00,"""United States, California""",42,98075,"""jjh19""","""jjh19""",1.0
2015-01-17 11:00:00,"""Norway""",1,0,"""ketil.928679""","""Ketil""","""ketil""",2010-12-17 11:00:00,"""Norway""",6,119220,"""ketilgr""","""ketilgr""",0.816497
2013-04-19 10:00:00,"""United States, Pennsylvania""",162,3,"""chuckv.729274""","""chuckv""","""chuckv""",2013-04-20 10:00:00,"""United States, Pennsylvania""",3,255217,"""Chuckv""","""chuckv""",1.0
2013-09-28 10:00:00,"""United States, Illinois""",2,2,"""kwilk.756262""","""Kwilk""","""kwilk""",2013-09-26 10:00:00,"""United States, Illinois""",1,281042,"""Kwilk""","""kwilk""",1.0
2004-06-04 10:00:00,"""United States, New York""",120,0,"""sjjn.6634""","""sjjn""","""sjjn""",2004-07-09 10:00:00,"""United States, New York""",8,13561,"""sjjn""","""sjjn""",1.0


##### Ratings

In [13]:
df_ratings.sample(5)

abv_ba,appearance_ba,aroma_ba,beer_id_ba,beer_name_ba,brewery_id_ba,brewery_name_ba,date_ba,overall_ba,palate_ba,rating_ba,review_ba,style_ba,taste_ba,text_ba,user_id_ba,user_name_ba,abv_rb,appearance_rb,aroma_rb,beer_id_rb,beer_name_rb,brewery_id_rb,brewery_name_rb,date_rb,overall_rb,palate_rb,rating_rb,style_rb,taste_rb,text_rb,user_id_rb,user_name_rb
f64,f64,f64,i64,str,i64,str,i64,f64,f64,f64,bool,str,f64,str,str,str,f64,f64,f64,i64,str,i64,str,i64,f64,f64,f64,str,f64,str,i64,str
6.7,5.0,4.0,198896,"""The Bright Side""",34599,"""Brew Gentlemen""",1459418400,4.25,4.0,4.11,True,"""American IPA""",4.0,"""pour is pale hazed yellow with…","""cpetrone84.343431""","""cpetrone84""",6.7,5.0,8.0,374945,"""The Brew Gentlemen The Bright …",19561,"""Brew Gentlemen""",1459418400,17.0,4.0,4.2,"""India Pale Ale (IPA)""",8.0,"""pour is pale hazed yellow with…",99545,"""cpetrone84"""
7.0,,,111053,"""Ava IPA""",17980,"""Lawson's Finest Liquids""",1391425200,,,4.0,True,"""American IPA""",,"""nan""","""tom10101.528127""","""tom10101""",7.0,3.0,8.0,248339,"""Lawsons Finest Ava IPA""",9863,"""Lawsons Finest Liquids""",1391425200,16.0,4.0,3.9,"""India Pale Ale (IPA)""",8.0,"""On tap @ The Reservoir (Waterb…",126758,"""tom10101"""
6.0,,,96755,"""Grassroots Arctic Saison""",23205,"""Grassroots Brewing""",1381658400,,,3.75,True,"""Saison / Farmhouse Ale""",,"""nan""","""sammy.3853""","""Sammy""",6.0,4.0,5.0,216712,"""Grassroots Arctic Saison""",11094,"""Grassroots Brewing""",1381572000,16.0,4.0,3.5,"""Saison""",6.0,"""From craftshack. A little funk…",11905,"""Sammy"""
12.0,4.5,4.75,3833,"""AleSmith Speedway Stout""",396,"""AleSmith Brewing Company""",1361790000,4.75,4.5,4.71,True,"""American Double / Imperial Sto…",4.75,"""Opened at a tasting alongside …","""tectactoe.666880""","""tectactoe""",12.0,4.0,9.0,14232,"""AleSmith Speedway Stout""",432,"""AleSmith Brewing Company""",1399370400,17.0,4.0,4.3,"""Imperial Stout""",9.0,"""Bottle: Deep, dark - black wit…",217974,"""tectactoe"""
10.2,4.0,4.0,37692,"""Twelve""",392,"""Weyerbacher Brewing Co.""",1268996400,4.0,4.0,4.0,True,"""American Barleywine""",4.0,"""Thanks to bu11zeye for sharing…","""mora2000.164611""","""Mora2000""",10.2,4.0,7.0,73531,"""Weyerbacher Twelve""",241,"""Weyerbacher Brewing Co.""",1268996400,15.0,4.0,3.7,"""Barley Wine""",7.0,"""Thanks to bu11zeye for sharing…",78912,"""Mora2000"""


In [14]:
describe(df_ratings)

+-----------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column          | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| abv_ba          | Float64 |            21964 |             0 |    0.00 % |                   187 |            0.85 % |
| appearance_ba   | Float64 |            19295 |          2669 |   12.15 % |                    18 |            0.09 % |
| aroma_ba        | Float64 |            19295 |          2669 |   12.15 % |                    18 |            0.09 % |
| beer_id_ba      | Int64   |            21964 |             0 |    0.00 % |                  9025 |           41.09 % |
| beer_name_ba    | String  |            21964 |             0 |    0.00 % |                  8719 |           39.70 % |
| brewery_id_ba   | Int64   |   

The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for the beer in BA/RB |
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `user_id_ba` / `user_id_rb` | Unique identifier for the user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `date_ba` / `date_rb` | Date of the review in BA/RB |
| `rating_ba` / `rating_rb` | Overall rating given by the user in BA/RB |
| `overall_ba` / `overall_rb` | Overall score in BA/RB |
| `appearance_ba` / `appearance_rb` | Appearance score in BA/RB |
| `aroma_ba` / `aroma_rb` | Aroma score in BA/RB |
| `palate_ba` / `palate_rb` | Palate score in BA/RB |
| `taste_ba` / `taste_rb` | Taste score in BA/RB |
| `text_ba` / `text_rb` | Review text in BA/RB |
| `review_ba` | Whether or not there is a BA review available |

In [15]:
describe_numbers(df_ratings, filters=["beer_id_ba", "beer_id_rb", "brewery_id_ba", "brewery_id_rb", "user_id_ba", "user_id_rb"])

+---------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+
| Column        |        Mean |         Std |         25% |         50% |         75% |         Min |         Max |
|---------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------|
| abv_ba        |     6.80022 |     2.09619 |         5.2 |         6.4 |           8 |         0.5 |          39 |
| appearance_ba |     3.78329 |    0.562042 |         3.5 |           4 |           4 |           1 |           5 |
| aroma_ba      |     3.70404 |    0.597862 |         3.5 |        3.75 |           4 |           1 |           5 |
| date_ba       | 1.34367e+09 | 1.12069e+08 | 1.27098e+09 | 1.36835e+09 | 1.43471e+09 | 9.84136e+08 | 1.50141e+09 |
| overall_ba    |     3.73419 |    0.614882 |         3.5 |           4 |           4 |           1 |           5 |
| palate_ba     |     3.67871 |    0.618717 |         3.5 |        3.75 

Here we see:

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns are identical again of course.

2. Rating Categories on BeerAdvocate (BA)
   - `appearance_ba`, `aroma_ba`, `palate_ba`, `taste_ba`, and `overall_ba` ratings on BeerAdvocate have similar means (ranging from 3.68 to 3.78) with a consistent standard deviation of around 0.6.
   - The median for each category is close to 4, suggesting that BeerAdvocate users tend to give positive ratings overall, as the maximum for these is 5.
   - The overall rating (`rating_ba`) has a mean of 3.71, indicating a generally favorable tendency in ratings across these categories.

3. Rating Categories on RateBeer (RB)
   - `appearance_rb`, `palate_rb`, and `rating_rb` have means similar to those on BeerAdvocate (3.53 to 3.69), suggesting that for these categories, user ratings are comparable across platforms.
   - `aroma_rb` and `taste_rb`, however, have higher maximum scores of 10, with means of 6.96 and 7.01, respectively. This difference likely reflects a different scoring system or metric on RateBeer, which will need to be accounted for in our analysis.
   - `overall_rb` shows a much broader scale, with a mean of 14.40 and a maximum of 20, indicating that RateBeer may use a larger rating scale for this category than BeerAdvocate does. Again this will need to be accounted for.

4. General Trends
   - BeerAdvocate ratings are generally more constrained within a 1-5 range for all categories, while RateBeer uses a broader scale for certain categories (`aroma_rb`, `taste_rb`, and `overall_rb`). This suggests a difference in rating systems that should be considered in cross-platform comparisons.
   - Despite these scale differences, the means for comparable categories like `appearance`, `palate`, and `rating` are fairly aligned between platforms, implying similar user perceptions on these aspects of beer.

Finally we'll convert the date into a datetime object.


In [16]:
df_ratings = df_ratings.with_columns((pl.col("date_ba").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_ratings = df_ratings.with_columns((pl.col("date_rb").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))

df_ratings.sample(5)

abv_ba,appearance_ba,aroma_ba,beer_id_ba,beer_name_ba,brewery_id_ba,brewery_name_ba,date_ba,overall_ba,palate_ba,rating_ba,review_ba,style_ba,taste_ba,text_ba,user_id_ba,user_name_ba,abv_rb,appearance_rb,aroma_rb,beer_id_rb,beer_name_rb,brewery_id_rb,brewery_name_rb,date_rb,overall_rb,palate_rb,rating_rb,style_rb,taste_rb,text_rb,user_id_rb,user_name_rb
f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,bool,str,f64,str,str,str,f64,f64,f64,i64,str,i64,str,datetime[ms],f64,f64,f64,str,f64,str,i64,str
6.75,5.0,4.5,20518,"""Sanctification""",863,"""Russian River Brewing Company""",2011-03-07 11:00:00,5.0,4.5,4.63,True,"""American Wild Ale""",4.5,"""Shared by friends at a tasting…","""cpetrone84.343431""","""cpetrone84""",6.75,5.0,8.0,38052,"""Russian River Sanctification""",1480,"""Russian River Brewing""",2012-04-10 10:00:00,19.0,4.0,4.5,"""Sour/Wild Ale""",9.0,"""Shared by friends at a tasting…",99545,"""cpetrone84"""
8.5,4.0,4.0,83362,"""Emil Lindén Doppel Weizen Bock""",14439,"""Sigtuna Brygghus""",2012-07-27 10:00:00,3.5,4.0,3.7,True,"""Weizenbock""",3.5,"""A clouded amber/orange brew wi…","""rarbring.354372""","""rarbring""",8.5,4.0,8.0,179864,"""Sigtuna Emil Lindén Doppel Wei…",6774,"""Sigtuna Brygghus""",2012-09-03 10:00:00,14.0,4.0,3.7,"""Weizen Bock""",7.0,"""Aroma: Smelling fruity and spi…",104819,"""rarbring"""
10.5,3.5,4.0,199544,"""Madness & Solitude""",22511,"""Hill Farmstead Brewery""",2015-12-14 11:00:00,4.0,3.75,3.95,True,"""American Double / Imperial IPA""",4.0,"""Nicely done blend. More of an…","""sammy.3853""","""Sammy""",10.5,3.0,6.0,375312,"""Hill Farmstead Madness & Solit…",11233,"""Hill Farmstead Brewery""",2015-12-13 11:00:00,16.0,3.0,3.5,"""Imperial IPA""",7.0,"""Youngs house. Getting bourbon.…",11905,"""Sammy"""
7.75,4.0,4.5,12770,"""Damnation""",863,"""Russian River Brewing Company""",2010-09-05 10:00:00,4.0,4.0,4.12,True,"""Belgian Strong Pale Ale""",4.0,"""A: Hazy, straw yellow hue with…","""huhzubendah.168653""","""Huhzubendah""",7.75,4.0,8.0,13146,"""Russian River Damnation""",1480,"""Russian River Brewing""",2012-07-24 10:00:00,15.0,4.0,3.8,"""Belgian Strong Ale""",7.0,"""A: Hazy, straw yellow hue with…",96987,"""Huhzubendah"""
4.9,2.5,3.5,60321,"""Pagoa Pilsner - Horia""",21833,"""Euskal Garagardoa S.L.""",2014-08-12 10:00:00,3.25,3.0,3.34,True,"""German Pilsener""",3.5,"""330 ml. bottle. Golden colour,…","""janubio.14451""","""janubio""",4.9,3.0,6.0,13987,"""Pagoa Pilsner Horia""",2560,"""Euskal Garagardoa S.A.""",2014-08-12 10:00:00,15.0,3.0,3.3,"""Pilsener""",6.0,"""330 ml. bottle. Golden colour,…",13895,"""janubio"""


## Processing and saving the cleaned data
Here we'll sync the data of the matched beer dataset with the one of the RateBeer and BeerAdvocate. This is needed because we've removed some data from the BeerAdvocate and RateBeer datasets during the data cleaning process and we need to ensure that the matched beer dataset is in sync with the other two datasets.
##### Preliminary

In [17]:
BEERADVOCATE_PATH = '../../../data/BeerAdvocate/processed'
RATEBEER_PATH = '../../../data/RateBeer/processed'


##### Users dataset

In [18]:
df_filtered_users = df_users.select(["user_id_ba", "user_id_rb", "sim_scores"])
print("Data before cleaning", df_filtered_users.shape[0])

# BeerAdvocate
df_users_ba = pl.read_parquet(BEERADVOCATE_PATH + "/users.pq")
df_users_ba = df_users_ba.rename({col: f"{col}_ba" for col in df_users_ba.columns})
df_filtered_users = df_filtered_users.join(df_users_ba, on="user_id_ba", how="inner")
del df_users_ba

# RateBeer
df_users_rb = pl.read_parquet(RATEBEER_PATH + "/users.pq")
df_users_rb = df_users_rb.rename({col: f"{col}_rb" for col in df_users_rb.columns})
df_filtered_users = df_filtered_users.join(df_users_rb, on="user_id_rb", how="inner")
del df_users_rb

print("Data after cleaning", df_filtered_users.shape[0])
print(f"Percentage of data lost: {100 - (df_filtered_users.shape[0] / df_users.shape[0] * 100)}%")

# Save the data
df_filtered_users.write_parquet(DST_DATA_PATH + "/users.pq")

Data before cleaning 3341


Data after cleaning 3341
Percentage of data lost: 0.0%


In [19]:
df_filtered_users.sample(5)

user_id_ba,user_id_rb,sim_scores,user_name_ba,joined_ba,location_ba,nbr_ratings_ba,nbr_reviews_ba,nbr_interactions_ba,user_name_rb,joined_rb,location_rb,nbr_reviews_rb,nbr_ratings_rb,nbr_interactions_rb
str,i64,f64,str,datetime[ms],str,i64,i64,i64,str,datetime[ms],str,i32,i32,i32
"""vizzionman.783726""",304309,1.0,"""vizzionman""",2014-02-20 11:00:00,"""United States, Wisconsin""",2.0,163.0,165,"""vizzionman""",2014-02-20 11:00:00,"""United States, Wisconsin""",1,0,0
"""zencigar.613666""",132548,1.0,"""zencigar""",2011-08-03 10:00:00,"""United States, North Carolina""",,4.0,4,"""zencigar""",2011-07-27 10:00:00,"""United States, North Carolina""",11,0,0
"""dspazz.55705""",35306,1.0,"""dspazz""",2005-12-19 11:00:00,"""United States, Massachusetts""",1.0,,1,"""dspazz""",2006-03-28 10:00:00,"""United States, Massachusetts""",1,0,0
"""matt.661184""",135122,0.866025,"""Matt""",2012-02-24 11:00:00,"""United States, California""",1.0,,1,"""MattH""",2011-09-08 10:00:00,"""United States, California""",676,0,0
"""mrphield.416905""",122122,1.0,"""MrPhield""",2010-01-18 11:00:00,"""United States, Wisconsin""",4.0,,4,"""MrPhield""",2011-01-26 11:00:00,"""United States, Wisconsin""",7,0,0


##### Beers dataset

In [20]:
df_filtered_beers = df_beers.select(["beer_id_ba", "beer_id_rb", "diff_scores", "sim_scores"])
print("Data before cleaning", df_filtered_beers.shape[0])

# BeerAdvocate
df_beers_ba = pl.read_parquet(BEERADVOCATE_PATH + "/beers.pq")
df_beers_ba = df_beers_ba.rename({col: f"{col}_ba" for col in df_beers_ba.columns})
df_filtered_beers = df_filtered_beers.join(df_beers_ba, on="beer_id_ba", how="inner")

# RateBeer
df_beers_rb = pl.read_parquet(RATEBEER_PATH + "/beers.pq")
df_beers_rb = df_beers_rb.rename({col: f"{col}_rb" for col in df_beers_rb.columns})
df_filtered_beers = df_filtered_beers.join(df_beers_rb, on="beer_id_rb", how="inner")

print("Data after cleaning", df_filtered_beers.shape[0])
print(f"Percentage of data lost: {(1 - df_filtered_beers.shape[0] / df_beers.shape[0]) * 100:.2f}%")

# Save the data
df_filtered_beers.write_parquet(DST_DATA_PATH + "/beers.pq")

Data before cleaning 45640


Data after cleaning 45636
Percentage of data lost: 0.01%


In [21]:
df_filtered_beers.sample(5)

beer_id_ba,beer_id_rb,diff_scores,sim_scores,beer_name_ba,brewery_id_ba,brewery_name_ba,style_ba,abv_ba,rating_score_avg_ba,rating_score_std_ba,rating_score_median_ba,nbr_interactions_ba,nbr_ratings_ba,nbr_reviews_ba,beer_name_rb,brewery_id_rb,brewery_name_rb,style_rb,abv_rb,rating_score_avg_rb,rating_score_std_rb,rating_score_median_rb,nbr_interactions_rb,nbr_ratings_rb,nbr_reviews_rb
i64,i64,f64,f64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64
250807,450020,0.753007,1.0,"""Essence Of Wetness""",43390,"""Cloudburst Brewing""","""American IPA""",6.7,4.233333,0.020817,4.24,3,2,1,"""Cloudburst Essence of Wetness""",23950,"""Cloudburst Brewing""","""India Pale Ale (IPA)""",6.7,3.966667,0.294392,3.85,6,0,6
60856,125398,0.906397,1.0,"""Dog Soldier Golden Ale""",23122,"""Cavalry Brewing""","""American Blonde Ale""",4.0,3.01,0.812794,3.2,25,11,14,"""Cavalry Brewing Dog Soldier Go…",11765,"""Cavalry Brewing Company""","""Golden Ale/Blond Ale""",4.0,3.015385,0.597001,3.0,13,0,13
201496,397870,1.0,1.0,"""Arkenstone""",34225,"""Twin Leaf Brewery""","""Märzen / Oktoberfest""",5.8,3.755,0.091924,3.755,2,2,0,"""Twin Leaf Arkenstone""",19018,"""Twin Leaf Brewery""","""Oktoberfest/Märzen""",5.8,3.3,0.519615,3.6,3,0,3
236956,407531,0.610242,0.842181,"""Black + Brett W/ Brouwerij De …",27920,"""Redchurch Brewery""","""American Double / Imperial Sto…",10.0,,,,0,0,0,"""Redchurch / De Molen Black & B…",13106,"""Redchurch""","""Imperial Stout""",10.0,3.895,0.318611,3.9,40,0,40
57942,105430,1.0,1.0,"""Blondy""",512,"""De 3 Horne Bierbrouwerij""","""Belgian Pale Ale""",5.3,3.68,,3.68,1,0,1,"""De 3 Horne Blondy""",1430,"""De 3 Horne""","""Belgian Ale""",5.3,2.704348,0.32957,2.7,23,0,23


##### Breweries dataset

In [22]:
df_filtered_breweries = df_breweries.select(["id_ba", "id_rb", "diff_scores", "sim_scores"])
print("Data before cleaning", df_filtered_breweries.shape[0])

# BeerAdvocate
df_breweries_ba = pl.read_parquet(BEERADVOCATE_PATH + "/breweries.pq")
df_breweries_ba = df_breweries_ba.rename({col: f"{col}_ba" for col in df_breweries_ba.columns})
df_filtered_breweries = df_filtered_breweries.join(df_breweries_ba, on="id_ba", how="inner")

# RateBeer
df_breweries_rb = pl.read_parquet(RATEBEER_PATH + "/breweries.pq")
df_breweries_rb = df_breweries_rb.rename({col: f"{col}_rb" for col in df_breweries_rb.columns})
df_filtered_breweries = df_filtered_breweries.join(df_breweries_rb, on="id_rb", how="inner")

print("Data after cleaning", df_filtered_breweries.shape[0])
print(f"Percentage of data lost: {(1 - df_filtered_breweries.shape[0] / df_breweries.shape[0]) * 100:.2f}%")

# Save the data
df_filtered_breweries.write_parquet(DST_DATA_PATH + "/breweries.pq")

Data before cleaning 8281
Data after cleaning 7428
Percentage of data lost: 10.30%


In [23]:
df_filtered_breweries.sample(5)

id_ba,id_rb,diff_scores,sim_scores,location_ba,name_ba,beers_count_ba,location_rb,name_rb,beers_count_rb
i64,i64,f64,f64,str,str,i64,str,str,i32
10101,8809,0.629784,1.0,"""Japan""","""Fujiyama Beer, K.K.""",2,"""Japan""","""Fujiyama Beer""",5
30884,16496,0.355602,1.0,"""United States, New York""","""The North Brewery""",157,"""United States, New York""","""The North Brewery""",47
35245,19082,0.706339,1.0,"""United States, Michigan""","""Beggars Brewery""",7,"""United States, Michigan""","""Beggars Brewery""",8
35337,19943,0.467084,0.896098,"""United States, Wisconsin""","""Brenner Brewing Co.""",16,"""United States, Wisconsin""","""Brenner Brewing Company""",22
36911,11730,0.389979,0.821878,"""Germany""","""Brauerei zum Klosterhof""",1,"""Germany""","""Brauerei Zum Klosterhof Heidel…",19


##### Ratings dataset

In [24]:
df_ratings_filtered = df_ratings.select(["beer_id_ba", "brewery_id_ba", "user_id_ba", "beer_id_rb", "brewery_id_rb", "user_id_rb"])

print("Data before cleaning", df_ratings_filtered.shape[0])

# BeerAdvocate
df_ratings_ba = pl.read_parquet(BEERADVOCATE_PATH + "/ratings.pq")
df_ratings_ba = df_ratings_ba.rename({col: f"{col}_ba" for col in df_ratings_ba.columns})
df_ratings_filtered = df_ratings_filtered.join(df_ratings_ba, on=["beer_id_ba", "brewery_id_ba", "user_id_ba"], how="inner")

# RateBeer
df_ratings_rb = pl.read_parquet(RATEBEER_PATH + "/ratings.pq")
df_ratings_rb = df_ratings_rb.rename({col: f"{col}_rb" for col in df_ratings_rb.columns})
df_ratings_filtered = df_ratings_filtered.join(df_ratings_rb, on=["beer_id_rb", "brewery_id_rb", "user_id_rb"], how="inner")

print("Data after cleaning", df_ratings_filtered.shape[0])
print(f"Percentage of data lost: {(1 - df_ratings_filtered.shape[0] / df_ratings.shape[0]) * 100:.2f}%")

# Save the data
df_ratings_filtered.write_parquet(DST_DATA_PATH + "/ratings.pq")

Data before cleaning 21964


Data after cleaning 21964
Percentage of data lost: 0.00%


In [25]:
df_ratings_filtered.sample(5)

beer_id_ba,brewery_id_ba,user_id_ba,beer_id_rb,brewery_id_rb,user_id_rb,rating_ba,review_ba,abv_ba,brewery_name_ba,user_name_ba,appearance_ba,palate_ba,text_ba,aroma_ba,overall_ba,taste_ba,style_ba,beer_name_ba,date_ba,rating_rb,palate_rb,abv_rb,beer_name_rb,taste_rb,date_rb,style_rb,appearance_rb,overall_rb,brewery_name_rb,text_rb,aroma_rb,user_name_rb
i64,i64,str,i64,i64,i64,f64,bool,f64,str,str,f64,f64,str,f64,f64,f64,str,str,datetime[μs],f64,f64,f64,str,f64,datetime[μs],str,f64,f64,str,str,f64,str
93047,18006,"""xmnwildx12.524792""",211651,10393,121664,3.75,False,6.0,"""Half Acre Beer Company""","""XmnwildX12""",,,,,,,"""American Pale Lager""","""Stargrazer""",2013-12-05 12:00:00,3.8,4.0,6.0,"""Half Acre Stargrazer""",8.0,2013-05-19 12:00:00,"""Premium Lager""",4.0,15.0,"""Half Acre Beer Company""","""On tap at Half Acre, Golden po…",7.0,"""XmnwildX12"""
6604,913,"""corby112.268461""",13159,2040,55936,4.1,True,5.3,"""Brauerei Spezial""","""corby112""",4.0,4.0,"""On draft at the Memphis Taproo…",4.0,4.5,4.0,"""Rauchbier""","""Spezial Rauchbier Märzen""",2010-06-06 12:00:00,4.0,4.0,5.3,"""Spezial Rauchbier Märzen""",7.0,2010-06-07 12:00:00,"""Smoked""",4.0,17.0,"""Brauerei Spezial""","""On draft at the Memphis Taproo…",8.0,"""corby112"""
98054,31793,"""jonb5.458362""",173892,14561,112403,3.75,True,5.6,"""Ratsherrn Brauerei GmbH""","""jonb5""",,,"""330ml bottle, poured into La C…",,,,"""English Pale Ale""","""Ratsherrn Pale Ale""",2014-08-23 12:00:00,3.5,4.0,5.6,"""Ratsherrn Pale Ale""",6.0,2014-08-23 12:00:00,"""American Pale Ale""",4.0,15.0,"""Ratsherrn Brauerei""","""330ml bottle, poured into La C…",6.0,"""Jonb5"""
170636,141,"""puboflyons.237852""",381451,20,54937,4.09,True,7.3,"""Smuttynose Brewing Company""","""puboflyons""",4.5,4.0,"""From the 650 ml. bottle with a…",4.25,4.0,4.0,"""Milk / Sweet Stout""","""Smuttynose Rocky Road Stout (B…",2015-12-30 12:00:00,4.2,4.0,7.3,"""Smuttynose Big Beer Series: Ro…",8.0,2015-12-30 12:00:00,"""Stout""",5.0,16.0,"""Smuttynose Brewing Company""","""650 ml. bottle with a bottling…",9.0,"""puboflyons"""
98525,273,"""chaingangguy.8942""",225244,34,32031,3.14,True,9.0,"""SweetWater Brewing Company""","""ChainGangGuy""",3.25,3.0,"""22 ounce bottle - $5.98 at She…",3.5,3.0,3.0,"""Belgian Strong Pale Ale""","""Sweetwater Dank Tank The Price…",2013-11-08 12:00:00,3.1,3.0,9.0,"""Sweetwater Dank Tank The Price…",6.0,2013-11-07 12:00:00,"""Belgian Strong Ale""",3.0,12.0,"""Sweetwater Brewing Company""","""22 ounce bottle - $5.98 at She…",7.0,"""ChainGangGuy"""


### Discussion
As we observed in the data exploration of RateBeer and MatchedBeer we don't lose lots of information regarding beers, users or ratings and the only relevant data lost is for the breweries (where we has lost approximately 10% of the data). Given that no review has been lost, and the filtering has been done in the processing of the files used to do the matching here, we can be confident that the data lost weren't relevant given that they didn't have any reviews.