# Data Analysis and Cleaning of MatchedBeer Dataset

The MatchedBeer dataset combines data from BeerAdvocate and RateBeer, two of the most prominent online platforms for beer ratings. This combined dataset was created to address the herding effect in online ratings, where early ratings can skew subsequent user opinions. By merging the two platforms, the MatchedBeer dataset allows for the study of the same beers rated independently on each platform, thereby providing a unique opportunity to analyze unbiased rating patterns.

## Matching Process Overview
The matching process is described in detail in the paper ["When Sheep Shop: Measuring Herding Effects in Product Ratings with Natural Experiments"](https://dlab.epfl.ch/people/west/pub/Lederrey-West_WWW-18.pdf), focusing on accurately aligning beers from both websites. The matching is performed in two phases to prioritize precision:

1. **Brewery Alignment**: Breweries across BeerAdvocate and RateBeer are aligned based on their exact location (state or country), with names represented as TF-IDF vectors to compute cosine similarity.
2. **Beer Alignment**: Within matched breweries, beers are aligned based on name similarity and require an exact match in alcohol by volume (ABV). Before matching, any shared tokens in brewery and beer names are removed to improve accuracy.

The algorithm ensures high precision by setting strict thresholds for cosine similarity and enforcing a significant gap between the best and second-best matches. This approach results in a dataset where brewery matching has a reported precision of 99.6%, and beer matching achieves a precision of 100%, meaning that only highly confident matches are included, though some potential matches may be excluded.

## Structure of MatchedBeer Dataset
The dataset contains essential information on beers, breweries, users, and user ratings from both platforms. Since the user bases of BeerAdvocate and RateBeer differ geographically, the combined dataset reflects a more balanced view. BeerAdvocate is more U.S.-centric, while RateBeer has a more internationally diverse user base. Despite these differences, the matched dataset has been verified for both internal and external validity, making it well-suited for analyzing the influence of initial ratings on user perceptions across platforms.

MatchedBeer provides insights into how early ratings impact subsequent ones by comparing ratings for the same beer on both sites, especially when early ratings on each platform differ significantly. By standardizing and aligning ratings across time and platforms, MatchedBeer reveals patterns that allow for the identification of herding effects, helping researchers understand the biases in user-generated ratings.

In [None]:
# Import all the necessary libraries
import polars as pl
import tqdm
import os

# Import the file in the utils folder
import sys
sys.path.append('../../utils')
from data_desc import remove_whitespaces, describe_numbers, describe

In [None]:
# Define the paths
SRC_DATA_PATH = '../../../data/MatchedBeer'
DST_DATA_PATH = '../../../data/MatchedBeer/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

#### Data Exploration

In [None]:
# This function loads and processes the matched_beer fioles so that all of the columns are appended with _ba or _rb depending on which dataset they are from
def load_and_process_files(file_paths):
    processed_dataframes = []
    
    for file_path in file_paths:
        # Load the first row to get the data source identifiers (e.g., "ba", "ba_1", "rb", etc.)
        header_row = pl.read_csv(file_path, n_rows=1, has_header=False)
        column_source = header_row.row(0)
        
        # Clean the source identifiers to only keep "ba" or "rb" without any suffixes
        source_identifiers = [source.split('_')[0] for source in column_source]
        
        # Load the actual data, skipping the first row
        df = pl.read_csv(file_path, skip_rows=1)
        
        # Rename columns based on the cleaned source identifiers
        new_column_names = [
            f"{name}_{source}".replace("_duplicated_0", "") for name, source in zip(df.columns, source_identifiers)
        ]
        df = df.rename({old: new for old, new in zip(df.columns, new_column_names)})
        
        # Append the processed DataFrame to the list
        processed_dataframes.append(df)
    
    return processed_dataframes

In [None]:
path = "./Dataset/MatchedBeer/"
file_paths = [path + "beers.csv", path + "breweries.csv", path + "ratings.csv", path + "users.csv"]
dataframes_list = load_and_process_files(file_paths)

df_beers = dataframes_list[0]
df_breweries = dataframes_list[1]
df_ratings = dataframes_list[2]
df_users = dataframes_list[3]

##### Beers dataset
We'll start by looking at the beers dataset.

In [None]:
df_beers.sample(5)

This dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for each beer in BA/RB | 
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `beer_wout_brewery_name_ba` / `beer_wout_brewery_name_rb` | Beer name without brewery name in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings in BA/RB |
| `nbr_reviews_ba` | Number of reviews in BA |
| `avg_ba` / `avg_rb` | Average rating in BA/RB |
| `avg_computed_ba` / `avg_computed_rb` | Computed average rating in BA/RB |
| `zscore_ba` / `zscore_rb` | Standardized score in BA/RB |
| `nbr_matched_valid_ratings_ba` / `nbr_matched_valid_ratings_rb` | Number of matched valid ratings in BA/RB |
| `avg_matched_valid_ratings_ba` / `avg_matched_valid_ratings_rb` | Average of matched valid ratings in BA/RB |
| `ba_score_ba` | BeerAdvocate score |
| `bros_score_ba` | Bros score in BA |
| `overall_score_rb` | Overall score in RB |
| `style_score_rb` | Style-specific score in RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

Now that we have an idea what the structure of our datasets looks like, let's see if our dataset is complete or whether it contains a lot of missing entries.

In [None]:
describe(df_beers)

In [None]:
describe_numbers(df_beers, filters =["beer_id_ba", "brewery_id_ba", "beer_id_rb", "brewery_id_rb"])

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns show identical distributions across BeerAdvocate and RateBeer, with a mean ABV of approximately 6.32%.
   - This similarity suggests a good alignment in the types of beers reviewed on both platforms. The maximum ABV of 67.5% is possibly an outlier, likely representing rare extreme beers, as typical ABVs are much lower.

2. Average Ratings (BeerAdvocate vs. RateBeer)
   - The `avg_ba` (BeerAdvocate) mean is 3.73, while `avg_rb` (RateBeer) is significantly lower at 3.14.
   - This aligns with previous findings that BeerAdvocate users tend to rate beers more generously than RateBeer users. The higher `std` value for `avg_ba` also indicates a wider range in BeerAdvocate’s ratings, suggesting more variability in user opinions.

3. Computed Averages (Consistency in Ratings)
   - `avg_computed_ba` and `avg_computed_rb` refer to recalculated averages, showing slightly lower standard deviations than their direct counterparts (`avg_ba` and `avg_rb`), indicating a slight smoothing effect.
   - Both BeerAdvocate and RateBeer have computed averages (`avg_computed_ba` vs. `avg_computed_rb`) that are fairly close to the respective raw averages, implying that the original averages align well with user trends.

4. Rating Counts
   - The median `nbr_ratings_ba` (3) and `nbr_ratings_rb` (5) are relatively low, showing that many beers have few ratings on both sites, though RateBeer has slightly more ratings per beer on average.
   - RateBeer’s broader user base also shows a higher standard deviation in the count of matched valid ratings (`nbr_matched_valid_ratings_rb` is 80.10 vs. 42.51 for BeerAdvocate), meaning some beers have far more reviews than others on RateBeer.

5. Z-scores (Standardized Ratings)
   - The mean `zscore_ba` (-0.41) is lower than `zscore_rb` (-0.10), suggesting BeerAdvocate users' ratings tend to deviate more below the mean, which may imply a slight rating inflation on RateBeer.
   - BeerAdvocate’s z-scores also exhibit a broader spread, evidenced by a standard deviation of 0.81 compared to RateBeer's 0.73, reinforcing the idea that BeerAdvocate reviews show more polarized opinions.

6. Score Difference and Similarity
   - `diff_scores` has a mean of 0.798, which indicates a considerable average difference in the ratings across the two platforms.
   - `sim_scores`, however, is high (mean of 0.986), suggesting that while individual score values differ, the rating trends between matched pairs are generally consistent.

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each platform, every brewery has a location, a name, and the amount of beers they produce, along with a unique identifier. We also have the similarity and difference scores between the breweries from the two platforms. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [None]:
df_breweries.sample(5)

The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `id_ba` / `id_rb` | Unique identifier for each brewery in BA/RB | 
| `name_ba` / `name_rb` | Name of the brewery in BA/RB |
| `location_ba` / `location_rb` | Geographic location of the brewery in BA/RB |
| `nbr_beers_ba` / `nbr_beers_rb` | Number of beers produced by the brewery in BA/RB |
| `diff_scores` | Difference in scores between BA and RB |
| `sim_scores` | Similarity in scores between BA and RB |

In [None]:
describe(df_breweries)

In [None]:
describe(df_breweries, filters=["id_ba", "id_rb"])

For our breweries on both platforms, we have no missing values (in this matched data).

##### Users dataset

**Note**: The users_approx data is essentailly exactly the same however rather than users being matched exactly by username across the two platforms, they are matched using a TF-IDF vectorizer and cosine similarity. Users are matched if username is close enough and same location.

For each of our users (for each platform), we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [None]:
df_users.sample(5)

The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `user_id_ba` / `user_id_rb` | Unique identifier for each user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `user_name_lower_ba` / `user_name_lower_rb` | Lowercase username in BA/RB |
| `joined_ba` / `joined_rb` | Date when the user joined BA/RB |
| `location_ba` / `location_rb` | Geographic location of the user in BA/RB |
| `nbr_ratings_ba` / `nbr_ratings_rb` | Number of ratings submitted by the user in BA/RB |
| `nbr_reviews_ba` | Number of reviews submitted by the user in BA |

In [None]:
describe(df_users)

In [None]:
describe_numbers(df_users, filters=["user_id_ba", "user_id_rb"])

To make the `joined` columns more readable we cast them to a datetime object.

In [None]:
df_users = df_users.with_columns((pl.col("joined_ba").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users = df_users.with_columns((pl.col("joined_rb").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

##### Ratings

In [None]:
df_ratings.sample(5)

In [None]:
describe(df_ratings)

The dataset has the following structure

| Column Name | Description |
|-------------|-------------|
| `beer_id_ba` / `beer_id_rb` | Unique identifier for the beer in BA/RB |
| `beer_name_ba` / `beer_name_rb` | Name of the beer in BA/RB |
| `brewery_id_ba` / `brewery_id_rb` | Unique identifier for the brewery in BA/RB |
| `brewery_name_ba` / `brewery_name_rb` | Name of the brewery in BA/RB |
| `style_ba` / `style_rb` | Beer style in BA/RB |
| `abv_ba` / `abv_rb` | Alcohol By Volume percentage in BA/RB |
| `user_id_ba` / `user_id_rb` | Unique identifier for the user in BA/RB |
| `user_name_ba` / `user_name_rb` | Username of the reviewer in BA/RB |
| `date_ba` / `date_rb` | Date of the review in BA/RB |
| `rating_ba` / `rating_rb` | Overall rating given by the user in BA/RB |
| `overall_ba` / `overall_rb` | Overall score in BA/RB |
| `appearance_ba` / `appearance_rb` | Appearance score in BA/RB |
| `aroma_ba` / `aroma_rb` | Aroma score in BA/RB |
| `palate_ba` / `palate_rb` | Palate score in BA/RB |
| `taste_ba` / `taste_rb` | Taste score in BA/RB |
| `text_ba` / `text_rb` | Review text in BA/RB |
| `review_ba` | Whether or not there is a BA review available |

In [None]:
describe_numbers(df_ratings, filters=["beer_id_ba", "beer_id_rb", "brewery_id_ba", "brewery_id_rb", "user_id_ba", "user_id_rb"])

Here we see:

1. Alcohol By Volume (ABV)
   - The `abv_ba` and `abv_rb` columns are identical again of course.

2. Rating Categories on BeerAdvocate (BA)
   - `appearance_ba`, `aroma_ba`, `palate_ba`, `taste_ba`, and `overall_ba` ratings on BeerAdvocate have similar means (ranging from 3.68 to 3.78) with a consistent standard deviation of around 0.6.
   - The median for each category is close to 4, suggesting that BeerAdvocate users tend to give positive ratings overall, as the maximum for these is 5.
   - The overall rating (`rating_ba`) has a mean of 3.71, indicating a generally favorable tendency in ratings across these categories.

3. Rating Categories on RateBeer (RB)
   - `appearance_rb`, `palate_rb`, and `rating_rb` have means similar to those on BeerAdvocate (3.53 to 3.69), suggesting that for these categories, user ratings are comparable across platforms.
   - `aroma_rb` and `taste_rb`, however, have higher maximum scores of 10, with means of 6.96 and 7.01, respectively. This difference likely reflects a different scoring system or metric on RateBeer, which will need to be accounted for in our analysis.
   - `overall_rb` shows a much broader scale, with a mean of 14.40 and a maximum of 20, indicating that RateBeer may use a larger rating scale for this category than BeerAdvocate does. Again this will need to be accounted for.

4. General Trends
   - BeerAdvocate ratings are generally more constrained within a 1-5 range for all categories, while RateBeer uses a broader scale for certain categories (`aroma_rb`, `taste_rb`, and `overall_rb`). This suggests a difference in rating systems that should be considered in cross-platform comparisons.
   - Despite these scale differences, the means for comparable categories like `appearance`, `palate`, and `rating` are fairly aligned between platforms, implying similar user perceptions on these aspects of beer.
