# Data analysis and cleaning of BeerAdvocate dataset

BeerAdvocate is a popular platform for beer enthusiasts to rate and review beers. The dataset provides information about beers, breweries, users and the ratings of the users for the beers on the platform.

On the website of BeerAdvocate, users can create an account. After creating an account with a username and password, they are asked to provide their location, but this is optional and not verified. 

Users can enter a beer into the database. Upon entering a beer, they have to select the location of the brewery where the beer is produced. They are then able to provide the beer name, the beer style and optionally an ABV (alcohol percentage).

After entering the beer into the database, every user can rate this beer. There are two types of ways a user can express their opinion on a beer: through ratings and reviews. A rating (in the current version of the platform, [the rating system has received a major update in 2014](https://www.beeradvocate.com/community/threads/beeradvocate-returns-to-its-roots-rating-system-revamped.238804/)) consists of Look, Smell, Taste, Feel and an Overall rating. Ratings range from 1-5 and can take quarter points as well (e.g. 3.75). Given that the dataset contains data that has been acquired before and after the major update the reviews are slightly different over time.

They can then optionally add a review as well. A review is a text description of what the user thinks about the beer. In some cases, as expressed in the article, the users must insert a text review if the reviews are outliers, to ensure that fraudulent reviews are reduced.

Our sources to better study this dataset are the following:
- [BeerAdvocate's forum](https://www.beeradvocate.com/community/threads/new-ba-score-beer-ranking-more-updates.537406/)
- [BeerAdvocate's website](https://www.beeradvocate.com)

## Preliminary work

### Install and import all the needed libraries

In [1]:
# Import all the necessary libraries
import polars as pl
import tqdm
import os

# Import the file in the utils folder
import sys
sys.path.append('../../utils')
from data_desc import remove_whitespaces, describe_numbers, describe

In [2]:
# Define the paths
SRC_DATA_PATH = '../../../data/BeerAdvocate'
DST_DATA_PATH = '../../../data/BeerAdvocate/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

## Data exploration

In [3]:
df_beers = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/beers.csv"))
df_breweries = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/breweries.csv"))
df_users = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/users.csv"))
df_ratings = remove_whitespaces(pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq"))

##### Beers dataset
We'll start by looking at the beers dataset.

In [4]:
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,i64,f64,f64,f64,f64,f64,f64,i64,f64
117715,"""Chinook Copper Ale""",16396,"""Northwest Brewing Company""","""American Amber / Red Ale""",11,1,2.72,80.0,,5.3,3.273636,,0,
250308,"""Bridge Street IPA""",36855,"""2 Witches Winery & Brewing Co.""","""American IPA""",1,0,3.98,,,7.2,3.98,,0,
198914,"""Foxes Rock""",36729,"""Cumberland Breweries""","""English India Pale Ale (IPA)""",3,1,3.75,,,4.5,3.843333,,0,
157379,"""Frangelic Stout With Madagasca…",1199,"""Founders Brewing Company""","""American Stout""",3,0,3.62,,,4.2,3.62,,0,
272450,"""HMB Stout""",38602,"""Mad Beach Craft Brewing Compan…","""American Stout""",1,1,4.25,,,5.3,4.25,0.548961,1,4.25


The dataset has the following structure

| Column Name | Description | 
|-------------|-------------|
| `beer_id` | Unique identifier for each beer |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `nbr_ratings` | Number of ratings (text + non text) the beer has received |
| `nbr_reviews` | Number of written reviews (only text) the beer has received |
| `avg` | Average rating of the beer |
| `ba_score` | BeerAdvocate score for the beer (fraction of raters who gave the beer a 3.75 or higher) |
| `bros_score` | Score given by the Bros (Todd and Jason Alström, the BeerAdvocate founders) |
| `abv` | Alcohol By Volume percentage of the beer |
| `avg_computed` | Computed average rating (in some cases it differ from `avg` due to different calculation methods) |
| `zscore` | Standardized score indicating how many standard deviations the beer's rating is from the mean |
| `nbr_matched_valid_ratings` | Number of matched valid ratings |
| `avg_matched_valid_ratings` | Average of matched valid ratings |

Now that we have an idea of what the structure of our dataset looks like, let's see if our dataset is complete or whether it contains a lot of missing entries. 

In [5]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   | Int64   |           280823 |             0 |    0.00 % |                280823 |          100.00 % |
| beer_name                 | String  |           280823 |             0 |    0.00 % |                236090 |           84.07 % |
| brewery_id                | Int64   |           280823 |             0 |    0.00 % |                 14325 |            5.10 % |
| brewery_name              | String  |           280823 |             0 |    0.00 % |                 14098 |            5.02 % |
| style                     | String  |           280823 |             0 |    0.00 

In [6]:
describe_numbers(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+-----------+----------+-----------+-----------+-----------+---------+---------+
| Column                    |      Mean |      Std |       25% |       50% |       75% |     Min |     Max |
|---------------------------+-----------+----------+-----------+-----------+-----------+---------+---------|
| nbr_ratings               |   29.8873 |   231.01 |         1 |         2 |         8 |       0 |   16509 |
| nbr_reviews               |   9.22142 |  68.8664 |         0 |         1 |         2 |       0 |    3899 |
| avg                       |   3.72103 | 0.476003 |       3.5 |      3.78 |      4.01 |       0 |       5 |
| ba_score                  |   84.6333 |  4.05272 |        83 |        85 |        86 |      46 |     100 |
| bros_score                |   84.8066 |  10.5077 |        81 |        87 |        91 |      31 |     100 |
| abv                       |   6.49137 |  2.05407 |         5 |         6 |       7.5 |    0.01 |    67.5 |
| avg_computed     

We can see that:
- The beer name and brewery information are present for every beer within the dataset
- Some data are missing for abv, ba_score, bros_score and the other aggregated scores
- The ba_score and the bros_score are in a 0 - 100 range
- The ba_score and bros_score have a very similar mean, but the bros_score has a far larger spread. This seems to signal that users of the website tend to give less extreme ratings that the bros (the owners of the website)
- The avg score is in a 0 - 5 range (meaning that the scores are in a 0 - 5 scale)
- The abv at first seems to have some outliers. The max in our table above shows we have values with ABV of 67.5 percent.

However, after careful analysis of the beers with such high ABV they actually exist and are not outliers. For example, the 'Brewdog Sink the Bismarck!' beer has 41% ABV and the 'Brewmeister Armageddon' has 65%. These are therefore perfectly fine entries and we will keep them in our dataframe.

Now we'll focus on dealing with the missing values. <br><br>
Regarding the ABV values, we will not be filling the NaN values here. This is due to the fact that it's not something that can be easily guessed and by approximating it with the mean or median values we would introduce a significant bias in the dataset.

In [7]:
# Compute the percentage of missing values for the reviews dataset
rev_nan_beers = df_beers.filter(pl.col("abv").is_null()).select(pl.col("nbr_ratings")).sum().item()
tot_rev = df_beers.select(pl.col("nbr_ratings")).sum().item()

# Compute the percentage of missing values for the reviews dataset
print(f"Percentage of missing values in the 'abv' column: {rev_nan_beers / tot_rev * 100:.2f}%")

Percentage of missing values in the 'abv' column: 2.04%


We see that by ignoring the beers with a nan abv we only ignore a small fraction of the reviews. This makes sense since it's unlikely that a beer with a high number of ratings would have a missing abv values.

Regarding the null values of the aggregated scores:
- In most cases these null values are associated with beers that doesn't have any review. It doesn't make sense to fill these values with the mean or median of the dataset since it would introduce a bias in the dataset.
- Most of the values will change after the processing and cleaning of the data.

For these two reasons we are not going to worry too much now about the missing values of the aggregated scores.

Since the beer are added by hand by people it can happen that a beer is created twice.

In [8]:
beers_duplicates = df_beers.filter(pl.struct(["beer_name", "brewery_id"]).is_duplicated())
beers_duplicates.head(5)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,i64,f64,f64,f64,f64,f64,f64,i64,f64
207341,"""10|05 Coffee Porter (San Sebas…",33192,"""Brew By Numbers""","""English Porter""",1,0,4.05,,,6.5,4.05,,0,
255983,"""10|05 Coffee Porter (San Sebas…",33192,"""Brew By Numbers""","""English Porter""",1,0,4.13,,,6.5,4.13,,0,
27803,"""Nappa Scar""",12981,"""Yorkshire Dales Brewing Compan…","""Extra Special / Strong Bitter …",1,1,3.5,,,4.8,3.5,,0,
289247,"""Nappa Scar""",12981,"""Yorkshire Dales Brewing Compan…","""English Bitter""",1,0,3.5,,,4.0,3.5,,0,
259896,"""Sage Farm""",44830,"""Half Hours on Earth""","""Saison / Farmhouse Ale""",3,1,3.54,,,6.0,3.76,,0,


Since some of the values (e.g. the abv or the style) are not always consistent between duplicates, to reduce the risk of errors or bias we are going to just drop the duplicates.

In [9]:
dropped_beers_ids = beers_duplicates.select("beer_id").to_pandas().values.flatten()
df_beers = df_beers.filter(~pl.col("beer_id").is_in(dropped_beers_ids))

To better handle the data we are also going to split the nbr_ratings into two distinct columns:
- nbr_interactions: will be used to denote the number of interactions in total (number of text reviews + number of non text reviews)
- nbr_ratings: will be used to denote only the non text reviews

In [10]:
# Create a new column that is equals to nbr_ratings
df_beers = df_beers.with_columns(pl.col("nbr_ratings").alias("nbr_interactions"))
df_beers = df_beers.with_columns((pl.col("nbr_ratings") - pl.col("nbr_reviews")).alias("nbr_ratings"))
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings,nbr_interactions
i64,str,i64,str,str,i64,i64,f64,f64,f64,f64,f64,f64,i64,f64,i64
210799,"""Passion For Blood""",43919,"""Inoculum Ale Works""","""Berliner Weissbier""",0,0,,,,3.2,,,0,,0
260538,"""Sour LFU Rouge""",30340,"""Foolproof Brewing Company""","""Saison / Farmhouse Ale""",1,0,3.35,,,8.0,3.35,,0,,1
850,"""Killarney's Red Lager""",29,"""Anheuser-Busch""","""American Amber / Red Lager""",13,13,2.47,74.0,81.0,5.0,2.442308,,0,,26
246813,"""Brown Dwarf""",33824,"""Bottle Logic Brewing""","""American Brown Ale""",2,0,3.77,,,4.5,3.77,-0.445024,0,,2
259195,"""Opaque Minds""",35259,"""Alvarado Street Brewery""","""American IPA""",24,4,4.24,88.0,,7.0,4.223929,,0,,28


Let's now see the data after the cleaning process.

In [11]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   | Int64   |           280791 |             0 |    0.00 % |                280791 |          100.00 % |
| beer_name                 | String  |           280791 |             0 |    0.00 % |                236078 |           84.08 % |
| brewery_id                | Int64   |           280791 |             0 |    0.00 % |                 14325 |            5.10 % |
| brewery_name              | String  |           280791 |             0 |    0.00 % |                 14098 |            5.02 % |
| style                     | String  |           280791 |             0 |    0.00 

In [12]:
describe_numbers(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+-----------+----------+-----------+-----------+-----------+---------+---------+
| Column                    |      Mean |      Std |       25% |       50% |       75% |     Min |     Max |
|---------------------------+-----------+----------+-----------+-----------+-----------+---------+---------|
| nbr_ratings               |   20.6667 |  167.707 |         0 |         1 |         5 |       0 |   12698 |
| nbr_reviews               |     9.222 |  68.8703 |         0 |         1 |         2 |       0 |    3899 |
| avg                       |   3.72101 | 0.476008 |       3.5 |      3.78 |      4.01 |       0 |       5 |
| ba_score                  |   84.6329 |  4.05295 |        83 |        85 |        86 |      46 |     100 |
| bros_score                |   84.8039 |  10.5067 |        81 |        87 |        91 |      31 |     100 |
| abv                       |   6.49145 |  2.05412 |         5 |         6 |       7.5 |    0.01 |    67.5 |
| avg_computed     

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each brewery we have a location, a name and the amount of beers they produce, along with a unique identifier. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [13]:
df_breweries.sample(5)

id,location,name,nbr_beers
i64,str,str,i64
7287,"""Austria""","""Brauerei Schönbach Ing. Hubert…",0
49642,"""Canada""","""Annex Ale Project""",1
2301,"""Italy""","""Manerba Brewery""",6
42214,"""United States, New Hampshire""","""Neighborhood Beer Co.""",21
45955,"""United States, Washington""","""Stones Throw Brewing Co.""",23


The dataset has the following structure
| Column Name | Description
|-------------|-------------|
| `id` | Unique identifier for each brewery |
| `location` | Geographic location of the brewery (country) |
| `name` | Name of the brewery |
| `nbr_beers` | Number of beers produced by the brewery |

In [14]:
describe(df_breweries)

+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column    | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id        | Int64  |            16758 |             0 |    0.00 % |                 16758 |          100.00 % |
| location  | String |            16758 |             0 |    0.00 % |                   297 |            1.77 % |
| name      | String |            16758 |             0 |    0.00 % |                 16237 |           96.89 % |
| nbr_beers | Int64  |            16758 |             0 |    0.00 % |                   273 |            1.63 % |
+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [15]:
describe_numbers(df_breweries, filters=["id"])

+-----------+---------+---------+-------+-------+-------+-------+-------+
| Column    |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-----------+---------+---------+-------+-------+-------+-------+-------|
| nbr_beers | 21.0563 | 69.4178 |     2 |     6 |    18 |     0 |  1196 |
+-----------+---------+---------+-------+-------+-------+-------+-------+


For our breweries, we have no missing values. <br><br>
From the analysis we can see that some of our breweries have a non-unique name. We need to verify if multiple breweries with the same name exists in the same country.

In [16]:
breweries_duplicates = df_breweries.filter(pl.struct(["name", "location"]).is_duplicated())
breweries_duplicates

id,location,name,nbr_beers
i64,str,str,i64
34598,"""Wales""","""Rhymney Brewery""",2
12936,"""Wales""","""Rhymney Brewery""",13
11410,"""England""","""Bartrams Brewery""",11
7095,"""England""","""Bartrams Brewery""",0
31304,"""England""","""Dorset Piddle Brewery""",0
…,…,…,…
16078,"""United States, Ohio""","""The Brew Keeper""",27
1127,"""United States, Georgia""","""Buckhead Brewery and Grill""",26
16099,"""United States, Florida""","""Spanish Springs Brewing Compan…",0
25439,"""United States, Michigan""","""Malty Dog Brewery & Supplies""",0


We see that there are some breweries that have the same name and are located in the same country. This introduces a possible source of error:
- Either multiple users have inserted the same brewery into the database multiple times
- Multiple breweries in one country have the same name

Both assumption are reasonable but we can't be sure which one is the correct one. To avoid introducing errors in our dataset we are going to drop the duplicates.

In [17]:
dropped_breweries_ids = breweries_duplicates.select("id")
df_breweries = df_breweries.filter(~pl.col("id").is_in(dropped_breweries_ids))

Since we modified both the breweries and the beers dataset let's recompute the shared values. At the same time we are going to drop the breweries that doesn't have any beer associated with them.

In [18]:
# Drop all the beers whose brewery has been dropped
df_beers = df_beers.filter(~pl.col("brewery_id").is_in(dropped_breweries_ids))

# Recompute the number of beers per each brewery
aggregated_value = df_beers.group_by("brewery_id").agg(pl.col("beer_id").count().alias("beers_count")).rename({"brewery_id": "id"}).cast(pl.Int64)

# Add the new column to the breweries dataset
df_breweries = df_breweries.join(aggregated_value, on="id", how="inner")
df_breweries = df_breweries.drop("nbr_beers")

df_breweries.sample(5)

id,location,name,beers_count
i64,str,str,i64
9826,"""United States, Wyoming""","""Altitude Chophouse & Brewery""",47
24677,"""Lithuania""","""Nat&#363;ralios Sultys UAB""",1
9538,"""United States, Washington""","""Atomic Ale Brewpub And Eatery""",28
2309,"""United States, Nevada""","""Ellis Island Casino & Brewery""",14
24747,"""United States, Colorado""","""Dad & Dude's Breweria""",11


In [19]:
describe(df_breweries)

+-------------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id          | Int64  |            14028 |             0 |    0.00 % |                 14028 |          100.00 % |
| location    | String |            14028 |             0 |    0.00 % |                   277 |            1.97 % |
| name        | String |            14028 |             0 |    0.00 % |                 13966 |           99.56 % |
| beers_count | Int64  |            14028 |             0 |    0.00 % |                   270 |            1.92 % |
+-------------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [20]:
describe_numbers(df_breweries, filters=["id"])

+-------------+---------+---------+-------+-------+-------+-------+-------+
| Column      |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-------------+---------+---------+-------+-------+-------+-------+-------|
| beers_count | 19.5155 | 40.0882 |     3 |     8 |    20 |     1 |  1196 |
+-------------+---------+---------+-------+-------+-------+-------+-------+


While we still have some beers with the same name we are sure that these beers are produced by different breweries. <br><br>
We are going to recompute the number of ratings, interactions and reviews for each beer later in this notebook.

##### Users dataset

For each of our users, we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [21]:
df_users.sample(5)

nbr_ratings,nbr_reviews,user_id,user_name,joined,location
i64,i64,str,str,i64,str
34,4,"""mhsmall84.849591""","""mhsmall84""",1408615200,"""United States, Wisconsin"""
1,0,"""mike280z.1207531""","""Mike280z""",1501322400,"""United States, North Carolina"""
1,0,"""lapislee.1137977""","""LapisLee""",1461664800,"""United States, Virginia"""
1,1,"""awareness.400927""","""awareness""",1260097200,"""United States, Florida"""
1,0,"""ctkingsley.920839""","""ctkingsley""",1420369200,


The dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `user_name` | Username of the reviewer |
| `joined` | Date when the user joined BeerAdvocate |
| `location` | Geographic location of the user |
| `nbr_ratings` | Number of ratings submitted by the user |
| `nbr_reviews` | Number of written reviews submitted by the user |

In [22]:
describe(df_users)

+-------------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| nbr_ratings | Int64  |           153704 |             0 |    0.00 % |                  2053 |            1.34 % |
| nbr_reviews | Int64  |           153704 |             0 |    0.00 % |                  1265 |            0.82 % |
| user_id     | String |           153704 |             0 |    0.00 % |                153704 |          100.00 % |
| user_name   | String |           153704 |             0 |    0.00 % |                153704 |          100.00 % |
| joined      | Int64  |           151052 |          2652 |    1.73 % |                  5525 |            3.66 % |
| location    | String |           122425 |         31279 |   20.35 % | 

In [23]:
describe_numbers(df_users, filters=["user_id"])

+-------------+-------------+-------------+-------------+-------------+-------------+-----------+------------+
| Column      |        Mean |         Std |         25% |         50% |         75% |       Min |        Max |
|-------------+-------------+-------------+-------------+-------------+-------------+-----------+------------|
| nbr_ratings |     54.6052 |     252.389 |           1 |           3 |          16 |         1 |      12046 |
| nbr_reviews |     16.8479 |     139.847 |           0 |           0 |           2 |         0 |       8970 |
| joined      | 1.35724e+09 | 9.19513e+07 | 1.30312e+09 | 1.39194e+09 | 1.41769e+09 | 840794400 | 1501495200 |
+-------------+-------------+-------------+-------------+-------------+-------------+-----------+------------+


We are going to change the structure in nbr_interactions, nbr_ratings, nbr_reviews as done with the beers dataset and we are also going to cast the joined column to a datetime object.

In [24]:
df_users = df_users.with_columns(pl.col("nbr_ratings").alias("nbr_interactions"))
df_users = df_users.with_columns((pl.col("nbr_ratings") - pl.col("nbr_reviews")).alias("nbr_ratings"))
df_users = df_users.with_columns((pl.col("joined").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
i64,i64,str,str,datetime[ms],str,i64
4,2,"""wemustdrink.1143257""","""WemustDRINK""",2016-05-11 10:00:00,"""United States, Georgia""",6
0,3,"""shocksmarin.735619""","""shocksmarin""",2013-06-04 10:00:00,"""United States, California""",3
230,8,"""ericbeech_87.799334""","""Ericbeech_87""",2014-05-11 10:00:00,"""Canada""",238
1,2,"""eggman2814.718703""","""eggman2814""",2013-02-16 11:00:00,"""United States, Illinois""",3
3,0,"""aucbai.838252""","""aucbai""",2014-08-02 10:00:00,"""United States, Alabama""",3


We see that lots of users doesn't have a location associated with them and this is reasonable since the location is an optional field. Strangely enough some users doesn't have neither the timestamp of when they joined the platform. We are going to verify how many reviews and ratings are of these users to ensure that we still have enough temporal and spatial data to work with.

In [25]:
# Compute the percentage of missing values for the reviews dataset
tot_number_users = df_users.select("nbr_interactions").sum().item()
tot_users_no_location_nbr_interactions = df_users.filter(pl.col("location").is_null()).select("nbr_interactions").sum().item()
tot_users_no_timestamp_nbr_interactions = df_users.filter(pl.col("joined").is_null()).select("nbr_interactions").sum().item()
tot_users_either_of_two = df_users.filter(pl.col("location").is_null() | pl.col("joined").is_null()).select("nbr_interactions").sum().item()

# Compute the percentage of missing values for the reviews dataset
print(f"Percentage of interactions (reviews or ratings) with missing user location: {tot_users_no_location_nbr_interactions / tot_number_users * 100:.2f} %")
print(f"Percentage of interactions (reviews or ratings) with missing user timestamp: {tot_users_no_timestamp_nbr_interactions / tot_number_users * 100:.2f} %")
print(f"Percentage of interactions (reviews or ratings) with missing user location or timestamp: {tot_users_either_of_two / tot_number_users * 100:.2f} %")

Percentage of interactions (reviews or ratings) with missing user location: 5.96 %
Percentage of interactions (reviews or ratings) with missing user timestamp: 4.15 %
Percentage of interactions (reviews or ratings) with missing user location or timestamp: 5.96 %


We see that the number of reviews and ratings of people that doesn't have a location or a join date is very low. Additionally it's interesting to see that almost all of the users that doesn't have a location associated with them doesn't have a join date as well. We still have enough data for our purposes.

We are going to recompute the number of ratings, interactions and reviews for each beer later on when we will work with the ratings dataset.

##### Ratings

In [26]:
df_ratings.sample(5)

user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,text,aroma,overall,taste,style,beer_name,brewery_id,date
str,f64,bool,f64,str,str,i64,f64,f64,str,f64,f64,f64,str,str,i64,datetime[μs]
"""benoit917.450638""",4.0,False,7.2,"""Maine Beer Company""","""benoit917""",54522,,,,,,,"""American Amber / Red Ale""","""Zoe""",20681,2013-01-09 12:00:00
"""wowcoolman.507084""",3.98,False,8.5,"""The Hop Concept Brewing""","""Wowcoolman""",157830,4.25,4.25,,3.75,4.0,4.0,"""American Double / Imperial IPA""","""Citrus & Piney IPA (The Hop Fr…",38886,2015-06-02 12:00:00
"""ncstateplaya.264065""",3.25,False,5.0,"""Triangle Brewing Company""","""ncstateplaya""",108882,,,,,,,"""Bière de Garde""","""Farmhouse Ale""",16186,2014-09-09 12:00:00
"""tmoney2591.322390""",3.33,True,6.3,"""5 Rabbit Cerveceria""","""TMoney2591""",76133,4.0,3.5,"""Served in a New Holland tulip …",3.5,3.5,3.0,"""Märzen / Oktoberfest""","""Vida Y Muerte Muertzenbier""",25544,2012-11-05 12:00:00
"""zbr101.740056""",4.5,False,5.0,"""Grassroots Brewing""","""ZBR101""",93400,4.5,4.5,,4.5,4.5,4.5,"""Saison / Farmhouse Ale""","""Grassroots Brother Soigné""",23205,2015-10-18 12:00:00


The dataset has the following structure

In [27]:
describe(df_ratings)

+--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| user_id      | String                                   |          8393032 |             0 |    0.00 % |                153704 |            1.83 % |
| rating       | Float64                                  |          8393032 |             0 |    0.00 % |                   401 |            0.00 % |
| review       | Boolean                                  |          8393032 |             0 |    0.00 % |                     2 |            0.00 % |
| abv          | Float64                                  |          8221727 |        171305 |

The dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `rating` | Global rating of the beer from the user |
| `review` | Flag to tell if the rating has text or not |
| `abv` | Alcohol By Volume percentage of the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `user_name` | Username of the reviewer |
| `beer_id` | Unique identifier for each beer |
| `appearance` | Rating of the appearance of the beer |
| `palate` | Rating of the palate of the beer |
| `text` | Text of the review |
| `aroma` | Rating of the aroma of the beer |
| `overall` | Overall rating of the beer |
| `taste` | Rating of the taste of the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `date` | Date when the review was submitted |

Since this dataset includes data from the breweries dataset, users dataset and beers dataset but we have modified them let's clean the reviews whose user_id, beer_id or brewery_id is not present in the respective datasets.

In [28]:
# Get the beers and breweries ids
beers_ids = df_beers.select("beer_id")
brewery_ids = df_breweries.select("id")

# Filter the ratings dataset
df_ratings = df_ratings.filter(pl.col("beer_id").is_in(beers_ids)).filter(pl.col("brewery_id").is_in(brewery_ids))

We are going to check if some of the reviews have a text but have the review flag as false

In [29]:
df_ratings_review_false_but_text = df_ratings.filter(~pl.col("review") & pl.col("text").is_not_null())
df_ratings_review_false_but_text.sample(5)

user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,text,aroma,overall,taste,style,beer_name,brewery_id,date
str,f64,bool,f64,str,str,i64,f64,f64,str,f64,f64,f64,str,str,i64,datetime[μs]
"""rudimon.938886""",2.31,False,5.0,"""3 Daughters Brewing""","""rudimon""",107095,2.5,2.5,"""ehh eh ehh eh ehhh""",2.75,2.25,2.0,"""American Blonde Ale""","""Beach Blonde Ale""",33476,2015-02-04 12:00:00
"""davesbeerreviews.1069877""",3.5,False,5.5,"""Bonfire Brewing Co.""","""DavesBeerReviews""",97822,3.5,3.5,"""Great with weiner schnitzel""",3.5,3.5,3.5,"""American Brown Ale""","""Demshitz Brown Ale""",24817,2016-03-12 12:00:00
"""ashleydenee.931319""",2.78,False,7.0,"""Full Sail Brewery & Tasting Ro…","""Ashleydenee""",150615,3.0,3.0,"""A very heavy, bitter beer. Sti…",2.5,3.0,2.75,"""Doppelbock""","""Bock""",5316,2015-01-22 12:00:00
"""munnster76.952023""",4.29,False,7.0,"""Bell's Brewery, Inc.""","""munnster76""",1093,4.5,4.5,"""A very good IPA, just not up t…",4.25,4.25,4.25,"""American IPA""","""Two Hearted Ale""",287,2015-03-01 12:00:00
"""donnierickles.969207""",4.78,False,12.0,"""Founders Brewing Company""","""donnierickles""",17538,4.25,4.75,"""Boozy, malty IPA. Love it.""",5.0,4.75,4.75,"""American Double / Imperial IPA""","""Founders Devil Dancer""",1199,2015-07-02 12:00:00


In [30]:
print("The percentage of reviews that are False but have a text is: {:.2f} %".format(df_ratings_review_false_but_text.shape[0] / df_ratings.shape[0] * 100))

The percentage of reviews that are False but have a text is: 1.50 %


While at first it seems like an error it's most likely the data where the review has been flagged as 'ambiguous' in the [paper](https://arxiv.org/pdf/1210.3926) so the text should be ignored. Since no strange results seems to be present we are not going to drop these results and we are just going to ignore the text for these reviews.

In [31]:
describe(df_ratings)

+--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| user_id      | String                                   |          8229373 |             0 |    0.00 % |                152921 |            1.86 % |
| rating       | Float64                                  |          8229373 |             0 |    0.00 % |                   401 |            0.00 % |
| review       | Boolean                                  |          8229373 |             0 |    0.00 % |                     2 |            0.00 % |
| abv          | Float64                                  |          8064274 |        165099 |

In [32]:
describe_numbers(df_ratings, filters=["beer_id", "brewery_id", "user_id"])

+------------+---------+----------+-------+-------+-------+-------+-------+
| Column     |    Mean |      Std |   25% |   50% |   75% |   Min |   Max |
|------------+---------+----------+-------+-------+-------+-------+-------|
| rating     | 3.88197 |  0.62109 |  3.54 |     4 |  4.25 |     1 |     5 |
| abv        |   7.335 |  2.46824 |   5.5 |   6.9 |   8.8 |  0.01 |  67.5 |
| appearance | 3.93564 | 0.558574 |  3.75 |     4 |  4.25 |     1 |     5 |
| palate     | 3.86437 | 0.608519 |   3.5 |     4 |  4.25 |     1 |     5 |
| aroma      | 3.86777 |  0.62057 |   3.5 |     4 |  4.25 |     1 |     5 |
| overall    | 3.89903 | 0.616302 |   3.5 |     4 |  4.25 |     1 |     5 |
| taste      | 3.90194 | 0.643627 |   3.5 |     4 |  4.25 |     1 |     5 |
+------------+---------+----------+-------+-------+-------+-------+-------+


## Data cleaning and saving
Now that we have done some preliminary analysis on the data we will modify some of the datasets to make them more suitable for our analysis. We will also save the datasets in parquet format with the correct data types for faster loading in the future.
##### Users dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [None]:
# Drop the nbr_ratings, nbr_reviews and nbr_interactions columns
df_users = df_users.drop(["nbr_ratings", "nbr_reviews", "nbr_interactions"])
    
# Select the ratings and the reviews
ratings = df_ratings.filter(pl.col("review"))
reviews = df_ratings.filter(~pl.col("review"))

# Get the user ids
number_of_ratings = ratings.group_by("user_id").agg(pl.col("rating").count().alias("nbr_ratings")).select("user_id", "nbr_ratings")
number_of_reviews = reviews.group_by("user_id").agg(pl.col("rating").count().alias("nbr_reviews")).select("user_id", "nbr_reviews")
number_of_interactions = df_ratings.group_by("user_id").agg(pl.col("rating").count().alias("nbr_interactions")).select("user_id", "nbr_interactions")

# Update the users dataset
df_users = df_users.join(number_of_ratings, on="user_id", how="left")
df_users = df_users.join(number_of_reviews, on="user_id", how="left")
df_users = df_users.join(number_of_interactions, on="user_id", how="left")

# Save the processed datasets
df_users.write_parquet(f"{DST_DATA_PATH}/users.pq")

##### Breweries dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [34]:
df_breweries.write_parquet(f"{DST_DATA_PATH}/breweries.pq")

##### Ratings dataset
The dataset is already cleaned and process, we just need to save it in parquet format. We will both save a full version of the dataset and a version without the text of the reviews.

In [35]:
df_ratings.write_parquet(f"{DST_DATA_PATH}/ratings.pq")
df_ratings_no_text = df_ratings.drop("text")
df_ratings_no_text.write_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")

##### Beers dataset
While for the previous datasets not much work was needed here multiple steps are needed. In particular:
- We will change some of the columns of the dataset. In particular we are going to create a dataset that is similar to BeerAdvocate dataset to simplify the comparison between the two datasets.
- Since we have removed some reviews and changed other datasets we are going to recompute the values in the beer dataset

Additional data regarding the matching of the two datasets will be added in another notebook.

In [None]:
rows = []
for row in tqdm.tqdm(df_beers.rows(named=True)):
    # Create the new dictionary
    new_row = {}

    # Compute general information
    new_row["beer_id"] = row["beer_id"]
    new_row["beer_name"] = row["beer_name"]
    new_row["brewery_id"] = row["brewery_id"]
    new_row["brewery_name"] = row["brewery_name"]
    new_row["style"] = row["style"]
    new_row["abv"] = row["abv"]

    # Compute aggregated informations
    reviews_elements = df_ratings_no_text.filter(pl.col("beer_id") == row["beer_id"])
    new_row["rating_score_avg"] = reviews_elements["rating"].mean()
    new_row["rating_score_std"] = reviews_elements["rating"].std()
    new_row["rating_score_median"] = reviews_elements["rating"].median()

    # Compute the number of interactions, nbr_ratings and nbr_reviews (here they are the same)
    new_row["nbr_interactions"] = reviews_elements.shape[0]
    new_row["nbr_ratings"] = reviews_elements.shape[0]
    new_row["nbr_reviews"] = reviews_elements.shape[0]

    # Append the new row to the list of rows
    rows.append(new_row)

# Transform the data into a Polars DataFrame
df_beers_aggregated = pl.DataFrame(rows)

# Save the data
df_beers_aggregated.write_parquet(f"{DST_DATA_PATH}/beers.pq")

# Remove all the data [TO REMOVE]
del df_beers
del df_users
del df_breweries
del df_ratings

100%|██████████| 273764/273764 [09:00<00:00, 506.43it/s]


## Computation of dropped values
Here we will compute how much data for each dataset has been dropped. This step is needed to ensure that we still have enough data to work with after the data cleaning process.
##### users dataset

In [None]:
# Load the original datasets
df_users_original = pl.read_csv(f"{SRC_DATA_PATH}/users.csv")
df_users = pl.read_parquet(f"{DST_DATA_PATH}/users.pq")
# Do some computation
original_number_of_students = df_users_original.shape[0]
new_number_of_students = df_users.shape[0]

# Print the results
print(f"Number of students in the original dataset: {original_number_of_students}")
print(f"Number of students in the new dataset: {new_number_of_students}")
print(f"Percentage of students removed: {(original_number_of_students - new_number_of_students) / original_number_of_students * 100:.2f}%")

Number of students in the original dataset: 153704
Number of students in the new dataset: 153704
Percentage of students removed: 0.00%


##### Beers dataset

In [None]:
# Load the original datasets
df_beers_original = pl.read_csv(f"{SRC_DATA_PATH}/beers.csv")
df_beers = pl.read_parquet(f"{DST_DATA_PATH}/beers.pq")
# Do some computation
original_number_of_beers = df_beers_original.shape[0]
new_number_of_beers = df_beers.shape[0]

# Print the results
print(f"Number of beers in the original dataset: {original_number_of_beers}")
print(f"Number of beers in the new dataset: {new_number_of_beers}")
print(f"Percentage of beers removed: {(original_number_of_beers - new_number_of_beers) / original_number_of_beers * 100:.2f}%")

Number of students in the original dataset: 280823
Number of students in the new dataset: 273764
Percentage of students removed: 2.51%


##### Breweries dataset

In [None]:
# Load the original datasets
df_breweries_original = pl.read_csv(f"{SRC_DATA_PATH}/beers.csv")
df_breweries = pl.read_parquet(f"{DST_DATA_PATH}/beers.pq")
# Do some computation
original_number_of_breweries = df_breweries_original.shape[0]
new_number_of_breweries = df_breweries.shape[0]

# Print the results
print(f"Number of beers in the original dataset: {original_number_of_breweries}")
print(f"Number of beers in the new dataset: {new_number_of_breweries}")
print(f"Percentage of beers removed: {(original_number_of_breweries - new_number_of_breweries) / original_number_of_breweries * 100:.2f}%")

Number of beers in the original dataset: 280823
Number of beers in the new dataset: 273764
Percentage of beers removed: 2.51%


##### Ratings dataset

In [39]:
# Load the original datasets
df_reviews_original = pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq")
df_reviews = pl.read_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")

# Do some computation
original_number_of_reviews = df_reviews_original.filter(pl.col("review")).shape[0]
original_number_of_ratings = df_reviews_original.filter(~pl.col("review")).shape[0]
new_number_of_reviews = df_reviews.filter(pl.col("review")).shape[0]
new_number_of_ratings = df_reviews.filter(~pl.col("review")).shape[0]

# Print the results
print(f"Number of reviews in the original dataset: {original_number_of_reviews}")
print(f"Number of reviews in the new dataset: {new_number_of_reviews}")
print(f"Percentage of reviews removed: {(original_number_of_reviews - new_number_of_reviews) / original_number_of_reviews * 100:.2f}%")
print()
print(f"Number of ratings in the original dataset: {original_number_of_ratings}")
print(f"Number of ratings in the new dataset: {new_number_of_ratings}")
print(f"Percentage of ratings removed: {(original_number_of_ratings - new_number_of_ratings) / original_number_of_ratings * 100:.2f}%")

Number of reviews in the original dataset: 2589586
Number of reviews in the new dataset: 2540122
Percentage of reviews removed: 1.91%

Number of ratings in the original dataset: 5803446
Number of ratings in the new dataset: 5689251
Percentage of ratings removed: 1.97%


##### Final discussion
We see that in all the datasets the number of removed content is very little and we still have enough data to work with even if they are cleaned.
## Conclusion
In this notebook we have cleaned the BeerAdvocate dataset and we now have data ready to be used for further analysis. <br>
Here we did focus on understanding the general structure of our data, clean them and produce processed files (in the format of parquet files) that can be easily loaded and used for further analysis.