# Data cleaning of RateBeer dataset
RateBeer is another large beer rating platform. Its dataset structure is similar to BeerAdvocate but with some unique features. In contrast to BeerAdvocate, which is owned by two brothers, [RateBeer is owned by Belgian-based beer producer AB InBev](https://www.ratebeer.com/about.asp). 

On the current version of the website, users can review a beer by first giving it a rating. The rating is between 0-5 and accepts one digit after the decimal point (e.g. 3.8). Then the user is prompted for a textual review, and to rate the beer on different attributes. There are five attributes with different scales. All of them only accept integer responses. They are specified as follows:
- Aroma (1-10)
- Appearance (1-5)
- Taste (1-10)
- Mouthfeel (1-5)
- Overall (1-20)

All the reviews have a text and the rating on the previous aspects. <br><br>
Some of the information have been extrapolated from [here](https://github.com/OrganicIrradiation/ratebeer)

## Files importing

In [1]:
import polars as pl
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import geopandas as gpd
import numpy as np
import tqdm
import os
# Import the utils
import sys
sys.path.append('../../utils')
from data_desc import remove_whitespaces, describe_numbers, describe

In [2]:
# Define the paths
SRC_DATA_PATH = '../../../data/RateBeer'
DST_DATA_PATH = '../../../data/RateBeer/processed'
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

## Data exploration and cleaning

In [3]:
df_beers = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/beers.csv"))
df_breweries = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/breweries.csv"))
df_users = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/users.csv"))
df_ratings = remove_whitespaces(pl.read_parquet(f"{SRC_DATA_PATH}/reviews.pq"))

##### Beers dataset
We'll start by looking at the beers dataset.

In [4]:
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,f64,f64,f64,f64,f64,f64,i64,f64
520879,"""Torque Foundation""",26012,"""Torque Brewing""","""American Pale Ale""",4,,,3.1,5.0,3.275,,0,
67351,"""John S. Rhodell Cranberry Whea…",3127,"""John S. Rhodell Brewery""","""Fruit Beer""",0,,,,5.0,,,0,
161115,"""Hopworks Noggin Floggin Barley…",8636,"""Hopworks Urban Brewery""","""Barley Wine""",52,79.0,41.0,3.39,11.0,3.428846,,0,
262231,"""KrügelBIER ABi""",19499,"""KrügelBIER""","""Altbier""",1,,,2.79,5.0,3.0,,0,
29065,"""Malt River Weiss Wheat""",3039,"""Malt River Brewing""","""Wheat Ale""",1,,,2.5,,2.1,,0,


The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `beer_id` | Unique identifier for each beer |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `nbr_ratings` | Number of ratings the beer has received |
| `overall_score` | Overall score of the beer on RateBeer |
| `style_score` | Score of the beer within its specific style category |
| `avg` | Average rating of the beer |
| `abv` | Alcohol By Volume percentage of the beer |
| `avg_computed` | Computed average rating (may differ from `avg` due to different calculation methods) |
| `zscore` | Standardized score indicating how many standard deviations the beer's rating is from the mean  |
| `nbr_matched_valid_ratings` | Number of matched valid ratings |
| `avg_matched_valid_ratings` | Average of matched valid ratings |


Now that we have an idea of how the features of our dataset look like, let's analyze the data of our dataset

In [5]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   | Int64   |           442081 |             0 |    0.00 % |                442081 |          100.00 % |
| beer_name                 | String  |           442081 |             0 |    0.00 % |                441675 |           99.91 % |
| brewery_id                | Int64   |           442081 |             0 |    0.00 % |                 23199 |            5.25 % |
| brewery_name              | String  |           442081 |             0 |    0.00 % |                 23183 |            5.24 % |
| style                     | String  |           442081 |             0 |    0.00 

In [6]:
describe_numbers(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+-----------+----------+-----------+------------+----------+----------+---------+
| Column                    |      Mean |      Std |       25% |        50% |      75% |      Min |     Max |
|---------------------------+-----------+----------+-----------+------------+----------+----------+---------|
| nbr_ratings               |   16.1103 |  80.9888 |         1 |          3 |        9 |        0 |    5272 |
| overall_score             |   55.6808 |  28.4827 |        34 |         53 |       83 |        0 |     100 |
| style_score               |   54.9861 |  28.6554 |        33 |         51 |       82 |        0 |     100 |
| avg                       |   3.02658 | 0.304503 |      2.87 |       3.02 |     3.18 |        0 |    4.52 |
| abv                       |   6.06548 |  1.92296 |       4.8 |        5.6 |        7 |     0.01 |     100 |
| avg_computed              |   3.24465 |  0.50752 |         3 |        3.3 |  3.59412 |      0.5 |       5 |
| zscore  

From the table we can see that:
- The overall score and the style score are both in 0-100 scale
- Some of the abv values are wrong, as there are values of 100 but this is not possible.
- Some beers have the same name. This is possible, but we need to check if they are the same beer or not.

Let's remove the outliers from the abv column. In BeerAdvocate the beer with the highest ABV we see is 67.5%. Therefore we make the assumption that any beer over 65% is an error while the ones below are possibly valid. We'll remove the beers with a too high ABV. We also verified, after appling this assumption, that very high ABV beers were correct, by looking in the website, and we observed that the value obtained are indeed correct.

In [7]:
beer_ids_removed = df_beers.filter(pl.col("abv") > 65)["beer_id"].to_list()
df_beers = df_beers.filter((pl.col("abv") <= 65) | (pl.col("abv").is_null()))

Here we are going to check if people have inserted the same beer twice for the same brewery. We'll check that by looking at the unique values of the columns 'beer_name' and 'brewery_name' and see if there are any duplicates.

In [8]:
beers_duplicates = df_beers.filter(pl.struct(["beer_name", "brewery_id"]).is_duplicated())
beers_duplicates.head(10)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,f64,f64,f64,f64,f64,f64,i64,f64
353940,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
353941,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
375995,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",0,,,,5.0,,,0,
375996,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",1,,,2.72,5.0,2.0,,0,
422836,"""Syndicate Roses Name Abbey Du…",26083,"""Syndicate Beer & Grill""","""Abbey Dubbel""",1,,,3.0,6.0,3.3,,0,
408166,"""Syndicate Roses Name Abbey Du…",26083,"""Syndicate Beer & Grill""","""Abbey Dubbel""",3,,,2.78,6.0,2.6,,0,
285897,"""Prince Edward Island 8 Cord Do…",3091,"""Prince Edward Island Brewing C…","""Imperial IPA""",7,,,3.39,8.5,3.642857,,0,
310659,"""Prince Edward Island 8 Cord Do…",3091,"""Prince Edward Island Brewing C…","""Imperial IPA""",0,,,,8.5,,,0,
510163,"""Isle de Garde Bitter Ordinaire""",21105,"""Isle de Garde""","""Bitter""",2,,,3.12,3.9,3.5,,0,
510165,"""Isle de Garde Blonde Anglaise""",21105,"""Isle de Garde""","""English Pale Ale""",0,,,,5.0,,,0,


In [9]:
print("Number of duplicate beers: ", beers_duplicates.shape[0])

Number of duplicate beers:  438


We see that there are some duplicates but that they are not a significant number. Since we see that in some cases the style or the abv values are not consistent between duplicates and since, as already said, the number of duplicates is not significant, we'll drop them.

In [10]:
beer_ids_removed = beers_duplicates["beer_id"].to_list() + beer_ids_removed
df_beers = df_beers.filter(~pl.col("beer_id").is_in(beer_ids_removed))

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each brewery we have a location, a name and the amount of beers they produce, along with a unique identifier. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [11]:
df_breweries.sample(5)

id,location,name,nbr_beers
i64,str,str,i64
9763,"""United States, Iowa""","""Depot Deli and Lounge""",5
15852,"""New Zealand""","""Queenstown Brewers""",4
23559,"""United States, Maine""","""Orono Brewing Company""",35
17443,"""England""","""Stod Fold""",10
26993,"""Switzerland""","""Blaser Getränke""",1


The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `id` | Unique identifier for each brewery |
| `location` | Geographic location of the brewery |
| `name` | Name of the brewery |
| `nbr_beers` | Number of beers produced by the brewery |

Now that we have an idea of how the features of our dataset look like, let's analyze the data of our dataset

In [12]:
describe(df_breweries)

+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column    | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id        | Int64  |            24189 |             0 |    0.00 % |                 24189 |          100.00 % |
| location  | String |            24189 |             0 |    0.00 % |                   267 |            1.10 % |
| name      | String |            24189 |             0 |    0.00 % |                 24173 |           99.93 % |
| nbr_beers | Int64  |            24189 |             0 |    0.00 % |                   271 |            1.12 % |
+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [13]:
describe_numbers(df_breweries, filters=["id"])

+-----------+---------+---------+-------+-------+-------+-------+-------+
| Column    |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-----------+---------+---------+-------+-------+-------+-------+-------|
| nbr_beers | 19.0227 | 31.5426 |     3 |     8 |    20 |     0 |   295 |
+-----------+---------+---------+-------+-------+-------+-------+-------+


Since the users can insert the breweries by hand let's verify if there are some duplicates

In [14]:
breweries_duplicates = df_breweries.filter(pl.struct(["name", "location"]).is_duplicated())
breweries_duplicates

id,location,name,nbr_beers
i64,str,str,i64
23620,"""Canada""","""Ridge Brewing Company""",15
3317,"""Canada""","""Ridge Brewing Company""",12
10376,"""Lithuania""","""UAB Brauer""",1
4714,"""Lithuania""","""UAB Brauer""",7
11243,"""United States, California""","""California Brewing Company""",6
2141,"""United States, California""","""California Brewing Company""",3
9467,"""England""","""Freedom""",4
6811,"""England""","""Freedom""",31


Here since the data are consistent we'll just fix this issue by remapping one of the two duplicates to the other one.

In [15]:
# Define an aggregated dataset to map the duplicates ids
aggregated = breweries_duplicates.join(breweries_duplicates, on=["name", "location"], how="inner")
aggregated = aggregated.filter(pl.col("id") < pl.col("id_right"))

# Create a mapping between the duplicates
mapping = aggregated.select(["id", "id_right"]).to_pandas() # destination, source
mapping = dict(zip(mapping["id_right"], mapping["id"]))

Let's to the mapping also in the beer dataset and then we can recompute the number of beers for each brewery.

In [16]:
# Drop the column that are in the mapping
df_breweries = df_breweries.filter(~pl.col("id").is_in(list(mapping.keys())))

# Change the ids in the beers dataset
df_beers = df_beers.with_columns(pl.col("brewery_id").replace(mapping))

# Recompute the number of beers for each brewery
aggregated_value = df_beers.group_by("brewery_id").agg(pl.col("beer_id").count().alias("beers_count")).rename({"brewery_id": "id"})

# Join the aggregated value with the breweries dataframe
df_breweries = df_breweries.join(aggregated_value, on="id", how="inner")
df_breweries = df_breweries.drop("nbr_beers")

# Cast the beers_count column to an integer
df_breweries = df_breweries.with_columns(pl.col("beers_count").cast(pl.Int32))

##### Users dataset

For each of our users, we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [17]:
df_users.sample(5)

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,f64,str
1,371213,"""quinnymcfc""",1432700000.0,
32,374837,"""asbe""",1435200000.0,"""Turkey"""
1,340994,"""jaymick82""",1414100000.0,
241,126614,"""Sparf""",1302300000.0,"""Sweden"""
12,12786,"""wvernon1981""",1085900000.0,"""United States, Oregon"""


Let's modify the joined column to make it in datetime format

In [18]:
df_users = df_users.with_columns((pl.col("joined").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
126,22630,"""murawski""",2005-05-27 10:00:00,
5,276768,"""Kimmelman""",2013-08-27 10:00:00,
1,28536,"""patrickmach""",2005-10-12 10:00:00,
1,47740,"""ostebo""",2007-01-08 11:00:00,"""Sweden"""
1,465495,"""coachgriffcscs""",2017-05-13 10:00:00,


The dataset has the following structure
| Column Name | Description |
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `user_name` | Username of the reviewer |
| `joined` | Date when the user joined RateBeer |
| `location` | Geographic location of the user |
| `nbr_ratings` | Number of ratings submitted by the user |

Now that we have an idea of what the features of our dataset look like, let's study the content of our dataset

In [19]:
describe(df_users)

+-------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| nbr_ratings | Int64                                    |            70174 |             0 |    0.00 % |                  2250 |            3.21 % |
| user_id     | Int64                                    |            70174 |             0 |    0.00 % |                 70120 |           99.92 % |
| user_name   | String                                   |            70174 |             0 |    0.00 % |                 70174 |          100.00 % |
| joined      | Datetime(time_unit='ms', time_zone=None) |            70144 |            30 |    0.0

In [20]:
describe_numbers(df_users, filters=["user_id", "joined"])

+-------------+---------+---------+-------+-------+-------+-------+-------+
| Column      |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-------------+---------+---------+-------+-------+-------+-------+-------|
| nbr_ratings | 108.821 | 754.493 |     1 |     2 |    10 |     1 | 46749 |
+-------------+---------+---------+-------+-------+-------+-------+-------+


Strangely there are some duplicates user_ids. 

In [21]:
user_duplicates = df_users.filter(pl.col("user_id").is_duplicated())
user_duplicates

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,3070,"""<span class=""__cf_email__"" dat…",2002-01-26 11:00:00,"""United States, Ohio"""
…,…,…,…,…
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,


It seems that there has been some issues in the creation of these users and their username contains some html code. Let's investigate this more

In [22]:
strange_users = df_users.filter(pl.col("user_name").str.contains("<"))
strange_users

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,266696,"""AdrianBM ()</A></I> - MEXICO -…",2013-07-01 10:00:00,"""Mexico"""
…,…,…,…,…
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,


In [23]:
strange_users_duplicates = user_duplicates.filter(pl.col("user_name").str.contains("span"))
strange_users_duplicates

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,3070,"""<span class=""__cf_email__"" dat…",2002-01-26 11:00:00,"""United States, Ohio"""
…,…,…,…,…
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""


We see that the two subgroups of users (the ones with duplicated id and a bad formatted username) are almost the same. For this reason we'll drop them since there has been most likely an issue in the creation of their account or in the scraping of the data.

In [24]:
duplicates_ids = user_duplicates.select("user_id").unique()
strange_ids = strange_users.select("user_id").unique()
user_ids_remove = pl.concat([duplicates_ids, strange_ids]).unique(subset="user_id")["user_id"].to_list()
df_users = df_users.filter(~pl.col("user_id").is_in(user_ids_remove))

##### Ratings

In [25]:
df_ratings.sample(5)

rating,palate,abv,beer_id,beer_name,user_id,taste,date,style,appearance,overall,brewery_name,text,aroma,user_name,brewery_id
f64,f64,f64,i64,str,i64,f64,datetime[μs],str,f64,f64,str,str,f64,str,i64
4.6,4.0,,469751,"""Cycle Trademark Dispute (Red)""",144930,9.0,2015-08-28 12:00:00,"""Imperial Stout""",5.0,19.0,"""Cycle Brewing""","""Black, thick, and lovely. Cinn…",9.0,"""Bobbyearl""",19030
3.6,3.0,6.2,6887,"""Lagunitas India Pale Ale (IPA)""",125360,6.0,2013-08-10 12:00:00,"""India Pale Ale (IPA)""",4.0,16.0,"""Lagunitas Brewing Company &#40…","""Bottle @ Mads, 10 / 8 2013. Da…",7.0,"""Valgreen""",1167
3.6,3.0,6.2,72662,"""Weyerbacher Muse""",3641,8.0,2007-10-14 12:00:00,"""Saison""",3.0,15.0,"""Weyerbacher Brewing Co.""","""Cloudy red, moderate white hea…",7.0,"""Suttree""",241
3.7,3.0,6.5,191346,"""Rooie Dop The Daily Grind""",106360,8.0,2013-06-15 12:00:00,"""Porter""",3.0,16.0,"""Rooie Dop""","""Dark, dark brown colored pour.…",7.0,"""suprchunk""",14217
2.3,2.0,5.2,10218,"""Vedett Extra Blond""",283717,5.0,2016-04-02 12:00:00,"""Pale Lager""",2.0,10.0,"""Duvel Moortgat""","""Lagar simples sem muitas prome…",4.0,"""leduardol""",247


The dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `ratings` | Global rating of the beer from the user (computed from all the other grades) |
| `palate` | Rating of the palate of the beer |
| `abv` | Alcohol By Volume percentage of the beer |
| `beer_id` | Unique identifier for each beer |
| `beer_name` | Name of the beer |
| `user_id` | Unique identifier for each user |
| `taste` | Rating of the taste of the beer |
| `date` | Date when the review was submitted |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `appearance` | Rating of the appearance of the beer |
| `overall` | Overall rating of the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `text` | Textual review of the beer |
| `aroma` | Rating of the aroma of the beer |
| `user_name` | Username of the reviewer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |

Since this dataset includes data from the breweries dataset, users dataset and beers dataset but we have modified them let's clean the reviews whose user_id, beer_id or brewery_id is not present in the respective datasets.

In [26]:
# Do a remapping of the breweries ids
df_ratings = df_ratings.with_columns(pl.col("brewery_id").replace(mapping))

# Select the ids of the beers, users and breweries
beer_ids = df_beers.select("beer_id")
user_ids = df_users.select("user_id")
brewery_ids = df_breweries.select("id")

# Filter out the ratings where the beer_id, user_id or brewery_id is not in the respective dataset
df_ratings = df_ratings.filter(pl.col("beer_id").is_in(beer_ids["beer_id"]))
df_ratings = df_ratings.filter(pl.col("user_id").is_in(user_ids["user_id"]))
df_ratings = df_ratings.filter(pl.col("brewery_id").is_in(brewery_ids["id"]))

Since we have removed some ratings we need to recompute the number of reviews for each user and the number of reviews for each beer. We'll do this later in the notebook.

In [27]:
describe(df_ratings)

+--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| rating       | Float64                                  |          7120381 |             0 |    0.00 % |                    46 |            0.00 % |
| palate       | Float64                                  |          7120381 |             0 |    0.00 % |                     5 |            0.00 % |
| abv          | Float64                                  |          6945449 |        174932 |    2.46 % |                   862 |            0.01 % |
| beer_id      | Int64                                    |          7120381 |             0 |

In [28]:
describe_numbers(df_ratings, filters=["beer_id", "user_id", "brewery_id"])

+------------+---------+----------+-------+-------+-------+-------+-------+
| Column     |    Mean |      Std |   25% |   50% |   75% |   Min |   Max |
|------------+---------+----------+-------+-------+-------+-------+-------|
| rating     | 3.28556 | 0.686496 |     3 |   3.4 |   3.7 |   0.5 |     5 |
| palate     | 3.28472 | 0.796805 |     3 |     3 |     4 |     1 |     5 |
| abv        | 6.50365 |  2.15749 |     5 |     6 |   7.7 |  0.01 |  57.7 |
| taste      | 6.49668 |  1.53813 |     6 |     7 |     7 |     1 |    10 |
| appearance | 3.44041 | 0.773787 |     3 |     3 |     4 |     1 |     5 |
| overall    | 13.2154 |  3.15212 |    12 |    14 |    15 |     1 |    20 |
| aroma      | 6.41841 |  1.53867 |     6 |     7 |     7 |     1 |    10 |
+------------+---------+----------+-------+-------+-------+-------+-------+


## Data cleaning and saving
Now that we have done some preliminary analysis on the data we will modify some of the datasets to make them more suitable for our analysis. We will also save the datasets in parquet format with the correct data types for faster loading in the future.
##### Users dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [29]:
# In the process we are going to lose all users without a rating
number_of_ratings = df_ratings.group_by("user_id").agg(pl.col("rating").count().alias("nbr_ratings").cast(pl.Int32))
df_users = df_users.join(number_of_ratings, on="user_id", how="inner")
df_users = df_users.drop("nbr_ratings")
df_users = df_users.rename({"nbr_ratings_right": "nbr_reviews"})
df_users = df_users.with_columns(pl.lit(0).cast(pl.Int32).alias("nbr_ratings"))
df_users = df_users.with_columns(pl.col("nbr_ratings").alias("nbr_interactions"))

# Write the cleaned datasets to parquet
df_users.write_parquet(f"{DST_DATA_PATH}/users.pq")

In [30]:
df_users.sample(5)

user_id,user_name,joined,location,nbr_reviews,nbr_ratings,nbr_interactions
i64,str,datetime[ms],str,i32,i32,i32
282458,"""zdunseth""",2013-10-05 10:00:00,"""Israel""",17,0,0
360987,"""nhjmoyer""",2015-03-08 11:00:00,"""United States, Texas""",6,0,0
441203,"""marzinihw""",2017-01-01 11:00:00,"""Italy""",1,0,0
64718,"""bluealfonse""",2007-11-25 11:00:00,"""United States, Pennsylvania""",1,0,0
481752,"""Mazzerbigmelz""",2017-07-13 10:00:00,"""England""",1,0,0


In [31]:
describe(df_users)

+------------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column           | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|------------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| user_id          | Int64                                    |            70097 |             0 |    0.00 % |                 70097 |          100.00 % |
| user_name        | String                                   |            70097 |             0 |    0.00 % |                 70097 |          100.00 % |
| joined           | Datetime(time_unit='ms', time_zone=None) |            70067 |            30 |    0.04 % |                  5960 |            8.51 % |
| location         | String                                   |       

In [32]:
describe_numbers(df_users, filters=["user_id"])

+------------------+---------+---------+-------+-------+-------+-------+-------+
| Column           |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|------------------+---------+---------+-------+-------+-------+-------+-------|
| nbr_reviews      | 101.579 | 703.595 |     1 |     2 |     9 |     1 | 43239 |
| nbr_ratings      |       0 |       0 |     0 |     0 |     0 |     0 |     0 |
| nbr_interactions |       0 |       0 |     0 |     0 |     0 |     0 |     0 |
+------------------+---------+---------+-------+-------+-------+-------+-------+


##### Breweries dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [33]:
df_breweries.write_parquet(f"{DST_DATA_PATH}/breweries.pq")

In [34]:
df_breweries.sample(5)

id,location,name,beers_count
i64,str,str,i32
13846,"""South Korea""","""Weizen Brauhaus""",2
18518,"""Italy""","""9punto1""",4
6046,"""Argentina""","""Cervecería Calchaquí""",3
15983,"""Canada""","""Arrowhead Brewing""",14
1741,"""Czech Republic""","""Pivovary Vratislavice (Hols a.…",35


In [35]:
describe(df_breweries)

+-------------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id          | Int64  |            23190 |             0 |    0.00 % |                 23190 |          100.00 % |
| location    | String |            23190 |             0 |    0.00 % |                   267 |            1.15 % |
| name        | String |            23190 |             0 |    0.00 % |                 23178 |           99.95 % |
| beers_count | Int32  |            23190 |             0 |    0.00 % |                   261 |            1.13 % |
+-------------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [36]:
describe_numbers(df_breweries, filters=["id"])

+-------------+---------+---------+-------+-------+-------+-------+-------+
| Column      |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-------------+---------+---------+-------+-------+-------+-------+-------|
| beers_count | 19.0443 | 31.1287 |     3 |     8 |    20 |     1 |   290 |
+-------------+---------+---------+-------+-------+-------+-------+-------+


##### Ratings dataset
The dataset is already cleaned and process, we just need to save it in parquet format. We will both save a full version of the dataset and a version without the text of the reviews.

In [37]:
df_ratings.write_parquet(f"{DST_DATA_PATH}/ratings.pq")
df_ratings_no_text = df_ratings.drop("text")
df_ratings_no_text.write_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")

In [38]:
df_ratings.sample(5)

rating,palate,abv,beer_id,beer_name,user_id,taste,date,style,appearance,overall,brewery_name,text,aroma,user_name,brewery_id
f64,f64,f64,i64,str,i64,f64,datetime[μs],str,f64,f64,str,str,f64,str,i64
2.3,3.0,4.8,1417,"""Warsteiner Premium Verum""",267475,5.0,2013-12-30 12:00:00,"""Pilsener""",2.0,8.0,"""Warsteiner Brauerei""","""Cerveja com leve sabor maltado…",5.0,"""leopolito""",242
0.8,1.0,4.5,89192,"""Sibirskaya Korona Laim""",62191,2.0,2009-03-10 12:00:00,"""Radler/Shandy""",1.0,2.0,"""Omsk (Sun-InBev)""","""500ml Bottle - Pale green in c…",2.0,"""SaintMatty""",2163
3.7,3.0,,156696,"""5 Seasons Westside Fresh Hop A…",128299,8.0,2013-02-15 12:00:00,"""American Pale Ale""",3.0,16.0,"""5 Seasons Westside""","""On draft at brewpub poured an …",7.0,"""burg326""",10476
3.2,3.0,7.5,281972,"""Prairie Cherry Funk""",96420,6.0,2014-11-07 12:00:00,"""Sour/Wild Ale""",3.0,14.0,"""Krebs Brewing Company / Petes…","""Bottle pours a pinkish orange …",6.0,"""Atom""",1908
3.7,3.0,9.0,322605,"""Saint Archer Mosaic IPA""",236434,8.0,2017-03-20 12:00:00,"""Imperial IPA""",3.0,15.0,"""Saint Archer Brewing Company &…","""From a bomber. Pours hazy gold…",8.0,"""snoworsummer""",16572


In [39]:
describe(df_ratings_no_text)

+--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| rating       | Float64                                  |          7120381 |             0 |    0.00 % |                    46 |            0.00 % |
| palate       | Float64                                  |          7120381 |             0 |    0.00 % |                     5 |            0.00 % |
| abv          | Float64                                  |          6945449 |        174932 |    2.46 % |                   862 |            0.01 % |
| beer_id      | Int64                                    |          7120381 |             0 |

In [40]:
describe_numbers(df_ratings_no_text, filters=["beer_id", "user_id", "brewery_id"])

+------------+---------+----------+-------+-------+-------+-------+-------+
| Column     |    Mean |      Std |   25% |   50% |   75% |   Min |   Max |
|------------+---------+----------+-------+-------+-------+-------+-------|
| rating     | 3.28556 | 0.686496 |     3 |   3.4 |   3.7 |   0.5 |     5 |
| palate     | 3.28472 | 0.796805 |     3 |     3 |     4 |     1 |     5 |
| abv        | 6.50365 |  2.15749 |     5 |     6 |   7.7 |  0.01 |  57.7 |
| taste      | 6.49668 |  1.53813 |     6 |     7 |     7 |     1 |    10 |
| appearance | 3.44041 | 0.773787 |     3 |     3 |     4 |     1 |     5 |
| overall    | 13.2154 |  3.15212 |    12 |    14 |    15 |     1 |    20 |
| aroma      | 6.41841 |  1.53867 |     6 |     7 |     7 |     1 |    10 |
+------------+---------+----------+-------+-------+-------+-------+-------+


##### Beers dataset
While for the previous datasets not much work was needed here multiple steps are needed. In particular:
- We will change some of the columns of the dataset. In particular we are going to create a dataset that is similar to BeerAdvocate dataset to simplify the comparison between the two datasets.
- Since we have removed some reviews and changed other datasets we are going to recompute the values in the beer dataset

Additional data regarding the matching of the two datasets will be added in another notebook.

In [41]:
rows = []
for row in tqdm.tqdm(df_beers.rows(named=True)):
    # Create the new dictionary
    new_row = {}

    # Compute general information
    new_row["beer_id"] = row["beer_id"]
    new_row["beer_name"] = row["beer_name"]
    new_row["brewery_id"] = row["brewery_id"]
    new_row["brewery_name"] = row["brewery_name"]
    new_row["style"] = row["style"]
    new_row["abv"] = row["abv"]

    # Compute aggregated informations
    reviews_elements = df_ratings_no_text.filter(pl.col("beer_id") == row["beer_id"])
    new_row["rating_score_avg"] = reviews_elements["rating"].mean()
    new_row["rating_score_std"] = reviews_elements["rating"].std()
    new_row["rating_score_median"] = reviews_elements["rating"].median()

    # Compute the number of interactions, nbr_ratings and nbr_reviews (here they are the same)
    new_row["nbr_interactions"] = reviews_elements.shape[0]
    new_row["nbr_ratings"] = 0
    new_row["nbr_reviews"] = reviews_elements.shape[0]

    # Append the new row to the list of rows
    rows.append(new_row)

# Transform the data into a Polars DataFrame
df_beers = pl.DataFrame(rows)

# Save the data
df_beers.write_parquet(f"{DST_DATA_PATH}/beers.pq")

100%|██████████| 441638/441638 [10:16<00:00, 716.72it/s]


In [42]:
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,abv,rating_score_avg,rating_score_std,rating_score_median,nbr_interactions,nbr_ratings,nbr_reviews
i64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64
358156,"""Mulberry Duck Copper Bottom""",16345,"""Mulberry Duck""","""Bitter""",4.6,3.15,0.494975,3.15,2,0,2
411417,"""Stockyards Thunder Hef""",26830,"""Stockyards Brewing Co""","""German Hefeweizen""",5.0,3.4,0.173205,3.5,3,0,3
326868,"""Basket Case Hey Joe Breakfast …",16196,"""Basket Case Brewing Company""","""Sweet Stout""",5.5,3.6,0.0,3.6,2,0,2
88064,"""Franconia Lager""",9573,"""Franconia Brewing Company""","""Premium Lager""",5.0,3.105882,0.488002,3.1,34,0,34
209640,"""Gaverhopke Extra Special Aged …",3184,"""t Gaverhopke""","""Belgian Strong Ale""",11.0,3.018182,0.695276,3.0,33,0,33


In [43]:
describe(df_beers)

+---------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column              | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id             | Int64   |           441638 |             0 |    0.00 % |                441638 |          100.00 % |
| beer_name           | String  |           441638 |             0 |    0.00 % |                441453 |           99.96 % |
| brewery_id          | Int64   |           441638 |             0 |    0.00 % |                 23190 |            5.25 % |
| brewery_name        | String  |           441638 |             0 |    0.00 % |                 23178 |            5.25 % |
| style               | String  |           441638 |             0 |    0.00 % |                    94 |            0.02 % |


In [44]:
describe_numbers(df_beers, filters=["beer_id", "brewery_id"])

+---------------------+----------+----------+----------+----------+----------+-------+---------+
| Column              |     Mean |      Std |      25% |      50% |      75% |   Min |     Max |
|---------------------+----------+----------+----------+----------+----------+-------+---------|
| abv                 |  6.06439 |   1.9058 |      4.8 |      5.6 |        7 |  0.01 |      60 |
| rating_score_avg    |   3.2446 |  0.50751 |        3 |      3.3 |  3.59375 |   0.5 |       5 |
| rating_score_std    | 0.361435 | 0.214694 | 0.212132 | 0.341501 | 0.465788 |     0 | 2.75772 |
| rating_score_median |  3.25695 | 0.513003 |        3 |      3.3 |      3.6 |   0.5 |       5 |
| nbr_interactions    |  16.1227 |  81.0252 |        1 |        3 |        9 |     0 |    5271 |
| nbr_ratings         |        0 |        0 |        0 |        0 |        0 |     0 |       0 |
| nbr_reviews         |  16.1227 |  81.0252 |        1 |        3 |        9 |     0 |    5271 |
+---------------------+-------

## Computation of dropped values
Here we will compute how much data for each dataset has been dropped. This step is needed to ensure that we still have enough data to work with after the data cleaning process and to help spotting possible issues in the data cleaning process.
##### Users dataset

In [45]:
# Load the original datasets
df_users_original = pl.read_csv(f"{SRC_DATA_PATH}/users.csv")

# Do some computation
original_number_of_students = df_users_original.shape[0]
new_number_of_students = df_users.shape[0]

# Print the results
print(f"Number of students in the original dataset: {original_number_of_students}")
print(f"Number of students in the new dataset: {new_number_of_students}")
print(f"Percentage of students removed: {(original_number_of_students - new_number_of_students) / original_number_of_students * 100:.2f}%")

Number of students in the original dataset: 70174
Number of students in the new dataset: 70097
Percentage of students removed: 0.11%


##### Beers dataset

In [46]:
# Load the original datasets
df_beers_original = pl.read_csv(f"{SRC_DATA_PATH}/beers.csv")

# Do some computation
original_number_of_beers = df_beers_original.shape[0]
new_number_of_beers = df_beers.shape[0]

# Print the results
print(f"Number of beers in the original dataset: {original_number_of_beers}")
print(f"Number of beers in the new dataset: {new_number_of_beers}")
print(f"Percentage of beers removed: {(original_number_of_beers - new_number_of_beers) / original_number_of_beers * 100:.2f}%")

Number of beers in the original dataset: 442081
Number of beers in the new dataset: 441638
Percentage of beers removed: 0.10%


##### Breweries dataset

In [47]:
# Load the original datasets
df_breweries_original = pl.read_csv(f"{SRC_DATA_PATH}/breweries.csv")

# Do some computation
original_number_of_breweries = df_breweries_original.shape[0]
new_number_of_breweries = df_breweries.shape[0]

# Print the results
print(f"Number of beers in the original dataset: {original_number_of_breweries}")
print(f"Number of beers in the new dataset: {new_number_of_breweries}")
print(f"Percentage of beers removed: {(original_number_of_breweries - new_number_of_breweries) / original_number_of_breweries * 100:.2f}%")

Number of beers in the original dataset: 24189
Number of beers in the new dataset: 23190
Percentage of beers removed: 4.13%


##### Ratings dataset

In [48]:
# Load the original datasets
df_reviews_original = pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq")

# Do some computation
original_number_of_reviews = df_reviews_original.shape[0]
new_number_of_reviews = df_ratings.shape[0]

# Print the results
print(f"Number of reviews in the original dataset: {original_number_of_reviews}")
print(f"Number of reviews in the new dataset: {new_number_of_reviews}")
print(f"Percentage of reviews removed: {(original_number_of_reviews - new_number_of_reviews) / original_number_of_reviews * 100:.2f}%")

Number of reviews in the original dataset: 7122074
Number of reviews in the new dataset: 7120381
Percentage of reviews removed: 0.02%


##### Final discussion
Here we see that we don't lose lots of data in the cleaning process. This is a good sign since we can still work with a large amount of data. We see a slightly higher percentage of data loss in the breweries dataset. This is due to the fact that we have removed breweries without beers associated (most of the data lost has been done here). This is not a significant issue since we have all the data needed for our analysis.

## Conclusion
In this notebook we have cleaned the BeerAdvocate dataset and we now have data ready to be used for further analysis. <br>
Here we did focus on understanding the general structure of our data, clean them and produce processed files (in the format of parquet files) that can be easily loaded and used for further analysis.