# Data cleaning of RateBeer dataset
RateBeer is another large beer rating platform. Its dataset structure is similar to BeerAdvocate but with some unique features. In contrast to BeerAdvocate, which is owned by two brothers, [RateBeer is owned by Belgian-based beer producer AB InBev](https://www.ratebeer.com/about.asp). 

On the current version of the website, users can review a beer by first giving it a rating. The rating is between 0-5 and accepts one digit after the decimal point (e.g. 3.8). Then the user is prompted for a textual review, and to rate the beer on different attributes. There are five attributes with different scales. All of them only accept integer responses. They are specified as follows:
- Aroma (1-10)
- Appearance (1-5)
- Taste (1-10)
- Mouthfeel (1-5)
- Overall (1-20)

All the reviews have a text and the rating on the previous aspects. <br><br>
Some of the information have been extrapolated from [here](https://github.com/OrganicIrradiation/ratebeer)

## Files importing

In [69]:
import polars as pl
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
import geopandas as gpd
import numpy as np
import tqdm
import os
from src.utils.data_desc import *

In [70]:
# Change this variable if you want to save the processed files
SAVE_PROCESSED = True

# Define the paths
SRC_DATA_PATH = './data/RateBeer'
DST_DATA_PATH = './data/RateBeer/processed'
if SAVE_PROCESSED:
    if not os.path.exists(DST_DATA_PATH):
        os.makedirs(DST_DATA_PATH)

## Data exploration and cleaning

In [71]:
df_beers = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/beers.csv"))
df_breweries = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/breweries.csv"))
df_users = remove_whitespaces(pl.read_csv(f"{SRC_DATA_PATH}/users.csv"))
df_ratings = remove_whitespaces(pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq"))

##### Beers dataset
We'll start by looking at the beers dataset.

In [72]:
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,f64,f64,f64,f64,f64,f64,i64,f64
121161,"""Oggis Anonymous IPA""",20314,"""Left Coast Brewing Company""","""Imperial IPA""",2,,,2.96,8.5,3.6,,0,
33952,"""Wold Top Wold Gold""",3760,"""Wold Top""","""Golden Ale/Blond Ale""",100,35.0,47.0,2.99,4.8,2.99,-0.695148,100,2.99
439514,"""Labrewatory Alco-hall & Oats I…",28175,"""Labrewatory""","""Specialty Grain""",0,,,,6.0,,,0,
511449,"""OEC Artista Zynergia: Oudilis …",19612,"""OEC Brewing &#40;Ordinem Ecent…","""Sour/Wild Ale""",2,,,3.3,6.0,4.0,,0,
367315,"""Loose Shoe Palomino""",22647,"""Loose Shoe Brewing Company""","""Cream Ale""",1,,,2.92,5.3,2.9,,0,


The dataset has the following structure


Now that we have an idea of how the features of our dataset look like, let's analyze the data of our dataset

In [73]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    | Type    |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   | Int64   |           442081 |             0 |    0.00 % |                442081 |          100.00 % |
| beer_name                 | String  |           442081 |             0 |    0.00 % |                441675 |           99.91 % |
| brewery_id                | Int64   |           442081 |             0 |    0.00 % |                 23199 |            5.25 % |
| brewery_name              | String  |           442081 |             0 |    0.00 % |                 23183 |            5.24 % |
| style                     | String  |           442081 |             0 |    0.00 

In [74]:
describe_numbers(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+-----------+----------+-----------+------------+----------+----------+---------+
| Column                    |      Mean |      Std |       25% |        50% |      75% |      Min |     Max |
|---------------------------+-----------+----------+-----------+------------+----------+----------+---------|
| nbr_ratings               |   16.1103 |  80.9888 |         1 |          3 |        9 |        0 |    5272 |
| overall_score             |   55.6808 |  28.4827 |        34 |         53 |       83 |        0 |     100 |
| style_score               |   54.9861 |  28.6554 |        33 |         51 |       82 |        0 |     100 |
| avg                       |   3.02658 | 0.304503 |      2.87 |       3.02 |     3.18 |        0 |    4.52 |
| abv                       |   6.06548 |  1.92296 |       4.8 |        5.6 |        7 |     0.01 |     100 |
| avg_computed              |   3.24465 |  0.50752 |         3 |        3.3 |  3.59412 |      0.5 |       5 |
| zscore  

From the table we can see that:
- The overall score and the style score are both in 0-100 scale
- Some of the abv values are wrong, as there are values of 100 but this is not possible.
- Some beers have the same name. This is possible, but we need to check if they are the same beer or not.

Let's remove the outliers from the abv column. In BeerAdvocate the beer with the highest ABV we see is 67.5%. Therefore we make the assumption that any beer over 65% is an error and we will remove these from our dataframe.

In [75]:
# [TODO] Choose what to do with the outliers of ABV

Here we are going to check if people have inserted the same beer twice for the same brewery. We'll check that by looking at the unique values of the columns 'beer_name' and 'brewery_name' and see if there are any duplicates.

In [76]:
beers_duplicates = df_beers.filter(pl.struct(["beer_name", "brewery_id"]).is_duplicated())
beers_duplicates.head(10)

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,f64,f64,f64,f64,f64,f64,i64,f64
353940,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
353941,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
375995,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",0,,,,5.0,,,0,
375996,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",1,,,2.72,5.0,2.0,,0,
422836,"""Syndicate Roses Name Abbey Du…",26083,"""Syndicate Beer & Grill""","""Abbey Dubbel""",1,,,3.0,6.0,3.3,,0,
408166,"""Syndicate Roses Name Abbey Du…",26083,"""Syndicate Beer & Grill""","""Abbey Dubbel""",3,,,2.78,6.0,2.6,,0,
285897,"""Prince Edward Island 8 Cord Do…",3091,"""Prince Edward Island Brewing C…","""Imperial IPA""",7,,,3.39,8.5,3.642857,,0,
310659,"""Prince Edward Island 8 Cord Do…",3091,"""Prince Edward Island Brewing C…","""Imperial IPA""",0,,,,8.5,,,0,
510163,"""Isle de Garde Bitter Ordinaire""",21105,"""Isle de Garde""","""Bitter""",2,,,3.12,3.9,3.5,,0,
510165,"""Isle de Garde Blonde Anglaise""",21105,"""Isle de Garde""","""English Pale Ale""",0,,,,5.0,,,0,


In [77]:
beers_duplicates

beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,overall_score,style_score,avg,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
i64,str,i64,str,str,i64,f64,f64,f64,f64,f64,f64,i64,f64
353940,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
353941,"""Talón de Aquiles""",24064,"""cerartmex""","""Porter""",0,,,,5.0,,,0,
375995,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",0,,,,5.0,,,0,
375996,"""Eterno Instante""",25141,"""Cervecería Eterno Instante""","""Porter""",1,,,2.72,5.0,2.0,,0,
422836,"""Syndicate Roses Name Abbey Du…",26083,"""Syndicate Beer & Grill""","""Abbey Dubbel""",1,,,3.0,6.0,3.3,,0,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
332721,"""Akitu Camellia IPA""",23008,"""Akitu Brewing""","""India Pale Ale (IPA)""",3,,,3.13,7.0,3.4,,0,
337288,"""Kempisch Vuur Blond""",14263,"""Pirlot""","""Belgian Ale""",1,,,3.11,5.0,3.8,,0,
15839,"""Kempisch Vuur Blond""",14263,"""Pirlot""","""Belgian Ale""",3,,,3.04,5.5,3.633333,,0,
480038,"""Brasserie du Quercorb Cascada""",22081,"""Brasserie du Quercorb""","""Golden Ale/Blond Ale""",0,,,,3.8,,,0,


We see that there are some duplicates but that they are not a significant number. Since we see that in some cases the style or the abv values are not consistent between duplicates and since, as already said, the number of duplicates is not significant, we'll drop them.

In [78]:
beer_ids_removed = beers_duplicates["beer_id"].to_list()
df_beers = df_beers.filter(~pl.col("beer_id").is_in(beer_ids_removed))

##### Breweries dataset

Now we can take a look at our breweries dataframe. For each brewery we have a location, a name and the amount of beers they produce, along with a unique identifier. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [79]:
df_breweries.sample(5)

id,location,name,nbr_beers
i64,str,str,i64
12717,"""Germany""","""Brauhaus Weißes Häusl""",5
8390,"""United States, California""","""Last Name Brewing""",29
19726,"""United States, Colorado""","""38 State Brewing Company""",35
18455,"""Canada""","""Brasserie Générale""",42
16713,"""Portugal""","""Faustino Microcervejeira""",30


Now that we have an idea of how the features of our dataset look like, let's analyze the data of our dataset

In [80]:
describe(df_breweries)

+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column    | Type   |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id        | Int64  |            24189 |             0 |    0.00 % |                 24189 |          100.00 % |
| location  | String |            24189 |             0 |    0.00 % |                   267 |            1.10 % |
| name      | String |            24189 |             0 |    0.00 % |                 24173 |           99.93 % |
| nbr_beers | Int64  |            24189 |             0 |    0.00 % |                   271 |            1.12 % |
+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [81]:
describe_numbers(df_breweries, filters=["id"])

+-----------+---------+---------+-------+-------+-------+-------+-------+
| Column    |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-----------+---------+---------+-------+-------+-------+-------+-------|
| nbr_beers | 19.0227 | 31.5426 |     3 |     8 |    20 |     0 |   295 |
+-----------+---------+---------+-------+-------+-------+-------+-------+


Since the users can insert the breweries by hand let's verify if there are some duplicates

In [82]:
breweries_duplicates = df_breweries.filter(pl.struct(["name", "location"]).is_duplicated())
breweries_duplicates

id,location,name,nbr_beers
i64,str,str,i64
23620,"""Canada""","""Ridge Brewing Company""",15
3317,"""Canada""","""Ridge Brewing Company""",12
10376,"""Lithuania""","""UAB Brauer""",1
4714,"""Lithuania""","""UAB Brauer""",7
11243,"""United States, California""","""California Brewing Company""",6
2141,"""United States, California""","""California Brewing Company""",3
9467,"""England""","""Freedom""",4
6811,"""England""","""Freedom""",31


Here since the data are consistent we'll just fix this issue by remapping one of the two duplicates to the other one.

In [83]:
# Define an aggregated dataset to map the duplicates ids
aggregated = breweries_duplicates.join(breweries_duplicates, on=["name", "location"], how="inner")
aggregated = aggregated.filter(pl.col("id") < pl.col("id_right"))

# Create a mapping between the duplicates
mapping = aggregated.select(["id", "id_right"]).to_pandas() # destination, source
mapping = dict(zip(mapping["id_right"], mapping["id"]))

Let's to the mapping also in the beer dataset and then we can recompute the number of beers for each brewery.

In [84]:
# Drop the column that are in the mapping
df_breweries = df_breweries.filter(~pl.col("id").is_in(list(mapping.keys())))

# Change the ids in the beers dataset
df_beers = df_beers.with_columns(pl.col("brewery_id").replace(mapping))

# Recompute the number of beers for each brewery
aggregated_value = df_beers.group_by("brewery_id").agg(pl.col("beer_id").count().alias("beers_count")).rename({"brewery_id": "id"})

# Join the aggregated value with the breweries dataframe
df_breweries = df_breweries.join(aggregated_value, on="id", how="inner")
df_breweries = df_breweries.drop("nbr_beers")

# Cast the beers_count column to an integer
df_breweries = df_breweries.with_columns(pl.col("beers_count").cast(pl.Int32))

##### Users dataset

For each of our users, we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [85]:
df_users.sample(5)

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,f64,str
5,51608,"""Moke""",1174300000.0,"""Canada"""
2,84041,"""OldPutin""",1228000000.0,"""United States, Nevada"""
21,291300,"""hna""",1386200000.0,"""Norway"""
1,10748,"""mdmedic""",1077400000.0,"""United States, Maryland"""
1,331215,"""smcorbett""",1408200000.0,"""New Zealand"""


Let's modify the joined column to make it in datetime format

In [86]:
df_users = df_users.with_columns((pl.col("joined").cast(pl.Int64) * 1000).cast(pl.Datetime("ms")))
df_users.sample(5)

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
17,64219,"""SirPip""",2007-11-16 11:00:00,"""United States, Minnesota"""
1,375999,"""JavierN74""",2015-07-05 10:00:00,"""Spain"""
2,31891,"""Nietzsche""",2006-01-05 11:00:00,"""England"""
1,234845,"""NickPatel324""",2012-12-22 11:00:00,"""United States, Florida"""
2,397685,"""Zigzaz""",2016-01-15 11:00:00,"""Poland"""


Now that we have an idea of what the features of our dataset look like, let's study the content of our dataset

In [87]:
describe(df_users)

+-------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| nbr_ratings | Int64                                    |            70174 |             0 |    0.00 % |                  2250 |            3.21 % |
| user_id     | Int64                                    |            70174 |             0 |    0.00 % |                 70120 |           99.92 % |
| user_name   | String                                   |            70174 |             0 |    0.00 % |                 70174 |          100.00 % |
| joined      | Datetime(time_unit='ms', time_zone=None) |            70144 |            30 |    0.0

In [88]:
describe_numbers(df_users, filters=["user_id", "joined"])

+-------------+---------+---------+-------+-------+-------+-------+-------+
| Column      |    Mean |     Std |   25% |   50% |   75% |   Min |   Max |
|-------------+---------+---------+-------+-------+-------+-------+-------|
| nbr_ratings | 108.821 | 754.493 |     1 |     2 |    10 |     1 | 46749 |
+-------------+---------+---------+-------+-------+-------+-------+-------+


Strangely there are some duplicates user_ids. 

In [89]:
user_duplicates = df_users.filter(pl.col("user_id").is_duplicated())
user_duplicates

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,3070,"""<span class=""__cf_email__"" dat…",2002-01-26 11:00:00,"""United States, Ohio"""
…,…,…,…,…
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,


It seems that there has been some issues in the creation of these users and their username contains some html code. Let's investigate this more

In [90]:
strange_users = df_users.filter(pl.col("user_name").str.contains("<span"))
strange_users

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,1502,"""<span class=""__cf_email__"" dat…",2001-08-28 10:00:00,"""Canada"""
…,…,…,…,…
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""


In [91]:
strange_users_duplicates = user_duplicates.filter(pl.col("user_name").str.contains("span"))
strange_users_duplicates

nbr_ratings,user_id,user_name,joined,location
i64,i64,str,datetime[ms],str
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
2,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,3070,"""<span class=""__cf_email__"" dat…",2002-01-26 11:00:00,"""United States, Ohio"""
…,…,…,…,…
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""
1,427060,"""<span class=""__cf_email__"" dat…",2016-09-08 10:00:00,
1,46797,"""<span class=""__cf_email__"" dat…",2006-12-19 11:00:00,"""United States, Arkansas"""


We see that the two subgroups of users (the ones with duplicated id and a bad formatted username) are almost the same. For this reason we'll drop them since there has been most likely an issue in the creation of their account or in the scraping of the data.

In [92]:
duplicates_ids = user_duplicates.select("user_id").unique()
strange_ids = strange_users.select("user_id").unique()
user_ids_remove = pl.concat([duplicates_ids, strange_ids]).unique(subset="user_id")["user_id"].to_list()
df_users = df_users.filter(~pl.col("user_id").is_in(user_ids_remove))

##### Ratings

In [93]:
df_ratings.sample(5)

rating,palate,abv,beer_id,beer_name,user_id,taste,date,style,appearance,overall,brewery_name,text,aroma,user_name,brewery_id
f64,f64,f64,i64,str,i64,f64,datetime[μs],str,f64,f64,str,str,f64,str,i64
3.9,4.0,7.0,245290,"""Schneider Weisse Tap X Meine P…",10229,8.0,2017-03-17 12:00:00,"""Weizen Bock""",3.0,16.0,"""Schneider Weisse G. Schneider …","""On Tap at Prost! Boise Cloudy …",8.0,"""IndianaRed""",313
2.9,3.0,5.5,86016,"""Brick House Easter Rising Stou…",56136,5.0,2008-05-01 12:00:00,"""Stout""",5.0,11.0,"""Brick House Brewery & Restaura…","""Just not good. Pretty plain an…",5.0,"""Chapel""",2257
2.6,2.0,6.0,585,"""Rogue HazelNut Brown Nectar""",40472,5.0,2007-12-21 12:00:00,"""Brown Ale""",3.0,12.0,"""Rogue Ales""","""On tap at the Bocktown, nice p…",4.0,"""cheap""",96
3.3,3.0,11.0,137762,"""BFM Abbaye de Saint Bon-Chien …",9357,7.0,2011-12-13 12:00:00,"""Sour/Wild Ale""",2.0,14.0,"""BFM (Brasserie des Franches-Mo…","""Bottle shared by craftycarl21.…",7.0,"""BeerandBlues2""",1063
2.1,2.0,4.5,123014,"""Trafalgar Cherry Ale""",17002,4.0,2010-09-07 12:00:00,"""Fruit Beer""",3.0,8.0,"""Trafalgar Ales and Meads""","""My bottle [650ml] shared with …",4.0,"""blankboy""",1976


Since this dataset includes data from the breweries dataset, users dataset and beers dataset but we have modified them let's clean the reviews whose user_id, beer_id or brewery_id is not present in the respective datasets.

In [94]:
# Do a remapping of the breweries ids
df_ratings = df_ratings.with_columns(pl.col("brewery_id").replace(mapping))

# Select the ids of the beers, users and breweries
beer_ids = df_beers.select("beer_id")
user_ids = df_users.select("user_id")
brewery_ids = df_breweries.select("id")

# Filter out the ratings where the beer_id, user_id or brewery_id is not in the respective dataset
df_ratings = df_ratings.filter(pl.col("beer_id").is_in(beer_ids["beer_id"]))
df_ratings = df_ratings.filter(pl.col("user_id").is_in(user_ids["user_id"]))
df_ratings = df_ratings.filter(pl.col("brewery_id").is_in(brewery_ids["id"]))

Since we have removed some ratings we need to recompute the number of reviews for each user and the number of reviews for each beer. We'll do this later in the notebook.

In [95]:
describe(df_ratings)

+--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       | Type                                     |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+------------------------------------------+------------------+---------------+-----------+-----------------------+-------------------|
| rating       | Float64                                  |          7120395 |             0 |    0.00 % |                    46 |            0.00 % |
| palate       | Float64                                  |          7120395 |             0 |    0.00 % |                     5 |            0.00 % |
| abv          | Float64                                  |          6945463 |        174932 |    2.46 % |                   865 |            0.01 % |
| beer_id      | Int64                                    |          7120395 |             0 |

In [96]:
describe_numbers(df_ratings, filters=["beer_id", "user_id", "brewery_id"])

+------------+---------+----------+-------+-------+-------+-------+-------+
| Column     |    Mean |      Std |   25% |   50% |   75% |   Min |   Max |
|------------+---------+----------+-------+-------+-------+-------+-------|
| rating     | 3.28556 | 0.686497 |     3 |   3.4 |   3.7 |   0.5 |     5 |
| palate     | 3.28472 | 0.796806 |     3 |     3 |     4 |     1 |     5 |
| abv        | 6.50373 |  2.15864 |     5 |     6 |   7.7 |  0.01 |    73 |
| taste      | 6.49668 |  1.53813 |     6 |     7 |     7 |     1 |    10 |
| appearance | 3.44041 | 0.773788 |     3 |     3 |     4 |     1 |     5 |
| overall    | 13.2154 |  3.15213 |    12 |    14 |    15 |     1 |    20 |
| aroma      | 6.41841 |  1.53867 |     6 |     7 |     7 |     1 |    10 |
+------------+---------+----------+-------+-------+-------+-------+-------+


## Data cleaning and saving
Now that we have done some preliminary analysis on the data we will modify some of the datasets to make them more suitable for our analysis. We will also save the datasets in parquet format with the correct data types for faster loading in the future.
##### Users dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [97]:
if (SAVE_PROCESSED):
    # Get the user ids
    number_of_ratings = df_ratings.group_by("user_id").agg(pl.col("rating").count().alias("number_of_ratings"))
    df_users = df_users.join(number_of_ratings, on="user_id", how="inner")
    df_users = df_users.drop("nbr_ratings")

    # Write the cleaned datasets to parquet
    df_users.write_parquet(f"{DST_DATA_PATH}/users.pq")

In [98]:
df_users.sample(5)

user_id,user_name,joined,location,number_of_ratings
i64,str,datetime[ms],str,u32
196799,"""chadpete21""",2012-06-27 10:00:00,,1
73390,"""vegasron""",2008-04-13 10:00:00,"""United States, Idaho""",5
273788,"""Tgn""",2013-08-10 10:00:00,"""England""",26
463139,"""jamiedrinko909""",2017-05-09 10:00:00,,1
246560,"""Iglebekk""",2013-02-25 11:00:00,,1


##### Breweries dataset
The dataset is already clean and processed, we just need to save it in parquet format

In [99]:
if (SAVE_PROCESSED):
    df_breweries.write_parquet(f"{DST_DATA_PATH}/breweries.pq")

In [100]:
df_breweries.sample(5)

id,location,name,beers_count
i64,str,str,i32
24344,"""United States, Nebraska""","""Bottle Rocket Brewing Company""",18
12641,"""Germany""","""Neuenahrer Brauhaus""",9
2561,"""United States, Pennsylvania""","""Jacks Mountain Restaurant and …",9
6581,"""Germany""","""Andreasbräu""",23
18469,"""Italy""","""Microbirrificio Birra Elvo""",8


##### Ratings dataset
The dataset is already cleaned and process, we just need to save it in parquet format. We will both save a full version of the dataset and a version without the text of the reviews.

In [101]:
if (SAVE_PROCESSED):
    df_ratings.write_parquet(f"{DST_DATA_PATH}/ratings.pq")
    df_ratings_no_text = df_ratings.drop("text")
    df_ratings_no_text.write_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")

In [102]:
df_ratings.sample(5)

rating,palate,abv,beer_id,beer_name,user_id,taste,date,style,appearance,overall,brewery_name,text,aroma,user_name,brewery_id
f64,f64,f64,i64,str,i64,f64,datetime[μs],str,f64,f64,str,str,f64,str,i64
1.3,2.0,8.6,5588,"""Bavaria 8.6 (Original)""",30639,2.0,2008-03-19 12:00:00,"""Imperial Pils/Strong Pale Lage…",2.0,4.0,"""Bavaria Brouwerij (Netherlands…","""Thanks Harry (tongue firmly in…",3.0,"""garthicus""",1012
3.5,3.0,6.2,356281,"""Sweetwater Hash Brown""",39,7.0,2015-09-25 12:00:00,"""Brown Ale""",3.0,15.0,"""Sweetwater Brewing Company""","""Hoppy brown ale. OK sounds goo…",7.0,"""PhillyBeer2112""",34
3.7,3.0,7.0,490098,"""To Øl Shock Series: !PA Citra …",83882,8.0,2017-03-06 12:00:00,"""India Pale Ale (IPA)""",3.0,14.0,"""To Øl""","""Bottle at Beerlovets. Clear go…",9.0,"""Travlr""",12119
3.7,4.0,6.2,121259,"""Big Sky Cowboy Coffee Porter""",244071,7.0,2013-02-20 12:00:00,"""Porter""",3.0,15.0,"""Big Sky Brewing Company""","""Bottle to pint glass. Very un…",8.0,"""gonzosauce""",1009
3.6,4.0,8.0,313826,"""SPB Crowd Control""",9892,7.0,2017-04-30 12:00:00,"""Imperial IPA""",4.0,14.0,"""Southern Prohibition Brewery""","""Can.Hazy dark gold with white …",7.0,"""SpringsLicker""",12261


##### Beers dataset
While for the previous datasets not much work was needed here multiple steps are needed. In particular:
- We will change some of the columns of the dataset. In particular we are going to create a dataset that is similar to BeerAdvocate dataset to simplify the comparison between the two datasets.
- Since we have removed some reviews and changed other datasets we are going to recompute the values in the beer dataset

Additional data regarding the matching of the two datasets will be added in another notebook.

In [103]:
if (SAVE_PROCESSED):
    rows = []
    for row in tqdm.tqdm(df_beers.rows(named=True)):
        # Create the new dictionary
        new_row = {}

        # Compute general information
        new_row["beer_id"] = row["beer_id"]
        new_row["beer_name"] = row["beer_name"]
        new_row["brewery_id"] = row["brewery_id"]
        new_row["brewery_name"] = row["brewery_name"]
        new_row["style"] = row["style"]
        new_row["abv"] = row["abv"]

        # Compute aggregated informations
        reviews_elements = df_ratings_no_text.filter(pl.col("beer_id") == row["beer_id"])
        new_row["rating_score_avg"] = reviews_elements["rating"].mean()
        new_row["rating_score_std"] = reviews_elements["rating"].std()
        new_row["rating_score_median"] = reviews_elements["rating"].median()

        # Compute the number of interactions, nbr_ratings and nbr_reviews (here they are the same)
        new_row["nbr_interactions"] = reviews_elements.shape[0]
        new_row["nbr_ratings"] = reviews_elements.shape[0]
        new_row["nbr_reviews"] = reviews_elements.shape[0]

        # Append the new row to the list of rows
        rows.append(new_row)

    # Transform the data into a Polars DataFrame
    df_beers = pl.DataFrame(rows)

    # Save the data
    df_beers.write_parquet(f"{DST_DATA_PATH}/beers.pq")

100%|██████████| 441643/441643 [11:00<00:00, 668.80it/s]


In [104]:
df_beers.sample(5)

beer_id,beer_name,brewery_id,brewery_name,style,abv,rating_score_avg,rating_score_std,rating_score_median,nbr_interactions,nbr_ratings,nbr_reviews
i64,str,i64,str,str,f64,f64,f64,f64,i64,i64,i64
302281,"""Hophurst Flaxen""",21509,"""Hophurst Brewery""","""Golden Ale/Blond Ale""",3.7,2.96,0.61887,3.1,5,5,5
521323,"""Republica Tina Bazuca""",31638,"""Republica Brewing""","""India Pale Ale (IPA)""",7.0,,,,0,0,0
491319,"""White Frontier Coffee Porter -…",21794,"""White Frontier""","""Porter""",5.0,3.9,,3.9,1,1,1
133403,"""Ise Kadoya Yuzu Ale""",6309,"""Ise Kadoya""","""Fruit Beer""",5.0,3.519048,0.227198,3.5,21,21,21
525612,"""Moody Tongue Apertif""",20128,"""Moody Tongue Brewing Company""","""Pilsener""",5.0,2.1,,2.1,1,1,1


##### Size comparison
It's important, after cleaning the data, to check the size of the datasets to see if we have lost a significant amount of data or not.
