# Data cleaning of BeerAdvocate dataset

BeerAdvocate is a popular platform for beer enthusiasts to rate and review beers. The dataset provides information about beers, breweries, users and the ratings of the users for the beers on the platform.

On the website of BeerAdvocate, users can create an account. After creating an account with a username and password, they are asked to provide their location, but this is optional and not verified. 

Users can enter a beer into the database. Upon entering a beer, they have to select the location of the brewery where the beer is produced. They are then able to provide the beer name, the beer style and optionally an ABV (alcohol percentage).

After entering the beer into the database, every user can rate this beer. There are two types of ways a user can express their opinion on a beer: through ratings and reviews. A rating (in the current version of the platform, [the rating system has received a major update in 2014](https://www.beeradvocate.com/community/threads/beeradvocate-returns-to-its-roots-rating-system-revamped.238804/)) consists of Look, Smell, Taste, Feel and an Overall rating. Ratings range from 1-5 and can take quarter points as well (e.g. 3.75). Given that the dataset contains data that has been acquired before and after the major update the reviews are slightly different over time.

They can then optionally add a review as well. A review is a text description of what the user thinks about the beer. In some cases, as expressed in the article, the users must insert a text review if the reviews are outliers, to ensure that fraudulent reviews are reduced.

Our sources to better study this dataset are the following:
- [BeerAdvocate's forum](https://www.beeradvocate.com/community/threads/new-ba-score-beer-ranking-more-updates.537406/)
- [BeerAdvocate's website](https://www.beeradvocate.com)

## Preliminary work

### Install and import all the needed libraries

In [1]:
# Import all the necessary libraries
import polars as pl
import tqdm
import os
import pandas as pd
from tabulate import tabulate
import matplotlib.pyplot as plt
import geopandas as gpd
import numpy as np

# Import the file in the utils folder
import sys
sys.path.append('../utils')

In [2]:
# Define the paths
SRC_DATA_PATH = '../../data'
DST_DATA_PATH = '../../data/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

In [3]:
# Fix the describe function to handle pandas Series correctly
def describe(df: pd.DataFrame, filters=[]) -> None:
	headers = ["Column", "Type", "Not null count", "Nulls count", "Nulls %", "Unique values count", "Unique values %"]
	content = []
	for col in df.columns:
		if col in filters:
			continue
		col_type = df[col].dtype
		col_count = df[col].count()
		col_null = df[col].isna().sum()
		col_unique = df[col].nunique()
		col_percentage_unique = col_unique / col_count * 100
		col_percentage_null = col_null / (col_count + col_null) * 100
		content.append([col, col_type, col_count, col_null, f"{col_percentage_null:.2f} %", col_unique, f"{col_percentage_unique:.2f} %"])
	print(tabulate(content, headers, tablefmt="psql", colalign=("left", "right", "right", "right", "right", "right", "right")))

def describe_number(df: pd.DataFrame, filters=[]) -> None:
	headers = ["Column", "Type", "Min", "Max", "Mean", "Std", "25%", "50%", "75%"]
	content = []
	for col in df.columns:
		if col in filters:
			continue
		if df[col].dtype in [int, float]:
			col_type = df[col].dtype
			col_min = df[col].min()
			col_max = df[col].max()
			col_mean = df[col].mean()
			col_std = df[col].std()
			col_25 = df[col].quantile(0.25)
			col_50 = df[col].quantile(0.50)
			col_75 = df[col].quantile(0.75)
			content.append([col, col_type, col_min, col_max, col_mean, col_std, col_25, col_50, col_75])
	print(tabulate(content, headers, tablefmt="psql", colalign=("left", "right", "right", "right", "right", "right", "right", "right", "right")))

## Data exploration

In [4]:
df_beers = pd.read_csv(f"{SRC_DATA_PATH}/beers.csv")
df_breweries = pd.read_csv(f"{SRC_DATA_PATH}/breweries.csv")
df_users = pd.read_csv(f"{SRC_DATA_PATH}/users.csv")
df_ratings = pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq")

##### Beers dataset
We'll start by looking at the beers dataset.

In [5]:
df_beers.sample(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
35599,210636,Mickey Mountain,39185,WarPigs Brewpub,American Double / Imperial IPA,1,0,4.1,,,8.6,4.1,0.253277,0,
133598,104816,Blue Moon Mountain Abbey Ale,306,Coors Brewing Company,American Brown Ale,313,49,3.27,76.0,,5.6,3.274345,,0,
110411,137834,Beer'd,12,John Harvard's Brewery & Ale House,American Amber / Red Ale,1,0,3.0,,,6.1,3.0,,0,
256461,154275,Pretty Please,32799,Only Child Brewing Company,Milk / Sweet Stout,7,1,4.01,,,9.2,3.895714,-0.149335,1,4.05
229467,125977,Kuhnhenn Summer Wonder,2097,Kuhnhenn Brewing Company,Bock,22,1,4.14,87.0,,13.0,4.24,,0,


The dataset has the following structure

| Column Name | Description | 
|-------------|-------------|
| `beer_id` | Unique identifier for each beer |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `nbr_ratings` | Number of ratings (text + non text) the beer has received |
| `nbr_reviews` | Number of written reviews (only text) the beer has received |
| `avg` | Average rating of the beer |
| `ba_score` | BeerAdvocate score for the beer (fraction of raters who gave the beer a 3.75 or higher) |
| `bros_score` | Score given by the Bros (Todd and Jason Alström, the BeerAdvocate founders) |
| `abv` | Alcohol By Volume percentage of the beer |
| `avg_computed` | Computed average rating (in some cases it differ from `avg` due to different calculation methods) |
| `zscore` | Standardized score indicating how many standard deviations the beer's rating is from the mean |
| `nbr_matched_valid_ratings` | Number of matched valid ratings |
| `avg_matched_valid_ratings` | Average of matched valid ratings |

Now that we have an idea of what the structure of our dataset looks like, let's see if our dataset is complete or whether it contains a lot of missing entries. 

In [6]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    |    Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   |   int64 |           280823 |             0 |    0.00 % |                280823 |          100.00 % |
| beer_name                 |  object |           280823 |             0 |    0.00 % |                236209 |           84.11 % |
| brewery_id                |   int64 |           280823 |             0 |    0.00 % |                 14325 |            5.10 % |
| brewery_name              |  object |           280823 |             0 |    0.00 % |                 14098 |            5.02 % |
| style                     |  object |           280823 |             0 |    0.00 

In [7]:
describe_number(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+---------+---------+---------+-----------+----------+-----------+-----------+-----------+
| Column                    |    Type |     Min |     Max |      Mean |      Std |       25% |       50% |       75% |
|---------------------------+---------+---------+---------+-----------+----------+-----------+-----------+-----------|
| nbr_ratings               |   int64 |       0 |   16509 |   29.8873 |   231.01 |         1 |         2 |         8 |
| nbr_reviews               |   int64 |       0 |    3899 |   9.22142 |  68.8664 |         0 |         1 |         2 |
| avg                       | float64 |       0 |       5 |   3.72103 | 0.476003 |       3.5 |      3.78 |      4.01 |
| ba_score                  | float64 |      46 |     100 |   84.6333 |  4.05272 |        83 |        85 |        86 |
| bros_score                | float64 |      31 |     100 |   84.8066 |  10.5077 |        81 |        87 |        91 |
| abv                       | float64 |    0.01 

We can see that:
- The beer name and brewery information are present for every beer within the dataset
- Some data are missing for abv, ba_score, bros_score and the other aggregated scores
- The ba_score and the bros_score are in a 0 - 100 range
- The ba_score and bros_score have a very similar mean, but the bros_score has a far larger spread. This seems to signal that users of the website tend to give less extreme ratings that the bros (the owners of the website)
- The avg score is in a 0 - 5 range (meaning that the scores are in a 0 - 5 scale)
- The abv at first seems to have some outliers. The max in our table above shows we have values with ABV of 67.5 percent.

However, after careful analysis of the beers with such high ABV they actually exist and are not outliers. For example, the 'Brewdog Sink the Bismarck!' beer has 41% ABV and the 'Brewmeister Armageddon' has 65%. These are therefore perfectly fine entries and we will keep them in our dataframe.

Now we'll focus on dealing with the missing values. <br><br>
Regarding the ABV values, we will not be filling the NaN values here. This is due to the fact that it's not something that can be easily guessed and by approximating it with the mean or median values we would introduce a significant bias in the dataset.

In [8]:
# Get the number of ratings for the beers with a null abv and the total number of ratings
rev_nan_beers = df_beers[df_beers["abv"].isnull()]["nbr_ratings"].sum()
tot_rev = df_beers["nbr_ratings"].sum()

# Compute the percentage of missing values for the reviews dataset
print(f"Number of ratings lost if we drop the rows with missing ABV: {rev_nan_beers}")

Number of ratings lost if we drop the rows with missing ABV: 171305


We see that by ignoring the beers with a nan abv we only ignore a small fraction of the reviews. This makes sense since it's unlikely that a beer with a high number of ratings would have a missing abv values.

Regarding the null values of the aggregated scores:
- In most cases these null values are associated with beers that doesn't have any review. It doesn't make sense to fill these values with the mean or median of the dataset since it would introduce a bias in the dataset.
- Most of the values will change after the processing and cleaning of the data.

For these two reasons we are not going to worry too much now about the missing values of the aggregated scores.

Since the beer are added by hand by people it can happen that a beer is created twice.

In [9]:
beers_duplicates = df_beers[df_beers.duplicated(subset=["beer_name", "brewery_id"], keep=False)]
beers_duplicates.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
7862,207341,10|05 Coffee Porter (San Sebastian),33192,Brew By Numbers,English Porter,1,0,4.05,,,6.5,4.05,,0,
7863,255983,10|05 Coffee Porter (San Sebastian),33192,Brew By Numbers,English Porter,1,0,4.13,,,6.5,4.13,,0,
9299,27803,Nappa Scar,12981,Yorkshire Dales Brewing Company,Extra Special / Strong Bitter (ESB),1,1,3.5,,,4.8,3.5,,0,
9300,289247,Nappa Scar,12981,Yorkshire Dales Brewing Company,English Bitter,1,0,3.5,,,4.0,3.5,,0,
25499,259896,Sage Farm,44830,Half Hours on Earth,Saison / Farmhouse Ale,3,1,3.54,,,6.0,3.76,,0,


Since some of the values (e.g. the abv or the style) are not always consistent between duplicates, to reduce the risk of errors or bias we are going to just drop the duplicates.

In [10]:
df_beers = df_beers[~df_beers['beer_id'].isin(beers_duplicates['beer_id'])]

To better handle the data we are also going to split the nbr_ratings into two distinct columns:
- nbr_interactions: will be used to denote the number of interactions in total (number of text reviews + number of non text reviews)
- nbr_ratings: will be used to denote only the non text reviews

In [11]:
df_beers['nbr_interactions'] = df_beers['nbr_ratings']
df_beers['nbr_ratings'] = df_beers['nbr_ratings'] - df_beers['nbr_reviews']

In [12]:
col_to_drop = ['bros_score','avg_computed','zscore','nbr_matched_valid_ratings','avg_matched_valid_ratings']
df_beers = df_beers.drop(columns=col_to_drop)
df_beers.sample(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
92095,197192,Summer Blonde,17271,Bootlegger's Brewery,American Blonde Ale,2,0,3.55,,4.9,2
103839,154189,Southern German Hefeweizen,31527,Seapine Brewing Company,Hefeweizen,2,0,3.99,,5.7,2
180417,199810,The Hammock Saison,42846,Crooked Hammock Brewery,Saison / Farmhouse Ale,2,0,3.69,,5.0,2
158856,231251,Blue Hazel Maibock,1853,Sly Fox Brewing Company,Maibock / Helles Bock,3,1,3.99,,7.2,4
156090,154486,Scratch Beer 169 - 2015 (Baltic Porter),694,Tröegs Brewing Company,Baltic Porter,6,2,3.91,,6.2,8


##### Breweries dataset

Now we can take a look at our breweries dataframe. For each brewery we have a location, a name and the amount of beers they produce, along with a unique identifier. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [13]:
df_breweries.sample(5)

Unnamed: 0,id,location,name,nbr_beers
223,14060,England,Shoes Brewery,1
13218,21364,"United States, Michigan",Michigan Brewers Guild,0
11827,46729,"United States, Minnesota",Yoerg Brewing Company,1
5469,23479,Australia,Ranga Brewing Co.,1
8184,29876,"United States, Georgia",Strawn Brewing Company,7


The dataset has the following structure
| Column Name | Description
|-------------|-------------|
| `id` | Unique identifier for each brewery |
| `location` | Geographic location of the brewery (country) |
| `name` | Name of the brewery |
| `nbr_beers` | Number of beers produced by the brewery |

In [14]:
describe(df_breweries)

+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column    |   Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id        |  int64 |            16758 |             0 |    0.00 % |                 16758 |          100.00 % |
| location  | object |            16758 |             0 |    0.00 % |                   297 |            1.77 % |
| name      | object |            16758 |             0 |    0.00 % |                 16237 |           96.89 % |
| nbr_beers |  int64 |            16758 |             0 |    0.00 % |                   273 |            1.63 % |
+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [15]:
describe_number(df_breweries, filters=["id"])

+-----------+--------+-------+-------+---------+---------+-------+-------+-------+
| Column    |   Type |   Min |   Max |    Mean |     Std |   25% |   50% |   75% |
|-----------+--------+-------+-------+---------+---------+-------+-------+-------|
| nbr_beers |  int64 |     0 |  1196 | 21.0563 | 69.4178 |     2 |     6 |    18 |
+-----------+--------+-------+-------+---------+---------+-------+-------+-------+


For our breweries, we have no missing values. <br><br>
From the analysis we can see that some of our breweries have a non-unique name. We need to verify if multiple breweries with the same name exists in the same country.

In [16]:
breweries_duplicates = df_breweries[df_breweries.duplicated(subset=["name", "location"], keep=False)]
breweries_duplicates.head(5)

Unnamed: 0,id,location,name,nbr_beers
68,34598,Wales,Rhymney Brewery,2
69,12936,Wales,Rhymney Brewery,13
213,11410,England,Bartrams Brewery,11
214,7095,England,Bartrams Brewery,0
344,31304,England,Dorset Piddle Brewery,0


We see that there are some breweries that have the same name and are located in the same country. This introduces a possible source of error:
- Either multiple users have inserted the same brewery into the database multiple times
- Multiple breweries in one country have the same name

Both assumption are reasonable but we can't be sure which one is the correct one. To avoid introducing errors in our dataset we are going to drop the duplicates.

In [17]:
df_breweries = df_breweries[~df_breweries['id'].isin(breweries_duplicates['id'])]

Since we modified both the breweries and the beers dataset let's recompute the shared values. At the same time we are going to drop the breweries that doesn't have any beer associated with them.

In [18]:
# Filter the beers dataset to remove the ones that do not have a corresponding brewery
df_beers = df_beers[df_beers["brewery_id"].isin(df_breweries["id"])]

# Recount the number of beers per brewery
count_beers = df_beers["brewery_id"].value_counts()
df_breweries["nbr_beers"] = df_breweries["id"].map(count_beers).dropna().astype(int)

# Show a sample of the breweries dataset
df_breweries.sample(5)

Unnamed: 0,id,location,name,nbr_beers
13716,758,"United States, Texas",Uncle Buck's Brewery & Steakhouse,43.0
15846,4421,"United States, Illinois",Founders Hill Brewing,
10177,44419,"United States, Colorado",Jaks Brewing Co.,8.0
6092,46764,Spain,Alimentos Para El Siglo XXI,1.0
9637,339,"United States, New Jersey",Flying Fish Brewing Company,69.0


Make the location of the brewery homogeneous.

In [19]:
print(df_breweries['location'].unique())

['Kyrgyzstan' 'Gabon' 'Northern Ireland' 'Wales' 'Scotland' 'England'
 'Singapore' 'China' 'Chad' 'Saint Lucia' 'Cameroon' 'Burkina Faso'
 'Zambia' 'Romania' 'Nigeria' 'South Korea' 'Georgia' 'Hong Kong' 'Guinea'
 'Montenegro' 'Benin' 'Mexico' 'Fiji Islands' 'Guam' 'Laos' 'Senegal'
 'Honduras' 'Morocco' 'Indonesia' 'Monaco' 'Ukraine' 'Canada' 'Jordan'
 'Portugal' 'Guernsey' 'India' 'Puerto Rico' 'Japan' 'Iran' 'Hungary'
 'Bulgaria' 'Guinea-Bissau' 'Liberia' 'Togo' 'Niger' 'Croatia' 'Lithuania'
 'Cyprus' 'Italy' 'Andorra' 'Botswana' 'Turks and Caicos Islands'
 'Papua New Guinea' 'Mongolia' 'Ethiopia' 'Denmark' 'French Polynesia'
 'Greece' 'Sri Lanka' 'Syria' 'Germany' 'Jersey' 'Armenia' 'Mozambique'
 'Palestine' 'Bangladesh' 'Turkmenistan' 'Reunion' 'Eritrea' 'Switzerland'
 'Malta' 'Israel' 'El Salvador' 'French Guiana' 'Tonga' 'Zimbabwe' 'Samoa'
 'Barbados' 'Chile' 'Cambodia' 'Cook Islands' 'Trinidad & Tobago' 'Bhutan'
 'Uzbekistan' 'Egypt' 'Uruguay' 'Dominican Republic' 'Equatorial Gu

In [20]:
def change_location(x):
    # Define some remapping
    remapping = {
        'Wales' : 'United Kingdom',
        'Scotland' : 'United Kingdom',
        'Northern Ireland' : 'United Kingdom',
        'Virgin Islands (British)' : 'United Kingdom',
        'Virgin Islands (U.S.)' : 'United Kingdom',
        'United Kingdom, Wales' : 'United Kingdom',
        'United Kingdom, Scotland' : 'United Kingdom',
        'United Kingdom, Northern Ireland' : 'United Kingdom',
        'United Kingdom, England' : 'United Kingdom',
    }

    # Do some hard coded remapping
    if 'United States' in x['location'] or 'Utah' in x['location']:
        x['location'] = 'United States of America'
    elif 'England' in x['location']:
        x['location'] = 'United Kingdom'
    elif 'Canada' in x['location']:
        x['location'] = 'Canada'

    # Do the remapping
    if x['location'] in remapping:
        x['location'] = remapping[x['location']]
    return x

df_breweries = df_breweries.apply(change_location, axis=1)
for c in sorted(df_breweries['location'].unique()):
    print(c)

Albania
Algeria
Andorra
Angola
Antigua & Barbuda
Argentina
Armenia
Aruba
Australia
Austria
Azerbaijan
Bahamas
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Bulgaria
Burkina Faso
Cambodia
Cameroon
Canada
Cape Verde Islands
Cayman Islands
Central African Republic
Chad
Chile
China
Colombia
Congo
Cook Islands
Costa Rica
Croatia
Cuba
Curaçao
Cyprus
Czech Republic
Denmark
Dominica
Dominican Republic
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Faroe Islands
Fiji Islands
Finland
France
French Guiana
French Polynesia
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Haiti
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Isle of Man
Israel
Italy
Ivory Coast
Jamaica
Japan
Jersey
Jordan
Kazakhstan
Kenya
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Macau
Macedonia
Madagasca

In [21]:
to_remove = ['UNKNOWN']
df_breweries = df_breweries[~df_breweries['location'].isin(to_remove)]

While we still have some beers with the same name we are sure that these beers are produced by different breweries. <br><br>
We are going to recompute the number of ratings, interactions and reviews for each beer later in this notebook.

##### Users dataset

For each of our users, we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [22]:
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
130657,1,1,hippiekinkster.238749,hippiekinkster,1217412000.0,"United States, Georgia"
128821,2,0,badgerboner_angrybits.868878,badgerboner_angrybits,1411639000.0,"United States, New Hampshire"
60967,1,1,mstank.56143,mstank,1135163000.0,"United States, California"
44020,5,0,maria09.809590,Maria09,1403258000.0,"United States, Illinois"
441,2965,1096,haybeerman.221381,Haybeerman,1211364000.0,"United States, Colorado"


This dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `user_name` | Username of the reviewer |
| `joined` | Date when the user joined BeerAdvocate |
| `location` | Geographic location of the user |
| `nbr_ratings` | Number of ratings submitted by the user |
| `nbr_reviews` | Number of written reviews submitted by the user |

In [23]:
describe(df_users)

+-------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      |    Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| nbr_ratings |   int64 |           153704 |             0 |    0.00 % |                  2053 |            1.34 % |
| nbr_reviews |   int64 |           153704 |             0 |    0.00 % |                  1265 |            0.82 % |
| user_id     |  object |           153704 |             0 |    0.00 % |                153704 |          100.00 % |
| user_name   |  object |           153703 |             1 |    0.00 % |                153703 |          100.00 % |
| joined      | float64 |           151052 |          2652 |    1.73 % |                  5524 |            3.66 % |
| location    |  object |           122425 |         31279 |   2

In [24]:
describe_number(df_users)

+-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------+
| Column      |    Type |         Min |        Max |        Mean |         Std |         25% |         50% |         75% |
|-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------|
| nbr_ratings |   int64 |           1 |      12046 |     54.6052 |     252.389 |           1 |           3 |          16 |
| nbr_reviews |   int64 |           0 |       8970 |     16.8479 |     139.847 |           0 |           0 |           2 |
| joined      | float64 | 8.40794e+08 | 1.5015e+09 | 1.35724e+09 | 9.19513e+07 | 1.30312e+09 | 1.39194e+09 | 1.41769e+09 |
+-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------+


We are going to change the structure in nbr_interactions, nbr_ratings, nbr_reviews as done with the beers dataset and we are also going to cast the joined column to a datetime object.

In [25]:
df_users['nbr_interactions'] = df_users['nbr_ratings']
df_users['nbr_ratings'] = df_users['nbr_ratings'] - df_users['nbr_reviews']
df_users['joined'] = pd.to_datetime(df_users['joined'], unit='s')
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
19549,17,0,armen.896651,Armen,2014-11-22 11:00:00,"United States, California",17
153404,0,1,mfgeorge.189021,mfgeorge,NaT,,1
40476,2,0,hwklax.874827,hwklax,2014-10-08 10:00:00,,2
92183,6,2,flaran.583540,flaran,2011-03-22 11:00:00,"United States, California",8
91211,0,3,arielle.367625,Arielle,2009-09-07 10:00:00,"United States, Arizona",3


In [26]:
df_users = df_users[df_users['location'].notna()]
df_users = df_users[df_users['joined'].notna()]
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
94436,3,0,rybarnes.450551,rybarnes,2010-04-20 10:00:00,"United States, Pennsylvania",3
52263,1,0,chumby.746276,Chumby,2013-08-02 10:00:00,Australia,1
50262,2,0,kp34bb.1013550,Kp34bb,2015-07-13 10:00:00,"United States, New Jersey",2
20751,26,0,sullaspiaggia.329320,sullaspiaggia,2009-05-11 10:00:00,"United States, California",26
99831,1,0,13csigg.1178571,13csigg,2016-10-29 10:00:00,"United States, South Carolina",1


##### Ratings

In [27]:
df_ratings.sample(5)

user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,text,aroma,overall,taste,style,beer_name,brewery_id,date
str,f64,bool,f64,str,str,i64,f64,f64,str,f64,f64,f64,str,str,i64,datetime[μs]
"""oline73.371504""",3.75,False,11.0,"""Tröegs Brewing Company""","""oline73""",181664,3.75,3.75,,3.75,3.75,3.75,"""American Wild Ale""","""Tröegs Wild Elf""",694,2016-11-30 12:00:00
"""therealjayz.925954""",3.81,False,7.0,"""O'so Brewing Company & Tap Hou…","""therealJAYZ""",61176,4.0,4.25,,3.5,4.0,3.75,"""American IPA""","""Hop Whoopin""",16386,2016-03-04 12:00:00
"""arithmeticus.464859""",3.71,True,6.9,"""Shipyard Brewing Co.""","""Arithmeticus""",80735,4.5,4.0,"""Poured chilled in 4-oz taster,…",3.5,4.0,3.5,"""English India Pale Ale (IPA)""","""Shipyard Monkey Fist IPA""",139,2012-08-05 12:00:00
"""bostonhops.621388""",2.86,True,10.5,"""Mayflower Brewing Company""","""BostonHops""",56461,4.0,4.0,"""bomber served in on oversized …",3.0,2.5,2.5,"""English Barleywine""","""Mayflower Barley Wine""",16105,2012-05-10 12:00:00
"""dougt.677143""",4.16,False,6.3,"""Voodoo Brewery""","""DougT""",254845,3.0,4.0,,4.5,4.0,4.25,"""American IPA""","""Pork Chop Sandwiches""",13371,2017-05-21 12:00:00


In [28]:
idx_list = np.arange(df_ratings.shape[0])
df_ratings_text = df_ratings.select('text')
df_ratings_text = df_ratings_text.with_columns(pl.Series(name='idx', values=idx_list))
df_ratings_no_text = df_ratings.drop('text').to_pandas()
df_ratings_no_text['idx'] = idx_list

In [29]:
df_ratings_text.head(5)

text,idx
str,i64
"""From a bottle, pours a piss ye…",0
"""Pours pale copper with a thin …",1
"""500ml Bottle bought from The V…",2
"""Serving: 500ml brown bottlePou…",3
"""500ml bottlePours with a light…",4


The dataset has the following structure

In [30]:
describe(df_ratings_no_text, filters=['idx'])

+--------------+----------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       |           Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+----------------+------------------+---------------+-----------+-----------------------+-------------------|
| user_id      |         object |      8.39303e+06 |             0 |    0.00 % |                153704 |            1.83 % |
| rating       |        float64 |      8.39303e+06 |             0 |    0.00 % |                   401 |            0.00 % |
| review       |           bool |      8.39303e+06 |             0 |    0.00 % |                     2 |            0.00 % |
| abv          |        float64 |      8.22173e+06 |        171305 |    2.04 % |                   843 |            0.01 % |
| brewery_name |         object |      8.39303e+06 |             0 |    0.00 % |                 13440 |            0.16 % |


The dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `rating` | Global rating of the beer from the user |
| `review` | Flag to tell if the rating has text or not |
| `abv` | Alcohol By Volume percentage of the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `user_name` | Username of the reviewer |
| `beer_id` | Unique identifier for each beer |
| `appearance` | Rating of the appearance of the beer |
| `palate` | Rating of the palate of the beer |
| `text` | Text of the review |
| `aroma` | Rating of the aroma of the beer |
| `overall` | Overall rating of the beer |
| `taste` | Rating of the taste of the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `date` | Date when the review was submitted |

In [31]:
describe_number(df_ratings_no_text, filters=['idx'])

+------------+---------+-------+--------+---------+----------+-------+-------+-------+
| Column     |    Type |   Min |    Max |    Mean |      Std |   25% |   50% |   75% |
|------------+---------+-------+--------+---------+----------+-------+-------+-------|
| rating     | float64 |     1 |      5 | 3.88213 | 0.620509 |  3.54 |     4 |  4.25 |
| abv        | float64 |  0.01 |   67.5 | 7.33027 |  2.45911 |   5.5 |   6.9 |   8.8 |
| beer_id    |   int64 |     3 | 293296 | 66754.4 |  64818.2 |  9074 | 52266 | 96548 |
| appearance | float64 |     1 |      5 | 3.93721 | 0.558413 |  3.75 |     4 |  4.25 |
| palate     | float64 |     1 |      5 | 3.86586 | 0.608391 |   3.5 |     4 |  4.25 |
| aroma      | float64 |     1 |      5 |  3.8696 | 0.620445 |   3.5 |     4 |  4.25 |
| overall    | float64 |     1 |      5 | 3.90053 | 0.616106 |   3.5 |     4 |  4.25 |
| taste      | float64 |     1 |      5 | 3.90329 | 0.643316 |   3.5 |     4 |  4.25 |
| brewery_id |   int64 |     1 |  49815 | 9

In [32]:
df_ratings_no_text.head(5)

Unnamed: 0,user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,aroma,overall,taste,style,beer_name,brewery_id,date,idx
0,nmann08.184925,2.88,True,4.5,Societe des Brasseries du Gabon (SOBRAGA),nmann08,142544,3.25,3.25,2.75,3.0,2.75,Euro Pale Lager,Régab,37262,2015-08-20 12:00:00,0
1,stjamesgate.163714,3.67,True,4.5,Strangford Lough Brewing Company Ltd,StJamesGate,19590,3.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2009-02-20 12:00:00,1
2,mdagnew.19527,3.73,True,4.5,Strangford Lough Brewing Company Ltd,mdagnew,19590,4.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2006-03-13 12:00:00,2
3,helloloser12345.10867,3.98,True,4.5,Strangford Lough Brewing Company Ltd,helloloser12345,19590,4.0,4.0,3.5,4.5,4.0,English Pale Ale,Barelegs Brew,10093,2004-12-01 12:00:00,3
4,cypressbob.3708,4.0,True,4.5,Strangford Lough Brewing Company Ltd,cypressbob,19590,4.0,4.0,4.0,4.0,4.0,English Pale Ale,Barelegs Brew,10093,2004-08-30 12:00:00,4


In [33]:
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['brewery_id'].isin(df_breweries['id'])]
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['beer_id'].isin(df_beers['beer_id'])]
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['user_id'].isin(df_users['user_id'])]
df_ratings_no_text.shape[0]

7737522

In [34]:
to_remove = ['user_name']
df_ratings_no_text = df_ratings_no_text.drop(columns=to_remove)

In [35]:
breweries_location = df_breweries[['id', 'location']]
df_ratings_no_text = df_ratings_no_text.merge(breweries_location, left_on='brewery_id', right_on='id', how='left')
df_ratings_no_text = df_ratings_no_text.drop(columns=['id'])
df_ratings_no_text.head(5)

Unnamed: 0,user_id,rating,review,abv,brewery_name,beer_id,appearance,palate,aroma,overall,taste,style,beer_name,brewery_id,date,idx,location
0,nmann08.184925,2.88,True,4.5,Societe des Brasseries du Gabon (SOBRAGA),142544,3.25,3.25,2.75,3.0,2.75,Euro Pale Lager,Régab,37262,2015-08-20 12:00:00,0,Gabon
1,stjamesgate.163714,3.67,True,4.5,Strangford Lough Brewing Company Ltd,19590,3.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2009-02-20 12:00:00,1,United Kingdom
2,mdagnew.19527,3.73,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2006-03-13 12:00:00,2,United Kingdom
3,helloloser12345.10867,3.98,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,4.0,3.5,4.5,4.0,English Pale Ale,Barelegs Brew,10093,2004-12-01 12:00:00,3,United Kingdom
4,cypressbob.3708,4.0,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,4.0,4.0,4.0,4.0,English Pale Ale,Barelegs Brew,10093,2004-08-30 12:00:00,4,United Kingdom


In [36]:
df_ratings_text = df_ratings_text.filter(df_ratings_text['idx'].is_in(df_ratings_no_text['idx']))

## Final processing and saving [TO DO]
### Beers dataset

In [37]:
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
0,166064,Nashe Moskovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.7,0
1,166065,Nashe Pivovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,3.8,0
2,166066,Nashe Shakhterskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.8,0
3,166067,Nashe Zhigulevskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.0,0
4,166063,Zhivoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.5,0


In [39]:
nbr_ratings_count = df_ratings_no_text[df_ratings_no_text['review'] == False].value_counts('beer_id')
nbr_reviews_count = df_ratings_no_text[df_ratings_no_text['review'] == True].value_counts('beer_id')
df_beers['nbr_ratings'] = df_beers['beer_id'].map(nbr_ratings_count).fillna(0).astype(int)
df_beers['nbr_reviews'] = df_beers['beer_id'].map(nbr_reviews_count).fillna(0).astype(int)
df_beers['nbr_interactions'] = df_beers['nbr_ratings'] + df_beers['nbr_reviews']
df_beers = df_beers[df_beers['nbr_interactions'] > 0]
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,0,1,2.88,,4.5,1
24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,0,4,3.85,,4.5,4
25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,14,58,3.45,80.0,4.8,72
26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,2,6,3.86,,6.0,8
27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,14,47,3.56,82.0,4.2,61


In [42]:
df_beers['avg'] = df_ratings_no_text['rating'].dropna().groupby(df_ratings_no_text['beer_id']).mean()
df_beers['std'] = df_ratings_no_text['rating'].dropna().groupby(df_ratings_no_text['beer_id']).std()
df_beers['median'] = df_ratings_no_text['rating'].dropna().groupby(df_ratings_no_text['beer_id']).median()
df_beers['appearance'] = df_ratings_no_text['appearance'].dropna().groupby(df_ratings_no_text['beer_id']).mean()
df_beers['palate'] = df_ratings_no_text['palate'].dropna().groupby(df_ratings_no_text['beer_id']).mean()
df_beers['aroma'] = df_ratings_no_text['aroma'].dropna().groupby(df_ratings_no_text['beer_id']).mean()
df_beers['overall'] = df_ratings_no_text['overall'].dropna().groupby(df_ratings_no_text['beer_id']).mean()

In [43]:
df_beers.to_parquet(f"{DST_DATA_PATH}/beers.pq")
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions,std,median,appearance,palate,aroma,overall
23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,0,1,3.681172,,4.5,1,0.436157,3.75,3.787791,3.702035,3.686047,3.867733
24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,0,4,3.781655,,4.5,4,0.405151,3.8,3.93932,3.694175,3.735437,3.813107
25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,14,58,,80.0,4.8,72,,,,,,
26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,2,6,3.460897,,6.0,8,0.396166,3.5,3.6,3.441667,3.297222,3.527778
27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,14,47,3.346196,82.0,4.2,61,0.448685,3.35,3.566038,3.40566,3.301887,3.490566


### Brewery dataset

In [None]:
df_breweries.to_parquet(f"{DST_DATA_PATH}/breweries.pq")
df_breweries.head(5)

### Ratings dataset

In [None]:
df_ratings_no_text.to_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")
df_ratings_no_text.head(5)

In [None]:
df_ratings_text.to_parquet(f"{DST_DATA_PATH}/ratings_text.pq")
df_ratings_text.head(5)

### Users dataset

In [None]:
df_users.to_parquet(f"{DST_DATA_PATH}/users.pq")
df_users.head(5)