# Data cleaning of BeerAdvocate dataset

BeerAdvocate is a popular platform for beer enthusiasts to rate and review beers. The dataset provides information about beers, breweries, users and the ratings of the users for the beers on the platform.

On the website of BeerAdvocate, users can create an account. After creating an account with a username and password, they are asked to provide their location, but this is optional and not verified. 

Users can enter a beer into the database. Upon entering a beer, they have to select the location of the brewery where the beer is produced. They are then able to provide the beer name, the beer style and optionally an ABV (alcohol percentage).

After entering the beer into the database, every user can rate this beer. There are two types of ways a user can express their opinion on a beer: through ratings and reviews. A rating (in the current version of the platform, [the rating system has received a major update in 2014](https://www.beeradvocate.com/community/threads/beeradvocate-returns-to-its-roots-rating-system-revamped.238804/)) consists of Look, Smell, Taste, Feel and an Overall rating. Ratings range from 1-5 and can take quarter points as well (e.g. 3.75). Given that the dataset contains data that has been acquired before and after the major update the reviews are slightly different over time.

They can then optionally add a review as well. A review is a text description of what the user thinks about the beer. In some cases, as expressed in the article, the users must insert a text review if the reviews are outliers, to ensure that fraudulent reviews are reduced.

Our sources to better study this dataset are the following:
- [BeerAdvocate's forum](https://www.beeradvocate.com/community/threads/new-ba-score-beer-ranking-more-updates.537406/)
- [BeerAdvocate's website](https://www.beeradvocate.com)

## Preliminary work

### Install and import all the needed libraries

In [1]:
# Import all the necessary libraries
import polars as pl
import os
import pandas as pd
from tabulate import tabulate
import geopandas as gpd
import numpy as np

# Import the file in the utils folder
import sys
sys.path.append('../utils')

In [2]:
# Define the paths
SRC_DATA_PATH = '../../data'
DST_DATA_PATH = '../../data/processed'

# Create the DST_DATA_PATH if it does not exist
if not os.path.exists(DST_DATA_PATH):
    os.makedirs(DST_DATA_PATH)

In [3]:
# Fix the describe function to handle pandas Series correctly
def describe(df: pd.DataFrame, filters=[]) -> None:
	headers = ["Column", "Type", "Not null count", "Nulls count", "Nulls %", "Unique values count", "Unique values %"]
	content = []
	for col in df.columns:
		if col in filters:
			continue
		col_type = df[col].dtype
		col_count = df[col].count()
		col_null = df[col].isna().sum()
		col_unique = df[col].nunique()
		col_percentage_unique = col_unique / col_count * 100
		col_percentage_null = col_null / (col_count + col_null) * 100
		content.append([col, col_type, col_count, col_null, f"{col_percentage_null:.2f} %", col_unique, f"{col_percentage_unique:.2f} %"])
	print(tabulate(content, headers, tablefmt="psql", colalign=("left", "right", "right", "right", "right", "right", "right")))

def describe_number(df: pd.DataFrame, filters=[]) -> None:
	headers = ["Column", "Type", "Min", "Max", "Mean", "Std", "25%", "50%", "75%"]
	content = []
	for col in df.columns:
		if col in filters:
			continue
		if df[col].dtype in [int, float]:
			col_type = df[col].dtype
			col_min = df[col].min()
			col_max = df[col].max()
			col_mean = df[col].mean()
			col_std = df[col].std()
			col_25 = df[col].quantile(0.25)
			col_50 = df[col].quantile(0.50)
			col_75 = df[col].quantile(0.75)
			content.append([col, col_type, col_min, col_max, col_mean, col_std, col_25, col_50, col_75])
	print(tabulate(content, headers, tablefmt="psql", colalign=("left", "right", "right", "right", "right", "right", "right", "right", "right")))

In [4]:
# Plot on a map the data
world = gpd.read_file("https://naciscdn.org/naturalearth/110m/cultural/ne_110m_admin_0_countries.zip").rename(columns={'ADMIN': 'name'})
countries = np.array(world['name'].unique())

## Data exploration

In [5]:
df_beers = pd.read_csv(f"{SRC_DATA_PATH}/beers.csv")
df_breweries = pd.read_csv(f"{SRC_DATA_PATH}/breweries.csv")
df_users = pd.read_csv(f"{SRC_DATA_PATH}/users.csv")
df_ratings = pl.read_parquet(f"{SRC_DATA_PATH}/ratings.pq")

##### Beers dataset
We'll start by looking at the beers dataset.

In [6]:
df_beers.sample(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
100658,65161,57 Wild,16866,The Bruery,American Wild Ale,25,15,3.91,85.0,,6.5,3.9012,,0,
101448,191928,Costa West,36847,The Good Beer Company,Saison / Farmhouse Ale,1,1,4.0,,,5.0,4.0,,0,
24729,128771,Flying Monkeys Berry,10796,Flying Monkeys Craft Brewery,Lambic - Fruit,1,0,3.25,,,,3.25,,0,
146515,113243,KGB,16357,SanTan Brewing Co.,Russian Imperial Stout,4,0,3.94,,,9.2,3.9375,,0,
237166,226268,Helles (Weyermann Malt),16378,Heater Allen Brewing,Munich Helles Lager,7,2,4.24,,,5.0,4.14,,0,


The dataset has the following structure

| Column Name | Description | 
|-------------|-------------|
| `beer_id` | Unique identifier for each beer |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `nbr_ratings` | Number of ratings (text + non text) the beer has received |
| `nbr_reviews` | Number of written reviews (only text) the beer has received |
| `avg` | Average rating of the beer |
| `ba_score` | BeerAdvocate score for the beer (fraction of raters who gave the beer a 3.75 or higher) |
| `bros_score` | Score given by the Bros (Todd and Jason Alström, the BeerAdvocate founders) |
| `abv` | Alcohol By Volume percentage of the beer |
| `avg_computed` | Computed average rating (in some cases it differ from `avg` due to different calculation methods) |
| `zscore` | Standardized score indicating how many standard deviations the beer's rating is from the mean |
| `nbr_matched_valid_ratings` | Number of matched valid ratings |
| `avg_matched_valid_ratings` | Average of matched valid ratings |

Now that we have an idea of what the structure of our dataset looks like, let's see if our dataset is complete or whether it contains a lot of missing entries. 

In [7]:
describe(df_beers)

+---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column                    |    Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|---------------------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| beer_id                   |   int64 |           280823 |             0 |    0.00 % |                280823 |          100.00 % |
| beer_name                 |  object |           280823 |             0 |    0.00 % |                236209 |           84.11 % |
| brewery_id                |   int64 |           280823 |             0 |    0.00 % |                 14325 |            5.10 % |
| brewery_name              |  object |           280823 |             0 |    0.00 % |                 14098 |            5.02 % |
| style                     |  object |           280823 |             0 |    0.00 

In [8]:
describe_number(df_beers, filters=["beer_id", "brewery_id"])

+---------------------------+---------+---------+---------+-----------+----------+-----------+-----------+-----------+
| Column                    |    Type |     Min |     Max |      Mean |      Std |       25% |       50% |       75% |
|---------------------------+---------+---------+---------+-----------+----------+-----------+-----------+-----------|
| nbr_ratings               |   int64 |       0 |   16509 |   29.8873 |   231.01 |         1 |         2 |         8 |
| nbr_reviews               |   int64 |       0 |    3899 |   9.22142 |  68.8664 |         0 |         1 |         2 |
| avg                       | float64 |       0 |       5 |   3.72103 | 0.476003 |       3.5 |      3.78 |      4.01 |
| ba_score                  | float64 |      46 |     100 |   84.6333 |  4.05272 |        83 |        85 |        86 |
| bros_score                | float64 |      31 |     100 |   84.8066 |  10.5077 |        81 |        87 |        91 |
| abv                       | float64 |    0.01 

We can see that:
- The beer name and brewery information are present for every beer within the dataset
- Some data are missing for abv, ba_score, bros_score and the other aggregated scores
- The ba_score and the bros_score are in a 0 - 100 range
- The ba_score and bros_score have a very similar mean, but the bros_score has a far larger spread. This seems to signal that users of the website tend to give less extreme ratings that the bros (the owners of the website)
- The avg score is in a 0 - 5 range (meaning that the scores are in a 0 - 5 scale)
- The abv at first seems to have some outliers. The max in our table above shows we have values with ABV of 67.5 percent.

However, after careful analysis of the beers with such high ABV they actually exist and are not outliers. For example, the 'Brewdog Sink the Bismarck!' beer has 41% ABV and the 'Brewmeister Armageddon' has 65%. These are therefore perfectly fine entries and we will keep them in our dataframe.

Now we'll focus on dealing with the missing values. <br><br>
Regarding the ABV values, we will not be filling the NaN values here. This is due to the fact that it's not something that can be easily guessed and by approximating it with the mean or median values we would introduce a significant bias in the dataset.

In [9]:
# Get the number of ratings for the beers with a null abv and the total number of ratings
rev_nan_beers = df_beers[df_beers["abv"].isnull()]["nbr_ratings"].sum()
tot_rev = df_beers["nbr_ratings"].sum()

# Compute the percentage of missing values for the reviews dataset
print(f"Number of ratings lost if we drop the rows with missing ABV: {rev_nan_beers}")

Number of ratings lost if we drop the rows with missing ABV: 171305


We see that by ignoring the beers with a nan abv we only ignore a small fraction of the reviews. This makes sense since it's unlikely that a beer with a high number of ratings would have a missing abv values.

Regarding the null values of the aggregated scores:
- In most cases these null values are associated with beers that doesn't have any review. It doesn't make sense to fill these values with the mean or median of the dataset since it would introduce a bias in the dataset.
- Most of the values will change after the processing and cleaning of the data.

For these two reasons we are not going to worry too much now about the missing values of the aggregated scores.

Since the beer are added by hand by people it can happen that a beer is created twice.

In [10]:
beers_duplicates = df_beers[df_beers.duplicated(subset=["beer_name", "brewery_id"], keep=False)]
beers_duplicates.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,bros_score,abv,avg_computed,zscore,nbr_matched_valid_ratings,avg_matched_valid_ratings
7862,207341,10|05 Coffee Porter (San Sebastian),33192,Brew By Numbers,English Porter,1,0,4.05,,,6.5,4.05,,0,
7863,255983,10|05 Coffee Porter (San Sebastian),33192,Brew By Numbers,English Porter,1,0,4.13,,,6.5,4.13,,0,
9299,27803,Nappa Scar,12981,Yorkshire Dales Brewing Company,Extra Special / Strong Bitter (ESB),1,1,3.5,,,4.8,3.5,,0,
9300,289247,Nappa Scar,12981,Yorkshire Dales Brewing Company,English Bitter,1,0,3.5,,,4.0,3.5,,0,
25499,259896,Sage Farm,44830,Half Hours on Earth,Saison / Farmhouse Ale,3,1,3.54,,,6.0,3.76,,0,


Since some of the values (e.g. the abv or the style) are not always consistent between duplicates, to reduce the risk of errors or bias we are going to just drop the duplicates.

In [11]:
df_beers = df_beers[~df_beers['beer_id'].isin(beers_duplicates['beer_id'])]

To better handle the data we are also going to split the nbr_ratings into two distinct columns:
- nbr_interactions: will be used to denote the number of interactions in total (number of text reviews + number of non text reviews)
- nbr_ratings: will be used to denote only the non text reviews

In [12]:
df_beers['nbr_interactions'] = df_beers['nbr_ratings']
df_beers['nbr_ratings'] = df_beers['nbr_ratings'] - df_beers['nbr_reviews']

In [13]:
col_to_drop = ['bros_score','avg_computed','zscore','nbr_matched_valid_ratings','avg_matched_valid_ratings']
df_beers = df_beers.drop(columns=col_to_drop)
df_beers.sample(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
148205,61681,Stoudts Abbey Triple Blended With Double IPA,394,Stoudts Brewing Co.,Belgian IPA,0,1,3.07,,9.5,1
126511,225481,Mangoteka,34104,Greenpoint Beer & Ale Company,American Wild Ale,0,1,4.32,,4.5,1
267579,189654,La Damoiselle Blonde,42096,Brasserie de l'Alagnon,Belgian Pale Ale,1,1,3.69,,5.0,2
40433,132961,Hüchelner Urstoff,27646,Hüchelner Urstoff Brauhaus Hintermeier,Kölsch,0,1,3.27,,4.7,1
243932,144884,Panzerfaust,34141,Stubborn Beauty Brewing Company,Weizenbock,15,2,3.94,85.0,8.3,17


##### Breweries dataset

Now we can take a look at our breweries dataframe. For each brewery we have a location, a name and the amount of beers they produce, along with a unique identifier. The brewery location can be very useful for our research questions. We will look into this more later on in this notebook.

In [14]:
df_breweries.sample(5)

Unnamed: 0,id,location,name,nbr_beers
9460,37302,"United States, Massachusetts",Fort Hill Brewery,23
1105,16245,England,Cock & Hen,0
5925,3527,Luxembourg,Brasserie Bofferding,7
6392,8044,Austria,Königsdorfer Bergbräu Löffler,0
16228,2103,"United States, Texas","Laboratory Brewing Company, The",0


In [15]:
print(f"Number of breweries: {df_breweries['id'].nunique()}")
original_nb_breweries = df_breweries['id'].nunique()

Number of breweries: 16758


The dataset has the following structure
| Column Name | Description
|-------------|-------------|
| `id` | Unique identifier for each brewery |
| `location` | Geographic location of the brewery (country) |
| `name` | Name of the brewery |
| `nbr_beers` | Number of beers produced by the brewery |

In [16]:
describe(df_breweries)

+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+
| Column    |   Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-----------+--------+------------------+---------------+-----------+-----------------------+-------------------|
| id        |  int64 |            16758 |             0 |    0.00 % |                 16758 |          100.00 % |
| location  | object |            16758 |             0 |    0.00 % |                   297 |            1.77 % |
| name      | object |            16758 |             0 |    0.00 % |                 16237 |           96.89 % |
| nbr_beers |  int64 |            16758 |             0 |    0.00 % |                   273 |            1.63 % |
+-----------+--------+------------------+---------------+-----------+-----------------------+-------------------+


In [17]:
describe_number(df_breweries, filters=["id"])

+-----------+--------+-------+-------+---------+---------+-------+-------+-------+
| Column    |   Type |   Min |   Max |    Mean |     Std |   25% |   50% |   75% |
|-----------+--------+-------+-------+---------+---------+-------+-------+-------|
| nbr_beers |  int64 |     0 |  1196 | 21.0563 | 69.4178 |     2 |     6 |    18 |
+-----------+--------+-------+-------+---------+---------+-------+-------+-------+


For our breweries, we have no missing values. <br><br>
From the analysis we can see that some of our breweries have a non-unique name. We need to verify if multiple breweries with the same name exists in the same country.

In [18]:
breweries_duplicates = df_breweries[df_breweries.duplicated(subset=["name", "location"], keep=False)]
breweries_duplicates.head(5)

Unnamed: 0,id,location,name,nbr_beers
68,34598,Wales,Rhymney Brewery,2
69,12936,Wales,Rhymney Brewery,13
213,11410,England,Bartrams Brewery,11
214,7095,England,Bartrams Brewery,0
344,31304,England,Dorset Piddle Brewery,0


We see that there are some breweries that have the same name and are located in the same country. This introduces a possible source of error:
- Either multiple users have inserted the same brewery into the database multiple times
- Multiple breweries in one country have the same name

Both assumption are reasonable but we can't be sure which one is the correct one. To avoid introducing errors in our dataset we are going to drop the duplicates.

In [19]:
df_breweries = df_breweries[~df_breweries['id'].isin(breweries_duplicates['id'])]

Since we modified both the breweries and the beers dataset let's recompute the shared values. At the same time we are going to drop the breweries that doesn't have any beer associated with them.

In [20]:
# Filter the beers dataset to remove the ones that do not have a corresponding brewery
df_beers = df_beers[df_beers["brewery_id"].isin(df_breweries["id"])]

# Recount the number of beers per brewery
count_beers = df_beers["brewery_id"].value_counts()
df_breweries["nbr_beers"] = df_breweries["id"].map(count_beers).dropna().astype(int)

# Show a sample of the breweries dataset
df_breweries.sample(5)

Unnamed: 0,id,location,name,nbr_beers
6930,44496,Czech Republic,Zámecký Pivovar Frýdlant,15.0
9348,43983,"United States, Washington",Quartzite Brewing Company,1.0
4742,49130,Germany,Café Weichhardt,1.0
5967,19536,Spain,La Pubilla - Cerveza Artesanal,1.0
8318,49767,"United States, West Virginia",Weathered Ground Brewery,


Make the location of the brewery homogeneous.

In [21]:
import re

def change_location(x):
    # Define some remapping
    remapping = {
        'Wales': 'United Kingdom',
        'Scotland': 'United Kingdom',
        'Northern Ireland': 'United Kingdom',
        'Virgin Islands (British)': 'United Kingdom',
        'Virgin Islands (U.S.)': 'United Kingdom',
        'Czech Republic': 'Czechia',
        'Slovak Republic': 'Slovakia',
        'Serbia': 'Republic of Serbia',
    }

    # Clean up HTML-like content and extract meaningful text
    if re.search(r"</a>", x['location']):
        # Extract the part before the HTML tag
        x['location'] = re.split(r"</a>", x['location'])[0].strip()

    # Standardize United States locations
    if 'United States' in x['location']:
        # Check for proper format "United States, State"
        if ',' in x['location']:
            country, state = x['location'].split(',', 1)
            state = state.strip()
            # Exclude entries like "United States, United States" or "United States, {number}"
            if state.lower() == "united states" or re.match(r'^\d+$', state):
                return None
            x['location'] = f"United States,{state}"
        else:
            # If no state information is present, mark for deletion
            return None

    # Handle cases like "Utah</a><br><a ..."
    elif 'Utah' in x['location']:
        x['location'] = 'United States, Utah'

    # Standardize United Kingdom
    elif any(uk_region in x['location'] for uk_region in ['United Kingdom', 'England', 'Scotland', 'Wales', 'Northern Ireland']):
        x['location'] = 'United Kingdom'

    # Standardize Canada: collapse all provinces into just "Canada"
    elif 'Canada' in x['location']:
        x['location'] = 'Canada'

    # Apply remapping if location exists in the dictionary
    if x['location'] in remapping:
        x['location'] = remapping[x['location']]

    return x

# Apply the function, dropping rows where None is returned
df_breweries = df_breweries.apply(change_location, axis=1).dropna()
df_breweries['location'].unique()

array(['Kyrgyzstan', 'Gabon', 'United Kingdom', 'Singapore', 'China',
       'Chad', 'Saint Lucia', 'Cameroon', 'Burkina Faso', 'Zambia',
       'Romania', 'Nigeria', 'South Korea', 'Georgia', 'Hong Kong',
       'Guinea', 'Montenegro', 'Benin', 'Mexico', 'Fiji Islands', 'Guam',
       'Laos', 'Senegal', 'Honduras', 'Morocco', 'Indonesia', 'Monaco',
       'Ukraine', 'Canada', 'Jordan', 'Portugal', 'Guernsey', 'India',
       'Puerto Rico', 'Japan', 'Iran', 'Hungary', 'Bulgaria',
       'Guinea-Bissau', 'Liberia', 'Togo', 'Niger', 'Croatia',
       'Lithuania', 'Cyprus', 'Italy', 'Andorra', 'Botswana',
       'Turks and Caicos Islands', 'Papua New Guinea', 'Mongolia',
       'Ethiopia', 'Denmark', 'French Polynesia', 'Greece', 'Sri Lanka',
       'Syria', 'Germany', 'Jersey', 'Armenia', 'Mozambique', 'Palestine',
       'Bangladesh', 'Turkmenistan', 'Reunion', 'Eritrea', 'Switzerland',
       'Malta', 'Israel', 'El Salvador', 'French Guiana', 'Tonga',
       'Zimbabwe', 'Samoa', 'Barba

In [22]:
# Remove some locations
to_remove = ['UNKNOWN']
countries_to_remove = np.setdiff1d(np.array(df_breweries['location'].unique()), countries)
to_remove = np.concatenate([to_remove, countries_to_remove])
# Filter the breweries dataset
df_breweries = df_breweries[~df_breweries['location'].isin(to_remove)]

print(df_breweries['location'].unique())


['Kyrgyzstan' 'Gabon' 'United Kingdom' 'China' 'Chad' 'Cameroon'
 'Burkina Faso' 'Zambia' 'Romania' 'Nigeria' 'South Korea' 'Georgia'
 'Guinea' 'Montenegro' 'Benin' 'Mexico' 'Laos' 'Senegal' 'Honduras'
 'Morocco' 'Indonesia' 'Ukraine' 'Canada' 'Jordan' 'Portugal' 'India'
 'Puerto Rico' 'Japan' 'Iran' 'Hungary' 'Bulgaria' 'Guinea-Bissau'
 'Liberia' 'Togo' 'Niger' 'Croatia' 'Lithuania' 'Cyprus' 'Italy'
 'Botswana' 'Papua New Guinea' 'Mongolia' 'Ethiopia' 'Denmark' 'Greece'
 'Sri Lanka' 'Syria' 'Germany' 'Armenia' 'Mozambique' 'Palestine'
 'Bangladesh' 'Turkmenistan' 'Eritrea' 'Switzerland' 'Israel'
 'El Salvador' 'Zimbabwe' 'Chile' 'Cambodia' 'Bhutan' 'Uzbekistan' 'Egypt'
 'Uruguay' 'Dominican Republic' 'Equatorial Guinea' 'Russia' 'Tajikistan'
 'Vietnam' 'Namibia' 'Australia' 'Ecuador' 'Vanuatu' 'Uganda' 'Azerbaijan'
 'Argentina' 'Tunisia' 'Belize' 'Luxembourg' 'Madagascar' 'Spain'
 'South Sudan' 'Belarus' 'Ivory Coast' 'Austria' 'Bolivia'
 'Central African Republic' 'Suriname' 'Solomon

In [23]:
print(f"Number of breweries: {df_breweries['id'].nunique()}")
print(f"Number of breweries removed: {original_nb_breweries - df_breweries['id'].nunique()}")

Number of breweries: 7474
Number of breweries removed: 9284


While we still have some beers with the same name we are sure that these beers are produced by different breweries. <br><br>
We are going to recompute the number of ratings, interactions and reviews for each beer later in this notebook.

##### Users dataset

For each of our users, we have their number of ratings/reviews, their ID, their name, the timestamp of when they joined and their location. The location is interesting for us, along with the number of reviews of each user.

In [24]:
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location
22896,37,37,tstandley13.135762,TStandley13,1177582000.0,"United States, Massachusetts"
84540,3,0,ashtx.998293,Ashtx,1433930000.0,"United States, Texas"
131701,20,0,moneyt23.1060920,MoneyT23,1446203000.0,"United States, Michigan"
91303,1,1,brewspongeoregon.570535,BrewSpongeOregon,1298286000.0,"United States, Oregon"
71091,4,0,moconn10.404151,moconn10,1260702000.0,"United States, Kentucky"


In [25]:
print(f"Number of users: {df_users['user_id'].nunique()}")
original_nb_users = df_users['user_id'].nunique()

Number of users: 153704


This dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `user_name` | Username of the reviewer |
| `joined` | Date when the user joined BeerAdvocate |
| `location` | Geographic location of the user |
| `nbr_ratings` | Number of ratings submitted by the user |
| `nbr_reviews` | Number of written reviews submitted by the user |

In [26]:
describe(df_users)

+-------------+---------+------------------+---------------+-----------+-----------------------+-------------------+
| Column      |    Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|-------------+---------+------------------+---------------+-----------+-----------------------+-------------------|
| nbr_ratings |   int64 |           153704 |             0 |    0.00 % |                  2053 |            1.34 % |
| nbr_reviews |   int64 |           153704 |             0 |    0.00 % |                  1265 |            0.82 % |
| user_id     |  object |           153704 |             0 |    0.00 % |                153704 |          100.00 % |
| user_name   |  object |           153703 |             1 |    0.00 % |                153703 |          100.00 % |
| joined      | float64 |           151052 |          2652 |    1.73 % |                  5524 |            3.66 % |
| location    |  object |           122425 |         31279 |   2

In [27]:
describe_number(df_users)

+-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------+
| Column      |    Type |         Min |        Max |        Mean |         Std |         25% |         50% |         75% |
|-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------|
| nbr_ratings |   int64 |           1 |      12046 |     54.6052 |     252.389 |           1 |           3 |          16 |
| nbr_reviews |   int64 |           0 |       8970 |     16.8479 |     139.847 |           0 |           0 |           2 |
| joined      | float64 | 8.40794e+08 | 1.5015e+09 | 1.35724e+09 | 9.19513e+07 | 1.30312e+09 | 1.39194e+09 | 1.41769e+09 |
+-------------+---------+-------------+------------+-------------+-------------+-------------+-------------+-------------+


We are going to change the structure in nbr_interactions, nbr_ratings, nbr_reviews as done with the beers dataset and we are also going to cast the joined column to a datetime object.

In [28]:
df_users['nbr_interactions'] = df_users['nbr_ratings']
df_users['nbr_ratings'] = df_users['nbr_ratings'] - df_users['nbr_reviews']
df_users['joined'] = pd.to_datetime(df_users['joined'], unit='s')
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
28077,1,0,snowfall.860051,snowfall,2014-09-07 10:00:00,"United States, California",1
9611,5,57,spelingchampeon.1111368,spelingchampeon,2016-02-21 11:00:00,"United States, Delaware",62
85612,0,7,pi5porter.573395,pi5porter,2011-02-27 11:00:00,"United States, Utah",7
82130,28,0,misterchief.521791,misterchief,2010-11-04 11:00:00,"United States, California",28
4361,3722,42,krome.355414,krome,2009-08-01 10:00:00,"United States, Illinois",3764


In [29]:
df_users.dropna(subset=['location', 'joined'], inplace=True)
df_users.sample(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
8129,7,26,raebies.638517,Raebies,2011-11-21 11:00:00,Australia,33
13142,1,0,mikeebabee.660449,Mikeebabee,2012-02-19 11:00:00,England,1
119544,1,1,keng47.787202,Keng47,2014-03-08 11:00:00,"United States, Pennsylvania",2
52406,34,0,beer-lovin-rat.948511,Beer-Lovin-RAT,2015-02-22 11:00:00,"United States, Kansas",34
99768,0,1,mc_wright.760661,MC_Wright,2013-10-25 10:00:00,"United States, Oregon",1


In [30]:
df_users = df_users.dropna(subset=['location', 'joined'])
df_users = df_users.apply(change_location, axis=1)

In [31]:
print(f"Number of users: {df_users['user_id'].nunique()}")
print(f"Number of users removed: {original_nb_users - df_users['user_id'].nunique()}")

Number of users: 122425
Number of users removed: 31279


##### Ratings

In [32]:
df_ratings.sample(5)

user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,text,aroma,overall,taste,style,beer_name,brewery_id,date
str,f64,bool,f64,str,str,i64,f64,f64,str,f64,f64,f64,str,str,i64,datetime[μs]
"""overlord.145338""",3.18,True,4.9,"""Boston Beer Company (Samuel Ad…","""Overlord""",21300,3.5,3.5,"""Samuel Adams talks a good game…",3.0,3.5,3.0,"""Schwarzbier""","""Samuel Adams Black Lager""",35,2007-12-25 12:00:00
"""whitey_from_remedy.935651""",4.36,False,7.5,"""Tröegs Brewing Company""","""Whitey_from_Remedy""",15881,4.5,4.25,,4.0,4.5,4.5,"""American Amber / Red Ale""","""Tröegs Nugget Nectar""",694,2015-02-18 12:00:00
"""bsetz.751251""",4.25,False,7.5,"""Mikkeller ApS""","""Bsetz""",56836,,,,,,,"""American Stout""","""Beer Hop Breakfast""",13307,2013-12-03 12:00:00
"""crytion.473047""",3.5,False,4.0,"""Rivertown Brewing Co.""","""crytion""",56393,,,,,,,"""Oatmeal Stout""","""Oatmeal Stout""",22157,2013-01-29 12:00:00
"""bultrey.4653""",4.3,True,6.0,"""Offshore Ale Company""","""bultrey""",22781,4.0,4.0,"""Bright copper, good persistent…",4.0,4.5,4.5,"""American IPA""","""Offshore India Pale Ale""",236,2005-08-03 12:00:00


In [33]:
import pyarrow
idx_list = np.arange(df_ratings.shape[0])
df_ratings_text = df_ratings.select('text')
df_ratings_text = df_ratings_text.with_columns(pl.Series(name='idx', values=idx_list))
df_ratings_no_text = df_ratings.drop('text').to_pandas()
df_ratings_no_text['idx'] = idx_list

In [34]:
df_ratings_text.head(5)
number_of_ratings = df_ratings.shape[0]
print(f"Number of ratings: {number_of_ratings}")

Number of ratings: 8393032


In [35]:
df_ratings_no_text.head(5)

Unnamed: 0,user_id,rating,review,abv,brewery_name,user_name,beer_id,appearance,palate,aroma,overall,taste,style,beer_name,brewery_id,date,idx
0,nmann08.184925,2.88,True,4.5,Societe des Brasseries du Gabon (SOBRAGA),nmann08,142544,3.25,3.25,2.75,3.0,2.75,Euro Pale Lager,Régab,37262,2015-08-20 12:00:00,0
1,stjamesgate.163714,3.67,True,4.5,Strangford Lough Brewing Company Ltd,StJamesGate,19590,3.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2009-02-20 12:00:00,1
2,mdagnew.19527,3.73,True,4.5,Strangford Lough Brewing Company Ltd,mdagnew,19590,4.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2006-03-13 12:00:00,2
3,helloloser12345.10867,3.98,True,4.5,Strangford Lough Brewing Company Ltd,helloloser12345,19590,4.0,4.0,3.5,4.5,4.0,English Pale Ale,Barelegs Brew,10093,2004-12-01 12:00:00,3
4,cypressbob.3708,4.0,True,4.5,Strangford Lough Brewing Company Ltd,cypressbob,19590,4.0,4.0,4.0,4.0,4.0,English Pale Ale,Barelegs Brew,10093,2004-08-30 12:00:00,4


The dataset has the following structure

In [36]:
describe(df_ratings_no_text, filters=['idx'])

+--------------+----------------+------------------+---------------+-----------+-----------------------+-------------------+
| Column       |           Type |   Not null count |   Nulls count |   Nulls % |   Unique values count |   Unique values % |
|--------------+----------------+------------------+---------------+-----------+-----------------------+-------------------|
| user_id      |         object |      8.39303e+06 |             0 |    0.00 % |                153704 |            1.83 % |
| rating       |        float64 |      8.39303e+06 |             0 |    0.00 % |                   401 |            0.00 % |
| review       |           bool |      8.39303e+06 |             0 |    0.00 % |                     2 |            0.00 % |
| abv          |        float64 |      8.22173e+06 |        171305 |    2.04 % |                   843 |            0.01 % |
| brewery_name |         object |      8.39303e+06 |             0 |    0.00 % |                 13440 |            0.16 % |


The dataset has the following structure
| Column Name | Description 
|-------------|-------------|
| `user_id` | Unique identifier for each user |
| `rating` | Global rating of the beer from the user |
| `review` | Flag to tell if the rating has text or not |
| `abv` | Alcohol By Volume percentage of the beer |
| `brewery_name` | Name of the brewery that produced the beer |
| `user_name` | Username of the reviewer |
| `beer_id` | Unique identifier for each beer |
| `appearance` | Rating of the appearance of the beer |
| `palate` | Rating of the palate of the beer |
| `text` | Text of the review |
| `aroma` | Rating of the aroma of the beer |
| `overall` | Overall rating of the beer |
| `taste` | Rating of the taste of the beer |
| `style` | Style or category of the beer (e.g., IPA, Stout, Lager) |
| `beer_name` | Name of the beer |
| `brewery_id` | Unique identifier for the brewery that produced the beer |
| `date` | Date when the review was submitted |

In [37]:
describe_number(df_ratings_no_text, filters=['idx'])

+------------+---------+-------+--------+---------+----------+-------+-------+-------+
| Column     |    Type |   Min |    Max |    Mean |      Std |   25% |   50% |   75% |
|------------+---------+-------+--------+---------+----------+-------+-------+-------|
| rating     | float64 |     1 |      5 | 3.88213 | 0.620509 |  3.54 |     4 |  4.25 |
| abv        | float64 |  0.01 |   67.5 | 7.33027 |  2.45911 |   5.5 |   6.9 |   8.8 |
| beer_id    |   int64 |     3 | 293296 | 66754.4 |  64818.2 |  9074 | 52266 | 96548 |
| appearance | float64 |     1 |      5 | 3.93721 | 0.558413 |  3.75 |     4 |  4.25 |
| palate     | float64 |     1 |      5 | 3.86586 | 0.608391 |   3.5 |     4 |  4.25 |
| aroma      | float64 |     1 |      5 |  3.8696 | 0.620445 |   3.5 |     4 |  4.25 |
| overall    | float64 |     1 |      5 | 3.90053 | 0.616106 |   3.5 |     4 |  4.25 |
| taste      | float64 |     1 |      5 | 3.90329 | 0.643316 |   3.5 |     4 |  4.25 |
| brewery_id |   int64 |     1 |  49815 | 9

In [38]:
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['brewery_id'].isin(df_breweries['id'])]
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['beer_id'].isin(df_beers['beer_id'])]
df_ratings_no_text = df_ratings_no_text[df_ratings_no_text['user_id'].isin(df_users['user_id'])]

In [39]:
to_remove = ['user_name']
df_ratings_no_text = df_ratings_no_text.drop(columns=to_remove)

In [40]:
breweries_location = df_breweries[['id', 'location']]
df_ratings_no_text = df_ratings_no_text.merge(breweries_location, left_on='brewery_id', right_on='id', how='left')
df_ratings_no_text = df_ratings_no_text.drop(columns=['id'])
df_ratings_no_text.sample(5)

Unnamed: 0,user_id,rating,review,abv,brewery_name,beer_id,appearance,palate,aroma,overall,taste,style,beer_name,brewery_id,date,idx,location
579419,ommegangpbr.15278,4.0,True,4.8,Privatbrauerei Gaffel Becker & Co.,4137,4.0,4.0,4.0,4.0,4.0,Kölsch,Gaffel Kölsch,1536,2005-06-28 12:00:00,625094,Germany
986375,globetrotter.1845,2.63,True,5.0,Stella Artois,449,3.0,2.5,2.5,3.0,2.5,Euro Pale Lager,Stella Artois,169,2003-11-08 12:00:00,7940253,Belgium
913925,socon67.471054,3.21,True,4.5,Guinness Ltd.,29602,4.0,3.5,3.0,3.5,3.0,Irish Red Ale,Smithwick's Imported Premium Irish Ale,209,2011-11-08 12:00:00,988302,Ireland
1030338,harkrank.584231,4.25,False,9.0,Brasserie Caracole,2319,,,,,,Belgian Strong Dark Ale,Nostradamus,753,2011-12-23 12:00:00,7991428,Belgium
924,andrenaline.393082,3.7,True,4.2,The Celt Experience,69185,4.0,4.0,4.0,3.5,3.5,English Pale Ale,Golden,20776,2011-09-25 12:00:00,959,United Kingdom


In [41]:
df_ratings_text = df_ratings_text.filter(df_ratings_text['idx'].is_in(df_ratings_no_text['idx']))

## Final processing and saving
### Beers dataset

In [42]:
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
0,166064,Nashe Moskovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.7,0
1,166065,Nashe Pivovskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,3.8,0
2,166066,Nashe Shakhterskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.8,0
3,166067,Nashe Zhigulevskoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.0,0
4,166063,Zhivoe,39912,Abdysh-Ata (Абдыш Ата),Euro Pale Lager,0,0,,,4.5,0


In [43]:
nbr_ratings_count = df_ratings_no_text[df_ratings_no_text['review'] == False].value_counts('beer_id')
nbr_reviews_count = df_ratings_no_text[df_ratings_no_text['review'] == True].value_counts('beer_id')
df_beers['nbr_ratings'] = df_beers['beer_id'].map(nbr_ratings_count).fillna(0).astype(int)
df_beers['nbr_reviews'] = df_beers['beer_id'].map(nbr_reviews_count).fillna(0).astype(int)
df_beers['nbr_interactions'] = df_beers['nbr_ratings'] + df_beers['nbr_reviews']
df_beers = df_beers[df_beers['nbr_interactions'] > 0]
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions
23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,0,1,2.88,,4.5,1
24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,0,4,3.85,,4.5,4
25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,14,58,3.45,80.0,4.8,72
26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,2,6,3.86,,6.0,8
27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,14,47,3.56,82.0,4.2,61


In [44]:
mean_rating = df_ratings_no_text['rating'].groupby(df_ratings_no_text['beer_id']).mean().round(2)
df_beers['avg'] = df_beers['beer_id'].map(mean_rating).fillna(0)

std_rating = df_ratings_no_text['rating'].groupby(df_ratings_no_text['beer_id']).std().round(2)
df_beers['std'] = df_beers['beer_id'].map(std_rating).fillna(0)

median_rating = df_ratings_no_text['rating'].groupby(df_ratings_no_text['beer_id']).median().round(2)
df_beers['median'] = df_beers['beer_id'].map(median_rating).fillna(0)

appearance_rating = df_ratings_no_text['appearance'].groupby(df_ratings_no_text['beer_id']).mean().round(2)
df_beers['appearance'] = df_beers['beer_id'].map(appearance_rating).fillna(0)

palate_rating = df_ratings_no_text['palate'].groupby(df_ratings_no_text['beer_id']).mean().round(2)
df_beers['palate'] = df_beers['beer_id'].map(palate_rating).fillna(0)

aroma_rating = df_ratings_no_text['aroma'].groupby(df_ratings_no_text['beer_id']).mean().round(2)
df_beers['aroma'] = df_beers['beer_id'].map(aroma_rating).fillna(0)

overall_rating = df_ratings_no_text['overall'].groupby(df_ratings_no_text['beer_id']).mean().round(2)
df_beers['overall'] = df_beers['beer_id'].map(overall_rating).fillna(0)

df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,nbr_ratings,nbr_reviews,avg,ba_score,abv,nbr_interactions,std,median,appearance,palate,aroma,overall
23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,0,1,2.88,,4.5,1,0.0,2.88,3.25,3.25,2.75,3.0
24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,0,4,3.84,,4.5,4,0.17,3.86,3.75,3.75,3.62,3.88
25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,14,58,3.43,80.0,4.8,72,0.47,3.5,3.84,3.51,3.47,3.5
26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,2,6,3.89,,6.0,8,0.47,3.98,3.75,3.83,3.79,3.92
27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,14,47,3.56,82.0,4.2,61,0.44,3.63,3.71,3.44,3.51,3.71


In [45]:
columns_ordering = ['beer_id','beer_name','brewery_id','brewery_name','style','abv','avg','std','median','appearance','aroma','palate','overall','nbr_ratings','nbr_reviews','nbr_interactions']
df_beers = df_beers[columns_ordering]
df_beers.head(5)

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name,style,abv,avg,std,median,appearance,aroma,palate,overall,nbr_ratings,nbr_reviews,nbr_interactions
23,142544,Régab,37262,Societe des Brasseries du Gabon (SOBRAGA),Euro Pale Lager,4.5,2.88,0.0,2.88,3.25,2.75,3.25,3.0,0,1,1
24,19590,Barelegs Brew,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,4.5,3.84,0.17,3.86,3.75,3.62,3.75,3.88,0,4,4
25,19827,Legbiter,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,4.8,3.43,0.47,3.5,3.84,3.47,3.51,3.5,14,58,72
26,20841,St. Patrick's Ale,10093,Strangford Lough Brewing Company Ltd,English Pale Ale,6.0,3.89,0.47,3.98,3.75,3.79,3.83,3.92,2,6,8
27,20842,St. Patrick's Best,10093,Strangford Lough Brewing Company Ltd,English Bitter,4.2,3.56,0.44,3.63,3.71,3.51,3.44,3.71,14,47,61


In [46]:
df_beers.to_parquet(f"{DST_DATA_PATH}/beers.pq")

### Brewery dataset

In [47]:
nbr_beers = df_beers.groupby('brewery_id').count()
df_breweries['nbr_beers'] = df_breweries['id'].map(nbr_beers['beer_id']).fillna(0).astype(int)
df_breweries['nbr_beers'] = df_breweries['nbr_beers'].astype(int)
df_breweries = df_breweries[df_breweries['nbr_beers'] > 0]
df_breweries.sample(5)

Unnamed: 0,id,location,name,nbr_beers
2893,5366.0,Italy,Fabbrica Birra Busalla,3
971,30246.0,United Kingdom,Oakwell Brewery,3
3576,6684.0,Germany,Schlossbrauerei GmbH Fürstlich-Drehna,5
262,33086.0,United Kingdom,Halfway Brew House,2
6166,46298.0,Spain,MCJ Cerveceros Portuenses SL,2


In [48]:
df_breweries.to_parquet(f"{DST_DATA_PATH}/breweries.pq")

### Ratings dataset

In [49]:
df_ratings_no_text.to_parquet(f"{DST_DATA_PATH}/ratings_no_text.pq")
df_ratings_no_text.head(5)

Unnamed: 0,user_id,rating,review,abv,brewery_name,beer_id,appearance,palate,aroma,overall,taste,style,beer_name,brewery_id,date,idx,location
0,nmann08.184925,2.88,True,4.5,Societe des Brasseries du Gabon (SOBRAGA),142544,3.25,3.25,2.75,3.0,2.75,Euro Pale Lager,Régab,37262,2015-08-20 12:00:00,0,Gabon
1,stjamesgate.163714,3.67,True,4.5,Strangford Lough Brewing Company Ltd,19590,3.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2009-02-20 12:00:00,1,United Kingdom
2,mdagnew.19527,3.73,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,3.5,3.5,3.5,4.0,English Pale Ale,Barelegs Brew,10093,2006-03-13 12:00:00,2,United Kingdom
3,helloloser12345.10867,3.98,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,4.0,3.5,4.5,4.0,English Pale Ale,Barelegs Brew,10093,2004-12-01 12:00:00,3,United Kingdom
4,cypressbob.3708,4.0,True,4.5,Strangford Lough Brewing Company Ltd,19590,4.0,4.0,4.0,4.0,4.0,English Pale Ale,Barelegs Brew,10093,2004-08-30 12:00:00,4,United Kingdom


In [50]:
df_ratings_text.to_pandas().to_parquet(f"{DST_DATA_PATH}/ratings_text.pq")
df_ratings_text.head(5)

text,idx
str,i64
"""From a bottle, pours a piss ye…",0
"""Pours pale copper with a thin …",1
"""500ml Bottle bought from The V…",2
"""Serving: 500ml brown bottlePou…",3
"""500ml bottlePours with a light…",4


### Users dataset

In [51]:
df_users.head(5)

Unnamed: 0,nbr_ratings,nbr_reviews,user_id,user_name,joined,location,nbr_interactions
0,7355,465,nmann08.184925,nmann08,2008-01-07 11:00:00,"United States,Washington",7820
1,17,2504,stjamesgate.163714,StJamesGate,2007-10-08 10:00:00,"United States,New York",2521
2,654,1143,mdagnew.19527,mdagnew,2005-05-18 10:00:00,United Kingdom,1797
3,0,31,helloloser12345.10867,helloloser12345,2004-11-25 11:00:00,United Kingdom,31
4,0,604,cypressbob.3708,cypressbob,2003-11-20 11:00:00,United Kingdom,604


In [52]:
nbr_ratings_per_user = df_ratings_no_text[df_ratings_no_text['review'] == False].value_counts('user_id')
nbr_reviews_per_user = df_ratings_no_text[df_ratings_no_text['review'] == True].value_counts('user_id')

df_users['nbr_ratings'] = df_users['user_id'].map(nbr_ratings_per_user).fillna(0).astype(int)
df_users['nbr_reviews'] = df_users['user_id'].map(nbr_reviews_per_user).fillna(0).astype(int)
df_users['nbr_interactions'] = df_users['nbr_ratings'] + df_users['nbr_reviews']
df_users = df_users[df_users['nbr_interactions'] > 0]
df_users = df_users[['user_id', 'user_name', 'location', 'joined', 'nbr_ratings', 'nbr_reviews', 'nbr_interactions']]
df_users.sample(5)

Unnamed: 0,user_id,user_name,location,joined,nbr_ratings,nbr_reviews,nbr_interactions
20543,ecoons.269278,ecoons,"United States,Ohio",2008-11-21 11:00:00,1,0,1
2927,tippyselenoid.122522,TippySelenoid,"United States,Missouri",2007-02-13 11:00:00,8,18,26
12883,drkarl.774405,DrKarl,"United States,Illinois",2014-01-05 11:00:00,10,0,10
21586,vmspicex.287840,vmspicex,"United States,California",2009-01-15 11:00:00,0,1,1
23838,wgfxkmf.653380,wgfxkmf,"United States,North Carolina",2012-01-07 11:00:00,10,1,11


In [53]:
df_users.to_parquet(f"{DST_DATA_PATH}/users.pq")
df_users.head(5)

Unnamed: 0,user_id,user_name,location,joined,nbr_ratings,nbr_reviews,nbr_interactions
0,nmann08.184925,nmann08,"United States,Washington",2008-01-07 11:00:00,886,84,970
1,stjamesgate.163714,StJamesGate,"United States,New York",2007-10-08 10:00:00,9,1712,1721
2,mdagnew.19527,mdagnew,United Kingdom,2005-05-18 10:00:00,588,902,1490
3,helloloser12345.10867,helloloser12345,United Kingdom,2004-11-25 11:00:00,0,29,29
4,cypressbob.3708,cypressbob,United Kingdom,2003-11-20 11:00:00,0,542,542
