# Beeradvocate pre-processing
Given that the ratings.txt and reviews.txt are hard to pars and to handle we will convert them in a more easy to use format (parquet). We will also check how much information overlap between the two files to understand if we can drop one of them. <br>
Finally we will do some processing on the reviews to make them compatible with our pipeline.<br><br><br>
How to run this notebook:
- Unpack all the files in the corresponding folder in the data folder
- Run the notebook

A converted reviews.pq and a converted ratings.pq will be saved in the data folder
##### Definition of some global variables and imports

In [1]:
import polars as pl
import polars as pl
import tqdm
from datetime import datetime
import os

In [2]:
DATA_FOLDER = '../../data/RateBeer'

##### Identification of unique labels

In [3]:
for file_name in ["ratings", "reviews"]:
    unique_labels = set()
    with open(f"{DATA_FOLDER}/{file_name}.txt") as file:
        for line in tqdm.tqdm(file):
            if line == "\n":
                continue
            label, _ = line.split(":", 1)
            unique_labels.add(label.strip())

    print(f"Unique labels in {file_name}.txt:")
    print(unique_labels)
    print()

121075258it [00:42, 2836687.60it/s]


Unique labels in ratings.txt:
{'brewery_name', 'brewery_id', 'beer_id', 'text', 'overall', 'beer_name', 'date', 'abv', 'palate', 'appearance', 'rating', 'user_name', 'taste', 'aroma', 'user_id', 'style'}



121075258it [00:45, 2690329.71it/s]

Unique labels in reviews.txt:
{'brewery_name', 'brewery_id', 'beer_id', 'text', 'overall', 'beer_name', 'date', 'abv', 'palate', 'appearance', 'rating', 'user_name', 'taste', 'aroma', 'user_id', 'style'}






##### Conversion from txt to parquet

In [4]:
# Define the mapping betwen column names and polars types
mapping_pl = {
    'rating': pl.Float64,
    'palate': pl.Float64,
    'abv': pl.Float64,
    'beer_id': pl.Int64,
    'beer_name': pl.Utf8,
    'user_id': pl.Int64,
    'taste': pl.Float64,
    'date': pl.Datetime,
    'style': pl.Utf8,
    'appearance': pl.Float64,
    'overall': pl.Float64,
    'brewery_name': pl.Utf8,
    'text': pl.Utf8,
    'aroma': pl.Float64,
    'user_name': pl.Utf8,
    'brewery_id': pl.Int64
}

for file_name in ["ratings", "reviews"]:

    # Create an empty list to collect rows
    rows = []

    # Open the file to read the reviews
    with open(f"{DATA_FOLDER}/{file_name}.txt") as f:
        for line in tqdm.tqdm(f):
            # Remove leading/trailing whitespaces
            line = line.strip()
                
            # Create a dictionary to store the content of the row
            content = {label: None for label in mapping_pl.keys()}

            # Process the line until we get a complete record
            while line:
                # Split the line into label and value
                label, value = line.split(":", 1)
                label = label.strip()
                value = value.strip()

                # Skip 'nan' values (these values are used to indicate missing data)
                if value != 'nan':
                    # Cast the value to the correct type based on the mapping
                    if mapping_pl[label] == pl.Int64:
                        value = int(value)
                    elif mapping_pl[label] == pl.Float64:
                        value = float(value)
                    elif mapping_pl[label] == pl.Utf8:
                        value = str(value)
                    elif mapping_pl[label] == pl.Datetime:
                        value = datetime.fromtimestamp(int(value))
                    elif mapping_pl[label] == pl.Boolean:
                        value = value == "True"

                    # Store the value in the content dictionary
                    content[label] = value

                # Read the next line (for multiline records, like reviews)
                line = f.readline().strip()

            # Add the processed row to the list
            rows.append(content)

    # After processing all lines, create a DataFrame from the accumulated rows
    df = pl.DataFrame(rows)

    # Save it as parquet
    df.write_parquet(f'{DATA_FOLDER}/{file_name}.pq')

    # Free the memory
    del df

7122074it [01:28, 80515.47it/s]
7122074it [01:27, 80988.82it/s]


##### Overlap between ratings and reviews

In [5]:
reviews = pl.read_parquet(f'{DATA_FOLDER}/reviews.pq')
ratings = pl.read_parquet(f'{DATA_FOLDER}/ratings.pq')

assert pl.DataFrame.equals(reviews, ratings)

From this we can see that the reviews file and the ratings file contains the same information. We can drop the reviews file and keep only the ratings file (to be consistent with what has been done in with BeerAdvocate dataset).