# Preprocessing

By looking at the datasets, we observe that the reviews are grouped in a txt file, i.e. 'ratings.txt', for the two websites. 
To keep as much as reviews as possible, we decide to use those txt files and convert them to CSV files for easier manipulation. 

The 'ratings.txt' from BeerAdvocate is named 'ratings_BA.csv' and the 'ratings.txt' from RateBeer is named 'ratings_RB.csv'.

In [None]:
# imports

import pandas as pd 
import numpy as np

### BeerAdvocate file

In [None]:
with open("./data/BeerAdvocate/ratings.txt") as f:
  
  for i, line in enumerate(f):
    print(line)
    if i > 20:
      break

We observe that we can construct a row containing all the infos of a rating with 18 lines of the txt file. 

In [None]:
len = 0

with open("./data/BeerAdvocate/ratings.txt") as f:
  for i, line in enumerate(f):
    len += 1
    if i > 151074570:
      print(line)
      
true_length = len/18

print("True length of the dataset :", true_length)  # divide by 18 because there are 17 keys and every "row" is separated by a blank line (even at the end)

We also find a potential length of the dataframe and use it to convert the txt to a dataframe

In [None]:
path = "./data/BeerAdvocate/ratings.txt"

keys = ['beer_name','beer_id', 'brewery_id','brewery_name', 
        'style', 'date', 'user_id', 'user_name', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating', 'text']

In [8]:
def read_lines(file_path, start, end):
    with open(file_path) as f:
        for i, line in enumerate(f):
            if start <= i < end and line.strip():
                yield line.rstrip()

def txt_to_df(min_row, max_row, path=path, keys=keys, nb_lines=18):
    df_dict = {k: [] for k in keys}
    
    for line in read_lines(path, min_row * nb_lines, max_row * nb_lines):
        key, value = line.split(":", maxsplit=1)
        if key in df_dict:
            df_dict[key].append(value.strip())

    df = pd.DataFrame.from_dict(df_dict)
    df.replace('nan', np.nan, inplace=True)
    df.dropna(subset=['text'], inplace=True)
    
    return df

In [9]:
df = pd.concat([txt_to_df(0, 3_000_000), txt_to_df(3_000_000, 6_000_000), txt_to_df(6_000_000, 9_000_000)], ignore_index=True)

# Finally, we save the dataframe as a csv file
df.to_csv("./data/BeerAdvocate/ratings_BA.csv", index=False)

### RateBeer file

In [None]:
with open("./data/RateBeer/ratings.txt") as f:
  
  for i, line in enumerate(f):
    print(line)
    if i > 20:
      break

We find that there are approximately 17 lines to form a rating. 

In [None]:
len = 0
with open("./data/RateBeer/ratings.txt") as f:
  for i, line in enumerate(f):
    len += 1
    if i > 151074570:
      print(line)
      
true_length = len/17

print("True length of the dataset :", true_length)

In [18]:
path = "./data/RateBeer/ratings.txt"
keys = ['beer_name','beer_id', 'brewery_id','brewery_name', 
        'style', 'date', 'user_id', 'user_name', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating', 'text']

df = pd.concat([txt_to_df(0, 3_000_000, nb_lines=17), txt_to_df(3_000_000, 6_000_000, nb_lines=17), txt_to_df(6_000_000, 9_000_000, nb_lines=17)], ignore_index=True)

In [25]:
df.dtypes

beer_name       object
beer_id         object
brewery_id      object
brewery_name    object
style           object
date            object
user_id         object
user_name       object
appearance      object
aroma           object
palate          object
taste           object
overall         object
rating          object
text            object
dtype: object

We need to convert the fields into the correct type. 

In [26]:
# convert date to int64 
df['date'] = df['date'].astype('int64')

# convert columns 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating' to float64
df[['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']] = df[['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']].astype('float64')

# convert beer_id, brewery_id to int64
df[['beer_id', 'brewery_id']] = df[['beer_id', 'brewery_id']].astype('int64')

In [28]:
df.dtypes

beer_name        object
beer_id           int64
brewery_id        int64
brewery_name     object
style            object
date              int64
user_id          object
user_name        object
appearance      float64
aroma           float64
palate          float64
taste           float64
overall         float64
rating          float64
text             object
dtype: object

In [29]:
df.to_csv("./data/RateBeer/ratings_RB.csv", index=False)