# NOTEBOOK TO PREPROCESS THE DATA (then used for the project accomplishment)

## **INFORMATIONS ON THE CSVs**

*Data*: BeerAdvocate / RateBeer / matched_beer_data

*Difference ratings-reviews*: **reviews.txt** appears to be a subset of **ratings.txt** because the latter also has the review column (True or False) and **reviews.txt** is the set of all ratings that are True.

*Code to print .txt*: 
* """with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(16):
        print(file.readline())"""
* """with open(BA_RATINGS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(17):
        print(file.readline())"""
* !head Data/BeerAdvocate/ratings.txt/ratings.txt
* """from collections import deque
n_last_lines = 10
with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    last_lines = deque(file, maxlen=n_last_lines)
for line in last_lines:
    print(line.strip())"""

### BeerAdvocate

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* nbr_reviews
* avg
* ba_score
* bros_score
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id,
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* joined
* location

**ratings.txt** (line format i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance
* aroma
* palate
* taste
* overall
* rating
* text
* review: *True or False*

**reviews.txt** (line format i.e. Header=None, subset of **ratings.txt**)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 5*
* palate : *up to 5*
* taste : *up to 5*
* overall : *up to 5*
* rating : *up to 5, unkown formula but different weights for each parameter*
* text

----------------------------------------------------------------------------------------------------

### RateBeer

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* overall_score
* style_score
* avg
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* user_id
* user_name
* joined
* location

**ratings.txt = reviews.txt** (line format i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 10*
* palate (=mouthfeel) : *up to 5*
* taste : *up to 10*
* overall : *up to 20*
* rating : *up to 50 (sum of all previous) then divided by 10 --> up to 5*
* text

----------------------------------------------------------------------------------------------------

### matched_beer_data

**beers.csv**
#### ba:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* ba_score
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* bros_score
* nbr_matched_valid_ratings
* nbr_ratings
* nbr_reviews
* style
* zscore
#### rb:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* nbr_matched_valid_ratings
* nbr_ratings
* overall_score
* style
* style_score
* zscore
#### scores:
* diff
* sim

**breweries.csv**
#### ba:
* id
* location
* name
* nbr_beers
#### rb:
* id
* location
* name
* nbr_beers
#### scores:
* diff
* sim

**ratings.csv**
#### ba:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* review
* style
* taste
* text
* user_id
* user_name
#### rb:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* style
* taste
* text
* user_id
* user_name


**users_approx.csv**
#### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
#### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower
#### scores:
* sim

**users.csv** (is a subset of **users_approx** --> it is composed of users from **users_approx** where `sim` closed to 1)
#### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
#### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower

----------------------------------------------------------------------------------------------------

## **LOADING DATAs**

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import fastparquet
import pyarrow
import datetime as dt

In [2]:
DATA_FOLDER = '../../data/'
BEER_ADVOCATE_FOLDER = DATA_FOLDER + 'BeerAdvocate/' #BA
RATE_BEER_FOLDER = DATA_FOLDER + 'RateBeer/' #RB

BA_USERS_DATASET = BEER_ADVOCATE_FOLDER + "users.csv"
BA_REVIEWS_DATASET = BEER_ADVOCATE_FOLDER + 'reviews.txt/' + "reviews.txt"

RB_USERS_DATASET = RATE_BEER_FOLDER + "users.csv"
RB_REVIEWS_DATASET = RATE_BEER_FOLDER + 'reviews.txt/' + "ratings.txt"


## **PROCESSING users.csv**

In [3]:
ba_users = pd.read_csv(BA_USERS_DATASET)
rb_users = pd.read_csv(RB_USERS_DATASET)

In [4]:
ba_users['joined'] = pd.to_datetime(ba_users['joined'], unit='s').dt.strftime('%d/%m/%Y')
ba_users['joined'] = pd.to_datetime(ba_users['joined'], dayfirst=True)
ba_users['user_id'] = ba_users['user_id'].astype(str)

rb_users['joined'] = pd.to_datetime(rb_users['joined'], unit='s').dt.strftime('%d/%m/%Y')
rb_users['joined'] = pd.to_datetime(rb_users['joined'], dayfirst=True)
rb_users['user_id'] = rb_users['user_id'].astype(str)

## **PROCESSING reviews.txt**

## BeerAdvocate

In [6]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating', 'text']
chunk_size = 1_000_000
data = []
entry_count = 0
chunk_count = 0
current_entry = {}

with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
                key, value = line.split(':', 1)
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:
                data.append(current_entry)
                current_entry = {}
                entry_count += 1

                # Save chunk when reaching chunk size
                if entry_count >= chunk_size:
                    chunk_df = pd.DataFrame(data, columns=columns)
                    chunk_file_path = f"../../generated/ba_chunks/ba_reviews_chunk_{chunk_count}.parquet"
                    chunk_df.to_parquet(chunk_file_path)
                    print(f"Saved {chunk_file_path}")
                    data = []
                    entry_count = 0
                    chunk_count += 1

# Process any remaining entries after the loop
if data:
    chunk_df = pd.DataFrame(data, columns=columns)
    chunk_file_path = f"../../generated/ba_chunks/ba_reviews_chunk_{chunk_count}.parquet"
    chunk_df.to_parquet(chunk_file_path)
    print(f"Saved {chunk_file_path}")

print('Saving done')

Saved ../../generated/ba_chunks/ba_reviews_chunk_0.parquet
Saved ../../generated/ba_chunks/ba_reviews_chunk_1.parquet
Saved ../../generated/ba_chunks/ba_reviews_chunk_2.parquet
Saving done


In [7]:
ba_chunk_files = glob.glob("../../generated/ba_chunks/ba_reviews_chunk_*.parquet")
ba_reviews = pd.concat([pd.read_parquet(ba_chunk) for ba_chunk in ba_chunk_files], ignore_index=True)


In [8]:
empty_text = ba_reviews['text'] == ''
ba_reviews.drop(ba_reviews[empty_text].index, inplace= True)
missing_values = np.where(pd.isnull(ba_reviews))
if missing_values[0].size == 0:
    print("The dataset is clean, with no missing values.")
else:
    print("The dataset has missing values at:")
    print("Row indices:", missing_values[0])
    print("Column indices:", missing_values[1])

The dataset is clean, with no missing values.


In [9]:
cols_to_numeric = ['abv', 'date', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
ba_reviews[cols_to_numeric] = ba_reviews[cols_to_numeric].apply(pd.to_numeric, errors = 'coerce')

cols_to_string = ['beer_id', 'brewery_id','user_id']
for col in cols_to_string:
    ba_reviews[col] = ba_reviews[col].astype(str)

In [10]:
ba_reviews['date'] = pd.to_datetime(ba_reviews['date'], unit='s').dt.strftime('%d/%m/%Y')
ba_reviews['date'] = pd.to_datetime(ba_reviews['date'], dayfirst=True)

## RateBeer

In [11]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating', 'text']
chunk_size = 1_000_000
data = []
entry_count = 0
chunk_count = 0
current_entry = {}

with open(RB_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
                key, value = line.split(':', 1)
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:
                data.append(current_entry)
                current_entry = {}
                entry_count += 1

                # Save chunk when reaching chunk size
                if entry_count >= chunk_size:
                    chunk_df = pd.DataFrame(data, columns=columns)
                    chunk_file_path = f"../../generated/rb_chunks/rb_reviews_chunk_{chunk_count}.parquet"
                    chunk_df.to_parquet(chunk_file_path)
                    print(f"Saved {chunk_file_path}")
                    data = []
                    entry_count = 0
                    chunk_count += 1

# Process any remaining entries after the loop
if data:
    chunk_df = pd.DataFrame(data, columns=columns)
    chunk_file_path = f"../../generated/rb_chunks/rb_reviews_chunk_{chunk_count}.parquet"
    chunk_df.to_parquet(chunk_file_path)
    print(f"Saved {chunk_file_path}")

print('Saving done')

Saved ../../generated/rb_chunks/rb_reviews_chunk_0.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_1.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_2.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_3.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_4.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_5.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_6.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_7.parquet
Saving done


In [12]:
rb_chunk_files = glob.glob("../../generated/rb_chunks/rb_reviews_chunk_*.parquet")
rb_reviews = pd.concat([pd.read_parquet(rb_chunk) for rb_chunk in rb_chunk_files], ignore_index=True)


In [13]:
empty_text = rb_reviews['text'] == ''
rb_reviews.drop(rb_reviews[empty_text].index, inplace= True)
np.where(pd.isnull(rb_reviews))
if missing_values[0].size == 0:
    print("The dataset is clean, with no missing values.")
else:
    print("The dataset has missing values at:")
    print("Row indices:", missing_values[0])
    print("Column indices:", missing_values[1])


The dataset is clean, with no missing values.


In [14]:
cols_to_numeric = ['abv', 'date', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
rb_reviews[cols_to_numeric] = rb_reviews[cols_to_numeric].apply(pd.to_numeric, errors = 'coerce')

cols_to_string = ['beer_id', 'brewery_id','user_id']
for col in cols_to_string:  
    rb_reviews[col] = rb_reviews[col].astype(str)

In [15]:
rb_reviews['date'] = pd.to_datetime(rb_reviews['date'], unit='s').dt.strftime('%d/%m/%Y')
rb_reviews['date'] = pd.to_datetime(rb_reviews['date'], dayfirst=True)

## **SAVING PROCESSED DFs**

In [17]:
ba_users.to_csv('../../generated/new_ba_users.csv')
ba_reviews.to_parquet('../../generated/new_ba_reviews.parquet')

rb_users.to_csv('../../generated/new_rb_users.csv')
rb_reviews.to_parquet('../../generated/new_rb_reviews.parquet')