# NOTEBOOK TO PREPROCESS THE DATA (then used for the project accomplishment)

## **INFORMATIONS ON THE CSVs**

*Data*: BeerAdvocate / RateBeer / matched_beer_data

*Difference ratings-reviews*: **reviews.txt** appears to be a subset of **ratings.txt** because the latter also has the review column (True or False) and **reviews.txt** is the set of all ratings that are True.

*Code to print .txt*: 
* """with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(16):
        print(file.readline())"""
* """with open(BA_RATINGS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(17):
        print(file.readline())"""
* !head Data/BeerAdvocate/ratings.txt/ratings.txt
* """from collections import deque
n_last_lines = 10
with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    last_lines = deque(file, maxlen=n_last_lines)
for line in last_lines:
    print(line.strip())"""

### BeerAdvocate

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* nbr_reviews
* avg
* ba_score
* bros_score
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id,
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* joined
* location

**ratings.txt** (line format i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance
* aroma
* palate
* taste
* overall
* rating
* text
* review: *True or False*

**reviews.txt** (line format i.e. Header=None, subset of **ratings.txt**)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 5*
* palate : *up to 5*
* taste : *up to 5*
* overall : *up to 5*
* rating : *up to 5, unkown formula but different weights for each parameter*
* text

----------------------------------------------------------------------------------------------------

### RateBeer

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* overall_score
* style_score
* avg
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* user_id
* user_name
* joined
* location

**ratings.txt = reviews.txt** (line format i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 10*
* palate (=mouthfeel) : *up to 5*
* taste : *up to 10*
* overall : *up to 20*
* rating : *up to 50 (sum of all previous) then divided by 10 --> up to 5*
* text

----------------------------------------------------------------------------------------------------

### matched_beer_data

**beers.csv**
#### ba:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* ba_score
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* bros_score
* nbr_matched_valid_ratings
* nbr_ratings
* nbr_reviews
* style
* zscore
#### rb:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* nbr_matched_valid_ratings
* nbr_ratings
* overall_score
* style
* style_score
* zscore
#### scores:
* diff
* sim

**breweries.csv**
#### ba:
* id
* location
* name
* nbr_beers
#### rb:
* id
* location
* name
* nbr_beers
#### scores:
* diff
* sim

**ratings.csv**
#### ba:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* review
* style
* taste
* text
* user_id
* user_name
#### rb:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* style
* taste
* text
* user_id
* user_name


**users_approx.csv**
#### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
#### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower
#### scores:
* sim

**users.csv** (is a subset of **users_approx** --> it is composed of users from **users_approx** where `sim` closed to 1)
#### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
#### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower

----------------------------------------------------------------------------------------------------

## **LOADING DATAs**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob

In [2]:
DATA_FOLDER = '../../data/'
BEER_ADVOCATE_FOLDER = DATA_FOLDER + 'BeerAdvocate/' #BA
RATE_BEER_FOLDER = DATA_FOLDER + 'RateBeer/' #RB
MATCHED_BEER_FOLDER = DATA_FOLDER + 'matched_beer_data/' #MB

BA_BEERS_DATASET = BEER_ADVOCATE_FOLDER + "beers.csv"
BA_BREWERIES_DATASET = BEER_ADVOCATE_FOLDER + "breweries.csv"
BA_USERS_DATASET = BEER_ADVOCATE_FOLDER + "users.csv"
BA_RATINGS_DATASET = BEER_ADVOCATE_FOLDER + 'ratings.txt/' + "ratings.txt"
BA_REVIEWS_DATASET = BEER_ADVOCATE_FOLDER + 'reviews.txt/' + "reviews.txt"

RB_BEERS_DATASET = RATE_BEER_FOLDER + "beers.csv"
RB_BREWERIES_DATASET = RATE_BEER_FOLDER + "breweries.csv"
RB_USERS_DATASET = RATE_BEER_FOLDER + "users.csv"
RB_RATINGS_DATASET = RATE_BEER_FOLDER + 'ratings.txt/' + "ratings.txt"
RB_REVIEWS_DATASET = RATE_BEER_FOLDER + 'reviews.txt/' + "reviews.txt"

MB_BEERS_DATASET = MATCHED_BEER_FOLDER + "beers.csv"
MB_BREWERIES_DATASET = MATCHED_BEER_FOLDER + "breweries.csv"
MB_USERS_DATASET = MATCHED_BEER_FOLDER + "users.csv"
MB_USERS_APPROX_DATASET = MATCHED_BEER_FOLDER + "users_approx.csv"
MB_RATINGS_DATASET = MATCHED_BEER_FOLDER + "ratings.csv"

In [3]:
ba_beers = pd.read_csv(BA_BEERS_DATASET)
ba_breweries = pd.read_csv(BA_BREWERIES_DATASET)
ba_users = pd.read_csv(BA_USERS_DATASET)

rb_beers = pd.read_csv(RB_BEERS_DATASET)
rb_breweries = pd.read_csv(RB_BREWERIES_DATASET)
rb_users = pd.read_csv(RB_USERS_DATASET)

# mb_beers = pd.read_csv(MB_BEERS_DATASET, skiprows= 1)
# mb_breweries = pd.read_csv(MB_BREWERIES_DATASET, skiprows= 1)
# mb_users = pd.read_csv(MB_USERS_DATASET, skiprows= 1)
# mb_users_approx = pd.read_csv(MB_USERS_APPROX_DATASET, skiprows= 1)
# mb_ratings = pd.read_csv(MB_RATINGS_DATASET, skiprows= 1)

## **CONTINENT & BEER STYLE CATEGORIZATION (function, list)**

In [4]:
def name_to_country(name: str) -> str:
    '''
    Determines the country associated with a given name string
    based on specific formatting rules.
    :param name: str, a string representing a geographical or generic name.
    :return: str, the formatted country name or the original input.
    '''
    if len(name) >= 13:
        if name.split('<')[0] in ['United States', 'Utah', 'New York', 'Illinois']:
            return 'United States'
        if name.split(',')[0] in ['United States']:
            return 'United States'
    return name

In [5]:
country_continent_map = {
    'Kyrgyzstan': 'Asia', 'Gabon': 'Africa', 'Northern Ireland': 'Europe',
    'Wales': 'Europe', 'Scotland': 'Europe', 'England': 'Europe',
    'Singapore': 'Asia', 'China': 'Asia', 'Chad': 'Africa', 
    'Saint Lucia': 'North America', 'Cameroon': 'Africa',
    'Burkina Faso': 'Africa', 'Zambia': 'Africa', 'Romania': 'Europe',
    'Nigeria': 'Africa', 'South Korea': 'Asia', 'Georgia': 'Asia',
    'Hong Kong': 'Asia', 'Guinea': 'Africa', 'Montenegro': 'Europe',
    'Benin': 'Africa', 'Mexico': 'North America', 'Fiji Islands': 'Oceania',
    'Guam': 'Oceania', 'Laos': 'Asia', 'Senegal': 'Africa',
    'Honduras': 'North America', 'Morocco': 'Africa', 'Indonesia': 'Asia',
    'Monaco': 'Europe', 'Ukraine': 'Europe', 'Canada': 'North America',
    'Jordan': 'Asia', 'Portugal': 'Europe', 'Guernsey': 'Europe',
    'India': 'Asia', 'Puerto Rico': 'North America', 'Japan': 'Asia',
    'Iran': 'Asia', 'Hungary': 'Europe', 'Bulgaria': 'Europe',
    'Guinea-Bissau': 'Africa', 'Liberia': 'Africa', 'Togo': 'Africa',
    'Niger': 'Africa', 'Croatia': 'Europe', 'Lithuania': 'Europe',
    'Cyprus': 'Asia', 'Italy': 'Europe', 'Andorra': 'Europe',
    'Botswana': 'Africa', 'Turks and Caicos Islands': 'North America',
    'Papua New Guinea': 'Oceania', 'Mongolia': 'Asia', 'Ethiopia': 'Africa',
    'Denmark': 'Europe', 'French Polynesia': 'Oceania', 'Greece': 'Europe',
    'Sri Lanka': 'Asia', 'Syria': 'Asia', 'Germany': 'Europe', 'Jersey': 'Europe',
    'Armenia': 'Asia', 'Mozambique': 'Africa', 'Palestine': 'Asia',
    'Bangladesh': 'Asia', 'Turkmenistan': 'Asia', 'Reunion': 'Africa',
    'Eritrea': 'Africa', 'Switzerland': 'Europe', 'Malta': 'Europe',
    'Israel': 'Asia', 'El Salvador': 'North America', 'French Guiana': 'South America',
    'Tonga': 'Oceania', 'Zimbabwe': 'Africa', 'Samoa': 'Oceania', 'Barbados': 'North America',
    'Chile': 'South America', 'Cambodia': 'Asia', 'Cook Islands': 'Oceania',
    'Trinidad & Tobago': 'North America', 'Bhutan': 'Asia', 'Uzbekistan': 'Asia',
    'Egypt': 'Africa', 'Uruguay': 'South America', 'Dominican Republic': 'North America',
    'Equatorial Guinea': 'Africa', 'Russia': 'Europe', 'Tajikistan': 'Asia',
    'Vietnam': 'Asia', 'Palau': 'Oceania', 'Namibia': 'Africa',
    'Cayman Islands': 'North America', 'Sao Tome and Principe': 'Africa', 'Australia': 'Oceania',
    'Martinique': 'North America', 'Virgin Islands (British)': 'North America',
    'Ecuador': 'South America', 'Vanuatu': 'Oceania', 'Congo': 'Africa',
    'Uganda': 'Africa', 'Mauritius': 'Africa', 'Azerbaijan': 'Asia',
    'Argentina': 'South America', 'Tunisia': 'Africa', 'Belize': 'North America',
    'Luxembourg': 'Europe', 'Madagascar': 'Africa', 'Aruba': 'North America',
    'Spain': 'Europe', 'Swaziland': 'Africa', 'South Sudan': 'Africa',
    'Belarus': 'Europe', 'Ivory Coast': 'Africa', 'Austria': 'Europe',
    'Bolivia': 'South America', 'Central African Republic': 'Africa',
    'Mali': 'Africa', 'Suriname': 'South America', 'Solomon Islands': 'Oceania',
    'Rwanda': 'Africa', 'Brazil': 'South America', 'Gibraltar': 'Europe',
    'Taiwan': 'Asia', 'Turkey': 'Asia', 'Greenland': 'North America',
    'Moldova': 'Europe', 'Haiti': 'North America', 'Guadeloupe': 'North America',
    'South Africa': 'Africa', 'Lesotho': 'Africa', 'Czech Republic': 'Europe',
    'Micronesia': 'Oceania', 'Paraguay': 'South America', 'Iraq': 'Asia',
    'Faroe Islands': 'Europe', 'Panama': 'North America', 'Netherlands': 'Europe',
    'Peru': 'South America', 'New Zealand': 'Oceania', 'Ghana': 'Africa',
    'Slovenia': 'Europe', 'Serbia': 'Europe', 'Macedonia': 'Europe',
    'Latvia': 'Europe', 'Guatemala': 'North America', 'Cuba': 'North America',
    'Venezuela': 'South America', 'Angola': 'Africa', 'Finland': 'Europe',
    'Nicaragua': 'North America', 'Sweden': 'Europe', 'Seychelles': 'Africa',
    'Poland': 'Europe', 'Cape Verde Islands': 'Africa', 'Libya': 'Africa',
    'Isle of Man': 'Europe', 'Ireland': 'Europe', 'Myanmar': 'Asia',
    'Algeria': 'Africa', 'Kazakhstan': 'Asia', 'Norway': 'Europe',
    'United States': 'North America', 'Costa Rica': 'North America',
    'North Korea': 'Asia', 'Bosnia and Herzegovina': 'Europe', 'Jamaica': 'North America',
    'Lebanon': 'Asia', 'Dominica': 'North America', 'Virgin Islands (U.S.)': 'North America',
    'Colombia': 'South America', 'Iceland': 'Europe', 'Macau': 'Asia',
    'Grenada': 'North America', 'Malaysia': 'Asia', 'Belgium': 'Europe',
    'Saint Vincent and The Grenadines': 'North America', 'Bahamas': 'North America',
    'Philippines': 'Asia', 'Curaçao': 'North America', 'San Marino': 'Europe',
    'France': 'Europe', 'Bermuda': 'North America', 'Mayotte': 'Africa',
    'Antigua & Barbuda': 'North America', 'Estonia': 'Europe', 'Gambia': 'Africa',
    'Pakistan': 'Asia', 'New Caledonia': 'Oceania', 'Slovak Republic': 'Europe',
    'Liechtenstein': 'Europe', 'Tanzania': 'Africa', 'Malawi': 'Africa',
    'Nepal': 'Asia', 'United Arab Emirates': 'Asia', 'Kenya': 'Africa',
    'Thailand': 'Asia', 'Albania': 'Europe', 'Canada, Ontario': 'North America',
    'United Kingdom, England': 'Europe', 'Canada, Manitoba': 'North America',
    'Canada, Nova Scotia': 'North America', 'Canada, Quebec': 'North America',
    'Canada, Newfoundland and Labrador': 'North America', 'Canada, Alberta': 'North America',
    'Canada, British Columbia': 'North America', 'Canada, Saskatchewan': 'North America',
    'UNKNOWN': 'Unknown', 'Canada, New Brunswick': 'North America',
    'United Kingdom, Wales': 'Europe', 'United Kingdom, Scotland': 'Europe'
}

In [6]:
style_classification = {
    "Lager": ["lager"],
    "Ale": ["ale"],
    "IPA": ["ipa"],
    "Stout/Porter": ["stout", "porter"],
    "Wheat/Sour": ["wheat", "sour"],
    "Bitter": ["bitter"],
    "Saké": ["saké"],
}

In [7]:
def style_to_type(style: str) -> str:
    '''
    Classifies a given beer style into a broader type 
    based on predefined keywords.
    :param style: str, a string representing a beer style.
    :return: str, the broader beer type or "Other" if no match is found.
    '''
    for k, v in style_classification.items():
        for keyword in v:
            if keyword in style.lower():
                return k
    return "Other"

## **PROCESSING BEERADVOCATE**

### breweries.csv

Continent categorization (+2 col: country & continent)

In [8]:
ba_breweries['country'] = ba_breweries['location'].apply(lambda name : name_to_country(name))
ba_breweries['continent'] = ba_breweries['country'].apply(lambda country : country_continent_map.get(country, 'Unknown'))

### beers.csv

Continent & beer type categorization (+2 col: continent & style)

In [9]:
ba_dict_id_br_concat = dict(zip(ba_breweries['id'], ba_breweries['continent']))
ba_beers['continent'] = ba_beers['brewery_id'].apply(lambda id_: ba_dict_id_br_concat.get(id_))
ba_beers['type'] = ba_beers['style'].apply(style_to_type)

### users.csv


Continent categorization (+2 col: country & continent)

In [10]:
ba_users['location'] = ba_users['location'].astype(str)
ba_users['country'] = ba_users['location'].apply(lambda name : name_to_country(name))
ba_users['continent'] = ba_users['country'].apply(lambda country : country_continent_map.get(country, 'Unknown'))

Conversion for the date

In [11]:
ba_users['joined'] = pd.to_datetime(ba_users['joined'], unit='s').dt.strftime('%d/%m/%Y')

### ratings.csv / reviews.csv

ratings.txt != reviews.txt

ratings :
[151 074 576 lines i.e. 151 074 576/18 = 8 393 032 reviews]

reviews :
[4 4022 962 lines i.e. 44 022 962/17 = 2 589 586 reviews]

----------------------------------------------------------------------------------------------------

Treatment of .txt to df

In [12]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating']
chunk_size = 1_000_000
data = []
entry_count = 0
chunk_count = 0
current_entry = {}

with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
            if line.startswith('text:'):
                continue
            if ':' in line:
                key, value = line.split(':', 1)
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:
                data.append(current_entry)
                current_entry = {}
                entry_count += 1

                # Save chunk when reaching chunk size
                if entry_count >= chunk_size:
                    chunk_df = pd.DataFrame(data, columns=columns)
                    chunk_file_path = f"../../generated/ba_chunks/ba_reviews_chunk_{chunk_count}.parquet"
                    chunk_df.to_parquet(chunk_file_path)
                    print(f"Saved {chunk_file_path}")
                    data = []
                    entry_count = 0
                    chunk_count += 1
                    
# Process any remaining entries after the loop
if data:
    chunk_df = pd.DataFrame(data, columns=columns)
    chunk_file_path = f"../../generated/ba_chunks/ba_reviews_chunk_{chunk_count}.parquet"
    chunk_df.to_parquet(chunk_file_path)
    print(f"Saved {chunk_file_path}")

Saved ../../generated/ba_chunks/ba_reviews_chunk_0.parquet
Saved ../../generated/ba_chunks/ba_reviews_chunk_1.parquet
Saved ../../generated/ba_chunks/ba_reviews_chunk_2.parquet


In [13]:
ba_chunk_files = glob.glob("../../generated/ba_chunks/ba_reviews_chunk_*.parquet")
ba_reviews = pd.concat([pd.read_parquet(ba_chunk) for ba_chunk in ba_chunk_files], ignore_index=True)

In [14]:
cols_to_numeric = ['beer_id', 'brewery_id', 'abv', 'date', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
ba_reviews[cols_to_numeric] = ba_reviews[cols_to_numeric].apply(pd.to_numeric, errors = 'coerce')
ba_reviews['date'] = pd.to_datetime(ba_reviews['date'], unit='s').dt.strftime('%d/%m/%Y')
ba_reviews['continent'] = ba_reviews['brewery_id'].apply(lambda id_: ba_dict_id_br_concat.get(int(id_)))
ba_reviews['type'] = ba_reviews['style'].apply(style_to_type)

## **PROCESSING RATEBEER**

### breweries.csv

Continent categorization (+2 col: country & continent)

In [15]:
rb_breweries['country'] = rb_breweries['location'].apply(lambda name : name_to_country(name))
rb_breweries['continent'] = rb_breweries['country'].apply(lambda country : country_continent_map.get(country, 'Unknown'))

### beers.csv

Continent & beer type categorization (+2 col: continent & style)

In [16]:
rb_dict_id_br_concat = dict(zip(rb_breweries['id'], rb_breweries['continent']))
rb_beers['continent'] = rb_beers['brewery_id'].apply(lambda id_: rb_dict_id_br_concat.get(id_))
rb_beers['type'] = rb_beers['style'].apply(lambda style: style_to_type)

### users.csv

Continent categorization (+2 col: country & continent)

In [17]:
rb_users['location'] = rb_users['location'].astype(str)
rb_users['country'] = rb_users['location'].apply(lambda name : name_to_country(name))
rb_users['continent'] = rb_users['country'].apply(lambda country : country_continent_map.get(country, 'Unknown'))

Conversion for the date

In [18]:
rb_users['joined'] = pd.to_datetime(rb_users['joined'], unit='s').dt.strftime('%d/%m/%Y')

### ratings.csv / reviews.csv

ratings.txt = reviews.txt (i.e. no difference for this dataset)

[121 075 258 lines i.e. 121075258/17 = 7 122 074 review]

----------------------------------------------------------------------------------------------------

Treatment of .txt to df

In [19]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating']
chunk_size = 1_000_000
data = []
entry_count = 0
chunk_count = 0
current_entry = {}

with open(RB_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()
        if line:
            if line.startswith('text:'):
                continue
            if ':' in line:
                key, value = line.split(':', 1)
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:
                data.append(current_entry)
                current_entry = {}
                entry_count += 1

                # Save chunk when reaching chunk size
                if entry_count >= chunk_size:
                    chunk_df = pd.DataFrame(data, columns=columns)
                    chunk_file_path = f"../../generated/rb_chunks/rb_reviews_chunk_{chunk_count}.parquet"
                    chunk_df.to_parquet(chunk_file_path)
                    print(f"Saved {chunk_file_path}")
                    data = []
                    entry_count = 0
                    chunk_count += 1
                    
# Process any remaining entries after the loop
if data:
    chunk_df = pd.DataFrame(data, columns=columns)
    chunk_file_path = f"../../generated/rb_chunks/rb_reviews_chunk_{chunk_count}.parquet"
    chunk_df.to_parquet(chunk_file_path)
    print(f"Saved {chunk_file_path}")

Saved ../../generated/rb_chunks/rb_reviews_chunk_0.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_1.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_2.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_3.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_4.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_5.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_6.parquet
Saved ../../generated/rb_chunks/rb_reviews_chunk_7.parquet


In [20]:
rb_chunk_files = glob.glob("../../generated/rb_chunks/rb_reviews_chunk_*.parquet")
rb_reviews = pd.concat([pd.read_parquet(rb_chunk) for rb_chunk in rb_chunk_files], ignore_index=True)

In [21]:
cols_to_numeric = ['beer_id', 'brewery_id', 'abv', 'date', 'appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
rb_reviews[cols_to_numeric] = rb_reviews[cols_to_numeric].apply(pd.to_numeric, errors = 'coerce')
rb_reviews['date'] = pd.to_datetime(rb_reviews['date'], unit='s').dt.strftime('%d/%m/%Y')
rb_reviews['continent'] = rb_reviews['brewery_id'].apply(lambda id_: rb_dict_id_br_concat.get(int(id_)))
rb_reviews['type'] = rb_reviews['style'].apply(style_to_type)

## **SAVING PROCESSED DFs**

In [22]:
ba_breweries.to_csv('../../generated/new_ba_breweries.csv')
ba_beers.to_csv('../../generated/new_ba_beers.csv')
ba_users.to_csv('../../generated/new_ba_users.csv')
ba_reviews.to_parquet('../../generated/new_ba_reviews.parquet')

rb_breweries.to_csv('../../generated/new_rb_breweries.csv')
rb_beers.to_csv('../../generated/new_rb_beers.csv')
rb_users.to_csv('../../generated/new_rb_users.csv')
rb_reviews.to_parquet('../../generated/new_rb_reviews.parquet')