## Data from Beer Reviews

*Data*: BeerAdvocate / RateBeer / matched_beer_data

*Difference ratings-reviews*: **reviews.txt** semble être un sous-ensemble de **ratings.txt** car ce dernier a en plus la colonne review (True or False) et **reviews.txt** est l'ensemble de tous les ratings qui sont True

*Ligne to print .txt*: 
* """with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(16):
        print(file.readline())"""
* """with open(BA_RATINGS_DATASET, 'r', encoding='utf-8') as file:
    for _ in range(17):
        print(file.readline())"""
* !head Data/BeerAdvocate/ratings.txt/ratings.txt
* """from collections import deque
n_last_lines = 10
with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    last_lines = deque(file, maxlen=n_last_lines)
for line in last_lines:
    print(line.strip())"""

# BeerAdvocate

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* nbr_reviews
* avg
* ba_score
* bros_score
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id,
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* joined
* location

**ratings.txt** (format ligne i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance
* aroma
* palate
* taste
* overall
* rating
* text
* review: *True or False*

**reviews.txt** (format ligne i.e. Header=None, sous-ensemble de **ratings.txt**)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 5*
* palate : *up to 5*
* taste : *up to 5*
* overall : *up to 5*
* rating : *up to 5, unkown formula but different weights for each parameter*
* text

----------------------------------------------------------------------------------------------------

# RateBeer

*Appearance and Mouthfeel (= Palate) are each scored out of 5. Aroma and Taste are scored out of 10. While Overall is scored out of 20. These all combine to give the beer a total score out of 50, which is then divided and displayed as a score out of 5 for each rating.*

**beers.csv**
* beer_id
* beer_name
* brewery_id
* brewery_name
* style
* nbr_ratings
* overall_score
* style_score
* avg
* abv
* avg_computed
* zscore
* nbr_matched_valid_ratings
* avg_matched_valid_ratings

**breweries.csv**
* id
* location
* name
* nbr_beers

**users.csv**
* nbr_ratings
* user_id
* user_name
* joined
* location

**ratings.txt = reviews.txt** (format ligne i.e. Header=None)
* beer_name
* beer_id
* brewery_name
* brewery_id
* style
* abv
* date
* user_name
* user_id
* appearance : *up to 5*
* aroma : *up to 10*
* palate (=mouthfeel) : *up to 5*
* taste : *up to 10*
* overall : *up to 20*
* rating : *up to 50 (sum of all previous) then divided by 10 --> up to 5*
* text

----------------------------------------------------------------------------------------------------

# matched_beer_data

**beers.csv**
### ba:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* ba_score
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* bros_score
* nbr_matched_valid_ratings
* nbr_ratings
* nbr_reviews
* style
* zscore
### rb:
* abv
* avg
* avg_computed
* avg_matched_valid_ratings
* beer_id
* beer_name
* beer_wout_brewery_name
* brewery_id
* brewery_name
* nbr_matched_valid_ratings
* nbr_ratings
* overall_score
* style
* style_score
* zscore
### scores:
* diff
* sim

**breweries.csv**
### ba:
* id
* location
* name
* nbr_beers
### rb:
* id
* location
* name
* nbr_beers
### scores:
* diff
* sim

**ratings.csv**
### ba:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* review
* style
* taste
* text
* user_id
* user_name
### rb:
* abv
* appearance
* aroma
* beer_id
* beer_name
* brewery_id
* brewery_name
* date
* overall
* palate
* rating
* style
* taste
* text
* user_id
* user_name


**users_approx.csv**
### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower
### scores:
* sim

**users.csv**
### ba:
* joined
* location
* nbr_ratings
* nbr_reviews
* user_id
* user_name
* user_name_lower
### rb:
* joined
* location
* nbr_ratings
* user_id
* user_name
* user_name_lower

----------------------------------------------------------------------------------------------------

# Loading the data

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
DATA_FOLDER = 'C:/Users/leroy/Documents/Etudes/EPFL/Master EST/Cours/MA 1/Applied Data Analysis/bADA55-project/data/'
BEER_ADVOCATE_FOLDER = DATA_FOLDER + 'BeerAdvocate/' #BA
RATE_BEER_FOLDER = DATA_FOLDER + 'RateBeer/' #RB
MATCHED_BEER = DATA_FOLDER + 'matched_beer_data/' #MB

BA_BEERS_DATASET = BEER_ADVOCATE_FOLDER + "beers.csv"
BA_BREWERIES_DATASET = BEER_ADVOCATE_FOLDER + "breweries.csv"
BA_USERS_DATASET = BEER_ADVOCATE_FOLDER + "users.csv"
BA_RATINGS_DATASET = BEER_ADVOCATE_FOLDER + 'ratings.txt/' + "ratings.txt"
BA_REVIEWS_DATASET = BEER_ADVOCATE_FOLDER + 'reviews.txt/' + "reviews.txt"

RB_BEERS_DATASET = RATE_BEER_FOLDER + "beers.csv"
RB_BREWERIES_DATASET = RATE_BEER_FOLDER + "breweries.csv"
RB_USERS_DATASET = RATE_BEER_FOLDER + "users.csv"
RB_RATINGS_DATASET = RATE_BEER_FOLDER + 'ratings.txt/' + "ratings.txt"
RB_REVIEWS_DATASET = RATE_BEER_FOLDER + 'reviews.txt/' + "ratings.txt"

### Beer Advocate

ratings.txt != reviews.txt

ratings :
[151 074 576 lines = 151 074 576/18 reviews = 8 393 032 reviews]

reviews :
[4 4022 962 lines = 44022962/17 reviews = 2 589 586 reviews]

In [8]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating']
data = []
current_entry = {}

max_entries = 2589586 #must be equal 2 589 586 to have all reviews
entry_count = 0

with open(BA_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()  # Supprimer les espaces de début/fin
        if line:  # Si la ligne n'est pas vide
            if line.startswith('text:'):
                continue  # Ignorer les lignes de texte
            if ':' in line:
                key, value = line.split(':', 1)  # Séparer la clé et la valeur
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:  # Si un bloc est terminé, ajouter l'entrée au dataset
                data.append(current_entry)
                current_entry = {}  # Réinitialiser pour le prochain bloc
                entry_count += 1
                if entry_count >= max_entries:  # Arrêter après 40 entrées
                    break

# Ajouter la dernière entrée si nécessaire et si le fichier ne finit pas par une ligne vide
if current_entry and entry_count < max_entries:
    data.append(current_entry)

ba_reviews = pd.DataFrame(data, columns=columns)
ba_reviews["date"] = pd.to_numeric(ba_reviews["date"])
ba_reviews["date"] = pd.to_datetime(ba_reviews["date"], unit='s').dt.strftime('%d/%m/%Y')
cols = ['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
ba_reviews[cols] = ba_reviews[cols].apply(pd.to_numeric, errors = 'coerce')

In [9]:
ba_beers = pd.read_csv(BA_BEERS_DATASET)
ba_breweries = pd.read_csv(BA_BREWERIES_DATASET)
ba_users = pd.read_csv(BA_USERS_DATASET)
ba_users['joined'] = pd.to_datetime(ba_users['joined'], unit='s').dt.strftime('%d/%m/%Y')

### Rate Beer

ratings.txt = reviews.txt (i.e. no difference for this dataset)

[121 075 258 lines = 121075258/17 reviews = 7 122 074 review]

In [None]:
columns = ['beer_name', 'beer_id', 'brewery_name', 'brewery_id', 'style', 'abv', 'date', 
           'user_name', 'user_id', 'appearance', 'aroma', 'palate', 'taste', 'overall', 
           'rating']
data = []
current_entry = {}

max_entries = 7122074 #must be equal 7 122 074 to have all reviews
entry_count = 0

with open(RB_REVIEWS_DATASET, 'r', encoding='utf-8') as file:
    for line in file:
        line = line.strip()  # Supprimer les espaces de début/fin
        if line:  # Si la ligne n'est pas vide
            if line.startswith('text:'):
                continue  # Ignorer les lignes de texte
            if ':' in line:
                key, value = line.split(':', 1)  # Séparer la clé et la valeur
                key = key.strip()
                value = value.strip()
                current_entry[key] = value
        else:
            if current_entry:  # Si un bloc est terminé, ajouter l'entrée au dataset
                data.append(current_entry)
                current_entry = {}  # Réinitialiser pour le prochain bloc
                entry_count += 1
                if entry_count >= max_entries:  # Arrêter après 40 entrées
                    break

# Ajouter la dernière entrée si nécessaire et si le fichier ne finit pas par une ligne vide
if current_entry and entry_count < max_entries:
    data.append(current_entry)

rb_reviews = pd.DataFrame(data, columns=columns)
rb_reviews["date"] = pd.to_numeric(rb_reviews["date"])
rb_reviews["date"] = pd.to_datetime(rb_reviews["date"], unit='s').dt.strftime('%d/%m/%Y')
cols = ['appearance', 'aroma', 'palate', 'taste', 'overall', 'rating']
rb_reviews[cols] = rb_reviews[cols].apply(pd.to_numeric)

In [None]:
rb_beers = pd.read_csv(RB_BEERS_DATASET)
rb_breweries = pd.read_csv(RB_BREWERIES_DATASET)
rb_users = pd.read_csv(RB_USERS_DATASET)
rb_users['joined'] = pd.to_datetime(rb_users['joined'], unit='s').dt.strftime('%d/%m/%Y')

# Tendencies : BeerAdvocate

## Display :

In [None]:
print("ba_beers :\n")
display(ba_beers)
print("=" * 150)
print("\nba_beweries :\n")
display(ba_breweries)
print("=" * 150)
print("\nba_users :\n")
display(ba_users)
print("=" * 150)
print("\nba_reviews :\n")
display(ba_reviews)

## ba_beers

## ba_breweries

## ba_users

## ba_reviews

# Tendencies : Rate Beer

## Display :

In [None]:
print("rb_beers :\n")
display(rb_beers)
print("=" * 150)
print("\nrb_beweries :\n")
display(rb_breweries)
print("=" * 150)
print("\nrb_users :\n")
display(rb_users)
print("=" * 150)
print("\nrb_reviews :\n")
display(rb_reviews)

## rb_beers

beer_id : Identifiant unique pour chaque bière dans le dataset RateBeer.\
beer_name : Nom de la bière.\
brewery_id : Identifiant unique pour chaque brasserie associée à la bière.\
brewery_name : Nom de la brasserie qui produit la bière.\
style : Style de la bière (par exemple, Pale Lager, Stout, Pilsener), indiquant le type ou la catégorie de la bière.\
nbr_ratings : Nombre total de notes reçues pour cette bière.\
overall_score : Score global de la bière, basé sur la moyenne de toutes les évaluations des utilisateurs (peut être exprimé sur une échelle de 0 à 100).\
style_score : Score de la bière spécifique à son style (comparée aux autres bières du même type).\
avg : -\
abv : Teneur en alcool par volume (ABV) de la bière, exprimée en pourcentage.\
avg_computed : Note moyenne de la bière, i.e. moyenne des ratings (note disponible sur le site arrondi à 2 chiffres après la virgule), *appelons le rating*\
zscore : -\
nbr_matched_valid_ratings : **Nombre de notes "valides" pour cette bière dans les données croisées (par exemple, lorsqu'on compare avec un autre dataset comme BeerAdvocate).**\
avg_matched_valid_ratings : **Moyenne des notes "valides" pour cette bière dans les données croisées.**\
**gras = pas sûr**

In [None]:
display(rb_beers)

In [None]:
unique_style = rb_beers.groupby('style').agg(nbr_beers= ('beer_id', 'count'),
                                             nbr_ratings_sum= ('nbr_ratings', 'sum'),
                                             nbr_ratings_mean= ('nbr_ratings', 'mean'),
                                             overall_score_mean= ('overall_score', 'mean'),
                                             style_score_mean= ('style_score', 'mean'),
                                             avg_mean= ('avg', 'mean'),
                                             abv_mean= ('abv', 'mean'),
                                             avg_computed_mean= ('avg_computed', 'mean')).reset_index()
display(unique_style)

In [None]:
sns.histplot(rb_beers['avg_computed'], bins= 50, stat= 'proportion')
plt.title('Distribution des notes des utilisateurs')
plt.xlabel('User Rating')
plt.ylabel('Proportion')
plt.show()

In [None]:
sns.histplot(rb_beers['nbr_ratings'], bins= 50, stat= 'proportion')
plt.yscale('log')
plt.title('Distribution du nombre de notes par bière (échelle logarithmique)')
plt.xlabel("Nombre de notes")
plt.ylabel('Proportion (log)')
plt.show()

In [None]:
fig, axs = plt.subplots(10, 10, figsize=(30, 30), sharex=True, sharey=True)
cmap = plt.get_cmap('tab20')
i, j = 0, 0

for idx, name in enumerate(rb_beers['style'].unique()):
    if j == 10:
        i += 1
        j = 0
    
    style_data = rb_beers[rb_beers['style'] == name]['avg_computed'].dropna()
    color = cmap(np.random.rand())

    if not style_data.empty:
        sns.histplot(style_data, bins= 50, binrange= [0, 5], ax= axs[i, j], stat= 'proportion', color= color)
        axs[i, j].set_title(name, fontsize= 15)
        axs[i, j].set_xlabel('')
        axs[i, j].set_ylabel('')
    
    j += 1

fig.text(0.5, -0.01, 'User rating', ha= 'center', fontsize= 20)
fig.text(-0.01, 0.5, 'Proportion', va= 'center', rotation= 'vertical', fontsize= 20)
fig.text(0.5, 1, 'Distribution des notes des utilisateurs par style de bières', ha= 'center', fontsize= 20)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize= (10, 20))
sorted_styles = rb_beers.groupby('style')['avg_computed'].mean().sort_values(ascending= False).index
sns.boxplot(data= rb_beers, x= 'avg_computed', y= 'style', palette= 'coolwarm', order= sorted_styles)
plt.title('Distribution des notes des utilisateurs par style de bière')
plt.xlabel('User rating')
plt.ylabel('Style de bière')
plt.show()

In [None]:
fig, axs = plt.subplots(10, 10, figsize=(30, 30), sharex=True, sharey=True)
cmap = plt.get_cmap('tab20')
i, j = 0, 0

for idx, name in enumerate(rb_beers['style'].unique()):
    if j == 10:
        i += 1
        j = 0
    
    style_data = rb_beers[rb_beers['style'] == name]['abv'].dropna()
    color = cmap(np.random.rand())

    if not style_data.empty:
        sns.histplot(style_data, bins= 35, binrange= [0, 70], ax= axs[i, j], stat= 'proportion', color= color) #70 because Snake Venom is the highest one 67.5%, rest is error
        axs[i, j].set_title(name, fontsize= 15)
        axs[i, j].set_xlabel('')
        axs[i, j].set_ylabel('')
    
    j += 1

fig.text(0.5, -0.01, 'Abv', ha= 'center', fontsize= 20)
fig.text(-0.01, 0.5, 'Proportion', va= 'center', rotation= 'vertical', fontsize= 20)
fig.text(0.5, 1, 'Distribution de la teneur en alcool par style de bière', ha= 'center', fontsize= 20)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize= (10, 20))
sorted_styles = rb_beers.groupby('style')['abv'].mean().sort_values(ascending= False).index
sns.boxplot(data= rb_beers, x= 'abv', y= 'style', palette= 'coolwarm', order= sorted_styles)
plt.title('Distribution de la teneur en alcool par style de bière')
plt.xlabel('Abv')
plt.ylabel('Style de bière')
plt.show()

In [None]:
plt.figure(figsize= (10, 10))
top_50 = unique_style.sort_values(ascending= False, by= 'nbr_beers').head(50)
sns.barplot(data= top_50, x= 'nbr_beers', y= 'style', hue= 'style', palette= 'coolwarm')
plt.title('Top 50 des styles avec le plus de bières')
plt.xlabel('Nombre de bières')
plt.ylabel('Style de bières')
plt.show()

In [None]:
plt.figure(figsize= (10, 10))
top_50 = unique_style.sort_values(ascending= False, by= 'avg_computed_mean').head(50)
min_top_50 = min(top_50['avg_computed_mean'])
max_top_50 = max(top_50['avg_computed_mean'])
sns.barplot(data= top_50, x= 'avg_computed_mean', y= 'style', hue= 'style', palette= 'coolwarm')
plt.xlim(min_top_50 - 0.1, max_top_50 + 0.1)
plt.title('Top 50 des styles avec les bières les mieux notées en moyenne')
plt.xlabel('Mean user rating')
plt.ylabel('Style de bières')
plt.show()

## rb_breweries

id : Identifiant unique de chaque brasserie dans la base de données RateBeer.\
location : Emplacement géographique de la brasserie, incluant le pays ou la région.\
name : Nom de la brasserie.\
nbr_beers : Nombre de bières produites par cette brasserie, référencées dans la base de données.

Si je crée `unique_brewery` à partir `de rb_beers` j'ai moins de breweries que celle dans `rb_brewerie`s (i.e. `rb_breweries` comptabilise des brasseries en plsu, un exemple : `unique_brewery` n'a pas de brasserie sans bière)

Code pour voir : *display(rb_breweries[~rb_breweries['id'].isin(unique_brewery['brewery_id'])])*

Si on compare les *nbr_beers* entre `unique_brewey` et `rb_breweries`, parfois `rb_breweries` a plus de bières

In [None]:
display(rb_breweries.sort_values(by= 'id'))

In [None]:
unique_brewery = rb_beers.groupby('brewery_id').agg(nbr_beers= ('beer_id', 'count'),
                                             nbr_ratings_sum= ('nbr_ratings', 'sum'),
                                             nbr_ratings_mean= ('nbr_ratings', 'mean'),
                                             overall_score_mean= ('overall_score', 'mean'),
                                             style_score_mean= ('style_score', 'mean'),
                                             avg_mean= ('avg', 'mean'),
                                             abv_mean= ('abv', 'mean'),
                                             avg_computed_mean= ('avg_computed', 'mean')).reset_index()
display(unique_brewery)

In [None]:
plt.figure(figsize= (10, 10))
top_50 = rb_breweries.sort_values(ascending= False, by= 'nbr_beers').head(50)
top_50['id'] = top_50['id'].astype(str)
sns.barplot(data= top_50, x= 'nbr_beers', y= 'id', hue= 'id', palette= 'coolwarm')
plt.title('Top 50 des brasseries avec le plus de bières')
plt.xlabel('Beer number')
plt.ylabel('Brewery ID')
plt.show()

In [None]:
sns.histplot(rb_breweries['nbr_beers'], bins= 50, stat= 'percent')
plt.title('Distribution de la quantité de bières produites par brasserie')
plt.yscale('log')
plt.xlabel('Beer number')
plt.ylabel('Proportion (log)')
plt.show()

In [None]:
plt.figure(figsize= (10, 10))
top_50 = unique_brewery.sort_values(ascending= False, by= 'avg_computed_mean').head(50)
top_50['brewery_id'] = top_50['brewery_id'].astype(str)
min_top_50 = min(top_50['avg_computed_mean'])
max_top_50 = max(top_50['avg_computed_mean'])
sns.barplot(data= top_50, x= 'avg_computed_mean', y= 'brewery_id', hue= 'brewery_id', palette= 'coolwarm')
plt.xlim(min_top_50 - 0.1, max_top_50 + 0.1)
plt.title('Top 50 des brasseries avec les bières les mieux notées en moyenne')
plt.xlabel('Mean user rating')
plt.ylabel('Brewery ID')
plt.show()

In [None]:
sns.histplot(unique_brewery['avg_computed_mean'], bins= 50, stat= 'percent')
plt.title('Distribution des moyennes des notes de bières produites par brasserie')
plt.yscale('log')
plt.xlabel('Mean user rating')
plt.ylabel('Proportion (log)')
plt.show()

## rb_users

nbr_ratings : Nombre total de notes laissées par l'utilisateur pour différentes bières.\
user_id : Identifiant unique de l'utilisateur dans la base de données RateBeer.\
user_name : Pseudonyme ou nom d'utilisateur.\
joined : Date d'inscription de l'utilisateur sur la plateforme RateBeer, au format JJ/MM/AAAA.\
location : Localisation géographique de l'utilisateur, incluant le pays, et parfois la région ou l'état.

In [None]:
display(rb_users)

## rb_reviews

In [None]:
rb_reviews