In [3]:
from importlib import reload
import utils_functions
reload(utils_functions)

<module 'utils_functions' from '/Users/gustavelapierre/Documents/EPFL/Ada/ada-2024-project-abracadabra/utils_functions.py'>

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import numpy as np
import os
from pathlib import Path
from utils_functions import classify_user_rating_level, plot_category_distrib, PCA_plot, compute_variance_per_attribute, recompute_grade

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from google.cloud.language_v2 import types
from google.oauth2 import service_account
from google.cloud import language_v2
from tqdm.auto import tqdm

# Part 1 Understanding and treating the data

## Part 1.1 Data conversion

The first step in our analysis is to ensure all data is represented consistently across the project. This involves converting the original **.txt** files containing ratings and reviews into **.csv** format. Each file was examined carefully, the strings were stored into dictionaries representing key data fields. More details on the conversion process and methodology can be found in **data/TxtToCsv.ipynb**.

The CSV files can be found on the following link: https://drive.google.com/drive/folders/1lcRRxlPpcyAcqJzanlwcyb5Vmip0s7_D?usp=sharing
(You will need to ask for permission to see the files).

## Part 1.2 Data exploration

With the data now in a consistent format, we begin exploring deeper the datasets to understand their links and features. We learn especially that some breweries, beers, users matches between the websites. Moreover, their might be duplicates within the datasets, with some users having multiple accounts. Breweries also have duplicates: i.e a single brewery in ratebeer can correspond to up to 3 breweries in advocate. The goal of this notebook was also to look at the percentage of Nan values for the ratings, and to understand the different variables. Further explanations can be found in **data/data_understanding.ipynb**.

Furthermore an other Jupyter Notebook explores Nan values in the dataset. It is mainly a secondary file that was used to evaluate the percentage of Nan in columns of a dataframe. In there we look at the min/max value of the different grades, where we noticed that both datasets not necessarily contain the same range for their grades. More information can be found in **data/data_cleaning.ipynb**. 

Some rows contain NaN values in the datasets. Since certain parts of the analysis do not require every feature, we handle missing data filtering based on the requirements of each analysis section.

## Part 1.3 Dataset merging
To enhance the robustness of the analysis, we merge data from both RateBeer and BeerAdvocate. This approach increases the number of ratings per beer enabling a higher reliability and enhancing the controversiality analysis.

The aim is to have a general dataset for users, beers, breweries and ratings. 
For users, breweries, ratings each one contains a new id, the old id from both datasets except for ratings. It also contains the name, location and other information. If it is match we often make a choice from which dataset to use the information, for example as two breweries in advocate are a single in ratebeer we decided to take the name of the brewery from ratebeer. We filter out the matched informations as a single one.

For ratings we have every rating of both datasets, even the one matched twice. We decided to keep them both as we noticed that comments may differ in the grading and textual description. We added columns corresponding to the new beer, user and brewery id. Certain attributes where deleted as we assumed that we could recompute them again, or if need could reload the old files. Finally we gave an id to each rating, the dataset of origin as 'rb' or 'ad', and we added a column called matched if filled with a number contains the id of the rating it is matched with. 

Further information can be found add **data/1dataset_matt.ipynb**. The transformed data can be found at the following link:
https://drive.google.com/drive/folders/1McQ7BU24mEsEqouulOPqrmtQJ47E6ZP8?usp=sharing
(You will need to ask for permission).

**Data Loading**

For the whole the next cell calls the different datasets and is used for the rest of the notebook. 

In [5]:
root = Path(os.getcwd()).parent

#Change for each one of where your data is. For me in Dataset I have all the three folders
dataset_path = os.path.join(root,'Dataset')

FULL = "full"
FULL_PATH = os.path.join(dataset_path,FULL)

breweries_df = pd.read_csv(os.path.join(FULL_PATH, 'breweries.csv'))
beers_df = pd.read_csv(os.path.join(FULL_PATH, 'beers.csv'))
users_df = pd.read_csv(os.path.join(FULL_PATH, 'users.csv'))
ratings_df = pd.read_csv(os.path.join(FULL_PATH,'ratings.csv'))

FileNotFoundError: [Errno 2] No such file or directory: '/Users/gustavelapierre/Documents/EPFL/Ada/Dataset/full/breweries.csv'

**Grade transformation**

We noticed during the **data_cleaning.ipynb** that the grades are not based on the same range. We decided to set the grade between 1 and 5. If the attribute is between 1 and 20 and the grade is set at 16/20 it will become (16-1)/19*4+1 = 4.15 and not 4. 1 comes from the min value of the attribute, and 19 because of the span. We decided to set it between 1 and 5 as we noticed that most attributes are graded with this range of value.

In [None]:
ratings_df = recompute_grade(ratings_df)

# Part 2 Exploring the definition of controversiality

This part aims to determine and label which beers are controversial and which are universal. To do this, we explore different aspects defining the controversiality of a beer. 

What does controversial mean : "giving rise or likely to give rise to controversy or public disagreement". As described, this depends on the opinions of the people. As a result, this analysis only depends on the fields the users can fulfill, namely, the different ratings : appearance, aroma, palate, taste, overall, and the textual reviews

## Part 2.1 Ratings and reviews filtering

As mentioned earlier, controversiality depends on disagreement in opinions. Beers with few ratings are more likely to show high variability (e.g. two opposing opinions). To ensure reliable insights and meaningful analysis, we exclude beers with fewer number of ratings or reviews than a specified threshold. Later, we might apply a weighting factor based on rating count to further refine the controversiality analysis, according more importance to more rated beers.

The threshold deciding wheter to keep a beer in the analysis is chosen arbitrarily.

In [None]:
# Create dataset filtering beer with too few ratings for the part 2.2
# Create dataset filtering beer with too few textual review for the part 2.3

### Part 2.1.1 Compute variance per attribute

In [None]:
ratings_df = compute_variance_per_attribute(ratings_df)

## Part 2.2 Features controversiality analysis

Controversiality can be analyzed in different manners. For now, the three following definitions are studied :

- We compute the variance of each attributes for each beer, then study which attribute seems to be the more controversial by looking at the distribution of the variances.
- We compute the variance across the **overall** rating, provided by the user. We then classify beers as controversial if above a certain threshold. Then, we observe which of the four main attributes influences the most the overall score controversiality.
- We count for each beer which attributes is the most and lest controversial.



WE COULD APPLY A WEIGHT AS A FUNCTION OF THE NUMBER OF RATINGS ????

### Part 2.2.1 Features analysis controversiality

In [None]:
# compute variance of each attributes on each beer
# describe it
# violin plot box plot

### Part 2.2.2 Features analysis from overall controversiality

In [None]:
# classify controversiality as a function of overall variance with threshold or %
# describe it
# box plot

### Part 2.2.3 Feature analysis from count

In [None]:
# count number of max min variance attributes

### Part 2.2.4 Correlation between the variance of the attributes

In [None]:
# Correlation of the attribute's variances with overall variance

### Part 2.2.5 PCA analysis

In [None]:
PCA_plot(ratings_df)

## Part 2.3 Analysis of the reviews


In order to better understand the meaning behind the ratings that we previously looked at, we will do two types of analysis : A sentiment analysis and a semantic similarity analysis between relevant topics and the reviews.

## Sentiment Analysis


We will first perform a sentiment analysis on the reviews. This would later on unable us to have more insights when we later on classify the reviews per topics. The first objective is to find a reliable, multilingual model. In order to do so, we will compare the performance of various models: A [BERT base multilingual uncased model](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment), [Google Cloud NLP](https://cloud.google.com/natural-language/docs/analyzing-sentiment?hl=fr), [GPT-4o mini](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/) and a 
[distilbert base multilingual cased model](https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student).

As it is a very tedious job to manually label the 2GB of reviews, we will compare each models performance with each other. 

## Part 2.4 Which beer is controversial then ?

### Part 2.4.1 Labelling

### Part 2.4.2 Statistical testing and validation

# Part 3 Some reasons of controversiality

This part uses the label attributed to the beers. It tries to find patterns and reasons of controversial opinion as a function of "constant" variables such as abv and style of the beer, location of the brewery and the users, level of expertise of the users...

## Part 3.1 Novice/Enthusiasts/Connoisseur analysis

In this part, we classify the users related to how many ratings they did.

- Novice are users with only a few ratings : 1-20.
- Enthusiasts are users with moderate number of ratings : 21-100
- Connoisseur are users with high nuber of ratings . 101+

It is important to note that this choice has been arbitrarily made. It could be made differently or could be interactive for the reader of the story, enabling him to label users differently according to how many ratings he thinks is enough to be a connoisseur/enthusiasts/novice.

Another essential thing to take into account is that these classes do not represent users as novice or connoisseur about **beers**, but about **rating** on these particular website.

First step is to classify the users in the three mentioned categories.

In [1]:
user_df = classify_user_rating_level(user_df, enthusiasts_level=21, connoisseur_level=101)
plot_category_distrib(user_df, 'rating_user_level')

NameError: name 'classify_user_rating_level' is not defined

## Part 3.2 Style of the beer and abv

## Part 3.2 Patterns in location and ratings of local or foreign beers

In [None]:
# Under construction