Here we provide an overview of the models that we will use for the project.
The goal with this notebook is to play with the data and see how the models perform.

The main goal is to provide a pipeline that can be used to analyze the reviews, to be able to answer our first research question. Tasks are : 
1) what is the sentiment of the review
2) what styles of beers exist
3) what emotions a beer is triggering

## Sentiment analysis

In [8]:
from src.models.sentiment_analysis_model import SentimentAnalysisPipeline

sentiment_analysis = SentimentAnalysisPipeline()
print(sentiment_analysis.predict("I love this beer"))
print(sentiment_analysis.predict("I hate this beer"))

emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'POS', 'score': 0.9920633435249329}]
[{'label': 'NEG', 'score': 0.9812145233154297}]


## Topics analysis

In [13]:
from src.models.lda_topics_analysis_model import LDAAnalysis

# sample corpus with two main topics : computer science and health

reviews = ["The computer network relies on secure software for data protection.",
           "A balanced diet and regular exercise are key to good health.",
           "Programming AI systems requires a deep understanding of data.",
           "Medicine and treatment plans are developed by the doctor for wellness.",
           "Internet connectivity is essential for accessing modern software tools.",
           "Exercise improves overall wellness and complements a healthy diet."
           ]

lda_analysis = LDAAnalysis(reviews)
lda_analysis.load_dataset()
lda_analysis.preprocess()
lda_analysis.train_lda()
lda_analysis.print_topics(num_words=2)
print("as we can see the output makes sense we have a first topic about software and a second topic about exercise")

Loaded dataset with 6 reviews.
starting preprocess
preprocessing completed
LDA model training completed.
[(0, '0.816*"software" + 0.184*"exercise"'),
 (1, '0.815*"exercise" + 0.185*"software"')]
as we can see the output makes sense we have a first topic about software and a second topic about exercise


## Emotion analysis (embeddings)

In [14]:
from src.models.emotions_analysis_model import EmotionsAnalysisPipeline

EmotionsAnalysisPipeline().analyse("This beer is very surprising, I didn't expect it to be so good.")

surprise                      : 0.3949018120765686
disgust                       : 0.16275370121002197
happiness                     : 0.11954759806394577
sadness                       : 0.11160004138946533
anger                         : 0.09775044023990631
fear                          : 0.06247026473283768
neutral                       : 0.10074402391910553


In [1]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
from pathlib import Path

import sys
import os
data_path = os.path.abspath('../data')
sys.path.append(data_path)

utils_path = os.path.abspath('../utils')
sys.path.append(utils_path)

import reviews_processing
import plotting_utils

In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
project_root = Path.cwd().parents[2]
beer_advocate_path = project_root / "BeerAdvocate"
reviews_path = beer_advocate_path / "reviews_df.csv"
users_path = beer_advocate_path / "users.csv"

sentiment_path = str(beer_advocate_path / "reviews2_df.pkl")

users_reviews = reviews_processing.Reviews(users_path, sentiment_path)
all_states=False

In [3]:
per_sentiment = users_reviews.posneg_sentiment_aggregation_counts(all_states)
year_list = list(np.arange(2004, 2017, 1, dtype=int))
per_sentiment_filt = per_sentiment[per_sentiment['year'].isin(year_list)]

In [4]:
plotting_utils.plot_sentiment_posneg(per_sentiment_filt)