# Imports

In [62]:
import re
import itertools
import pickle
import pandas as pd
import numpy as np
import fasttext, nltk   # NLP library
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

RAW_DATA_FOLDER = '../data/raw/'
PROCESSED_DATA_FOLDER = '../data/processed/'
DATASET = 'BeerAdvocate/'    # Can be either 'BeerAdvocate/' or 'RateBeer/'

# Load Data

In [2]:
reviews = pd.read_pickle(PROCESSED_DATA_FOLDER + DATASET + 'reviews.pkl')
beers = pd.read_pickle(PROCESSED_DATA_FOLDER + DATASET + 'beers.pkl')

# Make sure there are no duplicates on the primary key
reviews.drop_duplicates(subset=['beer_id', 'user_id', 'date'], inplace=True)
beers.drop_duplicates(subset=['beer_id'], inplace=True)

# What are the adjectives that best describe each beer style?

In order to help the consumers to choose a beer that would fit their tastes, we try to provide, for each main beer styles, a list of adjectives that best describe the style. 

To determine which adjectives best describe each style, we carry out a lexical analysis based on textual reviews. For a given beer style, the most informatives adjectives are those that occure the most in the textual reviews about the given beer style, but that does not occure to much in reviews of other beer style. To adjust for the fact that some adjectives appear more frequently in general (for example 'good' or 'bad'), we will use a TF-IDF approach, as it is one of the most popular term-weighting schemes today.

This lexical analysis can be decomposed in the following steps:

- Step 1: Group reviews by language.
- Step 2: Group reviews by beer styles.
- Step 3: Extract adjectives from the textual reviews.
- Step 4: Compute the TF-IDF matrix where the documents are the list of adjectives for each beer styles
- Step 5: Keep the adjectives with the greatest weight in the TF-IDF matrix for each beer styles.
- Step 6: Visualize the selected adjectives per beer style.

### Step 1: Group reviews by language

To group the reviews by language, a pre-trained language predictor provided by fastText library is used. If one wants to reproduce the following language classification, first make sure to have all the requirements (see requirements.txt in the root of the repository), then download the pre-trained language predictor [here](https://fasttext.cc/docs/en/language-identification.html) and place it in the `/src` folder.

In [4]:
class LanguagePredictor:

    def __init__(self):
        pretrained_lang_model = "../src/lid.176.bin"
        self.model = fasttext.load_model(pretrained_lang_model)

    def predict_lang(self, text):
        predictions = self.model.predict(text)
        language = re.sub(pattern='__label__', repl='', string=predictions[0][0])
        score = predictions[1][0]
        return language, score

In [5]:
model = LanguagePredictor()
reviews['language'] = reviews.text.apply(lambda x: model.predict_lang(x)[0])
reviews['score_language'] = reviews.text.apply(lambda x: model.predict_lang(x)[1])
reviews.head()



Unnamed: 0,beer_id,user_id,date,text,language,score_language
0,142544,nmann08.184925,2015-08-20 10:00:00,"From a bottle, pours a piss yellow color with ...",en,0.891358
1,19590,stjamesgate.163714,2009-02-20 11:00:00,Pours pale copper with a thin head that quickl...,en,0.924852
2,19590,mdagnew.19527,2006-03-13 11:00:00,"500ml Bottle bought from The Vintage, Antrim.....",en,0.783006
3,19590,helloloser12345.10867,2004-12-01 11:00:00,Serving: 500ml brown bottlePour: Good head wit...,en,0.852789
4,19590,cypressbob.3708,2004-08-30 10:00:00,"500ml bottlePours with a light, slightly hazy ...",en,0.768192


In [41]:
language_count = reviews.groupby('language').text.count().sort_values(ascending=False).rename('Number of reviews')
language_count.head()

language
en    2589044
Name: Number of reviews, dtype: int64

In [42]:
nb_reviews = len(reviews)
nb_reviews_low_lang_score = len(reviews[reviews.score_language < 0.9])

print("Percentage of reviews with language score bellow 0.9: {:.2f}%".format(100*nb_reviews_low_lang_score/nb_reviews))

Percentage of reviews with language score bellow 0.9: 14.95%


Regarding the number of reviews in each language, the rest of the analysis will be done only on english reviews as there are not enough reviews for other language.

Also, only reviews with a score language (confidence of the sentence belonging to the predicted language) greater or equal than 0.9 are considered. Doing so we throw away less than 15% of the reviews but we make sure to keep only reviews in the same language with high confidence and by the same time we discard dirty reviews (empty reviews, reviews with several languages ...etc..).

In [43]:
LANGUAGE = 'en'
reviews = reviews[(reviews.language == LANGUAGE) & (reviews.score_language >= 0.9)]

### Step 2: Group reviews by beer styles

In order to group the reviews by beer styles we first join the relations `beers` and `reviews` on `beer_id` key, we select `text` and `style` attributes and then we finally group the merged dataframe by `style`.

In [56]:
merged = reviews.merge(beers, on='beer_id')[['text', 'style']]
styles = merged['style'].unique()
style_groups = merged.groupby('style')

### Step 3: Extract adjectives from the textual reviews

First the reviews of each beer style are tokenized using the tokenizer provided by `nltk` package. Then the Part-Of-Speech tagger from the same package is applied on each tokenized reviews. The adjectives are the token with the tag `JJ`. Adjectives used in reviews of the same beer style are then stored as a long string (`adjectives` variable) with space separator between adjectives. Since it takes a while to tag the tokens, the intermediate variable `adjectives` is stored in pickle files for each beer style. This allow the adjectives extraction to be done to be done iteratively, and only once.

In [None]:
for style in styles:
    tokens = style_groups.get_group(style).text.apply(nltk.word_tokenize).tolist()
    tagged_tokens = nltk.pos_tag_sents(tokens)
    adjectives = ' '.join([word for (word, tag) in list(itertools.chain.from_iterable(tagged_tokens)) if tag == 'JJ'])
    
    with open(PROCESSED_DATA_FOLDER + 'adjectives/{}.pkl'.format(style)) as f:
        pickle.dump(adjectives, f)

### Step 4: Compute the TF-IDF matrix

In this step we aim at retrieving the adjectives that are the most informatives for each beer styles. 

As said before, for a given beer style, the most informatives adjectives are those that occure the most in the reviews, but we do not want the adjectives that frequently appear in every styles. To adjust for the fact that some adjectives appear more frequently in general (for example 'good' or 'bad'), we use a TF-IDF approach, as it is one of the most popular term-weighting schemes today. Here the terms are the adjectives and the documents are the adjectives belonging to the same beer style. 

From [WikiPedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf):
The term frequency (tf) is the relative frequency of term t within document d and it is computed as follow:

$$ \operatorname{tf}(t, d)=\frac{f_{t, d}}{\sum_{t^{\prime} \in d} f_{t^{\prime}, d}} $$

where $f_{t, d}$ is the raw count of a term in a document, i.e., the number of times that term $t$ occurs in document $d$. Note the denominator is simply the total number of terms in document $d$ (counting each occurrence of the same term separately).

One the other hand, the inverse document frequency (idf) is a measure of how much information the word provides, i.e., if it is common or rare across all documents, and it is computed ad follow:

$$ \qquad \operatorname{idf}(t, D)=-\log \frac{n_t}{N} $$

with
- $N$ : total number of documents in the corpus $N=|D|$
- $n_t$ : number of documents where the term $t$ appears (i.e., $\operatorname{tf}(t, d) \neq 0$ ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the numerator to $1+n_t$.

Then tf-idf is calculated as
$$
\operatorname{tfidf}(t, d, D)=\operatorname{tf}(t, d) \cdot \operatorname{idf}(t, D)
$$
A high weight in tf-idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1 , the value of idf (and tf-idf) is greater than or equal to 0 . As a term appears in more documents, the ratio inside the logarithm approaches 1 , bringing the idf and tf-idf closer to 0 .

To compute the TF-IDF matrix, `TfidfVectorizer` from `Scikit-Learn` package is used.

In [None]:
corpus = []
for style in styles:
    with open(PROCESSED_DATA_FOLDER + 'adjectives/{}.pkl'.format(style)) as f:
        adjectives = pickle.load(f)
    corpus.append(adjectives)

vectorizer = TfidfVectorizer()
TF_IDF = vectorizer.fit_transform(corpus)

### Step 6: Visualize the selected adjectives per beer style

To visualize the selected adjectives per beer style a wordcloud is displayed where the size of each words is proportional to the tf-idf weight associated to the adjectives.

In [None]:
feature_names = vectorizer.get_feature_names()
TF_IDF = pd.DataFrame(TF_IDF.todense().tolist(), index=styles, columns=feature_names)

wordcloud = WordCloud(background_color="white", max_words=20).generate_from_frequencies(TF_IDF.iloc['American IPA'])

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()