# Lab 10: Sentiment Analysis for Beer Reviews (2/2)

<br>

[<img width=500 align="center" margin="20" src="wc_feel.jpg">](http://www.recommend.beer/analysis/)


In this lab we will continue working with beer reviews for sentiment analysis. We will build better classifiers by using document embeddings and methods like logistic regression and KNN classification.

Document embeddings are representations of documents as vectors, and are analogous to word embeddings. They can be constructed from word embeddings by composing them, or by building them in a similar manner to word embeddings.

Classifiers are models that take as input a set of features and output a discrete label (a class like 'positive', 'negative', 'neutral'). The input features we will use are the document embeddings of the review. Whereas we previously used rule-based methods based on small lists of words to construct the sentiment classifier, now we will use machine learning methods to approximate a relationship between reviews and their sentiment.

We will explore three types of models in this lab:
- Document embeddings by taking the mean of word embeddings
- Document embeddings by taking a weighted sum of word embeddings
- Document embeddings by using Doc2Vec, an algorithm that is similar in spirit to Word2Vec.

As always, load the usual modules.

In [None]:
from datascience import *
import numpy as np
import re
import gensim

from collections import Counter

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)
logging.root.level = logging.CRITICAL 

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

# direct plots to appear within the cell, and set their style
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

## Load and Split Dataset

As before, load the `csv` file that contains our beer reviews.

In [None]:
# load file and build table
filename = "ratings.csv"
data = Table.read_table(filename)
sample_size = data.num_rows
data.show(2)

As before, transform the review scores from strings like `"4/5"` into integers like `4`.

In [None]:
# transform columns with scoring into ints
review_cols = ["review_appearance", "review_aroma", "review_overall", "review_palate", "review_taste"]

def transform_int(score):
    return int(re.match(r'([0-9]*)\/', score)[1])

for col in review_cols:
    review_score = data.apply(transform_int, col)
    data = data.drop(col)
    data = data.with_column(col, review_score)

data.show(2)

Now, label the reviews into 3 classes: positive, neutral and negative. 

Previously, we took the the first third of reviews sorted by increasing overall review score as labelled negative, the next third neutral, and the last third positive. The problem with this is that a review given a score of, say 15, may be labelled as neutral and another review with that same score may be labelled as positive.

A better approach is to find the overall review score associated to the review at the 33rd and 66th percentile, and define our class thresholds according to that score.

In [None]:
# sort by increasing review overall score
data = data.sort("review_overall")

# label class
c2i = {"negative": 1,
       "neutral": 2,
       "positive": 3}

# find score associated with tertiles
thresh_neg = np.percentile(data.column("review_overall"), q=33)
thresh_neu = np.percentile(data.column("review_overall"), q=66)

print('negative threshold: %.0f\nneutral threshold: %.0f\n' % (thresh_neg, thresh_neu))

# label classes
def label_class(score, thresh_neg, thresh_neu):
    if score <= thresh_neg:
        return c2i["negative"]
    elif score <= thresh_neu:
        return c2i["neutral"]
    else:
        return c2i["positive"]
scores = data.column("review_overall")
labels = [label_class(score, thresh_neg, thresh_neu) for score in scores] 

# add to data
data = data.with_column("class", labels)
data.show(2)

Let's take a look at the distribution of the resulting classes.

In [None]:
plt.hist(labels)
plt.show()

In the following sections, we again develop different models to score the sentiment of a beer review. As before, split the dataset into a training set (80%) and test set (20%). Recall that the training set will be used to develop the model, while the test set will be used to evaluate the model. This allows us to more reliably evaluate the model. 

In [None]:
# split to training and test set
from sklearn.model_selection import train_test_split
seed = 123

train, test = train_test_split(data.to_df(), test_size=0.20, random_state=seed)
train = Table.from_df(train)
test = Table.from_df(test)

x_train = train.column("review_text")
y_train = train.column("class")
scores_train = train.column("review_overall")

x_test = test.column("review_text")
y_test = test.column("class")
scores_test = test.column("review_overall")

## Load GloVe Word Embeddings

Let's load the 100-dimensional GloVe embeddings we worked with in labs 08 and 09. Recall that these are trained using word cooccurrence counts from about 6 billion words of text from Wikipedia pages.

In [None]:
import gensim
import gensim.downloader as gdl
from gensim.models import KeyedVectors

glove = gdl.load("glove-wiki-gigaword-100")

## Utility Functions

Let's define the functions we've been using for preprocessing of documents. Recall that we also lemmatize words (verbs) to derive their root form.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def is_numeric(string):
    return any(char.isdigit() for char in string)

def has_poss_contr(string):
    return '\'s' in string

def empty_string(string):
    return not string

def remove_string(string):
    return is_numeric(string) or has_poss_contr(string) or empty_string(string)

def preprocess_data(docs):
    docs = [re.sub(r'[^\w\s]', '', doc) for doc in docs]
    docs_tok = [doc.lower().strip().split(' ') for doc in docs]
    docs_tok = [[tok for tok in doc if not remove_string(tok)] for doc in docs_tok]
    docs_tok = [[lemmatizer.lemmatize(tok, pos='v') for tok in doc] for doc in docs_tok]
    return docs_tok

We now define a couple more utility functions. The first filters a list of `tokens` and only keeps those that are in `vocab`. This helps us deal with tokens that may be out of GloVe's vocabulary by removing them. The second (from the previous lab) uses a model to predict the classes of reviews in `features`, and calculates the accuracy against `labels`.

In [None]:
def filter_not_in_vocab(tokens, vocab):
    return [token for token in tokens if token in vocab]

def evaluate(model, features, labels, split=None):
    pred_class = model.predict(features)
    acc = np.mean(np.equal(labels, pred_class))
    print("Classification accuracy (%s): %f" % (split, acc))
    return pred_class

## 1) Document Embeddings by Taking the Mean of Word Embeddings

One simple approach to obtaining document embeddings is to average out all the word embeddings for each token in the document.

$$
    a = \frac{1}{|D|}\sum_{w \in D} \phi(w)
$$

The result will be stored in a matrix, a data structure that stores an array of arrays.

$$
    \begin{pmatrix}
        a_1 & a_2 & \dots & a_d \\
        b_1 & b_2 & \dots & b_d \\
        \vdots & \vdots & \ddots & \vdots \\
        c_1 & c_2 & \dots & c_d \\
    \end{pmatrix}
$$

Each row in this matrix corresponds to a document embedding (a list of numbers `a = [a_1,a_2,...,a_d]`). Each embedding will, in our case, be of size $d=100$ since we average over 100-dimensional GloVe vectors. This will be the vector representation of a review. So, the matrix has $n$ rows and $d = 100$ columns, where $n$ is the number of documents.

In [None]:
def get_doc_emb_avg(docs, emb_model):
    
    # vocab
    if emb_model == glove:
        vocab = glove.vocab
    elif emb_model == w2v:
        vocab = w2v.wv.vocab
    
    # preprocess data and filter tokens not in  vocab
    docs_tok = preprocess_data(docs)
    docs_tok = [filter_not_in_vocab(docs, glove.vocab) for docs in docs_tok]
    
    # function to average vectors
    def average_vectors(tokens):
        vectors = []
        for token in tokens:
            vectors.append(emb_model[token])
        return np.mean(vectors, axis=0)
    
    # get a list of document vectors
    docs_emb = [average_vectors(tokens) for tokens in docs_tok]
    
    # stack document vectors into matrix
    docs_emb = np.vstack(docs_emb)
    
    return docs_emb

Below, we compute the document embeddings for `x_train` and `x_test` respectively. We also scale the embeddings so that they have mean 0 and variance 1; this is also called "standardization." It's a typical step in many machine learning methods so they will behave well, and we've also seen it for many statistical procedures in YData.


In [None]:
# get document embeddings 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

docs_emb_train = get_doc_emb_avg(x_train, glove)
docs_emb_test = get_doc_emb_avg(x_test, glove)

# scale the document embeddings 
x_train_feat = scaler.fit_transform(docs_emb_train)
x_test_feat = scaler.transform(docs_emb_test)

Next, we train our classifier. The first classifier we will consider is logistic regression. The simpler case is to predict among two classes (binary). 

$$
logit(\hat{p}) = \hat{\beta_0} + \hat{\beta_{1}}x_1 + ... + \hat{\beta_{100}}x_{100}
$$

In this case, we learn the coefficients $\beta$, just as we would for linear regression. Taking the weighted sum of components of the document embedding, weighting component $x_j$ by $\beta_j$, gives the log odds of the document belonging to a class.

In our case, we predict among 3 possible classes, so we have to extend binary logistic regression to multinomial logistic regression.

In [None]:
# train multinomial logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, multi_class='multinomial', 
                                solver='sag', max_iter=100)
model = clf.fit(x_train_feat, y_train)

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

The logistic regression gives us 0.555 accuraccy over the test set. Not bad!

We also try a k-NN classifier.

<img width=350 src="http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/Knn_k1_z96jba.png">

The idea of the k-NN classifier is to find the nearest k neighbors to a document in the vector space. The majority label among these k neighbors becomes the label of that document. We'll start with 1 nearest neighbor.

In [None]:
# train KNN classifier
from sklearn.neighbors import KNeighborsClassifier  
clf = KNeighborsClassifier(n_neighbors=1)  
model = clf.fit(x_train_feat, y_train) 

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

The k-NN classifier doesn't perform as well as the logistic regression model. Can you suggest any reasons for this?

## Your turn! (1/2)

### Evaluating different k

Currently, we set $k=1$ so we effectively have a nearest neighbor classifier. How would performance change as we vary $k$?

Plot the training accuracy and test acccuracy of a k-NN classifier againt $k$. You want two lines in your plot.  You can use the following outline to get started.

In [None]:
# define arrays for storing accuracies
...

# define how we vary neighbours; don't change this!
neighbors = np.arange(1, 50, 2)
for k in neighbors:
    
    # train a KNN classifier with param k
    ...

    # Note that this gives you the accuracy:
    # 100*sum(y_train==preds_train)/len(preds_train)

    # evaluate model on training set
    ...
    
    # evaluate model on test set
    ...
    
# plot both relationships on a graph
# x axis: neighbours
# y axis: training accuracy, test accuracy
...


## 2) Document Embeddings using Weighted Sum of  Word Vectors

Another way to obtain document embeddings is to take a *weighted* sum of word embeddings. Intuitively, some words will be more important the others&mdash;this is why we selected lexicons of positive and negative words in the last lab. But how  should we weight the word embeddings? 

We can use tf-idf as weights. tf stands for *term frequency*, and idf stands for *inverse document frequency*. This is a statistical measure of how important a word is to a particular document among a large corpus of documents. 

Intuitively, the tf term reflects the idea that the importance of a word increases if it appears in a document more times. 

$$
    \text{tf}_D(w) = \frac{\text{# appearances of w in document D}}{\text{# words in D}}
$$

The idf term reflects the idea that the importance of a word decreases if it appears across many documents; e.g., `the` will appear in almost all documents, and so it's not informative.

$$
    \text{idf}(w) = \log\left(\frac{\text{# documents}}{\text{# documents that contain w}}\right)
$$

The tf-idf weight of a term is then just the product of tf and idf:

$$
    \text{tf-idf}_D(w) = tf(w) \times idf(w)
$$

We use this tf-idf weight for word w to form a weighted sum of the GloVe embedding vectors to obtain a document embedding:

$$
    a = \sum_{w \in D} \text{tf-idf}_D(w) \times \phi(w)
$$

As before, we store document embeddings in a matrix.

In [None]:
from gensim.corpora import Dictionary
from gensim.models.tfidfmodel import TfidfModel
from gensim.matutils import sparse2full
    
def get_doc_emb_tfidf(docs, emb_model):
    
    # vocab
    if emb_model == glove:
        vocab = glove.vocab
    elif emb_model == w2v:
        vocab = w2v.wv.vocab
    
    # build dictionary (vocabulary) over training set
    docs_tok = preprocess_data(docs)
    docs_tok = [filter_not_in_vocab(docs, vocab) for docs in docs_tok]
    docs_dict = Dictionary(docs_tok)
    docs_dict.filter_extremes(no_below=35, no_above=0.08)
    docs_dict.compactify()

    # build tfidf matrix for tokens in dictionary
    docs_corpus = [docs_dict.doc2bow(doc) for doc in docs_tok]
    model_tfidf = TfidfModel(docs_corpus, id2word=docs_dict)
    docs_tfidf  = model_tfidf[docs_corpus]
    docs_vecs   = np.vstack([sparse2full(c, len(docs_dict)) for c in docs_tfidf])

    # build matrix of glove vectors for each token
    tfidf_emb_vecs = np.vstack([emb_model[docs_dict[i]] for i in range(len(docs_dict))])
    
    # build document vectors, weighted by tfidf
    docs_emb = np.dot(docs_vecs, tfidf_emb_vecs) 
    return docs_emb

We scale the embeddings as before.

In [None]:
# get document embeddings 
from sklearn.preprocessing import StandardScaler
x_full = np.concatenate((x_train, x_test), axis=0)
docs_emb = get_doc_emb_tfidf(x_full, glove)

# we have to split into the embeddings for train docs and for test docs
docs_emb_train = docs_emb[:train.num_rows,]
docs_emb_test = docs_emb[train.num_rows:,] 

# scale embeddings
x_train_feat = scaler.fit_transform(docs_emb_train)
x_test_feat = scaler.transform(docs_emb_test)

Finally, we train new classifiers using logistic regression and k-NN classification.

In [None]:
# train multinomial logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, multi_class='multinomial', 
                         solver='sag', max_iter=100)
model = clf.fit(x_train_feat, y_train)

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

What can we say about the accuracy compared with the previous logistic regression model? Can you think of any possible explanation for this? How might we try to improve the model?


Next we'll train the k-NN classifier.

In [None]:
# train KNN classifier
from sklearn.neighbors import KNeighborsClassifier  
clf = KNeighborsClassifier(n_neighbors=7)  
model = clf.fit(x_train_feat, y_train) 

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

## Document Embeddings using Doc2Vec

The last method we will consider for document embeddings is called Doc2Vec. The details are not necessary here, but the algorithm is similar to Word2Vec. This takes a while to run, so we can simply load from a previously saved model.

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
x_full = np.concatenate((x_train, x_test), axis=0)

docs = preprocess_data(x_full)
tagged_docs = [TaggedDocument(words=d, tags=[str(i)]) for i, d in enumerate(docs)]

In [None]:
# # train d2v model
# max_epochs = 100
# vec_size = 100
# alpha = 0.025

# d2v = Doc2Vec(size=vec_size,
#               alpha=alpha, 
#               min_alpha=0.00025,
#               min_count=1,
#               dm=1)
  
# d2v.build_vocab(tagged_docs)

# for epoch in range(max_epochs):
#     if epoch % 10 == 0:
#         print('iteration {0}'.format(epoch))
#     d2v.train(tagged_docs,
#                 total_examples=d2v.corpus_count,
#                 epochs=model.iter)
#     # decrease the learning rate
#     d2v.alpha -= 0.0002
#     # fix the learning rate, no decay
#     d2v.min_alpha = d2v.alpha

# d2v.save("100-d2v.model")

In [None]:
d2v = Doc2Vec.load("100-d2v.model")

As before, scale the embeddings.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# build the matrix
docs_emb = np.vstack([d2v.docvecs[str(i)] for i in range(data.num_rows)])

# we have to split into the embeddings for train docs and for test docs
docs_emb_train = docs_emb[:train.num_rows,]
docs_emb_test = docs_emb[train.num_rows:,] 

# scale embeddings
x_train_feat = scaler.fit_transform(docs_emb_train)
x_test_feat = scaler.transform(docs_emb_test)

Train a logistic regression model...

In [None]:
# train multinomial logistic regression
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, multi_class='multinomial', 
                         solver='sag', max_iter=100)
model = clf.fit(x_train_feat, y_train)

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

The model performs well, with 0.591 accuracy on the test set. 

Train a KNN classifier...

In [None]:
# train KNN classifier
from sklearn.neighbors import KNeighborsClassifier  
clf = KNeighborsClassifier(n_neighbors=1)  
model = clf.fit(x_train_feat, y_train) 

# evaluate model on train set
preds_train = evaluate(model, x_train_feat, y_train, split="train")

# evaluate model on test set
preds_test = evaluate(model, x_test_feat, y_test, split="test")

# from sklearn.metrics import classification_report, confusion_matrix  
# print(confusion_matrix(y_test, preds_test))  
# print(classification_report(y_test, preds_test))  

This seems to do poorly once again.

## Your turn! (2/2)

### Word2Vec embeddings

Currently, we use GloVe word embeddings to build document embeddings. How might the performance change if we use Word2Vec embeddings constructed from the training set instead?

First, train word embeddings on `x_train` by filling in the following chunk.

In [None]:
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', \
                    level=logging.INFO)

... = preprocess_data(...)
w2v = word2vec.Word2Vec(..., size=100, window=10, iter=10, min_count=10)


Second, get document embeddings via the first method (mean of word embeddings). Then, scale the document embeddings. Your code for the following chunks should look like what has been given above.

In [None]:
# get document embeddings

# scale document embeddings

Third, train a logistic regression model, and evaluate its performance.

In [None]:
# train and evaluate logistic regression model

Fourth, train a KNN classifier model, and evaluate its performance.

In [None]:
# train and evaluate KNN classifier model

Discuss the performance of the models. How do they compare with the GloVe embeddings?