# Web and Text Analytics - Sentiment Analysis

### Authors
- Francois Cubelier
- Lisa Bueres
- Romain Charles

## Introduction
This report showcases our results for sentiment analysis. We experimented different types of recurrent neural networks on this task: basic RNN, LSTM and GRU. We studied the effect of attention on these models as well as testing different word embeddings (GloVe, FastText and Word2Vec). Finally, document level embedding were also investigated (WME, Bert sentence, GloVe average and a task specific average).

### Sentiment Analysis
Sentiment analysis is the analysis of texts such as reviews in order to detect the polarity implied by the author. The polarity is either positive, negative or neutral.

## Methods

### RNNs

3 types of recurrent neural networks were tested: Base RNN, LSTM and GRUs. We also experimented with different hyperparameters: 
- The number of hidden layers: 1 or 2
- The size of the hidden state: 100 or 300
- The type of output layer after the main netwrok: linear or MLP (for more details on the MLP, see models/RNN.py)

### Attention
We tested 2 types of attention based on scaled dot-product attention: 
- Last hidden layer attention: 
$Attention(Q, K, V) = softmax(QK^\top/\sqrt n)V$ where V = K = output sequence and Q is the last hiddent state
- Self-attention followed by last hiddent layer attention.
First we apply a self attention:
$Attention_1(Q, K, V) = V' = softmax(QK^\top/\sqrt n)V$ where V = K = Q = output sequence.
Then we apply the last hidden layer attention:
$Attention_2(Q', K', V') = softmax(Q'K'^\top/\sqrt n)V'$ where V' = K' = output sequence of self-attention and Q is the last hidden state of the RNN.


### Document level embeddings
There numerous document level embeddings in the literature.
In tested 4 methods:
- Average glove embeddings: a sentence is encoded by the average of its GloVe word embeddings.
- Task specific average: a sentece is encoded by the average of its word embeddings and the word embedding are trained from scratch for this task.
- WME: Document embedding based on the distance (based on Word Mover Distance) between the document and randomly generated documents.
- Sentence BERT: a pre-trained document level embedding based on BERT (all-MiniLM-L6-v2 model).

In order to predict the class from the document embedding, 2 types of model were tested: linear model and MLP.

## Experiments

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Experiment 1: Hyperparameters
In this experiment, we compare RNN, LSTM and GRU over the IMDB datasets and examine the results by varying hyperparameters.

List hparams and explain.

In [None]:
df_1 = pd.read_csv("results/exp1_all.csv")

### Overall:
As can be seen by computing the mean of the test accuracy and the test loss, GRU and LSTM are better than RNN. GRU and LSTM have mean accuracies very close, but LSTM is able to better reduce the loss in average. RNN performed around 30% less than the others for the accuracy and the the loss, which is a non negligible difference.

Both GRU and LSTM use memory while the simple RNN does not. This allosw the two first to have more information than the RNN when performing the classification.

In [None]:
df_1.groupby("Model Type").mean()[["Test Acc", "Test Loss"]]

### Output Layer Type:
The table below shows that linear layers are better than MLP layers, on both the accuracy and the loss. This can be due to the Multi-Layer Perceptron (MLP) leading to a more complex model, because of non linearity, with overfitting on the training set and thus reduced performance on the testing set, while the linear model is simpler and limited in overtfitting thanks to its structure not allowing too complex functions. Additionally, using a more complex model after the RNN may not force the RNN to learn a good representation of the sentence (not linearly separable).

In [None]:
df_1.groupby("Output Layer Type").mean()[["Test Acc", "Test Loss"]]

### Number of layers:
As shown in the table below, adding one layer slighlty improves the average accuracy and decreases the loss. However it also increase the computation time (TODO: add numbers)

In [None]:
df_1.groupby("Number of Layers").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_1.groupby("Hidden Size").mean()[["Test Acc", "Test Loss"]]

Experiment 1 complete results (TODO: in Appendix ?)

In [None]:
df_1.groupby(["Model Type","Output Layer Type", "Hidden Size", "Number of Layers"], sort=True).max()[["Test Acc", "Test Loss", "step"]]

In [None]:
sns.lineplot(x="step", y="Val Loss", hue="Model Type", data=df_1)

In [None]:
sns.barplot(x="Model Type", y="Val Loss", data=df_1)

In [None]:
sns.scatterplot(x="Val Acc", y="Val Loss", hue="Model Type", data=df_1)

In [None]:
sns.jointplot(x="Test Acc", y="Test Loss", hue="Model Type", data=df_1, kind="scatter")

## Experiment 2: Attention
In this experiment, we analyse the effect of adding an attention layer for the sentiment analysis.

In [None]:
df_2 = pd.read_csv("results/exp2_all.csv")

In [None]:
df_2.groupby("Model Type").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_2.groupby("Attention").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_2.groupby(["Model Type","Attention"], sort=True).max()[["Test Acc", "Test Loss", "step"]]

In [None]:
sns.lineplot(x="step", y="Val Loss", hue="Model Type", data=df_2)

In [None]:
sns.barplot(x="Model Type", y="Val Loss", data=df_2)

In [None]:
sns.scatterplot(x="Val Acc", y="Val Loss", hue="Model Type", data=df_2)

In [None]:
sns.jointplot(x="Test Acc", y="Test Loss", hue="Model Type", data=df_2, kind="scatter")

## Experiment 3: Word level embedding

In [None]:
df_3 = pd.read_csv("results/exp3_all.csv")

In [None]:
df_3.groupby("Model Type").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_3.groupby("Embedding").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_3.groupby(["Model Type","Embedding"], sort=True).max()[["Test Acc", "Test Loss"]

In [None]:
sns.lineplot(x="step", y="Val Loss", hue="Model Type", data=df_3)

In [None]:
sns.barplot(x="Model Type", y="Val Loss", data=df_3)

In [None]:
sns.scatterplot(x="Val Acc", y="Val Loss", hue="Model Type", data=df_3)

In [None]:
sns.jointplot(x="Test Acc", y="Test Loss", hue="Model Type", data=df_3, kind="scatter")

## Experiment 4: Document level embedding

In [None]:
df_4 = pd.read_csv("results/exp4_all.csv")

In [None]:
df_4.groupby("Model Type").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_4.groupby("Output Layer Type").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_4.groupby("Embedding Size").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_4.groupby(["Model Type","Output Layer Type","Embedding Size"], sort=True).mean()[["Test Acc", "Test Loss"]]

In [None]:
sns.lineplot(x="step", y="Val Loss", hue="Model Type", data=df_4)

In [None]:
sns.barplot(x="Model Type", y="Val Loss", data=df_4)

In [None]:
sns.scatterplot(x="Val Acc", y="Val Loss", hue="Model Type", data=df_4)

In [None]:
sns.jointplot(x="Test Acc", y="Test Loss", hue="Model Type", data=df_4, kind="scatter")

## Experiment 5: Performance on different datasets

In [None]:
df_5 = pd.read_csv("results/exp5_all.csv")

In [None]:
df_5.groupby("Model Type").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_5.groupby("Dataset").mean()[["Test Acc", "Test Loss"]]

In [None]:
df_5.groupby(["Dataset", "Model Type"], sort=True).max()[["Test Acc", "Test Loss", "step"]]

In [None]:
sns.lineplot(x="step", y="Val Loss", hue="Dataset", data=df_5[])

In [None]:
sns.barplot(x="Model Type", y="Val Loss", data=df_5)

## Bibliography

- Nils Reimers and Iryna Gurevych (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR, abs/1908.10084.
- Lingfei Wu and Ian En-Hsu Yen and Kun Xu and Fangli Xu and Avinash Balakrishnan and Pin-Yu Chen and Pradeep Ravikumar and Michael J. Witbrock (2018). Word Mover's Embedding: From Word2Vec to Document Embedding. CoRR, abs/1811.01713.

