# Torch SST #
I implemented different models in `torch_sst.py` and this notebook tests them against the Stanford Sentiment Treebank dataset.

Models:
* Bag of words with a dense output layer
* Averaging the GloVe vectors for each token and passing the output to a dense layer
* RNN (RNN, LSTM, GRU)

There are a lot of improvements that'd lift all of these models' performance (e.g. regularization), but I just wanted to look at the relative performance. Overall the BOW model and the bidirectional LSTM/GRU perform best.

In [6]:
import os

import torch
import numpy as np

from imblearn.over_sampling import RandomOverSampler

from torch_sst import BOWClassifier, GloveClassifier, RnnClassifier, load_raw_data, experiment

In [8]:
train_texts, train_labels = load_raw_data('train')
train_texts = np.array(train_texts)
train_labels = np.array(train_labels)

dev_texts, dev_labels = load_raw_data('dev')

In [9]:
rnn_classifier = RnnClassifier(
    hidden_dimension=50, rnn_type='rnn', num_classes=3, epochs=100,
    print_every=10, bidirectional=False, batch_size=256,
    oversample=False
)
experiment(rnn_classifier, orig_train_texts, orig_train_labels, dev_texts, dev_labels)

Epoch 10: 35.23	0.48
Epoch 20: 33.30	0.55
Epoch 30: 31.21	0.60
Epoch 40: 30.50	0.61
Epoch 50: 29.48	0.63
Epoch 60: 28.93	0.63
Epoch 70: 28.27	0.64
Epoch 80: 27.51	0.65
Epoch 90: 27.41	0.66
Epoch 100: 26.66	0.66

## Train ##


  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

    negative       0.61      0.78      0.69      3310
     neutral       0.00      0.00      0.00      1624
    positive       0.68      0.81      0.74      3610

    accuracy                           0.65      8544
   macro avg       0.43      0.53      0.47      8544
weighted avg       0.52      0.65      0.58      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.61      0.75      0.67       428
     neutral       0.00      0.00      0.00       229
    positive       0.62      0.80      0.70       444

    accuracy                           0.61      1101
   macro avg       0.41      0.52      0.46      1101
weighted avg       0.49      0.61      0.54      1101



In [None]:
rnn_classifier = RnnClassifier(
    hidden_dimension=50, rnn_type='rnn', num_classes=3, epochs=100,
    print_every=10, bidirectional=False, batch_size=256,
    oversample=True
)
experiment(rnn_classifier, orig_train_texts, orig_train_labels, dev_texts, dev_labels)

In [None]:
rnn_classifier = RnnClassifier(
    hidden_dimension=50, rnn_type='rnn', num_classes=3, epochs=100,
    print_every=10, bidirectional=True, batch_size=256,
    oversample=True
)
experiment(rnn_classifier, orig_train_texts, orig_train_labels, dev_texts, dev_labels)

In [3]:
rnn_classifier = RnnClassifier(
    hidden_dimension=50, rnn_type='lstm', num_classes=3, epochs=100,
    print_every=10, bidirectional=True, batch_size=256,
    oversample=True
)
experiment(rnn_classifier, orig_train_texts, orig_train_labels, dev_texts, dev_labels)

Epoch 10: 37.85	0.57
Epoch 20: 31.43	0.68
Epoch 30: 23.06	0.80
Epoch 40: 16.77	0.86
Epoch 50: 12.96	0.90
Epoch 60: 9.46	0.93
Epoch 70: 7.56	0.95
Epoch 80: 6.17	0.96
Epoch 90: 4.72	0.97
Epoch 100: 4.01	0.97

## Train ##
              precision    recall  f1-score   support

    negative       0.77      0.70      0.74      3310
     neutral       0.56      0.51      0.53      1624
    positive       0.71      0.80      0.75      3610

    accuracy                           0.71      8544
   macro avg       0.68      0.67      0.67      8544
weighted avg       0.71      0.71      0.70      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.60      0.53      0.56       428
     neutral       0.26      0.20      0.22       229
    positive       0.57      0.70      0.63       444

    accuracy                           0.53      1101
   macro avg       0.48      0.48      0.47      1101
weighted avg       0.52      0.53      0.52      1101



In [4]:
rnn_classifier = RnnClassifier(
    hidden_dimension=50, rnn_type='gru', num_classes=3, epochs=100,
    print_every=10, bidirectional=True, batch_size=256,
    oversample=True
)
experiment(rnn_classifier, orig_train_texts, orig_train_labels, dev_texts, dev_labels)

Epoch 10: 35.24	0.62
Epoch 20: 24.91	0.77
Epoch 30: 16.23	0.86
Epoch 40: 10.89	0.92
Epoch 50: 7.15	0.95
Epoch 60: 5.20	0.97
Epoch 70: 4.18	0.97
Epoch 80: 2.92	0.98
Epoch 90: 2.75	0.98
Epoch 100: 1.62	0.99

## Train ##
              precision    recall  f1-score   support

    negative       0.88      0.63      0.73      3310
     neutral       0.41      0.82      0.55      1624
    positive       0.78      0.64      0.70      3610

    accuracy                           0.67      8544
   macro avg       0.69      0.69      0.66      8544
weighted avg       0.75      0.67      0.68      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.62      0.40      0.49       428
     neutral       0.23      0.42      0.29       229
    positive       0.58      0.52      0.55       444

    accuracy                           0.45      1101
   macro avg       0.47      0.45      0.44      1101
weighted avg       0.52      0.45      0.47      1101



## RNN with GloVe Vectors ##

### Findings: ###
* LSTM, GRU: improvement in macro F1 of 7% and 9% respectively.
* Bidirectional: 12% improvement in macro F1
* Not freezing embeddings: bidirectional LSTM and unfreezing gave the model way more power. It almost perfectly fit the train set (macro F1 of 96%) and performed a little worse on the dev set than the bidirectional LSTM with frozen embeddings. 

### TODO: ###
* Keep going with the bidirectional LSTM with unfrozen embeddings. It fits the training set well, so try adding regularization.
* Stacked (deep) RNNs
* Regularization / dropout
* Search over learning rate / batch size, hidden dimension, epochs

In [10]:
rnn_classifier = RnnClassifier(hidden_dimension=50, rnn_type='rnn', num_classes=3, epochs=20, print_every=1, bidirectional=True)
experiment(rnn_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 1: 93.36	0.35
Epoch 2: 92.32	0.38
Epoch 3: 89.65	0.43
Epoch 4: 89.62	0.45
Epoch 5: 85.64	0.48
Epoch 6: 85.21	0.49
Epoch 7: 84.33	0.50
Epoch 8: 83.12	0.51
Epoch 9: 82.37	0.51
Epoch 10: 82.18	0.52
Epoch 11: 82.16	0.52
Epoch 12: 80.77	0.53
Epoch 13: 79.68	0.53
Epoch 14: 79.75	0.54
Epoch 15: 79.11	0.54
Epoch 16: 79.28	0.54
Epoch 17: 78.67	0.54
Epoch 18: 78.18	0.55
Epoch 19: 77.83	0.55
Epoch 20: 76.80	0.56

## Train ##
              precision    recall  f1-score   support

    negative       0.59      0.79      0.68      3310
     neutral       0.30      0.14      0.19      1624
    positive       0.73      0.68      0.71      3610

    accuracy                           0.62      8544
   macro avg       0.54      0.54      0.53      8544
weighted avg       0.60      0.62      0.60      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.61      0.79      0.69       428
     neutral       0.36      0.14      0.20       229
    positive       0.6

In [11]:
rnn_classifier = RnnClassifier(hidden_dimension=50, rnn_type='lstm', num_classes=3, epochs=20, print_every=1, bidirectional=True)
experiment(rnn_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 1: 92.99	0.36
Epoch 2: 87.27	0.46
Epoch 3: 85.62	0.48
Epoch 4: 86.11	0.49
Epoch 5: 82.71	0.51
Epoch 6: 79.37	0.53
Epoch 7: 77.91	0.55
Epoch 8: 77.17	0.55
Epoch 9: 75.94	0.56
Epoch 10: 74.18	0.57
Epoch 11: 72.57	0.59
Epoch 12: 71.01	0.60
Epoch 13: 69.47	0.61
Epoch 14: 68.13	0.63
Epoch 15: 65.58	0.65
Epoch 16: 63.94	0.65
Epoch 17: 60.99	0.68
Epoch 18: 59.46	0.70
Epoch 19: 57.98	0.70
Epoch 20: 55.05	0.73

## Train ##
              precision    recall  f1-score   support

    negative       0.74      0.69      0.72      3310
     neutral       0.42      0.39      0.40      1624
    positive       0.77      0.84      0.80      3610

    accuracy                           0.70      8544
   macro avg       0.65      0.64      0.64      8544
weighted avg       0.69      0.70      0.69      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.64      0.60      0.62       428
     neutral       0.28      0.21      0.24       229
    positive       0.6

In [12]:
rnn_classifier = RnnClassifier(hidden_dimension=50, rnn_type='gru', num_classes=3, epochs=20, print_every=1, bidirectional=True)
experiment(rnn_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 1: 93.14	0.36
Epoch 2: 84.84	0.49
Epoch 3: 78.13	0.55
Epoch 4: 75.27	0.57
Epoch 5: 73.67	0.58
Epoch 6: 71.75	0.60
Epoch 7: 69.56	0.62
Epoch 8: 67.87	0.63
Epoch 9: 65.77	0.65
Epoch 10: 63.09	0.67
Epoch 11: 61.95	0.69
Epoch 12: 57.22	0.72
Epoch 13: 54.84	0.73
Epoch 14: 51.71	0.75
Epoch 15: 47.48	0.79
Epoch 16: 46.71	0.79
Epoch 17: 42.14	0.82
Epoch 18: 38.24	0.84
Epoch 19: 36.06	0.85
Epoch 20: 34.07	0.86

## Train ##
              precision    recall  f1-score   support

    negative       0.73      0.80      0.76      3310
     neutral       0.51      0.37      0.43      1624
    positive       0.74      0.75      0.75      3610

    accuracy                           0.70      8544
   macro avg       0.66      0.64      0.65      8544
weighted avg       0.69      0.70      0.69      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.61      0.67      0.64       428
     neutral       0.23      0.14      0.17       229
    positive       0.6

In [14]:
rnn_classifier = RnnClassifier(hidden_dimension=50, rnn_type='lstm', num_classes=3, epochs=20, print_every=1, bidirectional=True, oversample=True)
experiment(rnn_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 1: 92.71	0.37
Epoch 2: 89.68	0.44
Epoch 3: 90.16	0.43
Epoch 4: 85.91	0.48
Epoch 5: 82.51	0.52
Epoch 6: 79.89	0.53
Epoch 7: 78.22	0.56
Epoch 8: 77.25	0.56
Epoch 9: 75.38	0.58
Epoch 10: 74.27	0.58
Epoch 11: 73.07	0.59
Epoch 12: 71.35	0.60
Epoch 13: 69.31	0.62
Epoch 14: 66.94	0.64
Epoch 15: 65.16	0.65
Epoch 16: 63.26	0.67
Epoch 17: 61.48	0.68
Epoch 18: 59.39	0.70
Epoch 19: 56.97	0.72
Epoch 20: 54.08	0.74

## Train ##
              precision    recall  f1-score   support

    negative       0.75      0.67      0.71      3310
     neutral       0.44      0.45      0.44      1624
    positive       0.77      0.84      0.80      3610

    accuracy                           0.70      8544
   macro avg       0.65      0.65      0.65      8544
weighted avg       0.70      0.70      0.70      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.66      0.59      0.62       428
     neutral       0.25      0.21      0.23       229
    positive       0.6

In [15]:
# Using the best RNN type so far, LSTM, and unfreezing the embedding layer
rnn_classifier = RnnClassifier(hidden_dimension=50, rnn_type='lstm', num_classes=3, epochs=5, print_every=1, bidirectional=True, update_glove=True)
experiment(rnn_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 1: 92.75	0.37
Epoch 2: 83.38	0.49
Epoch 3: 70.56	0.60
Epoch 4: 57.46	0.72
Epoch 5: 44.01	0.81

## Train ##
              precision    recall  f1-score   support

    negative       0.77      0.89      0.83      3310
     neutral       0.53      0.58      0.56      1624
    positive       0.96      0.78      0.86      3610

    accuracy                           0.79      8544
   macro avg       0.75      0.75      0.75      8544
weighted avg       0.80      0.79      0.79      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.63      0.71      0.67       428
     neutral       0.33      0.34      0.33       229
    positive       0.74      0.64      0.68       444

    accuracy                           0.60      1101
   macro avg       0.57      0.56      0.56      1101
weighted avg       0.61      0.60      0.60      1101



## GloVe Averaged Classifier ##

Use pre-trained GloVe vectors. Average the vectors across all tokens.

In [16]:
glove_classifier = GloveClassifier(hidden_dim=100, epochs=1000, print_every=100, update_glove=False)
experiment(glove_classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 100: 58.35	0.61
Epoch 200: 56.47	0.63
Epoch 300: 56.22	0.63
Epoch 400: 53.75	0.65
Epoch 500: 52.19	0.67
Epoch 600: 51.34	0.67
Epoch 700: 49.74	0.69
Epoch 800: 48.78	0.69
Epoch 900: 47.81	0.70
Epoch 1000: 46.82	0.71

## Train ##
              precision    recall  f1-score   support

    negative       0.65      0.58      0.61      3310
     neutral       0.32      0.32      0.32      1624
    positive       0.68      0.74      0.71      3610

    accuracy                           0.60      8544
   macro avg       0.55      0.55      0.55      8544
weighted avg       0.60      0.60      0.60      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.58      0.53      0.56       428
     neutral       0.27      0.26      0.27       229
    positive       0.61      0.68      0.64       444

    accuracy                           0.54      1101
   macro avg       0.49      0.49      0.49      1101
weighted avg       0.53      0.54      0.53      

When setting the embedding to `requires_grad=True` the embedding weights change after training, but not when it's `False`.

In [17]:
glove_classifier = GloveClassifier(hidden_dim=100, epochs=1, print_every=1, update_glove=False)
initial_weights = glove_classifier.embed.weight.detach().numpy().copy()
glove_classifier.fit(train_texts, train_labels)
later_weights = glove_classifier.embed.weight.detach().numpy()

(initial_weights == later_weights).all()

Epoch 1: 70.29	0.43


True

In [18]:
glove_classifier = GloveClassifier(hidden_dim=100, epochs=1, print_every=1, update_glove=True)
initial_weights = glove_classifier.embed.weight.detach().numpy().copy()
glove_classifier.fit(train_texts, train_labels)
later_weights = glove_classifier.embed.weight.detach().numpy()

(initial_weights == later_weights).all()

Epoch 1: 70.24	0.42


False

## Bag of Words Classifier ##

In [19]:
word_to_id, label_to_id = BOWClassifier.build_vocab(train_texts, train_labels, top_n=10000)

In [20]:
def demo(classifier):
    for text in 'It was a horrible disgusting movie', 'it was ok', 'It was an amazing movie!':
        print(text)
        print(classifier.predict(text))
        print()

In [21]:
classifier = BOWClassifier(word_to_id, label_to_id, epochs=10)
experiment(classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 5: 195.33	0.74
Epoch 10: 151.03	0.83

## Train ##
              precision    recall  f1-score   support

    negative       0.82      0.92      0.87      3310
     neutral       0.93      0.50      0.65      1624
    positive       0.84      0.93      0.89      3610

    accuracy                           0.84      8544
   macro avg       0.87      0.78      0.80      8544
weighted avg       0.85      0.84      0.83      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.63      0.72      0.67       428
     neutral       0.31      0.06      0.10       229
    positive       0.63      0.81      0.71       444

    accuracy                           0.62      1101
   macro avg       0.52      0.53      0.49      1101
weighted avg       0.56      0.62      0.57      1101



In [22]:
classifier = BOWClassifier(word_to_id, label_to_id, epochs=50)
experiment(classifier, train_texts, train_labels, dev_texts, dev_labels)

Epoch 5: 195.40	0.74
Epoch 10: 151.03	0.83
Epoch 15: 124.08	0.87
Epoch 20: 105.23	0.90
Epoch 25: 90.88	0.92
Epoch 30: 79.52	0.93
Epoch 35: 70.32	0.94
Epoch 40: 62.69	0.95
Epoch 45: 56.28	0.96
Epoch 50: 50.82	0.96

## Train ##
              precision    recall  f1-score   support

    negative       0.96      0.98      0.97      3310
     neutral       0.96      0.90      0.93      1624
    positive       0.97      0.98      0.98      3610

    accuracy                           0.96      8544
   macro avg       0.96      0.95      0.96      8544
weighted avg       0.96      0.96      0.96      8544

## Dev ##
              precision    recall  f1-score   support

    negative       0.65      0.68      0.67       428
     neutral       0.33      0.21      0.26       229
    positive       0.66      0.74      0.70       444

    accuracy                           0.61      1101
   macro avg       0.54      0.55      0.54      1101
weighted avg       0.59      0.61      0.59      1101

