# Homework 2: Stanford Sentiment Treebank

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"

## Contents

1. [Overview](#Overview)
1. [Methodological note](#Methodological-note)
1. [Set-up](#Set-up)
1. [A softmax baseline](#A-softmax-baseline)
1. [RNNClassifier wrapper](#RNNClassifier-wrapper)
1. [Error analysis](#Error-analysis)
1. [Homework questions](#Homework-questions)
  1. [Sentiment words alone [2 points]](#Sentiment-words-alone-[2-points])
  1. [A more powerful vector-summing baseline [3 points]](#A-more-powerful-vector-summing-baseline-[3-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Overview

This homework and associated bake-off are devoted to the Stanford Sentiment Treebank (SST). The homework questions ask you to implement some baseline systems, and the bake-off challenge is to define a system that does extremely well at the SST task.

We'll focus on the ternary task as defined by `sst.ternary_class_func`.

The SST test set will be used for the bake-off evaluation. This dataset is already publicly distributed, so we are counting on people not to cheat by develping their models on the test set. You must do all your development without using the test set at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. __Much of the scientific integrity of our field depends on people adhering to this honor code__. 

Our only additional restriction is that __you cannot make any use of the subtree labels__. This corresponds to the 'Root' condition in the paper. As we discussed in class, the subtree labels are a really interesting feature of SST, but bringing them in results in a substantially different learning problem.

One of our goals for this homework and bake-off is to encourage you to engage in __the basic development cycle for supervised models__, in which you

1. Write a new feature function. We recommend starting with something simple.
1. Use `sst.experiment` to evaluate your new feature function, with at least `fit_softmax_classifier`.
1. If you have time, compare your feature function with `unigrams_phi` using `sst.compare_models` or `sst.compare_models_mcnemar`. (For discussion, see [this notebook section](sst_02_hand_built_features.ipynb#Statistical-comparison-of-classifier-models).)
1. Return to step 1, or stop the cycle and conduct a more rigorous evaluation with hyperparameter tuning and assessment on the `dev` set.

[Error analysis](#Error-analysis) is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.

## Methodological note

You don't have to use the experimental framework defined below (based on `sst`). However, if you don't use `sst.experiment` as below, then make sure you're training only on `train`, evaluating on `dev`, and that you report with 

```
from sklearn.metrics import classification_report
classification_report(y_dev, predictions)
```
where `y_dev = [y for tree, y in sst.dev_reader(class_func=sst.ternary_class_func)]`. We'll focus on the value at `macro avg` under `f1-score` in these reports.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [100]:
from collections import Counter
import copy
from functools import partial
import numpy as np
import os
import pandas as pd
import random
import scipy.stats
from sklearn.linear_model import LogisticRegression
import sst
import sys
import torch.nn as nn
from torch_rnn_classifier import TorchRNNClassifier
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_tree_nn import TorchTreeNN
import treelstm
import utils

In [3]:
SST_HOME = os.path.join('data', 'trees')

## A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models.

In [4]:
def unigrams_phi(tree):
    """The basis for a unigrams feature function.
    
    Parameters
    ----------
    tree : nltk.tree
        The tree to represent.
    
    Returns
    -------    
    Counter
        A map from strings to their counts in `tree`. (Counter maps a 
        list to a dict of counts of the elements in that list.)
    
    """
    return Counter(tree.leaves())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [5]:
def fit_softmax_classifier(X, y):        
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [12]:
softmax_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

    accuracy                          0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101



`softmax_experiment` contains a lot of information that you can use for analysis; see [this section below](#Error-analysis) for starter code.

## RNNClassifier wrapper

This section illustrates how to use `sst.experiment` with RNN and TreeNN models.

To featurize examples for an RNN, we just get the words in order, letting the model take care of mapping them into an embedding space.

In [6]:
def rnn_phi(tree):
    return tree.leaves()    

The model wrapper gets the vocabulary using `sst.get_vocab`. If you want to use pretrained word representations in here, then you can have `fit_rnn_classifier` build that space too; see [this notebook section for details](sst_03_neural_networks.ipynb#Pretrained-embeddings).

In [7]:
def fit_rnn_classifier(X, y):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=True,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [18]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)

Finished epoch 50 of 50; error is 1.9558108299970627

              precision    recall  f1-score   support

    negative      0.593     0.631     0.612       428
     neutral      0.273     0.249     0.260       229
    positive      0.632     0.622     0.627       444

    accuracy                          0.548      1101
   macro avg      0.499     0.500     0.499      1101
weighted avg      0.542     0.548     0.545      1101



In [66]:
def fit_rnn_classifier_layer(X, y, num_layers=1):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=True,
        num_layers=num_layers,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [70]:
fit_rnn_classifier_layers = []
for i in range (1, 7):
    fit_rnn_classifier_layers.append(partial(fit_rnn_classifier_layer, num_layers=i))


In [71]:
rnn_experiment_layers = []
for i, layer in enumerate(fit_rnn_classifier_layers):
    print(f"num_layers: {i+1}")
    rnn_experiment_layers.append(
        sst.experiment(
        SST_HOME,
        rnn_phi,
        layer, 
        vectorize=False,  # For deep learning, use `vectorize=False`.
        assess_reader=sst.dev_reader)
    )

num_layers: 1


Finished epoch 50 of 50; error is 1.9672084599733353

              precision    recall  f1-score   support

    negative      0.632     0.565     0.597       428
     neutral      0.266     0.201     0.229       229
    positive      0.596     0.732     0.657       444

    accuracy                          0.557      1101
   macro avg      0.498     0.499     0.494      1101
weighted avg      0.541     0.557     0.545      1101

num_layers: 2


Finished epoch 50 of 50; error is 2.2303597629070284

              precision    recall  f1-score   support

    negative      0.609     0.661     0.634       428
     neutral      0.259     0.197     0.223       229
    positive      0.610     0.635     0.623       444

    accuracy                          0.554      1101
   macro avg      0.493     0.498     0.493      1101
weighted avg      0.537     0.554     0.544      1101

num_layers: 3


Finished epoch 50 of 50; error is 2.2279613763093953

              precision    recall  f1-score   support

    negative      0.564     0.736     0.638       428
     neutral      0.284     0.249     0.265       229
    positive      0.689     0.529     0.599       444

    accuracy                          0.551      1101
   macro avg      0.512     0.505     0.501      1101
weighted avg      0.556     0.551     0.545      1101

num_layers: 4


Finished epoch 50 of 50; error is 2.1385425627231685

              precision    recall  f1-score   support

    negative      0.592     0.654     0.622       428
     neutral      0.332     0.275     0.301       229
    positive      0.671     0.662     0.667       444

    accuracy                          0.579      1101
   macro avg      0.532     0.530     0.530      1101
weighted avg      0.570     0.579     0.573      1101

num_layers: 5


Finished epoch 50 of 50; error is 2.0896782428026267

              precision    recall  f1-score   support

    negative      0.588     0.647     0.616       428
     neutral      0.264     0.231     0.247       229
    positive      0.653     0.631     0.641       444

    accuracy                          0.554      1101
   macro avg      0.501     0.503     0.501      1101
weighted avg      0.547     0.554     0.550      1101

num_layers: 6


Finished epoch 50 of 50; error is 2.2545953691005707

              precision    recall  f1-score   support

    negative      0.591     0.647     0.618       428
     neutral      0.281     0.157     0.202       229
    positive      0.591     0.671     0.629       444

    accuracy                          0.555      1101
   macro avg      0.488     0.492     0.483      1101
weighted avg      0.527     0.555     0.536      1101



In [72]:
def fit_rnn_classifier_no_bi(X, y):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=False,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [73]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier_no_bi, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)

Finished epoch 50 of 50; error is 2.1071425303816795

              precision    recall  f1-score   support

    negative      0.609     0.572     0.590       428
     neutral      0.300     0.262     0.280       229
    positive      0.611     0.687     0.647       444

    accuracy                          0.554      1101
   macro avg      0.507     0.507     0.506      1101
weighted avg      0.546     0.554     0.549      1101



## Error analysis

This section begins to build an error-analysis framework using the dicts returned by `sst.experiment`. These have the following structure:

```
'model': trained model
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_dataset': same structure as the value of 'train_dataset'
'predictions': predictions on the assessment data
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the assessment data
```
The following function just finds mistakes, and returns a `pd.DataFrame` for easy subsequent processing:

In [13]:
def find_errors(experiment):
    """Find mistaken predictions.
    
    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.
        
    Returns
    -------
    pd.DataFrame
    
    """
    raw_examples = experiment['assess_dataset']['raw_examples']
    raw_examples = [" ".join(tree.leaves()) for tree in raw_examples]
    df = pd.DataFrame({
        'raw_examples': raw_examples,
        'predicted': experiment['predictions'],
        'gold': experiment['assess_dataset']['y']})
    df['correct'] = df['predicted'] == df['gold']
    return df

In [14]:
softmax_analysis = find_errors(softmax_experiment)

In [19]:
rnn_analysis = find_errors(rnn_experiment)

Here we merge the sotmax and RNN experiments into a single DataFrame:

In [20]:
analysis = softmax_analysis.merge(
    rnn_analysis, left_on='raw_examples', right_on='raw_examples')

analysis = analysis.drop('gold_y', axis=1).rename(columns={'gold_x': 'gold'})

The following code collects a specific subset of examples; small modifications to its structure will give you different interesting subsets:

In [21]:
# Examples where the softmax model is correct, the RNN is not,
# and the gold label is 'positive'

error_group = analysis[
    (analysis['predicted_x'] == analysis['gold'])
    &
    (analysis['predicted_y'] != analysis['gold'])    
    &
    (analysis['gold'] == 'positive')
]

In [22]:
error_group.shape[0]

75

In [23]:
for ex in error_group['raw_examples'].sample(5):
    print("="*70)
    print(ex)

An operatic , sprawling picture that 's entertainingly acted , magnificently shot and gripping enough to sustain most of its 170-minute length .
Lovely and poignant .
Vera 's three actors -- Mollà , Gil and Bardem -- excel in insightful , empathetic performances .
Uses high comedy to evoke surprising poignance .
Bogdanovich tantalizes by offering a peep show into the lives of the era 's creme de la celluloid .


## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Sentiment words alone [2 points]

NLTK includes an easy interface to [Minqing Hu and Bing Liu's __Opinion Lexicon__](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), which consists of a list of positive words and a list of negative words. How much of the ternary SST story does this lexicon tell?

For this problem, submit code to do the following:

1. Create a feature function `op_unigrams` on the model of `unigrams_phi` above, but filtering the vocabulary to just items that are members of the Opinion Lexicon. Submit this feature function.

1. Evaluate your feature function with `sst.experiment`, with all the same parameters as were used to create `softmax_experiment` in [A softmax baseline](#A-softmax-baseline) above, except of course for the feature function.

1. Use `utils.mcnemar` to compare your feature function with the results in `softmax_experiment`. The information you need for this is in `softmax_experiment` and your own `sst.experiment` results. Submit your evaluation code. You can assume `softmax_experiment` is already in memory, but your code should create the other objects necessary for this comparison.

In [24]:
from nltk.corpus import opinion_lexicon

# Use set for fast membership checking:
positive = set(opinion_lexicon.positive())
negative = set(opinion_lexicon.negative())
words = set(opinion_lexicon.words())


In [25]:
def op_unigrams_words(tree, words):
    leaves = Counter(copy.deepcopy(tree.leaves()))
    for key in list(leaves):
        if key in words:
            del leaves[key]

    return leaves

op_unigrams = partial(op_unigrams_words, words=words)

In [26]:
softmax_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

    accuracy                          0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101



In [27]:
op_experiment = sst.experiment(
    SST_HOME,
    op_unigrams,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.575     0.607     0.591       428
     neutral      0.266     0.109     0.155       229
    positive      0.580     0.725     0.645       444

    accuracy                          0.551      1101
   macro avg      0.474     0.481     0.463      1101
weighted avg      0.513     0.551     0.522      1101



In [28]:
_ = sst.compare_models(
    SST_HOME,
    unigrams_phi,
    fit_softmax_classifier,
    stats_test=scipy.stats.wilcoxon,
    trials=10,
    phi2=op_unigrams,  # Defaults to same as first required argument.
    train_func2=fit_softmax_classifier, # Defaults to same as second required argument.
    reader=sst.train_reader, 
    train_size=0.7, 
    class_func=sst.ternary_class_func, 
    score_func=utils.safe_macro_f1)

Model 1 mean: 0.518
Model 2 mean: 0.468
p = 0.005


In [29]:
m = utils.mcnemar(
    softmax_experiment['assess_dataset']['y'], 
    op_experiment['predictions'],
    softmax_experiment['predictions'])

p = "p < 0.0001" if m[1] < 0.0001 else m[1]

print("McNemar's test: {0:0.02f} ({1:})".format(m[0], p))

McNemar's test: 20.17 (p < 0.0001)


### A more powerful vector-summing baseline [3 points]

In [Distributed representations as features](sst_03_neural_networks.ipynb#Distributed-representations-as-features), we looked at a baseline for the ternary SST problem in which each example is modeled as the sum of its 50-dimensional GloVe representations. A `LogisticRegression` model was used for prediction. A neural network might do better with these representations, since there might be complex relationships between the input feature dimensions that a linear classifier can't learn. 

To address this question, rerun the experiment with `TorchShallowNeuralClassifier` as the classifier. Specs:
* Use `sst.experiment` to conduct the experiment. 
* Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
  * The hidden dimensionality at 50, 100, and 200.
  * The hidden activation function as `nn.Tanh` or `nn.ReLU`.
* (For all other parameters to `TorchShallowNeuralClassifier`, use the defaults.)

For this problem, submit code to do the following:

1. Your model wrapper function around `TorchShallowNeuralClassifier`. This function should implement the requisite cross-validation; see [this notebook section](sst_02_hand_built_features.ipynb#Hyperparameter-search) for examples.
1. The classification report as printed by `sst.experiment`. (This will print out when you run `sst.experiment`. That print-out suffices.)
2. The optimal hyperparameters chosen in your experiment. (This too will print out when you run `sst.experiment`. The print-out again suffices.)

We're not evaluating the quality of your model. (We've specified the protocols completely, but there will still be variation in the results.) However, the primary goal of this question is to get you thinking more about this strikingly good baseline feature representation scheme for SST, so we're sort of hoping you feel compelled to try out variations on your own.

In [30]:
def fit_nn_classifier(X, y):
    mod = TorchShallowNeuralClassifier(
        hidden_dim=50, max_iter=100)
    mod.fit(X, y)
    return mod

In [31]:
def fit_shallow_nn_classifier_with_crossvalidation(X, y):
    basemod = fit_nn_classifier(X, y)
    cv = 3
    param_grid = {
                    'hidden_dim': [50, 100, 200], 
                    'hidden_activation': [nn.Tanh(), nn.ReLU()]
                 }    
    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, basemod, cv, param_grid)
    return best_mod

In [32]:
shallow_nn_classifier_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,
    fit_shallow_nn_classifier_with_crossvalidation, 
    class_func=sst.ternary_class_func)

Finished epoch 100 of 100; error is 9.089576269616373e-052

Best params: {'hidden_activation': ReLU(), 'hidden_dim': 200}
Best score: 0.494
              precision    recall  f1-score   support

    negative      0.620     0.596     0.608      1010
     neutral      0.298     0.279     0.288       481
    positive      0.658     0.702     0.679      1073

    accuracy                          0.581      2564
   macro avg      0.526     0.525     0.525      2564
weighted avg      0.576     0.581     0.578      2564



### Your original system [4 points]

Your task is to develop an original model for the SST ternary problem, using only the root-level labels (again, __you cannot make any use of the subtree labels__). There are many options. If you spend more than a few hours on this homework problem, you should consider letting it grow into your final project! Here are some relatively manageable ideas that you might try:

1. We didn't systematically evaluate the `bidirectional` option to the `TorchRNNClassifier`. Similarly, that model could be tweaked to allow multiple LSTM layers (at present there is only one), and you could try adding layers to the classifier portion of the model as well.

1. We've already glimpsed the power of rich initial word representations, and later in the course we'll see that smart initialization usually leads to a performance gain in NLP, so you could perhaps achieve a winning entry with a simple model that starts in a great place.

1. The [practical introduction to contextual word representations](contextualreps.ipynb) (to be discussed later in the quarter) covers pretrained representations and interfaces that are likely to boost the performance of any system.

1. The `TreeNN` and `TorchTreeNN` don't perform all that well, and this could be for the same reason that RNNs don't peform well: the gradient signal doesn't propagate reliably down inside very deep trees. [Tai et al. 2015](https://aclanthology.info/papers/P15-1150/p15-1150) sought to address this with TreeLSTMs, which are fairly easy to implement in PyTorch.

1. In the [distributed representations as features](#Distributed-representations-as-features) section, we just summed  all of the leaf-node GloVe vectors to obtain a fixed-dimensional representation for all sentences. This ignores all of the tree structure. See if you can do better by paying attention to the binary tree structure: write a function `glove_subtree_phi` that obtains a vector representation for each subtree by combining the vectors of its daughters, with the leaf nodes again given by GloVe (any dimension you like) and the full representation of the sentence given by the final vector obtained by this recursive process. You can decide on how you combine the vectors. 

1. If you have a lot of computing resources, then you can fire off a large hyperparameter search over many parameter values. All the model classes for this course are compatible with the `scikit-learn` and [scikit-optimize](https://scikit-optimize.github.io) methods, because they define the required functions for getting and setting parameters.

We want to emphasize that this needs to be an __original__ system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it.

__Please include a brief prose description of your system along with your code, to help the teaching team understand the structure of your system.__

## Bake-off [1 point]

The bake-off will begin on April 22. The announcement will go out on Piazza. As we said above, the bake-off evaluation data is the official SST test set release. For this bake-off, you'll evaluate your original system from the above homework problem on the test set, using the ternary class problem. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.
1. As noted above, __you cannot make any use of the subtree labels__.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187246

The cells below this one constitute your bake-off entry.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on April 24. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [54]:
X_glove.loc['$UNK', :] = np.random.uniform(low=-1.0, high=1.0, size=(1, 300))

In [33]:
def fit_rnn_classifier_glove(X, y, embeddings=None):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)
    embed_dim = 50
    all_embeddings = None
    use_embeddings = False
    if embeddings is not None:
        embed_dim = embeddings.shape[1]
        all_embeddings = X_glove.to_numpy()
        use_embeddings=True
    mod = TorchRNNClassifier(
        embeddings.index, 
        eta=0.05,
        embedding=all_embeddings,
        use_embedding=True,
        batch_size=1000,
        embed_dim=embed_dim,
        hidden_dim=embed_dim,
        max_iter=100,
        l2_strength=0.001,
        bidirectional=True,
        num_layers=3,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [63]:
fit_rnn_classifier_embedded = partial(fit_rnn_classifier_glove, embeddings=X_glove)

In [64]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier_embedded, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)

Finished epoch 100 of 100; error is 4.805607169866562

              precision    recall  f1-score   support

    negative      0.560     0.864     0.680       428
     neutral      0.290     0.079     0.124       229
    positive      0.728     0.619     0.669       444

    accuracy                          0.602      1101
   macro avg      0.526     0.521     0.491      1101
weighted avg      0.571     0.602     0.560      1101



In [37]:
!pip install bert-serving-client

Collecting bert-serving-client
  Using cached bert_serving_client-1.10.0-py2.py3-none-any.whl (28 kB)
Installing collected packages: bert-serving-client
Successfully installed bert-serving-client-1.10.0


In [39]:
#%reload_ext autoreload
#%autoreload 2
import os
import sst
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNClassifier
from sklearn.metrics import classification_report
from bert_serving.client import BertClient 

In [40]:
SST_HOME = os.path.join("data", "trees")

### BERT representations for the SST

With the BERT server running in the background, the following will allow you to process new examples and obtain their BERT representations:

In [41]:
bc = BertClient(check_length=False)

Here we load in the SST train and dev sets, and we flatten the trees into strings of just their leaf nodes. We'll allow BERT to tokenize for us; an alternative is to use `is_tokenized=True` in the call to `bc.encode`, but this [requires a bit more fussing with the representations](https://github.com/hanxiao/bert-as-service#using-your-own-tokenizer) and might be suboptimal.

In [42]:
sst_train_reader = sst.train_reader(
    SST_HOME, class_func=sst.ternary_class_func)

sst_train = [(" ".join(t.leaves()), label) for t, label in sst_train_reader]

In [43]:
sst_dev_reader = sst.dev_reader(
    SST_HOME, class_func=sst.ternary_class_func)

sst_dev = [(" ".join(t.leaves()), label) for t, label in sst_dev_reader]

In [44]:
X_str_train, y_train = zip(*sst_train)

In [45]:
X_str_dev, y_dev = zip(*sst_dev)

Now we process the examples into BERT representations. I've set `show_tokens=True` to help us keep track of what BERT is doing to our texts:

In [46]:
X_bert_train, bert_train_toks = bc.encode(
    list(X_str_train), show_tokens=True)

In [47]:
X_bert_dev, bert_dev_toks = bc.encode(
    list(X_str_dev), show_tokens=True)

### BERT sentence-level classifier

As first illustration, we'll use BERT representations as the input to a classifier model. The first step is to combine the individual word representations into fixed dimensional vectors, so that we can use them as inputs to a classifier. For this, I'll just average the individual vectors:

In [48]:
def bert_reduce_mean(X):
    return X.mean(axis=1)  

This is very much like what we [summed the GloVe representations of these examples](sst_03_neural_networks.ipynb#Distributed-representations-as-features), but now the individual word representations are different depending on the context in which they appear.

Note: If you start the BERT server with `-pooling_strategy REDUCE_MEAN`, then this step is done for you. And [see here for discussion of other pooling strategies](https://github.com/hanxiao/bert-as-service#q-what-are-the-available-pooling-strategies).

In [49]:
X_bert_train_mean = bert_reduce_mean(X_bert_train)

BERT representations are pretty large:

In [50]:
X_bert_train_mean.shape[1]

768

Now we instantiate and fit a classifier. I picked a `TorchShallowNeuralClassifier`. Since the input representations are large, I chose a pretty large `hidden_dim`:

In [51]:
mod = TorchShallowNeuralClassifier(
    max_iter=100, hidden_dim=300)

In [52]:
%time _ = mod.fit(X_bert_train_mean, y_train)

Finished epoch 100 of 100; error is 0.17025146167725325

CPU times: user 36.3 s, sys: 490 ms, total: 36.8 s
Wall time: 25.2 s


Evaluation proceeds as you would expect:

In [53]:
X_bert_dev_mean = bert_reduce_mean(X_bert_dev)

In [54]:
bert_sent_preds = mod.predict(X_bert_dev_mean)

In [55]:
print(classification_report(y_dev, bert_sent_preds, digits=3))

              precision    recall  f1-score   support

    negative      0.689     0.715     0.702       428
     neutral      0.315     0.253     0.281       229
    positive      0.712     0.759     0.735       444

    accuracy                          0.637      1101
   macro avg      0.572     0.576     0.573      1101
weighted avg      0.621     0.637     0.628      1101



### Using the SST experimental framework with BERT

It is straightforward to conduct experiments like the above using `sst.experiment`, which will enable you to do a wider range of experiments without writing or copy-pasting a lot of code. 

Per [the guidelines at Han Xiao's "BERT as a service"](https://github.com/hanxiao/bert-as-service#speed-wrt-client_batch_size), it would be prohibitively slow to call `bc.encode` on all our sentences individually. To address this, I suggest first creating a look-up for the precomputed BERT representations and then having your feature function simply use this look-up:

In [56]:
bert_lookup = {}

for (sents, reps) in ((X_str_train, X_bert_train_mean), 
                      (X_str_dev, X_bert_dev_mean)):
    assert len(sents) == len(reps)
    for s, rep in zip(sents, reps):
        bert_lookup[s] = rep

In [57]:
def bert_sentence_phi(tree):
    s = " ".join(tree.leaves())
    return bert_lookup[s]

In [58]:
def fit_wide_shallow_network(X, y):
    mod = TorchShallowNeuralClassifier(
        max_iter=100, hidden_dim=300)
    mod.fit(X, y)
    return mod

In [59]:
%%time 
_ = sst.experiment(
    SST_HOME,
    bert_sentence_phi,
    fit_wide_shallow_network,
    train_reader=sst.train_reader, 
    assess_reader=sst.dev_reader, 
    class_func=sst.ternary_class_func,
    vectorize=False)

Finished epoch 100 of 100; error is 0.17682028748095036

              precision    recall  f1-score   support

    negative      0.709     0.706     0.707       428
     neutral      0.341     0.275     0.304       229
    positive      0.700     0.773     0.734       444

    accuracy                          0.643      1101
   macro avg      0.583     0.584     0.582      1101
weighted avg      0.629     0.643     0.634      1101

CPU times: user 38.7 s, sys: 486 ms, total: 39.2 s
Wall time: 26.9 s


### BERT word-level representations as RNN features

We can also use BERT representations as the input to an RNN. There is just one key change from how we used these models before:

* Previously, we would feed in lists of tokens, and they would be converted to indices into a fixed embedding space. This presumes that all words have the same representation no matter what their context is. 

* With BERT, we skip the embedding entirely and just feed in lists of BERT vectors, which means that the same word can be represented in different ways.

`TorchRNNClassifier` supports this via `use_embedding=False`. In turn, you needn't supply a vocabulary:

In [89]:
bert_rnn = TorchRNNClassifier(
    vocab=[],
    max_iter=50,
    use_embedding=False)

In [61]:
%time _ = bert_rnn.fit(X_bert_train, y_train)

Finished epoch 50 of 50; error is 2.7157750725746155

CPU times: user 23min 8s, sys: 4min 19s, total: 27min 28s
Wall time: 19min 37s


In [62]:
bert_rnn_preds = bert_rnn.predict(X_bert_dev)

In [63]:
print(classification_report(y_dev, bert_rnn_preds, digits=3))

              precision    recall  f1-score   support

    negative      0.752     0.645     0.694       428
     neutral      0.326     0.376     0.349       229
    positive      0.753     0.797     0.775       444

    accuracy                          0.650      1101
   macro avg      0.610     0.606     0.606      1101
weighted avg      0.664     0.650     0.655      1101



In [94]:
bert_rnn_layers = TorchRNNClassifier(
    vocab=[],
    max_iter=50,
    num_layers=8,
    use_embedding=False)

In [95]:
%time _ = bert_rnn_layers.fit(X_bert_train, y_train)

Finished epoch 50 of 50; error is 2.0444820076227194

CPU times: user 20min 48s, sys: 3min 51s, total: 24min 40s
Wall time: 16min 58s


In [96]:
bert_rnn_preds_layers = bert_rnn_layers.predict(X_bert_dev)

In [97]:
print(classification_report(y_dev, bert_rnn_preds_layers, digits=3))

              precision    recall  f1-score   support

    negative      0.727     0.703     0.715       428
     neutral      0.307     0.275     0.290       229
    positive      0.734     0.797     0.765       444

    accuracy                          0.652      1101
   macro avg      0.590     0.592     0.590      1101
weighted avg      0.643     0.652     0.647      1101

