# Homework and bake-off: Sentiment analysis

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2021"

## Contents

1. [Overview](#Overview)
1. [Methodological note](#Methodological-note)
1. [Set-up](#Set-up)
1. [Train set](#Train-set)
1. [Dev sets](#Dev-sets)
1. [A softmax baseline](#A-softmax-baseline)
1. [RNNClassifier wrapper](#RNNClassifier-wrapper)
1. [Error analysis](#Error-analysis)
1. [Homework questions](#Homework-questions)
  1. [Token-level differences [1 point]](#Token-level-differences-[1-point])
  1. [Training on some of the bakeoff data [1 point]](#Training-on-some-of-the-bakeoff-data-[1-point])
  1. [A more powerful vector-averaging baseline [2 points]](#A-more-powerful-vector-averaging-baseline-[2-points])
  1. [BERT encoding [2 points]](#BERT-encoding-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bakeoff [1 point]](#Bakeoff-[1-point])
1. [Submission Instruction](#Submission-Instruction)

## Overview

This homework and associated bakeoff are devoted to supervised sentiment analysis using the ternary (positive/negative/neutral) version of the Stanford Sentiment Treebank (SST-3) as well as a new dev/test dataset drawn from restaurant reviews. Our goal in introducing the new dataset is to push you to create a system that performs well in both the movie and restaurant domains.

The homework questions ask you to implement some baseline system, and the bakeoff challenge is to define a system that does well at both the SST-3 test set and the new restaurant test set. Both are ternary tasks, and our central bakeoff score is the mean of the macro-FI scores for the two datasets. This assigns equal weight to all classes and datasets regardless of size.

The SST-3 test set will be used for the bakeoff evaluation. This dataset is already publicly distributed, so we are counting on people not to cheat by developing their models on the test set. You must do all your development without using the test set at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. __Much of the scientific integrity of our field depends on people adhering to this honor code__. 

One of our goals for this homework and bakeoff is to encourage you to engage in __the basic development cycle for supervised models__, in which you

1. Design a new system. We recommend starting with something simple.
1. Use `sst.experiment` to evaluate your system, using random train/test splits initially.
1. If you have time, compare your system with others using `sst.compare_models` or `utils.mcnemar`. (For discussion, see [this notebook section](sst_02_hand_built_features.ipynb#Statistical-comparison-of-classifier-models).)
1. Return to step 1, or stop the cycle and conduct a more rigorous evaluation with hyperparameter tuning and assessment on the `dev` set.

[Error analysis](#Error-analysis) is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [2]:
from collections import Counter
import random
import numpy as np
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
import torch
import torch.nn as nn
from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import sst
import sst_mod
from sklearn.metrics import classification_report
from iit import get_IIT_sentiment_dataset, get_IIT_sentiment_devset
from torch_bert_classifier_IIT import TorchBertClassifierIIT
from torch_deep_neural_classifier_iit import TorchDeepNeuralClassifierIIT
from torch_rnn_classifier import TorchRNNClassifier
import utils

In [3]:
SST_HOME = os.path.join('data', 'sentiment')

## A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models:

In [4]:
def unigrams_phi(text):
    return Counter(text.split())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [5]:
def fit_softmax_classifier(X, y):
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [6]:
softmax_experiment = sst.experiment(
    sst.train_reader(SST_HOME),   # Train on any data you like except SST-3 test!
    unigrams_phi,                 # Free to write your own!
    fit_softmax_classifier,       # Free to write your own!
    assess_dataframes=[sst.dev_reader(SST_HOME), sst.bakeoff_dev_reader(SST_HOME)]) # Free to change this during development!

Assessment dataset 1
              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

    accuracy                          0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101

Assessment dataset 2
              precision    recall  f1-score   support

    negative      0.272     0.692     0.391       565
     neutral      0.429     0.113     0.179      1019
    positive      0.409     0.346     0.375       777

    accuracy                          0.328      2361
   macro avg      0.370     0.384     0.315      2361
weighted avg      0.385     0.328     0.294      2361

Mean of macro-F1 scores: 0.416


In [7]:
def one_hot(label):
    sents = ['positive', 'neutral', 'negative']
    return np.eye(len(sents))[sents.index(label)]

def build_dataset_subtrees(dataframes, phi, vectorizer=None, vectorize=True):
    if isinstance(dataframes, (list, tuple)):
        df = pd.concat(dataframes)
    else:
        df = dataframes

    raw_examples = list(df.sentence.values)

    # feat_dicts = list(df.left_label.apply(phi).values)
    left_labels = df.left_label.values
    right_labels = df.right_label.values

    feat_dicts = [np.concatenate((one_hot(left_labels[i]), one_hot(right_labels[i]))) for i in range(len(left_labels))]

    if 'sentence_label' in df.columns:
        labels = list(df.sentence_label.values)
    else:
        labels = None

    feat_matrix = None
    if vectorize:
        # In training, we want a new vectorizer:
        if vectorizer is None:
            vectorizer = DictVectorizer(sparse=False)
            feat_matrix = vectorizer.fit_transform(feat_dicts)
        # In assessment, we featurize using the existing vectorizer:
        else:
            feat_matrix = vectorizer.transform(feat_dicts)
    else:
        feat_matrix = feat_dicts

    return {'X': feat_matrix,
            'y': labels,
            'vectorizer': vectorizer,
            'raw_examples': raw_examples}

In [8]:
sentiment_iit_train_df = pd.read_csv('sst_tree_train.csv')
sentiment_iit_dev_df = pd.read_csv('sst_tree_dev.csv')

softmax_tree_experiment = sst_mod.experiment(
    sentiment_iit_train_df,
    unigrams_phi,
    fit_softmax_classifier,
    assess_dataframes=[sentiment_iit_dev_df],
    vectorize=False,
    build_dataset_fn=build_dataset_subtrees)

              precision    recall  f1-score   support

    negative      1.000     1.000     1.000       428
     neutral      1.000     1.000     1.000       229
    positive      1.000     1.000     1.000       444

    accuracy                          1.000      1101
   macro avg      1.000     1.000     1.000      1101
weighted avg      1.000     1.000     1.000      1101



In [9]:
LEFT = 0
RIGHT = 1
softmax_root_model = softmax_tree_experiment['model']

data_size = 10
X_base, X_sources, y_base, y_IIT, interventions, vectorizer = get_IIT_sentiment_dataset(sentiment_iit_dev_df.sample(data_size), softmax_root_model, LEFT, unigrams_phi)

split = (data_size * data_size) // 5

X_base_train = X_base[split:]
X_sources_train = [source[split:] for source in X_sources]
y_base_train = y_base[split:]
y_IIT_train = y_IIT[split:]
interventions_train = interventions[split:]


print(X_base_train.shape)

{('positive', 'positive'): 40, ('negative', 'negative'): 56, ('negative', 'positive'): 4}
torch.Size([80, 162])


In [10]:
embedding_dim = X_base.shape[1]
V1 = 0
V2 = 1
both = 2
# similar to our alignment in the IIT accuracy section?
# aligning V1 to left side of layer 1, and V2 to the right side
# we are defining both as a list with two values -- why not encode it as a single range from 0  to dim * 2?
id_to_coords = {V1:{1: [{"layer":1, "start":0, "end":embedding_dim}]}, \
    V2: {1: [{"layer":1, "start":embedding_dim, "end":embedding_dim*2}]}, \
    both: {1: [{"layer":1, "start":0, "end":embedding_dim},{"layer":1, "start":embedding_dim, "end":embedding_dim*2}]}}

# gives back an IIT dataset based off of the Premack dataset, coming up with 
# all possible permutations of same/different shape pairs and same/different base-source pairs?
# X_base_train, X_sources_train, y_base_train, y_IIT_train, interventions = get_IIT_equality_dataset("V1", embedding_dim ,data_size)

# this is a different model from the one we defined in the previous cell, but with a similar idea?
model = TorchDeepNeuralClassifierIIT(hidden_dim=embedding_dim*4, hidden_activation=torch.nn.ReLU(), num_layers=3, id_to_coords=id_to_coords)
# model.fit() function internally calls on model.create_dataset(), which creates dataset in a way that pairs off
# source and base inputs?
_ = model.fit(X_base_train, X_sources_train, y_base_train, y_IIT_train, interventions_train)

# this is a runtime error I've also encountered in antra (with no change to the original code)
# could this be due to mismatching pytorch versions??

Stopping after epoch 33. Training loss did not improve more than tol=1e-05. Final error is 2.714747324716882e-06.

In [11]:
X_base_test = X_base[:split]
X_sources_test = [source[:split] for source in X_sources]
y_base_test = y_base[:split]
y_IIT_test = y_IIT[:split]
interventions_test = interventions[:split]

IIT_preds, base_preds = model.model(model.prep_input(X_base_test, X_sources_test, interventions_test))
IIT_preds = np.array(IIT_preds.argmax(axis=1).cpu())
base_preds = np.array(base_preds.argmax(axis=1).cpu())
print(classification_report(y_base_test, base_preds))
print(classification_report(y_IIT_test, IIT_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         6
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00        14

    accuracy                           0.30        20
   macro avg       0.33      0.33      0.33        20
weighted avg       0.30      0.30      0.30        20

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         8
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00        12

    accuracy                           0.40        20
   macro avg       0.33      0.33      0.33        20
weighted avg       0.40      0.40      0.40        20



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
X_base_dev, X_sources_dev, y_base_dev, y_IIT_dev, interventions_dev = get_IIT_sentiment_devset(
    sst.bakeoff_dev_reader(SST_HOME), LEFT, unigrams_phi, vectorizer)

IIT_preds, base_preds = model.model(model.prep_input(X_base_dev, X_sources_dev, interventions_dev))
base_preds = np.array(base_preds.argmax(axis=1).cpu())
print(classification_report(y_base_dev, base_preds))

              precision    recall  f1-score   support

           0       0.36      0.37      0.36       777
           1       0.46      0.71      0.56      1019
           2       0.00      0.00      0.00       565

    accuracy                           0.43      2361
   macro avg       0.27      0.36      0.31      2361
weighted avg       0.32      0.43      0.36      2361



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## BERT IIT Training

In [18]:
def bert_fine_tune_phi(text):
    return text

In [25]:
LEFT = 0
RIGHT = 1
BOTH = 2
dim = 768
half_dim = dim // 2

id_to_coords = {LEFT:{3: [{"layer":0, "start":0, "end":half_dim}]}, \
    RIGHT: {3: [{"layer":0, "start":half_dim, "end":dim}]}, \
    BOTH: {3: [{"layer":0, "start":0, "end":half_dim},{"layer":0, "start":half_dim, "end":dim}]}}
    
test_model = TorchBertClassifierIIT(id_to_coords)

data_size = 2
X_base, X_sources, y_base, y_IIT, interventions, vectorizer = get_IIT_sentiment_dataset(sentiment_iit_train_df.sample(data_size), softmax_root_model, LEFT, bert_fine_tune_phi, vectorize=False)

_ = test_model.fit(X_base, X_sources, y_base, y_IIT, interventions)

{('neutral', 'neutral'): 2, ('negative', 'neutral'): 2}


Stopping after epoch 18. Training loss did not improve more than tol=1e-05. Final error is 1.1446082592010498.

In [None]:
X_base_test, X_sources_test, y_base_test, y_IIT_test, interventions_test  = get_IIT_sentiment_devset(sst.dev_reader(SST_HOME), LEFT, bert_fine_tune_phi, None, False)

y_predict, y_IIT_predict = test_model.model(test_model.prep_input(X_base_test, X_sources_test, interventions_test))

print(classification_report(y_base_test, y_predict))

In [38]:
base = ['This is just a single test']
sources = [base]
coord_ids = [0] * len(base)

LABELS = ['positive', 'neutral', 'negative']
_, y_ = test_model.model(test_model.prep_input(base, sources, coord_ids))
y_ = np.array(y_.argmax(axis=1).cpu())
y_

array([1, 1, 1], dtype=int64)

In [37]:
def fit_iit_bert_classifier_with_hyperparameter_search(X, y):
    basemod = TorchBertClassifierIIT(
        weights_name='bert-base-cased',
        batch_size=8,  # Small batches to avoid memory overload.
        max_iter=1,  # We'll search based on 1 iteration for efficiency.
        n_iter_no_change=5,   # Early-stopping params are for the
        early_stopping=True)  # final evaluation.

    param_grid = {
        'gradient_accumulation_steps': [1, 4, 8],
        'eta': [0.00005, 0.0001, 0.001]}

    bestmod = utils.fit_classifier_with_hyperparameter_search(
        X, y, basemod, cv=3, param_grid=param_grid)

    return bestmod

In [None]:
%%time
bert_classifier_xval = sst.experiment(
    sst.train_reader(SST_HOME),
    bert_fine_tune_phi,
    fit_iit_bert_classifier_with_hyperparameter_search,
    assess_dataframes=sst.dev_reader(SST_HOME),
    vectorize=False)  # Pass in the BERT hidden state directly!

In [None]:
optimized_bert_classifier = bert_classifier_xval['model']
del bert_classifier_xval

In [None]:
def fit_optimized_hf_bert_classifier(X, y):
    optimized_bert_classifier.max_iter = 1000
    optimized_bert_classifier.fit(X, y)
    return optimized_bert_classifier

In [None]:
%%time
_ = sst.experiment(
    sst.train_reader(SST_HOME),
    bert_fine_tune_phi,
    fit_optimized_hf_bert_classifier,
    assess_dataframes=test_df,
    vectorize=False)  # Pass in the BERT hidden state directly!