# Homework and bake-off: Sentiment analysis

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2021"

## Contents

1. [Overview](#Overview)
1. [Methodological note](#Methodological-note)
1. [Set-up](#Set-up)
1. [Train set](#Train-set)
1. [Dev sets](#Dev-sets)
1. [A softmax baseline](#A-softmax-baseline)
1. [RNNClassifier wrapper](#RNNClassifier-wrapper)
1. [Error analysis](#Error-analysis)
1. [Homework questions](#Homework-questions)
  1. [Token-level differences [1 point]](#Token-level-differences-[1-point])
  1. [Training on some of the bakeoff data [1 point]](#Training-on-some-of-the-bakeoff-data-[1-point])
  1. [A more powerful vector-averaging baseline [2 points]](#A-more-powerful-vector-averaging-baseline-[2-points])
  1. [BERT encoding [2 points]](#BERT-encoding-[2-points])
  1. [Your original system [3 points]](#Your-original-system-[3-points])
1. [Bakeoff [1 point]](#Bakeoff-[1-point])
1. [Submission Instruction](#Submission-Instruction)

## Overview

This homework and associated bakeoff are devoted to supervised sentiment analysis using the ternary (positive/negative/neutral) version of the Stanford Sentiment Treebank (SST-3) as well as a new dev/test dataset drawn from restaurant reviews. Our goal in introducing the new dataset is to push you to create a system that performs well in both the movie and restaurant domains.

The homework questions ask you to implement some baseline system, and the bakeoff challenge is to define a system that does well at both the SST-3 test set and the new restaurant test set. Both are ternary tasks, and our central bakeoff score is the mean of the macro-FI scores for the two datasets. This assigns equal weight to all classes and datasets regardless of size.

The SST-3 test set will be used for the bakeoff evaluation. This dataset is already publicly distributed, so we are counting on people not to cheat by developing their models on the test set. You must do all your development without using the test set at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. __Much of the scientific integrity of our field depends on people adhering to this honor code__. 

One of our goals for this homework and bakeoff is to encourage you to engage in __the basic development cycle for supervised models__, in which you

1. Design a new system. We recommend starting with something simple.
1. Use `sst.experiment` to evaluate your system, using random train/test splits initially.
1. If you have time, compare your system with others using `sst.compare_models` or `utils.mcnemar`. (For discussion, see [this notebook section](sst_02_hand_built_features.ipynb#Statistical-comparison-of-classifier-models).)
1. Return to step 1, or stop the cycle and conduct a more rigorous evaluation with hyperparameter tuning and assessment on the `dev` set.

[Error analysis](#Error-analysis) is one of the most important methods for steadily improving a system, as it facilitates a kind of human-powered hill-climbing on your ultimate objective. Often, it takes a careful human analyst just a few examples to spot a major pattern that can lead to a beneficial change to the feature representations.

## Methodological note

You don't have to use the experimental framework defined below (based on `sst`). The only constraint we need to place on your system is that it must have a `predict_one` method that can map directly from an example text to a prediction, and it must be able to make predictions without having any information beyond the text. (For example, it can't depend on knowing which task the text comes from.) See [the bakeoff section below](#Bakeoff-[1-point]) for examples of functions that conform to this specification.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [11]:
from collections import Counter
import numpy as np
import os
import pandas as pd
from sklearn.linear_model import LogisticRegression
import torch.nn as nn
import torch
from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import sst
import utils
from transformers import BertModel, BertTokenizer
from transformers.file_utils import PaddingStrategy

In [2]:
SST_HOME = os.path.join('data', 'sentiment')

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

## Train set

Our primary train set is the SST-3 train set:

In [4]:
sst_train = sst.train_reader(SST_HOME, include_subtrees=False)

This is the train set we will use for all the regular homework questions. You are welcome to bring in new datasets for your original system. You are also free to add `include_subtrees=True`. This is very likely to lead to better systems, but it substantially increases the overall size of the dataset (from 8,544 examples to 159,274), which will in turn substantially increase the time it takes to run experiments.

See [this notebook](sst_01_overview.ipynb) for additional details of this dataset.

## Dev sets

We have two development set. SST3-dev consists of sentences from movie reviews, just like SST-3 train:

In [5]:
sst_dev = sst.dev_reader(SST_HOME)

Our new bakeoff dev set consists of sentences from restaurant reviews:

In [6]:
bakeoff_dev = sst.bakeoff_dev_reader(SST_HOME)

In [7]:
bakeoff_dev.sample(3, random_state=1).to_dict(orient='records')

[{'example_id': 57,
  'sentence': 'I would recommend that you make reservations in advance.',
  'label': 'neutral',
  'is_subtree': 0},
 {'example_id': 590,
  'sentence': 'We were welcomed warmly.',
  'label': 'positive',
  'is_subtree': 0},
 {'example_id': 1968,
  'sentence': 'We have been to Oceanaire twice in the last 6 weeks.',
  'label': 'neutral',
  'is_subtree': 0}]

Here is the label distribution:

In [8]:
sst_dev.label.value_counts(normalize=True)
bakeoff_dev.label.value_counts(normalize=True)

positive    0.403270
negative    0.388738
neutral     0.207993
Name: label, dtype: float64

neutral     0.431597
positive    0.329098
negative    0.239305
Name: label, dtype: float64

The label distribution for the corresponding test set is similar to this.

## A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models:

In [9]:
def unigrams_phi(text):
    return Counter(text.split())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [10]:
def fit_softmax_classifier(X, y):
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [11]:
softmax_experiment = sst.experiment(
    sst_train,   # Train on any data you like except SST-3 test!
    unigrams_phi,                 # Free to write your own!
    fit_softmax_classifier,       # Free to write your own!
    assess_dataframes=[sst_dev, bakeoff_dev]) # Free to change this during development!

Assessment dataset 1
              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

    accuracy                          0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101

Assessment dataset 2
              precision    recall  f1-score   support

    negative      0.272     0.690     0.390       565
     neutral      0.429     0.113     0.179      1019
    positive      0.409     0.346     0.375       777

    accuracy                          0.328      2361
   macro avg      0.370     0.383     0.315      2361
weighted avg      0.385     0.328     0.294      2361

Mean of macro-F1 scores: 0.416


`softmax_experiment` contains a lot of information that you can use for error analysis; see [this section below](#Error-analysis) for starter code.

## RNNClassifier wrapper

This section illustrates how to use `sst.experiment` with `TorchRNNClassifier`.

To featurize examples for an RNN, we can just get the words in order, letting the model take care of mapping them into an embedding space.

In [12]:
def rnn_phi(text):
    return text.split()

The model wrapper gets the vocabulary using `sst.get_vocab`. If you want to use pretrained word representations in here, then you can have `fit_rnn_classifier` build that space too; see [this notebook section for details](sst_03_neural_networks.ipynb#Pretrained-embeddings). See also [torch_model_base.py](torch_model_base.py) for details on the many optimization parameters that `TorchRNNClassifier` accepts.

In [13]:
def fit_rnn_classifier(X, y):
    sst_glove_vocab = utils.get_vocab(X, mincount=2)
    mod = TorchRNNClassifier(
        sst_glove_vocab,
        early_stopping=True,
        device=device)
    mod.fit(X, y)
    return mod

In [14]:
rnn_experiment = sst.experiment(
    sst.train_reader(SST_HOME),
    rnn_phi,
    fit_rnn_classifier,
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_dataframes=[sst_dev, bakeoff_dev])

Stopping after epoch 37. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 2.1222076565027237

Assessment dataset 1
              precision    recall  f1-score   support

    negative      0.538     0.720     0.616       428
     neutral      0.293     0.170     0.215       229
    positive      0.649     0.579     0.612       444

    accuracy                          0.549      1101
   macro avg      0.494     0.490     0.481      1101
weighted avg      0.532     0.549     0.531      1101

Assessment dataset 2
              precision    recall  f1-score   support

    negative      0.290     0.632     0.397       565
     neutral      0.457     0.275     0.343      1019
    positive      0.458     0.304     0.365       777

    accuracy                          0.370      2361
   macro avg      0.402     0.403     0.369      2361
weighted avg      0.417     0.370     0.363      2361

Mean of macro-F1 scores: 0.425


## Error analysis

This section begins to build an error-analysis framework using the dicts returned by `sst.experiment`. These have the following structure:

```
'model': trained model
'phi': the feature function used
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_datasets': list of datasets, each with the same structure as the value of 'train_dataset'
'predictions': list of lists of predictions on the assessment datasets
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the each of the assessment dataasets
```
The following function just finds mistakes, and returns a `pd.DataFrame` for easy subsequent processing:

In [16]:
def find_errors(experiment):
    """Find mistaken predictions.

    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.

    Returns
    -------
    pd.DataFrame

    """
    dfs = []
    for i, dataset in enumerate(experiment['assess_datasets']):
        df = pd.DataFrame({
            'raw_examples': dataset['raw_examples'],
            'predicted': experiment['predictions'][i],
            'gold': dataset['y']})
        df['correct'] = df['predicted'] == df['gold']
        df['dataset'] = i
        dfs.append(df)
    return pd.concat(dfs)

In [17]:
softmax_analysis = find_errors(softmax_experiment)

In [18]:
rnn_analysis = find_errors(rnn_experiment)

Here we merge the sotmax and RNN experiments into a single DataFrame:

In [20]:
analysis = softmax_analysis.merge(
    rnn_analysis, left_on='raw_examples', right_on='raw_examples')

analysis = analysis.drop('gold_y', axis=1).rename(columns={'gold_x': 'gold'})

The following code collects a specific subset of examples; small modifications to its structure will give you different interesting subsets:

In [21]:
# Examples where the softmax model is correct, the RNN is not,
# and the gold label is 'positive'

error_group = analysis[
    (analysis['predicted_x'] == analysis['gold'])
    &
    (analysis['predicted_y'] != analysis['gold'])
    &
    (analysis['gold'] == 'positive')
]

In [22]:
error_group

Unnamed: 0,raw_examples,predicted_x,gold,correct_x,dataset_x,predicted_y,correct_y,dataset_y
2,And if you 're not nearly moved to tears by a ...,positive,positive,True,0,negative,False,0
4,Uses sharp humor and insight into human nature...,positive,positive,True,0,negative,False,0
10,Unlike the speedy wham-bam effect of most Holl...,positive,positive,True,0,negative,False,0
12,The band 's courage in the face of official re...,positive,positive,True,0,neutral,False,0
13,Although German cooking does not come readily ...,positive,positive,True,0,negative,False,0
...,...,...,...,...,...,...,...,...
3368,If you go to Maui and can only eat at one rest...,positive,positive,True,1,negative,False,1
3377,I went here for Valentines Day and heard good ...,positive,positive,True,1,negative,False,1
3380,The best part of the meyer lemon curd filled f...,positive,positive,True,1,negative,False,1
3410,"Great Location, Very nice atmosphere.",positive,positive,True,1,neutral,False,1


In [24]:
for ex in error_group['raw_examples'].sample(3, random_state=1):
    print("="*70)
    print(ex)

The movie 's relatively simple plot and uncomplicated morality play well with the affable cast .
Have visited restaurant in Naples several times and steaks are outstanding.
The Sunday Brunch at Zefferinos is by far the best brunch at which I have ever dined in my life of 55 years.


## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Token-level differences [1 point]

We can begin to get a sense for how our two dev sets differ by considering the most frequent tokens from each. This question asks you to begin such analysis.

Your task: write a function `get_token_counts` that, given a `pd.DataFrame` in the format of our datasets, tokenizes the example sentences based on whitespace and creates a count distribution over all of the tokens. The function should return a `pd.Series` sorted by frequency; if you create a count dictionary `d`, then `pd.Series(d).sort_values(ascending=False)` will give you what you need.

In [25]:
def get_token_counts(df: pd.DataFrame) -> pd.Series:
    '''
    Given a df with a "sentence" col, returns a sorted Series of word counts of entire sentence col vocab.
    '''
    #coerce to arrays
    sentences = df.sentence.values
    
    #join all sentences together and count
    tokens = Counter(' '.join(sentences).split())
    
    #coerce to Series and sort
    sorted_tokens = pd.Series(tokens).sort_values(ascending=False)
                              
    return sorted_tokens

In [26]:
def test_get_token_counts(func):
    df = pd.DataFrame([
        {'sentence': 'a a b'},
        {'sentence': 'a b a'},
        {'sentence': 'a a a b.'}])
    result = func(df)
    for token, expected in (('a', 7), ('b', 2), ('b.', 1)):
        actual = result.loc[token]
        assert actual == expected, \
            "For token {}, expected {}; got {}".format(
            token, expected, actual)

In [27]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_get_token_counts(get_token_counts)

As you develop your original system, you might review these results. The two dev sets have different vocabularies and different low-level encoding details that are sure to impact model performance, especially when one considers that the train set is like `sst_dev` in all these respects. For additional discussion, see [this notebook section](sst_01_overview.ipynb#Tokenization).

### Training on some of the bakeoff data [1 point]

We have so far presented the bakeoff dev set as purely for evaluation. Since the train set consists entirely of SST-3 data, this makes the bakeoff split especially challenging. We might be able to reduce the challenging by adding some of the bakeoff dev set to the train set, keeping some of it for evaluation. The current question asks to begin explore the effects of such training.

Your task: write a function `run_mixed_training_experiment`. The function should:

1. Take as inputs (a) a model training wrapper like `fit_softmax_classifier` and (b) an integer `bakeoff_train_size` specifying the number of examples from `bakeoff_dev` that should be included in the train set.
1. Split `bakeoff_dev` so that the first `bakeoff_train_size` examples are in the train set and the rest are used for evaluation.
1. Use `sst.experiment` with the user-supplied model training wrapper, `unigram_phi` as defined above, and a train set that consists of SST-3 train and the train portion of `bakeoff_dev` as defined in step 2. The value of `assess_dataframes` should be a list consisting of the SST-3 dev set and the evaluation portion of `bakeoff_dev` as defined in step 2.
1. Return the return value of `sst.experiment`.

The function `test_run_mixed_training_experiment` will help you iterate to the required design.

In [28]:
def run_mixed_training_experiment(wrapper_func, bakeoff_train_size, vectorize=True):
    
    #split bakeoff_dev into train and dev
    bakeoff_dev_train, bakeoff_dev_dev = bakeoff_dev.iloc[:bakeoff_train_size,:], bakeoff_dev.iloc[bakeoff_train_size:,:]
    
    #append training df's together
    sst_train = sst.train_reader(SST_HOME)
    combined_train = sst_train.append(bakeoff_dev_train, ignore_index=True)
    
    #grab sst-3 dev, if not already defined in notebook
    sst_dev = sst.dev_reader(SST_HOME)
    
    #run experiment
    experiment_results = sst.experiment(
                            combined_train,   
                            unigrams_phi,                 
                            wrapper_func,      
                            assess_dataframes=[sst_dev, bakeoff_dev_dev],
                            vectorize=vectorize)
    return experiment_results

In [29]:
def test_run_mixed_training_experiment(func):
    bakeoff_train_size = 1000
    experiment = func(fit_softmax_classifier, bakeoff_train_size)

    assess_size = len(experiment['assess_datasets'])
    assert len(experiment['assess_datasets']) == 2, \
        ("The evaluation should be done on two datasets: "
         "SST3 and part of the bakeoff dev set. "
         "You have {} datasets.".format(assess_size))

    bakeoff_test_size = bakeoff_dev.shape[0] - bakeoff_train_size
    expected_eval_examples = bakeoff_test_size + sst_dev.shape[0]
    eval_examples = sum(len(d['raw_examples']) for d in experiment['assess_datasets'])
    assert expected_eval_examples == eval_examples, \
        "Expected {} evaluation examples; got {}".format(
        expected_eval_examples, eval_examples)

In [30]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_run_mixed_training_experiment(run_mixed_training_experiment)

Assessment dataset 1
              precision    recall  f1-score   support

    negative      0.627     0.671     0.648       428
     neutral      0.319     0.162     0.214       229
    positive      0.638     0.757     0.692       444

    accuracy                          0.599      1101
   macro avg      0.528     0.530     0.518      1101
weighted avg      0.567     0.599     0.576      1101

Assessment dataset 2
              precision    recall  f1-score   support

    negative      0.471     0.412     0.440       320
     neutral      0.588     0.590     0.589       612
    positive      0.503     0.548     0.525       429

    accuracy                          0.535      1361
   macro avg      0.521     0.517     0.518      1361
weighted avg      0.534     0.535     0.534      1361

Mean of macro-F1 scores: 0.518


### A more powerful vector-averaging baseline [2 points]

In [Distributed representations as features](sst_03_neural_networks.ipynb#Distributed-representations-as-features), we looked at a baseline for the ternary SST-3 problem in which each example is modeled as the mean of its GloVe representations. A `LogisticRegression` model was used for prediction. A neural network might do better with these representations, since there might be complex relationships between the input feature dimensions that a linear classifier can't learn. To address this question, we want to get set up to run the experiment with a shallow neural classifier. 

Your task: write and submit a model wrapper function around `TorchShallowNeuralClassifier`. This function should implement hyperparameter search according to this specification:

* Set `early_stopping=True` for all experiments.
* Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
  * The hidden dimensionality at 50, 100, and 200.
  * The hidden activation function as `nn.Tanh()` and `nn.ReLU()`.
* For all other parameters to `TorchShallowNeuralClassifier`, use the defaults.

See [this notebook section](sst_02_hand_built_features.ipynb#Hyperparameter-search) for examples. You are not required to run a full evaluation with this function using `sst.experiment`, but we assume you will want to.

We're not evaluating the quality of your model. (We've specified the protocols completely, but there will still be variation in the results.) However, the primary goal of this question is to get you thinking more about this strong baseline feature representation scheme for SST-3, so we're sort of hoping you feel compelled to try out variations on your own.

In [31]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

def fit_shallow_neural_classifier_with_hyperparameter_search(X, y):
    model = TorchShallowNeuralClassifier(early_stopping=True, device=device)
    cv = 3
    param_grid = {
        'hidden_dim': [50,100,200],
        'hidden_activation':[nn.Tanh(), nn.ReLU()]}
    best_model = utils.fit_classifier_with_hyperparameter_search(X, y, model, cv, param_grid=param_grid)
        
    return best_model

In [156]:
best = sst.experiment(sst_train, 
                      unigrams_phi,
                      fit_shallow_neural_classifier_with_hyperparameter_search,
                      assess_dataframes=sst_dev) 

Stopping after epoch 31. Validation score did not improve by tol=1e-05 for more than 10 epochs. Final error is 0.38966774195432663

Best params: {'hidden_activation': ReLU(), 'hidden_dim': 100}
Best score: 0.519
              precision    recall  f1-score   support

    negative      0.624     0.659     0.641       428
     neutral      0.266     0.148     0.190       229
    positive      0.641     0.752     0.692       444

    accuracy                          0.590      1101
   macro avg      0.510     0.520     0.508      1101
weighted avg      0.556     0.590     0.568      1101



In [32]:
def fit_nn_classifier(X, y):
    mod = TorchShallowNeuralClassifier(
        hidden_dim=200,
        early_stopping=True,      # A basic early stopping set-up.
        validation_fraction=0.1,  # If no improvement on the
        tol=1e-5,                 # validation set is seen within
        n_iter_no_change=10)      # `n_iter_no_change`, we stop.
    mod.fit(X, y)
    return mod

### BERT encoding [2 points]

We might hypothesize that encoding our examples with BERT will yield improvements over the GloVe averaging method explored in the previous question, since BERT implements a much more complex and data-driven function for this kind of combination. This question asks you to begin exploring this general hypothesis.

Your task: write a function `hf_cls_phi` that uses Hugging Face functionality to encode individual examples with BERT and returns the final output representation above the [CLS] token.

You are not required to evaluate this feature function, but it is easy to do so with `sst.experiment` and `vectorize=False` (since your feature function directly encodes every example as a vector). Your code should also be a natural basis for even more powerful approaches – for example, it might be even better to pool all the output states rather than using just the first output state. Another option is [fine-tuning](finetuning.ipynb).

In [33]:
from transformers import BertModel, BertTokenizer
import vsm

# Instantiate a Bert model and tokenizer based on `bert_weights_name`:
bert_weights_name = 'bert-base-uncased'

model = BertModel.from_pretrained(bert_weights_name)
tokenizer = BertTokenizer.from_pretrained(bert_weights_name)

def hf_cls_phi(text):
    # Get the ids. `vsm.hf_encode` will help; be sure to
    # set `add_special_tokens=True`.
    encode = vsm.hf_encode(text, tokenizer, add_special_tokens=True)
    
    # Get the BERT representations. `vsm.hf_represent` will help:
    reps = vsm.hf_represent(encode, model)

    # Index into `reps` to get the representation above [CLS].
    # The shape of `reps` should be (1, n, 768), where n is the
    # number of tokens. You need the 0th element of the 2nd dim:
    cls_rep = reps[:,0,:][0]

    # These conversions should ensure that you can work with the
    # representations flexibly. Feel free to change the variable
    # name:
    return cls_rep.cpu().numpy()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [34]:
hf_cls_phi("The dog ate my homework").shape

(768,)

In [35]:
def test_hf_cls_phi(func):
    rep = func("Just testing!")

    expected_shape = (768,)
    result_shape = rep.shape
    assert rep.shape == (768,), \
        "Expected shape {}; got {}".format(
        expected_shape, result_shape)

    # String conversion to avoid precision errors:
    expected_first_val = str(0.1709)
    result_first_val = "{0:.04f}".format(rep[0])

    assert expected_first_val == result_first_val, \
        ("Unexpected representation values. Expected the "
        "first value to be {}; got {}".format(
            expected_first_val, result_first_val))

In [37]:
if 'IS_GRADESCOPE_ENV' not in os.environ:
    test_hf_cls_phi(hf_cls_phi)

Note: encoding all of SST-3 train (no subtrees) takes about 11 minutes on my 2015 iMac, CPU only (32GB).

### Your original system [3 points]

Your task is to develop an original model for the SST-3 problem and our new bakeoff dataset. There are many options. If you spend more than a few hours on this homework problem, you should consider letting it grow into your final project! Here are some relatively manageable ideas that you might try:

1. We didn't systematically evaluate the `bidirectional` option to the `TorchRNNClassifier`. Similarly, that model could be tweaked to allow multiple LSTM layers (at present there is only one), and you could try adding layers to the classifier portion of the model as well.

1. We've already glimpsed the power of rich initial word representations, and later in the course we'll see that smart initialization usually leads to a performance gain in NLP, so you could perhaps achieve a winning entry with a simple model that starts in a great place.

1. Our [practical introduction to contextual word representations](finetuning.ipynb) covers pretrained representations and interfaces that are likely to boost the performance of any system.

We want to emphasize that this needs to be an __original__ system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it. See the course website for additional guidance on how original systems will be evaluated.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.  We also ask that you report the best score your system got during development (your best average of macro-F1 scores), just to help us understand how systems performed overall.

<font color='red'>Please review the descriptions in the following comment and follow the instructions.</font>

In [29]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
#   3) The score achieved by your system in place of MY_NUMBER.
#        With no other changes to that line.
#        You should report your score as a decimal value <=1.0
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# NOTE: MODULES, CODE AND DATASETS REQUIRED FOR YOUR ORIGINAL SYSTEM
# SHOULD BE ADDED BELOW THE 'IS_GRADESCOPE_ENV' CHECK CONDITION. DOING
# SO ABOVE THE CHECK MAY CAUSE THE AUTOGRADER TO FAIL.

# START COMMENT: Enter your system description in this cell.
'''
1. I started by utilizing the sst.experiment framework as provided by the course instructor.  However, after running several experiments, including multiple data augmentation
techniques, I did not get remarkable improvements to the combined mean f1_macros scores.  I therefore decided to try to finetune an existing BERT model from the HF library.

2. I did not have a lot of time to try out several different models (i.e. RoBERTa, Distil-BERT, etc.) and after a few trial runs, I realized that experimentation was going to 
require a multi-GPU machine.  I had never used the nn.DataParallel framework before, so I figured I would try it out for this bake-off. 

3. I ended up creating a simple Classification model that uses the last layer of the pretained BERT model linearly connected to an output layer with dropout to improve generalization.  
I started with bert-base which worked well, but saw a 5-point F1_score jump by using a bert-large model.  Training time for a 20K sentence dataset at a batch size of 32 took 30-40 minutes 
to complete 3 epochs.  GPU capacity was definitely a limiting factor in the fine-tune training and evaluation of these models.  

4. I ended up combining dev datasets from SST3, Bake off (restaurant reviews) and a random 5,000 sample of the Dynasent dataset.  If I had more time I would have done a better job
doing error analysis.  One thought I had that I did not have time to implement was creating augmented sentences from the known errors of the model and feeding them back into the system
to force the model to recognize where it was messing up and correcting itself.  I evaluated on the f1_scores attained on both the sst3-dev and bakeoff-dev datasets. 
'''
# My peak score was: 0.741
if 'IS_GRADESCOPE_ENV' not in os.environ:
    #pytorch imports
    import torch
    import torch.nn as nn
    from torch.utils.data import Dataset, DataLoader

    #HuggingFace imports
    from transformers import BertModel, BertTokenizer
    from transformers.file_utils import PaddingStrategy
    from transformers import get_linear_schedule_with_warmup
    from datasets import load_dataset

    #data science imports
    from sklearn.metrics import classification_report, f1_score
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from tqdm import tqdm

    #cs224u imports
    import sst, vsm, utils

    #python standard libraries
    from collections import defaultdict
    import time
    import os
    
    #label GPU device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    #instantiate bert tokenizer
    weights_name = 'bert-large-cased'
    bert_tokenizer = BertTokenizer.from_pretrained(weights_name)
    
    #function for converting pd.DataFrame data into array-like (X,y) values
    def create_dataset(df: pd.DataFrame) -> np.array:
        label_types = sorted(df.label.unique().tolist())
        assert ['negative', 'neutral', 'positive'] == label_types
        label_map = {label:index for index, label in enumerate(label_types)}

        text = df.sentence.values.tolist()
        labels = df.label.apply(lambda x: label_map[x]).values.tolist()
        assert len(text) == len(labels)

        return text, labels
    
    #blueprint for use with torch.DataLoader Class
    class getSentences(Dataset):
        '''
        Given an X and Y input, encodes these values for model consumption,
        and is the precursor function for the DataLoader.
        
        Returns a dict of values.
        '''
        def __init__(self, sentences, labels, tokenizer, max_len):
            self.sentences = sentences
            self.labels = labels
            self.tokenizer = tokenizer
            self.max_len = max_len

        def __repr__(self):
            return f'Sentences: {len(self.sentences)}     Labels: {len(self.labels)}'

        def __len__(self):
            return (len(self.sentences))

        def __getitem__(self,index):
            sentence = self.sentences[index]
            encoding = self.tokenizer.encode_plus(
                              sentence,
                              add_special_tokens=True,
                              max_length=self.max_len,
                              truncation=True,
                              return_token_type_ids=False,
                              padding=PaddingStrategy.MAX_LENGTH,
                              return_attention_mask=True,
                              return_tensors='pt')

            return {'text' : sentence,
                    'input_id': encoding['input_ids'].flatten(),
                    'attention_mask':encoding['attention_mask'].flatten(),
                    'labels': torch.tensor(self.labels[index], dtype = torch.long)
                   }
    ######################################################################################################################
    #in place of real dataset I am passing string values so that autograder won't fail
    train_X, train_y, dev_X, dev_y, dev_bake_X, dev_bake_y = 'trainx', 'trainy', 'devx', 'devy', 'dev_bakex', 'dev_bakey'
    ######################################################################################################################
    
    #set user-defined constants
    BATCH_SIZE = 64
    MAX_LEN = 300
    NUM_TRAIN_SAMPLES = len(train_X)
    NUM_VAL_SAMPLES = len(dev_X)
    NUM_BAKE_SAMPLES = len(dev_bake_X)
    
    #encoded inputs for DataLoader
    #Training Data
    #Two sets of dev data, one for SST3 and one for Bake-off
    training_data = getSentences(
                        sentences = train_X,
                        labels = train_y,
                        tokenizer = bert_tokenizer,
                        max_len = MAX_LEN)

    val_data = getSentences(
                    sentences = dev_X,
                    labels = dev_y,
                    tokenizer = bert_tokenizer,
                    max_len = MAX_LEN)

    val_bake_data = getSentences(
                    sentences = dev_bake_X,
                    labels = dev_bake_y,
                    tokenizer = bert_tokenizer,
                    max_len = MAX_LEN)

    train_loader = DataLoader(training_data, BATCH_SIZE, shuffle = True)
    val_loader = DataLoader(val_data, BATCH_SIZE, shuffle = True)
    val_bake_loader = DataLoader(val_bake_data, BATCH_SIZE, shuffle = True)
        
    #blueprint for BERT Classification task
    class BertClassifier(nn.Module):
        def __init__(self, model_name, num_classes):
            super(BertClassifier,self).__init__()
            self.bert = BertModel.from_pretrained(model_name)
            self.dropout = nn.Dropout(p = 0.3)
            self.linear = nn.Linear(self.bert.config.hidden_size,num_classes)
            self.softmax = nn.Softmax(dim = 1)

        def forward(self,input_ids, attention_mask):
            temp = self.bert(input_ids, attention_mask)  
            pooled_output = temp[1]                            
            out = self.dropout(pooled_output)          
            out = self.linear(out)
            return out   # -> softmax probabilities
        
    #instantiate model with 3 classes as output
    num_classes = 3
    model = BertClassifier(weights_name, 3)
    
    #due to size of model, recommended to run on multi-gpu system
    if torch.cuda.device_count() > 1:
        model = nn.DataParallel(model)
        model.to(device)
        
    #set model hyperparameters
    learning_rate = 1e-5
    EPOCHS = 3
    steps = len(train_loader) * EPOCHS
    loss_fn = torch.nn.CrossEntropyLoss().to(device)
    optim = torch.optim.AdamW(params = model.parameters(),lr = learning_rate)
    scheduler = get_linear_schedule_with_warmup(optimizer=optim, num_warmup_steps=0,num_training_steps = steps)
    
    #set model training loop through one epoch
    def train_model(model, data_loader=train_loader, loss_function=loss_fn, optimizer=optim, scheduler=scheduler, n_examples=NUM_TRAIN_SAMPLES):
        '''
        Model training function that represents one pass through the data.
        Returns accuracy and mean total loss.  This function is meant to be 
        paired with an "eval model" function.
        '''
        #set model in train mode to ensure grad calcs
        model.train()
        batches = len(data_loader)
        train_loss = []
        correct_predictions = 0


        for d in tqdm(data_loader):

            #grab data in batches and move to GPU
            input_ids = d['input_id'].to(device)
            masks = d['attention_mask'].to(device)
            labels = d['labels'].to(device)

            #forward propagation
            predictions = model(input_ids, masks)
            loss = loss_function(predictions, labels)
            _, pred_classes = torch.max(predictions, dim=1)

            #back propagation
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            #collect loss and acc measures
            train_loss.append(loss.item())
            correct_predictions += torch.sum(pred_classes==labels)

        return (correct_predictions/n_examples).cpu().numpy(), np.mean(train_loss)
    
    #set model eval loop through one epoch
    def eval_model(model, data_loader=val_loader, loss_function=loss_fn, n_examples=NUM_VAL_SAMPLES):
        '''
        Model evaluation function that represents one pass through the data.
        Returns accuracy and mean total loss.  This function is meant to be 
        paired with the "train_model" function.
        '''
        
        #set model in eval mode to optimize speed
        model.eval()
        eval_loss = []
        correct_predictions = 0
        all_predictions = []
        all_labels = [] 

        with torch.no_grad():
            for d in tqdm(data_loader):

                input_ids = d['input_id'].to(device)
                masks = d['attention_mask'].to(device)
                labels = d['labels'].to(device)

                #forward prop for inference
                predictions = model(input_ids, masks)
                loss = loss_function(predictions, labels)
                _,pred_classes = torch.max(predictions, dim=1)

                #collect preds/labels for class_report
                all_predictions.extend(pred_classes.cpu().tolist())
                all_labels.extend(labels.cpu().tolist())

                #collect loss and acc measures
                eval_loss.append(loss.item())
                correct_predictions += torch.sum(pred_classes==labels)

        report = classification_report(all_labels, 
                                       all_predictions, 
                                       labels=[0,1,2], 
                                       target_names=['negative', 'neutral', 'positive'])
        
        f1_macro = f1_score(all_labels, all_predictions, average='macro')

        return (correct_predictions / n_examples).cpu().numpy(), np.mean(eval_loss), report, f1_macro
    
    #this is the execution function for the model training/evaluation cycle
    def run(model, epochs: int=3):
        '''
        Execution function for model train/eval cycle. 
        Automatically saves model weights to file, if 
        scores is above updated f1_macro threshold. 
        '''
        
        tracking = defaultdict(list)
        #initialize at a high rate to ensure extra models are not saved to disk
        best_macro = 0.70
        best_report = None

        EPOCHS = epochs

        start = time.perf_counter()

        for epoch in range(EPOCHS):
            print(f'epoch : {epoch+1}/{EPOCHS}')

            train_acc, train_loss = train_model(model, data_loader=train_loader, n_examples=NUM_TRAIN_SAMPLES)
            val_acc , val_loss, report, val_f1 = eval_model(model, data_loader=val_loader, n_examples=NUM_VAL_SAMPLES)
            bakeoff_acc, bakeoff_loss, bakeoff_report, bakeoff_f1 = eval_model(model, data_loader=val_bake_loader, n_examples=NUM_BAKE_SAMPLES)

            mean_f1_macro = np.mean([val_f1, bakeoff_f1])
            print(f'Mean f1_macro = {mean_f1_macro}')

            tracking['train_acc'].append(train_acc)
            tracking['train_loss'].append(train_loss)
            tracking['val_acc'].append((val_acc, report))
            tracking['val_loss'].append(val_loss)
            tracking['bake_acc'].append((bakeoff_acc, bakeoff_report))
            tracking['bake_loss'].append(bakeoff_loss)

            scores = np.round([train_loss, train_acc, val_loss, val_acc, bakeoff_loss, bakeoff_acc],3)
            print(f'train_loss: {scores[0]}, train_acc: {scores[1]}\
                    \nval_loss: {scores[2]}, val_acc: {scores[3]}\
                    \nbake_loss: {scores[4]}, bake_acc: {scores[5]}')

            if mean_f1_macro > best_macro:
                best_model_name = f'/home/americanthinker/notebooks/pytorch/cs224u/saved_models/{weights_name}_{mean_f1_macro}.bin'
                torch.save(model.state_dict(), best_model_name)
                print(f"New model saved with mean f1_macro of: {mean_f1_macro}")
                best_macro = mean_f1_macro
                best_report = [report, bakeoff_report]
            else: 
                print('mean_f1 not better than best macro')

        end = time.perf_counter() - start

        print(f'Total time for {EPOCHS} epochs: {np.round(end/60, 1)} minutes')
        print(f'Classification Report:')
        if best_report:
            print(best_report)

        return tracking
    
    tracker = run(model=model, epochs=5)
    
# STOP COMMENT: Please do not remove this comment.

'\n1. I started by utilizing the sst.experiment framework as provided by the course instructor.  However, after running several experiments, including multiple data augmentation\ntechniques, I did not get remarkable improvements to the combined mean f1_macros scores.  I therefore decided to try to finetune an existing BERT model from the HF library.\n\n2. I did not have a lot of time to try out several different models (i.e. RoBERTa, Distil-BERT, etc.) and after a few trial runs, I realized that experimentation was going to \nrequire a multi-GPU machine.  I had never used the nn.DataParallel framework before, so I figured I would try it out for this bake-off. \n\n3. I ended up creating a simple Classification model that uses the last layer of the pretained BERT model linearly connected to an output layer with dropout to improve generalization.  \nI started with bert-base which worked well, but saw a 5-point F1_score jump by using a bert-large model.  Training time for a 20K sentence da

Some weights of the model checkpoint at bert-large-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DataParallel(
  (module): BertClassifier(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(28996, 1024, padding_idx=0)
        (position_embeddings): Embedding(512, 1024)
        (token_type_embeddings): Embedding(2, 1024)
        (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=1024, out_features=1024, bias=True)
                (key): Linear(in_features=1024, out_features=1024, bias=True)
                (value): Linear(in_features=1024, out_features=1024, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=1024, out_features=1024, bias=True)


epoch : 1/5


  0%|          | 0/1 [00:00<?, ?it/s]


TypeError: new(): invalid data type 'str'

## Bakeoff [1 point]

As we said above, the bakeoff evaluation data is the official SST test-set release and a new test set derived from the same sources and labeling methods as for `bakeoff_dev`.

For this bakeoff, you'll evaluate your original system from the above homework problem on these test sets. Our metric will be the mean of the macro-F1 values, which weights both datasets equally despite their differing sizes.

The central requirement for your system is that you have define a `predict_one` method for it that maps a text (str) directly to a label prediction – one of 'positive', 'negative', 'neutral'. If you used `sst.experiment` with `vectorize=True`, then the following function (for `softmax_experiment`) will be easy to adapt – you probably just need to change the variable `softmax_experiment` to the variable for your experiment output.

In [None]:
def predict_one_softmax(text):
    # Singleton list of feature dicts:
    feats = [softmax_experiment['phi'](text)]
    # Vectorize to get a feature matrix:
    X = softmax_experiment['train_dataset']['vectorizer'].transform(feats)
    # Standard sklearn `predict` step:
    preds = softmax_experiment['model'].predict(X)
    # Be sure to return the only member of the predictions,
    # rather than the singleton list:
    return preds[0]

If you used an RNN like the one we demoed above, then featurization is a bit more straightforward:

In [None]:
def predict_one_rnn(text):
    # List of tokenized examples:
    X = [rnn_experiment['phi'](text)]
    # Standard `predict` step on a list of lists of str:
    preds = rnn_experiment['model'].predict(X)
    # Be sure to return the only member of the predictions,
    # rather than the singleton list:
    return preds[0]

The following function is used to create the bakeoff submission file. Its arguments are your `predict_one` function and an output filename (str).

In [12]:
weights_name = 'bert-large-cased'
bert_tokenizer = BertTokenizer.from_pretrained(weights_name)

class BertClassifier(nn.Module):
    def __init__(self, model_name, num_classes):
        super(BertClassifier,self).__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(p = 0.3)
        self.linear = nn.Linear(self.bert.config.hidden_size,num_classes)
        self.softmax = nn.Softmax(dim = 1)
        
    def forward(self,input_ids, attention_mask):
        temp = self.bert(input_ids, attention_mask)  
        pooled_output = temp[1]                            
        out = self.dropout(pooled_output)          
        out = self.linear(out)
        return out

num_classes = 3
model = BertClassifier(weights_name, 3)
model.to(device)
# if torch.cuda.device_count() > 1:
#     print(f'Device Count: {torch.cuda.device_count()}')
#     model = nn.DataParallel(model)
#     model.to(device)

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwi

In [13]:
checkpoint = torch.load('../../Downloads/bert-large-cased_0.741.bin', map_location=device)
model.load_state_dict(checkpoint)

def predict_one(text):
    encoding = bert_tokenizer.encode_plus(
                          text,
                          add_special_tokens=True,
                          max_length=300,
                          truncation=True,
                          return_token_type_ids=False,
                          padding=PaddingStrategy.MAX_LENGTH,
                          return_attention_mask=True,
                          return_tensors='pt')
    
    input_id = encoding['input_ids'].to(device)
    mask = encoding['attention_mask'].to(device)
    output = model(input_id, mask)
    _, prediction_class = torch.max(output, dim=1)
    prediction_class = prediction_class.cpu().numpy().tolist()[0]
    
    label_types = ['negative', 'neutral', 'positive']
    class_map = {index:label for index, label in enumerate(label_types)}
    prediction = class_map[prediction_class]
    
    return prediction

RuntimeError: Error(s) in loading state_dict for BertClassifier:
	Missing key(s) in state_dict: "bert.embeddings.position_ids", "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias", "bert.encoder.layer.0.attention.self.key.weight", "bert.encoder.layer.0.attention.self.key.bias", "bert.encoder.layer.0.attention.self.value.weight", "bert.encoder.layer.0.attention.self.value.bias", "bert.encoder.layer.0.attention.output.dense.weight", "bert.encoder.layer.0.attention.output.dense.bias", "bert.encoder.layer.0.attention.output.LayerNorm.weight", "bert.encoder.layer.0.attention.output.LayerNorm.bias", "bert.encoder.layer.0.intermediate.dense.weight", "bert.encoder.layer.0.intermediate.dense.bias", "bert.encoder.layer.0.output.dense.weight", "bert.encoder.layer.0.output.dense.bias", "bert.encoder.layer.0.output.LayerNorm.weight", "bert.encoder.layer.0.output.LayerNorm.bias", "bert.encoder.layer.1.attention.self.query.weight", "bert.encoder.layer.1.attention.self.query.bias", "bert.encoder.layer.1.attention.self.key.weight", "bert.encoder.layer.1.attention.self.key.bias", "bert.encoder.layer.1.attention.self.value.weight", "bert.encoder.layer.1.attention.self.value.bias", "bert.encoder.layer.1.attention.output.dense.weight", "bert.encoder.layer.1.attention.output.dense.bias", "bert.encoder.layer.1.attention.output.LayerNorm.weight", "bert.encoder.layer.1.attention.output.LayerNorm.bias", "bert.encoder.layer.1.intermediate.dense.weight", "bert.encoder.layer.1.intermediate.dense.bias", "bert.encoder.layer.1.output.dense.weight", "bert.encoder.layer.1.output.dense.bias", "bert.encoder.layer.1.output.LayerNorm.weight", "bert.encoder.layer.1.output.LayerNorm.bias", "bert.encoder.layer.2.attention.self.query.weight", "bert.encoder.layer.2.attention.self.query.bias", "bert.encoder.layer.2.attention.self.key.weight", "bert.encoder.layer.2.attention.self.key.bias", "bert.encoder.layer.2.attention.self.value.weight", "bert.encoder.layer.2.attention.self.value.bias", "bert.encoder.layer.2.attention.output.dense.weight", "bert.encoder.layer.2.attention.output.dense.bias", "bert.encoder.layer.2.attention.output.LayerNorm.weight", "bert.encoder.layer.2.attention.output.LayerNorm.bias", "bert.encoder.layer.2.intermediate.dense.weight", "bert.encoder.layer.2.intermediate.dense.bias", "bert.encoder.layer.2.output.dense.weight", "bert.encoder.layer.2.output.dense.bias", "bert.encoder.layer.2.output.LayerNorm.weight", "bert.encoder.layer.2.output.LayerNorm.bias", "bert.encoder.layer.3.attention.self.query.weight", "bert.encoder.layer.3.attention.self.query.bias", "bert.encoder.layer.3.attention.self.key.weight", "bert.encoder.layer.3.attention.self.key.bias", "bert.encoder.layer.3.attention.self.value.weight", "bert.encoder.layer.3.attention.self.value.bias", "bert.encoder.layer.3.attention.output.dense.weight", "bert.encoder.layer.3.attention.output.dense.bias", "bert.encoder.layer.3.attention.output.LayerNorm.weight", "bert.encoder.layer.3.attention.output.LayerNorm.bias", "bert.encoder.layer.3.intermediate.dense.weight", "bert.encoder.layer.3.intermediate.dense.bias", "bert.encoder.layer.3.output.dense.weight", "bert.encoder.layer.3.output.dense.bias", "bert.encoder.layer.3.output.LayerNorm.weight", "bert.encoder.layer.3.output.LayerNorm.bias", "bert.encoder.layer.4.attention.self.query.weight", "bert.encoder.layer.4.attention.self.query.bias", "bert.encoder.layer.4.attention.self.key.weight", "bert.encoder.layer.4.attention.self.key.bias", "bert.encoder.layer.4.attention.self.value.weight", "bert.encoder.layer.4.attention.self.value.bias", "bert.encoder.layer.4.attention.output.dense.weight", "bert.encoder.layer.4.attention.output.dense.bias", "bert.encoder.layer.4.attention.output.LayerNorm.weight", "bert.encoder.layer.4.attention.output.LayerNorm.bias", "bert.encoder.layer.4.intermediate.dense.weight", "bert.encoder.layer.4.intermediate.dense.bias", "bert.encoder.layer.4.output.dense.weight", "bert.encoder.layer.4.output.dense.bias", "bert.encoder.layer.4.output.LayerNorm.weight", "bert.encoder.layer.4.output.LayerNorm.bias", "bert.encoder.layer.5.attention.self.query.weight", "bert.encoder.layer.5.attention.self.query.bias", "bert.encoder.layer.5.attention.self.key.weight", "bert.encoder.layer.5.attention.self.key.bias", "bert.encoder.layer.5.attention.self.value.weight", "bert.encoder.layer.5.attention.self.value.bias", "bert.encoder.layer.5.attention.output.dense.weight", "bert.encoder.layer.5.attention.output.dense.bias", "bert.encoder.layer.5.attention.output.LayerNorm.weight", "bert.encoder.layer.5.attention.output.LayerNorm.bias", "bert.encoder.layer.5.intermediate.dense.weight", "bert.encoder.layer.5.intermediate.dense.bias", "bert.encoder.layer.5.output.dense.weight", "bert.encoder.layer.5.output.dense.bias", "bert.encoder.layer.5.output.LayerNorm.weight", "bert.encoder.layer.5.output.LayerNorm.bias", "bert.encoder.layer.6.attention.self.query.weight", "bert.encoder.layer.6.attention.self.query.bias", "bert.encoder.layer.6.attention.self.key.weight", "bert.encoder.layer.6.attention.self.key.bias", "bert.encoder.layer.6.attention.self.value.weight", "bert.encoder.layer.6.attention.self.value.bias", "bert.encoder.layer.6.attention.output.dense.weight", "bert.encoder.layer.6.attention.output.dense.bias", "bert.encoder.layer.6.attention.output.LayerNorm.weight", "bert.encoder.layer.6.attention.output.LayerNorm.bias", "bert.encoder.layer.6.intermediate.dense.weight", "bert.encoder.layer.6.intermediate.dense.bias", "bert.encoder.layer.6.output.dense.weight", "bert.encoder.layer.6.output.dense.bias", "bert.encoder.layer.6.output.LayerNorm.weight", "bert.encoder.layer.6.output.LayerNorm.bias", "bert.encoder.layer.7.attention.self.query.weight", "bert.encoder.layer.7.attention.self.query.bias", "bert.encoder.layer.7.attention.self.key.weight", "bert.encoder.layer.7.attention.self.key.bias", "bert.encoder.layer.7.attention.self.value.weight", "bert.encoder.layer.7.attention.self.value.bias", "bert.encoder.layer.7.attention.output.dense.weight", "bert.encoder.layer.7.attention.output.dense.bias", "bert.encoder.layer.7.attention.output.LayerNorm.weight", "bert.encoder.layer.7.attention.output.LayerNorm.bias", "bert.encoder.layer.7.intermediate.dense.weight", "bert.encoder.layer.7.intermediate.dense.bias", "bert.encoder.layer.7.output.dense.weight", "bert.encoder.layer.7.output.dense.bias", "bert.encoder.layer.7.output.LayerNorm.weight", "bert.encoder.layer.7.output.LayerNorm.bias", "bert.encoder.layer.8.attention.self.query.weight", "bert.encoder.layer.8.attention.self.query.bias", "bert.encoder.layer.8.attention.self.key.weight", "bert.encoder.layer.8.attention.self.key.bias", "bert.encoder.layer.8.attention.self.value.weight", "bert.encoder.layer.8.attention.self.value.bias", "bert.encoder.layer.8.attention.output.dense.weight", "bert.encoder.layer.8.attention.output.dense.bias", "bert.encoder.layer.8.attention.output.LayerNorm.weight", "bert.encoder.layer.8.attention.output.LayerNorm.bias", "bert.encoder.layer.8.intermediate.dense.weight", "bert.encoder.layer.8.intermediate.dense.bias", "bert.encoder.layer.8.output.dense.weight", "bert.encoder.layer.8.output.dense.bias", "bert.encoder.layer.8.output.LayerNorm.weight", "bert.encoder.layer.8.output.LayerNorm.bias", "bert.encoder.layer.9.attention.self.query.weight", "bert.encoder.layer.9.attention.self.query.bias", "bert.encoder.layer.9.attention.self.key.weight", "bert.encoder.layer.9.attention.self.key.bias", "bert.encoder.layer.9.attention.self.value.weight", "bert.encoder.layer.9.attention.self.value.bias", "bert.encoder.layer.9.attention.output.dense.weight", "bert.encoder.layer.9.attention.output.dense.bias", "bert.encoder.layer.9.attention.output.LayerNorm.weight", "bert.encoder.layer.9.attention.output.LayerNorm.bias", "bert.encoder.layer.9.intermediate.dense.weight", "bert.encoder.layer.9.intermediate.dense.bias", "bert.encoder.layer.9.output.dense.weight", "bert.encoder.layer.9.output.dense.bias", "bert.encoder.layer.9.output.LayerNorm.weight", "bert.encoder.layer.9.output.LayerNorm.bias", "bert.encoder.layer.10.attention.self.query.weight", "bert.encoder.layer.10.attention.self.query.bias", "bert.encoder.layer.10.attention.self.key.weight", "bert.encoder.layer.10.attention.self.key.bias", "bert.encoder.layer.10.attention.self.value.weight", "bert.encoder.layer.10.attention.self.value.bias", "bert.encoder.layer.10.attention.output.dense.weight", "bert.encoder.layer.10.attention.output.dense.bias", "bert.encoder.layer.10.attention.output.LayerNorm.weight", "bert.encoder.layer.10.attention.output.LayerNorm.bias", "bert.encoder.layer.10.intermediate.dense.weight", "bert.encoder.layer.10.intermediate.dense.bias", "bert.encoder.layer.10.output.dense.weight", "bert.encoder.layer.10.output.dense.bias", "bert.encoder.layer.10.output.LayerNorm.weight", "bert.encoder.layer.10.output.LayerNorm.bias", "bert.encoder.layer.11.attention.self.query.weight", "bert.encoder.layer.11.attention.self.query.bias", "bert.encoder.layer.11.attention.self.key.weight", "bert.encoder.layer.11.attention.self.key.bias", "bert.encoder.layer.11.attention.self.value.weight", "bert.encoder.layer.11.attention.self.value.bias", "bert.encoder.layer.11.attention.output.dense.weight", "bert.encoder.layer.11.attention.output.dense.bias", "bert.encoder.layer.11.attention.output.LayerNorm.weight", "bert.encoder.layer.11.attention.output.LayerNorm.bias", "bert.encoder.layer.11.intermediate.dense.weight", "bert.encoder.layer.11.intermediate.dense.bias", "bert.encoder.layer.11.output.dense.weight", "bert.encoder.layer.11.output.dense.bias", "bert.encoder.layer.11.output.LayerNorm.weight", "bert.encoder.layer.11.output.LayerNorm.bias", "bert.encoder.layer.12.attention.self.query.weight", "bert.encoder.layer.12.attention.self.query.bias", "bert.encoder.layer.12.attention.self.key.weight", "bert.encoder.layer.12.attention.self.key.bias", "bert.encoder.layer.12.attention.self.value.weight", "bert.encoder.layer.12.attention.self.value.bias", "bert.encoder.layer.12.attention.output.dense.weight", "bert.encoder.layer.12.attention.output.dense.bias", "bert.encoder.layer.12.attention.output.LayerNorm.weight", "bert.encoder.layer.12.attention.output.LayerNorm.bias", "bert.encoder.layer.12.intermediate.dense.weight", "bert.encoder.layer.12.intermediate.dense.bias", "bert.encoder.layer.12.output.dense.weight", "bert.encoder.layer.12.output.dense.bias", "bert.encoder.layer.12.output.LayerNorm.weight", "bert.encoder.layer.12.output.LayerNorm.bias", "bert.encoder.layer.13.attention.self.query.weight", "bert.encoder.layer.13.attention.self.query.bias", "bert.encoder.layer.13.attention.self.key.weight", "bert.encoder.layer.13.attention.self.key.bias", "bert.encoder.layer.13.attention.self.value.weight", "bert.encoder.layer.13.attention.self.value.bias", "bert.encoder.layer.13.attention.output.dense.weight", "bert.encoder.layer.13.attention.output.dense.bias", "bert.encoder.layer.13.attention.output.LayerNorm.weight", "bert.encoder.layer.13.attention.output.LayerNorm.bias", "bert.encoder.layer.13.intermediate.dense.weight", "bert.encoder.layer.13.intermediate.dense.bias", "bert.encoder.layer.13.output.dense.weight", "bert.encoder.layer.13.output.dense.bias", "bert.encoder.layer.13.output.LayerNorm.weight", "bert.encoder.layer.13.output.LayerNorm.bias", "bert.encoder.layer.14.attention.self.query.weight", "bert.encoder.layer.14.attention.self.query.bias", "bert.encoder.layer.14.attention.self.key.weight", "bert.encoder.layer.14.attention.self.key.bias", "bert.encoder.layer.14.attention.self.value.weight", "bert.encoder.layer.14.attention.self.value.bias", "bert.encoder.layer.14.attention.output.dense.weight", "bert.encoder.layer.14.attention.output.dense.bias", "bert.encoder.layer.14.attention.output.LayerNorm.weight", "bert.encoder.layer.14.attention.output.LayerNorm.bias", "bert.encoder.layer.14.intermediate.dense.weight", "bert.encoder.layer.14.intermediate.dense.bias", "bert.encoder.layer.14.output.dense.weight", "bert.encoder.layer.14.output.dense.bias", "bert.encoder.layer.14.output.LayerNorm.weight", "bert.encoder.layer.14.output.LayerNorm.bias", "bert.encoder.layer.15.attention.self.query.weight", "bert.encoder.layer.15.attention.self.query.bias", "bert.encoder.layer.15.attention.self.key.weight", "bert.encoder.layer.15.attention.self.key.bias", "bert.encoder.layer.15.attention.self.value.weight", "bert.encoder.layer.15.attention.self.value.bias", "bert.encoder.layer.15.attention.output.dense.weight", "bert.encoder.layer.15.attention.output.dense.bias", "bert.encoder.layer.15.attention.output.LayerNorm.weight", "bert.encoder.layer.15.attention.output.LayerNorm.bias", "bert.encoder.layer.15.intermediate.dense.weight", "bert.encoder.layer.15.intermediate.dense.bias", "bert.encoder.layer.15.output.dense.weight", "bert.encoder.layer.15.output.dense.bias", "bert.encoder.layer.15.output.LayerNorm.weight", "bert.encoder.layer.15.output.LayerNorm.bias", "bert.encoder.layer.16.attention.self.query.weight", "bert.encoder.layer.16.attention.self.query.bias", "bert.encoder.layer.16.attention.self.key.weight", "bert.encoder.layer.16.attention.self.key.bias", "bert.encoder.layer.16.attention.self.value.weight", "bert.encoder.layer.16.attention.self.value.bias", "bert.encoder.layer.16.attention.output.dense.weight", "bert.encoder.layer.16.attention.output.dense.bias", "bert.encoder.layer.16.attention.output.LayerNorm.weight", "bert.encoder.layer.16.attention.output.LayerNorm.bias", "bert.encoder.layer.16.intermediate.dense.weight", "bert.encoder.layer.16.intermediate.dense.bias", "bert.encoder.layer.16.output.dense.weight", "bert.encoder.layer.16.output.dense.bias", "bert.encoder.layer.16.output.LayerNorm.weight", "bert.encoder.layer.16.output.LayerNorm.bias", "bert.encoder.layer.17.attention.self.query.weight", "bert.encoder.layer.17.attention.self.query.bias", "bert.encoder.layer.17.attention.self.key.weight", "bert.encoder.layer.17.attention.self.key.bias", "bert.encoder.layer.17.attention.self.value.weight", "bert.encoder.layer.17.attention.self.value.bias", "bert.encoder.layer.17.attention.output.dense.weight", "bert.encoder.layer.17.attention.output.dense.bias", "bert.encoder.layer.17.attention.output.LayerNorm.weight", "bert.encoder.layer.17.attention.output.LayerNorm.bias", "bert.encoder.layer.17.intermediate.dense.weight", "bert.encoder.layer.17.intermediate.dense.bias", "bert.encoder.layer.17.output.dense.weight", "bert.encoder.layer.17.output.dense.bias", "bert.encoder.layer.17.output.LayerNorm.weight", "bert.encoder.layer.17.output.LayerNorm.bias", "bert.encoder.layer.18.attention.self.query.weight", "bert.encoder.layer.18.attention.self.query.bias", "bert.encoder.layer.18.attention.self.key.weight", "bert.encoder.layer.18.attention.self.key.bias", "bert.encoder.layer.18.attention.self.value.weight", "bert.encoder.layer.18.attention.self.value.bias", "bert.encoder.layer.18.attention.output.dense.weight", "bert.encoder.layer.18.attention.output.dense.bias", "bert.encoder.layer.18.attention.output.LayerNorm.weight", "bert.encoder.layer.18.attention.output.LayerNorm.bias", "bert.encoder.layer.18.intermediate.dense.weight", "bert.encoder.layer.18.intermediate.dense.bias", "bert.encoder.layer.18.output.dense.weight", "bert.encoder.layer.18.output.dense.bias", "bert.encoder.layer.18.output.LayerNorm.weight", "bert.encoder.layer.18.output.LayerNorm.bias", "bert.encoder.layer.19.attention.self.query.weight", "bert.encoder.layer.19.attention.self.query.bias", "bert.encoder.layer.19.attention.self.key.weight", "bert.encoder.layer.19.attention.self.key.bias", "bert.encoder.layer.19.attention.self.value.weight", "bert.encoder.layer.19.attention.self.value.bias", "bert.encoder.layer.19.attention.output.dense.weight", "bert.encoder.layer.19.attention.output.dense.bias", "bert.encoder.layer.19.attention.output.LayerNorm.weight", "bert.encoder.layer.19.attention.output.LayerNorm.bias", "bert.encoder.layer.19.intermediate.dense.weight", "bert.encoder.layer.19.intermediate.dense.bias", "bert.encoder.layer.19.output.dense.weight", "bert.encoder.layer.19.output.dense.bias", "bert.encoder.layer.19.output.LayerNorm.weight", "bert.encoder.layer.19.output.LayerNorm.bias", "bert.encoder.layer.20.attention.self.query.weight", "bert.encoder.layer.20.attention.self.query.bias", "bert.encoder.layer.20.attention.self.key.weight", "bert.encoder.layer.20.attention.self.key.bias", "bert.encoder.layer.20.attention.self.value.weight", "bert.encoder.layer.20.attention.self.value.bias", "bert.encoder.layer.20.attention.output.dense.weight", "bert.encoder.layer.20.attention.output.dense.bias", "bert.encoder.layer.20.attention.output.LayerNorm.weight", "bert.encoder.layer.20.attention.output.LayerNorm.bias", "bert.encoder.layer.20.intermediate.dense.weight", "bert.encoder.layer.20.intermediate.dense.bias", "bert.encoder.layer.20.output.dense.weight", "bert.encoder.layer.20.output.dense.bias", "bert.encoder.layer.20.output.LayerNorm.weight", "bert.encoder.layer.20.output.LayerNorm.bias", "bert.encoder.layer.21.attention.self.query.weight", "bert.encoder.layer.21.attention.self.query.bias", "bert.encoder.layer.21.attention.self.key.weight", "bert.encoder.layer.21.attention.self.key.bias", "bert.encoder.layer.21.attention.self.value.weight", "bert.encoder.layer.21.attention.self.value.bias", "bert.encoder.layer.21.attention.output.dense.weight", "bert.encoder.layer.21.attention.output.dense.bias", "bert.encoder.layer.21.attention.output.LayerNorm.weight", "bert.encoder.layer.21.attention.output.LayerNorm.bias", "bert.encoder.layer.21.intermediate.dense.weight", "bert.encoder.layer.21.intermediate.dense.bias", "bert.encoder.layer.21.output.dense.weight", "bert.encoder.layer.21.output.dense.bias", "bert.encoder.layer.21.output.LayerNorm.weight", "bert.encoder.layer.21.output.LayerNorm.bias", "bert.encoder.layer.22.attention.self.query.weight", "bert.encoder.layer.22.attention.self.query.bias", "bert.encoder.layer.22.attention.self.key.weight", "bert.encoder.layer.22.attention.self.key.bias", "bert.encoder.layer.22.attention.self.value.weight", "bert.encoder.layer.22.attention.self.value.bias", "bert.encoder.layer.22.attention.output.dense.weight", "bert.encoder.layer.22.attention.output.dense.bias", "bert.encoder.layer.22.attention.output.LayerNorm.weight", "bert.encoder.layer.22.attention.output.LayerNorm.bias", "bert.encoder.layer.22.intermediate.dense.weight", "bert.encoder.layer.22.intermediate.dense.bias", "bert.encoder.layer.22.output.dense.weight", "bert.encoder.layer.22.output.dense.bias", "bert.encoder.layer.22.output.LayerNorm.weight", "bert.encoder.layer.22.output.LayerNorm.bias", "bert.encoder.layer.23.attention.self.query.weight", "bert.encoder.layer.23.attention.self.query.bias", "bert.encoder.layer.23.attention.self.key.weight", "bert.encoder.layer.23.attention.self.key.bias", "bert.encoder.layer.23.attention.self.value.weight", "bert.encoder.layer.23.attention.self.value.bias", "bert.encoder.layer.23.attention.output.dense.weight", "bert.encoder.layer.23.attention.output.dense.bias", "bert.encoder.layer.23.attention.output.LayerNorm.weight", "bert.encoder.layer.23.attention.output.LayerNorm.bias", "bert.encoder.layer.23.intermediate.dense.weight", "bert.encoder.layer.23.intermediate.dense.bias", "bert.encoder.layer.23.output.dense.weight", "bert.encoder.layer.23.output.dense.bias", "bert.encoder.layer.23.output.LayerNorm.weight", "bert.encoder.layer.23.output.LayerNorm.bias", "bert.pooler.dense.weight", "bert.pooler.dense.bias", "linear.weight", "linear.bias". 
	Unexpected key(s) in state_dict: "module.bert.embeddings.position_ids", "module.bert.embeddings.word_embeddings.weight", "module.bert.embeddings.position_embeddings.weight", "module.bert.embeddings.token_type_embeddings.weight", "module.bert.embeddings.LayerNorm.weight", "module.bert.embeddings.LayerNorm.bias", "module.bert.encoder.layer.0.attention.self.query.weight", "module.bert.encoder.layer.0.attention.self.query.bias", "module.bert.encoder.layer.0.attention.self.key.weight", "module.bert.encoder.layer.0.attention.self.key.bias", "module.bert.encoder.layer.0.attention.self.value.weight", "module.bert.encoder.layer.0.attention.self.value.bias", "module.bert.encoder.layer.0.attention.output.dense.weight", "module.bert.encoder.layer.0.attention.output.dense.bias", "module.bert.encoder.layer.0.attention.output.LayerNorm.weight", "module.bert.encoder.layer.0.attention.output.LayerNorm.bias", "module.bert.encoder.layer.0.intermediate.dense.weight", "module.bert.encoder.layer.0.intermediate.dense.bias", "module.bert.encoder.layer.0.output.dense.weight", "module.bert.encoder.layer.0.output.dense.bias", "module.bert.encoder.layer.0.output.LayerNorm.weight", "module.bert.encoder.layer.0.output.LayerNorm.bias", "module.bert.encoder.layer.1.attention.self.query.weight", "module.bert.encoder.layer.1.attention.self.query.bias", "module.bert.encoder.layer.1.attention.self.key.weight", "module.bert.encoder.layer.1.attention.self.key.bias", "module.bert.encoder.layer.1.attention.self.value.weight", "module.bert.encoder.layer.1.attention.self.value.bias", "module.bert.encoder.layer.1.attention.output.dense.weight", "module.bert.encoder.layer.1.attention.output.dense.bias", "module.bert.encoder.layer.1.attention.output.LayerNorm.weight", "module.bert.encoder.layer.1.attention.output.LayerNorm.bias", "module.bert.encoder.layer.1.intermediate.dense.weight", "module.bert.encoder.layer.1.intermediate.dense.bias", "module.bert.encoder.layer.1.output.dense.weight", "module.bert.encoder.layer.1.output.dense.bias", "module.bert.encoder.layer.1.output.LayerNorm.weight", "module.bert.encoder.layer.1.output.LayerNorm.bias", "module.bert.encoder.layer.2.attention.self.query.weight", "module.bert.encoder.layer.2.attention.self.query.bias", "module.bert.encoder.layer.2.attention.self.key.weight", "module.bert.encoder.layer.2.attention.self.key.bias", "module.bert.encoder.layer.2.attention.self.value.weight", "module.bert.encoder.layer.2.attention.self.value.bias", "module.bert.encoder.layer.2.attention.output.dense.weight", "module.bert.encoder.layer.2.attention.output.dense.bias", "module.bert.encoder.layer.2.attention.output.LayerNorm.weight", "module.bert.encoder.layer.2.attention.output.LayerNorm.bias", "module.bert.encoder.layer.2.intermediate.dense.weight", "module.bert.encoder.layer.2.intermediate.dense.bias", "module.bert.encoder.layer.2.output.dense.weight", "module.bert.encoder.layer.2.output.dense.bias", "module.bert.encoder.layer.2.output.LayerNorm.weight", "module.bert.encoder.layer.2.output.LayerNorm.bias", "module.bert.encoder.layer.3.attention.self.query.weight", "module.bert.encoder.layer.3.attention.self.query.bias", "module.bert.encoder.layer.3.attention.self.key.weight", "module.bert.encoder.layer.3.attention.self.key.bias", "module.bert.encoder.layer.3.attention.self.value.weight", "module.bert.encoder.layer.3.attention.self.value.bias", "module.bert.encoder.layer.3.attention.output.dense.weight", "module.bert.encoder.layer.3.attention.output.dense.bias", "module.bert.encoder.layer.3.attention.output.LayerNorm.weight", "module.bert.encoder.layer.3.attention.output.LayerNorm.bias", "module.bert.encoder.layer.3.intermediate.dense.weight", "module.bert.encoder.layer.3.intermediate.dense.bias", "module.bert.encoder.layer.3.output.dense.weight", "module.bert.encoder.layer.3.output.dense.bias", "module.bert.encoder.layer.3.output.LayerNorm.weight", "module.bert.encoder.layer.3.output.LayerNorm.bias", "module.bert.encoder.layer.4.attention.self.query.weight", "module.bert.encoder.layer.4.attention.self.query.bias", "module.bert.encoder.layer.4.attention.self.key.weight", "module.bert.encoder.layer.4.attention.self.key.bias", "module.bert.encoder.layer.4.attention.self.value.weight", "module.bert.encoder.layer.4.attention.self.value.bias", "module.bert.encoder.layer.4.attention.output.dense.weight", "module.bert.encoder.layer.4.attention.output.dense.bias", "module.bert.encoder.layer.4.attention.output.LayerNorm.weight", "module.bert.encoder.layer.4.attention.output.LayerNorm.bias", "module.bert.encoder.layer.4.intermediate.dense.weight", "module.bert.encoder.layer.4.intermediate.dense.bias", "module.bert.encoder.layer.4.output.dense.weight", "module.bert.encoder.layer.4.output.dense.bias", "module.bert.encoder.layer.4.output.LayerNorm.weight", "module.bert.encoder.layer.4.output.LayerNorm.bias", "module.bert.encoder.layer.5.attention.self.query.weight", "module.bert.encoder.layer.5.attention.self.query.bias", "module.bert.encoder.layer.5.attention.self.key.weight", "module.bert.encoder.layer.5.attention.self.key.bias", "module.bert.encoder.layer.5.attention.self.value.weight", "module.bert.encoder.layer.5.attention.self.value.bias", "module.bert.encoder.layer.5.attention.output.dense.weight", "module.bert.encoder.layer.5.attention.output.dense.bias", "module.bert.encoder.layer.5.attention.output.LayerNorm.weight", "module.bert.encoder.layer.5.attention.output.LayerNorm.bias", "module.bert.encoder.layer.5.intermediate.dense.weight", "module.bert.encoder.layer.5.intermediate.dense.bias", "module.bert.encoder.layer.5.output.dense.weight", "module.bert.encoder.layer.5.output.dense.bias", "module.bert.encoder.layer.5.output.LayerNorm.weight", "module.bert.encoder.layer.5.output.LayerNorm.bias", "module.bert.encoder.layer.6.attention.self.query.weight", "module.bert.encoder.layer.6.attention.self.query.bias", "module.bert.encoder.layer.6.attention.self.key.weight", "module.bert.encoder.layer.6.attention.self.key.bias", "module.bert.encoder.layer.6.attention.self.value.weight", "module.bert.encoder.layer.6.attention.self.value.bias", "module.bert.encoder.layer.6.attention.output.dense.weight", "module.bert.encoder.layer.6.attention.output.dense.bias", "module.bert.encoder.layer.6.attention.output.LayerNorm.weight", "module.bert.encoder.layer.6.attention.output.LayerNorm.bias", "module.bert.encoder.layer.6.intermediate.dense.weight", "module.bert.encoder.layer.6.intermediate.dense.bias", "module.bert.encoder.layer.6.output.dense.weight", "module.bert.encoder.layer.6.output.dense.bias", "module.bert.encoder.layer.6.output.LayerNorm.weight", "module.bert.encoder.layer.6.output.LayerNorm.bias", "module.bert.encoder.layer.7.attention.self.query.weight", "module.bert.encoder.layer.7.attention.self.query.bias", "module.bert.encoder.layer.7.attention.self.key.weight", "module.bert.encoder.layer.7.attention.self.key.bias", "module.bert.encoder.layer.7.attention.self.value.weight", "module.bert.encoder.layer.7.attention.self.value.bias", "module.bert.encoder.layer.7.attention.output.dense.weight", "module.bert.encoder.layer.7.attention.output.dense.bias", "module.bert.encoder.layer.7.attention.output.LayerNorm.weight", "module.bert.encoder.layer.7.attention.output.LayerNorm.bias", "module.bert.encoder.layer.7.intermediate.dense.weight", "module.bert.encoder.layer.7.intermediate.dense.bias", "module.bert.encoder.layer.7.output.dense.weight", "module.bert.encoder.layer.7.output.dense.bias", "module.bert.encoder.layer.7.output.LayerNorm.weight", "module.bert.encoder.layer.7.output.LayerNorm.bias", "module.bert.encoder.layer.8.attention.self.query.weight", "module.bert.encoder.layer.8.attention.self.query.bias", "module.bert.encoder.layer.8.attention.self.key.weight", "module.bert.encoder.layer.8.attention.self.key.bias", "module.bert.encoder.layer.8.attention.self.value.weight", "module.bert.encoder.layer.8.attention.self.value.bias", "module.bert.encoder.layer.8.attention.output.dense.weight", "module.bert.encoder.layer.8.attention.output.dense.bias", "module.bert.encoder.layer.8.attention.output.LayerNorm.weight", "module.bert.encoder.layer.8.attention.output.LayerNorm.bias", "module.bert.encoder.layer.8.intermediate.dense.weight", "module.bert.encoder.layer.8.intermediate.dense.bias", "module.bert.encoder.layer.8.output.dense.weight", "module.bert.encoder.layer.8.output.dense.bias", "module.bert.encoder.layer.8.output.LayerNorm.weight", "module.bert.encoder.layer.8.output.LayerNorm.bias", "module.bert.encoder.layer.9.attention.self.query.weight", "module.bert.encoder.layer.9.attention.self.query.bias", "module.bert.encoder.layer.9.attention.self.key.weight", "module.bert.encoder.layer.9.attention.self.key.bias", "module.bert.encoder.layer.9.attention.self.value.weight", "module.bert.encoder.layer.9.attention.self.value.bias", "module.bert.encoder.layer.9.attention.output.dense.weight", "module.bert.encoder.layer.9.attention.output.dense.bias", "module.bert.encoder.layer.9.attention.output.LayerNorm.weight", "module.bert.encoder.layer.9.attention.output.LayerNorm.bias", "module.bert.encoder.layer.9.intermediate.dense.weight", "module.bert.encoder.layer.9.intermediate.dense.bias", "module.bert.encoder.layer.9.output.dense.weight", "module.bert.encoder.layer.9.output.dense.bias", "module.bert.encoder.layer.9.output.LayerNorm.weight", "module.bert.encoder.layer.9.output.LayerNorm.bias", "module.bert.encoder.layer.10.attention.self.query.weight", "module.bert.encoder.layer.10.attention.self.query.bias", "module.bert.encoder.layer.10.attention.self.key.weight", "module.bert.encoder.layer.10.attention.self.key.bias", "module.bert.encoder.layer.10.attention.self.value.weight", "module.bert.encoder.layer.10.attention.self.value.bias", "module.bert.encoder.layer.10.attention.output.dense.weight", "module.bert.encoder.layer.10.attention.output.dense.bias", "module.bert.encoder.layer.10.attention.output.LayerNorm.weight", "module.bert.encoder.layer.10.attention.output.LayerNorm.bias", "module.bert.encoder.layer.10.intermediate.dense.weight", "module.bert.encoder.layer.10.intermediate.dense.bias", "module.bert.encoder.layer.10.output.dense.weight", "module.bert.encoder.layer.10.output.dense.bias", "module.bert.encoder.layer.10.output.LayerNorm.weight", "module.bert.encoder.layer.10.output.LayerNorm.bias", "module.bert.encoder.layer.11.attention.self.query.weight", "module.bert.encoder.layer.11.attention.self.query.bias", "module.bert.encoder.layer.11.attention.self.key.weight", "module.bert.encoder.layer.11.attention.self.key.bias", "module.bert.encoder.layer.11.attention.self.value.weight", "module.bert.encoder.layer.11.attention.self.value.bias", "module.bert.encoder.layer.11.attention.output.dense.weight", "module.bert.encoder.layer.11.attention.output.dense.bias", "module.bert.encoder.layer.11.attention.output.LayerNorm.weight", "module.bert.encoder.layer.11.attention.output.LayerNorm.bias", "module.bert.encoder.layer.11.intermediate.dense.weight", "module.bert.encoder.layer.11.intermediate.dense.bias", "module.bert.encoder.layer.11.output.dense.weight", "module.bert.encoder.layer.11.output.dense.bias", "module.bert.encoder.layer.11.output.LayerNorm.weight", "module.bert.encoder.layer.11.output.LayerNorm.bias", "module.bert.encoder.layer.12.attention.self.query.weight", "module.bert.encoder.layer.12.attention.self.query.bias", "module.bert.encoder.layer.12.attention.self.key.weight", "module.bert.encoder.layer.12.attention.self.key.bias", "module.bert.encoder.layer.12.attention.self.value.weight", "module.bert.encoder.layer.12.attention.self.value.bias", "module.bert.encoder.layer.12.attention.output.dense.weight", "module.bert.encoder.layer.12.attention.output.dense.bias", "module.bert.encoder.layer.12.attention.output.LayerNorm.weight", "module.bert.encoder.layer.12.attention.output.LayerNorm.bias", "module.bert.encoder.layer.12.intermediate.dense.weight", "module.bert.encoder.layer.12.intermediate.dense.bias", "module.bert.encoder.layer.12.output.dense.weight", "module.bert.encoder.layer.12.output.dense.bias", "module.bert.encoder.layer.12.output.LayerNorm.weight", "module.bert.encoder.layer.12.output.LayerNorm.bias", "module.bert.encoder.layer.13.attention.self.query.weight", "module.bert.encoder.layer.13.attention.self.query.bias", "module.bert.encoder.layer.13.attention.self.key.weight", "module.bert.encoder.layer.13.attention.self.key.bias", "module.bert.encoder.layer.13.attention.self.value.weight", "module.bert.encoder.layer.13.attention.self.value.bias", "module.bert.encoder.layer.13.attention.output.dense.weight", "module.bert.encoder.layer.13.attention.output.dense.bias", "module.bert.encoder.layer.13.attention.output.LayerNorm.weight", "module.bert.encoder.layer.13.attention.output.LayerNorm.bias", "module.bert.encoder.layer.13.intermediate.dense.weight", "module.bert.encoder.layer.13.intermediate.dense.bias", "module.bert.encoder.layer.13.output.dense.weight", "module.bert.encoder.layer.13.output.dense.bias", "module.bert.encoder.layer.13.output.LayerNorm.weight", "module.bert.encoder.layer.13.output.LayerNorm.bias", "module.bert.encoder.layer.14.attention.self.query.weight", "module.bert.encoder.layer.14.attention.self.query.bias", "module.bert.encoder.layer.14.attention.self.key.weight", "module.bert.encoder.layer.14.attention.self.key.bias", "module.bert.encoder.layer.14.attention.self.value.weight", "module.bert.encoder.layer.14.attention.self.value.bias", "module.bert.encoder.layer.14.attention.output.dense.weight", "module.bert.encoder.layer.14.attention.output.dense.bias", "module.bert.encoder.layer.14.attention.output.LayerNorm.weight", "module.bert.encoder.layer.14.attention.output.LayerNorm.bias", "module.bert.encoder.layer.14.intermediate.dense.weight", "module.bert.encoder.layer.14.intermediate.dense.bias", "module.bert.encoder.layer.14.output.dense.weight", "module.bert.encoder.layer.14.output.dense.bias", "module.bert.encoder.layer.14.output.LayerNorm.weight", "module.bert.encoder.layer.14.output.LayerNorm.bias", "module.bert.encoder.layer.15.attention.self.query.weight", "module.bert.encoder.layer.15.attention.self.query.bias", "module.bert.encoder.layer.15.attention.self.key.weight", "module.bert.encoder.layer.15.attention.self.key.bias", "module.bert.encoder.layer.15.attention.self.value.weight", "module.bert.encoder.layer.15.attention.self.value.bias", "module.bert.encoder.layer.15.attention.output.dense.weight", "module.bert.encoder.layer.15.attention.output.dense.bias", "module.bert.encoder.layer.15.attention.output.LayerNorm.weight", "module.bert.encoder.layer.15.attention.output.LayerNorm.bias", "module.bert.encoder.layer.15.intermediate.dense.weight", "module.bert.encoder.layer.15.intermediate.dense.bias", "module.bert.encoder.layer.15.output.dense.weight", "module.bert.encoder.layer.15.output.dense.bias", "module.bert.encoder.layer.15.output.LayerNorm.weight", "module.bert.encoder.layer.15.output.LayerNorm.bias", "module.bert.encoder.layer.16.attention.self.query.weight", "module.bert.encoder.layer.16.attention.self.query.bias", "module.bert.encoder.layer.16.attention.self.key.weight", "module.bert.encoder.layer.16.attention.self.key.bias", "module.bert.encoder.layer.16.attention.self.value.weight", "module.bert.encoder.layer.16.attention.self.value.bias", "module.bert.encoder.layer.16.attention.output.dense.weight", "module.bert.encoder.layer.16.attention.output.dense.bias", "module.bert.encoder.layer.16.attention.output.LayerNorm.weight", "module.bert.encoder.layer.16.attention.output.LayerNorm.bias", "module.bert.encoder.layer.16.intermediate.dense.weight", "module.bert.encoder.layer.16.intermediate.dense.bias", "module.bert.encoder.layer.16.output.dense.weight", "module.bert.encoder.layer.16.output.dense.bias", "module.bert.encoder.layer.16.output.LayerNorm.weight", "module.bert.encoder.layer.16.output.LayerNorm.bias", "module.bert.encoder.layer.17.attention.self.query.weight", "module.bert.encoder.layer.17.attention.self.query.bias", "module.bert.encoder.layer.17.attention.self.key.weight", "module.bert.encoder.layer.17.attention.self.key.bias", "module.bert.encoder.layer.17.attention.self.value.weight", "module.bert.encoder.layer.17.attention.self.value.bias", "module.bert.encoder.layer.17.attention.output.dense.weight", "module.bert.encoder.layer.17.attention.output.dense.bias", "module.bert.encoder.layer.17.attention.output.LayerNorm.weight", "module.bert.encoder.layer.17.attention.output.LayerNorm.bias", "module.bert.encoder.layer.17.intermediate.dense.weight", "module.bert.encoder.layer.17.intermediate.dense.bias", "module.bert.encoder.layer.17.output.dense.weight", "module.bert.encoder.layer.17.output.dense.bias", "module.bert.encoder.layer.17.output.LayerNorm.weight", "module.bert.encoder.layer.17.output.LayerNorm.bias", "module.bert.encoder.layer.18.attention.self.query.weight", "module.bert.encoder.layer.18.attention.self.query.bias", "module.bert.encoder.layer.18.attention.self.key.weight", "module.bert.encoder.layer.18.attention.self.key.bias", "module.bert.encoder.layer.18.attention.self.value.weight", "module.bert.encoder.layer.18.attention.self.value.bias", "module.bert.encoder.layer.18.attention.output.dense.weight", "module.bert.encoder.layer.18.attention.output.dense.bias", "module.bert.encoder.layer.18.attention.output.LayerNorm.weight", "module.bert.encoder.layer.18.attention.output.LayerNorm.bias", "module.bert.encoder.layer.18.intermediate.dense.weight", "module.bert.encoder.layer.18.intermediate.dense.bias", "module.bert.encoder.layer.18.output.dense.weight", "module.bert.encoder.layer.18.output.dense.bias", "module.bert.encoder.layer.18.output.LayerNorm.weight", "module.bert.encoder.layer.18.output.LayerNorm.bias", "module.bert.encoder.layer.19.attention.self.query.weight", "module.bert.encoder.layer.19.attention.self.query.bias", "module.bert.encoder.layer.19.attention.self.key.weight", "module.bert.encoder.layer.19.attention.self.key.bias", "module.bert.encoder.layer.19.attention.self.value.weight", "module.bert.encoder.layer.19.attention.self.value.bias", "module.bert.encoder.layer.19.attention.output.dense.weight", "module.bert.encoder.layer.19.attention.output.dense.bias", "module.bert.encoder.layer.19.attention.output.LayerNorm.weight", "module.bert.encoder.layer.19.attention.output.LayerNorm.bias", "module.bert.encoder.layer.19.intermediate.dense.weight", "module.bert.encoder.layer.19.intermediate.dense.bias", "module.bert.encoder.layer.19.output.dense.weight", "module.bert.encoder.layer.19.output.dense.bias", "module.bert.encoder.layer.19.output.LayerNorm.weight", "module.bert.encoder.layer.19.output.LayerNorm.bias", "module.bert.encoder.layer.20.attention.self.query.weight", "module.bert.encoder.layer.20.attention.self.query.bias", "module.bert.encoder.layer.20.attention.self.key.weight", "module.bert.encoder.layer.20.attention.self.key.bias", "module.bert.encoder.layer.20.attention.self.value.weight", "module.bert.encoder.layer.20.attention.self.value.bias", "module.bert.encoder.layer.20.attention.output.dense.weight", "module.bert.encoder.layer.20.attention.output.dense.bias", "module.bert.encoder.layer.20.attention.output.LayerNorm.weight", "module.bert.encoder.layer.20.attention.output.LayerNorm.bias", "module.bert.encoder.layer.20.intermediate.dense.weight", "module.bert.encoder.layer.20.intermediate.dense.bias", "module.bert.encoder.layer.20.output.dense.weight", "module.bert.encoder.layer.20.output.dense.bias", "module.bert.encoder.layer.20.output.LayerNorm.weight", "module.bert.encoder.layer.20.output.LayerNorm.bias", "module.bert.encoder.layer.21.attention.self.query.weight", "module.bert.encoder.layer.21.attention.self.query.bias", "module.bert.encoder.layer.21.attention.self.key.weight", "module.bert.encoder.layer.21.attention.self.key.bias", "module.bert.encoder.layer.21.attention.self.value.weight", "module.bert.encoder.layer.21.attention.self.value.bias", "module.bert.encoder.layer.21.attention.output.dense.weight", "module.bert.encoder.layer.21.attention.output.dense.bias", "module.bert.encoder.layer.21.attention.output.LayerNorm.weight", "module.bert.encoder.layer.21.attention.output.LayerNorm.bias", "module.bert.encoder.layer.21.intermediate.dense.weight", "module.bert.encoder.layer.21.intermediate.dense.bias", "module.bert.encoder.layer.21.output.dense.weight", "module.bert.encoder.layer.21.output.dense.bias", "module.bert.encoder.layer.21.output.LayerNorm.weight", "module.bert.encoder.layer.21.output.LayerNorm.bias", "module.bert.encoder.layer.22.attention.self.query.weight", "module.bert.encoder.layer.22.attention.self.query.bias", "module.bert.encoder.layer.22.attention.self.key.weight", "module.bert.encoder.layer.22.attention.self.key.bias", "module.bert.encoder.layer.22.attention.self.value.weight", "module.bert.encoder.layer.22.attention.self.value.bias", "module.bert.encoder.layer.22.attention.output.dense.weight", "module.bert.encoder.layer.22.attention.output.dense.bias", "module.bert.encoder.layer.22.attention.output.LayerNorm.weight", "module.bert.encoder.layer.22.attention.output.LayerNorm.bias", "module.bert.encoder.layer.22.intermediate.dense.weight", "module.bert.encoder.layer.22.intermediate.dense.bias", "module.bert.encoder.layer.22.output.dense.weight", "module.bert.encoder.layer.22.output.dense.bias", "module.bert.encoder.layer.22.output.LayerNorm.weight", "module.bert.encoder.layer.22.output.LayerNorm.bias", "module.bert.encoder.layer.23.attention.self.query.weight", "module.bert.encoder.layer.23.attention.self.query.bias", "module.bert.encoder.layer.23.attention.self.key.weight", "module.bert.encoder.layer.23.attention.self.key.bias", "module.bert.encoder.layer.23.attention.self.value.weight", "module.bert.encoder.layer.23.attention.self.value.bias", "module.bert.encoder.layer.23.attention.output.dense.weight", "module.bert.encoder.layer.23.attention.output.dense.bias", "module.bert.encoder.layer.23.attention.output.LayerNorm.weight", "module.bert.encoder.layer.23.attention.output.LayerNorm.bias", "module.bert.encoder.layer.23.intermediate.dense.weight", "module.bert.encoder.layer.23.intermediate.dense.bias", "module.bert.encoder.layer.23.output.dense.weight", "module.bert.encoder.layer.23.output.dense.bias", "module.bert.encoder.layer.23.output.LayerNorm.weight", "module.bert.encoder.layer.23.output.LayerNorm.bias", "module.bert.pooler.dense.weight", "module.bert.pooler.dense.bias", "module.linear.weight", "module.linear.bias". 

In [36]:
# create_bakeoff_submission(predict_one)

In [32]:
def create_bakeoff_submission(
        predict_one_func,
        output_filename='cs224u-sentiment-bakeoff-entry.csv'):

    bakeoff_test = sst.bakeoff_test_reader(SST_HOME)
    sst_test = sst.test_reader(SST_HOME)
    bakeoff_test['dataset'] = 'bakeoff'
    sst_test['dataset'] = 'sst3'
    df = pd.concat((bakeoff_test, sst_test))

    df['prediction'] = df['sentence'].apply(predict_one_func)

    df.to_csv(output_filename, index=None)

Thus, for example, the following will create a bake-off entry based on `predict_one_softmax`:

In [None]:
# This check ensure that the following code only runs on the local environment only.
# The following call will not be run on the autograder environment.
if 'IS_GRADESCOPE_ENV' not in os.environ:
    pass
    create_bakeoff_submission(predict_one_softmax)

This creates a file `cs224u-sentiment-bakeoff-entry.csv` in the current directory. That file should be uploaded as-is. Please do not change its name.

Only one upload per team is permitted, and you should do no tuning of your system based on what you see in our bakeoff prediction file – you should not study that file in anyway, beyond perhaps checking that it contains what you expected it to contain. The upload function will do some additional checking to ensure that your file is well-formed.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points.

## Submission Instruction

Review and follow the [Homework and bake-off code: Formatting guide](hw_formatting_guide.ipynb).
Please do not change the file name as described below.

Submit the following files to Gradescope:

- `hw_sentiment.ipynb` (this notebook)
- `cs224u-sentiment-bakeoff-entry.csv` (bake-off output)
