# Glassdoor Sentiment Analysis: Stanford Sentiment Treebank

## Contents

1. [Methodological note](#Methodological-note)
1. [Set-up](#Set-up)
1. [A softmax baseline](#A-softmax-baseline)
1. [RNNClassifier wrapper](#RNNClassifier-wrapper)
1. [Error analysis](#Error-analysis)
1. [Homework questions](#Homework-questions)
  1. [Sentiment words alone [2 points]](#Sentiment-words-alone-[2-points])
  1. [A more powerful vector-summing baseline [3 points]](#A-more-powerful-vector-summing-baseline-[3-points])
  1. [Your original system [4 points]](#Your-original-system-[4-points])
1. [Bake-off [1 point]](#Bake-off-[1-point])

## Methodological note

You don't have to use the experimental framework defined below (based on `sst`). However, if you don't use `sst.experiment` as below, then make sure you're training only on `train`, evaluating on `dev`, and that you report with 

```
from sklearn.metrics import classification_report
classification_report(y_dev, predictions)
```
where `y_dev = [y for tree, y in sst.dev_reader(class_func=sst.ternary_class_func)]`. We'll focus on the value at `macro avg` under `f1-score` in these reports.

## Set-up

See [the first notebook in this unit](sst_01_overview.ipynb#Set-up) for set-up instructions.

In [5]:
import sys
sys.path.insert(0, 'cs224u_files')

from collections import Counter
import numpy as np
import os
import pandas as pd
import random
from sklearn.linear_model import LogisticRegression
import sst
import torch.nn as nn
from torch_rnn_classifier import TorchRNNClassifier
from torch_tree_nn import TorchTreeNN
import sklearn.model_selection.StratifiedKFold as StratifiedKFold
import utils

In [6]:
SST_HOME = 'glassdoor-data'

## A softmax baseline

This example is here mainly as a reminder of how to use our experimental framework with linear models.

In [20]:
def unigrams_phi(tree):
    """The basis for a unigrams feature function.
    
    Parameters
    ----------
    tree : nltk.tree
        The tree to represent.
    
    Returns
    -------    
    Counter
        A map from strings to their counts in `tree`. (Counter maps a 
        list to a dict of counts of the elements in that list.)
    
    """
    return Counter(tree.leaves())

Thin wrapper around `LogisticRegression` for the sake of `sst.experiment`:

In [21]:
def fit_softmax_classifier(X, y):        
    mod = LogisticRegression(
        fit_intercept=True,
        solver='liblinear',
        multi_class='ovr')
    mod.fit(X, y)
    return mod

The experimental run with some notes:

In [22]:
softmax_experiment = sst.experiment(
    SST_HOME,
    unigrams_phi,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

              precision    recall  f1-score   support

    negative      0.628     0.689     0.657       428
     neutral      0.343     0.153     0.211       229
    positive      0.629     0.750     0.684       444

   micro avg      0.602     0.602     0.602      1101
   macro avg      0.533     0.531     0.518      1101
weighted avg      0.569     0.602     0.575      1101



`softmax_experiment` contains a lot of information that you can use for analysis; see [this section below](#Error-analysis) for starter code.

## RNNClassifier wrapper

This section illustrates how to use `sst.experiment` with RNN and TreeNN models.

To featurize examples for an RNN, we just get the words in order, letting the model take care of mapping them into an embedding space.

In [23]:
def rnn_phi(tree):
    return tree.leaves()    

The model wrapper gets the vocabulary using `sst.get_vocab`. If you want to use pretrained word representations in here, then you can have `fit_rnn_classifier` build that space too; see [this notebook section for details](sst_03_neural_networks.ipynb#Pretrained-embeddings).

In [24]:
def fit_rnn_classifier(X, y):    
    sst_glove_vocab = utils.get_vocab(X, n_words=10000)     
    mod = TorchRNNClassifier(
        sst_glove_vocab, 
        eta=0.05,
        embedding=None,
        batch_size=1000,
        embed_dim=50,
        hidden_dim=50,
        max_iter=50,
        l2_strength=0.001,
        bidirectional=True,
        hidden_activation=nn.ReLU())
    mod.fit(X, y)
    return mod

In [25]:
rnn_experiment = sst.experiment(
    SST_HOME,
    rnn_phi,
    fit_rnn_classifier, 
    vectorize=False,  # For deep learning, use `vectorize=False`.
    assess_reader=sst.dev_reader)

Finished epoch 50 of 50; error is 2.2390241771936417

              precision    recall  f1-score   support

    negative      0.601     0.661     0.630       428
     neutral      0.274     0.227     0.248       229
    positive      0.630     0.624     0.627       444

   micro avg      0.556     0.556     0.556      1101
   macro avg      0.501     0.504     0.501      1101
weighted avg      0.544     0.556     0.549      1101



## Error analysis

This section begins to build an error-analysis framework using the dicts returned by `sst.experiment`. These have the following structure:

```
'model': trained model
'train_dataset':
   'X': feature matrix
   'y': list of labels
   'vectorizer': DictVectorizer,
   'raw_examples': list of raw inputs, before featurizing   
'assess_dataset': same structure as the value of 'train_dataset'
'predictions': predictions on the assessment data
'metric': `score_func.__name__`, where `score_func` is an `sst.experiment` argument
'score': the `score_func` score on the assessment data
```
The following function just finds mistakes, and returns a `pd.DataFrame` for easy subsequent processing:

In [26]:
def find_errors(experiment):
    """Find mistaken predictions.
    
    Parameters
    ----------
    experiment : dict
        As returned by `sst.experiment`.
        
    Returns
    -------
    pd.DataFrame
    
    """
    raw_examples = experiment['assess_dataset']['raw_examples']
    raw_examples = [" ".join(tree.leaves()) for tree in raw_examples]
    df = pd.DataFrame({
        'raw_examples': raw_examples,
        'predicted': experiment['predictions'],
        'gold': experiment['assess_dataset']['y']})
    df['correct'] = df['predicted'] == df['gold']
    return df

In [27]:
softmax_analysis = find_errors(softmax_experiment)

In [28]:
rnn_analysis = find_errors(rnn_experiment)

Here we merge the sotmax and RNN experiments into a single DataFrame:

In [29]:
analysis = softmax_analysis.merge(
    rnn_analysis, left_on='raw_examples', right_on='raw_examples')

analysis = analysis.drop('gold_y', axis=1).rename(columns={'gold_x': 'gold'})

The following code collects a specific subset of examples; small modifications to its structure will give you different interesting subsets:

In [30]:
# Examples where the softmax model is correct, the RNN is not,
# and the gold label is 'positive'

error_group = analysis[
    (analysis['predicted_x'] == analysis['gold'])
    &
    (analysis['predicted_y'] != analysis['gold'])    
    &
    (analysis['gold'] == 'positive')
]

In [31]:
error_group.shape[0]

71

In [32]:
for ex in error_group['raw_examples'].sample(5):
    print("="*70)
    print(ex)

It 's a beautiful madness .
One of the smartest takes on singles culture I 've seen in a long time .
But it still jingles in the pocket .
For the first time in years , De Niro digs deep emotionally , perhaps because he 's been stirred by the powerful work of his co-stars .
With Rabbit-Proof Fence , Noyce has tailored an epic tale into a lean , economical movie .


## Homework questions

Please embed your homework responses in this notebook, and do not delete any cells from the notebook. (You are free to add as many cells as you like as part of your responses.)

### Sentiment words alone [2 points]

NLTK includes an easy interface to [Minqing Hu and Bing Liu's __Opinion Lexicon__](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), which consists of a list of positive words and a list of negative words. How much of the ternary SST story does this lexicon tell?

For this problem, submit code to do the following:

1. Create a feature function `op_unigrams` on the model of `unigrams_phi` above, but filtering the vocabulary to just items that are members of the Opinion Lexicon. Submit this feature function.

1. Evaluate your feature function with `sst.experiment`, with all the same parameters as were used to create `softmax_experiment` in [A softmax baseline](#A-softmax-baseline) above, except of course for the feature function.

1. Use `utils.mcnemar` to compare your feature function with the results in `softmax_experiment`. The information you need for this is in `softmax_experiment` and your own `sst.experiment` results. Submit your evaluation code. You can assume `softmax_experiment` is already in memory, but your code should create the other objects necessary for this comparison.

In [70]:
from nltk.corpus import opinion_lexicon

# Use set for fast membership checking:
positive = set(opinion_lexicon.positive())
negative = set(opinion_lexicon.negative())

def op_unigrams(tree):
    filtered_unigrams = [word for word in tree.leaves() if word in positive or word in negative]
    return Counter(filtered_unigrams)

softmax_op_experiment = sst.experiment(
    SST_HOME,
    op_unigrams,                      # Free to write your own!
    fit_softmax_classifier,            # Free to write your own!
    train_reader=sst.train_reader,     # Fixed by the competition.
    assess_reader=sst.dev_reader,      # Fixed until the bake-off.
    class_func=sst.ternary_class_func) # Fixed by the bake-off rules.

m = utils.mcnemar(
    softmax_experiment['assess_dataset']['y'], 
    softmax_experiment['predictions'],
    softmax_op_experiment['predictions'])

p = "p < 0.0001" if m[1] < 0.0001 else m[1]
print("McNemar's test: {0:0.02f} ({1:})".format(m[0], p))

              precision    recall  f1-score   support

    negative      0.553     0.752     0.638       428
     neutral      0.179     0.031     0.052       229
    positive      0.615     0.664     0.639       444

   micro avg      0.567     0.567     0.567      1101
   macro avg      0.449     0.482     0.443      1101
weighted avg      0.500     0.567     0.516      1101

McNemar's test: 5.33 (0.020980477345314247)


### A more powerful vector-summing baseline [3 points]

In [Distributed representations as features](sst_03_neural_networks.ipynb#Distributed-representations-as-features), we looked at a baseline for the ternary SST problem in which each example is modeled as the sum of its 50-dimensional GloVe representations. A `LogisticRegression` model was used for prediction. A neural network might do better here, since there might be complex relationships between the input feature dimensions that a linear classifier can't learn. 

To address this question, rerun the experiment with `torch_shallow_neural_classifier.TorchShallowNeuralClassifier` as the classifier. Specs:
* Use `sst.experiment` to conduct the experiment. 
* Using 3-fold cross-validation, exhaustively explore this set of hyperparameter combinations:
  * The hidden dimensionality at 50, 100, and 200.
  * The hidden activation function as `nn.Tanh` or `nn.ReLU`.
* (For all other parameters to `TorchShallowNeuralClassifier`, use the defaults.)

For this problem, submit code to do the following:

1. Your model wrapper function around `TorchShallowNeuralClassifier`. This function should implement the requisite cross-validation; see [this notebook section](sst_02_hand_built_features.ipynb#Hyperparameter-search) for examples.
1. Your average F1 score according to `sst.experiment`. 
2. The optimal hyperparameters chosen in your experiment. (You can just paste in the dict that `sst._experiment` prints.)

We're not evaluating the quality of your model. (We've specified the protocols completely, but there will still be a  lot of variation in the results.) However, the primary goal of this question is to get you thinking more about this strikingly good baseline feature representation scheme for SST, so we're sort of hoping you feel compelled to try out variations on your own.

In [87]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

def vsm_leaves_phi(tree, lookup, np_func=np.sum):
    """Represent `tree` as a combination of the vector of its words.
    
    Parameters
    ----------
    tree : nltk.Tree   
    lookup : dict
        From words to vectors.
    np_func : function (default: np.sum)
        A numpy matrix operation that can be applied columnwise, 
        like `np.mean`, `np.sum`, or `np.prod`. The requirement is that 
        the function take `axis=0` as one of its arguments (to ensure
        columnwise combination) and that it return a vector of a 
        fixed length, no matter what the size of the tree is.
    
    Returns
    -------
    np.array, dimension `X.shape[1]`
            
    """      
    allvecs = np.array([lookup[w] for w in tree.leaves() if w in lookup])    
    if len(allvecs) == 0:
        dim = len(next(iter(lookup.values())))
        feats = np.zeros(dim)
    else:       
        feats = np_func(allvecs, axis=0)      
    return feats

def glove_leaves_phi(tree, np_func=np.sum):
    return vsm_leaves_phi(tree, glove_lookup, np_func=np_func)

def fit_torch_shallow_neural_classifier(X, y):
    base_mod = TorchShallowNeuralClassifier()
    
    cv = 3
    param_grid = {
        'hidden_dim': [50, 100, 200], 
        'hidden_activation': [nn.Tanh(), nn.ReLU()],
    }   

    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, base_mod, cv, param_grid)
    
    return best_mod

DATE_HOME = 'data'
GLOVE_HOME = os.path.join(DATE_HOME, 'glove.6B')

glove_lookup = utils.glove2dict(
    os.path.join(GLOVE_HOME, 'glove.6B.300d.txt'))

In [91]:
_ = sst.experiment(
    SST_HOME,
    glove_leaves_phi,
    fit_torch_shallow_neural_classifier,
    class_func=sst.ternary_class_func,
    vectorize=False)  # Tell `experiment` that we already have our feature vectors.

Finished epoch 100 of 100; error is 2.30353552103042693

Best params: {'hidden_activation': Tanh(), 'hidden_dim': 50}
Best score: 0.520
              precision    recall  f1-score   support

    negative      0.621     0.645     0.633       989
     neutral      0.304     0.191     0.235       498
    positive      0.647     0.735     0.688      1077

   micro avg      0.595     0.595     0.595      2564
   macro avg      0.524     0.524     0.519      2564
weighted avg      0.570     0.595     0.579      2564



### Results
The macro-averaged F1 score was 0.519 and the best parameters were {'hidden_activation': Tanh(), 'hidden_dim': 50}.

### Your original system [4 points]

Your task is to develop an original model for the SST ternary problem. There are many options. If you spend more than a few hours on this homework problem, you should consider letting it grow into your final project! Here are some relatively manageable ideas that you might try:

1. We didn't systematically evaluate the `bidirectional` option to the `TorchRNNClassifier`. Similarly, that model could be tweaked to allow multiple LSTM layers (at present there is only one), and you could try adding layers to the classifier portion of the model as well.

1. We've already glimpsed the power of rich initial word representations, and later in the course we'll see that smart initialization usually leads to a performance gain in NLP, so you could perhaps achieve a winning entry with a simple model that starts in a great place.

1. The [practical introduction to contextual word representations](contextualreps.ipynb) (to be discussed later in the quarter) covers pretrained representations and interfaces that are likely to boost the performance of any system.

1. The `TreeNN` and `TorchTreeNN` don't perform all that well, and this could be for the same reason that RNNs don't peform well: the gradient signal doesn't propagate reliably down inside very deep trees. [Tai et al. 2015](https://aclanthology.info/papers/P15-1150/p15-1150) sought to address this with TreeLSTMs, which are fairly easy to implement in PyTorch.

1. In the [distributed representations as features](#Distributed-representations-as-features) section, we just summed  all of the leaf-node GloVe vectors to obtain a fixed-dimensional representation for all sentences. This ignores all of the tree structure. See if you can do better by paying attention to the binary tree structure: write a function `glove_subtree_phi` that obtains a vector representation for each subtree by combining the vectors of its daughters, with the leaf nodes again given by GloVe (any dimension you like) and the full representation of the sentence given by the final vector obtained by this recursive process. You can decide on how you combine the vectors. 

1. If you have a lot of computing resources, then you can fire off a large hyperparameter search over many parameter values. All the model classes for this course are compatible with the `scikit-learn` and [scikit-optimize](https://scikit-optimize.github.io) methods, because they define the required functions for getting and setting parameters.

We want to emphasize that this needs to be an __original__ system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it.

__Please include a brief prose description of your system along with your code, to help the teaching team understand the structure of your system.__

### Model Description

We tried various approaches for our own system, attempting to utilize the TorchShallowNeuralClassifier, RandomForestClassifier, and our custom 4-layer neural network (TorchCustomNeuralClassifier) before settling on using a TorchRNNClassifierModel. We extended the class to be able to handle multiple layers and utilize dropout.

Our final system was a RNN that:

1. Utilized contextual information using BERT uncased_L-12_H-768_A-12 model, with words as features.
2. Explored the hyperparameter space. Used various values for hidden_dim, dropout, and bidirectionality.
3. Tried using both 1 and 2 LSTM layers. However, we did not edit the classifer portion of the RNN.

In [52]:
from bert_serving.client import BertClient 

def bert_sentence_phi(tree):
    s = " ".join(tree.leaves())
    return bert_lookup[s]

def bert_rnn_sentence_phi(tree):
    s = " ".join(tree.leaves())
    return bert_word_lookup[s]

def bert_reduce_mean(X):
    return X.mean(axis=1)  

bc = BertClient(check_length=False)

# Read train and dev
sst_train_reader = sst.train_reader(
    SST_HOME, class_func=sst.ternary_class_func)
sst_train = [(" ".join(t.leaves()), label) for t, label in sst_train_reader]

sst_dev_reader = sst.dev_reader(
    SST_HOME, class_func=sst.ternary_class_func)
sst_dev = [(" ".join(t.leaves()), label) for t, label in sst_dev_reader]

# Zip
X_str_train, y_train = zip(*sst_train)
X_str_dev, y_dev = zip(*sst_dev)

# Process examples into tokens
X_bert_train, bert_train_toks = bc.encode(list(X_str_train), show_tokens=True)
X_bert_dev, bert_dev_toks = bc.encode(list(X_str_dev), show_tokens=True)
    
# Reduce mean
X_bert_train_mean = bert_reduce_mean(X_bert_train)
X_bert_dev_mean = bert_reduce_mean(X_bert_dev)

bert_lookup = {}
for (sents, reps) in ((X_str_train, X_bert_train_mean), 
                      (X_str_dev, X_bert_dev_mean)):
    assert len(sents) == len(reps)
    for s, rep in zip(sents, reps):
        bert_lookup[s] = rep
        

bert_word_lookup = {}
for (sents, reps) in ((X_str_train, X_bert_train), 
                      (X_str_dev, X_bert_dev)):
    assert len(sents) == len(reps)
    for s, rep in zip(sents, reps):
        bert_word_lookup[s] = rep

In [123]:
import numpy as np
import torch
import torch.nn as nn
import torch.utils.data
from torch_model_base import TorchModelBase
from utils import progress_bar

class TorchCustomNeuralClassifier(TorchShallowNeuralClassifier):
    """
    Code based on TorchShallowNeuralClassifier.
    
    Fit a model

    h = f(xW1 + b1)
    y = softmax(hW2 + b2)

    with a cross entropy loss.

    Parameters
    ----------
    hidden_dim_1 : int
        Dimensionality of the first hidden layer.
    hidden_dim_2 : int
        Dimensionality of the second hidden layer.
    hidden_dim_3 : int
        Dimensionality of the third hidden layer.
    hidden_activation : vectorized activation function
        The non-linear activation function used by the network for the
        hidden layer. Default `nn.Tanh()`.
    max_iter : int
        Maximum number of training epochs.
    eta : float
        Learning rate.
    optimizer : PyTorch optimizer
        Default is `torch.optim.Adam`.
    l2_strength : float
        L2 regularization strength. Default 0 is no regularization.
    device : 'cpu' or 'cuda'
        The default is to use 'cuda' iff available

    """
    def __init__(self, **kwargs):
        super(TorchCustomNeuralClassifier, self).__init__(**kwargs)

    def define_graph(self):
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim_1),
            self.hidden_activation,
            nn.Linear(self.hidden_dim_1, self.hidden_dim_2),
            self.hidden_activation,
            torch.nn.Dropout(0.1),
            nn.Linear(self.hidden_dim_2, self.hidden_dim_3),
            self.hidden_activation,
            nn.Linear(self.hidden_dim_3, self.n_classes_))

In [4]:
import numpy as np
from operator import itemgetter
import torch
import torch.nn as nn
import torch.utils.data
from torch_model_base import TorchModelBase
from utils import progress_bar
from torch_rnn_classifier import TorchRNNClassifierModel

class TorchMultilayerRNNClassifierModel(TorchRNNClassifierModel):
    def __init__(self,
            vocab_size,
            embed_dim,
            embedding,
            use_embedding,
            hidden_dim,
            output_dim,
            bidirectional,
            device,
            dropout,
            num_layers):
        super(TorchMultilayerRNNClassifierModel, self).__init__(
            vocab_size,
            embed_dim,
            embedding,
            use_embedding,
            hidden_dim,
            output_dim,
            bidirectional,
            device)
        # Graph
        if self.use_embedding:
            self.embedding = self._define_embedding(
                embedding, vocab_size, self.embed_dim)
            self.embed_dim = self.embedding.embedding_dim
        self.rnn = nn.LSTM(
            input_size=self.embed_dim,
            hidden_size=hidden_dim,
            batch_first=True,
            bidirectional=bidirectional,
            num_layers=num_layers,
            dropout=dropout)
        if bidirectional:
            classifier_dim = hidden_dim * 2
        else:
            classifier_dim = hidden_dim
        self.classifier_layer = nn.Linear(classifier_dim, output_dim)


class TorchMultilayerRNNClassifier(TorchRNNClassifier):
    """LSTM-based Recurrent Neural Network for classification problems.
    The network will work for any kind of classification task.

    Parameters
    ----------
    vocab : list of str
        This should be the vocabulary. It needs to be aligned with
         `embedding` in the sense that the ith element of vocab
        should be represented by the ith row of `embedding`. Ignored
        if `use_embedding=False`.
    embedding : np.array or None
        Each row represents a word in `vocab`, as described above.
    use_embedding : bool
        If True, then incoming examples are presumed to be lists of
        elements of the vocabulary. If False, then they are presumed
        to be lists of vectors. In this case, the `embedding` and
        `embed_dim` arguments are ignored, since no embedding is needed
        and `embed_dim` is set by the nature of the incoming vectors.
    embed_dim : int
        Dimensionality for the initial embeddings. This is ignored
        if `embedding` is not None, as a specified value there
        determines this value. Also ignored if `use_embedding=False`.
    hidden_dim : int
        Dimensionality of the hidden layer.
    bidirectional : bool
        If True, then the final hidden states from passes in both
        directions are used.
    hidden_activation : vectorized activation function
        The non-linear activation function used by the network for the
        hidden layer. Default `nn.Tanh()`.
    max_iter : int
        Maximum number of training epochs.
    eta : float
        Learning rate.
    optimizer : PyTorch optimizer
        Default is `torch.optim.Adam`.
    l2_strength : float
        L2 regularization strength. Default 0 is no regularization.
    device : 'cpu' or 'cuda'
        The default is to use 'cuda' iff available
    dropout : float
        The amount of dropout to be used in the LSTM. The default is 0.
    num_layers : int
        The number of layers in the LSTM. The default is 1.
    """
    def __init__(self,
            vocab=[],
            embedding=None,
            use_embedding=True,
            embed_dim=50,
            bidirectional=False,
            num_layers=1,
            dropout=0,
            **kwargs):
        super(TorchMultilayerRNNClassifier, self).__init__(
            vocab,
            embedding=embedding,
            use_embedding=use_embedding,
            embed_dim=embed_dim,
            bidirectional=bidirectional,
            **kwargs)
        self.num_layers = num_layers
        self.dropout = dropout

    def build_graph(self):
        return TorchMultilayerRNNClassifierModel(
            vocab_size=len(self.vocab),
            embedding=self.embedding,
            use_embedding=self.use_embedding,
            embed_dim=self.embed_dim,
            hidden_dim=self.hidden_dim,
            output_dim=self.n_classes_,
            bidirectional=self.bidirectional,
            device=self.device,
            dropout=self.dropout,
            num_layers=self.num_layers)

In [159]:
from sklearn.ensemble import RandomForestClassifier

def fit_custom_torch_shallow_neural_classifier(X, y):
    base_mod = TorchShallowNeuralClassifier()
    
    cv = 3
    param_grid = {
        'hidden_dim': [250, 275, 300, 325, 350], 
        'hidden_activation': [nn.Tanh(), nn.ReLU()],
    }   

    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, base_mod, cv, param_grid)
    
    return best_mod

def fit_custom_random_forest_classifier(X, y):
    base_mod = RandomForestClassifier()
    
    cv = 3
    param_grid = {
        'max_depth': [50, 100, 150, 200], 
        'max_features': ['auto', None]
    }   

    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, base_mod, cv, param_grid)
    
    return best_mod

def fit_custom_torch_neural_classifier(X, y):
    base_mod = TorchCustomNeuralClassifier()
    
    cv = 3
    param_grid = {
        'hidden_dim_1': [275, 300, 325], 
        'hidden_dim_2': [150, 175], 
        'hidden_dim_3': [100, 125], 
        'hidden_activation': [nn.Tanh()],
    }   

    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, base_mod, cv, param_grid)
    
    return best_mod

def fit_custom_rnn_classifier(X, y):
    bert_rnn = TorchMultilayerRNNClassifier()

    # Warning: takes a VERY long time to run!
    cv = 3
    param_grid = {
        'vocab': [[]],
        'use_embedding': [False],
        'max_iter': [50],
        'bidirectional': [True, False], 
        'hidden_dim': [50, 100], 
        'num_layers': [1, 2],
        'dropout': [0, 0.1]
    }   

    best_mod = utils.fit_classifier_with_crossvalidation(
        X, y, bert_rnn, cv, param_grid)

    return best_mod
        
_ = sst.experiment(
    SST_HOME,
    bert_rnn_sentence_phi,
    fit_custom_rnn_classifier,
    class_func=sst.ternary_class_func,
    vectorize=False)

  "num_layers={}".format(dropout, num_layers))
Finished epoch 50 of 50; error is 0.0034767751931212842

Best params: {'bidirectional': True, 'dropout': 0, 'hidden_dim': 100, 'max_iter': 50, 'num_layers': 2, 'use_embedding': False, 'vocab': []}
Best score: 0.609
              precision    recall  f1-score   support

    negative      0.691     0.757     0.723       988
     neutral      0.381     0.268     0.315       522
    positive      0.763     0.807     0.785      1054

   micro avg      0.678     0.678     0.678      2564
   macro avg      0.612     0.611     0.607      2564
weighted avg      0.658     0.678     0.665      2564



In [160]:
# Store model in pickle for later usage in bakeoff
import pickle
pickle.dump(_['model'], open('sst_best_model.pkl', 'wb'))

In [162]:
from sklearn.metrics import classification_report
# Test load
loaded_model = pickle.load(open('sst_best_model.pkl', 'rb'))
loaded_model_preds = loaded_model.predict(X_bert_dev)
print(classification_report(y_dev, loaded_model_preds, digits=3))

              precision    recall  f1-score   support

    negative      0.705     0.710     0.708       428
     neutral      0.331     0.227     0.269       229
    positive      0.725     0.838     0.777       444

   micro avg      0.661     0.661     0.661      1101
   macro avg      0.587     0.592     0.585      1101
weighted avg      0.636     0.661     0.645      1101



## Bake-off [1 point]

The bake-off will begin on April 22. The announcement will go out on Piazza. As we said above, the bake-off evaluation data is the official SST test set release. For this bake-off, you'll evaluate your original system from the above homework problem on the test set, using the ternary class problem. Rules:

1. Only one evaluation is permitted.
1. No additional system tuning is permitted once the bake-off has started.

To enter the bake-off, upload this notebook on Canvas:

https://canvas.stanford.edu/courses/99711/assignments/187246

The cells below this one constitute your bake-off entry.

Systems that enter will receive the additional homework point, and systems that achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

The bake-off will close at 4:30 pm on April 24. Late entries will be accepted, but they cannot earn the extra 0.5 points. Similarly, you cannot win the bake-off unless your homework is submitted on time.

In [6]:
# Enter your bake-off assessment code in this cell. 
# Please do not remove this comment.

# This is a copy of some code above, but cleaned up and edited to handle the test set.
from bert_serving.client import BertClient 

def bert_rnn_sentence_phi(tree):
    s = " ".join(tree.leaves())
    return bert_word_lookup[s]

def fit_custom_rnn_classifier(X, y):
    # Same as above, but without hyperparameter exploration
    bert_rnn = TorchMultilayerRNNClassifier(
        vocab=[],
        use_embedding=False,
        max_iter=50,
        bidirectional=True,
        hidden_dim=100,
        num_layers=2,
        dropout=0
    )
    bert_rnn.fit(X, y)
    return bert_rnn

bc = BertClient(check_length=False)

# Read train and test
sst_train_reader = sst.train_reader(
    SST_HOME, class_func=sst.ternary_class_func)
sst_train = [(" ".join(t.leaves()), label) for t, label in sst_train_reader]

sst_test_reader = sst.test_reader(
    SST_HOME, class_func=sst.ternary_class_func)
sst_test = [(" ".join(t.leaves()), label) for t, label in sst_test_reader]

# Zip
X_str_train, y_train = zip(*sst_train)
X_str_test, y_test = zip(*sst_test)

# Process examples into tokens
X_bert_train, bert_train_toks = bc.encode(list(X_str_train), show_tokens=True)
X_bert_test, bert_test_toks = bc.encode(list(X_str_test), show_tokens=True)        

bert_word_lookup = {}
for (sents, reps) in ((X_str_train, X_bert_train), 
                      (X_str_test, X_bert_test)):
    assert len(sents) == len(reps)
    for s, rep in zip(sents, reps):
        bert_word_lookup[s] = rep
        
# Added vectorize=False
bakeoff_experiment = sst.experiment(
    SST_HOME,
    bert_rnn_sentence_phi,
    fit_custom_rnn_classifier,
    train_reader=sst.train_reader,
    assess_reader=sst.test_reader,
    class_func=sst.ternary_class_func,
    vectorize=False)

Finished epoch 50 of 50; error is 0.0034276168735232204

              precision    recall  f1-score   support

    negative      0.732     0.752     0.742       912
     neutral      0.318     0.260     0.286       389
    positive      0.791     0.831     0.810       909

   micro avg      0.698     0.698     0.698      2210
   macro avg      0.613     0.614     0.613      2210
weighted avg      0.683     0.698     0.690      2210



In [19]:
# On an otherwise blank line in this cell, please enter
# your macro-average F1 value as reported by the code above. 
# Please enter only a number between 0 and 1 inclusive.
# Please do not remove this comment.
0.613