<a href="https://colab.research.google.com/github/chebbal/data-centric-deep-learning/blob/main/week1/Copy_of_Data_Quality_Deep_Learning_Refresher.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Quality: Deep Learning Refresher


> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.


## Overview

If it has been a minute since you have seen deep learning, this notebook is a good opportunity to review how training a neural network works, the PyTorch programming language, and how to make predictions with a trained model. If you have just completed a deep learning course, or are quite familiar with the space, it never hurts to see the same concept twice! 

We will be assuming that you have used PyTorch before and are comfortable with the ins & outs of machine learning, and the basics of deep learning. Good pre-requisites include the *co:rise Introduction to AML* and *co:rise Deep Learning* courses or equivalent.

## Goals

The theme for week 1 is data quality. In this notebook, we will be focusing on a text classification task in NLP: namely, we hope to predict if a sentence has positive or negative sentiment. In this notebook, we aim at the following learning goals:
1. Training a deep learning model in PyTorch. 
1. Understand pretrained features and foundation models.
1. Making predictions on new examples using a trained model.
1. Evaluating the performance of a trained model.
1. Get familiar with PyTorch and PyTorch Lightning

## Instructions

1. We provide starter code and data to give your work a common starting point and scaffolding. You should try to keep function signatures unchanged to support any later usage or grading of your project.
1. Ensure you read through the document and starting code before beginning your work. Understand the overall structure and goals of the project to make your implementation efficient. 

> When we ask you to do a task, it will be indented like this!

## Setup

We will need a GPU for this notebook. Go to RunTime > Change runtime type > Choose GPU for hardware accelerator. Make sure to do this or otherwise, you may find this notebook to be very slow!

---

In [1]:
# Install dependencies
!pip install datasets
!pip install transformers
!pip install pytorch-lightning
!pip install dotmap
!pip install jsonargparse[signatures]
!pip install --upgrade --no-cache-dir gdown
!pip install torchmetrics

# Import all the required dependencies
import nltk                           # for NLP utilities
import random
import torch                          # deep learning utilities
import numpy as np                    # for array manipulation
import pandas as pd                   # for datasets
from tqdm import tqdm                 # for iteration counters
import matplotlib.pyplot as plt       # for plotting and visualization
from dotmap import DotMap             # for configs 
import torchmetrics                   # for metrics
plt.style.use('ggplot')

from datasets import load_dataset     # to download datasets
nltk.download('stopwords')            # for optional section
nltk.download('wordnet') 
  

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.2.2-py3-none-any.whl (346 kB)
[K     |████████████████████████████████| 346 kB 5.1 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 53.1 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 57.1 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 73.4 MB/s 
[?25hCollecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.5
  Downloading dill-0.3.4-py2.py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 7.

True

To start, we will introduce many of the concepts necessary for NLP with a bit of exposition and code. Starting in the *Your Challenge* section you will need to add code to complete cells.

### Dataset

We will be using the Stanford Sentiment TreeBank dataset, or SST. It contains a little under 11k sentences, each annotated with a score between 0 and 1, with 1 indicating positive sentiment. For example, an entry in the dataset is:

> "This was the worst restaurant I have ever had the misfortune of eating at."

Our goal is to train a deep learning model to infer the sentiment for sentences in the test set. A well-trained model will be able to correctly predict the sentiment for new unseen sentences. 

In [2]:
# loading the dataset
train_dataset = load_dataset('sst', split='train')
dev_dataset = load_dataset('sst', split='validation')
test_dataset = load_dataset('sst', split='test')

Downloading builder script:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

No config specified, defaulting to: sst/default


Downloading and preparing dataset sst/default (download: 6.83 MiB, generated: 3.73 MiB, post-processed: Unknown size, total: 10.56 MiB) to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff...


Downloading data:   0%|          | 0.00/6.37M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/790k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8544 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1101 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2210 [00:00<?, ? examples/s]

Dataset sst downloaded and prepared to /root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff. Subsequent calls will reuse this data.


No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


The dataset is pre-split for you into three portions:
- **training**: as the name suggests, this is used to train the model. 
- **dev**: this is typically used to tune any hyperparameters, and to detect overfitting or underfitting. 
- **test**: this is used to measure performance. It should be used to decide model parameters. 

In [3]:
print(f'{train_dataset.num_rows:,} training examples')
print(f'{dev_dataset.num_rows:,} dev examples')
print(f'{test_dataset.num_rows:,} test examples')

8,544 training examples
1,101 dev examples
2,210 test examples


To get a sentence of the data, we can inspect the first example in the training dataset.

In [4]:
print(f"Sentence: {train_dataset[0]['sentence']}")
print(f"Label: {train_dataset[0]['label']}")

Sentence: The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .
Label: 0.6944400072097778


Notice the label is a floating point number. We will be simplifying this to just two labels (by rounding to the closest integer). This way, our model will only need to predict positive or negative. 

### Tokenization

Natural language is not the easiest form of input for neural networks to digest: the vocabulary (or number of unique words) is very high, most words do not appear in every sentence, and sentences can be of very different lengths. What is commonly done in practice is *tokenization* where every unique word (or word piece) is mapped to a unique integer. For example, cat --> 0, dog --> 1, ... Then a sentence is converted to a vector of integers, rather than a vector of strings. Since proper tokenization can be tricky, we use the Huggingface toolkit.

In [5]:
from transformers import RobertaTokenizer

# the first time running this, it will download a few files
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [6]:
tokenized = tokenizer(
  'hello world!', 
  truncation=True, 
  padding='max_length',
  pad_to_max_length=True, 
  return_attention_mask=True,
  return_tensors='pt',
)

In [7]:
tokenized.keys()

dict_keys(['input_ids', 'attention_mask'])

There are two keys in the `tokenized` object! 

1. `input_ids` contains a torch tensor of token ids (vocab ids).
2. The maximum sentence length for Huggingface is 512. If the sentence is less than 512 tokens, it will pad to 512. `attention_mask` will keep track of which tokens are for padding only. 

In [8]:
print(tokenized['input_ids'][0][:5])
print(tokenized['input_ids'].squeeze(0).shape)

tensor([    0, 42891,   232,   328,     2])
torch.Size([512])


There are 5 non-padding tokens. The middle three correspond to "Hello", "World", and "!". The first one is a special token that represents the start of the sentence. The fifth token is also a special one, but it represents the end of the sentence.

### Data I/O

To start, we will need to build a `Dataset` object, which is used to serve data to a deep learning model. PyTorch offers a `Dataset` class that helps us do this. In particular, this class has three important methods:
1. `__init__`: often, the initialization function is used to preload data and sent class variables. 
2. `__getitem__`: this function serves the logic of returning the example with index `index`. The outputs of this will be used in minibatching. 
3. `__len__`: this compute the total number of examples and is used when looping through the dataset.

In [9]:
from torch.utils.data import Dataset


class SST(Dataset):
  """
  The Stanford Sentiment TreeBank dataset. 

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    if split == 'dev': split = 'validation'  # match their expectations
    self.split = split
    self.data = load_dataset('sst', split = split)
    self.tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

  def __getitem__(self, index):
    sentence = self.data[index]['sentence']
    # remember we want to round!
    label = round(self.data[index]['label'])

    tokenized = self.tokenizer(
      sentence,
      truncation=True, 
      padding='max_length',
      pad_to_max_length=True, 
      return_attention_mask=True,
      return_tensors='pt',
    )
    print( tokenized['input_ids'].squeeze(0).shape)
    output = {
      # think through these ops: why am I reshaping?
      'input_ids': tokenized['input_ids'].squeeze(0),
      'attention_mask': tokenized['attention_mask'].squeeze(0),
      'label': label,
    }
    return output

  def __len__(self):
    return self.data.num_rows

Let's try creating a training dataset and fetching the first item.

In [10]:
dataset = SST(split = 'train')
row = dataset.__getitem__(0)
print(row.keys())

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


torch.Size([512])
dict_keys(['input_ids', 'attention_mask', 'label'])


### Pretrained Features

Training high quality NLP models usually requires lots of data. In fact, the best models are trained over billions of data points crawled from all over the internet. Unfortunately, not everyone has the ability to do this (you and I certainly don't!). 
To make this easier, we leverage *pre-trained models*, or large models customized for a modality (e.g. natural language) that can convert a sentence into a vector represented (of fixed size). 

A popular pretrained model for text is BERT (Deep Bidirectional Transformers), which have been shown in research and practice to have good performance on text classification tasks like sentiment analysis. The Huggingface library (which we used for tokenization) also has a pretrained BERT model, which we will leverage to compute pretrained features. 

In [11]:
from transformers import RobertaModel

# The first time you run this, it will download a rather large model file. 
# You may get an warning about some weights not being initialized. This is 
# expected and you can ignore it.
pretrained = RobertaModel.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Let's try feeding a dataset entry into the model.

In [12]:
dataset = SST(split = 'train')
row = dataset.__getitem__(0)

output = pretrained(
  input_ids = row['input_ids'].unsqueeze(0), 
  attention_mask = row['attention_mask'].unsqueeze(0),
)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


torch.Size([512])


The output contains two fields: 
- `last_hidden_state` is a tensor of shape (1, 512, 768). The first dimension is 1 because we only passed in 1 example. The second dimension is 512 because it is the maximum length allowed. The third dimension is 768, the pretrained model feature size. 
- `pooler_output` is a tensor of shape (1, 768). It represents a "pooled" version of the `last_hidden_state` tensor. We will be treating this as our pretrained features! 

In [13]:
output.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The BERT model we are using is a big one! If we tried to train with it in our pipeline, we would likely get an OOM error (colab is great but we can't expect it to host a billion-parameter model in it). Instead, we computed these features offline so that you can download them.

In [14]:
!gdown --id 17fCCxc0XrfCxLs9uUp9v1OY7v1Sc9pPP
!gdown --id 18_KoA8kLhg_GIxzyjLqMID1I51397F1X
!gdown --id 1koPWrlRKXV_nQmyGVHE7Z4X0VLrAjn-_

Downloading...
From: https://drive.google.com/uc?id=17fCCxc0XrfCxLs9uUp9v1OY7v1Sc9pPP
To: /content/sst-roberta-train.pt
100% 26.2M/26.2M [00:00<00:00, 163MB/s] 
Downloading...
From: https://drive.google.com/uc?id=18_KoA8kLhg_GIxzyjLqMID1I51397F1X
To: /content/sst-roberta-dev.pt
100% 3.38M/3.38M [00:00<00:00, 276MB/s]
Downloading...
From: https://drive.google.com/uc?id=1koPWrlRKXV_nQmyGVHE7Z4X0VLrAjn-_
To: /content/sst-roberta-test.pt
100% 6.79M/6.79M [00:00<00:00, 116MB/s]


In [15]:
train_features = torch.load('sst-roberta-train.pt')
dev_features = torch.load('sst-roberta-dev.pt')
test_features = torch.load('sst-roberta-test.pt')

> Make a new dataset that serves these features. We will need your help this time!

In [16]:
from torch.utils.data import Dataset


class SSTBERT(Dataset):
  """
  The Stanford Sentiment TreeBank dataset with BERT features. 

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    self.features = torch.load(f'sst-roberta-{split}.pt').cpu()
    if split == 'dev': split = 'validation'  # match their expectations
    self.split = split
    self.data = load_dataset('sst', split = split)

  def __getitem__(self, index):
    features = None
    label = None
    # ================================
    # FILL ME OUT
    # 
    # Return a list of two objects, features and label. 
    # You may find the `index` variable useful.
    # Solution code is two lines.
    # 
    # Pseudocode
    # --
    # features = ...
    # label = ...
    # 
    # Types
    # --
    # features: torch.FloatTensor
    # label: numeric
    # ================================
    features =  self.features[index]
    label = self.data[index]['label']
    return features, round(label)

  def __len__(self):
    return self.data.num_rows

In [17]:
ex_feat, ex_label = SSTBERT('train').__getitem__(0)
ex_feat.size()
print(ex_feat.shape)
print(ex_label)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)


torch.Size([768])
1


## Your Challenge! 

Now that we have reviewed a few of the intricacies of natural language, it is time to train a model. 

> Your challenge will be to fit a multi-layer perceptron on top of the pretrained features to predict sentiment. 

Take care to properly pick a model design, a loss function, an optimizer, and all the hyperparameters we love (and hate) in deep learning. Try a few different options and compare performance on your dev set. When you are ready, pick your best model and evaluate performance on the test set. Take notes on which designs and choices showed improvements.

### Evaluation

To measure performance, we will look at accuracy, although there are many other choices. For more on metrics for binary classifiers, scikit-learn provides a good practical [explanation](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [18]:
# you may find these helpful
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torch.utils.data import random_split

Define the PyTorch model. You can design this as you wish! As a suggestion, don't start too complicated.

In [19]:
class MLP(nn.Module):
  """
  A multi-layer perceptron. 
  """

  def __init__(self, input_dim, output_dim):
    super().__init__()
    # ================================
    # FILL ME OUT
    # 
    # Define any layers or class variables.
    # 
    # Pseudocode
    # --
    # self.fc = ...
    # (add more if you want) ...
    # 
    # Type
    # --
    # self.fc: nn.Module
    # ================================
    self.fc = nn.Sequential(
        nn.Linear(input_dim, 10),
        nn.ReLU(),
        nn.Linear(10, 10),
        nn.ReLU(),   
        #nn.Linear(50, 50),
        #nn.ReLU(),             
        nn.Linear(10, output_dim)              
    )

  def forward(self, x):
    logits = self.fc(x)
    probs = torch.sigmoid(logits)
    # ================================
    # FILL ME OUT
    # 
    # Return a probability between 0 and 1
    # that `x` has positive sentiment. 
    # 
    # Pseudocode
    # --
    # probs = ...
    # 
    # Type
    # --
    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================

    return probs

We will be using the PyTorch Lightning framework to do training and evaluation. If you haven't seen the Lightning framework before, check out their [tutorials](https://www.pytorchlightning.ai/tutorials). It offers an easy-to-use framework that makes the many moving pieces of deep learning training feel more manageable. 

In [20]:
import pytorch_lightning as pl

In [21]:
class SSTSystem(pl.LightningModule):

  def __init__(self):
    super().__init__()
    # ================================
    # FILL ME OUT
    # 
    # What should we set `input_dim` and `output_dim`
    # to do?
    # 
    # Pseudocode:
    # --
    # input_dim = ...
    # output_dim = ...
    # 
    # Type:
    # --
    # input_dim: integer
    # output_dim: integer
    # ================================
    input_dim = 768
    output_dim = 1
    self.model = MLP(input_dim, output_dim)

  def forward(self, features):
    # ================================
    # FILL ME OUT
    # 
    # Combine the pretrained features and the MLP together
    # to compute probabilities.
    # 
    # Solution code is 1 line.
    # 
    # Pseudocode:
    # --
    # probs = ...
    # 
    # Type:
    # --
    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================
    probs = self.model(features)
    return probs

  def configure_optimizers(self):
    lr = 1e-3
    # ================================
    # FILL ME OUT
    # 
    # Decide an optimizer. As a reminder, the trainable
    # parameters can be accessed using `self.parameters`.
    # Generally, we recommend using Adam.
    # 
    # Use the learning rate as specified above. 
    # 
    # Solution code is 1 line.
    # 
    # Pseudocode:
    # --
    # optimizer = ...
    # 
    # Type:
    # --
    # optimizer: torch.optim.Optimizer
    # ================================
    optimizer = torch.optim.Adam(params= self.parameters(),lr= lr)
    return optimizer

  def _common_step(self, batch, batch_idx):
    # The `batch` will be a tuple of minibatch elements. 
    features, labels = batch
    
    # ================================
    # FILL ME OUT
    # 
    # Compute the loss. This function will be used
    # for the `training_step`, `validation_step`,  
    # `test_step`, and `predict_step`.
    # 
    # This function should return the loss and the 
    # accuracy. Accuracy is computed by rounding the 
    # predicted probabilities to nearest integer.
    # 
    # Solution code is 3 lines.
    # 
    # Pseudocode:
    # --
    # probs = ...
    # loss = ...
    # 
    # Type:
    # --
    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # loss: torch.FloatTensor
    #   shape: 1
    # ================================
    probs = self.forward(features)
    labels = labels.unsqueeze(1).float()
    loss = F.binary_cross_entropy(probs, labels)
    with torch.no_grad():
      # ================================
      # FILL ME OUT
      #
      # Add accuracy computation here. Try to use the outputted 
      # probabilities from the model to compute predictions.
      # Then, compare these to the true labels.
      # 
      # Pseudocode:
      # --
      # preds = ...(use probs)...
      # accuracy = ...
      # 
      # Type:
      # --
      # preds: torch.FloatTensor or np.array
      # accuracy: float
      # ================================
      preds = torch.round(probs.squeeze(1))
      num_correct = torch.sum(preds == labels.squeeze(1)).item()
      total_num = labels.size(0)
      accuracy = num_correct / total_num
    return loss, accuracy

  def training_step(self, train_batch, batch_idx):
    loss, acc = self._common_step(train_batch, batch_idx)
    self.log('train_loss', loss)
    self.log('train_acc', acc, prog_bar=True)
    return loss

  def validation_step(self, dev_batch, batch_idx):
    loss, acc = self._common_step(dev_batch, batch_idx)
    self.log('dev_loss', loss)
    self.log('dev_acc', acc, prog_bar=True)

  def test_step(self, test_batch, batch_idx):
    loss, acc = self._common_step(test_batch, batch_idx)
    self.log('test_loss', loss)
    self.log('test_acc', acc)

  def predict_step(self, batch, batch_idx):
    return self.forward(batch[0])

> Create a `DataModule` for our SST datasets, which makes it easy to handle minibatching. 

For more information, read the documentation [here](https://pytorch-lightning.readthedocs.io/en/stable/extensions/datamodules.html). 

In [22]:
class SSTDataModule(pl.LightningDataModule):
  """
  Data module wrapper around SST datasets.
  
  Arguments
  ---------
  batch_size: (int) minibatch size
    default = 32
  """
  def __init__(self, batch_size: int = 32):
    super().__init__()
    # ================================
    # FILL ME OUT
    # 
    # Initialize the datasets.
    #   
    # Pseudocode:
    # --
    # self.sst_train = ...
    # self.sst_dev = ...
    # self.sst_test = ...
    # 
    # Type:
    # --
    # self.sst_*: SSTBERT
    # ================================
    self.sst_train = SSTBERT('train')
    self.sst_dev = SSTBERT('dev')
    self.sst_test = SSTBERT('test')
    self.batch_size = batch_size

  def train_dataloader(self):
    # Create a dataloader for train dataset.
    return DataLoader(self.sst_train, batch_size=self.batch_size, shuffle=True)

  def val_dataloader(self):
    # Create a dataloader for dev dataset.
    return DataLoader(self.sst_dev, batch_size=self.batch_size)

  def test_dataloader(self):
    # Create a dataloader for test dataset.
    return DataLoader(self.sst_test, batch_size=self.batch_size)

  def predict_dataloader(self):
    # You can also use the test dataset here.
    return DataLoader(self.sst_test, batch_size=self.batch_size)

> Now let's put everything together with `pytorch_lightning.Trainer`.

In [23]:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

In [24]:
def seed_everything(seed, use_cuda=True):
  """
  Important to standardize seeds!
  """
  random.seed(seed)
  torch.manual_seed(seed)
  if use_cuda: 
    torch.cuda.manual_seed_all(seed)
  np.random.seed(seed)

PyTorch is sometimes quite exact about the shape and type of objects it expects. When you run this below, you may face some unexpected complaints. Try looking through the stack trace and double checking the shapes and types of tensors in your implementations, such as the `label` object.

In [25]:
# use our hard work!
dm = SSTDataModule(batch_size=32)
model = SSTSystem()

seed_everything(42, use_cuda=True)

checkpoint_callback = ModelCheckpoint(monitor='dev_loss')

trainer = Trainer(
  # you can add lots more custom config here for more advanced
  # functionality like early stopping, learning rate decay, etc.
  max_epochs=80,
  gpus=1,                           # make sure you enabled GPU runtime
  callbacks=[checkpoint_callback],  # for tracking best checkpoint
)

trainer.fit(model, dm)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /content/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type | Params
-------------------------------
0 | model | MLP  | 7.8 K 
-------------------------------
7.8 K     Trainable params
0         Non-trainable params
7.8 K     Total params
0.031  

Sanity Checking: 0it [00:00, ?it/s]

                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [26]:
results = trainer.test(model, dm, ckpt_path="best")

Restoring states from the checkpoint path at /content/lightning_logs/version_0/checkpoints/epoch=76-step=20559.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at /content/lightning_logs/version_0/checkpoints/epoch=76-step=20559.ckpt


Testing: 0it [00:00, ?it/s]

                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.8135746717453003
        test_loss           0.41376781463623047
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


### How did it go?

What design did you find performed the best? What choices did you observe to improve performance? Briefly describe the different techniques you tried and what led to the best results you find. As a baseline, logistic regression achieves around 73% after 20 epochs of training.

> **Best test accuracy**: 81.42%

> **Your approach**: Increase in test accuracy was achieved by adding mulitple hidden linear layers and also increasing the training epochs

## That's all folks!

This concludes our refresher into deep learning, and a few of the tools. Over the next few weeks, we will see more and more of these same techniques and tools. The rest of week 1 will focus more on data annotation.

---

# Optional: Feature Comparison

What is the point of the pretrained features? Can't we just derive features directly from text? In this optional section, we will compare the performance we achieved above with the performance on (simpler) Word2Vec features. 

One of the benefits of pretrained features from a neural network was that they work over natural language sentences. The same cannot be said for non-neural features. We need to a bit of preprocessing to pull out normalized words from sentences.

In this part, we will do some common standardization techniques for NLP:

- Lower case (e.g. My name -> my name)
- Remove whitespace (e.g. '   hi' -> 'hi')
- Replace multiple white spaces with one
- Remove punctuation (e.g. hi! -> hi)
- Stop word removal (remove words we consider not useful to the semantic meaning)
- Tokenization
- Lemmatization (e.g. programming -> program)
- Rare word removal

In [28]:
import string
from collections import defaultdict
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def to_lower_case(x):
  # Return lower case of string.
  return x.lower()

def remove_whitespace(x):
  # Remove any outer whitespace (tabs, newlines, spaces) around text.
  return x.strip()

def remove_extra_whitespace(x):
  # Remove any extra space within a sentence.
  # Input is a string.
  # Output is a string.
  return ' '.join(x.split())

def remove_punctuation(x):
  # Remove all punctuation characters. 
  # Input is a string.
  # Output is a string.
  return ''.join([ch for ch in x if ch not in string.punctuation])

def tokenize(x):
  # Convert a string into tokens.
  # Input is a string.
  # Output is a list of strings.
  return x.split()

def remove_stop_words(tokens):
  # Remove tokens from list if token is a stop word.
  # Input is a list of strings.
  # Output is a list of strings.
  return [tok for tok in tokens if tok not in stop_words]

def lemmatize(tokens):
  # Lemmatize every token in list.
  # Input is a list of strings.
  # Output is a list of strings.
  return [lemmatizer.lemmatize(tok) for tok in tokens]

def build_vocab(list_of_tokens):
  vocab = defaultdict(lambda: 0)
  # ================================
  # FILL ME OUT
  # 
  # Build a dictionary from token to count. 
  # Input is a list of list of strings. 
  # 
  # Solution code is 3 lines.
  # 
  # Pseudocode:
  # --
  # for tokens in list_of_tokens:
  #   for token in tokens:
  #     (do something...?)
  # ================================
  for tokens in list_of_tokens:
    for token in tokens:
      vocab[token] += 1
  return vocab

def remove_rare_words(tokens, vocab, min_count = 3):
  # Only keep tokens that appear more than min_count 
  # number of times in vocab.
  return [tok for tok in tokens if vocab[tok] > min_count]


Let's use the functions you programmed above to build a `preprocess` dataset.

In [29]:
def preprocess(dataset):
  """
  For every entry in dataset, apply all the preprocessing steps 
  to the `reviewText` entry in the following order:

  - to_lower_case
  - remove_whitespace
  - remove_extra_whitespace
  - remove_punctuation
  - tokenize
  - remove_stop_words
  - lemmatize
  - build_vocab
  - remove_rare_words

  For every entry, the output of this pipeline will be a list of 
  strings. Combine these tokens into a single sentence by joining
  with a whitespace. 
  
  For example:
  ['hi', 'i', 'am', 'mike'] -> 'hi i am mike' 

  Note that in the future, if we want to convert this to tokens, we
  do not need to call word_tokenize anymore (which is costly), we can
  just split by whitespace.
  """
  all_tokens = []
  pbar = tqdm(total=len(dataset), leave=True, position=0)
  for entry in dataset:
    text = entry['sentence']
    text = to_lower_case(text)
    text = remove_whitespace(text)
    text = remove_extra_whitespace(text)
    text = remove_punctuation(text)
    tokens = tokenize(text)
    tokens = remove_stop_words(tokens)
    tokens = lemmatize(tokens)
    all_tokens.append(tokens)
    pbar.update()
  pbar.close()

  vocab = build_vocab(all_tokens)

  new_dataset = []
  for i in tqdm(range(len(dataset)), leave=True, position=0):
    tokens = all_tokens[i]
    tokens = remove_rare_words(tokens, vocab, 3)
    row_i = dataset[i]
    row_i['processed'] = ' '.join(tokens)
    new_dataset.append(row_i)

  return new_dataset

In [30]:
train_dataset = preprocess(load_dataset('sst', split='train'))
dev_dataset = preprocess(load_dataset('sst', split='validation'))
test_dataset = preprocess(load_dataset('sst', split='test'))

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 8544/8544 [00:03<00:00, 2567.73it/s]
100%|██████████| 8544/8544 [00:00<00:00, 10855.86it/s]
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 1101/1101 [00:00<00:00, 5079.59it/s]
100%|██████████| 1101/1101 [00:00<00:00, 12366.85it/s]
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 2210/2210 [00:00<00:00, 4870.53it/s]
100%|██████████| 2210/2210 [00:00<00:00, 9365.79it/s]


Double check that the first row in your `dataset` object has a `processed` entry. Notice that the processed sentence is not a complete sentence, only the words that the preprocessing deems useful.

In [31]:
train_dataset[0]

{'label': 0.6944400072097778,
 'processed': 'rock destined 21st century new going make even greater arnold schwarzenegger van steven',
 'sentence': "The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 'tokens': "The|Rock|is|destined|to|be|the|21st|Century|'s|new|``|Conan|''|and|that|he|'s|going|to|make|a|splash|even|greater|than|Arnold|Schwarzenegger|,|Jean-Claud|Van|Damme|or|Steven|Segal|.",
 'tree': '70|70|68|67|63|62|61|60|58|58|57|56|56|64|65|55|54|53|52|51|49|47|47|46|46|45|40|40|41|39|38|38|43|37|37|69|44|39|42|41|42|43|44|45|50|48|48|49|50|51|52|53|54|55|66|57|59|59|60|61|62|63|64|65|66|67|68|69|71|71|0'}

### Word2Vec

Word2Vec maps individual words to high dimensional vector representations in a way that synonymous words will be close to each other in vector space. In fact, a famous example for Word2Vec is that features of `king` - features of `man` + features of `woman` returns the features of `queen`. Word2Vec is trained on a large text corpus, from Twitter to a collection of books. 

In [32]:
import gensim.downloader
# this takes some time to download -- it is a big file!
word2vec = gensim.downloader.load('fasttext-wiki-news-subwords-300')



> Fill out of the rest of the `SSTWord2Vec` dataset below to serve Word2Vec features for training.

In [33]:
from torch.utils.data import Dataset

class SSTWord2Vec(Dataset):
  """
  The Stanford Sentiment TreeBank dataset with Word2Vec features. 

  Argument
  --------
  split: (str) the dataset portion
    Options - train | dev | test
  """

  def __init__(self, word2vec, split = 'train'):
    super().__init__()
    assert split in ['train', 'dev', 'test'], f"Split {split} not supported."
    if split == 'dev': split = 'validation'  # match their expectations
    self.data = preprocess(load_dataset('sst', split=split))
    self.word2vec = word2vec
    self.split = split

  def __getitem__(self, index):
    sentence = self.data[index]['processed']
    tokens = sentence.split()

    feature = []
    # ================================
    # FILL ME OUT
    # 
    # Return a list of two objects, features and labels. 
    # Compute Word2Vec features on the `processed` sentence.
    # 
    # HINT: Use `self.word2vec.get_vector` to get the embedding from token.
    # 
    # If Word2Vec doesn't have an embedding for the word, skip it!
    # 
    # Pseudocode:
    # --
    # for token in tokens:
    #   try:
    #     feat = ...
    #     add to features ...
    #   except KeyError:
    #     pass
    # ================================
    for token in tokens:
      try:
        feature.append(self.word2vec.get_vector(token))
      except KeyError:
        pass

    if len(feature) == 0:
      # if none of the words are in the dictionary, 
      # we can deafult to zeros
      feature = np.zeros(300)
    else:
      feature = np.stack(feature)
      # We treat the average of word embeddings as the sentence embedding!
      feature = np.mean(feature, axis=0)

    feature = torch.from_numpy(feature).float()
    label = round(self.data[index]['label'])
    return feature, label

  def __len__(self):
    return len(self.data)

> Use what you have learned above to complete the setup for a PyTorch lightning dataset, and train the same model as you did above using these Word2Vec features. 

In [35]:
import pytorch_lightning as pl

class SSTWord2VecDataModule(pl.LightningDataModule):
  """
  Data module wrapper around SST datasets with Word2Vec.
  
  Arguments
  ---------
  batch_size: (int) minibatch size
    default = 32
  """
  def __init__(self, word2vec, batch_size: int = 32):
    super().__init__()
    # ================================
    # FILL ME OUT
    # 
    # Initialize the datasets.
    #  
    # Pseudocode:
    # --
    # self.sst_train = ...
    # self.sst_dev = ...
    # self.sst_test = ...
    #
    # Type:
    # --
    # self.sst_*: SSTWord2Vec
    # ================================
    self.batch_size = batch_size
    self.sst_train = SSTWord2Vec(word2vec, 'train')
    self.sst_dev = SSTWord2Vec(word2vec, 'dev')
    self.sst_test = SSTWord2Vec(word2vec, 'test')

  def train_dataloader(self):
    # Create a dataloader for train dataset.
    return DataLoader(self.sst_train, batch_size=self.batch_size, shuffle=True)

  def val_dataloader(self):
    # Create a dataloader for dev dataset.
    return DataLoader(self.sst_dev, batch_size=self.batch_size)

  def test_dataloader(self):
    # Create a dataloader for test dataset.
    return DataLoader(self.sst_test, batch_size=self.batch_size)

  def predict_dataloader(self):
    # You can also use the test dataset here.
    return DataLoader(self.sst_test, batch_size=self.batch_size)

In [49]:
class SSTWord2VecSystem(pl.LightningModule):

  def __init__(self):
    super().__init__()
    # ================================
    # FILL ME OUT
    # 
    # What should we set `input_dim` and `output_dim`
    # to do? This is different than when we were using
    # BERT above!
    # 
    # Pseudocode:
    # --
    # input_dim = ...
    # output_dim = ...
    #
    # Type:
    # --
    # input_dim: integer
    # output_dim: integer
    # ================================
    input_dim = 300
    output_dim = 1
    self.model = MLP(input_dim, output_dim) 

  def forward(self, features):
    # ================================
    # FILL ME OUT
    # 
    # Combine the pretrained features and the MLP together
    # to compute probabilities.
    # 
    # Solution code is 1 line.
    # 
    # Pseudocode:
    # --
    # probs = ...
    # 
    # Type:
    # --
    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # ================================
    probs = torch.sigmoid(self.model(features))
    return probs

  def configure_optimizers(self):
    lr = 1e-3
    # ================================
    # FILL ME OUT
    # 
    # Decide of an optimizer. As a reminder, the trainable
    # parameters can be accessed using `self.parameters`.
    # 
    # Solution code is 1 line.
    # 
    # Pseudocode:
    # --
    # optimizer = ...
    # 
    # Type:
    # --
    # optimizer: torch.optim.Optimizer
    # ================================
    optimizer = torch.optim.Adam(self.parameters(), lr)
    return optimizer

  def _common_step(self, batch, batch_idx):
    features, labels = batch
    # ================================
    # FILL ME OUT
    # 
    # Compute the loss. The `batch` will be a tuple 
    # of minibatch elements. This function will be used
    # for the `training_step`, `validation_step`,  
    # `test_step`, and `predict_step`.
    # 
    # This function should return the loss and the 
    # accuracy. Accuracy is computed by rounding the 
    # predicted probabilities to nearest integer.
    # 
    # Solution code is 4 lines.
    # 
    # Pseudocode:
    # --
    # probs = ...
    # loss = ...
    # 
    # Type:
    # --
    # probs: torch.FloatTensor
    #   shape: batch_size x 1
    # loss: torch.FloatTensor
    #   shape: 1
    # ================================
    probs = self.forward(features)
    #print(labels.shape)
    labels = labels.unsqueeze(1).float()
    loss = F.binary_cross_entropy(probs, labels)

    with torch.no_grad():
      # ================================
      # FILL ME OUT
      #
      # Add accuracy computation here.
      # 
      # Pseudocode:
      # --
      # preds = ...(use probs)...
      # accuracy = ...
      # 
      # Type:
      # --
      # preds: torch.Tensor or np.array
      # accuracy: float
      # ================================
      preds = torch.round(probs.squeeze(1))
      num_correct = torch.sum(preds == labels.squeeze(1)).item()
      accuracy = num_correct / labels.size(0)
    return loss, accuracy

  def training_step(self, train_batch, batch_idx):
    loss, acc = self._common_step(train_batch, batch_idx)
    self.log('train_loss', loss)
    self.log('train_acc', acc, prog_bar=True)
    return loss

  def validation_step(self, dev_batch, batch_idx):
    loss, acc = self._common_step(dev_batch, batch_idx)
    self.log('dev_loss', loss)
    self.log('dev_acc', acc, prog_bar=True)

  def test_step(self, test_batch, batch_idx):
    loss, acc = self._common_step(test_batch, batch_idx)
    self.log('test_loss', loss)
    self.log('test_acc', acc)

  def predict_step(self, batch, batch_idx):
    return self.forward(batch[0])

In [55]:
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint

dm = SSTWord2VecDataModule(word2vec, batch_size=32)
model = SSTWord2VecSystem()

seed_everything(42, use_cuda=True)
checkpoint_callback = ModelCheckpoint(monitor='dev_loss')

trainer = Trainer(max_epochs=160, gpus=1,
  callbacks=[checkpoint_callback])
trainer.fit(model, dm)

No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 8544/8544 [00:01<00:00, 4995.11it/s]
100%|██████████| 8544/8544 [00:00<00:00, 12305.30it/s]
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 1101/1101 [00:00<00:00, 4906.98it/s]
100%|██████████| 1101/1101 [00:00<00:00, 12629.06it/s]
No config specified, defaulting to: sst/default
Reusing dataset sst (/root/.cache/huggingface/datasets/sst/default/1.0.0/b8a7889ef01c5d3ae8c379b84cc4080f8aad3ac2bc538701cbe0ac6416fb76ff)
100%|██████████| 2210/2210 [00:00<00:00, 4910.38it/s]
100%|██████████| 2210/2210 [00:00<00:00, 12441.36it/s]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU

Sanity Checking: 0it [00:00, ?it/s]

                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [56]:
results = trainer.test(model, dm, ckpt_path="best")

Restoring states from the checkpoint path at /content/lightning_logs/version_5/checkpoints/epoch=107-step=28836.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from checkpoint at /content/lightning_logs/version_5/checkpoints/epoch=107-step=28836.ckpt


Testing: 0it [00:00, ?it/s]

                not been set for this class (_ResultMetric). The property determines if `update` by
                default needs access to the full metric state. If this is not the case, significant speedups can be
                achieved and we recommend setting this to `False`.
                We provide an checking function
                `from torchmetrics.utilities import check_forward_no_full_state`
                that can be used to check if the `full_state_update=True` (old and potential slower behaviour,
                default for now) or if `full_state_update=False` can be used safely.
                


────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.7167420983314514
        test_loss           0.6464707851409912
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


### How did it go?

How does this compare to using BERT features?

As a baseline, logistic regression on Word2Vec features achieves around 71% after 20 epochs of training. Recall using BERT features, logistic regression reaches 73%, a 2% difference!

> **Word2Vec test accuracy**: 71.67% 

> **BERT test accuracy**: 81.42%

Optionally, post your results on slack! We'd love to see students share their progress 🥰