# Vector-space models: Static representations from contextual models

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2021"

## Contents

1. [Overview](#Overview)
1. [General set-up](#General-set-up)
1. [Loading Transformer models](#Loading-Transformer-models)
1. [The basics of tokenizing](#The-basics-of-tokenizing)
1. [The basics of representations](#The-basics-of-representations)
1. [The decontextualized approach](#The-decontextualized-approach)
  1. [Basic example](#Basic-example)
  1. [Creating a full VSM](#Creating-a-full-VSM)
1. [The aggregated approach](#The-aggregated-approach)
1. [Some related work](#Some-related-work)

## Overview



Can we get good static representations of words from models (like BERT) that supply only contextual representations? On the one hand, contextual models are very successful across a wide range of tasks, in large part because they are trained for a long time on a lot of data. This should be a boon for VSMs as we've designed them so far. On the other hand, the goal of having static representations might seem to be at odds with how these models process examples and represent examples. Part of the point is to obtain different representations for words depending on the context in which they occur, and a hallmark of the training procedure is that it processes sequences rather than individual words.

[Bommasani et al. (2020)](https://www.aclweb.org/anthology/2020.acl-main.431) make a significant step forward in our understanding of these issues. Ultimately, they arrive at a positive answer: excellent static word representations can be obtained from contextual models. They explore two strategies for achieving this:

1. __The decontextualized approach__: just process individual words as though they were isolated texts. Where a word consists of multiple tokens in the model, pool them with a function like mean or max.
1. __The aggregrated approach__: process lots and lots of texts containing the words of interest. As before, pool sub-word tokens, and also pool across all the pooled representations.

As Bommasani et al. say, the decontextualized approach "presents an unnatural input" – these models were not trained on individual words, but rather on longer sequences, so the individual words are infrequent kinds of inputs at best (and unattested as far as the model is concerned if the special boundary tokens [CLS] and [SEP] are not included). However, in practice, Bommasani et al. achieve very impressive results with this approach on word similarity/relatedness tasks.

The aggregrated approach is even better, but it requires more work and involves more decisions relating to which texts are processed.

This notebook briefly explores both of these approaches, with the goal of making it easy for you to apply these methods in [the associated homework and bakeoff](hw_wordrelatedness.ipynb).

## General set-up



In [1]:
import os
import pandas as pd
import numpy as np
import torch
from transformers import BertModel, BertTokenizer
from transformers import RobertaModel, RobertaTokenizer

import utils
import vsm

In [2]:
if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')
device

device(type='cuda', index=0)

In [9]:
DATA_HOME = os.path.join('data', 'vsmdata')

In [4]:
utils.fix_random_seeds()

The `transformers` library does a lot of logging. To avoid ending up with a cluttered notebook, I am changing the logging level. You might want to skip this as you scale up to building production systems, since the logging is very good – it gives you a lot of insights into what the models and code are doing.

In [5]:
import logging
logger = logging.getLogger()
logger.level = logging.ERROR

## Loading Transformer models

To start, let's get a feel for the basic API that `transformers` provides. The first step is specifying the pretrained parameters we'll be using:

In [4]:
bert_weights_name = 'bert-large-uncased'

There are lots other options for pretrained weights. See [this Hugging Face directory](https://huggingface.co/models).

Next, we specify a tokenizer and a model that match both each other and our choice of pretrained weights:

In [5]:
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)

bert_model = BertModel.from_pretrained(bert_weights_name)
bert_model.to(device)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=Fals

In [7]:
bert_model.device

device(type='cuda', index=0)

## The basics of tokenizing

It's illuminating to see what the tokenizer does to example texts:

In [13]:
example_text = "Bert knows Snuffleupagus"

Simple tokenization:

In [14]:
bert_tokenizer.tokenize(example_text)

['bert', 'knows', 's', '##nu', '##ffle', '##up', '##ag', '##us']

The `encode` method maps individual strings to indices into the underlying embedding used by the model:

In [15]:
ex_ids = bert_tokenizer.encode(example_text, add_special_tokens=True)

ex_ids

[101, 14324, 4282, 1055, 11231, 18142, 6279, 8490, 2271, 102]

We can get a better feel for what these representations are like by mapping the indices back to "words":

In [30]:
bert_tokenizer.convert_ids_to_tokens(ex_ids)

['[CLS]',
 'bert',
 'knows',
 's',
 '##nu',
 '##ffle',
 '##up',
 '##ag',
 '##us',
 '[SEP]']

Those are all the essential ingredients for working with these parameters in Hugging Face. Of course, the library has a lot of other functionality, but the above suffices for our current application.

## The basics of representations

To obtain the representations for a batch of examples, we use the `forward` method of the model, as follows:

In [31]:
with torch.no_grad():
    reps = bert_model(torch.tensor([ex_ids], device=device), output_hidden_states=True)

In [32]:
for index, hs in enumerate(reps.hidden_states, 1):
    shape = hs.shape
    print(f'Hidden Shape {index}:    {shape}')

Hidden Shape 1:    torch.Size([1, 10, 1024])
Hidden Shape 2:    torch.Size([1, 10, 1024])
Hidden Shape 3:    torch.Size([1, 10, 1024])
Hidden Shape 4:    torch.Size([1, 10, 1024])
Hidden Shape 5:    torch.Size([1, 10, 1024])
Hidden Shape 6:    torch.Size([1, 10, 1024])
Hidden Shape 7:    torch.Size([1, 10, 1024])
Hidden Shape 8:    torch.Size([1, 10, 1024])
Hidden Shape 9:    torch.Size([1, 10, 1024])
Hidden Shape 10:    torch.Size([1, 10, 1024])
Hidden Shape 11:    torch.Size([1, 10, 1024])
Hidden Shape 12:    torch.Size([1, 10, 1024])
Hidden Shape 13:    torch.Size([1, 10, 1024])
Hidden Shape 14:    torch.Size([1, 10, 1024])
Hidden Shape 15:    torch.Size([1, 10, 1024])
Hidden Shape 16:    torch.Size([1, 10, 1024])
Hidden Shape 17:    torch.Size([1, 10, 1024])
Hidden Shape 18:    torch.Size([1, 10, 1024])
Hidden Shape 19:    torch.Size([1, 10, 1024])
Hidden Shape 20:    torch.Size([1, 10, 1024])
Hidden Shape 21:    torch.Size([1, 10, 1024])
Hidden Shape 22:    torch.Size([1, 10, 1024

The return value `reps` is a special `transformers` class that holds a lot of representations. If we want just the final output representations for each token, we use `last_hidden_state`:

The shape indicates that our batch has 1 example, with 10 tokens, and each token is represented by a vector of dimensionality 768. 

Aside: Hugging Face `transformers` models also have a `pooler_output` value. For BERT, this corresponds to the output representation above the [CLS] token, which is often used as a summary representation for the entire sequence. However, __we cannot use `pooler_output` in the current context__, as `transformers` adds new randomized parameters on top of it, to facilitate fine-tuning. If we want the [CLS] representation, we need to use `reps.last_hidden_state[:, 0]`.

Finally, if we want access to the output representations from each layer of the model, we use `hidden_states`. This will be `None` unless we set `output_hidden_states=True` when using the `forward` method, as above. 

In [33]:
len(reps.hidden_states)

25

The length 13 corresponds to the initial embedding layer (layer 0) and the 12 layers of this BERT model.

The final layer in `hidden_states` is identical to `last_hidden_state`:

In [34]:
reps.hidden_states[-1].shape

torch.Size([1, 10, 1024])

In [35]:
torch.equal(reps.hidden_states[-1], reps.last_hidden_state)

True

## The decontextualized approach

As discussed above, Bommasani et al. (2020) define and explore two general strategies for obtaining static representations for word using a model like BERT. The simpler one involves processing individual words and, where they correspond to multiple tokens, pooling those token representations into a single vector using an operation like mean.

### Basic example

To begin to see what this is like in practice, we'll use the method `vsm.hf_encode`, which maps texts to their ids, taking care to use `unk_token` for texts that can't otherwise be processed by the model.

Where a word corresponds to just one token in the vocabulary, it will get mapped to a single id:

In [36]:
token = bert_tokenizer.tokenize('puppy')

In [37]:
type(bert_tokenizer.encode(token))

list

In [38]:
vsm.hf_encode("puppy", bert_tokenizer)

tensor([[17022]], device='cuda:0')

As we saw above, some words map to multiple tokens:

In [39]:
bert_tokenizer.tokenize('snuffleupagus')

['s', '##nu', '##ffle', '##up', '##ag', '##us']

In [40]:
subtok_ids = vsm.hf_encode("snuffleupagus", bert_tokenizer)

subtok_ids

tensor([[ 1055, 11231, 18142,  6279,  8490,  2271]], device='cuda:0')

Next, the function `vsm.hf_represent` will map a batch of ids to their representations in a user-supplied model, at a specified layer in that model:

In [41]:
subtok_reps = vsm.hf_represent(subtok_ids, bert_model, layer=-1)

subtok_reps.shape

torch.Size([1, 6, 1024])

The shape here: 1 example containing 6 (sub-word) tokens, each of dimension 768. With `layer=-1`, we obtain the final output repreentation from the entire model.

The final step is to pool together the 6 tokens. Here, we can use a variety of operations; [Bommasani et al. 2020](https://www.aclweb.org/anthology/2020.acl-main.431) find that `mean` is the best overall:

In [44]:
mean_tensor = torch.mean(subtok_reps, axis=1)
subtok_reps.shape, mean_tensor.shape

(torch.Size([1, 6, 1024]), torch.Size([1, 1024]))

In [45]:
subtok_pooled = vsm.mean_pooling(subtok_reps)

subtok_pooled.shape

torch.Size([1, 1024])

The function `vsm.mean_pooling` is simple `torch.mean` with `axis=1`. There are also predefined functions `vsm.max_pooling`, `vsm.min_pooling`, and `vsm.last_pooling` (representation for the final token).

### Creating a full VSM

Now we want to scale the above process to a large vocabulary, so that we can create a full VSM. The function `vsm.create_subword_pooling_vsm` makes this easy. To start, we get the vocabulary from one of our count VSMs (all of which have the same vocabulary):

In [8]:
vsm_index = pd.read_csv(os.path.join(DATA_HOME, 'yelp_window5-scaled.csv.gz'),usecols=[0], index_col=0)

vocab = list(vsm_index.index)
len(vocab)

In [50]:
vocab[:5]

['):', ');', '..', '...', ':(']

And then we use `vsm.create_subword_pooling_vsm`:

In [52]:
%%time
pooled_df = vsm.create_subword_pooling_vsm(
    vocab, bert_tokenizer, bert_model, layer=1)

CPU times: user 1min 40s, sys: 87.1 ms, total: 1min 40s
Wall time: 1min 40s


The result, `pooled_df`, is a `pd.DataFrame` with its index given by `vocab`. This can be used directly in the word relatedness evaluations that are central the homework and associated bakeoff.

In [60]:
pooled_df.loc['zoo']

0      -1.341927
1       0.520249
2      -0.009878
3       0.001161
4      -1.310133
          ...   
1019   -0.662984
1020    0.055996
1021   -1.098887
1022    0.678939
1023   -1.109572
Name: zoo, Length: 1024, dtype: float32

In [61]:
vsm.neighbors('zoo', pooled_df)

zoo         0.000000
aquarium    0.348048
animals     0.428493
museum      0.449523
animal      0.467085
              ...   
seeing      0.910901
letting     0.912019
handed      0.915275
hands       0.916266
waited      0.918262
Length: 6000, dtype: float64

This approach, and the associated code, should work generally for all Hugging Face Transformer-based models. Bommasani et al. (2020) provide a lot of guidance when it comes to how the model, the layer choice, and the pooling function interact.

## ** Training Area ** 

In [19]:
bert_weights_name = 'bert-large-uncased'
bert_tokenizer = BertTokenizer.from_pretrained(bert_weights_name)
bert_model = BertModel.from_pretrained(bert_weights_name)
bert_model.to(device)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 1024, padding_idx=0)
    (position_embeddings): Embedding(512, 1024)
    (token_type_embeddings): Embedding(2, 1024)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=1024, out_features=1024, bias=True)
            (key): Linear(in_features=1024, out_features=1024, bias=True)
            (value): Linear(in_features=1024, out_features=1024, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=Fals

In [21]:
from utils import devdf_generator
import time
from tqdm import tqdm

VSM_HOME = os.path.join('data', 'vsmdata')
DATA_HOME = os.path.join('data', 'wordrelatedness')

#dev df
dev = pd.read_csv(os.path.join(DATA_HOME, "cs224u-wordrelatedness-dev.csv"))
highest = devdf_generator(dev, scoring='highest')

#vsm index
vsm_index = pd.read_csv(os.path.join(VSM_HOME, 'yelp_window5-scaled.csv.gz'),usecols=[0], index_col=0)
vocab = list(vsm_index.index)
len(vocab)

All problematic word pairs are removed


6000

In [22]:
with torch.no_grad():
    reps = bert_model(torch.tensor([ex_ids], device=device), output_hidden_states=True)

In [23]:
layers = len(reps.hidden_states)
layers

25

In [24]:
scores = {}

start = time.perf_counter()

for layer in tqdm(range(5,25)):
    pooled_df = vsm.create_subword_pooling_vsm(vocab, bert_tokenizer, bert_model, layer=layer)
    df, rho = vsm.word_relatedness_evaluation(highest, pooled_df)
    scores[layer] = rho

end = time.perf_counter() - start
print(f'Total time = {end/60} minutes')

100%|██████████| 20/20 [35:39<00:00, 106.96s/it]

Total time = 35.654120114933356 minutes





## The aggregated approach

The aggregated is also straightfoward to implement given the above tool. To start, we can create a map from vocabulary items into their sequences of ids:

In [30]:
vocab_ids = {w: vsm.hf_encode(w, bert_tokenizer)[0] for w in vocab}

Next, let's assume we have a corpus of texts that contain the words of interest:

In [31]:
corpus = [
    "This is a sailing example",
    "It's fun to go sailing!",
    "We should go sailing.",
    "I'd like to go sailing and sailing",
    "This is merely an example"]

The following embeds every corpus example, keeping `layer=1` representations:

In [39]:
corpus_ids = [vsm.hf_encode(text, bert_tokenizer)
              for text in corpus]
corpus_ids

[tensor([[2023, 2003, 1037, 8354, 2742]], device='cuda:0'),
 tensor([[2009, 1005, 1055, 4569, 2000, 2175, 8354,  999]], device='cuda:0'),
 tensor([[2057, 2323, 2175, 8354, 1012]], device='cuda:0'),
 tensor([[1045, 1005, 1040, 2066, 2000, 2175, 8354, 1998, 8354]],
        device='cuda:0'),
 tensor([[2023, 2003, 6414, 2019, 2742]], device='cuda:0')]

In [46]:
corpus_reps = [vsm.hf_represent(ids, bert_model, layer=1)
               for ids in corpus_ids]
corpus_reps[4].shape

torch.Size([1, 5, 1024])

Finally, we define a convenience function for finding all the occurrences of a sublist in a larger list:

In [34]:
def find_sublist_indices(sublist, mainlist):
    indices = []
    length = len(sublist)
    for i in range(0, len(mainlist)-length+1):
        if mainlist[i: i+length] == sublist:
            indices.append((i, i+length))
    return indices

For example:

In [35]:
find_sublist_indices([1,2], [1, 2, 3, 0, 1, 2, 3])

[(0, 2), (4, 6)]

And here's an example using our `vocab_ids` and `corpus`:

In [36]:
sailing = vocab_ids['sailing']

In [37]:
sailing

tensor([8354], device='cuda:0')

In [103]:
sailing_reps = []

for ids, reps in zip(corpus_ids, corpus_reps):
    offsets = find_sublist_indices(sailing, ids.squeeze(0))
    for (start, end) in offsets:
        pooled = vsm.mean_pooling(reps[:, start: end])
        sailing_reps.append(pooled)

sailing_rep = torch.mean(torch.cat(sailing_reps), axis=0).squeeze(0)

In [104]:
sailing_rep.shape

torch.Size([768])

The above building blocks could be used as the basis for an original system and bakeoff entry for this unit. The major question is probably which data to use for the corpus.

## Some related work

1. [Ethayarajh (2019)](https://www.aclweb.org/anthology/D19-1006/) uses dimensionality reduction techniques (akin to LSA) to derive static representations from contextual models, and explores layer-wise variation in detailed, with findings that are likely to align with your experiences using the above techniques.

1. [Akbik et al (2019)](https://www.aclweb.org/anthology/N19-1078/) explore techniques similar to those of Bommasani et al. specifically for the supervised task of named entity recognition.

1. [Wang et al. (2020](https://arxiv.org/pdf/1911.02929.pdf) learn static representations from contextual ones using techniques adapted from the word2vec model.