# **BERT Experiments**

1. [x] load BERT model
2. [x] run inference / get embeddings
3. [x] inverse embeddings
4. [x] edit embeddings
5. [x] inverse edited embeddings

## References

* http://nlp.seas.harvard.edu/2018/04/03/attention.html
* https://huggingface.co/transformers/notebooks.html
* https://huggingface.co/transformers/_modules/transformers/pipelines/fill_mask.html#FillMaskPipeline


## Environment

In [1]:
!python --version

Python 3.7.10


In [2]:
!pip install tokenizers
!pip install transformers

Collecting tokenizers
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 8.7MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.10.1
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 7.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 30.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: f

## Dependencies

In [3]:
import torch
import numpy as np
from scipy.special import softmax
from transformers import pipeline
from transformers import AutoModel, AutoTokenizer

## Settings

In [4]:
bert_model_name = 'bert-base-cased'

## 1. Load BERT model

In [5]:
%time unmasker = pipeline('fill-mask', model=bert_model_name)
%time tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


CPU times: user 14.9 s, sys: 2 s, total: 16.9 s
Wall time: 21.3 s
CPU times: user 102 ms, sys: 13.8 ms, total: 115 ms
Wall time: 1.29 s


## 2. Run inference

In [6]:
torch.set_grad_enabled(False)

<torch.autograd.grad_mode.set_grad_enabled at 0x7f6b7f450c10>

In [7]:
text1 = 'Wow. Cats eat fish.'
text2 = 'Wow. Dogs eat meat.'
texts = [text1, text2]

In [8]:
tokens = {}

for text in texts:
  tokens_pt = tokenizer(text, return_tensors='pt')
  tokens[text] = tokens_pt

for text, tokens_pt in tokens.items():
  print(f'text: >{text}<')
  for key, value in tokens_pt.items():
    print(f'\t{key}: {value}')
    if key == 'input_ids':
      print(f'\t\ttokens (str): {[tokenizer.convert_ids_to_tokens(s) for s in value]}')
      print(f'\t\t#decoding: {[tokenizer.decode(v) for v in value]}')

text: >Wow. Cats eat fish.<
	input_ids: tensor([[  101, 11750,   119, 17408,  3940,  3489,   119,   102]])
		tokens (str): [['[CLS]', 'Wow', '.', 'Cats', 'eat', 'fish', '.', '[SEP]']]
		#decoding: ['[CLS] Wow. Cats eat fish. [SEP]']
	token_type_ids: tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
	attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1]])
text: >Wow. Dogs eat meat.<
	input_ids: tensor([[  101, 11750,   119, 16406,  3940,  6092,   119,   102]])
		tokens (str): [['[CLS]', 'Wow', '.', 'Dogs', 'eat', 'meat', '.', '[SEP]']]
		#decoding: ['[CLS] Wow. Dogs eat meat. [SEP]']
	token_type_ids: tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
	attention_mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1]])


In [9]:
embeddings = {}
for text, tokens_pt in tokens.items():
  %time output = unmasker.model.bert(**tokens_pt)

  last_hidden_state = output.last_hidden_state
  #pooler_output = output.pooler_output

  embeddings[text] = last_hidden_state

CPU times: user 106 ms, sys: 4.87 ms, total: 111 ms
Wall time: 237 ms
CPU times: user 106 ms, sys: 0 ns, total: 106 ms
Wall time: 107 ms


## 4. Inverse embeddings

In [22]:
for text, embedding in embeddings.items():
  print(text)
  cls_output = unmasker.model.cls(embedding)

  for i in range(cls_output.shape[1]):
    probs = softmax(cls_output[0][i].detach().cpu().numpy())
    indices = probs.argsort()[-10:][::-1]
    print([(tokenizer.decode(int(idx)), round(probs[idx], 3)) for idx in indices])

Wow. Cats eat fish.
[('.', 0.064), (',', 0.022), ('"', 0.018), ('the', 0.017), (')', 0.017), ('and', 0.01), ('of', 0.008), ('?', 0.008), ('to', 0.008), ('-', 0.007)]
[('.', 0.57), ('"', 0.042), ('the', 0.015), (',', 0.011), (')', 0.009), ('of', 0.008), (';', 0.008), ('?', 0.007), ('and', 0.006), ("'", 0.005)]
[('.', 0.999), ('...', 0.001), (':', 0.0), ('!', 0.0), (',', 0.0), ('?', 0.0), ('said', 0.0), ('the', 0.0), ('freaking', 0.0), ('that', 0.0)]
[('Cats', 0.99), ('Dogs', 0.004), ('They', 0.001), ('cats', 0.0), ('People', 0.0), ('Cat', 0.0), ('Horses', 0.0), ('Animals', 0.0), ('Humans', 0.0), ('You', 0.0)]
[('eat', 0.998), ('ate', 0.001), ('eating', 0.0), ('love', 0.0), ('like', 0.0), ('and', 0.0), ('Eat', 0.0), ('are', 0.0), ('do', 0.0), ('eats', 0.0)]
[('fish', 0.999), ('fishes', 0.001), ('Fish', 0.0), ('trout', 0.0), ('salmon', 0.0), ('it', 0.0), ('this', 0.0), ('fishing', 0.0), ('them', 0.0), ('food', 0.0)]
[('.', 0.994), ('!', 0.004), ('?', 0.002), (';', 0.0), ('...', 0.0), (':'

## 5. Edit embeddings

In [13]:
cat_embeddings = embeddings[text1]
dog_embeddings = embeddings[text2]

new_embeddings = 0.5*(cat_embeddings + dog_embeddings)

## 6. Inverse edited embeddings

In [16]:
cls_output = unmasker.model.cls(new_embeddings)

for i in range(cls_output.shape[1]):
  probs = softmax(cls_output[0][i].detach().cpu().numpy())
  indices = probs.argsort()[-10:][::-1]
  print([tokenizer.decode(int(idx)) for idx in indices])

['.', ',', '"', ')', 'the', 'and', 'of', 'to', '?', '-']
['.', '"', 'the', ',', ')', 'of', ';', '?', 'and', "'"]
['.', '...', ':', '!', '?', ',', 'freaking', 'the', 'said', 'that']
['Dogs', 'Cats', 'Dog', 'They', 'People', 'Horses', 'Humans', 'We', 'You', 'I']
['eat', 'ate', 'eating', 'love', 'and', 'are', 'like', 'Eat', 'do', 'eats']
['fish', 'meat', 'food', 'it', 'pork', 'this', 'them', 'me', 'beef', 'that']
['.', '!', '?', ';', '...', ':', ',', '"', '-', 'and']
['.', '?', ';', '...', '!', ':', '"', ',', 'and', '-']
