# Extracting BERT embeddings from the captions

In this notebook, I attempt to extract BERT word embeddings from the captions. I use the steps outlined in section 4.4 of [this](https://www.overleaf.com/project/5ca66278f4224d0690dd9e29) source to achieve this.
I follow [this](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/) tutorial.

More sources consulted:
 - http://mccormickml.com/2019/07/22/BERT-fine-tuning/
 - [Huggingface's much easier one](https://huggingface.co/bert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.)

# TODO:
 - use TFBert and batch tokenize
 - detokenize using wordpiect perhaps?

In [108]:
## imports 
import torch
from transformers import BertTokenizer,BertModel
#from pytorch_pretrained_bert import BertTokenizer
#from pytorch_pretrained_bert import BertModel

In [109]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: GeForce RTX 2060


In [110]:
import logging
logging.basicConfig(level=logging.INFO)

In [111]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

## No detokenization: word tokens used as-is by bert

In [121]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Here is the sentence I want embeddings for."
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', max_length=50)
output = model(**encoded_input)

In [122]:
print(encoded_input)
print(output.pooler_output.size())
print(output.last_hidden_state.size())

{'input_ids': tensor([[ 101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005,
         1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}
torch.Size([1, 768])
torch.Size([1, 50, 768])


In [120]:
#trying on a single sentence
# Print the original sentence.
sentence = "Here is the sentence I want embeddings for."
print(' Original: ', sentence)

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentence))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence)))


 Original:  Here is the sentence I want embeddings for.
Tokenized:  ['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']
Token IDs:  [2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005, 1012]


In [92]:
#trying on a list of sentences:

sentences = [
    "Here is the sentence I want embeddings for.",
    "Our friends won't buy this analysis, let alone the next one we propose.",
    "snowman on a red background",
    "celebrity celebrates with his team mates after the second goal was scored by football player , during the match .",
    "action shot of basketball player , passing the ball during a game .",
    "a sunny winter day by lake .",
    "the extraordinary domed ceiling , with an artist 's hand - painted mural of delicate clouds , creates a vaulted open and airy feel in the space ."
      
]

In [93]:
def bert_tokenize(max_len, sentences, tokenizer):
    """
    Takes in a maximum length and a list of sentences and uses bert to tokenize them.
    """
    input_ids = []
    attention_masks = []
    for sent in sentences:
        encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
        
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])
        
    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    

    # Print sentence 0, now as a list of IDs.
    print('Original: ', sentences[0])
    print('Token IDs:', input_ids[0])
    # Display the words with their indeces.
    for tok_id in input_ids[0]:
        token = list(tokenizer.vocab.values()).index(tok_id.item())
        print(list(tokenizer.vocab.keys())[token])
        
    
    return input_ids, attention_masks

        

In [94]:
input_ids, atttention_masks = bert_tokenize(50, sentences, tokenizer)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  Here is the sentence I want embeddings for.
Token IDs: tensor([ 101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005,
        1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])
[CLS]
here
is
the
sentence
i
want
em
##bed
##ding
##s
for
.
[SEP]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]




In [95]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz not found in cache, downloading to C:\Users\apra\AppData\Local\Temp\tmp4dg68ris
100%|██████████| 407873900/407873900 [00:05<00:00, 70118507.15B/s]
INFO:pytorch_pretrained_bert.file_utils:copying C:\Users\apra\AppData\Local\Temp\tmp4dg68ris to cache at C:\Users\apra\.pytorch_pretrained_bert\9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for C:\Users\apra\.pytorch_pretrained_bert\9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.file_utils:removing temp file C:\Users\apra\AppData\Local\Temp\tmp4dg68ris
INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-u

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Lin

In [106]:
def bert_embed(input_ids, model):
    """
    Given a list of input ids, outputs a numpy array of embeddings: 768 dimensional, padded to max_len of 50.
    TODO: disregard embedding of CLS token
    """
    input_embeddings = []
    for tokens_tensor in input_ids:
        with torch.no_grad():
            segments_tensor = torch.tensor([0]*50)
            print(segments_tensor.size())
            print(tokens_tensor.size())
            encoded_layers, _ = model(tokens_tensor, segments_tensor)
            bert_embedding = encoded_layers[11].squeeze(0)
            print(bert_embedding.size())
        
    
    

In [107]:
bert_embed(input_ids, model)

torch.Size([50])
torch.Size([50])


IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

## Try detokenization

This just means that the embedding of words like em
##bed
##ding
##s will be the average of the embedding of each part of it