# Extracting BERT embeddings from the captions

In this notebook, I attempt to extract BERT word embeddings from the captions. I use the steps outlined in section 4.4 of [this](https://www.overleaf.com/project/5ca66278f4224d0690dd9e29) source to achieve this.
I follow [this](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/) tutorial.

More sources consulted:
 - http://mccormickml.com/2019/07/22/BERT-fine-tuning/
 - [Huggingface's much easier one](https://huggingface.co/bert-base-uncased?text=The+goal+of+life+is+%5BMASK%5D.)

# TODO:
 - use TFBert and batch tokenize
 - detokenize using wordpiece perhaps?

## Setup

In [1]:
## imports 
import torch
from transformers import BertTokenizer,BertModel
from tqdm import tqdm
#from pytorch_pretrained_bert import BertTokenizer
#from pytorch_pretrained_bert import BertModel

In [2]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: GeForce RTX 2060


In [3]:
import logging
logging.basicConfig(level=logging.INFO)

In [4]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

INFO:filelock:Lock 1900175335136 acquired on C:\Users\apra/.cache\huggingface\transformers\45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

INFO:filelock:Lock 1900175335136 released on C:\Users\apra/.cache\huggingface\transformers\45c3f7a79a80e1cf0a489e5c62b43f173c15db47864303a55d623bb3c96f72a5.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99.lock
INFO:filelock:Lock 1900075740416 acquired on C:\Users\apra/.cache\huggingface\transformers\c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

INFO:filelock:Lock 1900075740416 released on C:\Users\apra/.cache\huggingface\transformers\c1d7f0a763fb63861cc08553866f1fc3e5a6f4f07621be277452d26d71303b7e.20430bd8e10ef77a7d2977accefe796051e01bc2fc4aa146bc862997a1a15e79.lock
INFO:filelock:Lock 1900175334560 acquired on C:\Users\apra/.cache\huggingface\transformers\534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock


Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

INFO:filelock:Lock 1900175334560 released on C:\Users\apra/.cache\huggingface\transformers\534479488c54aeaf9c3406f647aa2ec13648c06771ffe269edabebd4c412da1d.7f2721073f19841be16f41b0a70b600ca6b880c8f3df6f3535cbc704371bdfa4.lock


## Trying out with one sentence

In [5]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Here is the sentence I want embeddings for."
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=50)
output = model(**encoded_input)

INFO:filelock:Lock 1900176415040 acquired on C:\Users\apra/.cache\huggingface\transformers\a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

INFO:filelock:Lock 1900176415040 released on C:\Users\apra/.cache\huggingface\transformers\a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock


In [6]:
print(encoded_input)
print(output.pooler_output.size())
print(output.last_hidden_state.size())

{'input_ids': tensor([[ 101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005,
         1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]])}
torch.Size([1, 768])
torch.Size([1, 50, 768])


In [7]:
#trying on a single sentence
# Print the original sentence.
sentence = "Here is the sentence I want embeddings for."
print(' Original: ', sentence)

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentence))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence)))


 Original:  Here is the sentence I want embeddings for.
Tokenized:  ['here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.']
Token IDs:  [2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005, 1012]


## No detokenization: word tokens used as-is by bert (multiple sentences, not vectorized)

In [8]:
#trying on a list of sentences:

sentences = [
    "Here is the sentence I want embeddings for.",
    "Our friends won't buy this analysis, let alone the next one we propose.",
    "snowman on a red background",
    "celebrity celebrates with his team mates after the second goal was scored by football player , during the match .",
    "action shot of basketball player , passing the ball during a game .",
    "a sunny winter day by lake .",
    "the extraordinary domed ceiling , with an artist 's hand - painted mural of delicate clouds , creates a vaulted open and airy feel in the space ."
      
]

In [12]:
def bert_tokenize(max_len, sentences, tokenizer):
    """
    Takes in a maximum length and a list of sentences and uses bert to tokenize them.
    """
    input_ids = []
    attention_masks = []
    for sent in sentences:
        encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,
                        return_tensors = 'pt',     # Return pytorch tensors.
                        padding = 'max_length',
                        truncation = True
                   )
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
        
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])
        
    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    

    # Print sentence 0, now as a list of IDs.
    print('Original: ', sentences[0])
    print('Token IDs:', input_ids[0])
    # Display the words with their indeces.
    for tok_id in input_ids[0]:
        token = list(tokenizer.vocab.values()).index(tok_id.item())
        print(list(tokenizer.vocab.keys())[token])
        
    
    return torch.unsqueeze(input_ids, 1), attention_masks

        

In [13]:
input_ids, atttention_masks = bert_tokenize(50, sentences, tokenizer)

Original:  Here is the sentence I want embeddings for.
Token IDs: tensor([ 101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005,
        1012,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])
[CLS]
here
is
the
sentence
i
want
em
##bed
##ding
##s
for
.
[SEP]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]
[PAD]


In [14]:
sentence_num = 6
print('Original: ', sentences[sentence_num], len(sentences[sentence_num]))
print('Token IDs:', input_ids[sentence_num], len(input_ids[sentence_num]))

Original:  the extraordinary domed ceiling , with an artist 's hand - painted mural of delicate clouds , creates a vaulted open and airy feel in the space . 145
Token IDs: tensor([[  101,  1996,  9313, 29208,  5894,  1010,  2007,  2019,  3063,  1005,
          1055,  2192,  1011,  4993, 15533,  1997, 10059,  8044,  1010,  9005,
          1037, 24616,  2330,  1998,  2250,  2100,  2514,  1999,  1996,  2686,
          1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]]) 1


In [15]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [16]:
def bert_embed(input_ids, model):
    """
    Given a list of input ids, outputs a numpy array of embeddings: 768 dimensional, padded to max_len of 50.
    TODO: disregard embedding of CLS token
    """
    input_embeddings = []
    for tokens_tensor in input_ids:
        with torch.no_grad():
            encoded_layers = model(tokens_tensor,output_hidden_states=True)
            bert_embedding = encoded_layers.last_hidden_state[0]
            input_embeddings.append(bert_embedding)
    embeddings = torch.stack(input_embeddings)
    return embeddings        
            

        
    
    

In [17]:
embeddings = bert_embed(input_ids, model)
print(embeddings.size())

torch.Size([7, 50, 768])


## Batch Embedding (so its faster)

In [18]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [19]:
tokenizer

PreTrainedTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [20]:
def bert_tokenize_batch(max_len, sentences, tokenizer):
    """
    Takes in a maximum length and a list of sentences and uses bert to tokenize them.
    """
   
    encoded_dict = tokenizer(
                    sentences,                      # Sentence to encode.
                    add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                    max_length = max_len,           # Pad & truncate all sentences.
                    padding = 'max_length',
                    truncation = True,
                    #can add back and refer to non batched version on how to retrieve attention masks
                    #return_attention_mask = True,   # Construct attn. masks.
                    return_tensors = 'pt',     # Return pytorch tensors.
               )
    
    input_ids = encoded_dict['input_ids']
    #tokenized_sentences = input_ids.apply_(lambda word: )

    

    
#test
#     print('Original: ', sentences[0])
#     print('Token IDs:', input_ids[0])
    #how to convert IDs to #seperated tokens
    #print(tokenizer.convert_ids_to_tokens(input_ids))
    # Display the words with their indeces.
#     for tok_id in input_ids[0]:
#         print(word_ids()[tok_id])
# #         token = list(tokenizer.vocab.values()).index(tok_id.item())
# #         print(list(tokenizer.vocab.keys())[token])
        
    #print(input_ids.size())
    return input_ids

In [21]:
def bert_embed_batch(input_ids, model):
    """
    Given a list of input ids, outputs a numpy array of embeddings: 768 dimensional, padded to max_len of 50.
    TODO: disregard embedding of CLS token
    """
    with torch.no_grad():
        encoded_layers = model(input_ids,output_hidden_states=True)
        bert_embedding = encoded_layers.last_hidden_state
        #print(bert_embedding.size())
    return bert_embedding
            

In [22]:
tokenized_sentences = bert_tokenize_batch(50, sentences, tokenizer)

In [24]:
embedded_sentences = bert_embed_batch(tokenized_sentences,model)

## Pipeline: folder with numpy files for ecery images caption by number 

There is no need to tokenize and save tokenized captions. This can be done on the fly as we are using a fixed pretrained tokenizer from bert. We will save the embeddings, then tokenize using the same tokenizer during training/testing time so we can match the sequence lengths of the embeddings with the captions. I think this will preserve information compared to the strategy of averaging the embeddings of each subword belonging to the same word to preserve sequence length :)

In [25]:
import os
import pandas as pd
import numpy as np
data_dir = r'../data/30k_sample/'
meta_file = r'prepped_meta'

In [26]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [27]:
meta_path = os.path.join(data_dir,meta_file)

In [119]:
meta_df = pd.read_csv(meta_path)
meta_df.iloc[0]

global_index                                                     552309
caption               drum and drumsticks pattern repeat seamless in...
link                  https://thumb7.shutterstock.com/display_pic_wi...
objects               font,line,black-and-white,pattern,design,illus...
mid                   /m/03gq5hm,/m/03scnj,/m/01g6gs,/m/0hwky,/m/02c...
object_confidences    0.920914888381958,0.8690993189811707,0.7811979...
image_path                            data\30k_sample\images\552309.png
size                                                         (450, 470)
Name: 0, dtype: object

In [120]:
captions = meta_df['caption'].tolist()
global_indices = meta_df['global_index'].tolist()

In [121]:
#zip the captions and the indices of the images they belong to together so we can process and save them accordingly
caption_ids = list(zip(captions,global_indices))

In [122]:
caption_ids[0]

('drum and drumsticks pattern repeat seamless in black color for any design .',
 552309)

In [123]:
len(captions)


29928

In [124]:
num_trying = 50

In [125]:
from torch.utils.data import DataLoader

dataloader = DataLoader(caption_ids[:num_trying], batch_size=32, shuffle=False)

In [126]:
print(len(dataloader))

2


In [127]:
embeddings_dir = r'bert_embeddings'
embeddings_path = os.path.join(data_dir,embeddings_dir)
if not os.path.exists(embeddings_path):
    os.mkdir(embeddings_path)
embeddings_path

'../data/30k_sample/bert_embeddings'

In [128]:
cwd = os.getcwd()

In [129]:
cwd

'C:\\Users\\apra\\Desktop\\Previous Classes\\SPRING 2020\\DS 4400\\ML1_Final_Proj\\BERT_stuff'

In [130]:
#os.chdir(embeddings_path)
for batch_idx, batch in tqdm(enumerate(dataloader), total= len(dataloader),leave=False):
    batch_captions,global_indices = batch
    tokenized_captions = bert_tokenize_batch(50, batch_captions, tokenizer)
    embedded_captions = bert_embed_batch(tokenized_captions,model)
    
    
    for idx, global_idx in enumerate(global_indices):
        save_path = str(int(global_idx))
        emb = embedded_captions[idx]
        np.save(save_path,emb.numpy())
            
        
        

                                                                                                                                                                                       

In [131]:
os.getcwd()

'C:\\Users\\apra\\Desktop\\Previous Classes\\SPRING 2020\\DS 4400\\ML1_Final_Proj\\BERT_stuff'

In [132]:
935*32

29920

In [138]:
tokenized_captions = bert_tokenize_batch(50, captions[:1], tokenizer)

In [139]:
embedded_captions = bert_embed_batch(tokenized_captions,model)

In [140]:
embedded_captions

tensor([[[-6.7798e-01,  6.6109e-02,  2.0387e-01,  ..., -7.0361e-02,
           6.2121e-01, -6.3509e-01],
         [ 3.6984e-01,  2.8703e-01,  3.6181e-01,  ...,  1.9766e-02,
           4.4718e-01, -1.4747e+00],
         [-6.4617e-01,  6.4853e-01,  7.7406e-02,  ..., -3.1656e-01,
           2.6238e-01, -6.6055e-01],
         ...,
         [ 3.2611e-01, -3.4134e-01,  9.1669e-01,  ..., -6.6094e-01,
          -7.0788e-04, -1.0672e+00],
         [ 4.1295e-01, -4.7254e-01,  9.3167e-01,  ..., -6.8353e-01,
          -1.8768e-02, -1.0531e+00],
         [ 4.5918e-01, -5.2467e-01,  7.8476e-01,  ..., -6.9273e-01,
          -9.4008e-02, -1.3676e+00]]])

In [141]:
embedded_captions.size()

torch.Size([1, 50, 768])

## Some code to read the bert embedding files

In [142]:
emb = np.load(r'552309.npy')
emb

array([[-6.7797875e-01,  6.6109024e-02,  2.0386510e-01, ...,
        -7.0361063e-02,  6.2120736e-01, -6.3508731e-01],
       [ 3.6984134e-01,  2.8702578e-01,  3.6180964e-01, ...,
         1.9765865e-02,  4.4717944e-01, -1.4747001e+00],
       [-6.4616889e-01,  6.4852601e-01,  7.7405632e-02, ...,
        -3.1655696e-01,  2.6238036e-01, -6.6054714e-01],
       ...,
       [ 3.2611141e-01, -3.4134045e-01,  9.1668588e-01, ...,
        -6.6093838e-01, -7.0826337e-04, -1.0672311e+00],
       [ 4.1295210e-01, -4.7253543e-01,  9.3166596e-01, ...,
        -6.8352866e-01, -1.8768054e-02, -1.0531371e+00],
       [ 4.5918289e-01, -5.2467126e-01,  7.8476018e-01, ...,
        -6.9273007e-01, -9.4007984e-02, -1.3676246e+00]], dtype=float32)

In [143]:
emb.shape

(50, 768)

## Try detokenization

This just means that the embedding of words like em
##bed
##ding
##s will be the average of the embedding of each part of it