# Lab 3: BERT Pretrained

---




In this lab, we'll learn how to retrieve contextualized embedding from pretrained BERT models

In [None]:
pip install transformers

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 7.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 53.5 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 84.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 9.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 91.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

# 1. Setup

In [None]:
from google.colab import drive
import os

drive.mount('/content/drive')

os.chdir('/content/drive/MyDrive/hw')
os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive/hw'

In [None]:
import random, pickle
import numpy as np
from torch.nn import BCEWithLogitsLoss, BCELoss
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score, precision_recall_fscore_support
import tensorflow as tf
import torch
import pandas as pd

from transformers import AutoConfig, AutoModel, AutoTokenizer, AutoModelForSequenceClassification

import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

import copy
from sklearn.utils import shuffle
import glob

import time
import datetime



In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [None]:
# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB


# 2. Demo Sentences

Let's have a few demo sentences.

In [None]:
sentences = ["Apple makes a new play for small businesses, taking on Google and Microsoft",
        "Apple loses bid for second bite at Qualcomm patents after license",
        "Apples also happen to be good for you. Very good. Rich in antioxidants, vitamin C, potassium, and fiber, apples are considered one of the healthiest foods you can eat.",
        "Apple chips moisture analysis made easy with near-infrared spectroscopy"]

# 3. Tokenization & Input Formatting

In this section, we'll transform our dataset into the format that BERT can be trained on.


To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT--the below cell will download this for us. We'll be using the "uncased" version here.

For the list of pretrained BERT model, check https://huggingface.co/transformers/pretrained_models.html


In [None]:
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)


tokenizer = AutoTokenizer.from_pretrained(
        "bert-base-uncased"
    )


Loading BERT tokenizer...


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Let's apply the tokenizer to one sentence just to see the output.


In [None]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  Apple makes a new play for small businesses, taking on Google and Microsoft
Tokenized:  ['apple', 'makes', 'a', 'new', 'play', 'for', 'small', 'businesses', ',', 'taking', 'on', 'google', 'and', 'microsoft']
Token IDs:  [6207, 3084, 1037, 2047, 2377, 2005, 2235, 5661, 1010, 2635, 2006, 8224, 1998, 7513]


In [None]:
# Load the Pretrained BERT tokenizer.

bert_model = AutoModel.from_pretrained(
    "bert-base-uncased",
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = True, # Whether the model returns all hidden-states.
        
    )
bert_model.cuda()

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

# Retrieve Embeddings from each BERT layer

Let's retrieve embeddings from each layer

In [None]:
def get_pretrained_wordvector(sentences, tokenizer, bert_model):

    input_ids = []
    attention_masks = []

    # Tokenize each sentence
    for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 200,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
        # Add the encoded sentence to the list.    
        input_ids.append(encoded_dict['input_ids'])
    
        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])


    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)

    bert_model.eval()
    with torch.no_grad():

        outputs = bert_model(input_ids.to(device), attention_masks.to(device))   
        hidden_states = outputs[2]

    
    # get the last four layers
    token_embeddings = torch.stack(hidden_states[-4:], dim=0) 
    #print(token_embeddings.size())

    # permute axis
    token_embeddings = token_embeddings.permute(1,2,0,3)
    #print(token_embeddings.size())

    # take the mean of the last 4 layers
    token_embeddings = token_embeddings.mean(axis=2)

    #print(token_embeddings.size())

    return token_embeddings, attention_masks

In [None]:
token_embeddings, masks = get_pretrained_wordvector(sentences, tokenizer, bert_model)

print(token_embeddings.size(), masks.size())

torch.Size([4, 200, 768]) torch.Size([4, 200])




# Test Contextualized Word Embedding

There is a word `Apple` in the beginning of each sentence. Let's check if they have similar embeddings.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

print(sentences[0])
apple1 = token_embeddings[0][0].cpu().numpy()

print(sentences[1])
apple2 = token_embeddings[1][0].cpu().numpy()

print(sentences[2])
apple3 = token_embeddings[2][0].cpu().numpy()

print(sentences[3])
apple4 = token_embeddings[3][0].cpu().numpy()

cosine_similarity(np.vstack([apple1, apple2, apple3, apple4]))

Apple makes a new play for small businesses, taking on Google and Microsoft
Apple loses bid for second bite at Qualcomm patents after license
Apples also happen to be good for you. Very good. Rich in antioxidants, vitamin C, potassium, and fiber, apples are considered one of the healthiest foods you can eat.
Apple chips moisture analysis made easy with near-infrared spectroscopy


array([[0.99999994, 0.79188734, 0.68766606, 0.75212693],
       [0.79188734, 1.        , 0.6522256 , 0.78111225],
       [0.68766606, 0.6522256 , 1.0000002 , 0.752947  ],
       [0.75212693, 0.78111225, 0.752947  , 1.0000002 ]], dtype=float32)

With the pretrained contexualized embeddings, you can use them in your deep learning model, just like regular embedding.
