# BERT embeddings extraction
This notebook contains code to extract contextual embeddings of essays using DeBERTa
## Import libraries

In [1]:
%run utils.ipynb

import numpy as np
import pandas as pd
import sklearn as sk 
import matplotlib.pyplot as plt
import seaborn as sns

import transformers
from transformers import AutoModel, AutoTokenizer, AutoConfig
import torch

### Checking CUDA availability

In [2]:
torch.cuda.is_available()

True

## Import data

In [2]:
DataLoader = DataLoader()
L2WritingData = DataLoader.GetData(source='L2Writing')
L2WritingShuffle = DataLoader.GetShuffled()

In [3]:
L2WritingData.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [5]:
L2WritingShuffle.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
2177,A2957D006D28,It has been said that the first impression are...,4.5,4.0,4.0,4.0,4.0,4.0
2833,CE43DBA12965,The growth of technology has made it convenien...,3.0,3.0,3.0,3.5,3.5,4.0
2352,AEE8A576989C,Dear friend\n\nmy name is STUDENT_NAME\n\nam c...,2.5,2.0,3.0,2.5,2.0,2.5
2446,B5AA232A7261,"Dear, Mrs. Generic_Name\n\nMy opinon is that i...",2.0,2.0,2.0,2.0,2.0,2.0
3409,EA37D9C12C91,life is to much fun and joy and happines we ju...,2.5,2.0,2.0,2.5,2.5,2.5


## Save and load the models

In [6]:
DebertaLargeModel = AutoModel.from_pretrained("microsoft/deberta-v3-large")
DebertaLargeTokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-large')
#DebertaLargeModel.save_pretrained('./model/deberta-v3-large/')
#DebertaLargeTokenizer.save_pretrained('/model/tokenizer/deberta-v3-large/')

Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2Model: ['mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.dense.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Special tokens have 

In [51]:
BigBirdModel = AutoModel.from_pretrained('google/bigbird-pegasus-large-arxiv')
BigBirdTokenizer = AutoTokenizer.from_pretrained('google/bigbird-pegasus-large-arxiv')
#BigBirdModel.save_pretrained('./model/bigbird-large')
#BigBirdTokenizer.save_pretrained('./model/tokenizer/bigbird-large')

Some weights of the model checkpoint at google/bigbird-pegasus-large-arxiv were not used when initializing BigBirdPegasusModel: ['final_logits_bias', 'lm_head.weight']
- This IS expected if you are initializing BigBirdPegasusModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdPegasusModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [2]:
debertabaseModel = AutoModel.from_pretrained('microsoft/deberta-v3-base')
debertabaseTokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-v3-base')
debertabaseModel.save_pretrained('./model/deberta-v3-base')
debertabaseTokenizer.save_pretrained('./model/tokenizer/deberta-v3-base')

Downloading:   0%|          | 0.00/579 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2Model: ['mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.LayerNorm.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


('./model/tokenizer/deberta-v3-base\\tokenizer_config.json',
 './model/tokenizer/deberta-v3-base\\special_tokens_map.json',
 './model/tokenizer/deberta-v3-base\\spm.model',
 './model/tokenizer/deberta-v3-base\\added_tokens.json',
 './model/tokenizer/deberta-v3-base\\tokenizer.json')

In [3]:
deberta = AutoModel.from_pretrained('microsoft/deberta-base')
debertatokenizer = AutoTokenizer.from_pretrained('microsoft/deberta-base')
deberta.save_pretrained('./model/deberta-base')
debertatokenizer.save_pretrained('./model/tokenizer/deberta-base')

Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaModel: ['lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

('./model/tokenizer/deberta-base\\tokenizer_config.json',
 './model/tokenizer/deberta-base\\special_tokens_map.json',
 './model/tokenizer/deberta-base\\vocab.json',
 './model/tokenizer/deberta-base\\merges.txt',
 './model/tokenizer/deberta-base\\added_tokens.json',
 './model/tokenizer/deberta-base\\tokenizer.json')

## Tokenizing

In [7]:
data = L2WritingShuffle['full_text'][:5]
tokenized = [DebertaLargeTokenizer(
    i,
    add_special_tokens=True,
    max_length=512,
    return_tensors='pt',
    padding='max_length',
    truncation=True
    ).to('cuda') for i in data]

  data = L2WritingShuffle['full_text'][:5]


In [11]:
DebertaLargeModel= DebertaLargeModel.to('cuda')

In [12]:
with torch.no_grad():
    out = DebertaLargeModel(tokenized[0]['input_ids'],tokenized[0]['attention_mask'])

  attention_scores = torch.bmm(query_layer, key_layer.transpose(-1, -2)) / torch.tensor(
  score += c2p_att / torch.tensor(scale, dtype=c2p_att.dtype)
  score += p2c_att / torch.tensor(scale, dtype=p2c_att.dtype)


## Get embeddings
### CLS embedding

In [30]:
hidden = out.last_hidden_state.detach().cpu()
CLSEmbedding = hidden[:,0]
print(CLSEmbedding.shape)
CLSEmbedding[:5]

torch.Size([1, 1024])


tensor([[ 1.0279e-01, -6.3384e-02,  1.3956e-03,  ...,  6.9370e-03,
         -5.7944e+00,  4.2127e-02]])

### Mean pooling

In [26]:
AttentionMask = tokenized[0]['attention_mask'].detach().cpu().unsqueeze(-1).expand(hidden.size())
SumEmbeddings = torch.sum(AttentionMask*hidden,1)
SumMask = AttentionMask.sum(1)
SumMask = SumMask.clamp(min=1e-9)
MeanPooling = SumEmbeddings/SumMask
print(MeanPooling.shape)
MeanPooling[:5]

torch.Size([1, 1024])


tensor([[ 0.4959, -0.2764, -0.0538,  ...,  0.3599, -0.3727,  0.3367]])

### Max Pooling

In [35]:
AttentionMask = tokenized[0]['attention_mask'].detach().cpu().unsqueeze(-1).expand(hidden.size())
hidden[AttentionMask == 0] = -1e9
MaxPooling = torch.max(hidden,1)[0]
print(MaxPooling.shape)
MaxPooling[:5]

torch.Size([1, 1024])


tensor([[1.8355, 2.8901, 1.5904,  ..., 2.3680, 1.6260, 1.8040]])

## Wrapper

In [24]:
class GetBERTEmbeddings():
    def __init__(self,input,model):
        # Check input
        if model in ['microsoft/deberta-v3-large','google/bigbird-pegasus-large-arxiv']:
            self.input = input
            # load model and tokenizer
            self.model = AutoModel.from_pretrained(model)
            self.tokenizer = AutoTokenizer.from_pretrained(model)
            self.hidden = []
        else:
            raise KeyError
    def tokenize(self,SeqLen=1024):
        self.tokenized = []
        for seq in self.input:
            self.tokenized.append(
                self.tokenizer(seq,
                    add_special_tokens=True,
                    max_length=SeqLen,    # max sequence length, default is 512 for deberta-v3-large
                    return_tensors='pt',  # return in tensor
                    padding='max_length',
                    truncation=True))
        self.input = self.tokenized # move to gpu
        self.input = [i.to('cuda') for i in self.input]
    def inf(self,stop=1000,SeqLen = 1024,):
        self.tokenize(SeqLen=SeqLen)
        print('tokenized')
        self.model = self.model.to('cuda') # move to gpu
        for run in range(len(self.input) // stop):
            for i in range(stop):
                with torch.no_grad():
                    out = self.model(self.input[run+i]['input_ids'],self.input[run+i]['attention_mask']) # inference
                self.hidden.append(out.last_hidden_state.detach().cpu()) # detach to cpu
                if i % 10 == 0:
                    print('{}/{}, run:{}'.format(i,stop,run))
                del out 
            torch.cuda.empty_cache() # clear cuda memory for next run
        #if len(self.input) > stop: # remaining ones
        t = len(self.input) // stop * stop
        for i in range(t,len(self.input)):
            with torch.no_grad():
                out = self.model(self.input[i]['input_ids'],self.input[i]['attention_mask']) # inference
            self.hidden.append(out.last_hidden_state.detach().cpu()) # detach to cpu
            if i % 10 == 0:
                print('{}/{}, run:{}'.format(i,stop,'f'))
            del out 
        torch.cuda.empty_cache()
    def CLSEmbedding(self,i):
        self.CLS = self.hidden[i][:,0]
        return self.CLS
    def MaxPooling(self,i):
        hidden = self.hidden[i]
        self.AttentionMask = self.tokenized[i]['attention_mask'].unsqueeze(-1).expand(hidden.size())
        hidden[self.AttentionMask == 0] = -1e9 # ignore paddings
        self.MaxP = torch.max(hidden,1)[0]
        return self.MaxP
    def MeanPooling(self,i):
        hidden = self.hidden[i].detach().cpu()
        self.AttentionMask = self.tokenized[i]['attention_mask'].unsqueeze(-1).expand(hidden.size()).detach().cpu()
        SumEmbeddings = torch.sum(self.AttentionMask*hidden,1) # ignore paddings
        SumMask = self.AttentionMask.sum(1)
        SumMask = SumMask.clamp(min=1e-9) # prevents division by zero
        self.MeanP = SumEmbeddings/SumMask
        return self.MeanP
    def GetEmbeddings(self,type) :
        EmbeddingType = {
            'CLS':self.CLSEmbedding,
            'MaxP':self.MaxPooling,
            'MeanP':self.MeanPooling
        }
        result = [EmbeddingType[type](i) for i in range(len(self.input)) ]
        return result


In [27]:
EmbeddingFetcher = GetBERTEmbeddings(L2WritingShuffle['full_text'][:50],'google/bigbird-pegasus-large-arxiv')
EmbeddingFetcher.inf()


  EmbeddingFetcher = GetBERTEmbeddings(L2WritingShuffle['full_text'][:50],'google/bigbird-pegasus-large-arxiv')
Some weights of the model checkpoint at google/bigbird-pegasus-large-arxiv were not used when initializing BigBirdPegasusModel: ['final_logits_bias', 'lm_head.weight']
- This IS expected if you are initializing BigBirdPegasusModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdPegasusModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenized


  torch.arange(indices.shape[0] * indices.shape[1] * num_indices_to_gather, device=indices.device)


0/1000, run:f
10/1000, run:f
20/1000, run:f
30/1000, run:f
40/1000, run:f


In [28]:
EmbeddingFetcher.GetEmbeddings('MeanP')

[tensor([[-0.1633, -0.1339,  0.2622,  ...,  0.0476,  0.0208,  0.2765]]),
 tensor([[ 0.0298, -0.1334,  0.2221,  ...,  0.1089, -0.0831,  0.1603]]),
 tensor([[-0.0224, -0.1506,  0.1303,  ...,  0.0075, -0.0181,  0.0877]]),
 tensor([[ 0.0705, -0.0944,  0.2164,  ...,  0.0218,  0.0009,  0.1355]]),
 tensor([[-0.0431, -0.1462,  0.1437,  ...,  0.0270, -0.0045,  0.0694]]),
 tensor([[-0.0301, -0.1147,  0.1494,  ...,  0.0416, -0.0227,  0.0984]]),
 tensor([[-0.1737, -0.1156,  0.2163,  ...,  0.0438, -0.0238,  0.2262]]),
 tensor([[ 0.0035, -0.1286,  0.1948,  ..., -0.0114, -0.0364,  0.1432]]),
 tensor([[-0.0038, -0.1254,  0.2023,  ...,  0.0964,  0.0020,  0.2621]]),
 tensor([[-0.0493, -0.1919,  0.2049,  ...,  0.1426, -0.0006,  0.1389]]),
 tensor([[-0.0382, -0.1667,  0.1721,  ...,  0.0406, -0.0415,  0.0773]]),
 tensor([[ 0.0833, -0.1503,  0.0911,  ...,  0.0608, -0.0191,  0.0564]]),
 tensor([[ 0.0351, -0.1271,  0.1517,  ...,  0.1259,  0.0118,  0.1516]]),
 tensor([[ 0.0122, -0.1490,  0.1924,  ...,  0.0427,