This is a python library for extracting word embeddings from pre-trained language models.
pip install --upgrade embedding4bert
Extract word embeddings of pretrained language models, such as BERT or XLNet. The subword embeddings within a word are averaged to represent the whole word embedding.
The extract_word_embeddings
function of Embedding4BERT
class has following arguments:
mode
:str
."sum"
(default) or"mean"
. Take the sum or average representations of the specficied layers.layers
:List[int]
. default:[-1,-2,-3,-4]
, indicating take the last four layers. Take the word representation of specifed layers from the given list.
- Extract BERT word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("bert-base-cased") # bert-base-uncased
text = 'This is a python library for extracting word representations from BERT.'
tokens, embeddings = emb4bert.extract_word_embeddings(text, mode="sum", layers=[-1,-2,-3,-4]) # Take the sum of last four layers
print(tokens)
print(embeddings.shape)
Expected output:
14 tokens: [CLS] This is a python library for extracting word representations from BERT. [SEP], 19 word-tokens: ['[CLS]', 'This', 'is', 'a', 'p', '##yt', '##hon', 'library', 'for', 'extract', '##ing', 'word', 'representations', 'from', 'B', '##ER', '##T', '.', '[SEP]']
['[CLS]', 'This', 'is', 'a', 'python', 'library', 'for', 'extracting', 'word', 'representations', 'from', 'BERT', '.', '[SEP]']
(14, 768)
- Extract XLNet word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("xlnet-base-cased")
text = 'This is a python library for extracting word representations from BERT.'
tokens, embeddings = emb4bert.extract_word_embeddings(text, mode="mean", layers=[-1,-2,-3,]) # Take the mean embeddings of last three layers
print(tokens)
print(embeddings.shape)
Expected output:
11 tokens: This is a python library for extracting word representations from BERT., 16 word-tokens: ['▁This', '▁is', '▁a', '▁', 'py', 'thon', '▁library', '▁for', '▁extract', 'ing', '▁word', '▁representations', '▁from', '▁B', 'ERT', '.']
['▁This', '▁is', '▁a', '▁python', '▁library', '▁for', '▁extracting', '▁word', '▁representations', '▁from', '▁BERT.']
(11, 768)
For attribution in academic contexts, please cite this work as:
@misc{chai2020-embedding4bert,
author = {Chai, Yekun},
title = {embedding4bert: A python library for extracting word embeddings from pre-trained language models},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cyk1337/embedding4bert}}
}