Skip to content

cyk1337/embedding4bert

Repository files navigation

Embedding4BERT

Stable version Python3wheel:embedding4bert MIT License

Table of Contents

This is a python library for extracting word embeddings from pre-trained language models.

User Guide

Installation

pip install --upgrade embedding4bert

Usage

Extract word embeddings of pretrained language models, such as BERT or XLNet. The subword embeddings within a word are averaged to represent the whole word embedding. The extract_word_embeddings function of Embedding4BERT class has following arguments:

  • mode: str. "sum" (default) or"mean". Take the sum or average representations of the specficied layers.
  • layers: List[int]. default: [-1,-2,-3,-4], indicating take the last four layers. Take the word representation of specifed layers from the given list.
  1. Extract BERT word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("bert-base-cased") # bert-base-uncased
text = 'This is a python library for extracting word representations from BERT.'
tokens, embeddings = emb4bert.extract_word_embeddings(text, mode="sum", layers=[-1,-2,-3,-4]) # Take the sum of last four layers
print(tokens)
print(embeddings.shape)

Expected output:

14 tokens: [CLS] This is a python library for extracting word representations from BERT. [SEP], 19 word-tokens: ['[CLS]', 'This', 'is', 'a', 'p', '##yt', '##hon', 'library', 'for', 'extract', '##ing', 'word', 'representations', 'from', 'B', '##ER', '##T', '.', '[SEP]']
['[CLS]', 'This', 'is', 'a', 'python', 'library', 'for', 'extracting', 'word', 'representations', 'from', 'BERT', '.', '[SEP]']
(14, 768)
  1. Extract XLNet word embeddings.
from embedding4bert import Embedding4BERT
emb4bert = Embedding4BERT("xlnet-base-cased")
text = 'This is a python library for extracting word representations from BERT.'
tokens, embeddings = emb4bert.extract_word_embeddings(text, mode="mean", layers=[-1,-2,-3,]) # Take the mean embeddings of last three layers
print(tokens)
print(embeddings.shape)

Expected output:

11 tokens: This is a python library for extracting word representations from BERT., 16 word-tokens: ['▁This', '▁is', '▁a', '', 'py', 'thon', '▁library', '▁for', '▁extract', 'ing', '▁word', '▁representations', '▁from', '▁B', 'ERT', '.']
['▁This', '▁is', '▁a', '▁python', '▁library', '▁for', '▁extracting', '▁word', '▁representations', '▁from', '▁BERT.']
(11, 768)

Citation

For attribution in academic contexts, please cite this work as:

@misc{chai2020-embedding4bert,
  author = {Chai, Yekun},
  title = {embedding4bert: A python library for extracting word embeddings from pre-trained language models},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cyk1337/embedding4bert}}
}

References

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  2. XLNet: Generalized Autoregressive Pretraining for Language Understanding