Skip to content

Latest commit

 

History

History
145 lines (113 loc) · 16 KB

pretrained_vectors.rst

File metadata and controls

145 lines (113 loc) · 16 KB

Pre-trained embeddings

BERT

We are publishing several pre-trained BERT models:

  • RuBERT for Russian language
  • Slavic BERT for Bulgarian, Czech, Polish, and Russian
  • Conversational BERT for informal English
  • Conversational BERT for informal Russian
  • Sentence Multilingual BERT for encoding sentences in 101 languages
  • Sentence RuBERT for encoding sentences in Russian

Description of these models is available in the BERT section </features/models/bert> of the docs.

License

The pre-trained models are distributed under the License Apache 2.0.

Downloads

The TensorFlow models can be run with the original BERT repo code while the PyTorch models can be run with the HuggingFace's Transformers library. The download links are:

Description Model parameters Download links
RuBERT vocab size = 120K, parameters = 180M, size = 632MB [tensorflow], [pytorch]
Slavic BERT vocab size = 120K, parameters = 180M, size = 632MB [tensorflow], [pytorch]
Conversational BERT vocab size = 30K, parameters = 110M, size = 385MB [tensorflow], [pytorch]
Conversational RuBERT vocab size = 120K, parameters = 180M, size = 630MB [tensorflow], [pytorch]
Sentence Multilingual BERT vocab size = 120K, parameters = 180M, size = 630MB [tensorflow], [pytorch]
Sentence RuBERT vocab size = 120K, parameters = 180M, size = 630MB [tensorflow], [pytorch]

ELMo

We are publishing Russian language ELMo embeddings model <deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder> for tensorflow-hub and LM model <deeppavlov.models.elmo.elmo.ELMo> for training and fine-tuning ELMo as LM model.
ELMo (Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

License

The pre-trained models are distributed under the License Apache 2.0.

Downloads

The models can be downloaded and run by configuration file or tensorflow hub module from:

Description Dataset parameters Perplexity Configuration file and tensorflow hub module
ELMo on Russian Wikipedia lines = 1M, tokens = 386M, size = 5GB 43.692 config_file, module_spec
ELMo on Russian WMT News lines = 63M, tokens = 946M, size = 12GB 49.876 config_file, module_spec
ELMo on Russian Twitter lines = 104M, tokens = 810M, size = 8.5GB 94.145 config_file, module_spec

fastText

We are publishing pre-trained word vectors for Russian language. Several models were trained on joint Russian Wikipedia and Lenta.ru corpora. We also introduce one model for Russian conversational language that was trained on Russian Twitter corpus.

All vectors are 300-dimensional. We used fastText skip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).

You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.

License

The pre-trained word vectors are distributed under the License Apache 2.0.

Downloads

The pre-trained fastText skipgram models can be downloaded from:

+-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Domain | Preprocessing | Vectors | +=======================+=========================================================+====================================================================================================================================================================================================================================================================================================================================+ | Wiki+Lenta | tokenize (nltk word_tokenize), lemmatize (pymorphy2) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize), lowercasing | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk wordpunсt_tokenize) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize), remove stopwords | bin, vec | +-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Twitter | tokenize (nltk word_tokenize) | bin, vec | +-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Word vectors training parameters

These word vectors were trained with following parameters ([...] is for default value):

fastText (skipgram)

  • lr [0.1]
  • lrUpdateRate [100]
  • dim 300
  • ws [5]
  • epoch [5]
  • neg [5]
  • loss [softmax]
  • pretrainedVectors []
  • saveOutput [0]