Pre-trained embeddings

BERT

We are publishing several pre-trained BERT models:

RuBERT for Russian language
Slavic BERT for Bulgarian, Czech, Polish, and Russian
Conversational BERT for informal English
Conversational BERT for informal Russian
Sentence Multilingual BERT for encoding sentences in 101 languages
Sentence RuBERT for encoding sentences in Russian

Description of these models is available in the BERT section </features/models/bert> of the docs.

License

The pre-trained models are distributed under the License Apache 2.0.

Downloads

The TensorFlow models can be run with the original BERT repo code while the PyTorch models can be run with the HuggingFace's Transformers library. The download links are:

Description	Model parameters	Download links
RuBERT	vocab size = 120K, parameters = 180M, size = 632MB	[tensorflow], [pytorch]
Slavic BERT	vocab size = 120K, parameters = 180M, size = 632MB	[tensorflow], [pytorch]
Conversational BERT	vocab size = 30K, parameters = 110M, size = 385MB	[tensorflow], [pytorch]
Conversational RuBERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]
Sentence Multilingual BERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]
Sentence RuBERT	vocab size = 120K, parameters = 180M, size = 630MB	[tensorflow], [pytorch]

ELMo

We are publishing Russian language ELMo embeddings model <deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder> for tensorflow-hub and LM model <deeppavlov.models.elmo.elmo.ELMo> for training and fine-tuning ELMo as LM model.
ELMo (Embeddings from Language Models) representations are pre-trained contextual representations from large-scale bidirectional language models. See a paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

License

The pre-trained models are distributed under the License Apache 2.0.

Downloads

The models can be downloaded and run by configuration file or tensorflow hub module from:

Description	Dataset parameters	Perplexity	Configuration file and tensorflow hub module
ELMo on Russian Wikipedia	lines = 1M, tokens = 386M, size = 5GB	43.692	config_file, module_spec
ELMo on Russian WMT News	lines = 63M, tokens = 946M, size = 12GB	49.876	config_file, module_spec
ELMo on Russian Twitter	lines = 104M, tokens = 810M, size = 8.5GB	94.145	config_file, module_spec

fastText

We are publishing pre-trained word vectors for Russian language. Several models were trained on joint Russian Wikipedia and Lenta.ru corpora. We also introduce one model for Russian conversational language that was trained on Russian Twitter corpus.

All vectors are 300-dimensional. We used fastText skip-gram (see Bojanowski et al. (2016)) for vectors training as well as various preprocessing options (see below).

You can get vectors either in binary or in text (vec) formats both for fastText and GloVe.

License

The pre-trained word vectors are distributed under the License Apache 2.0.

Downloads

The pre-trained fastText skipgram models can be downloaded from:

+-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Domain | Preprocessing | Vectors | +=======================+=========================================================+====================================================================================================================================================================================================================================================================================================================================+ | Wiki+Lenta | tokenize (nltk word_tokenize), lemmatize (pymorphy2) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize), lowercasing | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk wordpunсt_tokenize) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize) | bin, vec | + +---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | tokenize (nltk word_tokenize), remove stopwords | bin, vec | +-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Twitter | tokenize (nltk word_tokenize) | bin, vec | +-----------------------+---------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Word vectors training parameters

These word vectors were trained with following parameters ([...] is for default value):

fastText (skipgram)

lr [0.1]
lrUpdateRate [100]
dim 300
ws [5]
epoch [5]
neg [5]
loss [softmax]
pretrainedVectors []
saveOutput [0]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pretrained_vectors.rst

pretrained_vectors.rst

Pre-trained embeddings

BERT

License

Downloads

ELMo

License

Downloads

fastText

License

Downloads

Word vectors training parameters

Files

pretrained_vectors.rst

Latest commit

History

pretrained_vectors.rst

File metadata and controls

Pre-trained embeddings

BERT

License

Downloads

ELMo

License

Downloads

fastText

License

Downloads

Word vectors training parameters