Vocabulary construction & embedding training information of blank:en LM #6290

atakanokan · 2020-10-22T18:24:08Z

I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model.

Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works. A short info about it somewhere in the documentation would be helpful for people using their own text corpus.

Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)

Embeddings

What is the dimensions of the embedding vectors?
How are they trained based on the corpus given? What is the LM loss function?

Which page or section is this issue related to?

adrianeboyd · 2020-10-23T07:53:24Z

Each language has its own default tokenizer configuration, defined through settings spacy/lang/lg/punctuation.py and spacy/lang/lg/tokenizer_exceptions.py for languages that use spacy's tokenizer (languages other than Chinese, Japanese, Korean, Thai, or Vietnamese). The configurations provided within the library are customized for the training corpora used in the provided models, so you may need to customize them for another corpus.

In a blank model, the vocab starts out empty and there are no default word embeddings. In spacy itself there are no methods to train word embeddings, but you can provide vectors from another source like word2vec or fasttext by using python -m spacy init-model to initialize a blank model and add the vectors to it: https://spacy.io/api/cli#init-model

The vocab is not static and it is more of a cache that keeps track of tokens that have been seen so far. The vocab starts out empty in a blank model and then keeps track of tokens that have been seen. After running spacy.blank("lg") you'll mainly see tokens from the tokenizer exceptions added to the vocab during the tokenizer initialization, and then as you process texts with the pipeline the vocab will grow in size.

Each spacy model can have its own tokenizer configuration and vectors, so there's no one answer about the dimensions or how they were trained. The basic info is provided in the model meta and for spacy's provided models, on the models pages on the website: https://spacy.io/models/

I hope that helps!

atakanokan · 2020-10-23T16:59:24Z

Thank you for the prompt answer @adrianeboyd

Got a few follow-ups:

In a blank model, the vocab starts out empty and there are no default word embeddings.

Ok, so every token's embedding is initiated with an all-zero vector.

In spacy itself there are no methods to train word embeddings

So this is a bit confusing. Did you mean there are no methods to train just word embeddings (= no methods to do language modeling) or spacy doesn't train word embeddings in the background when doing a downstream supervised task (e.g. text classification, ner etc.)

The vocab starts out empty in a blank model and then keeps track of tokens that have been seen

as you process texts with the pipeline the vocab will grow in size.

Is an embedding being trained after the tokens are added to vocab based on some objective (unsupervised LM or supervised downstrem task)?

Each spacy model can have its own tokenizer configuration and vectors, so there's no one answer about the dimensions or how they were trained.

Correct me if I'm wrong but these are pretrained models thus having a pre-determined embedding dimensions and training objective. But is that true for also a blank:en or blank:lg model?

adrianeboyd · 2020-10-26T07:40:15Z

I think part of the confusion is that the term "model" has gotten kind of overloaded here. For the v3 docs we're trying to separate this into "model" for the individual statistical models within some pipeline components and "pipeline" for the group of ordered components that may or may not contain statistical models.

A blank pipeline initialized with spacy.blank("lg") really and truly does not contain any vectors or statistical models. The vectors are not all-0 vectors, there are really no vectors. If you try to access a vector you'll get an error. A blank pipeline only contains a tokenizer and some lexical attributes, and doesn't contain any components with models.

If you add a pipeline component with a statistical model to your blank model (like a tagger), that model has its own tok2vec layer in spacy v2, which is defined here for tagger, parser, and ner (the textcat models are defined separately):

https://github.com/explosion/spaCy/blob/260c29794a1caa70f8b0702c31fcfecad6bfdadc/spacy/ml/_legacy_tok2vec.py

If there are vectors in the vocab, those vectors can be included as a feature along with NORM/PREFIX/SUFFIX/SHAPE. If you add vectors to a blank model and then add a tagger, the vectors are included automatically. In spacy v2, each pipeline component's model (the models for tagger, parser, etc.) has its own completely separate tok2vec layer. This is more configurable in the upcoming spacy v3, for instance using a transformers model or sharing spacy's tok2vec layer between multiple components, see: https://explosion.ai/blog/spacy-v3-nightly#transformers

A lot of the model details are a bit buried in undocumented code for spacy v2 and thinc v7, but the model configuration and documentation are much improved in spacy v3 and thinc v8, see https://thinc.ai for the thinc docs and https://nightly.spacy.io for a preview of spacy v3. If you want to define custom models, we'd strongly recommend going ahead and making the jump to spacy v3. The final v3.0.0 release will be very close to the current nightly release candidate.

atakanokan · 2020-10-26T18:45:17Z

Ok this is very explanatory! Thanks @adrianeboyd

github-actions · 2021-10-30T00:01:55Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added the docs Documentation and website label Oct 22, 2020

adrianeboyd added the resolved The issue was addressed / answered label Oct 23, 2020

github-actions bot removed the resolved The issue was addressed / answered label Oct 23, 2020

atakanokan closed this as completed Oct 26, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary construction & embedding training information of blank:en LM #6290

Vocabulary construction & embedding training information of blank:en LM #6290

atakanokan commented Oct 22, 2020

adrianeboyd commented Oct 23, 2020

atakanokan commented Oct 23, 2020

adrianeboyd commented Oct 26, 2020

atakanokan commented Oct 26, 2020

github-actions bot commented Oct 30, 2021

Vocabulary construction & embedding training information of blank:en LM #6290

Vocabulary construction & embedding training information of blank:en LM #6290

Comments

atakanokan commented Oct 22, 2020

Which page or section is this issue related to?

adrianeboyd commented Oct 23, 2020

atakanokan commented Oct 23, 2020

adrianeboyd commented Oct 26, 2020

atakanokan commented Oct 26, 2020

github-actions bot commented Oct 30, 2021