Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary construction & embedding training information of blank:en LM #6290

Closed
atakanokan opened this issue Oct 22, 2020 · 5 comments
Closed
Labels
docs Documentation and website

Comments

@atakanokan
Copy link

I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model.

Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works. A short info about it somewhere in the documentation would be helpful for people using their own text corpus.

Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)

Embeddings

  • What is the dimensions of the embedding vectors?
  • How are they trained based on the corpus given? What is the LM loss function?

Which page or section is this issue related to?

@svlandeg svlandeg added the docs Documentation and website label Oct 22, 2020
@adrianeboyd
Copy link
Contributor

Each language has its own default tokenizer configuration, defined through settings spacy/lang/lg/punctuation.py and spacy/lang/lg/tokenizer_exceptions.py for languages that use spacy's tokenizer (languages other than Chinese, Japanese, Korean, Thai, or Vietnamese). The configurations provided within the library are customized for the training corpora used in the provided models, so you may need to customize them for another corpus.

In a blank model, the vocab starts out empty and there are no default word embeddings. In spacy itself there are no methods to train word embeddings, but you can provide vectors from another source like word2vec or fasttext by using python -m spacy init-model to initialize a blank model and add the vectors to it: https://spacy.io/api/cli#init-model

The vocab is not static and it is more of a cache that keeps track of tokens that have been seen so far. The vocab starts out empty in a blank model and then keeps track of tokens that have been seen. After running spacy.blank("lg") you'll mainly see tokens from the tokenizer exceptions added to the vocab during the tokenizer initialization, and then as you process texts with the pipeline the vocab will grow in size.

Each spacy model can have its own tokenizer configuration and vectors, so there's no one answer about the dimensions or how they were trained. The basic info is provided in the model meta and for spacy's provided models, on the models pages on the website: https://spacy.io/models/

I hope that helps!

@adrianeboyd adrianeboyd added the resolved The issue was addressed / answered label Oct 23, 2020
@atakanokan
Copy link
Author

Thank you for the prompt answer @adrianeboyd

Got a few follow-ups:

In a blank model, the vocab starts out empty and there are no default word embeddings.

Ok, so every token's embedding is initiated with an all-zero vector.

In spacy itself there are no methods to train word embeddings

So this is a bit confusing. Did you mean there are no methods to train just word embeddings (= no methods to do language modeling) or spacy doesn't train word embeddings in the background when doing a downstream supervised task (e.g. text classification, ner etc.)

The vocab starts out empty in a blank model and then keeps track of tokens that have been seen

as you process texts with the pipeline the vocab will grow in size.

Is an embedding being trained after the tokens are added to vocab based on some objective (unsupervised LM or supervised downstrem task)?

Each spacy model can have its own tokenizer configuration and vectors, so there's no one answer about the dimensions or how they were trained.

Correct me if I'm wrong but these are pretrained models thus having a pre-determined embedding dimensions and training objective. But is that true for also a blank:en or blank:lg model?

@github-actions github-actions bot removed the resolved The issue was addressed / answered label Oct 23, 2020
@adrianeboyd
Copy link
Contributor

I think part of the confusion is that the term "model" has gotten kind of overloaded here. For the v3 docs we're trying to separate this into "model" for the individual statistical models within some pipeline components and "pipeline" for the group of ordered components that may or may not contain statistical models.

A blank pipeline initialized with spacy.blank("lg") really and truly does not contain any vectors or statistical models. The vectors are not all-0 vectors, there are really no vectors. If you try to access a vector you'll get an error. A blank pipeline only contains a tokenizer and some lexical attributes, and doesn't contain any components with models.

If you add a pipeline component with a statistical model to your blank model (like a tagger), that model has its own tok2vec layer in spacy v2, which is defined here for tagger, parser, and ner (the textcat models are defined separately):

https://github.com/explosion/spaCy/blob/260c29794a1caa70f8b0702c31fcfecad6bfdadc/spacy/ml/_legacy_tok2vec.py

If there are vectors in the vocab, those vectors can be included as a feature along with NORM/PREFIX/SUFFIX/SHAPE. If you add vectors to a blank model and then add a tagger, the vectors are included automatically. In spacy v2, each pipeline component's model (the models for tagger, parser, etc.) has its own completely separate tok2vec layer. This is more configurable in the upcoming spacy v3, for instance using a transformers model or sharing spacy's tok2vec layer between multiple components, see: https://explosion.ai/blog/spacy-v3-nightly#transformers

A lot of the model details are a bit buried in undocumented code for spacy v2 and thinc v7, but the model configuration and documentation are much improved in spacy v3 and thinc v8, see https://thinc.ai for the thinc docs and https://nightly.spacy.io for a preview of spacy v3. If you want to define custom models, we'd strongly recommend going ahead and making the jump to spacy v3. The final v3.0.0 release will be very close to the current nightly release candidate.

@atakanokan
Copy link
Author

Ok this is very explanatory! Thanks @adrianeboyd

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website
Projects
None yet
Development

No branches or pull requests

3 participants