New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary construction & embedding training information of blank:en LM #6290
Comments
Each language has its own default tokenizer configuration, defined through settings In a blank model, the The Each spacy model can have its own tokenizer configuration and vectors, so there's no one answer about the dimensions or how they were trained. The basic info is provided in the model meta and for spacy's provided models, on the models pages on the website: https://spacy.io/models/ I hope that helps! |
Thank you for the prompt answer @adrianeboyd Got a few follow-ups:
Ok, so every token's embedding is initiated with an all-zero vector.
So this is a bit confusing. Did you mean there are no methods to train just word embeddings (= no methods to do language modeling) or spacy doesn't train word embeddings in the background when doing a downstream supervised task (e.g. text classification, ner etc.)
Is an embedding being trained after the tokens are added to
Correct me if I'm wrong but these are pretrained models thus having a pre-determined embedding dimensions and training objective. But is that true for also a blank:en or blank:lg model? |
I think part of the confusion is that the term "model" has gotten kind of overloaded here. For the v3 docs we're trying to separate this into "model" for the individual statistical models within some pipeline components and "pipeline" for the group of ordered components that may or may not contain statistical models. A blank pipeline initialized with If you add a pipeline component with a statistical model to your blank model (like a tagger), that model has its own If there are vectors in the vocab, those vectors can be included as a feature along with A lot of the model details are a bit buried in undocumented code for spacy v2 and thinc v7, but the model configuration and documentation are much improved in spacy v3 and thinc v8, see https://thinc.ai for the thinc docs and https://nightly.spacy.io for a preview of spacy v3. If you want to define custom models, we'd strongly recommend going ahead and making the jump to spacy v3. The final v3.0.0 release will be very close to the current nightly release candidate. |
Ok this is very explanatory! Thanks @adrianeboyd |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I may have missed these but I couldn't find much information about the tokenization method, vocabulary construction (e.g. size) and embedding training (e.g. size of embeddings, which LM training objective is used during training embeddings) when using a blank:en spacy model.
Tokenization
I am assuming the default spacy tokenizer is used as detailed here: How Tokenizer Works. A short info about it somewhere in the documentation would be helpful for people using their own text corpus.
Vocabulary Construction
No information given about the size and construction method of the vocabulary (is it just X top frequent tokens in the corpus?)
Embeddings
Which page or section is this issue related to?
The text was updated successfully, but these errors were encountered: