New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a transformer model for an existing language. #7221
Comments
In terms of adding a new model, we already have similar training corpora set up for the standard Danish models (also DaNE, although possibly an older version) and an internal transformer config template used for all the I'm honestly not sure how many transformer models we want to maintain at this point. It makes sense for cases where we have internal training data that we can't redistribute (English, Chinese, German), but when all the resources are easily available it might make sense to support projects rather than a lot of huge pretrained pipelines. We'd also like a mechanism to add user projects to spaCy projects (we've thought about using submodules, but haven't tested it thoroughly), which could be another option. As a note: I don't think those orth variants are going to do anything. They refer to Penn treebank tags that don't appear in the DaNE corpus as far as I'm aware. |
Indeed all the resources are easily available. This is simply redistributing an application of a trained model which could be easily reproduced. Thus some kind of hub for projects would be perfect. Regarding size, there also exist a couple of Danish Electras (small) which could be trained for a more efficient workflow. Probably more production-friendly. Regarding the sidenote: Indeed they don't do anything. Similarly, the casing doesn't do anything either (at least for the Danish BERT) as these are lowercased. This was mostly to test how this influences performance (speed-wise). Regardless. This was a very painless experience to use. The workflow is really efficient. (Another option would be to easily allow users to upload the models to hugging face or similar and provide a mechanism to download it from there as well.) EDIT: |
Hi, as a note we'll be adding If you know of better options, we're always happy to have suggestions about things like which transformer to choose for the quickstart and for future models, since it's hard for us to stay up-to-date on the resources for all supported languages. |
Hi @adrianeboyd. The BERT by botxo is a good option to go with. Alternatively the Danish ÆLÆCTRA (which is a small-sized Electra also uploaded by @MalteHB, I believe it is uploaded under the name -l-ctra due to huggingface interface not allowing for "Æ"). This has a cased and uncased version, but if you do use the cased version do note that the lowercase augmenter is ideal as we for older Danish do use uppercase nouns. One advantage of this model is trained on the Danish gigaword corpus which is much more varied (deliberately made to be) as opposed to the BERT by botxo (the one you refer to) which is trained on common crawl and private data.
In reality, I would probably include both as the tradeoffs are similar to your other model. This is what I have done in DaCy a spacy-based project for Danish NLP. The Ælæctra model also has the advantage of being more transparent in what data it is trained on. In DaCy I have done experimentation with all the Danish models using SpaCy. You can find it here under training: |
Hi @adrianeboyd, I strongly agree with @KennethEnevoldsen in that including both the Danish BERT and Ælæctra, and, furthermore, perhaps even including both the cased and uncased version of Ælæctra, would give some very efficient quickstart model options for use of spacy for Danish NLP. And regarding the data, @KennethEnevoldsen is also right in that Ælæctra was trained on the Danish Gigaword Corpus which gives it much more transparency, and significantly reduces the probability of it have discriminatory biases and tendencies. I am looking very much forward to hearing what you opt for, and to continue following your future work on spaCy! Have a nice day! 😄 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I have trained and evaluated a new transformer model for Danish. I the project in a github rep so it is reproducible and I am uploading the model so that it is downloadable. Would love to add it to SpaCy as well. I assume this is possible but the documentation does not seem clear on this?
The text was updated successfully, but these errors were encountered: