Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a transformer model for an existing language. #7221

Closed
KennethEnevoldsen opened this issue Feb 26, 2021 · 6 comments
Closed

Adding a transformer model for an existing language. #7221

KennethEnevoldsen opened this issue Feb 26, 2021 · 6 comments
Labels
lang / da Danish language data and models models Issues related to the statistical models

Comments

@KennethEnevoldsen
Copy link
Contributor

I have trained and evaluated a new transformer model for Danish. I the project in a github rep so it is reproducible and I am uploading the model so that it is downloadable. Would love to add it to SpaCy as well. I assume this is possible but the documentation does not seem clear on this?

@adrianeboyd adrianeboyd added lang / da Danish language data and models models Issues related to the statistical models labels Feb 26, 2021
@adrianeboyd
Copy link
Contributor

In terms of adding a new model, we already have similar training corpora set up for the standard Danish models (also DaNE, although possibly an older version) and an internal transformer config template used for all the trf models, so it would be a matter of choosing to support this (when we add a model we commit to maintaining it in the future) and deciding which transformer model to use. I do think xlm-roberta-large is getting too large to be practical for us to redistribute (also I think you can run into issues with msgpack?), but the smaller Danish model could be a reasonable choice.

I'm honestly not sure how many transformer models we want to maintain at this point. It makes sense for cases where we have internal training data that we can't redistribute (English, Chinese, German), but when all the resources are easily available it might make sense to support projects rather than a lot of huge pretrained pipelines.

We'd also like a mechanism to add user projects to spaCy projects (we've thought about using submodules, but haven't tested it thoroughly), which could be another option.

As a note: I don't think those orth variants are going to do anything. They refer to Penn treebank tags that don't appear in the DaNE corpus as far as I'm aware.

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Feb 27, 2021

Indeed all the resources are easily available. This is simply redistributing an application of a trained model which could be easily reproduced. Thus some kind of hub for projects would be perfect.
But what I hear you say is that we would have to distribute the model ourselves if we want to distribute it? Is there a case where people have the resources to use the model but not to fine-tune it?

Regarding size, there also exist a couple of Danish Electras (small) which could be trained for a more efficient workflow. Probably more production-friendly.

Regarding the sidenote: Indeed they don't do anything. Similarly, the casing doesn't do anything either (at least for the Danish BERT) as these are lowercased. This was mostly to test how this influences performance (speed-wise).

Regardless. This was a very painless experience to use. The workflow is really efficient.

(Another option would be to easily allow users to upload the models to hugging face or similar and provide a mechanism to download it from there as well.)

EDIT:
I have recently added a small ELECTRA (Ælæctra) as well.

@adrianeboyd
Copy link
Contributor

Hi, as a note we'll be adding da_core_news_trf for spacy v3.1, initially configured to use Maltehb/danish-bert-botxo. For now the pipeline config is basically the same for all trf models to make the training setup easier on our end, so the details might not be quite the same as in your pipeline, but we hope it will be useful.

If you know of better options, we're always happy to have suggestions about things like which transformer to choose for the quickstart and for future models, since it's hard for us to stay up-to-date on the resources for all supported languages.

@KennethEnevoldsen
Copy link
Contributor Author

Hi @adrianeboyd. The BERT by botxo is a good option to go with.

Alternatively the Danish ÆLÆCTRA (which is a small-sized Electra also uploaded by @MalteHB, I believe it is uploaded under the name -l-ctra due to huggingface interface not allowing for "Æ"). This has a cased and uncased version, but if you do use the cased version do note that the lowercase augmenter is ideal as we for older Danish do use uppercase nouns. One advantage of this model is trained on the Danish gigaword corpus which is much more varied (deliberately made to be) as opposed to the BERT by botxo (the one you refer to) which is trained on common crawl and private data.

  • It does however perform slightly worse due to it having a smaller size on the DaNE dataset.

In reality, I would probably include both as the tradeoffs are similar to your other model. This is what I have done in DaCy a spacy-based project for Danish NLP. The Ælæctra model also has the advantage of being more transparent in what data it is trained on.

In DaCy I have done experimentation with all the Danish models using SpaCy. You can find it here under training:
https://github.com/KennethEnevoldsen/DaCy

@MalteHB
Copy link

MalteHB commented Jun 16, 2021

Hi @adrianeboyd,

I strongly agree with @KennethEnevoldsen in that including both the Danish BERT and Ælæctra, and, furthermore, perhaps even including both the cased and uncased version of Ælæctra, would give some very efficient quickstart model options for use of spacy for Danish NLP.

And regarding the data, @KennethEnevoldsen is also right in that Ælæctra was trained on the Danish Gigaword Corpus which gives it much more transparency, and significantly reduces the probability of it have discriminatory biases and tendencies.

I am looking very much forward to hearing what you opt for, and to continue following your future work on spaCy!

Have a nice day! 😄

@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2023

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lang / da Danish language data and models models Issues related to the statistical models
Projects
None yet
Development

No branches or pull requests

3 participants