New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UD web should warn about treebanks that are not yet / no longer part of the official release #563
Comments
Hi Pierre! It depends on what you mean. We are not allowed to distribute the original text of the FTB, but if you have a license for the treebank you can merge the UD annotations with the original text (for more information, see https://github.com/UniversalDependencies/UD_French-FTB/blob/master/README.md). Or are you saying that the annotations are missing as well? If so, this is a mistake. The FTB treebank was not included in the CoNLL shared task (because the texts are not free), but it should have been included in the official 2.2 release. I am afraid we may have to wait until @dan-zeman is back from vacation until we can sort this out (unless @fginter happens to know something). |
Hi, ud 2.2 validation is stricter than 2.1 and ftb is no longer valid. It may be reintroduced in future if the bugs are fixed. |
Ah, I should have thought of that. Thanks! |
Hello Joakim, @dan-zeman If the validation rejected it, then the UD main page listing the corpora should be updated to reflect this. By the way, are there other corpora missing due to validation failures? |
I agree it should be reflected on the main page, but it currently isn't. There are even treebanks that have never been ported from v1 that are still listed (like the English-ESL and the Swedish Sign Language treebanks). It seems appropriate to add an attribute to the treebank row (as opposed to the language row) in the at-a-glance table indicating the latest version in which it was part of the release (or simply a binary indication of whether it is included in the latest release or not). Unfortunately, we don't have this at the moment. |
Hi Pierre, Best, |
Hello Djamé, |
I am leaving the issue open but have changed its title to reflect that now the requested action is to modify the scripts that generate the list of treebanks on the UD title page. Treebanks that have never been released should be marked as such (they may still appear in the upper part if there is another treebank of the same language that has been released). Languages that have repositories with data but have not been released yet should be shown in the lower part (upcoming treebanks). It does not work properly now because treebanks are classified as "upcoming" only if their repositories are empty. And finally, if the treebank has been released at least once BUT it is not part of the most recent release because of validity issues, it should be also marked. |
Finally solved. After dropping another few treebanks from UD 2.12, we now have a section for "retired treebanks" on the UD home page. |
I just downloaded version 2.2 of the treebanks. Apparently, the French treebank (French-FTB) is missing from it. Is this normal?
The text was updated successfully, but these errors were encountered: