Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UD web should warn about treebanks that are not yet / no longer part of the official release #563

Closed
pnugues opened this issue Aug 15, 2018 · 9 comments

Comments

@pnugues
Copy link

pnugues commented Aug 15, 2018

I just downloaded version 2.2 of the treebanks. Apparently, the French treebank (French-FTB) is missing from it. Is this normal?

@jnivre
Copy link
Contributor

jnivre commented Aug 15, 2018

Hi Pierre! It depends on what you mean. We are not allowed to distribute the original text of the FTB, but if you have a license for the treebank you can merge the UD annotations with the original text (for more information, see https://github.com/UniversalDependencies/UD_French-FTB/blob/master/README.md). Or are you saying that the annotations are missing as well? If so, this is a mistake. The FTB treebank was not included in the CoNLL shared task (because the texts are not free), but it should have been included in the official 2.2 release. I am afraid we may have to wait until @dan-zeman is back from vacation until we can sort this out (unless @fginter happens to know something).

@dan-zeman
Copy link
Member

Hi, ud 2.2 validation is stricter than 2.1 and ftb is no longer valid. It may be reintroduced in future if the bugs are fixed.

@jnivre
Copy link
Contributor

jnivre commented Aug 15, 2018

Ah, I should have thought of that. Thanks!

@pnugues
Copy link
Author

pnugues commented Aug 15, 2018

Hello Joakim,
I meant, it was not included in the official release. See here: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2837 and download the files to check it.

@dan-zeman If the validation rejected it, then the UD main page listing the corpora should be updated to reflect this. By the way, are there other corpora missing due to validation failures?

@jnivre
Copy link
Contributor

jnivre commented Aug 15, 2018

I agree it should be reflected on the main page, but it currently isn't. There are even treebanks that have never been ported from v1 that are still listed (like the English-ESL and the Swedish Sign Language treebanks). It seems appropriate to add an attribute to the treebank row (as opposed to the language row) in the at-a-glance table indicating the latest version in which it was part of the release (or simply a binary indication of whether it is included in the latest release or not). Unfortunately, we don't have this at the moment.

@dseddah
Copy link

dseddah commented Aug 15, 2018

Hi Pierre,
sorry for not including the UD_FTB in the 2.2 release but as it turned out, the new validator rejected this treebank because of 200 dependency annotation mismatches (200 out of 600k, 0.03% errors, mostly left-to-right errors for conj and some fixed). As this treebank comes from a huge data-driven conversion process and because there were so few errors, we didn't want to correct them by hand. We probably will do it as soon as we have some time to do it. Meanwhile, I'll be happy to send you that treebank in dm. Please follow the procedure indicated in the 2.1 readme file. Anyway, I think you already have it if I'm not mistaken. There will be no major changes in the futur fixed version besides these corrections.

Best,
Djamé

@pnugues
Copy link
Author

pnugues commented Aug 15, 2018

Hello Djamé,
Yes I have it, but I was very close to replace (and erase) the old versions of all the universal dependencies treebanks with the new ones. I will wait for the next release.

@dan-zeman dan-zeman added this to the v2.4 milestone Nov 13, 2018
@dan-zeman dan-zeman changed the title Missing treebank from the 2.2 release UD web should warn about treebanks that are not yet / no longer part of the official release Nov 13, 2018
@dan-zeman
Copy link
Member

I am leaving the issue open but have changed its title to reflect that now the requested action is to modify the scripts that generate the list of treebanks on the UD title page.

Treebanks that have never been released should be marked as such (they may still appear in the upper part if there is another treebank of the same language that has been released). Languages that have repositories with data but have not been released yet should be shown in the lower part (upcoming treebanks). It does not work properly now because treebanks are classified as "upcoming" only if their repositories are empty. And finally, if the treebank has been released at least once BUT it is not part of the most recent release because of validity issues, it should be also marked.

@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019
@dan-zeman dan-zeman modified the milestones: v2.5, v2.6 Nov 9, 2019
@dan-zeman dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@dan-zeman
Copy link
Member

Finally solved. After dropping another few treebanks from UD 2.12, we now have a section for "retired treebanks" on the UD home page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants