-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update UDPipe.json - new languages #62
Conversation
Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt
@kosarko What do you mean by "duplicating"? The language list should contain ISO 639-3 language codes, as specified in the documentation (https://github.com/clarin-eric/switchboard-doc/blob/master/documentation/ToolDescriptionSpec.md). I didn't find the "vsn" for instance in the wikipedia list, can you tell me what language it is? |
The @kosarko The models.txt from UDPipe contains 639-1, 639-2/T and 639-2/B codes. If 639-3 is to be used in this place, you need to keep only the 639-2/T ones. @kosarko Also, the UD 2.6 models are in UDPipe 2 and unfortunately at a different place than the models.txt; they are currently here: https://github.com/ufal/udpipe/blob/udpipe-2/models-2.6/models_list.sh And yes, getting a list of ISO codes directly from the service (with separate entries for -1, -2/T, -2/B and -3) instead of weird config files would be much better -- it is planned ;-) |
@emanueldima I don't think it's ready yet; I need to check there's everything from the models_list.sh. And also we should probably remove the 639-2/B codes, right? |
removing duplicate codes and lzh: classical_chinese orv: old_russian pcm: naija (Nigerian Pidgin?)
@emanueldima should be mostly ok now; I did remove:
My script doesn't recognize these codes (maybe outdated code table). @emanueldima what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge |
On 23. 11. 2020, at 19:00, Ondřej Košarko ***@***.***> wrote:
@emanueldima <https://github.com/emanueldima> should be mostly ok now; I did remove:
lzh: classical_chinese
orv: old_russian
pcm: naija (Nigerian Pidgin?)
My script doesn't recognize these codes (maybe outdated code table). @emanueldima <https://github.com/emanueldima> what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge
But they are all valid ISO 639-3 codes. Example:
https://iso639-3.sil.org/code_tables/639/data?title=lzh&field_iso639_cd_st_mmbrshp_639_1_tid=All&name_3=&field_iso639_element_scope_tid=All&field_iso639_language_type_tid=All&items_per_page=200
Especially this model (Classical Chinese) might be of interest for many researchers. It was the language of both literature and official documents in China, Korea, and possible other countries from antiquity until about 1920.
|
iso 639 is a living thing there are updates and/or deprecations happening.
Do you know a reliable iso639 codes reader/validator? Seems that the tables from sil.org can't be redistributed (https://iso639-3.sil.org/code_tables/download_tables). I just pulled a library from pip (it hasen't been updated in over 5 years) and I'm using that to convert what's in udpipe model list; if switchboard or anything validating these json files uses the same library it might report issues...
______________________________________________________________
Od: "Pavel Stranak" ***@***.***>
Komu: "clarin-eric/switchboard-tool-registry" ***@***.***>
Datum: 23.11.2020 19:07
Předmět: Re: [clarin-eric/switchboard-tool-registry] Update UDPipe.json - new languages (#62)
On 23. 11. 2020, at 19:00, Ondřej Košarko <notifications@github.com> wrote:
@emanueldima <https://github.com/emanueldima> should be mostly ok now; I did remove:
lzh: classical_chinese
orv: old_russian
pcm: naija (Nigerian Pidgin?)
My script doesn't recognize these codes (maybe outdated code table). @emanueldima <https://github.com/emanueldima> what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge
But they are all valid ISO 639-3 codes. Example:
https://iso639-3.sil.org/code_tables/639/data?title=lzh&field_iso639_cd_st_mmbrshp_639_1_tid=All&name_3=&field_iso639_element_scope_tid=All&field_iso639_language_type_tid=All&items_per_page=200
Especially this model (Classical Chinese) might be of interest for many researchers. It was the language of both literature and official documents in China, Korea, and possible other countries from antiquity until about 1920.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOBZUNN6SLNDW5RB47BHCTSRKQHJANCNFSM4STDO3PA>.
|
The Switchboard uses a library to recognize the language automatically. The list of recognized languages is here: https://github.com/optimaize/language-detector#language-support Each ISO 639-1 code is converted to ISO 639-3 using this table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Only the ISO 639-1 and -3 codes are used, all the other language codes are removed. You have removed important languages like |
iso639-3 code instead of a mix of part2B and part2T changing lang encoding as for lzh, orv and pcm there's no Part1 (can't be mapped to 639-1)
Alright sorry for the mixup; some of those were part2B (e.g. ger insted of deu; baq instead of eus; etc.); using https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab as the authoritative source of the codes. |
you should be able to test the changes here: https://beta-switchboard.clarin.eu/tools |
Seems ok I'd say. Let me reiterate the question about supported languages, you wrote:
So if there is a language that does not have an iso 639-1 code (such as lzh, orv, pcm) it won't be possible to select it in the switchboard ui, correct? |
Not 639-1 but 639-3. If the specified language is not 639-3 and listed in the wikipedia table above the user cannot select it. |
Sorry if I am a bit confused, I will ask again just to make sure I understand it correctly: We should provide ISO 639-3 (or -1) codes, but only languages defined in 639-1 are acceptable, all the other ISO 639-3 codes are ignored? If that is the case, what is the motivation? And are there any plans to change it? It is clear that we now have models not showing up (tested in the beta) and there will only be more. I think especially with SSH use cases in mind we cannot limit ourselves to languages recognised in 639-1. PS: I am fine with many small or old languages not having models that will recognise them automatically. But it should be still possible to find them and select them by name. |
The Switchboard tries to work with the largest set of languages available, so it works internally with 639-3. If a tool works also with 639-3 and uses 639-3 language codes then all should be fine.
Edit: Apparently I need more sleep. You're right, I am restricting artificially the languages to 639-1. This should be treated as a bug and will be fixed. |
Ah I see, now. Yes, that is very much incomplete. 639-3 is vastly broader set than 639-1. Many languages are simply not listed in 639-1, for many other there is only a macrolanguage (like Chinese or German, even for mutually non-undersandable varieties), there are almost no historical languages, no sign languages … For example look at "German" across all ISO 639 code tables here. |
Thanks for pointing this out. This needs fixing the UI and may take some time, until the next release, to be able to correctly select the language. I suggest to keep the current UDPipe definition; after the fix the languages will become selectable. |
If you can generate UD-pipe metadata for switchboard automatically, you may be interested in our new repository: https://github.com/clarin-eric/switchboard-tool-registry-contrib (see the README). Please note that we (clarin-eric, switchboard developers) cannot maintain the code in there, we only host it. |
A new version (2.2.3) has just been released, which should allow users to select any language. Please report any issue you find with it. |
When I try a text in Classical (Literary) Chinese, it doesn't work with UDPipe. Could there be a code mismatch? When I say the text is "Chinese" in Switchboard, it is sent to UDPipe, proper model selected and executed. When I pick "Literary Chinese" in Switchboard, it redirects to this URL. Model is "undefined" and the text is also not copied into input. https://lindat.mff.cuni.cz/services/udpipe/?data=https%3A%2F%2Fswitchboard.clarin.eu%2Fapi%2Fstorage%2Fd5336eeb-416e-4759-a290-34650dff9af2&model=undefined |
The last changes have not been pushed to production yet, this will do it: #83 You should be able to test the current up-to-date version in our beta: https://beta-switchboard.clarin.eu |
Should be working now in production as well |
Works in the beta. Production, on the other hand, is giving me "Error Bad Gateway" when I try to grag and drop an input file. |
Sorry my bad! The server got stuck during reload and I did not notice. It is all good again now. |
Yes, works fine now. It doesn't detect Literary Chinese automatically, but when selected, it works as expected. Thanks, guys! |
Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt
@stranak @foxik Does this seem correct - there were more models added, but very few new languages.
@emanueldima @andmor- Is it ok to list "duplicate" codes such as
wel
andcym
orfra
andfre
?