Update UDPipe.json - new languages #62

kosarko · 2020-10-16T09:44:01Z

Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt

@stranak @foxik Does this seem correct - there were more models added, but very few new languages.

@emanueldima @andmor- Is it ok to list "duplicate" codes such as wel and cym or fra and fre?

Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt

emanueldima · 2020-10-19T08:09:29Z

@kosarko What do you mean by "duplicating"? The language list should contain ISO 639-3 language codes, as specified in the documentation (https://github.com/clarin-eric/switchboard-doc/blob/master/documentation/ToolDescriptionSpec.md). I didn't find the "vsn" for instance in the wikipedia list, can you tell me what language it is?

foxik · 2020-10-19T08:29:42Z

The vsn is a proposed code for Vedic Sanskrit, see for example http://multitree.org/codes/vsn or https://en.wikipedia.org/wiki/Vedic_Sanskrit; the proposal for the vsn itself is at https://iso639-3.sil.org/request/2011-041.

@kosarko The models.txt from UDPipe contains 639-1, 639-2/T and 639-2/B codes. If 639-3 is to be used in this place, you need to keep only the 639-2/T ones.

@kosarko Also, the UD 2.6 models are in UDPipe 2 and unfortunately at a different place than the models.txt; they are currently here: https://github.com/ufal/udpipe/blob/udpipe-2/models-2.6/models_list.sh

And yes, getting a list of ISO codes directly from the service (with separate entries for -1, -2/T, -2/B and -3) instead of weird config files would be much better -- it is planned ;-)

emanueldima · 2020-11-18T10:00:14Z

@kosarko @foxik Is this ready to merge from your perspective?

kosarko · 2020-11-18T10:16:43Z

@emanueldima I don't think it's ready yet; I need to check there's everything from the models_list.sh. And also we should probably remove the 639-2/B codes, right?

removing duplicate codes and lzh: classical_chinese orv: old_russian pcm: naija (Nigerian Pidgin?)

kosarko · 2020-11-23T18:00:07Z

@emanueldima should be mostly ok now; I did remove:

lzh: classical_chinese
orv: old_russian
pcm: naija (Nigerian Pidgin?)

My script doesn't recognize these codes (maybe outdated code table). @emanueldima what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge

stranak · 2020-11-23T18:07:32Z

On 23. 11. 2020, at 19:00, Ondřej Košarko ***@***.***> wrote: @emanueldima <https://github.com/emanueldima> should be mostly ok now; I did remove: lzh: classical_chinese orv: old_russian pcm: naija (Nigerian Pidgin?) My script doesn't recognize these codes (maybe outdated code table). @emanueldima <https://github.com/emanueldima> what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge

But they are all valid ISO 639-3 codes. Example: https://iso639-3.sil.org/code_tables/639/data?title=lzh&field_iso639_cd_st_mmbrshp_639_1_tid=All&name_3=&field_iso639_element_scope_tid=All&field_iso639_language_type_tid=All&items_per_page=200 Especially this model (Classical Chinese) might be of interest for many researchers. It was the language of both literature and official documents in China, Korea, and possible other countries from antiquity until about 1920.

kosarko · 2020-11-23T21:43:43Z

iso 639 is a living thing there are updates and/or deprecations happening. Do you know a reliable iso639 codes reader/validator? Seems that the tables from sil.org can't be redistributed (https://iso639-3.sil.org/code_tables/download_tables). I just pulled a library from pip (it hasen't been updated in over 5 years) and I'm using that to convert what's in udpipe model list; if switchboard or anything validating these json files uses the same library it might report issues...

______________________________________________________________

Od: "Pavel Stranak" ***@***.***> Komu: "clarin-eric/switchboard-tool-registry" ***@***.***> Datum: 23.11.2020 19:07 Předmět: Re: [clarin-eric/switchboard-tool-registry] Update UDPipe.json - new languages (#62)

On 23. 11. 2020, at 19:00, Ondřej Košarko <notifications@github.com> wrote: @emanueldima <https://github.com/emanueldima> should be mostly ok now; I did remove: lzh: classical_chinese orv: old_russian pcm: naija (Nigerian Pidgin?) My script doesn't recognize these codes (maybe outdated code table). @emanueldima <https://github.com/emanueldima> what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge But they are all valid ISO 639-3 codes. Example: https://iso639-3.sil.org/code_tables/639/data?title=lzh&field_iso639_cd_st_mmbrshp_639_1_tid=All&name_3=&field_iso639_element_scope_tid=All&field_iso639_language_type_tid=All&items_per_page=200 Especially this model (Classical Chinese) might be of interest for many researchers. It was the language of both literature and official documents in China, Korea, and possible other countries from antiquity until about 1920.— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#62 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOBZUNN6SLNDW5RB47BHCTSRKQHJANCNFSM4STDO3PA>.

emanueldima · 2020-11-24T08:19:12Z

The Switchboard uses a library to recognize the language automatically. The list of recognized languages is here: https://github.com/optimaize/language-detector#language-support

Each ISO 639-1 code is converted to ISO 639-3 using this table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Only the ISO 639-1 and -3 codes are used, all the other language codes are removed.

You have removed important languages like deu, ell, fra and added baq, which is not ISO 639-3. Is this what you want?

iso639-3 code instead of a mix of part2B and part2T changing lang encoding as for lzh, orv and pcm there's no Part1 (can't be mapped to 639-1)

kosarko · 2020-11-24T10:34:07Z

Alright sorry for the mixup; some of those were part2B (e.g. ger insted of deu; baq instead of eus; etc.); using https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab as the authoritative source of the codes.

emanueldima · 2020-11-24T11:41:26Z

you should be able to test the changes here: https://beta-switchboard.clarin.eu/tools

kosarko · 2020-11-24T13:39:42Z

Seems ok I'd say.

Let me reiterate the question about supported languages, you wrote:

Each ISO 639-1 code is converted to ISO 639-3 using this table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Only the ISO 639-1 and -3 codes are used, all the other language codes are removed.

So if there is a language that does not have an iso 639-1 code (such as lzh, orv, pcm) it won't be possible to select it in the switchboard ui, correct?

emanueldima · 2020-11-25T08:59:51Z

Not 639-1 but 639-3. If the specified language is not 639-3 and listed in the wikipedia table above the user cannot select it.

stranak · 2020-11-25T09:19:08Z

Sorry if I am a bit confused, I will ask again just to make sure I understand it correctly: We should provide ISO 639-3 (or -1) codes, but only languages defined in 639-1 are acceptable, all the other ISO 639-3 codes are ignored?

If that is the case, what is the motivation? And are there any plans to change it? It is clear that we now have models not showing up (tested in the beta) and there will only be more. I think especially with SSH use cases in mind we cannot limit ourselves to languages recognised in 639-1.

PS: I am fine with many small or old languages not having models that will recognise them automatically. But it should be still possible to find them and select them by name.

emanueldima · 2020-11-25T09:27:57Z

The Switchboard tries to work with the largest set of languages available, so it works internally with 639-3. If a tool works also with 639-3 and uses 639-3 language codes then all should be fine.

~~We may conflict over the definition of 'what is 639-3'. I used this wikipedia table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Is it incomplete?~~

Edit: Apparently I need more sleep. You're right, I am restricting artificially the languages to 639-1. This should be treated as a bug and will be fixed.

stranak · 2020-11-25T09:32:52Z

We may conflict over the definition of 'what is 639-3'. I used this wikipedia table: >https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Is it incomplete?

Ah I see, now. Yes, that is very much incomplete. 639-3 is vastly broader set than 639-1. Many languages are simply not listed in 639-1, for many other there is only a macrolanguage (like Chinese or German, even for mutually non-undersandable varieties), there are almost no historical languages, no sign languages … For example look at "German" across all ISO 639 code tables here.

emanueldima · 2020-11-25T09:42:08Z

Thanks for pointing this out. This needs fixing the UI and may take some time, until the next release, to be able to correctly select the language. I suggest to keep the current UDPipe definition; after the fix the languages will become selectable.

clarin-eric/switchboard#150

emanueldima · 2020-11-30T18:14:57Z

If you can generate UD-pipe metadata for switchboard automatically, you may be interested in our new repository: https://github.com/clarin-eric/switchboard-tool-registry-contrib (see the README). Please note that we (clarin-eric, switchboard developers) cannot maintain the code in there, we only host it.

emanueldima · 2020-12-01T10:20:13Z

A new version (2.2.3) has just been released, which should allow users to select any language. Please report any issue you find with it.

stranak · 2020-12-01T10:26:04Z

When I try a text in Classical (Literary) Chinese, it doesn't work with UDPipe. Could there be a code mismatch? When I say the text is "Chinese" in Switchboard, it is sent to UDPipe, proper model selected and executed. When I pick "Literary Chinese" in Switchboard, it redirects to this URL. Model is "undefined" and the text is also not copied into input. https://lindat.mff.cuni.cz/services/udpipe/?data=https%3A%2F%2Fswitchboard.clarin.eu%2Fapi%2Fstorage%2Fd5336eeb-416e-4759-a290-34650dff9af2&model=undefined

emanueldima · 2020-12-02T11:22:53Z

The last changes have not been pushed to production yet, this will do it: #83

You should be able to test the current up-to-date version in our beta: https://beta-switchboard.clarin.eu

andmor- · 2020-12-02T12:22:55Z

Should be working now in production as well

stranak · 2020-12-02T12:28:21Z

Works in the beta. Production, on the other hand, is giving me "Error Bad Gateway" when I try to grag and drop an input file.

andmor- · 2020-12-02T12:30:02Z

Sorry my bad! The server got stuck during reload and I did not notice. It is all good again now.

stranak · 2020-12-02T12:32:19Z

Yes, works fine now. It doesn't detect Literary Chinese automatically, but when selected, it works as expected.

Thanks, guys!

Update UDPipe.json

1c2a143

Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt

Update UDPipe.json

4161ff4

removing duplicate codes and lzh: classical_chinese orv: old_russian pcm: naija (Nigerian Pidgin?)

Update UDPipe.json

8fb41c7

iso639-3 code instead of a mix of part2B and part2T changing lang encoding as for lzh, orv and pcm there's no Part1 (can't be mapped to 639-1)

emanueldima merged commit ef0be4a into clarin-eric:master Nov 24, 2020

kosarko mentioned this pull request Oct 4, 2021

Integrate with CLARIN LR Switchboard ufal/nametag#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update UDPipe.json - new languages #62

Update UDPipe.json - new languages #62

kosarko commented Oct 16, 2020

emanueldima commented Oct 19, 2020

foxik commented Oct 19, 2020

emanueldima commented Nov 18, 2020

kosarko commented Nov 18, 2020

kosarko commented Nov 23, 2020

stranak commented Nov 23, 2020 via email

kosarko commented Nov 23, 2020 via email

emanueldima commented Nov 24, 2020

kosarko commented Nov 24, 2020

emanueldima commented Nov 24, 2020

kosarko commented Nov 24, 2020

emanueldima commented Nov 25, 2020

stranak commented Nov 25, 2020 •

edited

Loading

emanueldima commented Nov 25, 2020 •

edited

Loading

stranak commented Nov 25, 2020

emanueldima commented Nov 25, 2020

emanueldima commented Nov 30, 2020

emanueldima commented Dec 1, 2020

stranak commented Dec 1, 2020

emanueldima commented Dec 2, 2020

andmor- commented Dec 2, 2020

stranak commented Dec 2, 2020

andmor- commented Dec 2, 2020

stranak commented Dec 2, 2020

Update UDPipe.json - new languages #62

Update UDPipe.json - new languages #62

Conversation

kosarko commented Oct 16, 2020

emanueldima commented Oct 19, 2020

foxik commented Oct 19, 2020

emanueldima commented Nov 18, 2020

kosarko commented Nov 18, 2020

kosarko commented Nov 23, 2020

stranak commented Nov 23, 2020 via email

kosarko commented Nov 23, 2020 via email

emanueldima commented Nov 24, 2020

kosarko commented Nov 24, 2020

emanueldima commented Nov 24, 2020

kosarko commented Nov 24, 2020

emanueldima commented Nov 25, 2020

stranak commented Nov 25, 2020 • edited Loading

emanueldima commented Nov 25, 2020 • edited Loading

stranak commented Nov 25, 2020

emanueldima commented Nov 25, 2020

emanueldima commented Nov 30, 2020

emanueldima commented Dec 1, 2020

stranak commented Dec 1, 2020

emanueldima commented Dec 2, 2020

andmor- commented Dec 2, 2020

stranak commented Dec 2, 2020

andmor- commented Dec 2, 2020

stranak commented Dec 2, 2020

stranak commented Nov 25, 2020 •

edited

Loading

emanueldima commented Nov 25, 2020 •

edited

Loading