Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UDPipe.json - new languages #62

Merged
merged 3 commits into from
Nov 24, 2020
Merged

Conversation

kosarko
Copy link
Contributor

@kosarko kosarko commented Oct 16, 2020

Updating language list based on updated models.txt https://raw.githubusercontent.com/ufal/udpipe/932c63f59d381c4d5bf5f521e4bbc60f1c48e412/releases/models.txt

@stranak @foxik Does this seem correct - there were more models added, but very few new languages.

@emanueldima @andmor- Is it ok to list "duplicate" codes such as wel and cym or fra and fre?

@emanueldima
Copy link
Collaborator

@kosarko What do you mean by "duplicating"? The language list should contain ISO 639-3 language codes, as specified in the documentation (https://github.com/clarin-eric/switchboard-doc/blob/master/documentation/ToolDescriptionSpec.md). I didn't find the "vsn" for instance in the wikipedia list, can you tell me what language it is?

@foxik
Copy link

foxik commented Oct 19, 2020

The vsn is a proposed code for Vedic Sanskrit, see for example http://multitree.org/codes/vsn or https://en.wikipedia.org/wiki/Vedic_Sanskrit; the proposal for the vsn itself is at https://iso639-3.sil.org/request/2011-041.

@kosarko The models.txt from UDPipe contains 639-1, 639-2/T and 639-2/B codes. If 639-3 is to be used in this place, you need to keep only the 639-2/T ones.

@kosarko Also, the UD 2.6 models are in UDPipe 2 and unfortunately at a different place than the models.txt; they are currently here: https://github.com/ufal/udpipe/blob/udpipe-2/models-2.6/models_list.sh

And yes, getting a list of ISO codes directly from the service (with separate entries for -1, -2/T, -2/B and -3) instead of weird config files would be much better -- it is planned ;-)

@emanueldima
Copy link
Collaborator

@kosarko @foxik Is this ready to merge from your perspective?

@kosarko
Copy link
Contributor Author

kosarko commented Nov 18, 2020

@emanueldima I don't think it's ready yet; I need to check there's everything from the models_list.sh. And also we should probably remove the 639-2/B codes, right?

removing duplicate codes

and
lzh: classical_chinese
orv: old_russian
pcm: naija (Nigerian Pidgin?)
@kosarko
Copy link
Contributor Author

kosarko commented Nov 23, 2020

@emanueldima should be mostly ok now; I did remove:

lzh: classical_chinese
orv: old_russian
pcm: naija (Nigerian Pidgin?)

My script doesn't recognize these codes (maybe outdated code table). @emanueldima what about switchboard? Can in cope with these? If yes, I'll add them back, if no it should be safe to merge

@stranak
Copy link

stranak commented Nov 23, 2020 via email

@kosarko
Copy link
Contributor Author

kosarko commented Nov 23, 2020 via email

@emanueldima
Copy link
Collaborator

The Switchboard uses a library to recognize the language automatically. The list of recognized languages is here: https://github.com/optimaize/language-detector#language-support

Each ISO 639-1 code is converted to ISO 639-3 using this table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Only the ISO 639-1 and -3 codes are used, all the other language codes are removed.

You have removed important languages like deu, ell, fra and added baq, which is not ISO 639-3. Is this what you want?

iso639-3 code instead of a mix of part2B and part2T
changing lang encoding as for lzh, orv and pcm there's no Part1 (can't be mapped to 639-1)
@kosarko
Copy link
Contributor Author

kosarko commented Nov 24, 2020

Alright sorry for the mixup; some of those were part2B (e.g. ger insted of deu; baq instead of eus; etc.); using https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab as the authoritative source of the codes.

@emanueldima emanueldima merged commit ef0be4a into clarin-eric:master Nov 24, 2020
@emanueldima
Copy link
Collaborator

you should be able to test the changes here: https://beta-switchboard.clarin.eu/tools

@kosarko
Copy link
Contributor Author

kosarko commented Nov 24, 2020

Seems ok I'd say.

Let me reiterate the question about supported languages, you wrote:

Each ISO 639-1 code is converted to ISO 639-3 using this table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Only the ISO 639-1 and -3 codes are used, all the other language codes are removed.

So if there is a language that does not have an iso 639-1 code (such as lzh, orv, pcm) it won't be possible to select it in the switchboard ui, correct?

@emanueldima
Copy link
Collaborator

Not 639-1 but 639-3. If the specified language is not 639-3 and listed in the wikipedia table above the user cannot select it.

@stranak
Copy link

stranak commented Nov 25, 2020

Sorry if I am a bit confused, I will ask again just to make sure I understand it correctly: We should provide ISO 639-3 (or -1) codes, but only languages defined in 639-1 are acceptable, all the other ISO 639-3 codes are ignored?

If that is the case, what is the motivation? And are there any plans to change it? It is clear that we now have models not showing up (tested in the beta) and there will only be more. I think especially with SSH use cases in mind we cannot limit ourselves to languages recognised in 639-1.

PS: I am fine with many small or old languages not having models that will recognise them automatically. But it should be still possible to find them and select them by name.

@emanueldima
Copy link
Collaborator

emanueldima commented Nov 25, 2020

The Switchboard tries to work with the largest set of languages available, so it works internally with 639-3. If a tool works also with 639-3 and uses 639-3 language codes then all should be fine.

We may conflict over the definition of 'what is 639-3'. I used this wikipedia table: https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Is it incomplete?

Edit: Apparently I need more sleep. You're right, I am restricting artificially the languages to 639-1. This should be treated as a bug and will be fixed.

@stranak
Copy link

stranak commented Nov 25, 2020

We may conflict over the definition of 'what is 639-3'. I used this wikipedia table: >https://en.m.wikipedia.org/wiki/List_of_ISO_639-1_codes. Is it incomplete?

Ah I see, now. Yes, that is very much incomplete. 639-3 is vastly broader set than 639-1. Many languages are simply not listed in 639-1, for many other there is only a macrolanguage (like Chinese or German, even for mutually non-undersandable varieties), there are almost no historical languages, no sign languages … For example look at "German" across all ISO 639 code tables here.

@emanueldima
Copy link
Collaborator

Thanks for pointing this out. This needs fixing the UI and may take some time, until the next release, to be able to correctly select the language. I suggest to keep the current UDPipe definition; after the fix the languages will become selectable.

clarin-eric/switchboard#150

@emanueldima
Copy link
Collaborator

If you can generate UD-pipe metadata for switchboard automatically, you may be interested in our new repository: https://github.com/clarin-eric/switchboard-tool-registry-contrib (see the README). Please note that we (clarin-eric, switchboard developers) cannot maintain the code in there, we only host it.

@emanueldima
Copy link
Collaborator

A new version (2.2.3) has just been released, which should allow users to select any language. Please report any issue you find with it.

@stranak
Copy link

stranak commented Dec 1, 2020

When I try a text in Classical (Literary) Chinese, it doesn't work with UDPipe. Could there be a code mismatch? When I say the text is "Chinese" in Switchboard, it is sent to UDPipe, proper model selected and executed. When I pick "Literary Chinese" in Switchboard, it redirects to this URL. Model is "undefined" and the text is also not copied into input. https://lindat.mff.cuni.cz/services/udpipe/?data=https%3A%2F%2Fswitchboard.clarin.eu%2Fapi%2Fstorage%2Fd5336eeb-416e-4759-a290-34650dff9af2&model=undefined

@emanueldima
Copy link
Collaborator

The last changes have not been pushed to production yet, this will do it: #83

You should be able to test the current up-to-date version in our beta: https://beta-switchboard.clarin.eu

@andmor-
Copy link
Member

andmor- commented Dec 2, 2020

Should be working now in production as well

@stranak
Copy link

stranak commented Dec 2, 2020

Works in the beta. Production, on the other hand, is giving me "Error Bad Gateway" when I try to grag and drop an input file.

@andmor-
Copy link
Member

andmor- commented Dec 2, 2020

Sorry my bad! The server got stuck during reload and I did not notice. It is all good again now.

@stranak
Copy link

stranak commented Dec 2, 2020

Yes, works fine now. It doesn't detect Literary Chinese automatically, but when selected, it works as expected.

Thanks, guys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants