Skip to content

ISO 639-3 Language codes #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emilymbender opened this issue Sep 13, 2016 · 11 comments
Closed

ISO 639-3 Language codes #339

emilymbender opened this issue Sep 13, 2016 · 11 comments
Assignees

Comments

@emilymbender
Copy link

The current ISO standard for language codes is the three letter language codes of ISO 639-3:

http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39534

The codes can be found here:

http://www-01.sil.org/iso639-3

The two letter codes do not scale to the worlds' 6000+ languages; a project with "universal" goals should use ones that do scale. In addition, work citing the UD treebanks tends to follow the codes that this project uses, leading to many publications with obsolete (and/or underinformative) language identifiers.

@fginter
Copy link
Member

fginter commented Sep 13, 2016

Completely agreed. This was a poor choice at the beginning of the project. Unfortunately, the current two letter codes are right now extremely prevalent throughout the documentation system, the data validation scripts, basically everywhere. Porting everything to the three letter codes would be great IMHO, but I personally wouldn't want to be the one having to do it. :) As a first approximation, we could start using the three letter codes for any new language we add.

@emilymbender
Copy link
Author

Would it be possible to set up an equivalence table somewhere that lists all the languages with three-letter codes and then (for backwards compatibility) links to the two-letter versions where those are still present? (And perhaps encourages people who are publishing on this to use the three-letter codes?)

@fginter
Copy link
Member

fginter commented Sep 13, 2016

Producing such compatibility links would need changes very deep in the guts of the documentation system. It's totally doable, but someone would then need to do it and make sure nothing breaks. Right now, at least I cannot promise this.

@dan-zeman
Copy link
Member

dan-zeman commented Sep 13, 2016

I think we are following the common practice that ISO 639-1 codes are used for languages for which they exist, and three-letter codes otherwise. It was even sort-of standardized in the Internet RFC 4646 (http://www.ietf.org/rfc/rfc4646.txt), but that RFC takes the three-letter codes from ISO 639-2T, while ISO 639-3, which came later, is IMHO better. But I like using the 639-1 for the "big languages", they are much more common and known.

I agree that changing the practice retroactively would be a lot of work with no clear benefit. And I would strongly oppose changing the practice for newly added languages without changing it for all languages.

@fginter
Copy link
Member

fginter commented Sep 13, 2016

I didn't know that was actually a common practice to do it this way. :) Live and learn. So I guess the tentative conclusion is to leave this as is because (a) it's not a bug it's a feature and (b) the change would be very labour-intensive.

@fginter fginter closed this as completed Sep 13, 2016
@emilymbender
Copy link
Author

I learned something today, too! I still think it is more appropriate for a linguistics project to use three letter codes (and not give preferential treatment to "big" languages). And I still think that linguistics publications should indicate which languages they are addressing using the three letter codes --- perhaps UD could provide a (flat, text, not linked) table that maps all codes into the three letter codes so that someone wishing to do this in a publication could have something to refer to?

@dan-zeman
Copy link
Member

Such a list can be created easily and I agree that it would be potentially useful.

I know I will not use it in my publications though. Not only because I like 639-1, but also because publications have page limits and tables are too large even now :-)

@dan-zeman
Copy link
Member

There is a nice langcode conversion list in Wikipedia, so maybe we could just link there?

http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

@fginter fginter reopened this Sep 14, 2016
@fginter fginter self-assigned this Sep 14, 2016
@fginter
Copy link
Member

fginter commented Sep 14, 2016

Okay. Good. Reopening and assigning to myself. I'll try to add that link to some reasonable place or even extract the codes for the languages we have and add them somewhere.

@emilymbender
Copy link
Author

Thanks, Filip!

@fginter
Copy link
Member

fginter commented Dec 29, 2016

In reaction to #389 I now added a script which creates symlinks with ISO 639-3 language codes.

@fginter fginter closed this as completed Dec 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants