ISO 639-3 Language codes #339

emilymbender · 2016-09-13T14:10:06Z

The current ISO standard for language codes is the three letter language codes of ISO 639-3:

http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39534

The codes can be found here:

http://www-01.sil.org/iso639-3

The two letter codes do not scale to the worlds' 6000+ languages; a project with "universal" goals should use ones that do scale. In addition, work citing the UD treebanks tends to follow the codes that this project uses, leading to many publications with obsolete (and/or underinformative) language identifiers.

fginter · 2016-09-13T14:18:15Z

Completely agreed. This was a poor choice at the beginning of the project. Unfortunately, the current two letter codes are right now extremely prevalent throughout the documentation system, the data validation scripts, basically everywhere. Porting everything to the three letter codes would be great IMHO, but I personally wouldn't want to be the one having to do it. :) As a first approximation, we could start using the three letter codes for any new language we add.

emilymbender · 2016-09-13T14:20:27Z

Would it be possible to set up an equivalence table somewhere that lists all the languages with three-letter codes and then (for backwards compatibility) links to the two-letter versions where those are still present? (And perhaps encourages people who are publishing on this to use the three-letter codes?)

fginter · 2016-09-13T14:23:39Z

Producing such compatibility links would need changes very deep in the guts of the documentation system. It's totally doable, but someone would then need to do it and make sure nothing breaks. Right now, at least I cannot promise this.

dan-zeman · 2016-09-13T14:25:06Z

I think we are following the common practice that ISO 639-1 codes are used for languages for which they exist, and three-letter codes otherwise. It was even sort-of standardized in the Internet RFC 4646 (http://www.ietf.org/rfc/rfc4646.txt), but that RFC takes the three-letter codes from ISO 639-2T, while ISO 639-3, which came later, is IMHO better. But I like using the 639-1 for the "big languages", they are much more common and known.

I agree that changing the practice retroactively would be a lot of work with no clear benefit. And I would strongly oppose changing the practice for newly added languages without changing it for all languages.

fginter · 2016-09-13T14:28:48Z

I didn't know that was actually a common practice to do it this way. :) Live and learn. So I guess the tentative conclusion is to leave this as is because (a) it's not a bug it's a feature and (b) the change would be very labour-intensive.

emilymbender · 2016-09-13T14:33:02Z

I learned something today, too! I still think it is more appropriate for a linguistics project to use three letter codes (and not give preferential treatment to "big" languages). And I still think that linguistics publications should indicate which languages they are addressing using the three letter codes --- perhaps UD could provide a (flat, text, not linked) table that maps all codes into the three letter codes so that someone wishing to do this in a publication could have something to refer to?

dan-zeman · 2016-09-13T14:44:29Z

Such a list can be created easily and I agree that it would be potentially useful.

I know I will not use it in my publications though. Not only because I like 639-1, but also because publications have page limits and tables are too large even now :-)

dan-zeman · 2016-09-13T15:01:56Z

There is a nice langcode conversion list in Wikipedia, so maybe we could just link there?

http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

fginter · 2016-09-14T06:55:38Z

Okay. Good. Reopening and assigning to myself. I'll try to add that link to some reasonable place or even extract the codes for the languages we have and add them somewhere.

emilymbender · 2016-09-14T14:08:04Z

Thanks, Filip!

fginter · 2016-12-29T12:27:59Z

In reaction to #389 I now added a script which creates symlinks with ISO 639-3 language codes.

fginter closed this as completed Sep 13, 2016

fginter reopened this Sep 14, 2016

fginter self-assigned this Sep 14, 2016

dan-zeman added the b:website label Nov 18, 2016

dan-zeman added this to the universal v2 milestone Nov 18, 2016

spyysalo modified the milestones: lg-specific v2, universal v2 Dec 1, 2016

fginter mentioned this issue Dec 28, 2016

language codes in treebank subdirs #389

Closed

fginter closed this as completed Dec 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ISO 639-3 Language codes #339

ISO 639-3 Language codes #339

emilymbender commented Sep 13, 2016

fginter commented Sep 13, 2016

Uh oh!

emilymbender commented Sep 13, 2016

Uh oh!

fginter commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016 •

edited

Loading

Uh oh!

fginter commented Sep 13, 2016

Uh oh!

emilymbender commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016

Uh oh!

fginter commented Sep 14, 2016

Uh oh!

emilymbender commented Sep 14, 2016

Uh oh!

fginter commented Dec 29, 2016

Uh oh!

ISO 639-3 Language codes #339

ISO 639-3 Language codes #339

Comments

emilymbender commented Sep 13, 2016

fginter commented Sep 13, 2016

Uh oh!

emilymbender commented Sep 13, 2016

Uh oh!

fginter commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fginter commented Sep 13, 2016

Uh oh!

emilymbender commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016

Uh oh!

dan-zeman commented Sep 13, 2016

Uh oh!

fginter commented Sep 14, 2016

Uh oh!

emilymbender commented Sep 14, 2016

Uh oh!

fginter commented Dec 29, 2016

Uh oh!

dan-zeman commented Sep 13, 2016 •

edited

Loading