-
Notifications
You must be signed in to change notification settings - Fork 259
ISO 639-3 Language codes #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Completely agreed. This was a poor choice at the beginning of the project. Unfortunately, the current two letter codes are right now extremely prevalent throughout the documentation system, the data validation scripts, basically everywhere. Porting everything to the three letter codes would be great IMHO, but I personally wouldn't want to be the one having to do it. :) As a first approximation, we could start using the three letter codes for any new language we add. |
Would it be possible to set up an equivalence table somewhere that lists all the languages with three-letter codes and then (for backwards compatibility) links to the two-letter versions where those are still present? (And perhaps encourages people who are publishing on this to use the three-letter codes?) |
Producing such compatibility links would need changes very deep in the guts of the documentation system. It's totally doable, but someone would then need to do it and make sure nothing breaks. Right now, at least I cannot promise this. |
I think we are following the common practice that ISO 639-1 codes are used for languages for which they exist, and three-letter codes otherwise. It was even sort-of standardized in the Internet RFC 4646 (http://www.ietf.org/rfc/rfc4646.txt), but that RFC takes the three-letter codes from ISO 639-2T, while ISO 639-3, which came later, is IMHO better. But I like using the 639-1 for the "big languages", they are much more common and known. I agree that changing the practice retroactively would be a lot of work with no clear benefit. And I would strongly oppose changing the practice for newly added languages without changing it for all languages. |
I didn't know that was actually a common practice to do it this way. :) Live and learn. So I guess the tentative conclusion is to leave this as is because (a) it's not a bug it's a feature and (b) the change would be very labour-intensive. |
I learned something today, too! I still think it is more appropriate for a linguistics project to use three letter codes (and not give preferential treatment to "big" languages). And I still think that linguistics publications should indicate which languages they are addressing using the three letter codes --- perhaps UD could provide a (flat, text, not linked) table that maps all codes into the three letter codes so that someone wishing to do this in a publication could have something to refer to? |
Such a list can be created easily and I agree that it would be potentially useful. I know I will not use it in my publications though. Not only because I like 639-1, but also because publications have page limits and tables are too large even now :-) |
There is a nice langcode conversion list in Wikipedia, so maybe we could just link there? |
Okay. Good. Reopening and assigning to myself. I'll try to add that link to some reasonable place or even extract the codes for the languages we have and add them somewhere. |
Thanks, Filip! |
In reaction to #389 I now added a script which creates symlinks with ISO 639-3 language codes. |
The current ISO standard for language codes is the three letter language codes of ISO 639-3:
http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=39534
The codes can be found here:
http://www-01.sil.org/iso639-3
The two letter codes do not scale to the worlds' 6000+ languages; a project with "universal" goals should use ones that do scale. In addition, work citing the UD treebanks tends to follow the codes that this project uses, leading to many publications with obsolete (and/or underinformative) language identifiers.
The text was updated successfully, but these errors were encountered: