New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: allow extended ordinal suffix terms in locale #66

Open
fbennett opened this Issue Aug 2, 2011 · 65 comments

Comments

Projects
None yet
4 participants
@fbennett
Member

fbennett commented Aug 2, 2011

This is a placeholder, to be replaced by a description of Sylvester's strategy for flexible ordinal assignment.

@fbennett

This comment has been minimized.

Member

fbennett commented Aug 2, 2011

A patch convering this ticket can be found here [link updated].

@rmzelle

This comment has been minimized.

Member

rmzelle commented Sep 1, 2011

Would that be the description at http://xbiblio-devel.2463403.n2.nabble.com/schema-bug-td6310661i20.html#a6316075 ? Also, I found some documentation about language-specific ordinal usage at http://en.wikipedia.org/wiki/Ordinal_indicator .

@rmzelle

This comment has been minimized.

Member

rmzelle commented Sep 2, 2011

Some more documentation on ordinal suffixes:

Chinese
http://typophile.com/node/42577#comment-262561
http://mandarin.about.com/od/lessons/a/ordinals.htm
Uses a fixed prefix!

Dutch
http://taaladvies.net/taal/advies/vraag/2/
http://www.let.ru.nl/ans/e-ans/07/03/01/body.html
Two systems are in use.
Difficult one:
"ste" for 1, 8, 20 and up
"de" for 2, 4-7, 9-19
Easy one:
"e" for everything

Spanish
http://en.wikipedia.org/wiki/Ordinal_indicator#Italian.2C_Portuguese.2C_and_Spanish
http://typophile.com/node/42577#comment-262249
Normally -o for masculine, but 1.er, 3.er, 11.er, 13.er, 21.er, ..., 123.er, etc.

The second link also highlights that many languages have distinct ordinals for singular and plural. I'm not completely sure, but I think that in the context of CSL we only need the singular versions. An example of the Spanish variants:

Segundo (masculine singular)
Segunda (feminine singular)
Segundos (masculine plural)
Segundas (feminine plural)

Swedish
http://en.wikipedia.org/wiki/Ordinal_indicator#Swedish
"The general rule is that :a (for 1 and 2) or :e (for all other numbers, except 101:a, 42:a, et cetera) is appended to the numeral."

@rmzelle

This comment has been minimized.

Member

rmzelle commented Sep 2, 2011

In the xbiblio thread, it is argued that the proposed solution could be included in CSL 1.0.1. While this is possible from a schema validation standpoint, I'm not sure there wouldn't be problems in practice.

A CSL 1.0.1 processor should be able to handle CSL 1.0 styles and locale files. So what happens if we adopt the proposed solution, and the CSL 1.0.1 processor encounters a CSL 1.0 style that redefines the ordinal-01 through ordinal-04 terms (e.g. http://www.zotero.org/styles/mcgill-guide-v7/dev ) or any CSL 1.0 locale file?

@fbennett

This comment has been minimized.

Member

fbennett commented Sep 2, 2011

True, they would throw the old suffixes for the exceptions at higher values. Transitioning will take a little work at the implementation level. Possible approaches I can think of:

  • Just make the change (at 1.0.1 or 1.1) and let styles that redefine the ordinal suffixes catch up.
  • Special-case ordinal handling in the English locale, but log a warning.
  • Somehow require style-level locales to redefine all ordinal terms in the corresponding file locale (presumably for 1.1, since the change would be backward-incompatible.

The third solution must be hard if not impossible to implement in validation. If the second solution is adopted, maybe the new handling could be offered in 1.0.1 ... ?

@rmzelle

This comment has been minimized.

Member

rmzelle commented Sep 3, 2011

Case: a CSL 1.0.1 compatible processor supporting the new ordinal assignment scheme encounters a CSL 1.0 locale file, which only includes definitions for ordinal-01 through 04. Then what? Should the processor support both schemes, and figure out which one to use based on the locale files?

@rmzelle

This comment has been minimized.

Member

rmzelle commented Oct 26, 2011

More one possible upgrade path for CSL 1.0.1: we add new terms, "ordinal-suffix-dd" ("dd" representing two digits) to replace the "ordinal-dd" ones (we can entirely remove the latter from the schema in CSL 1.1). When one or more "ordinal-suffix-dd" terms are present in either the locale file or style, the new ordinal numbering scheme is used. Otherwise the CSL processor falls back to the old scheme.

(and, I like "ordinal-suffix-dd" better than "ordinal-dd" anyway)

@rmzelle

This comment has been minimized.

Member

rmzelle commented Dec 1, 2011

The least disruptive way to transition to Sylvester's ordinal-suffix scheme might be:

  • make the new scheme the default
  • if the style has no "default-locale", or if "default-locale" is set to an English locale ("en", "en-*"), and "ordinal-00" is undefined (in both the style itself and the called locale file(s)), use the original CSL 1.0 scheme to assign ordinal suffixes: "ordinal-01" is used for numbers ending on a 1 (except those ending on 11), "ordinal-02" for those ending on a 2 (except those ending on 12), "ordinal-03" for those ending on a 3 (except those ending on 13) and "ordinal-04" for all other numbers.
  • for all other default locales: if "ordinal-00" is undefined, use the value of "ordinal-04" as the default ordinal suffix.
@inukshuk

This comment has been minimized.

Member

inukshuk commented Jun 20, 2012

I 'reopened' this discussion with Rintze by mail yesterday, but I think it is more productive if I post it here:

It would be extremely frustrating for citeproc-ruby users if Frank's patch would not be included in the schema definition for the imminent CSL release. Here is why:

In Ruby there are existing software packages that do an admirable job at ordinalizing numbers in multiple languages. Obviously, these packages cannot be configured using CSL locales; therefore, we have to set them aside and implement a CSL-compatible approach instead. However, the current specification flat-out enforces to follow the rules and exceptions of one arbitrary language: English. When I ordinalize a Dutch number, why should the CSL spec force me to use the value of ordinal-04 for the number 13 for instance? It is ridiculous for me to not use ordinalizing-packages that do an admirable job under the pretext of better localization when, in fact, the new solution cannot be localized.

That's why I tried to come up with a different solution that works in a number of languages for which I want to generate bibliographies and works with CSL locales. The proposed solution yields good results, is based on the decimal system and simply requires translators to be able to add more ordinal terms than 1, 2, 3, and 4. The one flaw of the proposed solution is that it implicitly treats the value of ordinal-00 as the general term – so it may be worthwhile to define a separate case (unless zero does not have an ordinal, in which case we could even keep ordinal-00).

So the proposed solution works admirably, but requires these additional ordinal terms. (Note that it will never change the current values of ordinal-01 through 04 unless these values were hacked, so to speak, to accommodate for the shortcoming of the current specification.) However, by adding the terms you automatically invalidate the CSL locale or style. As a result, citeproc-ruby cannot simply validate CSL styles or locales when loading them. Furthermore, citeproc-ruby has to manage the locale files separately. Changes (in either direction) become more complicated. All this is simply because the schema does not allow any additional ordinals even though adding them would not harm existing cite processor implementations at all!

The only alternative for citeproc-ruby is to implement the algorithm suggested by the specification. This is just silly and I will not do it. That a specification flat out forces developers to implement inferior algorithms is extremely frustrating. To refuse to make the specification more permissive in this regard only ensures that citeproc-ruby's locales will stay invalid and it achieves nothing. Adding Frank's patch on the other hand would allow citeproc-ruby to validate locales and styles again and it would not change anything else at all. If you are seriously concerned about compatibility issues, you can even leave the specification text as is: the algorithm still works as before, translators can still define ordinals 1,2, 3, and 4 (and some languages will still not be able to use the feature).

Of course I would prefer it if we go even further and also change the specification (not only the schema). As Rintze correctly points out above this could lead to some issues – however all these issues can be mitigated by either updating a locale or style or by the cite processor implementation; it would be possible to fix these problems, there are a number of perfectly good suggestions in this thread. I don't think that it is the specification's purpose to tell the implementers excatly how to deal with these problems but suggestions can help a great deal of course.

Anyway, this is all about enabling solutions. With the current specification it is impossible to ordinalize numbers in some languages. Frank's patch would make it possible.

And seriously? There is one current implementation that has real problems because of the specification; problems that cannot be solved. I think this is much worse than the potential problems that may arise in some unknown implementation (IIRC Andrea was perfectly fine with the suggestion) – especially since the specification leaves room for these problems to be solved. I honestly don't understand this.

Finally, if we come up with a better solution than the one proposed I'll be more than happy to implement it. But it has to make sense.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jun 20, 2012

In Ruby there are existing software packages that do an admirable job at ordinalizing numbers in multiple languages.

Which are those? They might be of interest to us.

a different solution that works in a number of languages

Which languages are those, exactly? I just have a hard time judging the merits of your algorithm without examples of the additional languages that we would be able to support.

Of course I would prefer it if we go even further and also change the specification (not only the schema). As Rintze correctly points out above this could lead to some issues – however all these issues can be mitigated by either updating a locale or style or by the cite processor implementation

If we adopt your scheme, we should update the specification. I suggested a possible migration path just above your post, but haven't received any feedback on it for months. Is that acceptable?

Again, to reiterate: I know the current algorithm is limited, but I want to be sure that a) any new scheme is actually an improvement, and b) that we don't have any migration issues between schemes. I haven't seen enough documentation for the former and there has been little discussion on the latter.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jun 20, 2012

On Jun 20, 2012, at 1:55 PM, Rintze M. Zelle wrote:

In Ruby there are existing software packages that do an admirable job at ordinalizing numbers in multiple languages.

Which are those? They might be of interest to us.

ActiveSupport extends integers with an #ordinalize method. I've seen French and German adaptions of it that tie into the i18n framework used by Rails.

https://github.com/rails/rails/blob/6c367a0d787705746f262d0bd5ad8c4f13a8c809/activesupport/lib/active_support/inflector/methods.rb#L278

Obviously, this implementation is basically identical to the current CSL scheme.

a different solution that works in a number of languages

Which languages are those, exactly? I just have a hard time judging the merits of your algorithm without examples of the additional languages that we would be able to support.

Don't underestimate the biggest issue for me as a developer: the current spec forces me to use a non-sensical approach to implement even the easy languages that use a fixed affix like German. It forces me to rely on this arbitrary and redundant locale definition like this:

ordinal-01 = "."
ordinal-02 = "."
ordinal-03 = "."
ordinal-04 = "."

And the implementation is forced to make exceptions for 11, 12 and 13. This is makes no sense and it gives no perceivable advantage.

Apart from this, however, you're pointing out a number of languages yourself in this thread. I don't speak Dutch myself, but judging from the rules you mentioned above it would be straight forward to implement with the proposed approach. A translator would have to add the following ordinals:

ordinal-00 = "e"
ordinal-01 = "ste"
ordinal-08 = "ste"
ordinal-20 = "ste"
…
ordinal-29 = "ste"

ordinal-02 = "de"
ordinal-04 = "de"
ordinal-05 = "de"
ordinal-06 = "de"
ordinal-07 = "de"
ordinal-09 = "de"
…
ordinal-19 = "de"

That is to say, this approach works best if you have a general case and a limited number of exceptions or special cases. If there are many exceptions the translation becomes a little tedious, but at least this creates possibilities.

Of course I would prefer it if we go even further and also change the specification (not only the schema). As Rintze correctly points out above this could lead to some issues – however all these issues can be mitigated by either updating a locale or style or by the cite processor implementation

If we adopt your scheme, we should update the specification. I suggested a possible migration path just above your post, but haven't received any feedback on it for months. Is that acceptable?

Because of your concerns I wanted to make clear that it would be sufficient for now to just make the schema more permissive. The real imminent problem for citeproc-ruby is that its locales are automatically invalid. Simply by allowing additional ordinals and not changing anything else for a 1.0.1 release you make everything much easier at literally no cost at all. You could then add the proposed solution or a different solution that works better in a 1.1 release later on.

Regarding the migration path: again, I don't think the specification should enforce a migration path, a suggestion, as you say, is fine, but it should be a cite processor's or application's prerogative to decide on whether or not and how they want to support 1.0.1 and 1.0 locales. Apart from that, I believe the scheme you describe above is perfectly valid. Personally, I would not implement the old scheme, because I really don't see the point of falling back to an algorithm that will not produce better results – if anyone reports an issue they would just have to adapt a locale file. But as I said, everyone else is welcome to, and I don't think it would be difficult for Zotero or Mendely to implement the strategy you proposed in order to guarantee stable results. Frankly, though, I am convinced that end users would not mind a few minor issues which can be resolved if this means that localization support improves over all.

Again, to reiterate: I know the current algorithm is limited, but I want to be sure that a) any new scheme is actually an improvement, and b) that we don't have any migration issues between schemes. I haven't seen enough documentation for the former and there has been little discussion on the latter.

Here are the old citeproc-ruby tests for #ordinalize (English and German) – this proves nothing new though.

https://github.com/inukshuk/citeproc-ruby/blob/master/spec/csl/locale_spec.rb#L85

If you have doubts: why not just accept Frank's patch and not change the specification. What is the downside of it? Why penalize alternative implementation schemes like that?

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jun 21, 2012

@inukshuk (and others), before settled on anything, a few more questions:

a) do we have any interest in supporting Chinese ordinals (e.g. "第1")? That could be done by using "ordinal-suffix-00" and "ordinal-prefix-00"-like terms instead of "ordinal-00".
b) CSL styles can include locale data that overrides that of the locale files (see http://bit.ly/KWlYLo ), and, traditionally, terms are overridden on a one-by-one basis. Does it make more sense here to replace the entire set of "ordinal-00" terms at a time?
c) "Personally, I would not implement the old scheme, because I really don't see the point of falling back to an algorithm that will not produce better results – if anyone reports an issue they would just have to adapt a locale file"
This is related to the issue above. There are a bunch of CSL 1.0 styles in use that specify overriding "ordinal-01" to "...-04" terms. How will citeproc-ruby deal with those? Just ignore those terms and fall back on the terms in the locale file?

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jun 21, 2012

a) Are you aware of any language that uses both prefix and suffix? If not, an alternative approach would be to define an "ordinal-affix" attribute which can be set to "suffix" or "prefix" by translators. Otherwise, yes, the algorithm could be adapted to check for both "ordinal-suffix-NN" and "ordinal-prefix-NN" (ordinal-prefix would have to be set to an empty string for English).

b) I'm not exactly sure what you mean here. Are you referring to cases like the style mentioned previously (http://www.zotero.org/styles/mcgill-guide-v7/dev)? In that case, let me try to walk through how citeproc-ruby would ordinalize 15, 14, 13, 3 and 23:

15: ordinal-15 is not defined, ordinal-05 is not defined so ordinal-00 (from the default fallback locale, English in this case) would be selected: 15th.

14: ordinal-14 is not defined, ordinal-04 is defined in the style: 14th

13: ordinal-13 is not defined in the style, but it is defined in the default locale: 13th

3: ordinal-03 is defined in the style: 3d

23: ordinal-23 is not defined, ordinal-03 is defined in the style: 23d

So, in this case, the style would not have to be changed at all.

c) I would use the overrides just as the specification says. The only cases where I can envision this leading to unexpected results is if the value of ordinal-04 differs from that of the value of ordinal-00 (or whatever we want to call the default value) – in that case, the style author was trying to override the default value but now effectively only overrides the value for numbers which end with a 4.

Now, instead of trying to implement a fallback mechanism for this case, I would just add ordinal-00 (with the same value as ordinal-04) to the style. This addition should not interfere with a CSL 1.0 cite processor at all because it would just ignore the value. The only problem is that this style would not be valid anymore according to the 1.0 schema.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jun 21, 2012

Just a quick follow-up: if we want to include "ordinal-prefix-NN" and "ordinal-suffix-NN" then this is definitely nothing for a 1.0.1 release because the regular "ordinal-NN" would become invalid.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jun 21, 2012

an alternative approach would be to define an "ordinal-affix" attribute which can be set to "suffix" or "prefix" by translators.

Yes, that's possible as well. I think using separate terms is slightly easier to understand from a style author perspective, though.

if we want to include "ordinal-prefix-NN" and "ordinal-suffix-NN" then this is definitely nothing for a 1.0.1 release because the regular "ordinal-NN" would become invalid

We could allow both (but they could be mutually exclusive). "ordinal-NN" for the old scheme, "ordinal-prefix-NN" and "ordinal-suffix-NN" for the new one. With CSL 1.1 we could then retire the old scheme. It would of course require you to support both algorithms.

I'm not exactly sure what you mean here.

Take the Dutch case. There is a simple solution (just adding "e" as a suffix, requiring just the "ordinal-00" term) and a complex solution (requiring a whole bunch of "ordinal-NN" terms). If the Dutch locale file describes the complex case, and a Dutch style wants the simple solution, the style would have to override all those terms. If we decided that the definition of one or more "ordinal-NN" terms in the style replaces the entire collection of any prior "ordinal-NN" term definitions, you wouldn't have this problem.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jun 21, 2012

Ah, now I understand – I didn't realize there where actually two possible solutions for Dutch (so my example above is wrong).

If the default locale is quite complex (i.e., defines many ordinals) and a style wants to use a simple approach instead this would require the simple approach to override all those ordinals defined in the original locale. This is obviously not ideal for the style author.

In the Dutch case (if I manage to get it right this time):

ordinal-00 = "ste"
ordinal-02 = "de"
ordinal-04 = "de"
ordinal-05 = "de"
ordinal-06 = "de"
ordinal-07 = "de"
ordinal-09 = "de"
...
ordinal-19 = "de"

So now, to override this I would have to set all these values to "e" instead of just setting ordinal-00, right?

Personally, I would prefer to not make locale prioritization more complex by adding a special case for ordinal terms; however, I can see your point here: overriding ordinals would definitely become more complicated for translators or style authors. On the other hand, if we introduce the special case then there will be potentially more migration issues: the example I posted above would yield invalid results, for example, because the fallbacks would not be loaded for 13 and 00.

@fbennett

This comment has been minimized.

Member

fbennett commented Jun 21, 2012

Re (a), there is no need to worry about 第 as a prefix in this context. It's a fixed string for all numbers, and can be handled as a text value where required. CJK styles will diverge significantly from those for European languages, and simple locale switching is unlikely to work anyway; language-specific layouts (currently an MLZ experiment, but you know, eventually) will be easier for style authors to manage.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jun 29, 2012

How would your scheme work for French? It's my understanding that there are gender-specific ordinal-suffixes for "1" (female: "1ère", male: "1er"), and a non-gender-specific ordinal-suffix for higher numbers ("101e") [see also http://commons.wikimedia.org/wiki/Template:Ordinal/testcases ]. Maybe the following?:

ordinal-101 = "e"
ordinal-11 = "e"
ordinal-01 = "ère"/"er"
ordinal-00 = "e"

I guess that would still fail for 1001, 10001, etc., though, which would end up wrongly matching ordinal-01.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 2, 2012

By the way, that's a really useful document of test cases!

I implemented ordinalization over the weekend in the CSL-Ruby library (I also implemented the old algorithm for good measure) so now we can test all these cases much easier.

Take a look at the current acceptance tests – they should be easy to read and make it really easy to experiment with different locale settings. As you can see, I added French ordinal tests at the bottom: to my understanding they seem to work – however, you're absolutely right that if I added a gender to calls for 101 or 21 etc. I would not get the neutral 'e'. For example:

When I ordinalize these numbers:
  | num   | form  | gender    |
  | 0     |       |           |
  | 1     |       |           |
  | 1     |       | feminine  |
  | 1     |       | masculine |
  | 1     |       | neutral   |
  | 2     |       |           |
  | 3     |       |           |
  | 999   |       |           |
  | 11    |       |           |
  | 21    |       |           |
  | 101   |       |           |
  | 1001  |       |           |
  | 301   |       |           |
  | 21    |       | masculine |
  | 1001  |       | masculine |
Then the ordinals should be:
  | ordinal |
  | 0e      |
  | 1e      |
  | 1ère    |
  | 1er     |
  | 1e      |
  | 2e      |
  | 3e      |
  | 999e    |
  | 11e     |
  | 21e     |
  | 101e    |
  | 1001e   |
  | 301e    |
  # These are incorrect:
  | 21er    |
  | 1001er  |

In order to work around this, we would have to add gendered versions for 11, 21, … 91 and that would still leave 101, 201, 1001, etc. which would annoyingly return the gendered version of ordinal-01 (when called with a gender). Does anyone have an idea how we could handle this?

Taking a look at the build status, ordinalization seems to work on all major platforms (the rbx and ree errors pertain to a different part of the library), so you should be able to run the tests yourself, too – this would allow you to experiment with more locales. If you clone the csl-ruby, simply go to that directory and run:

$ bundle install
$ bundle exec cucumber

If you just want to run the ordinalization tests you can use the tags:

$ bundle exec cucumber --tags @v1.0,@ordinals
#-> runs the tests for the 1.0 algorithm

$ bundle exec cucumber --tags @v1.0.1,@ordinals
#-> runs the tests for the 1.0.1 algorithm

$ bundle exec cucumber --tags @ordinals,@lang:fr
#-> only runs the french ordinals tests

And so on.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 2, 2012

I realized my description for the complex Dutch case was incorrect. According to http://www.let.ru.nl/ans/e-ans/07/03/01/body.html it's:

0, 1, 8, 20-99: "ste"
2-7, 9-19: "de"

For 100 and up the rules stay the same (you just look at the last two digits), so "100ste", "102de", etc. For some reason "101de", "1001de", etc. is also allowed for the numbers ending on "01" (Dutch is weird).

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 3, 2012

Dutch zeroth is nulde I believe (see here) so this makes it even more difficult.

So I tried to set this up – not claiming this is a particularly practical solution, but at least it is possible (or did I miss something?). Take a look here. The test cases below all work as expected.

Some observations gathered from the French and Dutch cases:

  • For languages where you have a general case and a limited number of exceptions (like French) it would be useful to mark a definition as an exception – this ordinal shall only be used if the number matches exactly (not as a fallback). This helps when the exception is one of the common ordinals (like ordinal-01 in French) – it would not force us to have to override the exception for higher numbers. For example, this could look like:

    <term name="ordinal-01" exception="true" gender-form="masculine">er</term>
    
  • For languages like Dutch it would help a lot if translators could define ordinal ranges. To make this easier for cite processors it may be a good idea to use a dedicated ordinal tag instead of terms, for instance, here are a couple of ideas:

    <ordinal from="20" to="99">ste</ordinal>
    
    or:
    
    <ordinal for="1,8,20-99">ste</ordinal>
    
@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 3, 2012

For the "0" case in Dutch: there is also http://nl.wiktionary.org/wiki/nulste, and I think both are used in colloquial Dutch (I'm not sure if one version is incorrect).

I have been thinking of other algorithms we could use to make it possible to cover languages such as French and Dutch while defining as few terms as possible. This might be a bit crazy, but would the following make sense?

French:

<term name="ordinal-00" modulo="1">e</term>
<term name="ordinal-01">e</term>
<term name="ordinal-01" gender-form="feminine">ère</term>
<term name="ordinal-01" gender-form="masculine">er</term>

English:

<term name="ordinal-00" modulo="1">th</term>
<term name="ordinal-01" modulo="10">st</term>
<term name="ordinal-02" modulo="10">nd</term>
<term name="ordinal-03" modulo="10">rd</term>
<term name="ordinal-11" modulo="100">th</term>
<term name="ordinal-12" modulo="100">th</term>
<term name="ordinal-13" modulo="100">th</term>

Dutch:

<term name="ordinal-00" modulo="1">ste</term>
<term name="ordinal-01" modulo="10">ste</term>
<term name="ordinal-02" modulo="100">de</term>
<term name="ordinal-03" modulo="100">de</term>
<term name="ordinal-04" modulo="100">de</term>
<term name="ordinal-05" modulo="100">de</term>
<term name="ordinal-06" modulo="100">de</term>
<term name="ordinal-07" modulo="100">de</term>
<term name="ordinal-08" modulo="10">ste</term>
<term name="ordinal-09" modulo="100">de</term>
<term name="ordinal-10" modulo="100">de</term>
<term name="ordinal-11" modulo="100">de</term>
<term name="ordinal-12" modulo="100">de</term>
<term name="ordinal-13" modulo="100">de</term> 
<term name="ordinal-14" modulo="100">de</term>
<term name="ordinal-15" modulo="100">de</term> 
<term name="ordinal-16" modulo="100">de</term>
<term name="ordinal-17" modulo="100">de</term>
<term name="ordinal-18" modulo="100">de</term>
<term name="ordinal-19" modulo="100">de</term>

The algorithm would be: say the number to ordinalize is "223", and the locale is Dutch. This number is subjected to a modulo calculation using the divisor specified in the ordinal term with the highest number, which is "100" for "ordinal-19" (Dutch). This gives "23" as the remainder. This doesn't match "19", so the whole operation is repeated with the next term ("ordinal-18", "ordinal-17", etc.), until a match is achieved (in this case, that would be "ordinal-00", as 223 mod 1 == 0, so we get "223ste").

E.g. with English and the number "112", there will be a match against "ordinal-12" (112th), and with "23" there will be a match against "ordinal-03" (23rd).

If no module attribute is set on an ordinal, the number is directly matched to the ordinal number (so only "1" matches "ordinal-01" in the French case).

This is more computational expensive than your original algorithm, and I haven't checked it with code, but maybe something like this could be used.

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 4, 2012

minor comment:
You're right for French. The only small mistake is that the ordinal suffix for 1 is "er"=>"1er" (male) and "re"=>"1__re__" (female).
Other ordinals do not have gender-specific suffix except (for your information) "2" which can be "2d" (male) "2de" (female) in a very peculiar case (for purists, disputed): when there's no other "item" after; i.e.: the World War II is, in French, "la Seconde Guerre mondiale" (2de GM) since there is no Third World War. But this distinction between second/deuxième (2d/2e) is disputed.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 5, 2012

@gracile-fr thanks for that. I updated the test cases – do they look correct now?

The disputed case sounds fun, too, because I imagine most of the time it is difficult to rule out the existence (or possible existence) of a next item. ;-)

@rmzelle I added a 'nulste' variant to the tests here (the nulde variant is just above). The locale is a bit long but I think it is complete. If we could mark the 'de' cases as exceptions that apply only to a direct match it would become much simpler, too.

Basically I like the idea of translators being able to define a divisor; however, I think we should try to find a solution that is easier to understand and implement. If I understand your idea correctly, to ordinalize 91 in English I would:

  1. Look for ordinal-91
  2. Determine divisior by looking at highest defined number: 100 (defined by ordinal-13)
  3. 91 % 100 = 91 – ordinal-91 does not match
  4. Determine divisor: 100 (defined by ordinal-12)
  5. see 3.
  6. Determine divisor: 100 (defined by ordinal-11)
  7. see 3.
  8. Determine divisor: 10 (defined by ordinal-03)
  9. 91 % 10 = 1 – ordinal-01 matches, so return "91st"

Is that correct? What I don't like about this solution is that I find the process of having to determine the divisor by looking at the highest defined ordinal. I find the ordinal term with 'highest number' a little arbitrary. For example, to ordinalize 7 in English, I would basically follow the same approach as 91 above – checking the divisors of 13, 12, 11, 3, 2, and 1 before I know that there is no match.

Looking at English, Dutch and French our requirements seem to be:

  • Define a general case
  • Define one time exceptions (like 1 in French)
  • Define exceptions repeating by modulo 10 or 100

We should definitely add tests for additional languages to see if there are any other important requirements.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 5, 2012

@inukshuk, I guess it doesn't really make sense to match the input number against ordinals of a higher number, since you'll never get a match. So we could reduce the tests to:

To ordinalize 91 in English:

  1. Identify the highest ordinal-number that is equal or less than 91: ordinal-13
  2. Use modulo divisor from ordinal-13, and match remainder to the ordinal-number: is 91 % 100 = equal to 13? No.
  3. Repeat 2 for next-highest ordinal-number, until the number match:
    • 91 % 100 != 12
    • 91 % 100 != 11
    • 91 % 10 != 3
    • 91 % 10 != 2
    • 91 % 10 == 1
  4. ordinal-01 is the first match, so return "91st"

For 7:

  1. Identify the highest ordinal-number that is equal or less than 7: ordinal-7
  2. Use modulo divisor from ordinal-03, and match remainder to the ordinal-number: is 7 % 10 = equal to 3? No.
  3. Repeat 2 for next-highest ordinal-number, until the number match:
    • 7 % 10 != 2
    • 7 % 10 != 1
    • 7 % 1 == 0
  4. ordinal-00 is the first match, so return "7th"
@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 5, 2012

Maybe we can simplify this further if we assume that we only ever need modulo 10 for ordinal-terms from 1 to 10, and modulo 100 for ordinal-terms from 11 to 100 (not entirely sure that's always the case). Then we could keep the same markup of ordinal-terms I proposed, and do:

To ordinalize 91 in English:

  1. Does 91 match any of the ordinal-numbers? No.
  2. Does 91 % 100 = 91 match any of the ordinal-numbers with a modulo divisor of 100? No. (this obviously can be skipped for numbers below 100)
  3. Does 91 % 10 = 1 match any of the ordinal-numbers with a modulo divisor of 10? Yes, ordinal-01
  4. Return term value of ordinal-01: "91st"

For 7:

  1. Does 7 match any of the ordinal-numbers? No.
  2. Does 7 % 100 = 7 match any of the ordinal-numbers with a modulo divisor of 100? No.
  3. Does 7 % 10 = 7 match any of the ordinal-numbers with a modulo divisor of 10? No.
  4. Return term value of ordinal-00: "7th"

Step 2 can be skipped for numbers below 100, and step 3 can be skipped for numbers below 10.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 5, 2012

Hmmm… so, in other words, you are working off the assumption that in most languages we'll be able to express all exceptions already in the numbers 1-100, right? So that, basically, our matching algorithm will be something like:

  1. Is there a direct match?
  2. If not, is there a match % 100?
  3. If not, is there a match % 10
  4. If not, return default.

I do like this – it would be a slightly more efficient and more straight forward than my current algorithm but it has exactly the same drawbacks: you need to specify a lot of ordinals to override one time exceptions. For example, Dutch 143:

  1. No direct match
  2. 143 % 100 = 43 – if we have ordinal-43 defined, we're good, otherwise:
  3. 143 % 10 = 3 – ordinal-03 will match but is wrong

In other words, I think this is a good improvement of the algorithm but it still does not solve the problems we have in Dutch and French: namely that the locales will have to contain lots of definitions.

As I said, we should look at more languages. So far, I think the algorithm is basically fine (even better with your improvement), but would benefit from ordinal definitions that can be marked as being restricted to direct matches or, alternatively ranges or number patterns.

I'll try to look at additional languages tomorrow.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 9, 2012

I've started adding test cases for Spanish, Italian, and Swedish – I will try to add a Slavic language, too (probably Polish). It would be great if we could get native speakers to review all of these as there are probably mistakes and interesting cases that are missing from the examples.

Basically, Spanish and Italian seem very easy, although there is a special case with Spanish 1 and 3 for singular, masculine nouns that seems to apply only if the number is prefixed (instead of being used as a suffix); I don't know when exactly this case occurs. However, I've also added singular/plural distinctions to the German long-terminals (e.g., here) and this works without a problem.

Swedish seems to be relatively regular but I believe it has the same issue as French in that 1 and 2 are treated separately – but I do not know whether that treatment applies to numbers such as 92 or 301 etc. – in other words, it may be that the 'one-time-exception' requirement would be useful for Swedish as well.

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 9, 2012

@inukshuk :thanks for that. I updated the test cases – do they look correct now?
https://github.com/inukshuk/csl-ruby/blob/master/features/locales/ordinalize.feature#L213

"1e" is a nonsense in French but maybe it's only for the test?

As you're talking about plural forms:

SINGULAR PLURAL premier 1er premiers 1ers première 1re premières 1res deuxième 2e deuxièmes 2es [ second 2d seconds 2ds seconde 2de secondes 2des ] troisième 3e troisièmes 3es dixième 10e dixièmes 10es centième 100e centièmes 100es [from http://www.langue-fr.net/spip.php?article239]
@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 9, 2012

@gracile-fr the "1e" was just a guess at the neutral form – does French only have feminine/masculine gender? In that case, we should replace the "1e" with whatever is typically more common (it is the case that is selected by the processor if no gender information is supplied).

Thanks about the plural forms, I'll add those in a minute. Can you tell me the difference between deuxième and second?

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 16, 2012

I only updated the 'nulste' variant and the modulo="100" is used. For example, if you ordinalize 52, you end up dividing by 10 to get ordinal-02 and because 10 != 100 the value will not be used. This way the locale definition becomes quite effective.

Of course if 'nulde' is the correct form then we actually have the case where ordinal-00 is not the general case and all this does not help much. Perhaps we could work around this by defining ordinal-00 twice, without a modulo and with a modulo, respectively. Without a modulo could be the regular zero ordinal whereas ordinal-00 modulo 1 is the default/fallback value.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 16, 2012

Ah, right. Yes, the Dutch "nulste" case looks good, apart from the "0ste" itself.

Would it be any clearer to use "ordinal" term as the default? I guess for Dutch, we could use

<term name="ordinal">de</term>
<term name="ordinal-00" modulo="100">ste</term>
@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 16, 2012

I think that would work; but your example would have to be the other way around, right? That is:

<term name="ordinal">ste</term> <!-- the default/fallback ordinal -->
<term name="ordinal-00">de</term> <!-- the zero ordinal -->

With the new algorithm you only need to specify the modulo when it is required. In French we are using modulo="1" because ordinal-01 should only be used for 1. In Dutch we're using modulo 100 for the group 2-19 (with the exception of 1 and 8), because we want it to be used for, for example 102, 202 etc. but not for 22, 122 etc.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 16, 2012

With your algorithm, <term name="ordinal-00" modulo="100">ste</term> would only match 100, 1000, 10000, etc., right?

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 16, 2012

No, it would match:

  1. 0 (direct match)
  2. Every number n where n % 100 == 0

So it would match 0, 100, 200, 300, … 1000 …

However, the important thing is that the matching algorithm works backwards. So if the number is 1200 it would:

  1. Try to match ordinal-1200
  2. If that didn't work it would calculate the starting modulus, in this case 1000
  3. It would then try to match 1200 % 1000 == 200 so ordinal-200 (modulo 1000)
  4. Then it would divide 1000 by 10 and get the new modulus: 100
  5. It would then try to match 200 % 100 == 0, so ordinal-00 (modulo 100)
@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 16, 2012

My bad. In that case your solution is fine.

@fbennett

This comment has been minimized.

Member

fbennett commented Jul 16, 2012

Sounds like you're close to wanting implementation in other processors?

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 17, 2012

I implemented the new default ordinals and now Dutch ("nulde" form) is much easier to define.

From the point of view of translators, we now have to: a) define the default/fallback ordinal with the name "ordinal" (possibly add gender forms, singular and plural). Then, for all exceptions and irregular forms, define the the ordinals as "ordinal-nnn" possibly adding a modulo attribute to specify how the exception should be repeated. modulo="1" means that the exception is never repeated, modulo="10" means the exception is repeated in intervals of ten (e.g., 2, 12, 22, 32) and modulo="100" means that the exception is repeated in intervals of 100 (e.g., 2, 102, 202, 302).

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 17, 2012

Looks great. Should I start working on some text for the CSL specification?

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 26, 2012

I created a pull request for the specification, detailing a possible new scheme for ordinal suffixes: citation-style-language/documentation#22

@inukshuk (and others), it differs a little bit from what we've discussed up to now. My reasons for deviating were twofold: first, it's always been my philosophy for CSL to hide as much programming logic from the style author (my target audience), so I wanted to avoid the attribute name "modulo". Second, I realized that only "ordinal-00" to "ordinal-09" could toggle between modulo="10" and "100" ("ordinal-10" through "ordinal-99" are limited to modulo="100").

The proposed text should be compatible with @inukshuk latest algorithm described in #66 (comment) (assuming I have understood it correctly). Instead of the modulo attribute, we use match, with values "1-digit" (modulo="10"), "2-digits" (modulo="100") and "whole-number" (modulo="1").

Let me know what y'all think.

@fbennett

This comment has been minimized.

Member

fbennett commented Jul 26, 2012

Style authors would have to speak for themselves, but I would probably want a guide that explains how I might go about casting the ordinal conventions of a language I am familiar with into this form. I start with a list of ordinals or a set of rules in a Wikipedia page someplace, and then what's my next step?

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 26, 2012

I can discuss the English case (the paragraph on gender-specific ordinals already shows the terms needed for French).

@fbennett

This comment has been minimized.

Member

fbennett commented Jul 26, 2012

Explanation of how a finished locale works and and explanation of how to work up correct ordinals from scratch might have different requirements. The latter might be a little more verbose. Here's a page that was helpful to me on an unrelated issue, that unfolds in stages for the benefit of someone coming to the topic cold: http://www.greywyvern.com/?post=337

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 26, 2012

Right. For the most part, though, ordinal terms will only be defined in the locale files. There are very few cases where styles actually need any customization.

Also, my call for feedback at this point is more about whether the solution is acceptable, and whether the attribute/value names are clear. Are they?

@fbennett

This comment has been minimized.

Member

fbennett commented Jul 26, 2012

Yes.

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 26, 2012

They're clear, yes.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 28, 2012

@rmzelle yes, I think the names are clear.

I'll try to adapt my implementation to see if everything still works the same. I'm particularly interested in the default values of the attributes – we need to get those right so as to not complicate locales and implementations.

I'll report back when I've made the switch.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 28, 2012

@rmzelle looking good: here are the latest locales I use for testing: https://github.com/inukshuk/csl-ruby/blob/e417d462f7ce159378d5bc6835c971dbb7a91947/features/locales/ordinalize.feature

@fbennett I would outline the new approach to translators and style authors, along the lines of:

  1. Determine the typical ordinal term; define that term with name 'ordinal'
  2. Determine which numbers use a different ordinal term from the one defined in step 1 and define those explicitly
  3. If there are many such numbers, use the match attribute with 1-digit or 2-digits to define repetition patterns of the numbers
  4. If some of the numbers of step 2 are one-time exceptions, use the match attribute with whole-number to mark them as such
@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 29, 2012

@inukshuk, your test data contains a few plural terms (e.g. "ersten" for de-DE). Do we ever need those? I couldn't come up with an example, but then again, I'm not too familiar with languages that are gender-sensitive.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 30, 2012

With regard to the values on the "match" attribute, I had some alternatives on my mind: "last-digit", "last-two-digits", and "number" instead of "1-digit", "2-digits", "whole-number". Let me know if anybody prefers one over the other.

@inukshuk

This comment has been minimized.

Member

inukshuk commented Jul 30, 2012

@rmzelle as far as citation styles are concerned I can't think of an example where plural ordinals would be required off the top of my head (for that matter, which styles require long ordinals. do you know?). Anyway, these data are merely for testing – we might not need the plural ordinals in the style locales.

Normally, you'd need them (but I guess that wasn't your question), for instance: 'the first edition' and 'the first two editions' would be 'die erste Ausgabe' and 'die ersten zwei Ausgaben'.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 30, 2012

@inukshuk, we have 7 styles that use long-ordinals.

@rmzelle

This comment has been minimized.

Member

rmzelle commented Jul 31, 2012

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 31, 2012

Congrats to all of you!
(@rmzelle : I think that "last-digit", "last-two-digits" and "whole-number" are the clearest names)

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 31, 2012

Also in the spec, the "3e janvier" example is a bad one (I assume it's supposed to be a French example). In dates, ordinals are only used for the first day of the month, i.e: "1er janvier" but "2 janvier", "3 janvier", "11 janvier", etc.
Actually, I had not thought about that before and it might be a problem... :(

@gracile-fr

This comment has been minimized.

gracile-fr commented Jul 31, 2012

[Edit: this is a (small) problem which already exists at the moment and requires a minor edit in French documents using csl. The general logic adopted here is not at stake.]

@rmzelle

This comment has been minimized.

Member

rmzelle commented Aug 1, 2012

@gracile-fr, unless we find evidence that limiting the use of ordinal suffixes to certain days is desired for languages other than French (anybody know of any?), I rather not burden the CSL schema with logic for determining when days get an ordinal.

That said, maybe @fbennett, @inukshuk, and other CSL implementors don't mind adding the exception for "fr" locales? (in which case I would gladly add a note to the specification to this effect).

@fbennett

This comment has been minimized.

Member

fbennett commented Aug 1, 2012

If it's a rare exception, that probably is simplest. If others agree, I'm happy to make the adjustment.

@gracile-fr

This comment has been minimized.

gracile-fr commented Aug 1, 2012

  1. At the moment, (I think) french styles are not using ordinals in dates as the probability to need them is 12/365. Styles coders wouldn't even think about it and use cardinals. This is not a problem in most cases: "dimanche 2 janvier", "lundi 11 juillet" but, indeed, "jeudi 1 novembre" instead of "jeudi 1er novembre").

Thus I don't know what is the best approach here: make a general exception when ordinals are used with days and months ("ordinalize" "1" only ) or make an exception when cardinals are used (force the ordinalization of "1"). I'd prefer the second approach.

  1. As for other languages, a quick search reveals that Italian, Portuguese (and also Norman and Occitan ;-)) applies the same rule. In Spanish, this is considered as an anglicism. Native speakers have to confirm however.

Italian: http://italian.about.com/od/grammar/a/aa042600c_2.htm (bottom of the page)
Portuguese: http://www.easyportuguese.com/Portuguese-Lessons/Ordinal-Number.html (bottom of the page)
French: http://french.lovetoknow.com/Months_of_the_Year_in_French and http://french.stackexchange.com/questions/1553/pourquoi-utilise-t-on-un-ordinal-uniquement-pour-le-premier-du-mois
Spanish: http://spanish.stackexchange.com/questions/1869/what-is-the-correct-way-to-say-the-days-of-a-month

@rmzelle

This comment has been minimized.

Member

rmzelle commented Aug 1, 2012

@gracile-fr, the topic of restricting ordinal suffixes to certain days deserves its own issue. See #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment