Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Spec language for multiple numbers #6

Open
bdarcus opened this Issue Mar 31, 2011 · 13 comments

Comments

Projects
None yet
3 participants
Owner

bdarcus commented Mar 31, 2011

Currently, the specification provides for recognition only of a single number through a cs:number node:

http://citationstyles.org/downloads/specification.html#number

This is a proposal to permit multiple values on cs:number. Proposed language for the specification:

  • //If a variable rendered with cs:number contains no characters other than numbers, at least one space separating each number, and optionally one or more connecting comma, hyphen or ampersand characters, the variable is treated as a list of numbers. In this case, the intervening punctuation is ignored, and the list is sorted and rendered with appropriate connecting punctuation (e.g. "4, 3 & 5" rendered with form="ordinal" becomes "3rd-5th").//
  • //If the variable contains no numbers or no spaces, it is rendered verbatim.//
  • //In all other cases, the first number encountered is used for rendering (e.g. "12th edition" becomes "12").//

This change would permit the use of plural-form labels with the variables edition, volume, and number.


Owner

bdarcus commented Mar 31, 2011

A small adjustment to the proposal:

  • If a variable rendered with cs:number contains no characters other than
    numbers, at least one space separating each number, and optionally one or more
    connecting comma, hyphen or ampersand characters, the variable is treated as a
    list of numbers. In this case, the intervening punctuation is ignored, and the
    list is sorted and rendered with appropriate connecting punctuation (e.g. "4,
    3 & 5" rendered with form="ordinal" becomes "3rd-5th").
  • If the variable contains (1) no numbers, or (2) no spaces, or (3) both
    numbers and spaces, plus characters that are not hyphen, ampersand or comma,
    then it is rendered verbatim (e.g. "1 vol + 1 CD" is rendered as "1 vol + 1
    CD").
  • In all other cases, the first number encountered is used for rendering
    (e.g. "12th edition" becomes "12").

Original Comment By: Frank Bennett
Owner

bdarcus commented Mar 31, 2011

  • Are en-dashes and em-dashes also recognized (in addition to hyphens)?
  • What happens when you sort on variable values that are parsed as number
    lists? Does "4, 3 & 5" (parsed: "3-5") sort before "3, 4 & 6" (parsed: "3, 4,
    6")?
  • I don't understand your "(2) no spaces". In CSL 1.0, "12th" is parsed as
    "12" when cs:number is used, right? If so, I'd like to keep that behavior.

Original Comment By: Rintze Zelle
Owner

bdarcus commented Mar 31, 2011

Re en- and em-dashes, what do you think? I'm open either way.

Re sorting, that's a good point. Using the first number encountered is
probably adequate. I'm not sure whether numeric variables sort numerically in
citeproc-js yet, actually, so that will be the first port of call.

Re (2) no spaces, that's a bad description. Should be something like "numbers
separated solely by non-space characters".


Original Comment By: Frank Bennett
Owner

bdarcus commented Mar 31, 2011

Apparently citeproc-js already does sort numeric variables numerically, and
it's not broken by multiple-value variables. Haven't checked the details, but
it passes a test okay.


Original Comment By: Frank Bennett
Owner

bdarcus commented Mar 31, 2011

As for the parsing of values with a single number, I think it makes sense to
render a value like "4a-c" verbatim. What would be the best logic to detect
something like this? Should we scan for hyphens (or dashes) between the number
and the nearest space (if present), so we still parse "12th Yellow-tailed
Woolly Monkey" as "12"?


Original Comment By: Rintze Zelle
Owner

bdarcus commented Mar 31, 2011

Could do that. Have added to the test for inspection.

http://bitbucket.org/fbennett/citeproc-
js/src/tip/tests/fixtures/local/number_EditionOrdinalWithMultiple.txt

js/src/tip/tests/fixtures/local/number_EditionOrdinalWithMultiple.txt


Original Comment By: Frank Bennett
Owner

bdarcus commented Mar 31, 2011

Based on Frank's proposal, maybe this would work?

If a variable displayed with cs:number contains both digit and non-digit
characters, an attempt is made to extract the numeric data. If the variable
contains multiple numbers that are separated by spaces (e.g. "2 4), optionally
with commas (e.g. "2, 4"), ampersands (e.g. "2 & 4") or hyphens/em-dashes/en-
dashes (for number ranges, e.g. "2 - 4"), the numbers are extracted, sorted
and rendered with connecting commas and hyphens in the selected form (e.g. "1,
4, 3 & 5" becomes "1st, 3rd-5th" when rendered with form="ordinal").

Variables that contain (1) no numbers (e.g. "first edition"), (2) a hyphen,
em-dash or en-dash in a word containing at least one digit (e.g. "4a-c"), (3)
two or more numbers without a separating space (e.g. "2a6") or (4) two or more
numbers and any character other than digits, spaces, hyphens, em-dashes, en-
dashes, ampersands or commas (e.g. "1 vol + 1 CD"), are not parsed and
rendered verbatim. In all other cases, the first number that is encountered is
extracted (e.g. "12" for "12th edition").

Variables can be tested for numeric content with the is-numeric conditional,
e.g. "12th edition" tests "true" whereas "third edition" tests "false" (see
Choose).


Original Comment By: Rintze Zelle
Owner

rmzelle commented Mar 3, 2012

@fbennett, I assume things have changed a bit over time with regard to multiple-number-recognition. Is it much work for you to update us to the current status (or point me to the relevant tests)? (that is, if you think this should go into the spec for 1.0.1)

Member

fbennett commented Mar 3, 2012

A little bit, but it does need a full description. The test linked above currently passes, and covers all the cases I could think of. I've dropped the idea of collapsing sequential numbers to a range, and of inserting commas and ampersands and whatnot; you basically get any credibly-numberic string back with ordinalization (or affixes or whatever) applied to the numbers, with the original punctuation joins in place. If it doesn't look like a number, then is-numeric will test false in cs:if, and the string will return unchanged in cs:number.

If can try to write up a description sometime, if it will help.

Owner

rmzelle commented Apr 25, 2012

@fbennett, I've looked at the unit test, and have an amended proposal for the CSL specification. There are some deviations from the test, but I tried to come up with the simplest rule set for the recognition of numbers via cs:number and the is-numeric conditional that still captures most of the behavior in your test. Also, I think the specification shouldn't concern itself with the rescue of crappy metadata. Finally, trying to recognize labels (e.g. "edition" in "2nd edition") seems like a bad idea because of potential localization issues and the risk of overcomplicating stuff.

So, my new rules are:

Variables can be tested for numeric content with the is-numeric conditional. Content is considered numeric if it solely consists of numbers. Numbers may have prefixes and suffixes ("D2", "2b", "L2d"), and may be separated by a comma, hyphen, or ampersand, with or without spaces ("2, 3", "2-4", "2 & 4"). For example, "2nd" tests "true" whereas "second" and "2nd edition" test "false" (see Choose).

If a variable is rendered with cs:number, has numeric content (as determined by the rules for is-numeric) and contains multiple numbers, the content is formatted as:

  • numbers separated by a hyphen are stripped from intervening spaces ("2 - 4" becomes "2-4"). Numbers separated by commas receive a space after the comma ("2,3" and "2 , 3" become "2, 3"), while numbers separated by ampersands receive a space before and after the ampsersand ("2&3" becomes "2 & 3").
  • numbers with prefixes or suffixes are never ordinalized or rendered in roman numerals. Numbers without affixes are individually transformed ("2, 3" can become "2nd, 3rd", "second, third" and "ii, iii").
  • cs:label renders the plural ("multiple") form of the term if it uses a number variable with numeric content and multiple numbers ("2nd & 3rd editions")

With these rules, I only get different results for the corner cases
Editions 1–6th --- ‘Editions 1 - 6’ (would become "Editions 1 - 6")
42nd edition --- ‘“42 editionX”’ (would become "“42 editionX”")
42nd–47th editions --- ‘“42 - 47 editionz”’ (would become "“42 - 47 editionz”")
12 13 edition --- ‘12 13’ (would become "12 13")

Owner

bdarcus commented Apr 25, 2012

I don't know what I think of this proposal, but like the precise spec writing!

On Wed, Apr 25, 2012 at 12:31 PM, Rintze M. Zelle
reply@reply.github.com
wrote:

@fbennett, I've looked at the unit test, and have an amended proposal for the CSL specification. There are some deviations from the test, but I tried to come up with the simplest rule set for the recognition of numbers via cs:number and the is-numeric conditional that still captures most of the behavior in your test. Also, I think the specification shouldn't concern itself with the rescue of crappy metadata. Finally, trying to recognize labels (e.g. "edition" in "2nd edition") seems like a bad idea because of potential localization issues and the risk of overcomplicating stuff.

So, my new rules are:

Variables can be tested for numeric content with the is-numeric conditional. Content is considered numeric if it solely consists of numbers. Numbers may have prefixes and suffixes ("D2", "2b", "L2d"), and may be separated by a comma, hyphen, or ampersand, with or without spaces ("2, 3", "2-4", "2 & 4"). For example, "2nd" tests "true" whereas "second" and "2nd edition" test "false" (see Choose).

If a variable is rendered with cs:number, has numeric content (as determined by the rules for is-numeric) and contains multiple numbers, the content is formatted as:

  • numbers separated by a hyphen are stripped from intervening spaces ("2 - 4" becomes "2-4"). Numbers separated by commas receive a space after the comma ("2,3" and "2 , 3" become "2, 3"), while numbers separated by ampersands receive a space before and after the ampsersand ("2&3" becomes "2 & 3").
  • numbers with prefixes or suffixes are never ordinalized or rendered in roman numerals. Numbers without affixes are individually transformed ("2, 3" can become "2nd, 3rd", "second, third" and "ii, iii").
  • cs:label renders the plural ("multiple") form of the term if it uses a number variable with numeric content and multiple numbers ("2nd & 3rd editions")

With these rules, I only get different results for the corner cases
Editions 1–6th --- ‘Editions 1 - 6’ (would become "Editions 1 - 6")
42nd edition --- ‘“42 editionX”’ (would become "“42 editionX”")
42nd–47th editions --- ‘“42 - 47 editionz”’ (would become "“42 - 47 editionz”")
12 13 edition --- ‘12 13’ (would become "12 13")


Reply to this email directly or view it on GitHub:
#6 (comment)

Member

fbennett commented Jun 10, 2012

Finally chiming in, sorry for the delay. A few tiny niggles and one suggestion, but the simplicity is good, and I agree on letting bad data lie.

In the first bullet point, "stripped from intervening spaces" should be "stripped of intervening spaces". There might be a slight increase in clarity if "receive a space" were changed to "receive exactly one space" (I found myself skipping back to the input description to be sure that spaces were permitted in input).

Converting a hyphen to en-dash (or whatever the localized range delimiter is) is friendly and good, but there are cases in which an explicit hyphen is desired. I have implemented \- as an escape for that purpose. Not sure if you want that in the specification, but I offer it up for what it's worth.

Owner

rmzelle commented Jun 16, 2012

I took into account Frank's comments and reworked the specification:

ed9c9ec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment