Independent formatting of cs:number variable and ordinal suffix #49

Closed
fbennett opened this Issue May 24, 2011 · 25 comments

Comments

Projects
None yet
5 participants
Member

fbennett commented May 24, 2011

This issue has been raised on xbiblio-devel here, and on the Zotero forums here.

As a possible solution for discussion, an approach similar to the cs:et-al node could be used:

<number variable="edition" form="ordinal"> 
  <ordinal-suffix vertical-align="sup"/> 
  <numeral font-weight="bold"/> 
</number> 

With numeral being required, and ordinal-suffix being optional. The sub-elements would not affect ordering, but would permit independent styling of the elements (so if the form= was not ordinal, ordinal-suffix in the example above would have no effect. If number is a singleton, it would just work the way it does now.

Owner

bdarcus commented May 24, 2011

So just to back up a bit, the use case is that someone wants to specify that the "st" in "1st" gets superscripted.

One suggestion on that zotero thread was to allow HTML markup in the locale term itself.

Here you're suggesting adding two new elements (numeral and ordinal-suffix) to handle this in the style templates instead.

I prefer the second approach myself.

Any objections?

One question though: what's the significance of the explicit lack of order? Clearly you're intending it work similarly as names, but is there any other reason?

Just a quick comment. We need superscript also for "issue": "no" (singular) and "nos" (plural) (same for folio actually). That might be harder than ordinals because the "n" of "no" must not be superscripted!

Member

fbennett commented May 24, 2011

Ah, that's a tough one. To do that within the specification, without resorting to escaped markup, I guess you could do something like this:

<choose>
  <if variable="issue" match="any">
    <group delimiter=" ">
      <group>
        <text value="n"/>
        <text value="o" vertical-align="sup"/>
      </group>
      <text variable="issue"/>
    </group>
  </if>
</choose>

You lose both localization that way, though. Pluralization too, since we have no is-plural test attribute.

Member

fbennett commented May 24, 2011

Another approach might be to allow formatting declarations on term substrings:

<locale>
  <terms>
    <term name="issue">
      <single>
        <term-part>n</term-part>
        <term-part vertical-align="sup">o</term-part>
      </single>
      <multiple>
        <term-part>n</term-part>
        <term-part vertical-align="sup">os</term-part>
      </multiple>
    </term>
  </terms>
</locale>

This would require significant reimplementation in the processors, but it should cover all known use cases.

Owner

bdarcus commented May 24, 2011

If we were going to go that way (as loathe am I to so), we might as well add a general purpose (for data as well) structure something like:

csl-span = element span { formatting-attributes, text }
Member

fbennett commented May 25, 2011

I should probably add that, although I wrote a couple of the proposals above, I'm not particularly enthusiastic about any of them. What we have in place at the moment works, and while embedding escaped rich text formatting in a term is not pretty, given that the files in which it might be used are very closely tied to our specific implementations of CSL, I don't see that convention as particularly subversive. I'll certainly follow any consensus that emerges, though.

Owner

bdarcus commented May 25, 2011

Using escaped content is about far more than just aesthetics, and notwithstanding other details, I don't see that as a reasonable solution.

If we accept the use case that people can use localized terms that are more than plain text, and which can contain local (to the term) formatting details like superscripts, then:

  1. It seems to me the easiest, most flexible, way to do that is to change the context model for terms from text to text | rich-text and then to define rich-text.
  2. Given that we need rich-text for some data fields, we might as well kill two birds with one stone (e.g. a common rich-text pattern)
  3. Given previous requests from Sylvester to recast how we deal with rich text in the test suite, it seems to me the solution is what I suggest, which is just a variation on Frank's third proposal.
<locale>
  <terms>
    <term name="issue">
      <single>
        <span>n</span>
        <span vertical-align="sup">o</span>
      </single>
      <multiple>
        <span>n</span>
        <span vertical-align="sup">os</span>
      </multiple>
    </term>
  </terms>
</locale>

The only question I have is whether we allow mixed content:

<single>n <span vertical-align="sup">os</span></single>

Also, there may be some issues with namespaces to consider; if we go with a new span element, is it:

  1. in the CSL namespace?
  2. in the XHTML namespace?
  3. sans namespace

Each of them (but particularly 3) present issues when dealing with style instances that use default namespaces.

In any case, curious what other implementors think about all this.

Member

inukshuk commented May 26, 2011

I think Frank's third proposal is quite flexible; using the tag name 'span' may suggest that other HTML input is possible. If we do not allow mixed content and each term-part or span accepts the normal CSL formatting attributes this change may actually not be too difficult to add to processors.

We should also consider whether or not this change affects the proposed gender-specific ordinals (I believe it does not).

Owner

rmzelle commented May 26, 2011

Why is mixed content problematic? Because of possible nesting?

Member

inukshuk commented May 26, 2011

Sorry, I should have elaborated on that: it's not problematic by itself, I was only thinking about possible implementation strategies: if there is no mixed content but simply one or more term-part elements that may each contain formatting attributes the transformation would be more straight forward.

Incidentally, is there mixed content anywhere else in CSL?

Owner

rmzelle commented May 26, 2011

Owner

bdarcus commented May 26, 2011

So let's back up. Do we agree we want a single rich text content model for
both data and locales? Or two separate models?

Owner

rmzelle commented May 26, 2011

FWIW, so far, the only demand for rich text markup in locales has been superscripting.

Member

fbennett commented May 26, 2011

"Either way, explain your choice."

If that's directed at me, I withdrew that comment when I thought again about the issue; it's obviously not particularly tough to handle. (It's late here, please make allowances.)

Owner

bdarcus commented May 26, 2011

No, not directed at you. Just trying to narrow down the issues.

Owner

bdarcus commented May 26, 2011

@inukshuk - as I said, the idea I'm proposing here is to simplify things by have one rich-text pattern, with just a single element, and no complex nesting. That it looks like HTML is precisely the point, so that it can be used in input data as well.

Member

fbennett commented May 26, 2011

Actually, my comment last night may not have been completely off the mark.

  • Locale terms hit the processor as XML.
  • Input currently arrives as string data.

The parser applied to string data is a completely separate, more forgiving thing from the implementation platform's native XML parser. It needs to be, in order to avoid crashing the processor when user input contains mismatched tags or some other infelicity. Because the input markup parser is more forgiving, aligning locale terms with input markup is no big deal; in citeproc-js, I would just serialize everything below cs:single, cs:multiple or cs:term, and run it through the string parser when generating output (which is pretty much what happens now).

Things might be interesting when running in the other direction. Would you run into bad-data issues when converting records containing text-level markup to MODS (just a thought -- I don't know anything about MODS other than that it's an XML schema).

Member

inukshuk commented May 27, 2011

Generally, I think that specifying what goes into input data (citation data, locale term values etc.) is a bad thing and it should be avoided if that is at all possible. By that, I do not mean that rich text content should be prohibited, but I think a CSL processor specification should be format-agnostic. Imagine someone using a CSL processor to produce LaTeX for example: in that case the citation data would probably be generated from BibTeX files and the values may contain LaTeX directives which the CSL processor ought not to touch.

Of course, HTML is the most important output format right now, but the CSL specification should not prohibit or penalize other formats. If CSL expects users to provide input data in a certain format this opens up many potential compatibility issues. Don't get me wrong, it is great that citeproc-js allows formatting of input values, and given its use in Zotero HTML is obviously the best choice as a format; however by putting a content model based on a given format into the specification we would essentially force all implementations to use that format.

I guess the salient point I'm trying to make is, a rich text content model is extremely useful; if it is not absolutely required, I would try not to make it part of the standard. (Perhaps it could be a recommendation?)

By the way, it just ocurred to me that since CSL supports unicode input it should even be possible to allow for superscripting of terms (not that this would make up for a lack of a rich text content model).

Member

fbennett commented May 27, 2011

Unfortunately, a user reports that many of the the superscripted Unicode characters that would commonly be required do not render correctly in Word.

But wait ... that said, it is a limited set of characters, and if CSL specified those characters as the sole means of achieving superscripting in locale terms, a processor could easily convert them to a more easily digestible form, for broken applications that can't cope with the real thing, as it were. That actually might be the happiest solution of all. As Rintze notes, the only demand so far has been for superscripting, and relying on Unicode would kick the problem of markup down the road a bit, allowing more time to gather evidence of requirements and reflect -- and maybe we'll get lucky and not have to do anything. :)

Owner

bdarcus commented May 27, 2011

Alright, I see the point on the input format. But what about the test suite (where input does need to be normalized)?

@fbennett - what would be the upshot of the unicode-only route vis-a-vis the CSL spec? Are you just suggesting doing nothing during that "reflection" period?

Member

fbennett commented May 27, 2011

If Sylvester's view is adopted, the input markup would be outside the specification, and would be considered an implementation-level extension. Tests that include rich text markup would be separated from the main body of fixtures, to make it clear that they reflect behavior outside of the core requirements.

For locale term strings, I would be happy with a note in the specification to the effect that implementations should not rely on implementation-specific markup in locale term strings. That would flag the approach that Gracile and I worked out as unwise but not outright illegal, which is sufficient. The remaining alternative for us would be to use Unicode. That produces output that breaks in some environments (because the characters are apparently not supported in some commonly circulated fonts), but it would be up to the processor to fix that.

How does all that sound?

Owner

rmzelle commented May 27, 2011

So once you add Unicode-conversion to citeproc-js, we can change the locale files to use Unicode (e.g. https://github.com/citation-style-language/locales/blob/master/locales-fr-FR.xml currently uses embedded markup)?

Member

fbennett commented May 28, 2011

Yes. Safe conversion of known Unicode superscript characters is now implemented in citeproc-js (at version 1.0.175), using a unicode.org document for reference. The result of conversion is reflected in a new test fixture. I'm not sure if the test should be treated as a member of the standard set or as an extension, but it's there, at least, and can be classified when we get around to housecleaning.

Member

inukshuk commented May 28, 2011

+1 on Frank's suggestion. I think it is important for the specification to allow room for processors to utilize extra information in the input data, but not force them to rely on it.

As regards the test-suite, it is important to mark cases that rely on a specific input format. Obviously, we want to retain all these tests so that we can keep on using them for processors which support the format.

Another option could be to include two alternative results in the test case: one for processors that work some magic on formatted input and one for processors that treat the formatting directives as regular input. But I think the first approach is much cleaner.

Member

fbennett commented May 30, 2011

Simon has picket up 1.0.175 for use in the upcoming Zotero 2.1.7 release. I've taken the markup out of the French locale, and replaced it with Unicode equivalents. Kudos to Bruce for sticking to his guns on this; we have a much cleaner chain for it.

Closing this ticket.

fbennett closed this May 30, 2011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment