Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in title casing that affects journal names #120

Open
jkitchin opened this issue Aug 18, 2022 · 5 comments
Open

Possible bug in title casing that affects journal names #120

jkitchin opened this issue Aug 18, 2022 · 5 comments

Comments

@jkitchin
Copy link

There is possible bug in citeproc-hash-itemgetter-from-any related to changing case that affects the journal in exported citations. This is related to issue #119.

To reproduce it, I used this bibtex entry:

#+BEGIN_SRC bibtex :tangle test.bib
@article{kitchin-2015-examp,
  author =	 {Kitchin, John R.},
  title =	 {Examples of Effective Data Sharing in Scientific Publishing},
  journal =	 {ACS Catalysis},
  volume =	 {5},
  number =	 {6},
  pages =	 {3894-3899},
  year =	 2015,
  doi =		 {10.1021/acscatal.5b00538},
  url =		 { http://dx.doi.org/10.1021/acscatal.5b00538 },
  keywords =	 {DESC0004031, early-career, orgmode, Data sharing },
  eprint =	 { http://dx.doi.org/10.1021/acscatal.5b00538 },
}
#+END_SRC

Then this helper function:

#+BEGIN_SRC emacs-lisp
(org-babel-tangle)

(defun render (csl-style data sentcase)
  (let* ((proc (citeproc-create csl-style
				(citeproc-hash-itemgetter-from-any "test.bib" sentcase)
				(citeproc-locale-getter-from-dir "../citeproc/csl-locales/")
				"en-US"))
	 (cites (list (citeproc-citation-create :cites data
						:mode 'nil
						:capitalize-first nil
						:suppress-affixes nil)))
	 (rendered-citations (progn (citeproc-append-citations cites proc)
				    (citeproc-render-citations proc 'org nil)))
	 (bib (citeproc-render-bib proc 'org)))
    (car bib)))
#+END_SRC

#+RESULTS:
: render

With the default setting of no-sentcase-wo-langid of nil, the journal name is sentence cased, which is not correct I think. You can protect it like: {ACS} {C}atalysis, but that should not be necessary.


#+BEGIN_SRC emacs-lisp
;; https://raw.githubusercontent.com/citation-style-language/styles/master/elsevier-harvard.csl
(render "elsevier-harvard.csl" '(((id . "kitchin-2015-examp")
				  (prefix . "See ")
				  (suffix . "")
				  (locator . "2")
				  (label . "page")
				  (suppress-author)))
	nil)
#+END_SRC

#+RESULTS:
: <<citeproc_bib_item_1>>Kitchin, J.R., McDonald, J., 2015. Examples of effective data sharing in scientific publishing. Acs catalysis 5, 3894–3899. https://doi.org/10.1021/acscatal.5b00538

If you sent the no-sentcase-wo-langid to t, this does not happen:

#+BEGIN_SRC emacs-lisp
(render "elsevier-harvard.csl" '(((id . "kitchin-2015-examp")
(prefix . "See ")
(suffix . "")
(locator . "2")
(label . "page")
(suppress-author)))
t)
#+END_SRC

#+RESULTS:
: <<citeproc_bib_item_1>>Kitchin, J.R., 2015. Examples of Effective Data Sharing in Scientific Publishing. ACS Catalysis 5, 3894–3899. https://doi.org/10.1021/acscatal.5b00538

@andras-simonyi
Copy link
Owner

andras-simonyi commented Aug 20, 2022

Thanks for reporting! There are two intertwined issues here: (i) how citeproc-el renders CSL input in the given style and (ii) the bib(la)tex -> CSL preprocessing conversion applied to bib(la)tex input.

As for (i), I've checked the CSL style used in the example (elsevier-harvard.csl) and it doesn't require any case transformation for journal titles, so the output should contain the content of the CSL container-title field verbatim. As far as I can see, that is what is happening.

(ii) is a bit more complicated. As it was discussed concerning several citeproc-el issues (e.g., #81), bib(la)tex expects title fields in title-case with protective braces around proper names, while CSL expects sentence-case in the input. Consequently, full conversion requires a title-case -> sentence-case conversion of all title fields, honoring the case-protecting braces. The complicating factor is language: sentence-casing must be limited to English entries, so citeproc-el decides on the basis of the langid field whether to apply the conversion. The tricky case is when there is no langid field, and this is why the no-sentcase-wo-langid argument was introduced in conversion-related functions. By default, when no-sentcase-wo-langid is nil, entries without langid are assumed to be in English, and the title-case -> sentence-case conversion is applied to them, but this can be changed by using a non-nil value, for instance, for a bibliography in which the default entry language is German.

What's happening in this concrete case is that the CSL rendering doesn't touch the case of the container-title field as required by the style, but without a non-nil no-sentcase-wo-langid argument the preprocessing sentence-cases the journal bib(la)tex field from "ACS Catalysis" to "Acs catalysis" as the entry is assumed to be English and none of the words are protected with braces. With non-nil no-sentcase-wo-langid, on the other hand, the entry is assumed to be not requiring case transformation during the preprocessing and left as it is. All in all the outputs seem to be as they should be. If the elsevier-harvard.csl style should title-case journal titles then it might be worth submitting a bug report to the CSL styles repository (@bdarcus, @denismaier) (but the ACS acronym would have to be brace-protected even then).

@AlecVercruysse
Copy link

I'm also having this trouble with journal titles rendering in sentence-case on my exports. I don't fully follow your comment above. In particular, it seems to me that the 'journal' field of bibtex entries do not require case-protecting brackets in the journal name. For example (as produced by a Better BibTeX export in Zotero):

@article{millikenFullOnChipCMOS2007,
  title = {Full {{On-Chip CMOS Low-Dropout Voltage Regulator}}},
  author = {Milliken, Robert J. and {Silva-Martinez}, Jose and {Sanchez-Sinencio}, Edgar},
  year = {2007},
  month = sep,
  journal = {IEEE Transactions on Circuits and Systems I: Regular Papers},
  volume = {54},
  number = {9},
  pages = {1879--1890},
  issn = {1549-8328},
  doi = {10.1109/TCSI.2007.902615},
  urldate = {2024-05-02},
  copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
  file = {zotero-links/millikenFullOnChipCMOS2007.pdf}
}

This does not have extra case-protecting brackets in the journal field, but when exported via pandoc (shown below), the case is correct:

<p>Test <span class="citation"
data-cites="millikenFullOnChipCMOS2007">[1]</span> …</p>
<div id="refs" class="references csl-bib-body" role="list">
<div id="ref-millikenFullOnChipCMOS2007" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[1] </div><div class="csl-right-inline">R.
J. Milliken, J. Silva-Martinez, and E. Sanchez-Sinencio, <span>“Full
<span>On-Chip CMOS Low-Dropout Voltage Regulator</span>,”</span>
<em>IEEE Transactions on Circuits and Systems I: Regular Papers</em>,
vol. 54, no. 9, pp. 1879–1890, Sep. 2007, doi: <a
href="https://doi.org/10.1109/TCSI.2007.902615">10.1109/TCSI.2007.902615</a>.</div>
</div>
</div>

Considering this, I would think that (i) the org biblatex -> CSL conversion of the journal field should not alter the case , and that (ii) org-citeproc rendering of the CSL should preserve case as well. However, when I modify citeproc-blt-entry-to-csl to not alter the case (i.e. return the field (container-title . "IEEE Transactions on Circuits...")) rather than perform the sentence-cased conversion process it's currently doing), I still end up with sentence-cased exports of the journal. This makes me think that neither (i) or (ii) are doing what they should. As a workaround, when I add <span class="nocase"></span> brackets around the whole journal name in the CSL conversion, the export behaves correctly.

I must be missing something... thank you for the help!


(For reference, here's the tex doc I used to test rendering in pandoc):

\documentclass{article}
\usepackage{biblatex}
\addbibresource{../../Documents/cal/cal.bib}
\begin{document}
Test \cite{millikenFullOnChipCMOS2007} \ldots
\printbibliography{}
\end{document}

@andras-simonyi
Copy link
Owner

andras-simonyi commented May 2, 2024

Hello, as I tried to explain earlier (see also the discussion concerning issue #159), the main problem is that in CSL the expectation is that the input is sentence-cased, this is why citeproc-el does the conversion. E.g., the latest CSL specification says that

CSL processors don’t recognize proper nouns. As a result, strings in sentence case can be accurately converted to title case, but not vice versa. For this reason, it is generally preferable to store strings such as titles in sentence case, and only use text-case if a style desires another case.

Sentence case conversion is deprecated and will be removed in a future version.

Since the default case is, in effect, sentence case, CSL styles should explicitly use the text-case attribute to indicate if they want to set some text in title-case. The cleanest solution is, accordingly, to modify the CSL style you are using. If that is not an option then you can try some workarounds, e.g., the itemgetter ("data reader") currently used by Org-mode, citeproc-hash-itemgetter-from-any has an optional parameter no-sentcase-wo-langid, which turns off sentence casing titles of entries that do not have a langid field, so you could easily advise this function to always set this parameter to t.

@AlecVercruysse
Copy link

Thanks for the quick response and your patience with me, I'm still learning CSL.

I've tried using the online CSL visual editor (editor.citationstyles.org) to change the style manually. I manually exported CSL json from Zotero to use as an example citation:

[
    {
        "id": "millikenFullOnChipCMOS2007",
        "accessed": {
            "date-parts": [
                [
                    "2024",
                    5,
                    2
                ]
            ]
        },
        "author": [
            {
                "family": "Milliken",
                "given": "Robert J."
            },
            {
                "family": "Silva-Martinez",
                "given": "Jose"
            },
            {
                "family": "Sanchez-Sinencio",
                "given": "Edgar"
            }
        ],
        "citation-key": "millikenFullOnChipCMOS2007",
        "container-title": "IEEE Transactions on Circuits and Systems I: Regular Papers",
        "container-title-short": "IEEE Trans. Circuits Syst. I",
        "DOI": "10.1109/TCSI.2007.902615",
        "ISSN": "1549-8328",
        "issue": "9",
        "issued": {
            "date-parts": [
                [
                    "2007",
                    9
                ]
            ]
        },
        "license": "https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html",
        "page": "1879-1890",
        "source": "DOI.org (Crossref)",
        "title": "Full On-Chip CMOS Low-Dropout Voltage Regulator",
        "type": "article-journal",
        "URL": "http://ieeexplore.ieee.org/document/4303304/",
        "volume": "54"
    }
]

Of note, the container-title (and title as well) field in this Zotero export uses unmodified casing, not sentence casing. If I understand your answer, the implication is that the Zotero implementation of the CSL export is non-compliant with the recommendation.

When I edit the default ieee.csl, I see that change the text-case attribute of Layout > Conditional > If article-journal > Group > container-title (variable) path is empty by default. Again, if I understand your point correctly, this would imply that ieee.csl is non-compliant with the recommendation made in the CSL standard (since the IEEE citation standard does not want journal titles to be in sentence case). Regardless, when I explicitly change text-case="title" in the container-title layout specification, since the preprocessor still sentence-cases the original input text, I end up with e.g. "Ieee" instead of "IEEE" in the journal name of my org export.

I understand that if citeproc-el did not attempt to manually convert fields to sentence case, it would be impossible to render a style that required a sentence-cased journal title. Since neither the Zotero nor Pandoc CSL exporters do this, however, this feels like an option that does not need to be enabled by default. Furthermore, citeproc-el suffers from the same issue as the CSL processors used to in that it cannot recognize proper nouns. If citeproc-el did not attempt the sentence-case conversion on its own, it's behaviour would be consistent with other CSL export tools, which I feel would be the best option.

Again, I appreciate your patience with me, and you've clearly been thinking about these questions longer than I have, but this is my two cents. It seems like many of the current issues submitted to citeproc-el revolve around people not getting the casing they expect, and I wonder if making the behavior consistent with the other CSL exporters (Zotero, Pandoc) would resolve this. Please let me know if I'm misunderstanding something!


N.b. it seems like APA requires sentence-cased titles. I tried a Zotero export directly to RTF in APA 7th ed. of a textbook, and it exported in title case. Their recommendation about this is documented here. In short, they make no attempt to perform a sentence case conversion by default and instead ask that the Zotero entries themselves be stored in sentence case. It seems reasonable for citeproc-el to do the same.

@andras-simonyi
Copy link
Owner

andras-simonyi commented May 7, 2024

Thanks for your comments! I think that it should be taken into consideration that, similarly to Zotero, citeproc-el doesn't do any case conversion on CSL input -- if you were using a CSL-JSON bibliography then it would simply assume that it is already in the required sentence case. The problem is that we talk about bib(la)tex input for which the casing expectation is the exact opposite of CSL's: the values of title fields should be in title case, with protecting brackets around strings that should not be touched during case conversion (e.g., "IEEE"). See the long discussion (partly with field experts) about issue #71 for some details.

The long and short of it is that by default (i) bib(la)tex assumes that title fields in the input are in title case with protective brackets if necessary, while (ii) CSL processors assume that title fields are in sentence case, so the default behaviour for a CSL processor in case of bib(la)tex input should be conversion to sentence case, and this is what citeproc-el does. (I don't see how anything else wouldn't lead to users shooting themselves in the foot when trying to use their bibliography files both with bib(la)tex and CSL processors and styles.) A complication is that all of these assumptions should apply only to English entries, this is why citeproc-el already provides a way to limit this behaviour to entries with explicit English langid values. This option can also be abused to stop citeproc-el from converting English entries that do not have a langid.

Having said all that, I don't want to be dogmatic about this issue (which is indeed a recurring one), so I plan to introduce a dedicated user option in Org to disable title field case conversion for langid-less bib(la)tex bibliography entries (it would simply expose the already existing citeproc option). What do you think about this solution?

PS. As far as can see, Pandoc's bib(la)tex->CSL converter also performs sentence case conversion by default, the difference is that it skips fully capitalized words like "IEEE" (see also jgm/pandoc-citeproc#269).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants