Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

closedown biblatex-cslc-onverter #7

Open
johanneswilm opened this issue Nov 12, 2019 · 21 comments
Open

closedown biblatex-cslc-onverter #7

johanneswilm opened this issue Nov 12, 2019 · 21 comments

Comments

@johanneswilm
Copy link

Hey, I just discovered this chart. I have been participating in the maintenance of biblatex-csl-converter over the past few years. Based on your chart it looks like Idea (reworked) gives the same output quality as biblatex-csl-converter. Does that mean that it can be used as a drop in replacement and that it covers all the same features? If that is the case, is there any reason why I would continue to maintain biblatex-csl-converter?

@larsgw
Copy link
Member

larsgw commented Nov 12, 2019

Does that mean that it can be used as a drop in replacement and that it covers all the same features?

Probably not, the Syntax column is a big simplification. The whole chart is meant as a way to compare different parser to replace the current one, and so is only compared on features the current one had or that I wanted for the new one. A number of differences, in terms of features, in idea-reworked, compared to biblatex-csl-converter:

  • different API, functions instead of classes
  • no async function
  • worse error and warning handling: no API for warnings, no error recovery
  • no field checking (although I'm planning to)
  • no EDTF, names, etc. support yet but that is definitely planned as well (however, as part of bibtex-mapping)

So it definitely isn't a drop in replacement, as the API is quite different, and depending on your needs it may not be possible at all to switch.

@johanneswilm
Copy link
Author

Ok, I understand. So "complete" doesn't mean "feature complete" but rather "completely covers what the other parser did"? Maybe that could be added somewhere as else it looks a bit misleading and users that may be better off using one the other parsers are lead to believe that they shouldn't. I'd prefer not to have to set up a different chart making counter claims, etc. . Speed isn't much of a concern for Fidus Writer's usecase of biblatex-csl-converter as it's totally fine to wait 250 ms for a single citation to be converted and even up to several minutes if a user uploads their entire mega collection as processing will happen entirely on that user's machine.

Accuracy is more important and also keeping maintenance costs down. So if there is another parser that can do the exact same but is maintained by someone else, I'd like to shut down biblatex-csl-converter. And if there isn't one, then I'd like for everyone else out there who needs the same functionality to contribute to biblatex-csl-converter so that we don't have to do all the maintenance by ourselves. That's why it would be nice to make sure people aren't mislead by that chart somehow.

And yes, please once you think your parser or one of the other ones covers all the features, let me know and I can see whether it still makes sense to put an AST converter on top and drop biblatex-csl-converter altogether.

@larsgw
Copy link
Member

larsgw commented Nov 12, 2019

"complete" means nothing more and nothing less than that it parses syntax.bib accurately, which encompasses all the syntax I had in mind for the new parser (apart from syntax within values).

Maybe that could be added somewhere as else it looks a bit misleading and users that may be better off using one the other parsers are lead to believe that they shouldn't. I'd prefer not to have to set up a different chart making counter claims, etc.

That's fair, I just didn't really intend this repository for other users to make choices with. What's missing from the description is "the new BibTeX parser formula for Citation.js". And the comparisons where either because I wanted to see if my new parser was up to the task, or because someone asked me to add it to the comparison. But I definitely see where you're coming from, and you're not the only one, so I'll change it up and also add more detailed comparisons.

I can see whether it still makes sense to put an AST converter on top

I'm not really sure what you mean by this. How is an AST converter "on top", and if you'd be dropping biblatex-csl-converter where woud it be on top of?

@johanneswilm
Copy link
Author

But I definitely see where you're coming from, and you're not the only one, so I'll change it up and also add more detailed comparisons.

Thank you very much for that. And yes, just a little bit of wording so that others understand what the purpose of the chart is and that it's not a full feature comparison of everything is all that I'm asking for. The comparison is still quite interesting.

I'm not really sure what you mean by this.

Sorry, let me reword. Currently biblatex-csl-converter outputs exactly the javascript object format we use internally in Fidus Writer. So if we switch to something else, then we'll probably need that parser + a converter from the output of that parser to the format we use internally in Fidus Writer. So there would be a bit of development cost creating this converter. That's all I was trying to say.

@retorquere
Copy link
Contributor

I don't mean to pile on just to be antagonistic, but idea-reworked parses syntax.bib (which is invalid BTW -- biblatex chokes on it) into

[
  {
    type: 'book',
    label: 'sweig42',
    properties: {
      author: "Stefan Swe{\\i}g and Xavier D\\'ecoret",
      title: ' The {impossible} ℡—book ',
      publisher: ' D\\"ead Poₑeet Society',
      year: 1942,
      month: '03'
    }
  }
]

I don't know if I'm calling it wrong:

const parser = require('./lib/idea-reworked')
const fs = require('fs')
console.log(parser.parse(fs.readFileSync('test/files/syntax.bib', 'utf-8')))

but it doesn't seem to do diacritics replacement, anything with braces, and for the subscript interpretation it just picks up the first character. Also, biblatex ignores leading and trailing spaces so title and publisher should have been trimmed. And TEL is superscript?

@retorquere
Copy link
Contributor

Wait, I got that wrong -- syntax.bib has double backslashes in the text, so it's not supposed to do diacritics conversions as there are none. Anyhow, that still leaves braces, subscript and superscript, and trimming.

@larsgw
Copy link
Member

larsgw commented Nov 13, 2019

which is invalid BTW -- biblatex chokes on it

natbib should not, at least the last time I checked.

  • The double backslashes are by mistake, I'll fix them.
  • I thought I did trimming, but I'll fix that as well.
  • For superscript and subscript, I implemented it like that specifically but I don't know why. I'm converting them to Unicode characters which has limited support, but I think CSL supports <sup> and <sub> markup.
  • TEL gets converted to the corresponding Unicode character in Zotero, which is were I got a lot of stuff from in the first version, and I kept it that way.

@retorquere
Copy link
Contributor

which is invalid BTW -- biblatex chokes on it

natbib should not, at least the last time I checked.

Fair enough, it does.

* For superscript and subscript, I implemented it like that specifically but I don't know why. I'm converting them to Unicode characters which has limited support,

But that doesn't apply here -- a unicode subscript e does (clearly) exist, the parser just doesn't convert the other two es.

but I think CSL supports <sup> and <sub> markup.

It does. My parser converts to unicode sub/superscript where possible and uses <sup> and <sub> where that's not possible.

* `TEL` gets converted to the corresponding Unicode character in Zotero, which is were I got a lot of stuff from in the first version, and I kept it that way.

I don't really follow -- in syntax.bib I see TEL as \u54\u45\u4C, after conversion it show up as \u2121. The TEL in the input isn't a single character, it's a word, and title casing by a CSL style is going to affect it differently.

@larsgw
Copy link
Member

larsgw commented Nov 13, 2019

• I found just transforming the first character (if it's supported) more consistent than to create a string with part sub/superscript and part normal text
• Regarding TEL: that's the point (well, not the title casing) https://github.com/zotero/translators/blob/bae2057067e2fde076252a3b897a7e689a173c71/BibTeX.js#L1707

@retorquere
Copy link
Contributor

• I found just transforming the first character (if it's supported) more consistent than to create a string with part sub/superscript and part normal text

$_{eee}$ should become either ₑₑₑ or <sub>eee</sub>, not ₑee. The braces mean that the entire string is subscript.

• Regarding TEL: that's the point (well, not the title casing) https://github.com/zotero/translators/blob/bae2057067e2fde076252a3b897a7e689a173c71/BibTeX.js#L1707

That table is a lossy mapping from unicode to ASCII TeX, you can't always revert this table for TeX to unicode mapping -- TEL being one such instance that should not be reversed. If the unicode char maps to a string that does not contain TeX-reserved characters, you generally do not want to use it as a reverse mapping.

@retorquere
Copy link
Contributor

That table is a lossy mapping from unicode to ASCII TeX, you can't always revert this table for TeX to unicode mapping

Case in point: the reverse table is held separately here, and I would argue that
the reverse mapping of {TEL} is a poor choice -- {TEL} means "the phrase TEL, not to be messed with in sentence casing". It does not mean "Telephone Sign" (which is the name of \u2121 in the unicode table).

@johanneswilm
Copy link
Author

johanneswilm commented Nov 13, 2019

Interesting conversation you guys are having here.

but I think CSL supports <sup> and <sub> markup.

Does that mean this parser does not support the other html tags either? biblatex-csl-exporter currently supports these in CSL export:

const TAGS = {
    'strong': {open:'<b>', close: '</b>'},
    'em': {open:'<i>', close: '</i>'},
    'sub': {open:'<sub>', close: '</sub>'},
    'sup': {open:'<sup>', close: '</sup>'},
    'smallcaps': {open:'<span style="font-variant:small-caps;">', close: '</span>'},
    'nocase': {open:'<span class="nocase">', close: '</span>'},
    'enquote': {open:'“', close: '”'},
    'url': {open:'', close: ''},
    'undefined': {open:'[', close: ']'}
 }

@retorquere
Copy link
Contributor

retorquere commented Nov 13, 2019

citeproc supports these; enquote and later in your table isn't markup so CSL won't mind. I can't find what CSL formally support, but everything that uses citeproc in its various incarnations will support the markup listed under that link.

@johanneswilm
Copy link
Author

enquote and later in your table isn't markup so CSL

Right, because as far as I know, citeproc-js doesn't have any corresponding tag for these. All the other ones are in that list you are linking to.

@retorquere
Copy link
Contributor

Correct.

@johanneswilm
Copy link
Author

@retorquere Ah, now I understand your reply. My first comment on this here was not formulated very well. I updated it now. I wasn't asking whether citeproc supports it (I know it does), I was wondering about this parser.

@larsgw
Copy link
Member

larsgw commented Nov 14, 2019

Does that mean this parser does not support the other html tags either?

It does, but not all the commands it seems (code):

const richTextMappings = {
  textit: 'i',
  textbf: 'b',
  textsc: 'sc',
  textsuperscript: 'sup',
  textsubscript: 'sub'
}

@retorquere
Copy link
Contributor

That misses at least mkbibbold, bf and bfseries for bold, sl, em, it, itshape, mkbibitalic, mkbibemph, emph for italics, sc and scshape for smallcaps, and citeproc doesn't support <sc>, just <span style="font-variant: small-caps;">

Parsing stuff like {partially \bf bold} but not this is interesting (in the apocryphal Chinese sense) in that \bf affects everything after it until the end of the current block, so here, only the word bold should be bold. That sample is synthetic, just for illustration; in practice you'd see the much more sensible partially {\bf bold} but not this but here the interesting aspect is that here the braces do not mean nocase. If a block has a command at the start, it is ignored for case protection by bib(la)tex.

@larsgw
Copy link
Member

larsgw commented Nov 14, 2019

Okay, that's some more things to add to the list. This does make me lean towards moving more parts of the parsing to earlier in the process.

  • For {partially \bf bold} but not this, are the braces still a nocase, since the \bf is not at the start of the block?
  • <sc> was mentioned in (although not part of) the old specification, I think that is were I got it. It seems to still be included in some test cases

@retorquere
Copy link
Contributor

Okay, that's some more things to add to the list. This does make me lean towards moving more parts of the parsing to earlier in the process.

I don't see any other way this can be done. In a one-pass parser, it must be done during the parse, since you need the context to make these decision. In a two-pass parser like mine, the decision can be postponed until the 2nd pass.

For {partially \bf bold} but not this, are the braces still a nocase, since the \bf is not at the start of the block?

Yes:

\documentclass{article}
\usepackage[american]{babel}
\usepackage[backend=biber, style=apa]{biblatex}
\DeclareLanguageMapping{american}{american-apa}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}

@article{03, author = "03", 
title =    "{\bf Next: Bold}",
}

@article{04, author = "04", 
title =    "{Next: \bf Bold}",
}

@article{05, author = "05", 
title =    "{Next: Bold}",
}

\end{filecontents}
\addbibresource{\jobname.bib}
\begin{document}
\nocite{*}
\printbibliography
\end{document}

gives

  1. (n.d.). NEXT: BOLD.
  2. (n.d.). Next: Bold.
  3. (n.d.). Next: Bold.

<sc> was mentioned in (although not part of) the old specification, I think that is were I got it. It seems to still be included in some test cases

I think most will actually still support it, but it's out of spec (even if I think it looks better)

@retorquere
Copy link
Contributor

I haven't used B-C-C in a while, but it always used to be noticeably faster than the BBT parser. I don't know why the latest tests don't bear this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants