BBT test suite #18

larsgw · 2020-11-17T12:25:06Z

I am going through the BBT test suite again (at least, the files in retorquere/bibtex-parser/__tests__/better-bibtex/import). @retorquere, some questions:

One file contains {\d} which would normally be {\dh}, \d being a diacritic expecting an argument. Do you handle that normally or is that covered by error recovery?
Does EndNote really export without labels...
The rest works now after some updates on my side

The text was updated successfully, but these errors were encountered:

retorquere · 2020-11-17T12:44:15Z

I am going through the BBT test suite again (at least, the files in retorquere/bibtex-parser/__tests__/better-bibtex/import).

Those are my tests, correct. The snapshots directory have what I currently consider the ground truth, but I can't guarantee that is going to be flawless.

* [One file](https://github.com/retorquere/bibtex-parser/blob/7ad73dfe6039961f65afdd865d9c523066454b57/__tests__/better-bibtex/import/BBT%20does%20not%20import%20groups%20from%20JabRef%205.1%20%231641.bib#L4) contains `{\d}` which would normally be `{\dh}`, `\d` being a diacritic expecting an argument. Do you handle that normally or is that covered by error recovery?

I currently produce a lone diacritic, which unicode allows even if it clearly will have unintended results in this case. Error recovery would be fine if that wouldn't mean the entire entry is lost.

* Does EndNote really export without labels...

It doesn't now, but it did at one time, and you know how old bibtex files have a tendency to stick around.

* The rest works now after some updates on my side

That is fantastic. The current list of commands-with-arguments that I recognize is here (but not yet updated for a work in progress to specify the parse mode for them, that is on a different branch that I've stopped working on seeing how fast yours is). My unicode map is here but I'll need to normalize the maps. The source for the maps is here but if you install the npm module you get

tables/latex.json: command replacements
tables/diacritics.json: diacritic replacements

retorquere · 2020-11-17T12:45:10Z

Wait, what does "works" mean in this context?

larsgw · 2020-11-17T12:52:58Z

Well, "parses". I haven't compared the actual results yet, sorry for the misleading comment.

retorquere · 2020-11-17T13:53:37Z

Ah OK, but no worries.

I'm serious about cooperating if you're interested. I want to have the best non-latex bibtex parser in BBT, but I need to be able to move when issues are brought in -- I have latex expertise on hand than helps me decide how issues are handled, don't mind discussing with more to get to better results.

If that's too much pressure (and I'd understand that completely), I'd probably be working on a fork. I've used other people's parsers before, and they never moved as fast as I wanted.

larsgw · 2020-11-17T23:30:13Z

I am looking through the actual output differences now (Citation.js meaning version in this repo, not the currently released one).

Citation.js does not automatically insert $ (or enter math mode) when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on this .bib #664)
Citation.js does not yet implement \enquote or \mkbibquote
Citation.js' order of nesting  and  is different to that of BBT
Citation.js does not keeps capital letters after punctuation in title fields
Citation.js splits fields that have the list type in BibLaTeX into an array; BBT does not
Citation.js sentence-cases the contents of the type field because the BBT source makes it seem like it does too; however sometimes not? (in case of type = {Ph.D})
BBT does not protect the case of fields like title = {{Something Something}}; we talked about this before but in my tests of natbib & biblatex it does protect the case
Citation.js does not currently case-protect uppercase words in all cases
BBT seems to not split off a name prefix (Better BibTeX.004, technically also Better BibTeX.008)
Citation.js needs more commands (Argument commands #17), needed for this test suite are: \url, \\, \ocirc, \langle, \rangle, \overline, \LaTeX, \mbox, \pm, \%, \,, \bar, \left, \right, \rm, \le, \textrm
collaborator is not a defined field in Citation.js (so does not parse as an author); this can be changed in the config as needed
{\'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but if you do not take \i as the dotless i it normalizes to \xC3\xAD
BBT splits keywords on semicolons (Math markup to unicode not always imported correctly #472)
Citation.js does not include the name of the string key if it is not defined
Citation.js does not put multiple definitions of the same field (in the same entry) in an array in the output (Options to use default import process? #1562, where is that from???)
BBT does not insert  around braces that start with a math environment starting with a command. Not sure what to make of that though... (Overline during Import #1467)
I feel like Title of German entry converted to lowercase during import #1350 doesn't make sense entirely, but I do have to note the language-detection is not complete yet (but the middle two entries are assumed to be English, right?)
BBT unabbreviates journals, and since Citation.js is for the browser as well I cannot just include 41 MB of mappings

Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own unicode2latex thing before I found this one, based on https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def (which unfortunately does not include everything).

retorquere · 2020-11-18T07:21:11Z

On Wed, 18 Nov 2020, 00:30 Lars Willighagen, ***@***.***> wrote: I am looking through the actual output differences now (Citation.js meaning version in this repo, not the currently released one). - Citation.js does not automatically insert $ (or enter math mode) when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on this .bib #664)

I'd have to test, but it seems to me BBT would be wrong to do this.

- Citation.js does not yet implement \enquote or \mkbibquote

That is a feature that was specifically requested, but I can't imagine that to be hard.

- Citation.js' order of nesting and is different to that of BBT

That is unlikely to make a difference. I'd yield on this one without problem.

- Citation.js does not keeps capital letters after punctuation in title fields

This is one I'd like to keep

- Citation.js splits fields that have the list type in BibLaTeX into an array; BBT does not

BBT is wrong here. I've not implemented it because Zotero doesn't support it.

- Citation.js sentence-cases the contents of the type field because the BBT source makes it seem like it does too; however sometimes not? (in case of type = {Ph.D})

I'd have to investigate. But this strikes me as odd. Type isn't a title-like field I think.

- BBT does not protect the case of fields like title = {{Something Something}}; we talked about this before but in my tests of natbib & biblatex it *does* protect the case

The problem is that the behavior of natbib has changed over time, and you can't tell by looking at the input whether it was written at a time when the unwrapping behavior was in effect. But I've yet to see an entry where double-bracing was applied correctly. AAMOF I usually find double-braced entries as exports by mendeley and the like, and their understanding of bibtex is poor at best.

- Citation.js does not currently case-protect uppercase words in all cases

One I'd like to keep. I can't think of an instance where such a word is not a proper noun.

- BBT seems to not split off a name prefix (Better BibTeX.004, technically also Better BibTeX.008)

That's because Zotero doesn't support it. BBT is wrong here, and I'd yield.

- Citation.js needs more commands (#17 <#17>), needed for this test suite are: \url, \\, \ocirc, \langle, \rangle, \overline, \LaTeX, \mbox, \pm, \%, \,, \bar, \left, \right, \rm, \le, \textrm

Those are all in my mapping I'd think - I still need to normalize the notation in it, but that can be done with a script.

- collaborator is not a defined field in Citation.js (so does not parse as an author); this can be changed in the config as needed

That's great

- {\'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but if you do not take \i as the dotless i it normalizes to \xC3\xAD

I'd accept either one, but with a soft preference for dotless.

- BBT splits keywords on semicolons (Math markup to unicode not always imported correctly #472)

I'd yield on that, although I'd prefer to keep it as an option. The keyword field doesn't have a defined behavior as far as I know. Does it?

- Citation.js does not include the name of the string key if it is not defined

That's one I'd definitely want to keep. People import partial bibtex files where the definitions are kept separately, and they want the information (and I also have a feature to restore those on export).

- Citation.js does not put multiple definitions of the same field (in the same entry) in an array in the output (Options to use default import process? #1562, where is that from???)

I don't know, that looks like an automated export. I'm ambivalent on the array output - it made it easier to write typescript typing.

- BBT does not insert around braces that start with a math environment starting with a command. Not sure what to make of that though... (Overline during Import #1467)

That's an unintended side effect of how I handle math. The rule "no sentence casing when a block starts with a command" causes this, because I simply ignore math mode during parsing as whatever I wanted to do in math mode specifically I did in the grammar phase.

- I feel like Title of German entry converted to lowercase during import #1350 doesn't make sense entirely, but I do have to note the language-detection is not complete yet (but the middle two entries are assumed to be English, right?)

The middle entries are assumed to be English, correct. What doesn't make sense?

- BBT unabbreviates journals, and since Citation.js is for the browser as well I cannot just include 41 MB of mappings

That's no issue because I can post-apply that, and the unabbreviations are in the middle of a major cleanup.

Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own unicode2latex thing before I found this one, based on https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def (which unfortunately does not include everything).

Far from it it would seem.

larsgw · 2020-11-18T23:18:19Z

Citation.js does not automatically insert $ (or enter math mode) when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on this .bib #664)

I'd have to test, but it seems to me BBT would be wrong to do this.

On Overleaf {foo_bar} produces an error (so undefined behaviour I guess) but it carries on and produces the equivalent of foob.

Citation.js' order of nesting  and  is different to that of BBT

That is unlikely to make a difference. I'd yield on this one without problem.

Well, it is mostly annoying for comparing the test results. Right now Citation.js puts .nocase inside because in the structure of the grammar the bracket string is inside the command, where does the BBT behaviour come from?

Citation.js sentence-cases the contents of the type field because the BBT source makes it seem like it does too; however sometimes not? (in case of type = {Ph.D})

I'd have to investigate. But this strikes me as odd. Type isn't a title-like field I think.

It's listed in https://github.com/retorquere/bibtex-parser/blob/0c8bd92/index.ts#L274-L283 and that's where I took my list from.

BBT does not protect the case of fields like title = {{Something Something}}; we talked about this before but in my tests of natbib & biblatex it does protect the case

The problem is that the behavior of natbib has changed over time, and you can't tell by looking at the input whether it was written at a time when the unwrapping behavior was in effect. But I've yet to see an entry where double-bracing was applied correctly. AAMOF I usually find double-braced entries as exports by mendeley and the like, and their understanding of bibtex is poor at best.

That's fair, most of the usage in the test cases is incorrect as well. I am not entirely sure yet how to handle it though, based on how the parser is structured at the moment it is hard to tell whether a bracket string spans the entire field before parsing the entire field.

Citation.js does not keeps capital letters after punctuation in title fields

This is one I'd like to keep

Citation.js does not currently case-protect uppercase words in all cases

One I'd like to keep. I can't think of an instance where such a word is not a proper noun.

That's two more for the list of reasons to refactor my sentence-casing. The new version isn't even released yet and I'm already refactoring...

{'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but if you do not take \i as the dotless i it normalizes to \xC3\xAD

I'd accept either one, but with a soft preference for dotless.

BBT seems to output \xC3\xAD (LATIN SMALL LETTER I WITH CIRCUMFLEX) and I think that is probably correct but that also means extra mappings for me.

BBT splits keywords on semicolons (Math markup to unicode not always imported correctly #472)

I'd yield on that, although I'd prefer to keep it as an option. The keyword field doesn't have a defined behavior as far as I know. Does it?

It does in BibLaTeX but I don't know if that counts for you. Also there the separator can be customized but if that's easy to do for us depends on whether we count braces as grouping (if we do, the tokenization would be customized which it currently isn't for me).

Citation.js does not include the name of the string key if it is not defined

That's one I'd definitely want to keep. People import partial bibtex files where the definitions are kept separately, and they want the information (and I also have a feature to restore those on export).

Sure, it seems pretty arbitrary to do it either way. How do you restore, replace specific string values?

Citation.js does not put multiple definitions of the same field (in the same entry) in an array in the output (Options to use default import process? #1562, where is that from???)

I don't know, that looks like an automated export. I'm ambivalent on the array output - it made it easier to write typescript typing.

That makes sense, though I'd prefer to rather keep the data model similar to the one defined by BibLaTeX.

I feel like Title of German entry converted to lowercase during import #1350 doesn't make sense entirely, but I do have to note the language-detection is not complete yet (but the middle two entries are assumed to be English, right?)

The middle entries are assumed to be English, correct. What doesn't make sense?

Ah, I see now. The middle entries confused me a bit because there were still some capital letters where I thought they shouldn't be and vice versa. Anyway, I think there is one possible mistake in gaertner2008a: Öffent is not converted to lower case. Is that another one of those things, like uppercase words?

Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own unicode2latex thing before I found this one, based on https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def (which unfortunately does not include everything).

Far from it it would seem.

More than what I had though, and from a good source. The main negative change were the Greek letters, which I added later.

retorquere · 2020-11-19T07:18:40Z

On Overleaf {foo_bar} produces an error (so undefined behaviour I guess) but it carries on and produces the equivalent of foob.

So yeah that's one of those things -- I'd prefer to keep that. For most of my users, import is single shot because inspecting the import results is tedious, and I have no avenue for interactive feedback to say "you might want to fix this". So I import as much as I can with a reasonable if not always spec-compliant way in cases like this. For larger errors, I create a note that contains an indication of what was wrong and where, but I'd prefer to keep this as slim as possible.

That is unlikely to make a difference. I'd yield on this one without problem.

Well, it is mostly annoying for comparing the test results.

I have a conversion running on a branch. I'll let you know when it's done and I've posted the results to github.

Right now Citation.js puts .nocase inside because in the structure of the grammar the bracket string is inside the command, where does the BBT behaviour come from?

It's the order of these two if blocks.

It's listed in https://github.com/retorquere/bibtex-parser/blob/0c8bd92/index.ts#L274-L283 and that's where I took my list from.

Huh. If it's there, it's because Nick Bart directed me there. He and plk are my sources of bib(la)tex knowledge.

That's fair, most of the usage in the test cases is incorrect as well. I am not entirely sure yet how to handle it though, based on how the parser is structured at the moment it is hard to tell whether a bracket string spans the entire field before parsing the entire field.

Right, I suppose because the case-protection behavior is passed forward as you parse the inner blocks, and you only know at the end whether the outer {{..}} embrace to the same block.

I don't currently know how to do that single-pass. The least-cost option I can think of right now is to parse/convert both behaviors and decide at the end which of the results to pick (for title-like fields).

BBT seems to output \xC3\xAD (LATIN SMALL LETTER I WITH CIRCUMFLEX) and I think that is probably correct but that also means extra mappings for me.

I can adjust my unicode2latex mappings to make them easier to use for you. They're already being processed from the config file.

I'd yield on that, although I'd prefer to keep it as an option. The keyword field doesn't have a defined behavior as far as I know. Does it?

It does in BibLaTeX but I don't know if that counts for you. Also there the separator can be customized but if that's easy to do for us depends on whether we count braces as grouping (if we do, the tokenization would be customized which it currently isn't for me).

If this one brings in more complexity, I'd not mind it being dropped, certainly if the biblatex manual has an opinion on this.

Sure, it seems pretty arbitrary to do it either way. How do you restore, replace specific string values?

Either that (there's a place in Zotero where you can enter them) or I keep them as string references on re-export based on heuristics (basically: "could this whole-field value possibly be a string reference? Then no braces on export") when the user turns that option on. That means if they have an external file with string declarations, a re-export would allow them to use that still.

That makes sense, though I'd prefer to rather keep the data model similar to the one defined by BibLaTeX.

If there is a form of type field in the field object (so something like { name: 'author', type: 'creators' } or { name: 'title', type: ... } I can make typescript declarations that distinguish between them.

Ah, I see now. The middle entries confused me a bit because there were still some capital letters where I thought they shouldn't be and vice versa. Anyway, I think there is one possible mistake in gaertner2008a: Öffent is not converted to lower case. Is that another one of those things, like uppercase words?

No, that's a bug; my parser tries to handled words with a hyphen in it as one word, but expects text to appear directly after the hyphen. So it's not picked up as an ^[uppercase][lowercase]*$ word, but also not as an [uppercase][lowercase]*(-[upper|lowercase]+)* word (regexes being rough approximations here of the real ones), and then it falls back to "do you have an uppercase char?" and that's true.

More than what I had though, and from a good source. The main negative change were the Greek letters, which I added later.

My list was initiated from multiple sources, but it's just become its own thing in interactions with nick and plk.

retorquere · 2020-11-19T14:31:23Z

The converted snapshots are available here

larsgw · 2020-11-19T15:23:45Z

In __tests__/__snapshots__? I don't see any recent commits (or updated markup) there...

retorquere · 2020-11-19T18:09:03Z

I forgot to commit, they're there now.

larsgw · 2020-11-22T14:15:33Z

Working through the tests, I came up with some questions:

In Failure to handle unparsed author names (92), SpringerMaterials is not sentence-cased, do you only sentence-case the first letter or do you treat this as a proper noun?
I think I also found a case where capital letters after a hyphen were sentence-cased though I can't find it; that would be an exception to the proper noun heuristic?
In Better BibTeX.006 the n in \~{n} is not case protected even though it is ± a lower-case word, is that specific to diacritics?
In Better BibTeX.014 you seem to trim the title, is that intended?

retorquere · 2020-11-22T19:06:33Z

Working through the tests, I came up with some questions:

* In `Failure to handle unparsed author names (92)`, `SpringerMaterials` is not sentence-cased, do you only sentence-case the first letter or do you treat this as a proper noun?
* I think I also found a case where capital letters after a hyphen were sentence-cased though I can't find it; that would be an exception to the proper noun heuristic?

I actually use the sentence-caser from citeproc-js; all my parser does is determine what ranges need to be protected from that, and I reset those ranges to the original input.

* In `Better BibTeX.006` the n in `\~{n}` is not case protected even though it is ± a lower-case word, is that specific to diacritics?

My parser doesn't case-protect lowercase words. I agree it's debatable whether that is the right choice.

* In `Better BibTeX.014` you seem to trim the title, is that intended?

Yes -- when I try on overleaf, fields seem to be trimmed when they are rendered.

larsgw · 2020-11-22T21:50:26Z

My parser doesn't case-protect lowercase words. I agree it's debatable whether that is the right choice.

Wait, what's up with this fixture then?

@article{test, title = "aa aa {aa} AA {AA} Aa {Aa}"}

BBT output:

[
  {
    id: 'test',
    properties: {
      title: 'aa aa <span class="nocase">aa</span> AA AA aa Aa'
    },
    type: 'article'
  }
]

retorquere · 2020-11-22T21:53:14Z

Sorry, I was mixing up things. The lowercase thing was for sentence casing, not case protection.

retorquere · 2020-11-22T21:54:14Z

One of the reasons I had started a rewrite; not just performance, but the parser has gotten to be to complicated.

larsgw · 2020-11-22T23:55:00Z

One of the reasons I had started a rewrite; not just performance, but the parser has gotten to be to complicated.

I know the feeling. I am still wondering though, it seems like BBT puts nocase around {a}, {aa} and \~{aa} but not \~{a} (in all cases surrounded by spaces in the middle of the sentence), is that as it should be?

retorquere · 2020-11-23T11:34:46Z

It looks like \~{a} and \~{aa} should not get case protection, but {a} and {aa} should. This is super strange, I'll have to consult with Nick, I always thought that \~{a} would case protect, and {\~{a}} would undo the case protection, but that doesn't seem to hold given this MWE:

\documentclass[american]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=apa, backend=biber]{biblatex}

\usepackage{filecontents}
\begin{filecontents*}{\jobname.bib}

@thesis{ref,
Author = {Daniel and Coley, David A. and Natarajan, Sukumar and Herrera, Manuel and Fosas de Pando, Miguel and Ramaho-Gonzalez, Alfonso},
Title = "Mitigation versus Adaptation: \~{a}oes \~{A}nsulating Dwellings Increase Overheating Risk?",
Journal = {BUILDING " AND ENVIRONMENT},
Year = {2018},
Volume = {143},
Pages = {740-759},
Month = OCT,
Publisher = {PERGAMON-ELSEVIER SCIENCE LTD},
Address = ACM-SSAC,
Type = {mathesis},
DOI = {10.1016/j.buildenv.2018.07.033},
ISSN = {0360-1323},
EISSN = "1873-684X",
}
\end{filecontents*}

\addbibresource{\jobname.bib}

\begin{document}

\cite{ref}

\printbibliography

\end{document}

larsgw · 2020-11-23T16:01:59Z

What compiler are you using? I am getting a weird bug on pdfLaTeX and LaTeX (via Overleaf) with diacritics immediately after puncutation.

retorquere · 2020-11-23T16:28:44Z

pdfLaTeX. But we're likely talking about an edge case where where we might be better off just dropping the lone diacritic.

larsgw added the bug Something isn't working label Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BBT test suite #18

BBT test suite #18

larsgw commented Nov 17, 2020

retorquere commented Nov 17, 2020 •

edited

retorquere commented Nov 17, 2020

larsgw commented Nov 17, 2020

retorquere commented Nov 17, 2020

larsgw commented Nov 17, 2020

retorquere commented Nov 18, 2020 via email •

edited

larsgw commented Nov 18, 2020

retorquere commented Nov 19, 2020

retorquere commented Nov 19, 2020

larsgw commented Nov 19, 2020

retorquere commented Nov 19, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 22, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 22, 2020

retorquere commented Nov 22, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 23, 2020

larsgw commented Nov 23, 2020

retorquere commented Nov 23, 2020

BBT test suite #18

BBT test suite #18

Comments

larsgw commented Nov 17, 2020

retorquere commented Nov 17, 2020 • edited

retorquere commented Nov 17, 2020

larsgw commented Nov 17, 2020

retorquere commented Nov 17, 2020

larsgw commented Nov 17, 2020

retorquere commented Nov 18, 2020 via email • edited

larsgw commented Nov 18, 2020

retorquere commented Nov 19, 2020

retorquere commented Nov 19, 2020

larsgw commented Nov 19, 2020

retorquere commented Nov 19, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 22, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 22, 2020

retorquere commented Nov 22, 2020

larsgw commented Nov 22, 2020

retorquere commented Nov 23, 2020

larsgw commented Nov 23, 2020

retorquere commented Nov 23, 2020

retorquere commented Nov 17, 2020 •

edited

retorquere commented Nov 18, 2020 via email •

edited