Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BBT test suite #18

Open
larsgw opened this issue Nov 17, 2020 · 20 comments
Open

BBT test suite #18

larsgw opened this issue Nov 17, 2020 · 20 comments
Labels
bug Something isn't working

Comments

@larsgw
Copy link
Member

larsgw commented Nov 17, 2020

I am going through the BBT test suite again (at least, the files in retorquere/bibtex-parser/__tests__/better-bibtex/import). @retorquere, some questions:

  • One file contains {\d} which would normally be {\dh}, \d being a diacritic expecting an argument. Do you handle that normally or is that covered by error recovery?
  • Does EndNote really export without labels...
  • The rest works now after some updates on my side
@larsgw larsgw added the bug Something isn't working label Nov 17, 2020
@retorquere
Copy link
Contributor

retorquere commented Nov 17, 2020

I am going through the BBT test suite again (at least, the files in retorquere/bibtex-parser/__tests__/better-bibtex/import).

Those are my tests, correct. The snapshots directory have what I currently consider the ground truth, but I can't guarantee that is going to be flawless.

* [One file](https://github.com/retorquere/bibtex-parser/blob/7ad73dfe6039961f65afdd865d9c523066454b57/__tests__/better-bibtex/import/BBT%20does%20not%20import%20groups%20from%20JabRef%205.1%20%231641.bib#L4) contains `{\d}` which would normally be `{\dh}`, `\d` being a diacritic expecting an argument. Do you handle that normally or is that covered by error recovery?

I currently produce a lone diacritic, which unicode allows even if it clearly will have unintended results in this case. Error recovery would be fine if that wouldn't mean the entire entry is lost.

* Does EndNote really export without labels...

It doesn't now, but it did at one time, and you know how old bibtex files have a tendency to stick around.

* The rest works now after some updates on my side

That is fantastic. The current list of commands-with-arguments that I recognize is here (but not yet updated for a work in progress to specify the parse mode for them, that is on a different branch that I've stopped working on seeing how fast yours is). My unicode map is here but I'll need to normalize the maps. The source for the maps is here but if you install the npm module you get

  • tables/latex.json: command replacements
  • tables/diacritics.json: diacritic replacements

@retorquere
Copy link
Contributor

Wait, what does "works" mean in this context?

@larsgw
Copy link
Member Author

larsgw commented Nov 17, 2020

Well, "parses". I haven't compared the actual results yet, sorry for the misleading comment.

@retorquere
Copy link
Contributor

Ah OK, but no worries.

I'm serious about cooperating if you're interested. I want to have the best non-latex bibtex parser in BBT, but I need to be able to move when issues are brought in -- I have latex expertise on hand than helps me decide how issues are handled, don't mind discussing with more to get to better results.

If that's too much pressure (and I'd understand that completely), I'd probably be working on a fork. I've used other people's parsers before, and they never moved as fast as I wanted.

@larsgw
Copy link
Member Author

larsgw commented Nov 17, 2020

I am looking through the actual output differences now (Citation.js meaning version in this repo, not the currently released one).

  • Citation.js does not automatically insert $ (or enter math mode) when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on this .bib #664)
  • Citation.js does not yet implement \enquote or \mkbibquote
  • Citation.js' order of nesting <i> and <span class="nocase"> is different to that of BBT
  • Citation.js does not keeps capital letters after punctuation in title fields
  • Citation.js splits fields that have the list type in BibLaTeX into an array; BBT does not
  • Citation.js sentence-cases the contents of the type field because the BBT source makes it seem like it does too; however sometimes not? (in case of type = {Ph.D})
  • BBT does not protect the case of fields like title = {{Something Something}}; we talked about this before but in my tests of natbib & biblatex it does protect the case
  • Citation.js does not currently case-protect uppercase words in all cases
  • BBT seems to not split off a name prefix (Better BibTeX.004, technically also Better BibTeX.008)
  • Citation.js needs more commands (Argument commands #17), needed for this test suite are: \url, \\, \ocirc, \langle, \rangle, \overline, \LaTeX, \mbox, \pm, \%, \,, \bar, \left, \right, \rm, \le, \textrm
  • collaborator is not a defined field in Citation.js (so does not parse as an author); this can be changed in the config as needed
  • {\'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but if you do not take \i as the dotless i it normalizes to \xC3\xAD
  • BBT splits keywords on semicolons (Math markup to unicode not always imported correctly #472)
  • Citation.js does not include the name of the string key if it is not defined
  • Citation.js does not put multiple definitions of the same field (in the same entry) in an array in the output (Options to use default import process? #1562, where is that from???)
  • BBT does not insert <span class="nocase"> around braces that start with a math environment starting with a command. Not sure what to make of that though... (Overline during Import #1467)
  • I feel like Title of German entry converted to lowercase during import #1350 doesn't make sense entirely, but I do have to note the language-detection is not complete yet (but the middle two entries are assumed to be English, right?)
  • BBT unabbreviates journals, and since Citation.js is for the browser as well I cannot just include 41 MB of mappings

Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own unicode2latex thing before I found this one, based on https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def (which unfortunately does not include everything).

@retorquere
Copy link
Contributor

retorquere commented Nov 18, 2020 via email

@larsgw
Copy link
Member Author

larsgw commented Nov 18, 2020

  • Citation.js does not automatically insert $ (or enter math mode) when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on this .bib #664)

I'd have to test, but it seems to me BBT would be wrong to do this.

On Overleaf {foo_bar} produces an error (so undefined behaviour I guess) but it carries on and produces the equivalent of foo<i><sub>b</sub>.

  • Citation.js' order of nesting <i> and <span class="nocase"> is different to that of BBT

That is unlikely to make a difference. I'd yield on this one without problem.

Well, it is mostly annoying for comparing the test results. Right now Citation.js puts .nocase inside because in the structure of the grammar the bracket string is inside the command, where does the BBT behaviour come from?

  • Citation.js sentence-cases the contents of the type field because the BBT source makes it seem like it does too; however sometimes not? (in case of type = {Ph.D})

I'd have to investigate. But this strikes me as odd. Type isn't a title-like field I think.

It's listed in https://github.com/retorquere/bibtex-parser/blob/0c8bd92/index.ts#L274-L283 and that's where I took my list from.

  • BBT does not protect the case of fields like title = {{Something Something}}; we talked about this before but in my tests of natbib & biblatex it does protect the case

The problem is that the behavior of natbib has changed over time, and you can't tell by looking at the input whether it was written at a time when the unwrapping behavior was in effect. But I've yet to see an entry where double-bracing was applied correctly. AAMOF I usually find double-braced entries as exports by mendeley and the like, and their understanding of bibtex is poor at best.

That's fair, most of the usage in the test cases is incorrect as well. I am not entirely sure yet how to handle it though, based on how the parser is structured at the moment it is hard to tell whether a bracket string spans the entire field before parsing the entire field.

  • Citation.js does not keeps capital letters after punctuation in title fields

This is one I'd like to keep

  • Citation.js does not currently case-protect uppercase words in all cases

One I'd like to keep. I can't think of an instance where such a word is not a proper noun.

That's two more for the list of reasons to refactor my sentence-casing. The new version isn't even released yet and I'm already refactoring...

  • {'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but if you do not take \i as the dotless i it normalizes to \xC3\xAD

I'd accept either one, but with a soft preference for dotless.

BBT seems to output \xC3\xAD (LATIN SMALL LETTER I WITH CIRCUMFLEX) and I think that is probably correct but that also means extra mappings for me.

  • BBT splits keywords on semicolons (Math markup to unicode not always imported correctly #472)

I'd yield on that, although I'd prefer to keep it as an option. The keyword field doesn't have a defined behavior as far as I know. Does it?

It does in BibLaTeX but I don't know if that counts for you. Also there the separator can be customized but if that's easy to do for us depends on whether we count braces as grouping (if we do, the tokenization would be customized which it currently isn't for me).

  • Citation.js does not include the name of the string key if it is not defined

That's one I'd definitely want to keep. People import partial bibtex files where the definitions are kept separately, and they want the information (and I also have a feature to restore those on export).

Sure, it seems pretty arbitrary to do it either way. How do you restore, replace specific string values?

  • Citation.js does not put multiple definitions of the same field (in the same entry) in an array in the output (Options to use default import process? #1562, where is that from???)

I don't know, that looks like an automated export. I'm ambivalent on the array output - it made it easier to write typescript typing.

That makes sense, though I'd prefer to rather keep the data model similar to the one defined by BibLaTeX.

  • I feel like Title of German entry converted to lowercase during import #1350 doesn't make sense entirely, but I do have to note the language-detection is not complete yet (but the middle two entries are assumed to be English, right?)

The middle entries are assumed to be English, correct. What doesn't make sense?

Ah, I see now. The middle entries confused me a bit because there were still some capital letters where I thought they shouldn't be and vice versa. Anyway, I think there is one possible mistake in gaertner2008a: Öffent is not converted to lower case. Is that another one of those things, like uppercase words?

Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own unicode2latex thing before I found this one, based on https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def (which unfortunately does not include everything).

Far from it it would seem.

More than what I had though, and from a good source. The main negative change were the Greek letters, which I added later.

@retorquere
Copy link
Contributor

On Overleaf {foo_bar} produces an error (so undefined behaviour I guess) but it carries on and produces the equivalent of foo<i><sub>b</sub>.

So yeah that's one of those things -- I'd prefer to keep that. For most of my users, import is single shot because inspecting the import results is tedious, and I have no avenue for interactive feedback to say "you might want to fix this". So I import as much as I can with a reasonable if not always spec-compliant way in cases like this. For larger errors, I create a note that contains an indication of what was wrong and where, but I'd prefer to keep this as slim as possible.

That is unlikely to make a difference. I'd yield on this one without problem.

Well, it is mostly annoying for comparing the test results.

I have a conversion running on a branch. I'll let you know when it's done and I've posted the results to github.

Right now Citation.js puts .nocase inside because in the structure of the grammar the bracket string is inside the command, where does the BBT behaviour come from?

It's the order of these two if blocks.

It's listed in https://github.com/retorquere/bibtex-parser/blob/0c8bd92/index.ts#L274-L283 and that's where I took my list from.

Huh. If it's there, it's because Nick Bart directed me there. He and plk are my sources of bib(la)tex knowledge.

That's fair, most of the usage in the test cases is incorrect as well. I am not entirely sure yet how to handle it though, based on how the parser is structured at the moment it is hard to tell whether a bracket string spans the entire field before parsing the entire field.

Right, I suppose because the case-protection behavior is passed forward as you parse the inner blocks, and you only know at the end whether the outer {{..}} embrace to the same block.

I don't currently know how to do that single-pass. The least-cost option I can think of right now is to parse/convert both behaviors and decide at the end which of the results to pick (for title-like fields).

BBT seems to output \xC3\xAD (LATIN SMALL LETTER I WITH CIRCUMFLEX) and I think that is probably correct but that also means extra mappings for me.

I can adjust my unicode2latex mappings to make them easier to use for you. They're already being processed from the config file.

I'd yield on that, although I'd prefer to keep it as an option. The keyword field doesn't have a defined behavior as far as I know. Does it?

It does in BibLaTeX but I don't know if that counts for you. Also there the separator can be customized but if that's easy to do for us depends on whether we count braces as grouping (if we do, the tokenization would be customized which it currently isn't for me).

If this one brings in more complexity, I'd not mind it being dropped, certainly if the biblatex manual has an opinion on this.

Sure, it seems pretty arbitrary to do it either way. How do you restore, replace specific string values?

Either that (there's a place in Zotero where you can enter them) or I keep them as string references on re-export based on heuristics (basically: "could this whole-field value possibly be a string reference? Then no braces on export") when the user turns that option on. That means if they have an external file with string declarations, a re-export would allow them to use that still.

That makes sense, though I'd prefer to rather keep the data model similar to the one defined by BibLaTeX.

If there is a form of type field in the field object (so something like { name: 'author', type: 'creators' } or { name: 'title', type: ... } I can make typescript declarations that distinguish between them.

Ah, I see now. The middle entries confused me a bit because there were still some capital letters where I thought they shouldn't be and vice versa. Anyway, I think there is one possible mistake in gaertner2008a: Öffent is not converted to lower case. Is that another one of those things, like uppercase words?

No, that's a bug; my parser tries to handled words with a hyphen in it as one word, but expects text to appear directly after the hyphen. So it's not picked up as an ^[uppercase][lowercase]*$ word, but also not as an [uppercase][lowercase]*(-[upper|lowercase]+)* word (regexes being rough approximations here of the real ones), and then it falls back to "do you have an uppercase char?" and that's true.

More than what I had though, and from a good source. The main negative change were the Greek letters, which I added later.

My list was initiated from multiple sources, but it's just become its own thing in interactions with nick and plk.

@retorquere
Copy link
Contributor

The converted snapshots are available here

@larsgw
Copy link
Member Author

larsgw commented Nov 19, 2020

In __tests__/__snapshots__? I don't see any recent commits (or updated markup) there...

@retorquere
Copy link
Contributor

I forgot to commit, they're there now.

@larsgw
Copy link
Member Author

larsgw commented Nov 22, 2020

Working through the tests, I came up with some questions:

  • In Failure to handle unparsed author names (92), SpringerMaterials is not sentence-cased, do you only sentence-case the first letter or do you treat this as a proper noun?
  • I think I also found a case where capital letters after a hyphen were sentence-cased though I can't find it; that would be an exception to the proper noun heuristic?
  • In Better BibTeX.006 the n in \~{n} is not case protected even though it is ± a lower-case word, is that specific to diacritics?
  • In Better BibTeX.014 you seem to trim the title, is that intended?

@retorquere
Copy link
Contributor

Working through the tests, I came up with some questions:

* In `Failure to handle unparsed author names (92)`, `SpringerMaterials` is not sentence-cased, do you only sentence-case the first letter or do you treat this as a proper noun?
* I think I also found a case where capital letters after a hyphen were sentence-cased though I can't find it; that would be an exception to the proper noun heuristic?

I actually use the sentence-caser from citeproc-js; all my parser does is determine what ranges need to be protected from that, and I reset those ranges to the original input.

* In `Better BibTeX.006` the n in `\~{n}` is not case protected even though it is ± a lower-case word, is that specific to diacritics?

My parser doesn't case-protect lowercase words. I agree it's debatable whether that is the right choice.

* In `Better BibTeX.014` you seem to trim the title, is that intended?

Yes -- when I try on overleaf, fields seem to be trimmed when they are rendered.

@larsgw
Copy link
Member Author

larsgw commented Nov 22, 2020

My parser doesn't case-protect lowercase words. I agree it's debatable whether that is the right choice.

Wait, what's up with this fixture then?

@article{test, title = "aa aa {aa} AA {AA} Aa {Aa}"}

BBT output:

[
  {
    id: 'test',
    properties: {
      title: 'aa aa <span class="nocase">aa</span> AA AA aa Aa'
    },
    type: 'article'
  }
]

@retorquere
Copy link
Contributor

Sorry, I was mixing up things. The lowercase thing was for sentence casing, not case protection.

@retorquere
Copy link
Contributor

One of the reasons I had started a rewrite; not just performance, but the parser has gotten to be to complicated.

@larsgw
Copy link
Member Author

larsgw commented Nov 22, 2020

One of the reasons I had started a rewrite; not just performance, but the parser has gotten to be to complicated.

I know the feeling. I am still wondering though, it seems like BBT puts nocase around {a}, {aa} and \~{aa} but not \~{a} (in all cases surrounded by spaces in the middle of the sentence), is that as it should be?

@retorquere
Copy link
Contributor

It looks like \~{a} and \~{aa} should not get case protection, but {a} and {aa} should. This is super strange, I'll have to consult with Nick, I always thought that \~{a} would case protect, and {\~{a}} would undo the case protection, but that doesn't seem to hold given this MWE:

\documentclass[american]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\usepackage{csquotes}

\usepackage[style=apa, backend=biber]{biblatex}

\usepackage{filecontents}
\begin{filecontents*}{\jobname.bib}

@thesis{ref,
Author = {Daniel and Coley, David A. and Natarajan, Sukumar and Herrera, Manuel and Fosas de Pando, Miguel and Ramaho-Gonzalez, Alfonso},
Title = "Mitigation versus Adaptation: \~{a}oes \~{A}nsulating Dwellings Increase Overheating Risk?",
Journal = {BUILDING " AND ENVIRONMENT},
Year = {2018},
Volume = {143},
Pages = {740-759},
Month = OCT,
Publisher = {PERGAMON-ELSEVIER SCIENCE LTD},
Address = ACM-SSAC,
Type = {mathesis},
DOI = {10.1016/j.buildenv.2018.07.033},
ISSN = {0360-1323},
EISSN = "1873-684X",
}
\end{filecontents*}

\addbibresource{\jobname.bib}

\begin{document}

\cite{ref}

\printbibliography

\end{document}

@larsgw
Copy link
Member Author

larsgw commented Nov 23, 2020

What compiler are you using? I am getting a weird bug on pdfLaTeX and LaTeX (via Overleaf) with diacritics immediately after puncutation.

@retorquere
Copy link
Contributor

pdfLaTeX. But we're likely talking about an edge case where where we might be better off just dropping the lone diacritic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants