New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BBT test suite #18
Comments
Those are my tests, correct. The snapshots directory have what I currently consider the ground truth, but I can't guarantee that is going to be flawless.
I currently produce a lone diacritic, which unicode allows even if it clearly will have unintended results in this case. Error recovery would be fine if that wouldn't mean the entire entry is lost.
It doesn't now, but it did at one time, and you know how old bibtex files have a tendency to stick around.
That is fantastic. The current list of commands-with-arguments that I recognize is here (but not yet updated for a work in progress to specify the parse mode for them, that is on a different branch that I've stopped working on seeing how fast yours is). My unicode map is here but I'll need to normalize the maps. The source for the maps is here but if you install the npm module you get
|
Wait, what does "works" mean in this context? |
Well, "parses". I haven't compared the actual results yet, sorry for the misleading comment. |
Ah OK, but no worries. I'm serious about cooperating if you're interested. I want to have the best non-latex bibtex parser in BBT, but I need to be able to move when issues are brought in -- I have latex expertise on hand than helps me decide how issues are handled, don't mind discussing with more to get to better results. If that's too much pressure (and I'd understand that completely), I'd probably be working on a fork. I've used other people's parsers before, and they never moved as fast as I wanted. |
I am looking through the actual output differences now (Citation.js meaning version in this repo, not the currently released one).
Other things I could fix quicker and are not included in this list. Something that applies to a lot of the commands: I made my own |
On Wed, 18 Nov 2020, 00:30 Lars Willighagen, ***@***.***> wrote:
I am looking through the actual output differences now (Citation.js
meaning version in this repo, not the currently released one).
- Citation.js does not automatically insert $ (or enter math mode)
when literal fields contain _ or ^ (bbt-import-zbb (quietly) chokes on
this .bib #664)
I'd have to test, but it seems to me BBT would be wrong to do this.
- Citation.js does not yet implement \enquote or \mkbibquote
That is a feature that was specifically requested, but I can't imagine
that to be hard.
- Citation.js' order of nesting <i> and <span class="nocase"> is
different to that of BBT
That is unlikely to make a difference. I'd yield on this one without problem.
- Citation.js does not keeps capital letters after punctuation in
title fields
This is one I'd like to keep
- Citation.js splits fields that have the list type in BibLaTeX into
an array; BBT does not
BBT is wrong here. I've not implemented it because Zotero doesn't support
it.
- Citation.js sentence-cases the contents of the type field because
the BBT source makes it seem like it does too; however sometimes not? (in
case of type = {Ph.D})
I'd have to investigate. But this strikes me as odd. Type isn't a
title-like field I think.
- BBT does not protect the case of fields like title = {{Something
Something}}; we talked about this before but in my tests of natbib &
biblatex it *does* protect the case
The problem is that the behavior of natbib has changed over time, and you
can't tell by looking at the input whether it was written at a time when
the unwrapping behavior was in effect. But I've yet to see an entry where
double-bracing was applied correctly. AAMOF I usually find double-braced
entries as exports by mendeley and the like, and their understanding of
bibtex is poor at best.
- Citation.js does not currently case-protect uppercase words in all
cases
One I'd like to keep. I can't think of an instance where such a word is
not a proper noun.
- BBT seems to not split off a name prefix (Better BibTeX.004,
technically also Better BibTeX.008)
That's because Zotero doesn't support it. BBT is wrong here, and I'd
yield.
- Citation.js needs more commands (#17
<#17>),
needed for this test suite are: \url, \\, \ocirc, \langle, \rangle,
\overline, \LaTeX, \mbox, \pm, \%, \,, \bar, \left, \right, \rm, \le,
\textrm
Those are all in my mapping I'd think - I still need to normalize the
notation in it, but that can be done with a script.
- collaborator is not a defined field in Citation.js (so does not
parse as an author); this can be changed in the config as needed
That's great
- {\'{\i}} is annoying, literally that would be \xC4\xB1\xCC\x81 but
if you do not take \i as the dotless i it normalizes to \xC3\xAD
I'd accept either one, but with a soft preference for dotless.
- BBT splits keywords on semicolons (Math markup to unicode not always
imported correctly #472)
I'd yield on that, although I'd prefer to keep it as an option. The
keyword field doesn't have a defined behavior as far as I know. Does it?
- Citation.js does not include the name of the string key if it is not
defined
That's one I'd definitely want to keep. People import partial bibtex files
where the definitions are kept separately, and they want the information
(and I also have a feature to restore those on export).
- Citation.js does not put multiple definitions of the same field (in
the same entry) in an array in the output (Options to use default
import process? #1562, where is that from???)
I don't know, that looks like an automated export. I'm ambivalent on the
array output - it made it easier to write typescript typing.
- BBT does not insert <span class="nocase"> around braces that start
with a math environment starting with a command. Not sure what to make of
that though... (Overline during Import #1467)
That's an unintended side effect of how I handle math. The rule "no
sentence casing when a block starts with a command" causes this, because I
simply ignore math mode during parsing as whatever I wanted to do in math
mode specifically I did in the grammar phase.
- I feel like Title of German entry converted to lowercase during
import #1350 doesn't make sense entirely, but I do have to note the
language-detection is not complete yet (but the middle two entries are
assumed to be English, right?)
The middle entries are assumed to be English, correct.
What doesn't make sense?
- BBT unabbreviates journals, and since Citation.js is for the browser
as well I cannot just include 41 MB of mappings
That's no issue because I can post-apply that, and the unabbreviations are
in the middle of a major cleanup.
Other things I could fix quicker and are not included in this list.
Something that applies to a lot of the commands: I made my own
unicode2latex thing before I found this one, based on
https://github.com/latex3/latex3/blob/master/texmf/tex/latex/base/tuenc.def
(which unfortunately does not include everything).
Far from it it would seem.
|
On Overleaf
Well, it is mostly annoying for comparing the test results. Right now Citation.js puts
It's listed in https://github.com/retorquere/bibtex-parser/blob/0c8bd92/index.ts#L274-L283 and that's where I took my list from.
That's fair, most of the usage in the test cases is incorrect as well. I am not entirely sure yet how to handle it though, based on how the parser is structured at the moment it is hard to tell whether a bracket string spans the entire field before parsing the entire field.
That's two more for the list of reasons to refactor my sentence-casing. The new version isn't even released yet and I'm already refactoring...
BBT seems to output
It does in BibLaTeX but I don't know if that counts for you. Also there the separator can be customized but if that's easy to do for us depends on whether we count braces as grouping (if we do, the tokenization would be customized which it currently isn't for me).
Sure, it seems pretty arbitrary to do it either way. How do you restore, replace specific string values?
That makes sense, though I'd prefer to rather keep the data model similar to the one defined by BibLaTeX.
Ah, I see now. The middle entries confused me a bit because there were still some capital letters where I thought they shouldn't be and vice versa. Anyway, I think there is one possible mistake in
More than what I had though, and from a good source. The main negative change were the Greek letters, which I added later. |
So yeah that's one of those things -- I'd prefer to keep that. For most of my users, import is single shot because inspecting the import results is tedious, and I have no avenue for interactive feedback to say "you might want to fix this". So I import as much as I can with a reasonable if not always spec-compliant way in cases like this. For larger errors, I create a note that contains an indication of what was wrong and where, but I'd prefer to keep this as slim as possible.
I have a conversion running on a branch. I'll let you know when it's done and I've posted the results to github.
It's the order of these two
Huh. If it's there, it's because Nick Bart directed me there. He and plk are my sources of bib(la)tex knowledge.
Right, I suppose because the case-protection behavior is passed forward as you parse the inner blocks, and you only know at the end whether the outer I don't currently know how to do that single-pass. The least-cost option I can think of right now is to parse/convert both behaviors and decide at the end which of the results to pick (for title-like fields).
I can adjust my unicode2latex mappings to make them easier to use for you. They're already being processed from the config file.
If this one brings in more complexity, I'd not mind it being dropped, certainly if the biblatex manual has an opinion on this.
Either that (there's a place in Zotero where you can enter them) or I keep them as string references on re-export based on heuristics (basically: "could this whole-field value possibly be a string reference? Then no braces on export") when the user turns that option on. That means if they have an external file with string declarations, a re-export would allow them to use that still.
If there is a form of
No, that's a bug; my parser tries to handled words with a hyphen in it as one word, but expects text to appear directly after the hyphen. So it's not picked up as an
My list was initiated from multiple sources, but it's just become its own thing in interactions with nick and plk. |
The converted snapshots are available here |
In |
I forgot to commit, they're there now. |
Working through the tests, I came up with some questions:
|
I actually use the sentence-caser from citeproc-js; all my parser does is determine what ranges need to be protected from that, and I reset those ranges to the original input.
My parser doesn't case-protect lowercase words. I agree it's debatable whether that is the right choice.
Yes -- when I try on overleaf, fields seem to be trimmed when they are rendered. |
Wait, what's up with this fixture then? @article{test, title = "aa aa {aa} AA {AA} Aa {Aa}"} BBT output: [
{
id: 'test',
properties: {
title: 'aa aa <span class="nocase">aa</span> AA AA aa Aa'
},
type: 'article'
}
] |
Sorry, I was mixing up things. The lowercase thing was for sentence casing, not case protection. |
One of the reasons I had started a rewrite; not just performance, but the parser has gotten to be to complicated. |
I know the feeling. I am still wondering though, it seems like BBT puts nocase around |
It looks like
|
What compiler are you using? I am getting a weird bug on pdfLaTeX and LaTeX (via Overleaf) with diacritics immediately after puncutation. |
pdfLaTeX. But we're likely talking about an edge case where where we might be better off just dropping the lone diacritic. |
I am going through the BBT test suite again (at least, the files in
retorquere/bibtex-parser/__tests__/better-bibtex/import
). @retorquere, some questions:{\d}
which would normally be{\dh}
,\d
being a diacritic expecting an argument. Do you handle that normally or is that covered by error recovery?The text was updated successfully, but these errors were encountered: