-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BBT parser #4
Comments
I've updated the package and will update the surrounding documentation.
I think I removed a part of the test file which removed the concatenation but that also removed the real culprit: it throws for |
I also cannot reproduce this. It imports both using the standalone BBT parser and in BBT as I'd expect. I've added this testcase but that passes for me. |
As an aside, it's a simple fix for me to add missing diacritics (or other constructs), but |
BTW, as far as completeness testing goes, I'd suggest testing at least
and optionally
|
BTW the BBT parser builds on the astrocite parser, parts of which are by my hand, but the BBT parser will therefore necessarily be slower than astrocite. I'm open to looking at the idea parser in this test, but that would need it to either produce an AST which I can postprocess, or that the postprocessing happens during parsing (which I'd not recommend). |
It seems to be specifically when a user-defined string with the aforementioned specific forms of diacritics are used in a field (it works fine if they're not used, or if the diacritic is in the field itself like the test case you made). Here's proof I'm not crazy :) |
I wanted to leave value parsing to a different part of the parser (namely, the mapping) as the mapping is used for Bib.TXT as well, which I assume has names in both formats, dates in EDTF, verbatim fields and everything else basically. And this parser would be specifically for everything that is BibLaTeX/BibTeX except the values, if that makes sense. I get that that makes it a bit of an unfair comparison, especially performance-wise, but I didn't mean this repository as a way of calling people out, just to see if my results were somewhat adequate. |
I'm not trying to convince you to switch, either. I knew my old parser was bad and I wanted to see which method of parsing was best for my purposes (PEG.js, nearley-js, rolling my own parser, etc.). There isn't much postprocessing going on, it converts commands to Unicode, concatenates fields, and puts everything into an object. No conversion to CSL though, but it's not an AST and I understand if this is too much pre-processing. {
type: String,
label: String,
properties: {
field: "value" // note: this is verbatim, except command -> Unicode
}
} |
Verbatim fields can't be parsed outside the grammar, because verbatim fields have a different parsing mode; the grammar has to know about them. EDTF dates can indeed be done later, but sentence casing and name parsing must be done at the AST level:
It's not just the speed difference. The BBT parser (and biblatex-csl-converter) keep the intended meaning structurally better than others in the list. |
This is now fixed. |
(I tried parsing syntax.bib, but it'd require changes to the astrocite parser, and it isn't valid bib(la)tex it seems; overleaf chokes on it at least) |
Right, I mixed that up. It's in #3 (checkbox 3).
Braces in values are kept for that reason (except around some diacritic commands, as
True, but I think this is well enough for my intended purposes. I can try to make a switch for it to return an AST, given the structure of the parser I think that should be very possible. Anyway, I updated the README to mention the AST capabilities. |
Sure. But it's more complicated than that; the braces usually, but not always, mean nocase. See https://retorque.re/zotero-better-bibtex/support/faq/#why-the-double-braces for some examples and links to details. And then there's still the point that lists (literal lists and names) can only be properly distinguished at the grammar level.
I don't know what Bib.TXT is BTW.
Can't argue that of course, but then "complete" doesn't mean a whole lot. But at the least, footnote 5 has been fixed now, unless there's more diacritics I missed.
Cool. BTW if name parsing and the meaning of braces (nocase or not) happens inside the parser, and the parser also converts markup (such as superscript, emph, etc), an AST may not be required. But I found it easier to do those by transforming the AST; that's actually what the BBT parser adds to the astrocite parser. The actual grammar is just that of astrocite, although I did add changes to the astrocite parser to be able to parse my test suite. My parser also adds a simple form of error recovery BTW. The astrocite parser is an all-or-nothing parser. The BBT parser will parse entries one by one and give some info on entries that failed to parse. |
Sorry, Bib.TXT is just a reskin of BibTeX. I don't think it has gotten much use, but the premise is that it supports Unicode and presents the key/value pairs in a different way, but the values, in theory, stay the same. I say that now, and that's how I implemented it, but I don't really have a way of knowing; that website is my only point of reference. Anyway, if the values stay the same there are basically two ways of presenting the values, Bib(La)TeX and Bib.TXT. My parser only makes level ground for the two, the rest is in the mapping, i.e. the Bib(La)TeX/Bib.TXT to CSL mapping.
I had something like that in my previous parser, I'll see how I can fit that in in this one. I guess braces still have to be paired for your one? |
I mean... if you're leaning that way, wouldn't TOML or YAML make more sense? At least the more naive parsers (which can sometimes be useful) become trivial.
It may be that we see the meaning of "values" differently. For a title, HTML markup will mostly do, as long as the actual intent (which is, as noted, non-trivial) comes through. But name-lists and literal-lists are not strings, they're lists of strings, and you can't safely deduce where they're to be broken into parts without passing on the structure.
An unclosed open brace will consume all the input after it, yes, but all other errors (also unexpected closing braces) will skip ahead to the first |
This mixes tokens (lowercase) and rules (capitalized), but that could be changed as long as there are no naming conflicts. @book{label, title = "{T}est" } { kind: 'Main', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, children: [ { kind: 'Entry', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, children: [ { kind: 'at', loc: { start: { offset: 0, line: 1, col: 1 }, end: { offset: 1, line: 1, col: 2 } }, value: '@' }, { kind: 'dataEntryType', loc: { start: { offset: 1, line: 1, col: 2 }, end: { offset: 5, line: 1, col: 6 } }, value: 'book' }, { kind: 'lbrace', loc: { start: { offset: 5, line: 1, col: 6 }, end: { offset: 6, line: 1, col: 7 } }, value: '{' }, { kind: 'label', loc: { start: { offset: 6, line: 1, col: 7 }, end: { offset: 11, line: 1, col: 12 } }, value: 'label' }, { kind: 'comma', loc: { start: { offset: 11, line: 1, col: 12 }, end: { offset: 12, line: 1, col: 13 } }, value: ',' }, { kind: '_', loc: { start: { offset: 12, line: 1, col: 13 }, end: { offset: 15, line: 2, col: 2 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 12, line: 1, col: 13 }, end: { offset: 15, line: 2, col: 2 } }, value: '\n ' } ], value: undefined }, { kind: 'EntryBody', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'Field', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'identifier', loc: { start: { offset: 15, line: 2, col: 3 }, end: { offset: 20, line: 2, col: 8 } }, value: 'title' }, { kind: '_', loc: { start: { offset: 20, line: 2, col: 8 }, end: { offset: 21, line: 2, col: 9 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 20, line: 2, col: 8 }, end: { offset: 21, line: 2, col: 9 } }, value: ' ' } ], value: undefined }, { kind: 'equals', loc: { start: { offset: 21, line: 2, col: 9 }, end: { offset: 22, line: 2, col: 10 } }, value: '=' }, { kind: '_', loc: { start: { offset: 22, line: 2, col: 10 }, end: { offset: 23, line: 2, col: 11 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 22, line: 2, col: 10 }, end: { offset: 23, line: 2, col: 11 } }, value: ' ' } ], value: undefined }, { kind: 'Expression', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'ExpressionPart', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 31, line: 2, col: 19 } }, children: [ { kind: 'QuoteString', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 31, line: 2, col: 19 } }, children: [ { kind: 'quote', loc: { start: { offset: 23, line: 2, col: 11 }, end: { offset: 24, line: 2, col: 12 } }, value: '"' }, { kind: 'Text', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 27, line: 2, col: 15 } }, children: [ { kind: 'BracketString', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 27, line: 2, col: 15 } }, children: [ { kind: 'lbrace', loc: { start: { offset: 24, line: 2, col: 12 }, end: { offset: 25, line: 2, col: 13 } }, value: '{' }, { kind: 'Text', loc: { start: { offset: 25, line: 2, col: 13 }, end: { offset: 26, line: 2, col: 14 } }, children: [ { kind: 'text', loc: { start: { offset: 25, line: 2, col: 13 }, end: { offset: 26, line: 2, col: 14 } }, value: 'T' } ], value: 'T' }, { kind: 'rbrace', loc: { start: { offset: 26, line: 2, col: 14 }, end: { offset: 27, line: 2, col: 15 } }, value: '}' } ], value: 'T' } ], value: '{T}' }, { kind: 'Text', loc: { start: { offset: 27, line: 2, col: 15 }, end: { offset: 30, line: 2, col: 18 } }, children: [ { kind: 'text', loc: { start: { offset: 27, line: 2, col: 15 }, end: { offset: 30, line: 2, col: 18 } }, value: 'est' } ], value: 'est' }, { kind: 'quote', loc: { start: { offset: 30, line: 2, col: 18 }, end: { offset: 31, line: 2, col: 19 } }, value: '"' } ], value: '{T}est' } ], value: '{T}est' }, { kind: '_', loc: { start: { offset: 31, line: 2, col: 19 }, end: { offset: 32, line: 3, col: 0 } }, children: [ { kind: 'whitespace', loc: { start: { offset: 31, line: 2, col: 19 }, end: { offset: 32, line: 3, col: 0 } }, value: '\n' } ], value: undefined } ], value: '{T}est' } ], value: [ 'title', '{T}est' ] } ], value: { title: '{T}est' } }, { kind: 'rbrace', loc: { start: { offset: 32, line: 3, col: 1 }, end: { offset: 33, line: 3, col: 2 } }, value: '}' } ], value: { type: 'book', label: 'label', properties: { title: '{T}est' } } } ], value: [ { type: 'book', label: 'label', properties: { title: '{T}est' } } ] } |
What is being mixed? I don't understand? This is the AST produced by the new idea parser? |
Yes.
I'm using a tokenizer (moo) which splits up the text into parts like |
I see. But as far as I can tell, the tokens should be easy enough to filter out, and that should leave a fairly clean nested AST, which I could then inspect and transform. Can I play with this? I am curious what How would I add test cases to the idea parser? First thing is I'd be curious to see if my existing tests parse at all. Error recovery is separate in my parser BTW. If it can be built into the idea parser it will almost certainly be faster, but if not, I could just keep my existing one; the error recovery works by chunking the input into individual entries/strings/comments, then parsing these individually with the astrocite parser, then reassembling the results (a.o. by replacing references to |
I'll push the changes to
In principle just by adding files to the
to get a single file's AST output. Note that those can be pretty long, longer than my terminals scrollback anyway. For the sake of brevity, the updated test suite only prints success on success. |
if I do
I get
which isn't what I expected. Should this have been the AST? |
Did you run |
Right, now it gives me the AST. |
It parses most of my test suite files, with these exceptions:
cleanup of the AST will be a bit of work, I'll take a look in the weekend. |
I'll look at the test results this weekend as well. |
One other thing that the chunker adds is optional async BTW. It's not really "background" async, but it will yield to the event loop after every chunk which allows other tasks to interleave with parsing. |
Oh and wrt verbatim fields, mendeley gets this wrong for eg At one time, endnote also exported items without citation keys. There's a ton of real-life crap in my test suite - just because it parses doesn't necessarily mean the meaning is extracted properly. |
BTW, I've put together a quicky test runner based on benchmark.js and the numbers shift a little; some better, some worse: https://gist.github.com/retorquere/79fb0ad7062a85a1d83e4b004d40985e |
Good idea, I'll make them configurable when I implement it.
Cool! I'll add the figures (and/or the test suite) to the repo. |
Another thing (just added to |
The BBT parser has been updated -- On two of my test files, at least astrocite runs out of memory, where my parser will parse them correctly (if slowly, they're 8.2Mb and 11Mb respectively). |
Does the citation-js parser handle verbatim fields (like A few things recently fixed in the BBT parser that citation-js may not yet be aware of:
BBT has its own AST parser now, which is based on a version of astrocite grammar but has seen substantial (and incompatible) changes since. It still seems strange to me to label parsers "complete" merely because they don't crash. Name parsing, verbatim fields, title-sentence casing, command-argument handling are all crucial parts of parsing bibtex. I'd wager that none of the "complete" parsers will parse |
Nice on the updated tests! BBT 3.1.20 fixes all non-gimmick tests and some gimmick tests. What do you think the state of idea-reworked is now? Given how fast it is I may want to build on it, but I'd need to be able to pass my own test suite. |
The main part missing from idea-reworked right now is the actual mapping to CSL or other output formats too. That includes field information as well, such as I have been working on mappings over at the aptly named bibtex-mappings, I don't remember if I linked it before. The repository contains some data text-mined from documentation (the biblatex docs are especially usable for this) to be combined with hand-crafted mappings. #3 is still pretty up-to-date, I have been mainly focused on fixing the test suites and README, and a workaround for the command concatenation gimmick. I'm trying to fully get back into it and sift through the issues and comments soon. |
I understand why you'd want mapping to other objects, but I just want the parsed object (pretty much what _intoFixtureOutput delivers), and I'll take it from there, as I'm targeting specifically conversion to Zotero objects. The command concatenation gimmick would be pretty difficult to address in my parser, but to me that wouldn't be any kind of priority. It's interesting to see that your parser can deal with it successfully, but it's not something I expect to see in the wild. #3 still has a long list of stuff I absolutely need in the todo list, so I'd have to wait on that. I'm subscribed to the issue, but I won't be notified of edits, just new comments. |
Do you happen to have some documentation for the extended name format? I am working on name parsing now and I did find 3.6 Data Annotations in the BibLaTeX manual but that's slightly different from how the feature fixture you added works. |
I don't have docs handy, no, and maybe I misunderstood it when I built it. What difference do you see? |
Apparently, what you have works but I have not found it in the manual yet. I did find this, on page 82 in http://mirrors.ctan.org/macros/latex/contrib/biblatex/doc/biblatex.pdf:
But the name-parts are not overwritten by the annotation. |
There's an example of what you implemented here: https://github.com/plk/biblatex/blob/dev/doc/latex/biblatex/examples/93-nameparts.tex |
Looking at the docs, I don't think they're meant to overwrite name-parts? They add annotations to the specific name-parts, and those annotations can be used in specialized styles; I've only seen it used in annotated bibliographies myself. |
I updated the feature fixtures to include all the name parts instead of just the last name, and on one I encountered unexpected |
I think you're right, still a bit confused about the annotation in the example though. Why would someone annotate specifically the family part of the name with "student"? |
I can't say with certainty, but this looks to me like a synthetic sample meant to show what's possible with annotations, more than an actual sample from an actual annotated bibliography. |
Those 0004 chars should not be in the output, I'll look into that. |
If it helps, I saw it when there were braces in explicit name part values in the extended name format: @article{test,
author = {family=Duchamp, given=Philippe, given-i={Ph}}
} |
Thanks, that is fixed in the latest release. |
I'm also tinkering with chevrotain to remove a pass from my parser. |
Cool! I think I might have heard of chevrotain before but I do not recognize the website... the uppercase function names seem familiar though. |
I've tried chevrotain, but if your test results are anything to go by, your parser is 2-3 times faster than my lexer alone. I can't replicate your results because |
I see I have not updated those READMEs in a while, mainly focusing on the automated test suite. I do not like the current tokenizer as it encodes a lot of state, but I believe it works well in terms of speed (you would have to test that though, it still feels weird to me that that code is actually faster). I used my own tokenizer before and I was thinking of making something similar to replace |
my lexer takes 5s or |
What's the downside of using |
At the current state of things I'd be better off either helping build out citationjs, or even just using citationjs as a lexer. How do you feel about typescript? |
I would like to try it out at some point but I also want the citationjs parser to be part of |
I wouldn't mind cooperating on a parser, especially seeing how fast yours is -- if that's of any interest to you of course. I don't want to rely on a parser I'm not personally involved in though -- my users are often on tight deadlines, and I'd prefer to be able to roll out fixes quickly when necessary. I wouldn't want to incur functional loss against the BBT parser though, and I've grown really fond of typescript, it has prevented so many problems at compile time over the years for me. |
It does sound interesting but right now I feel like the parsers are incompatible in that sense (and our goals maybe as well). If it is okay with you, I will get back to you about this later.
Some more thoughts: Right now my parser lexes everything with
But it might also not: somehow, I guess this would be the structure of a more normal parser?
I am really curious now why mine is faster. Maybe the lack of back-tracking? I keep being afraid I messed up the benchmarks though. |
Sure. I'm going to tinker with it to see if it's a better base than my lexer in the interim.
That's what my lexer does too.
What is special about
In my chevrotain attempt, the same, and I think by necessity, unless you use kludges for things like name fields (of the kind my main parser has right now).
I don't know how that could make a difference? Are iterators especially efficient at generating tokenized text?
Yeah but if I understand correctly,
That could well be it. I test multiple regexes per state. My lexer did already pick out commands though.
I'm not sure what this says, sorry.
My lexer doesn't back-track. So yours seems very structurally faster. |
WRT the issues reported here on the BBT parser:
The sample below imports in BBT since 5.1.154
but the concat part of the title imported before that too.
The text was updated successfully, but these errors were encountered: