Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress on the active parser ("citationjs") #3

Open
10 tasks done
larsgw opened this issue Oct 12, 2019 · 13 comments
Open
10 tasks done

Progress on the active parser ("citationjs") #3

larsgw opened this issue Oct 12, 2019 · 13 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@larsgw
Copy link
Member

larsgw commented Oct 12, 2019

2020-09-15: update below

One big problem is the question of what should be parsed when parsing syntax, and what should parsed when mapping to CSL. Consider also that Bib.TXT should be able to use the same mapping.

  • diacritics: when parsing syntax, as Bib.TXT and some BibTeX supports utf8
  • other known symbol commands and ligatures: when parsing syntax
  • except, fields tagged as verbatim or url in the specification should not have commands parsed, and then the syntax parser has to know about all the different fields.
  • although field data is available, URL escaping should be handled when mapping since Bib.TXT should probably have that behavior too
  • name field parsing: should probably be when mapping
  • list field parsing (splitting on " and "): should probably be when mapping
  • markup: should be done when mapping, as markup differs between formats
  • crossref: should be done when mapping

Less crucial things, maybe:

  • let people extend the constants (i.e., add commands, diacritics, ligatures)
  • (only) warn for mis-matched entry brackets
@larsgw larsgw added bug Something isn't working enhancement New feature or request labels Oct 12, 2019
@larsgw larsgw mentioned this issue Nov 7, 2019
@larsgw
Copy link
Member Author

larsgw commented May 7, 2020

I have a different mental model of this now, with two stages of parsing: one of entries, and one of values. name (and date) field parsing is still at mapping for the moment, but verbatim and uri as well as markup and list fields are at stage two of parsing now, and implemented.

@retorquere
Copy link
Contributor

What does the first stage do? It looks like it does more than tokenization, but assuming the first stage also does command -> unicode mapping, I'm not clear on how you brought verbatim parsing to the 2nd stage. What does > show up as in the 2nd stage?

@larsgw
Copy link
Member Author

larsgw commented May 8, 2020

Yeah first part is full parsing of the file syntax, resolving @strings and concatenation, resulting in a data structure with parsed entries but verbatim values. The second part is parsing markup, commands etc. That also has to include name and date parsing because in the second part, all brackets are "resolved". Both parts include tokenization and further parsing.

All in all, not optimal so I might have to rethink it. But that would probably also involve rethinking the tokenizer (just category codes?) and as a result the whole grammar.

I also still have to make special cases for commands like frac and vphantom. And sentence casing I guess, though that still feels like something that would be part of a formatter, not a parser.

@retorquere
Copy link
Contributor

retorquere commented May 8, 2020

But literal lists have the same issue as name lists. With "list fields" do you mean "literal lists"?

Why would frac require a special case? There's nothing special about frac, it's just a two-argument command.

Sentence casing has to know about bracketing, and commands, because the brackets and the placement of commands within them influence the result in fairly intricate ways. If just the end-result is sentence-cased after all else is done, the results won't be right.

@larsgw
Copy link
Member Author

larsgw commented May 8, 2020

But literal lists have the same issue as name lists. With "list fields" do you mean "literal lists"?

I mean any list field. Brackets are kept to the second stage, and while I did implement top-level " and " splitting, so to speak, I didn't realize that needed to be done for name fields as well until after posting the comment.

Why would frac require a special case? There's nothing special about frac, it's just a two-argument command.

Because I currently don't have any argument commands yet, apart from formatting which I do special-case (and would like to keep doing so). So I'll probably use formatting commands, symbol commands and "other" commands implemented as functions in tandem.

Sentence casing has to know about bracketing, and commands, because the brackets and the placement of commands within them influence the result in fairly intricate ways. If just the end-result is sentence-cased after all else is done, the results won't be right.

The brackets would be translated to <span class="nocase"> (unless that's insufficient?), which would work for CSL. Things like BibTeX → RIS would not, so that's be a good reason. Can you give an example of the command thing? Do you mean command at the start of a word?

@retorquere
Copy link
Contributor

retorquere commented May 8, 2020

I guess it could be corrected if enough information travels along? It gets complicated fairly quickly though

  • {\u AB} is equivalent to ÅB
  • {ÅB} is equivalent to <span class="nocase">ÅB</span>.
  • {AB} is equivalent to <span class="nocase">AB</span>.

If you don't parse with arguments, how do you parse \c{\u{E}} or \href?

@retorquere
Copy link
Contributor

Some of the command-and-brace interaction is described at https://retorque.re/zotero-better-bibtex/support/faq/#why-the-double-braces

@larsgw
Copy link
Member Author

larsgw commented May 8, 2020

Right, diacritics are a special case too. \href used to be a symbol command producing no output and the argument would turn into regular case protection, but I realise I accidentally removed that just now.

I see the BibLaTeX manual also has documentation about their algorithm for sentence casing (page 253-255). However, not when it is applied exactly: it seems to be part of specific citation styles.

@retorquere
Copy link
Contributor

Styles decide for themselves whether to apply sentence casing but given the complexities of sentence casing in bib(la)tex I can't imagine that the styles actually each implement the sentence casing themselves.

@larsgw
Copy link
Member Author

larsgw commented May 8, 2020

No, BibLaTeX has a helper for that (\MakeSentenceCase). However, I don't know for which fields that is applied ("title-like fields" doesn't tell me that much), and each style can decide whether to apply the helper. I meant that usage of the helper is specifc to styles, not the backend or something.

@retorquere
Copy link
Contributor

Ah yes, that is true, but it remains that bib(la)tex expect the input to be title cased, and that CSL expects it sentence case.

@larsgw
Copy link
Member Author

larsgw commented Sep 15, 2020

So an update for the interested:

Note: Stage 1 is parsing files into in-memory structures, stage 2 is mapping that structure to CSL-JSON.

  • name, list and separated fields are now parsed in stage 1, as well as verbatim and url fields
  • different tokenization rules for different field types helps keeping performance
  • date fields are parsed in stage 2: syntax is EDTF (not specific to bibtex) and does not benefit from parsing in stage 1
  • case protection is not fully implemented yet (specifically the case protection inside formatting commands takes some extra thought)
  • same goes for applying sentence case
  • markup is also parsed during stage 1 now, since putting that in stage 2 is not feasible
  • crossref will be a special case between stage 1 and 2 (basically stage 2 but not in the actual mapping itself, more of a pre-processing directive)
  • constants will be able to be edited in the plugin config

@larsgw
Copy link
Member Author

larsgw commented Oct 21, 2020

Going through the issues.

Checked

TODO list:

  • try the BBT test suite (link)
  • annotations? (link)
  • \frac, \href (link)
  • \emph takes only the first character of the subsequent text (tokenizer thing?)
  • JabRef groups?
  • check for major simplification of tokenizer(s?)
  • crude recovery? (link)
  • \vphantom (link)
  • figure out the deal with \c{aa} (link)
  • \enquote, maybe (link)

@larsgw larsgw changed the title Some problems with idea-reworked Progress on the active parser ("citationjs") Oct 21, 2020
larsgw added a commit that referenced this issue Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants