idea-reworked: diacritics misplaced #10

retorquere · 2019-11-13T00:24:18Z

When parsing

publisher =  { D{\u{o}}\}\"ead Po$_{eee}$t Society},

the mark that should be above the o ends up above the }

The text was updated successfully, but these errors were encountered:

larsgw · 2019-11-13T07:08:34Z

I assume you tried this in a terminal? I have had rendering issues in mine. Can you try copying the output to somewhere else? Otherwise, maybe I'm overcorrecting.

retorquere · 2019-11-13T07:20:03Z

Nope. When I do

const parser = require('./lib/idea-reworked')
const fs = require('fs')
let parsed = parser.parse(fs.readFileSync('test/files/syntax.bib', 'utf-8'))
fs.writeFileSync('parsed.txt', parsed[0].properties.publisher)

I get

$ od -c parsed.txt 
0000000        D   {   {   o   }    ̆  **   }   }   e    ̈  **   a   d    
0000020    P   o 342 202 221   e   e   t       S   o   c   i   e   t   y
0000040

larsgw · 2019-11-13T08:50:17Z

Oh, I think I get it. I'll take a look this afternoon then.

retorquere · 2019-11-13T08:54:34Z

BTW I added the \} to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

retorquere · 2019-11-13T08:55:29Z

The MacOS console renders such characters properly in my experience BTW

larsgw · 2019-11-13T09:11:53Z

BTW I added the } to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

True, I guess parsing that immediately is unavoidable.

larsgw · 2020-03-10T23:09:38Z

Ah, seems this was fixed in c60e7b6 (it appended to the Text rule which included the braces because of reasons that should be fixed now). Also, it now appends the diacritics to the first character in the string, as it seems it should.

BTW I added the } to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

Well, you'd have to balance the brackets in this example right? Anyway, I'll work on this, looks like an easy fix. \vphantom might be more difficult.

larsgw · 2020-03-10T23:18:49Z

Hm, not very happy about this:

f\u{o}o → fŏo
f\u{oo} → fŏo
f\u{{o}o} → f ̆oo

Except the last one is rendered like this:

retorquere · 2020-03-11T13:13:42Z

If I parse those with pretected-section generation off, I get the desired result for all three; with it on (which is the default), it only works correctly for the first case. I'm considering changing this, but as far as I can tell, this would be a "category 2" failure. I haven't seen this in the wild, and I don't currently see why anyone would choose f\u{oo} or f\u{{o}o}.

Well, you'd have to balance the brackets in this example right? Anyway, I'll work on this, looks like an easy fix. \vphantom might be more difficult.

The braces are actually balanced -- the \} ought to be interpreted as just text, not structuring braces, but whether it is read this way during parsing by bibtex depends on the version you're using. But for full compatibility, you can add \vphantom{\}} after it which adds a non-printing balancing brace.

retorquere · 2020-03-12T13:52:18Z

Hm, not very happy about this:

* `f\u{o}o` → fŏo

* `f\u{oo}` → fŏo

* `f\u{{o}o}` → f ̆oo

Turns out to be an easy fix, I have this covered now.

retorquere · 2020-03-12T23:17:17Z

The fix will also take care of constructs like \c{\u A} which I didn't handle properly before.

larsgw · 2020-03-13T00:54:17Z

Ah cool, I'll try something later. Do you mean specifically with \c in the outer layer? There's even more cases than I thought:

\c{\u A} → Ă̧
\u{\c A} → ̆A̧
\u{\i} → ı̆
\c{aa} → a̧a (i.e. cedilla on first 'a')
\c{\u aa} → ă ̧a (i.e. cedilla precisely in middle, no actual spaces)
\c{AA} → A ̧A (i.e. cedilla precisely in middle, no actual spaces)

So first of all, I think there are two classes of diacritics:

one (\u, \", etc.) that applies to the first char only and does not work if some sort of block is given in the argument. I am not entirely sure what this means in LaTeX terms, but there is clearly a difference between (2) and (3).
one (\c, \b, etc.) that applies to the whole group (5, 6) OR the first one (4) depending on some arbitrary reason. I get that it treats it as a whole group in (5) but I'm honestly stumped by (6)

I guess this is just LaTeX at this point and not really Bib(La)TeX...

retorquere · 2020-03-13T08:06:15Z

No, not specifically \c in the outer layer, just a nesting of such diacritics. It's pretty strange that 4 and 6 behave differently though. I pasted the above in my parser and I currently get:

A → Ă̧
̧̆A → ̆A̧
ĭ → ı̆
a̧a → a̧a (i.e. cedilla on first 'a': this one I seem to get right)
̧̆aa → ă ̧a (i.e. cedilla precisely in middle, no actual spaces: the cedilla shows up at the front, but it's one composed character)
A̧A → A ̧A (i.e. cedilla precisely in middle, no actual spaces: my parser turns it into composed-A+A, not A+composed space+A)

and I'm personally satisfied with that. I consider 4 - 6 to be "gimmick" cases. I don't see a good reason why anyone should do this, and I'm not certain the behavior is well-defined even for LaTeX (even if it will probably be stable). Even 1 and 2 are actually not great examples (I fully realize I brought them) because the cedilla isn't usually (ever?) on A? I only know about ZzCc getting a cedilla.

retorquere · 2020-03-13T08:59:05Z

And Ee apparently (retorquere/zotero-better-bibtex#1455)

larsgw · 2020-03-16T16:48:59Z

The braces are actually balanced -- the } ought to be interpreted as just text, not structuring braces, but whether it is read this way during parsing by bibtex depends on the version you're using.

On Overleaf I seem to have

Package: natbib 2010/09/13 8.31b (PWD, AO)
Package: biblatex 2019/08/31 v3.13a programmable bibliographies (PK/MW)

retorquere · 2020-03-16T16:54:29Z

Meh, the vphantom trick is a reliable workaround. I don't parse \} as an unbalanced brace but as text, seems to me it's not better to error out.

larsgw mentioned this issue Oct 21, 2020

Progress on the active parser ("citationjs") #3

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea-reworked: diacritics misplaced #10

idea-reworked: diacritics misplaced #10

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

retorquere commented Nov 13, 2019

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

larsgw commented Mar 10, 2020

larsgw commented Mar 10, 2020

retorquere commented Mar 11, 2020 •

edited

retorquere commented Mar 12, 2020

retorquere commented Mar 12, 2020

larsgw commented Mar 13, 2020

retorquere commented Mar 13, 2020

retorquere commented Mar 13, 2020

larsgw commented Mar 16, 2020

retorquere commented Mar 16, 2020

idea-reworked: diacritics misplaced #10

idea-reworked: diacritics misplaced #10

Comments

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

retorquere commented Nov 13, 2019

retorquere commented Nov 13, 2019

larsgw commented Nov 13, 2019

larsgw commented Mar 10, 2020

larsgw commented Mar 10, 2020

retorquere commented Mar 11, 2020 • edited

retorquere commented Mar 12, 2020

retorquere commented Mar 12, 2020

larsgw commented Mar 13, 2020

retorquere commented Mar 13, 2020

retorquere commented Mar 13, 2020

larsgw commented Mar 16, 2020

retorquere commented Mar 16, 2020

retorquere commented Mar 11, 2020 •

edited