Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idea-reworked: diacritics misplaced #10

Open
retorquere opened this issue Nov 13, 2019 · 16 comments
Open

idea-reworked: diacritics misplaced #10

retorquere opened this issue Nov 13, 2019 · 16 comments

Comments

@retorquere
Copy link
Contributor

When parsing

publisher =  { D{\u{o}}\}\"ead Po$_{eee}$t Society},

the mark that should be above the o ends up above the }

@larsgw
Copy link
Member

larsgw commented Nov 13, 2019

I assume you tried this in a terminal? I have had rendering issues in mine. Can you try copying the output to somewhere else? Otherwise, maybe I'm overcorrecting.

@retorquere
Copy link
Contributor Author

Nope. When I do

const parser = require('./lib/idea-reworked')
const fs = require('fs')
let parsed = parser.parse(fs.readFileSync('test/files/syntax.bib', 'utf-8'))
fs.writeFileSync('parsed.txt', parsed[0].properties.publisher)

I get

$ od -c parsed.txt 
0000000        D   {   {   o   }    ̆  **   }   }   e    ̈  **   a   d    
0000020    P   o 342 202 221   e   e   t       S   o   c   i   e   t   y
0000040

@larsgw
Copy link
Member

larsgw commented Nov 13, 2019

Oh, I think I get it. I'll take a look this afternoon then.

@retorquere
Copy link
Contributor Author

BTW I added the \} to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

@retorquere
Copy link
Contributor Author

The MacOS console renders such characters properly in my experience BTW

@larsgw
Copy link
Member

larsgw commented Nov 13, 2019

BTW I added the } to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

True, I guess parsing that immediately is unavoidable.

@larsgw
Copy link
Member

larsgw commented Mar 10, 2020

Ah, seems this was fixed in c60e7b6 (it appended to the Text rule which included the braces because of reasons that should be fixed now). Also, it now appends the diacritics to the first character in the string, as it seems it should.

BTW I added the } to highlight another potential problem -- if the idea is to process braces in a later stage, information is lost that the last } should not be parsed as a block delimiter but as a literal character.

Well, you'd have to balance the brackets in this example right? Anyway, I'll work on this, looks like an easy fix. \vphantom might be more difficult.

@larsgw
Copy link
Member

larsgw commented Mar 10, 2020

Hm, not very happy about this:

  • f\u{o}o → fŏo
  • f\u{oo} → fŏo
  • f\u{{o}o} → f ̆oo

Except the last one is rendered like this:

Screenshot_20200311_001319

@retorquere
Copy link
Contributor Author

retorquere commented Mar 11, 2020

If I parse those with pretected-section generation off, I get the desired result for all three; with it on (which is the default), it only works correctly for the first case. I'm considering changing this, but as far as I can tell, this would be a "category 2" failure. I haven't seen this in the wild, and I don't currently see why anyone would choose f\u{oo} or f\u{{o}o}.

Well, you'd have to balance the brackets in this example right? Anyway, I'll work on this, looks like an easy fix. \vphantom might be more difficult.

The braces are actually balanced -- the \} ought to be interpreted as just text, not structuring braces, but whether it is read this way during parsing by bibtex depends on the version you're using. But for full compatibility, you can add \vphantom{\}} after it which adds a non-printing balancing brace.

@retorquere
Copy link
Contributor Author

Hm, not very happy about this:

* `f\u{o}o` → fŏo

* `f\u{oo}` → fŏo

* `f\u{{o}o}` → f ̆oo

Turns out to be an easy fix, I have this covered now.

@retorquere
Copy link
Contributor Author

The fix will also take care of constructs like \c{\u A} which I didn't handle properly before.

@larsgw
Copy link
Member

larsgw commented Mar 13, 2020

Ah cool, I'll try something later. Do you mean specifically with \c in the outer layer? There's even more cases than I thought:

  1. \c{\u A} → Ă̧
  2. \u{\c A} → ̆A̧
  3. \u{\i} → ı̆
  4. \c{aa} → a̧a (i.e. cedilla on first 'a')
  5. \c{\u aa} → ă ̧a (i.e. cedilla precisely in middle, no actual spaces)
  6. \c{AA} → A ̧A (i.e. cedilla precisely in middle, no actual spaces)

So first of all, I think there are two classes of diacritics:

  • one (\u, \", etc.) that applies to the first char only and does not work if some sort of block is given in the argument. I am not entirely sure what this means in LaTeX terms, but there is clearly a difference between (2) and (3).
  • one (\c, \b, etc.) that applies to the whole group (5, 6) OR the first one (4) depending on some arbitrary reason. I get that it treats it as a whole group in (5) but I'm honestly stumped by (6)

I guess this is just LaTeX at this point and not really Bib(La)TeX...

@retorquere
Copy link
Contributor Author

No, not specifically \c in the outer layer, just a nesting of such diacritics. It's pretty strange that 4 and 6 behave differently though. I pasted the above in my parser and I currently get:

A → Ă̧
̧̆A → ̆A̧
ĭ → ı̆
a̧a → a̧a (i.e. cedilla on first 'a': this one I seem to get right)
̧̆aa → ă ̧a (i.e. cedilla precisely in middle, no actual spaces: the cedilla shows up at the front, but it's one composed character)
A̧A → A ̧A (i.e. cedilla precisely in middle, no actual spaces: my parser turns it into composed-A+A, not A+composed space+A)

and I'm personally satisfied with that. I consider 4 - 6 to be "gimmick" cases. I don't see a good reason why anyone should do this, and I'm not certain the behavior is well-defined even for LaTeX (even if it will probably be stable). Even 1 and 2 are actually not great examples (I fully realize I brought them) because the cedilla isn't usually (ever?) on A? I only know about ZzCc getting a cedilla.

@retorquere
Copy link
Contributor Author

And Ee apparently (retorquere/zotero-better-bibtex#1455)

@larsgw
Copy link
Member

larsgw commented Mar 16, 2020

The braces are actually balanced -- the } ought to be interpreted as just text, not structuring braces, but whether it is read this way during parsing by bibtex depends on the version you're using.

On Overleaf I seem to have

Package: natbib 2010/09/13 8.31b (PWD, AO)
Package: biblatex 2019/08/31 v3.13a programmable bibliographies (PK/MW)

@retorquere
Copy link
Contributor Author

Meh, the vphantom trick is a reliable workaround. I don't parse \} as an unbalanced brace but as text, seems to me it's not better to error out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants