Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First name displays as "false" #275

Closed
davidweichiang opened this issue Apr 19, 2019 · 24 comments

Comments

@davidweichiang
Copy link
Collaborator

commented Apr 19, 2019

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

When this gets fixed, can we add a test case that ensures we never see the string “False” in the generated YAML? We should set up a set of unit tests that we can slowly build on.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 19, 2019

When this gets fixed, can we add a test case that ensures we never see the string “False” in the generated YAML? We should set up a set of unit tests that we can slowly build on.

We can, but that's not what's happening here. Generating the site locally and grepping for "false" in author pages, there are exactly two instances where this currently happens: N Veilleux and Vasudevan N.

From that, I conclude that Hugo's YAML parser interprets "N" as a boolean false, while PyYAML (when writing the data) doesn't. Possibly, though I haven't dug into this more, PyYAML follows the YAML 1.2 spec (which only allows "true" and "false" for booleans), while Hugo follows YAML 1.1, which allows a whole bunch of other strings, including "N".

(EDIT: There's also a Nirmal Y...)

The easiest solution would be to get PyYAML to quote these strings always, so they can only be interpreted as strings, but I haven't yet checked whether that's easily possible.

@mbollmann mbollmann self-assigned this Apr 19, 2019

@mbollmann mbollmann added the bug label Apr 19, 2019

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 20, 2019

PyYAML has a version option, but it doesn't seem to fix this:

>>> yaml.dump([{'canonical':{'first': 'N', 'last': 'Y'}}], version=(1,1))
'%YAML 1.1\n---\n- canonical: {first: N, last: Y}\n'
@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 20, 2019

Oh, this is why: yaml/pyyaml#247

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 21, 2019

I've briefly looked into alternatives to PyYAML, as some people strongly argue against it and advocate packages like ruamel.yaml instead. However, I've played around with it and couldn't get it to quote these values when dumping either.

I also tried using different default styles in PyYAML, like default_style='"' to make it always quote everything, but that breaks the parsing in Hugo somehow.

This is a nasty bug...

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 21, 2019

Yes, and it seems the bug report David pointed out---when fixed---actually makes things worse.

I hate to suggest it, for the amount of work it implies, but what about switching to JSON for our internal representations? I see that Hugo supports them both. My impression is that JSON libraries in Python are much more robust. Maybe it will be less work than it seems.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 21, 2019

Update: I tried going the YAML 1.2 route, as I saw that Go's YAML parser apparently supports it. So I tried using ruamel to produce valid YAML 1.2 and explicitly declare the YAML version in the files by prepending %YAML 1.2\n---. Unfortunately, for whatever reason, Hugo's YAML parsing routines don't seem to support this, as I just get

Error: Error building site: "/home/bollmann/repositories/acl-anthology/hugo/data/papers/A00.yaml:1:1": failed to unmarshal YAML: yaml: found incompatible YAML document

Going with another data format actually starts to look appealing... besides JSON, Hugo also supports TOML, which is more human-friendly (and should theoretically have good Python support since pip uses it). But I don't know if there are (other) potential pitfalls in either of them.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

FWIW, I opened an issue with ruamel.yaml and it was fixed in less than a day. I'm currently experimenting with it, the main drawback seems to be that everything takes much longer now as we can't use the C libraries anymore. On the plus side, we could keep the YAML format, and the ruamel library also supports comments.

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

Hmm. The slowness of #277 gives me this idea: in our YAML, what if use the value

https://github.com/acl-org/acl-anthology/issues/275 Y

to encode “Y”, and

https://github.com/acl-org/acl-anthology/issues/275 N

to encode “N”? We then just have to strip this out in the few places where we use it, and it will be somewhat self-documenting. It’s obviously an ugly hack, but we’d keep the build speed.

Have we filed a bug request against the PyYAML project? From my understanding, it is basically impossible to encode the value “Y” or “N”, which seems like a serious bug.

Another idea: when writing names, we could just check in the code for string values of “True” and “False”, and convert them to “Y” and “N”. This would work so long as no one is actually named “True” or “False”.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

If Y/N are the only cases that ever cause trouble, such a hack may work. I was thinking of appending a zero-width space as a potential hacky solution, with the idea of (probably) not even needing to strip it out.

In general, I do also like the idea of using an actively maintained library, though.

Have we filed a bug request against the PyYAML project? From my understanding, it is basically impossible to encode the value “Y” or “N”, which seems like a serious bug.

There is an issue and a PR to fix it, @davidweichiang linked it above, but it's been there for 3 months without being merged. PyYAML doesn't really appear to be actively maintained.

Another idea: when writing names, we could just check in the code for string values of “True” and “False”, and convert them to “Y” and “N”. This would work so long as no one is actually named “True” or “False”.

There are no string values "True" and "False" passed around anywhere with this bug. You'd have to check in the Hugo templates whether any of the variables assumed to be a string is actually a boolean, which is pretty cumbersome, to say the least.

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 22, 2019

This seems to work:

import yaml
import re

for c in "YyNn":
    yaml.add_implicit_resolver(c, re.compile('^'+c+'$'))

print(yaml.dump(['Y', 'N', 'y', 'n', 'Yes', 'No', 'yes', 'no']))
@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

Doesn't work when using the CDumper, which is what gives the speed advantage.

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 22, 2019

That's too bad...does that mean this is a libyaml bug?

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 22, 2019

OK, how about

import yaml
import re

for c in "YyNn":
    yaml.add_implicit_resolver("tag:yaml.org,2002:bool", re.compile('^'+c+'$'), Dumper=yaml.CDumper)

print(yaml.dump(['Y', 'N', 'y', 'n', 'Yes', 'No', 'yes', 'no'], Dumper=yaml.CDumper))

I'm not sure whether it's bad for there two be two implicit resolvers for booleans; maybe the existing resolver should be modified. I wish I knew what an implicit resolver was.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

I'm not sure whether it's bad for there two be two implicit resolvers for booleans; maybe the existing resolver should be modified. I wish I knew what an implicit resolver was.

I think it's a function that handles literals in the YAML without an explicit type annotation (e.g. !bool). I'm not too sure either.

Anyway, the code you posted makes the YAML export take forever (I stopped it at 6 minutes and 17%), but the following appears to do the trick:

yaml.add_implicit_resolver("fix-bools", re.compile('^[yYnN]$'), list("yYnN"), Dumper=Dumper)

I will test that a bit more, but it'd be pretty cool if that worked.

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 22, 2019

In your code, is Dumper == yaml.CDumper?

I'm a little nervous about adding resolvers to the loader as well, so how about:

yaml.CDumper.add_implicit_resolver("tag:yaml.org,2002:bool", re.compile('^[YNyn]$'), "YNyn")
@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

I'm a little nervous about adding resolvers to the loader as well

I think there's nothing wrong with that (if anything, it should be the correct™ thing to do), and the more I look at it the more I'm convinced that adding this line is equivalent to fixing it in the original PyYAML source, with the difference being that now two regexes are called instead of just one.

I'm mainly surprised it works with the CDumper, I thought this is something that had to be fixed in libyaml as well, but testing it locally, it does seem to do the trick.

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

I think the bug David linked above is the opposite of our problem—it’s reporting there are not enough strings that result in a boolean being loaded.

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 22, 2019

I think the bug David linked above is the opposite of our problem—it’s reporting there are not enough strings that result in a boolean being loaded.

That's functionally identical though. If the library considers y|Y|n|N to be boolean literals, they will be escaped (= quoted) when dumping and interpreted as booleans when loading.

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 23, 2019

FYI, the page with the bug has moved to here: https://www.aclweb.org/anthology/people/n/nanette-veilleux/

@mbollmann I agree with your last comment; the only thing is that the tag should be tag:yaml.org,2002:bool, right?

@mbollmann

This comment has been minimized.

Copy link
Collaborator

commented Apr 23, 2019

@mbollmann I agree with your last comment; the only thing is that the tag should be tag:yaml.org,2002:bool, right?

Not quite. I think the tag name's relevant for instantiating (loading) from YAML, but using tag:yaml.org,2002:bool will fail since the constructor for that doesn't recognize y|Y|n|N. So I think loading won't work with these anyway unless we also define a new constructor, which requires that we name the tag something else. But then again... maybe we actually don't need to worry about that. :)

@davidweichiang

This comment has been minimized.

Copy link
Collaborator Author

commented Apr 25, 2019

The above PR appears to work on my machine. The fix is applied only to the dumpers, not to the loaders, in light of @mbollmann's observation above.

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

Great. I'll take a look later this afternoon and merge it if it looks fine on my end, too.

@mjpost mjpost closed this in #282 Apr 25, 2019

@mjpost

This comment has been minimized.

Copy link
Member

commented Apr 25, 2019

LGTM. Merged and rsynced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.