New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with encoding (with non-ASCII chars) #28

Closed
vh-d opened this Issue Oct 24, 2018 · 12 comments

Comments

Projects
None yet
2 participants
@vh-d
Contributor

vh-d commented Oct 24, 2018

I run into encoding issues when parsing files as well as R characters.

Parsing from text (R characters):

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký ků�
> Encoding(test_output)
[1] "unknown"

The encoding attribute is lost. But may be set again

> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

Parsing from files

Example: test.txt

> test_file <- RcppTOML::parseTOML("test.txt
")
> test_file$value
[1] "Ĺ˝luĹĄouÄŤkĂ˝ kĹŻĹ\u0088"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

TOML files are assumed to be UTF-8 Unicode texts. However R characters obtained from parsing via parseTOML() are labeled as "unknown" encoding.

In case of files, the solution may be relatively easy, I think. We can assume that input is UTF-8 and label every string output as "UTF-8".

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 24, 2018

Just to double check: what platform are you on?

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 24, 2018

Good point, this was Windows 2008. I will check Linux in few hours and report back.

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 24, 2018

In a way that's good -- Windows seems to be the worst. So if Encoding() covers it there we should be good on the others (who may already be good).

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 24, 2018

On Linux:

Parsing character value:

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký kůň
> Encoding(test_output)
[1] "unknown"
> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

and parsing files:

> test_file <- RcppTOML::parseTOML("~/Downloads/test.txt")
> test_file$value
[1] "Žluťoučký kůň"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

So the problem persists (lost encoding attribute) but the consequences are negligible because "unknown" encoding on modern linux is "UTF-8" anyway.

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 24, 2018

I suggest

  • conversion to UTF-8 via enc2utf8() for every input whenever fromFile=FALSE and
  • labeling every character on output as "UTF-8"

The first would solve non-file input. The second would solve both file and non-file inputs given that user is assumed to provide UTF-8 files by TOML spec.

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 25, 2018

Sounds good. Also see #20 which is pretty much the same, no?

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 25, 2018

Yeah, sorry about that. I will try to come up with a pull request soon as this bug bites my application quite a bit.

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 25, 2018

That's ok. You are being careful, and you are constructing good examples. One change at a time...

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 25, 2018

Ok, I just pushed a PR with a change I had following the 0.1.4 release. I'd like to make one more change in there and properly document your last change -- see ChangeLog which is a standard (older) format well supported by Emacs :) Can you drop me a full name please, either here or if you prefer by email? And if you want an email different from the one used by git log. Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 25, 2018

Can you drop me a full name please, either here if you prefer by email? And if you want an email different from the one used by git log.

My name is Václav Hausenblas, nice to meet you :-)
Big fan of tinyverse, btw.

Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

My pleasure! I am going to play with encoding now...

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 26, 2018

Ok, you're in the ChangeLog now :) And I got my other issue taken care of -- the (new in 0.5.0) local_time type comes back to us now too (as a string, there is no real type for it and I don't think I want to pull in hms just for this).

So if/when you something for encoding feel free to branch or fork again and show it :)

@vh-d vh-d referenced this issue Oct 26, 2018

Merged

Fix/encoding #30

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 26, 2018

Fixed in #30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment