Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with encoding (with non-ASCII chars) #28

Closed
vh-d opened this issue Oct 24, 2018 · 12 comments
Closed

Problems with encoding (with non-ASCII chars) #28

vh-d opened this issue Oct 24, 2018 · 12 comments

Comments

@vh-d
Copy link
Contributor

@vh-d vh-d commented Oct 24, 2018

I run into encoding issues when parsing files as well as R characters.

Parsing from text (R characters):

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký ků�
> Encoding(test_output)
[1] "unknown"

The encoding attribute is lost. But may be set again

> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

Parsing from files

Example: test.txt

> test_file <- RcppTOML::parseTOML("test.txt
")
> test_file$value
[1] "Ĺ˝luĹĄouÄŤkĂ˝ kĹŻĹ\u0088"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

TOML files are assumed to be UTF-8 Unicode texts. However R characters obtained from parsing via parseTOML() are labeled as "unknown" encoding.

In case of files, the solution may be relatively easy, I think. We can assume that input is UTF-8 and label every string output as "UTF-8".

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 24, 2018

Just to double check: what platform are you on?

@vh-d
Copy link
Contributor Author

@vh-d vh-d commented Oct 24, 2018

Good point, this was Windows 2008. I will check Linux in few hours and report back.

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 24, 2018

In a way that's good -- Windows seems to be the worst. So if Encoding() covers it there we should be good on the others (who may already be good).

@vh-d
Copy link
Contributor Author

@vh-d vh-d commented Oct 24, 2018

On Linux:

Parsing character value:

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký kůň
> Encoding(test_output)
[1] "unknown"
> Encoding(test_output) <- "UTF-8"
> cat(test_output)
Žluťoučký kůň

and parsing files:

> test_file <- RcppTOML::parseTOML("~/Downloads/test.txt")
> test_file$value
[1] "Žluťoučký kůň"
> Encoding(test_file$value)
[1] "unknown"
> Encoding(test_file$value) <- "UTF-8"
> test_file$value
[1] "Žluťoučký kůň"

So the problem persists (lost encoding attribute) but the consequences are negligible because "unknown" encoding on modern linux is "UTF-8" anyway.

@vh-d
Copy link
Contributor Author

@vh-d vh-d commented Oct 24, 2018

I suggest

  • conversion to UTF-8 via enc2utf8() for every input whenever fromFile=FALSE and
  • labeling every character on output as "UTF-8"

The first would solve non-file input. The second would solve both file and non-file inputs given that user is assumed to provide UTF-8 files by TOML spec.

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 25, 2018

Sounds good. Also see #20 which is pretty much the same, no?

@vh-d
Copy link
Contributor Author

@vh-d vh-d commented Oct 25, 2018

Yeah, sorry about that. I will try to come up with a pull request soon as this bug bites my application quite a bit.

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 25, 2018

That's ok. You are being careful, and you are constructing good examples. One change at a time...

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 25, 2018

Ok, I just pushed a PR with a change I had following the 0.1.4 release. I'd like to make one more change in there and properly document your last change -- see ChangeLog which is a standard (older) format well supported by Emacs :) Can you drop me a full name please, either here or if you prefer by email? And if you want an email different from the one used by git log. Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

@vh-d
Copy link
Contributor Author

@vh-d vh-d commented Oct 25, 2018

Can you drop me a full name please, either here if you prefer by email? And if you want an email different from the one used by git log.

My name is Václav Hausenblas, nice to meet you :-)
Big fan of tinyverse, btw.

Thanks! The 0.1.5 release some time next week will be much improved thanks to your help.

My pleasure! I am going to play with encoding now...

@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 26, 2018

Ok, you're in the ChangeLog now :) And I got my other issue taken care of -- the (new in 0.5.0) local_time type comes back to us now too (as a string, there is no real type for it and I don't think I want to pull in hms just for this).

So if/when you something for encoding feel free to branch or fork again and show it :)

@vh-d vh-d mentioned this issue Oct 26, 2018
@eddelbuettel
Copy link
Owner

@eddelbuettel eddelbuettel commented Oct 26, 2018

Fixed in #30

eddelbuettel added a commit that referenced this issue Jun 18, 2019
fix #28 again (declare UTF-8 in arrays of strings)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.