New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/encoding #30

Merged
merged 4 commits into from Oct 26, 2018

Conversation

Projects
None yet
3 participants
@vh-d
Contributor

vh-d commented Oct 26, 2018

Issue #28.

It looks good and surprisingly easy.

> test_input <- enc2utf8("value='Žluťoučký kůň'")
> Encoding(test_input)
[1] "UTF-8"
> test_output <- RcppTOML::parseTOML(test_input, fromFile = FALSE)$value
> cat(test_output)
Žluťoučký kůň
> Encoding(test_output)
[1] "UTF-8"

test.txt

> test_file <- RcppTOML::parseTOML("test.txt")
> test_file$value
[1] "Žluťoučký kůň"
> Encoding(test_file$value)
[1] "UTF-8"

I only have two doubts.

  1. Maybe this is unneccessarily expensive solution? Rf_mkCharCE?
Rcpp::String se(s, CE_UTF8);
return Rcpp::wrap(se);
  1. Maybe doing enc2utf8(input) is too strict. Some R functions for string manipulations have encoding argument allowing for various other encodings and use iconv(x, from=encoding, to='UTF-8') to convert it. But I ditched this idea since it would not apply to cases where fromFile=TRUE and that would be very confusing.

This pull request is just a kick-off. I will add some documentation before merging after we agree on the right solution.

@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 26, 2018

  1. Rf_mkCharCE: I know not that much about encodings, but that is what we use in Rcpp as well. @kevinushey just fixed something in a pending PR there: RcppCore/Rcpp#917

  2. Unsure as well. Particularly across OSs. But on balance ... I think it is worth paying this price on input.

Will ponder but this is probably good as is -- and it is nicely minimal. @kevinushey, can you chime in?

(Oh, and now that I showed you ChangeLog, would you mind adding a commit mentioning the two files you altered?)

@codecov

This comment has been minimized.

codecov bot commented Oct 26, 2018

Codecov Report

Merging #30 into master will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #30      +/-   ##
=========================================
+ Coverage   60.17%   60.2%   +0.02%     
=========================================
  Files           3       3              
  Lines        1341    1342       +1     
=========================================
+ Hits          807     808       +1     
  Misses        534     534
Impacted Files Coverage Δ
R/parseToml.R 100% <100%> (ø) ⬆️
src/parse.cpp 64.62% <100%> (+0.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6a94ce...ba609f5. Read the comment docs.

@kevinushey

This comment has been minimized.

kevinushey commented Oct 26, 2018

LGTM!

I think keeping everything as UTF-8 rather than attempting to accommodate different encodings of text is a good idea, unless RcppTOML works with large enough strings that copying / re-encoding of strings could be expensive enough to matter.

@vh-d

This comment has been minimized.

Contributor

vh-d commented Oct 26, 2018

1. `Rf_mkCharCE`: I know not that much about encodings, but that is what we use in Rcpp as well. @kevinushey just fixed something in a pending PR there: [RcppCore/Rcpp#917](https://github.com/RcppCore/Rcpp/pull/917)

I think it is safe concerning encoding. But I was wondering whether the c++ code I wrote (using additional Rcpp::String) does an unnecessary copy of the whole text or not. If Rf_mkCharCE makes a pointer to an existing string and returns SEXP with an UTF-8 attribute, without copying, than it would be more efficient.

2. Unsure as well.  Particularly across OSs.  But on balance ... I think it is worth paying this price on input.

The only bad scenario I can imagine would be if user has a text labeled as "unknown" encoding but it would not be his/her system's native encoding. Then enc2utf8() may do some harm because R will assume native encoding I think. On the other hand, cpptoml requires UTF-8 and also user should care about encoding when importing text into R anyway. I will put warning in man pages.

(Oh, and now that I showed you ChangeLog, would you mind adding a commit mentioning the two files you altered?)

Sure, I will make a commit to the man page changes and enhance the pull request.

vh-d added some commits Oct 26, 2018

@eddelbuettel eddelbuettel merged commit 926d33b into eddelbuettel:master Oct 26, 2018

3 checks passed

codecov/patch 100% of diff hit (target 60.17%)
Details
codecov/project 60.2% (+0.02%) compared to c6a94ce
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@eddelbuettel

This comment has been minimized.

Owner

eddelbuettel commented Oct 31, 2018

Quick follow-up and another Thanks! for the PRs -- new version is now on CRAN.

@vh-d

This comment has been minimized.

Contributor

vh-d commented Nov 1, 2018

That's great! I have been pleased to contribute!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment