-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7018: [R] Non-UTF-8 data in Arrow <--> R conversion #7527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions of my own @romainfrancois
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Rf_mkCharCE(Rf_translateCharUTF8(s), CE_UTF8) is dropped in several places; should we factor this out to a macro or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah maybe some sort of Rf_mkCharUtf8() or Rf_mkUtf8()
r/src/recordbatch.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few places where I Rf_mkCharCE() and then immediately call CHAR(), which IIUC is boxing in a SEXP and then immediately unboxing it. Maybe that can be eliminated some places or done better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think R offers api for this
| names = function() Schema__field_names(self), | ||
| names = function() { | ||
| out <- Schema__field_names(self) | ||
| # Hack: Rcpp should set the encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a more general problem, would affect the names() method of any objects where they return a std::vector<std::string>. Those must (as I understand it) always be UTF-8 in Arrow, but if you don't declare them as UTF-8 in R, then they get displayed all mangled on Windows (default/unknown encoding treated as latin1).
Rather than relying on the default Rcpp::wrap method for this, we should probably wrap ourselves. I could naively write this (create CharacterVector, iterate over the std::vector<std::string> and insert Rcpp::String with CE_UTF8) but maybe that's not great?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe cpp11 will rescue us from that sort of trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did a little dance when I saw https://cpp11.r-lib.org/articles/motivations.html#utf-8-everywhere
|
@romainfrancois this is ready for (and seriously needs) your review. Tests should be passing now. |
romainfrancois
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from following up on your hint, it LGTM.
| names = function() Schema__field_names(self), | ||
| names = function() { | ||
| out <- Schema__field_names(self) | ||
| # Hack: Rcpp should set the encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe cpp11 will rescue us from that sort of trouble.
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah maybe some sort of Rf_mkCharUtf8() or Rf_mkUtf8()
r/src/recordbatch.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think R offers api for this
|
Thanks, I think we should just revisit further work once |
Sprinkles
Rf_translateCharUTF8a few places. I tried to add tests for all of the different scenarios I could think of where we could have non-UTF strings.Also includes
$and[[methods forSchemaobjects.