New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding info losses if the column names contain non-ASCII characters #144
Comments
Hi @shrektan, thanks for reporting the issue and for adding the (clear) reproducible example! Indeed, the encoding information is not stored correctly for the column names, see this code. The encoding is incorrectly assumed to be the default (local) encoding. Thanks for spotting that and I will fix that ASAP!. |
Hi @shrektan, I've pushed a commit that addresses the issue with encoding of column names. A small test script to verify: utf8_strings <- c("\u0648\u0693\u0644", "\u00de")
# create data frame with UTF-8 strings
df1 <- data.frame(X = utf8_strings, Y = LETTERS[1:2], stringsAsFactors = FALSE)
# set UTF-8 column names
colnames(df1) <- utf8_strings
# column names
colnames(df1)
#> [1] "وړل" "Þ"
# column names encoding
Encoding(colnames(df1))
#> [1] "UTF-8" "UTF-8"
# read/write cycle
fst::write_fst(df1, "encoding.fst")
df2 <- fst::read_fst("encoding.fst")
#> Loading required namespace: data.table
# column names result frame
colnames(df2)
#> [1] "وړل" "Þ"
# encoding result frame
Encoding(colnames(df2))
#> [1] "UTF-8" "UTF-8" I would be very interested to see if the |
Hi @shrektan, one additional note, |
@MarcusKlik I confirm the fix works for me. Thanks for the quick fix and the nice comment that I missed before. Assuming the identical encoding is reasonable for me. BTW, for across-platforms compatibility, I always have to build wrapper functions like |
Hi @shrektan, thanks a lot for testing that! To understand your request correctly, are you referring to the use case where you write a If you would like all However, if all strings are converted to Off course, when we use options, it would be the user's choice whether or not a performance hit would be acceptable, so that could be the way to go. Also, to make sure that the selection is a deliberate choice, the
Would that be a workable solution for your use case? |
To get a feeling for the performance of converting a non data(WorldPhones)
char_vec <- sample(colnames(WorldPhones), 1e6, replace = TRUE)
timings <- microbenchmark::microbenchmark(
char_vec_UTF8 <- iconv(char_vec, to = "UTF-8")
)
# number of characters processed per second (in millions)
1e-6 * sum(nchar(char_vec)) * 1e9 / median(timings$time)
#> [1] 13.17229 we can also do that in Rcpp::cppFunction('
SEXP convert_to_utf8(SEXP char_vec) {
int vec_length = LENGTH(char_vec);
SEXP char_vec_utf8 = Rf_allocVector(STRSXP, vec_length);
PROTECT(char_vec_utf8);
for (int i = 0; i < vec_length; i++) {
SEXP char_elem = STRING_ELT(char_vec, i);
const char* char_elem_utf8 = Rf_translateCharUTF8(char_elem);
SET_STRING_ELT(char_vec_utf8, i, Rf_mkCharCE(char_elem_utf8, (cetype_t) 1));
}
UNPROTECT(1);
return char_vec_utf8;
}')
timings_cpp <- microbenchmark::microbenchmark(
char_vec_UTF8 <- convert_to_utf8(char_vec)
)
# performance in characters processed per second (in millions)
1e-6 * sum(nchar(char_vec)) * 1e9 / median(timings_cpp$time)
#> [1] 106.4347 Although the |
@MarcusKlik Thanks for the very detailed comment (like I didn't notice that The killer feature of One of the user cases I can imagine: If the user wants the speed, he/she simply |
Retain column names encoding (fixes fstpackage#144)
Hi, first of all, thanks for this great package! The speed is really that fast👍
I want to report the issue that although
fst
supports writing and reading non-ASCII strings very well for its content, theEncoding
info for the column names will be missing if there are non-ASCII strings.The minimal reproducible example
sessionInfo()
The text was updated successfully, but these errors were encountered: