New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding info losses if the column names contain non-ASCII characters #144

Closed
shrektan opened this Issue Apr 9, 2018 · 7 comments

Comments

Projects
None yet
2 participants
@shrektan

shrektan commented Apr 9, 2018

Hi, first of all, thanks for this great package! The speed is really that fast 馃憤

I want to report the issue that although fst supports writing and reading non-ASCII strings very well for its content, the Encoding info for the column names will be missing if there are non-ASCII strings.

The minimal reproducible example

utf8_strings <- c("\u00e7ile", "fa\u00e7ile", "El. pa\u00c5\u00a1tas", "\u00a1tas", "\u00de")
latin1_strings <- iconv(utf8_strings, from = "UTF-8", to = "latin1")
tbl <- data.frame(utf8_strings, latin1_strings, stringsAsFactors = FALSE)
colnames(tbl) <- c(utf8_strings[1], latin1_strings[1])
tbl2 <- local({
  tmp_file <- tempfile(fileext = ".fst")
  on.exit(unlink(tmp_file), add = TRUE)
  fst::write_fst(tbl, tmp_file)
  fst::read_fst(tmp_file)
})
colnames(tbl)
#> [1] "莽ile" "莽ile"
colnames(tbl2)
#> [1] "莽ile"    "\xe7ile"
Encoding(colnames(tbl))
#> [1] "UTF-8"  "latin1"
Encoding(colnames(tbl2))
#> [1] "unknown" "unknown"
Encoding(colnames(tbl2)) <- c("UTF-8", "latin1")
colnames(tbl2)
#> [1] "莽ile" "莽ile"

sessionInfo()

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] usethis_1.3.0

loaded via a namespace (and not attached):
[1] compiler_3.4.3  backports_1.1.2 parallel_3.4.3  tools_3.4.3     yaml_2.1.18     Rcpp_0.12.16    fst_0.8.4 
@MarcusKlik

This comment has been minimized.

Collaborator

MarcusKlik commented Apr 9, 2018

Hi @shrektan, thanks for reporting the issue and for adding the (clear) reproducible example!

Indeed, the encoding information is not stored correctly for the column names, see this code. The encoding is incorrectly assumed to be the default (local) encoding.

Thanks for spotting that and I will fix that ASAP!.

@MarcusKlik MarcusKlik self-assigned this Apr 9, 2018

@MarcusKlik MarcusKlik added the bug label Apr 9, 2018

@MarcusKlik MarcusKlik added this to the fst v0.8.6 milestone Apr 9, 2018

@MarcusKlik

This comment has been minimized.

Collaborator

MarcusKlik commented Apr 10, 2018

Hi @shrektan, I've pushed a commit that addresses the issue with encoding of column names. A small test script to verify:

utf8_strings <- c("\u0648\u0693\u0644", "\u00de")

# create data frame with UTF-8 strings
df1 <- data.frame(X = utf8_strings, Y = LETTERS[1:2], stringsAsFactors = FALSE)

# set UTF-8 column names
colnames(df1) <- utf8_strings

# column names
colnames(df1)
#> [1] "賵趽賱" "脼"

# column names encoding
Encoding(colnames(df1))
#> [1] "UTF-8" "UTF-8"

# read/write cycle
fst::write_fst(df1, "encoding.fst")
df2 <- fst::read_fst("encoding.fst")
#> Loading required namespace: data.table

# column names result frame
colnames(df2)
#> [1] "賵趽賱" "脼"

# encoding result frame
Encoding(colnames(df2))
#> [1] "UTF-8" "UTF-8"

I would be very interested to see if the dev version now works for you!

@MarcusKlik

This comment has been minimized.

Collaborator

MarcusKlik commented Apr 10, 2018

Hi @shrektan, one additional note, fst assumes that the encoding of all elements of a character vector is identical. This assumption is made to avoid checking the encoding of each element in the vector (and having to store it), which would require significant CPU resources. Parameter uniform_encoding = FALSE can be used with write_fst() to check each element separately (but an error will be throw if different encodings are found).

@shrektan

This comment has been minimized.

shrektan commented Apr 11, 2018

@MarcusKlik I confirm the fix works for me. Thanks for the quick fix and the nice comment that I missed before. Assuming the identical encoding is reasonable for me.

BTW, for across-platforms compatibility, I always have to build wrapper functions like write_std_fst() to ensure the data.table written to disk is encoded in UTF-8 with timezone UTC (if not, converted to UTF-8 and UTC first). If there's anyway that fst can naively support this, it will be supper handy. 馃槂 (Although I bet it's very difficult)

@MarcusKlik

This comment has been minimized.

Collaborator

MarcusKlik commented Apr 11, 2018

Hi @shrektan, thanks a lot for testing that!

To understand your request correctly, are you referring to the use case where you write a data.table to a fst file and retrieve it on a computer that's located in a different time-zone or with different settings for the character encoding (and the encoding or time-zone is left at the default setting)?

If you would like all fst files to have it's strings stored in UTF-8 (both columns and column-elements), perhaps the best way would be to provide a package option to signal that requirement, e.g. options(fst_encoding_UTF8 = TRUE). And the same could be done for time-zones.

However, if all strings are converted to UTF-8 before writing to disk, the serialization performance would certainly suffer. The feather package chose to take that performance hit (see this code), but I deliberately left this up to the user to allow for maximum speed when serializing character vectors.

Off course, when we use options, it would be the user's choice whether or not a performance hit would be acceptable, so that could be the way to go. Also, to make sure that the selection is a deliberate choice, the fst package startup message could include a message about converting all strings, something like

fst is set to convert all strings to UTF-8 from option fst_encoding_UTF8

Would that be a workable solution for your use case?

@MarcusKlik

This comment has been minimized.

Collaborator

MarcusKlik commented Apr 11, 2018

To get a feeling for the performance of converting a non UTF-8 string to UTF-8, I did a small benchmark:

data(WorldPhones)
char_vec <- sample(colnames(WorldPhones), 1e6, replace = TRUE)

timings <- microbenchmark::microbenchmark(
  char_vec_UTF8 <- iconv(char_vec, to = "UTF-8")
)

# number of characters processed per second (in millions)
1e-6 * sum(nchar(char_vec)) * 1e9 / median(timings$time)
#> [1] 13.17229

we can also do that in C++ using Rcpp:

Rcpp::cppFunction('

SEXP convert_to_utf8(SEXP char_vec) {
  
  int vec_length = LENGTH(char_vec);
  
  SEXP char_vec_utf8 = Rf_allocVector(STRSXP, vec_length);
  PROTECT(char_vec_utf8);
    
  for (int i = 0; i < vec_length; i++) {
    SEXP char_elem = STRING_ELT(char_vec, i);
    const char* char_elem_utf8 = Rf_translateCharUTF8(char_elem);
    SET_STRING_ELT(char_vec_utf8, i, Rf_mkCharCE(char_elem_utf8, (cetype_t) 1));
  }
  
  UNPROTECT(1);
  
  return char_vec_utf8;
}')

timings_cpp <- microbenchmark::microbenchmark(
  char_vec_UTF8 <- convert_to_utf8(char_vec)
)

# performance in characters processed per second (in millions)
1e-6 * sum(nchar(char_vec)) * 1e9 / median(timings_cpp$time)
#> [1] 106.4347

Although the C++ version is much faster than iconv, it's still only about 100 million characters per second (106 MB/s not counting any NA checks). Such low speeds can be a real bottleneck for fst, because when the multi-threaded character column serialization is finished, my aim is to have a speed of more than a GB/s, about a factor 10 faster than the UTF-8 conversion. So the user should definitely be made aware of the impact of forcing all string conversions :-)

@shrektan

This comment has been minimized.

shrektan commented Apr 12, 2018

@MarcusKlik Thanks for the very detailed comment (like I didn't notice that feather will convert to UTF-8 strings automatically)!

The killer feature of fst is the lighting I/O speed so yes it needs to be notified properly. However, not sure if a new option or a new argument is better.

One of the user cases I can imagine: If the user wants the speed, he/she simply write_fst() as usual. When he/she needs the data to be read on another machine or platform, he/she simply call write_fst() with the new argument without impacting any other part.

Weekend-Warrior added a commit to Weekend-Warrior/fst that referenced this issue Apr 25, 2018

Merge pull request #1 from fstpackage/develop
Retain column names encoding (fixes fstpackage#144)

@MarcusKlik MarcusKlik modified the milestones: Candidate, fst v0.8.6, fst version 0.8.6 May 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment