ARROW-9186: [R] Allow specifying CSV file encoding #12030

paleolimbot · 2021-12-22T20:32:56Z

This PR makes it possible to read non-utf-8-encoded CSV files as was done in Python (ARROW-9106). I'm very open to (and would love suggestions on!) changes in the structure, naming, and implementation, since C++ isn't my strong suit. I opted for using R's C-level iconv because it made more sense to me than calling back to R (where I don't know how I'd handle partial multibyte characters at the end of a buffer).

Reprex for testing:

library(arrow, warn.conflicts = FALSE)

tf <- tempfile()
on.exit(unlink(tf))

strings <- c("a", "\u00e9", "\U0001f4a9", NA)
file_string <- paste0(
  "col1,col2\n",
  paste(strings, 1:400, sep = ",", collapse = "\n")
)

file_bytes_utf16 <- iconv(file_string, to = "UTF-16LE", toRaw = TRUE)[[1]]

con <- file(tf, open = "wb")
writeBin(file_bytes_utf16, con)
close(con)

fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
  fs$OpenInputStream(tf),
  read_options = CsvReadOptions$create(encoding = "UTF-16LE")
)

tibble::as_tibble(reader$Read())
#> # A tibble: 400 × 2
#>    col1   col2
#>    <chr> <int>
#>  1 a         1
#>  2 é         2
#>  3 💩        3
#>  4 NA        4
#>  5 a         5
#>  6 é         6
#>  7 💩        7
#>  8 NA        8
#>  9 a         9
#> 10 é        10
#> # … with 390 more rows

github-actions · 2021-12-22T20:33:16Z

https://issues.apache.org/jira/browse/ARROW-9186

github-actions · 2021-12-22T20:33:17Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

…is not UTF-8

r/R/csv.R

r/src/io.cpp

pitrou · 2022-01-03T13:40:13Z

r/src/io.cpp

+
+  arrow::Result<std::shared_ptr<arrow::Buffer>> operator()(
+      const std::shared_ptr<arrow::Buffer>& src) {
+    ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(32));


Pre-allocating src->size() bytes sounds like a better heuristic (or perhaps even a bit more).

Actually, it would perhaps be even better to use a BufferBuilder and call Reserve accordingly. It will handle overallocation for you.

I tried a BufferBuilder solution but I found it hard to make it readable without allocating a bunch of intermediary buffers and Append()ing them. I did completely rewrite the Buffer-based solution but if there's a pattern I'm missing I'm happy to rewrite again! I think the current version works for most input with a single allocation (but I will check more thoroughly).

I'm a bit surprised, because you could just use the Reserve, mutable_data and UnsafeAdvance methods on BufferBuilder: basically, rely on the (re)allocation facilities but do the buffer filling yourself.

I'm totally game to use it, but in the current version I only interact with the buffer object 3 times:

ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(initial_size));

if (out_bytes_left < in_bytes_left) { RETURN_NOT_OK(dest->Resize(dest->size() * 1.2)); out_buf = dest->mutable_data() + out_bytes_used; out_bytes_left = dest->size() - out_bytes_used; }

RETURN_NOT_OK(dest->Resize(out_bytes_used, false));

I found that with the BufferBuiler I needed an extra line for Finish() and the rest was almost the same (Reserve() instead of Resize()). Is there anything that I'm missing that would be safer or more readable using the BufferBuilder?

I think the main point is to avoid reinventing the overallocation logic. I pushed a commit that uses BufferBuilder, feel free to keep it or not depending on how you feel about it.

I see! It's definitely better. It also takes care of some of the bookkeeping that was previously duplicated.

r/src/io.cpp

r/tests/testthat/test-csv.R

r/src/io.cpp

pitrou

+1 from me. Can someone double-check the R parts?

jonkeane

This looks good. Do we need to add any documentation around this? And one question about error handling.

jonkeane · 2022-01-05T20:04:14Z

r/tests/testthat/test-csv.R

+  reader <- CsvTableReader$create(
+    fs$OpenInputStream(tf),
+    read_options = CsvReadOptions$create(encoding = "UTF-16LE")
+  )


What happens if someone calls this without specifying the encoding?

The default is "UTF-8":

arrow/r/R/csv.R

Line 416 in 493c88e

encoding = "UTF-8") {

nods I was wondering more if there is any sort of error / detection. Altering your reprex below slightly, I see now we get binary columns out. That's not the worst (and there's probably not a good way to detect and do something differently with that reliably anyway):

library(arrow, warn.conflicts = FALSE) # generate a data frame with funky characters latin1_chars <- iconv( # exclude the comma and control characters list(as.raw(setdiff(c(38:126, 161:255), 44))), "latin1", "UTF-8" ) make_text_col <- function(chars, chars_per_item_min = 1, chars_per_item_max = 20, n_items = 20) { choices <- unlist(strsplit(chars, "")) text_col <- character(n_items) for (i in seq_along(text_col)) { text_col[i] <- paste0( sample( choices, round(runif(1, chars_per_item_min, chars_per_item_max)), replace = TRUE ), collapse = "" ) } text_col } set.seed(1843) n_items <- 1e6 df_latin1 <- data.frame( n = 1:n_items, latin1_chars = make_text_col(latin1_chars, n_items = n_items) ) # now check the CSV reader library(arrow, warn.conflicts = FALSE) # make some files tf_latin1_utf8 <- tempfile() tf_latin1_latin1 <- tempfile() readr::write_csv(df_latin1, tf_latin1_utf8) readr::write_file( iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]], tf_latin1_latin1 ) fs <- LocalFileSystem$create() reader <- CsvTableReader$create( fs$OpenInputStream(tf_latin1_latin1) ) tibble::as_tibble(reader$Read()) #> # A tibble: 1,000,000 × 2 #> n latin1_chars #> <int> <arrw_bnr> #> 1 1 be, 31, 4f, e3, d8, d4, 5c, f9, e9, 76, cd #> 2 2 5c, ad, bf, ed, 62, dd #> 3 3 fe, 63, ec, 48, c7, 37, 45, e1, 71, 6b, 77, ca, a4, a6, 3b #> 4 4 47, 2f, 67, cc, a9, e3, 51, b0, 38, 52, f8, 74, f3 #> 5 5 48 #> 6 6 7c, 47, 50, f4, e5, 49, cc, e3, 65, b7, 64, 61, b7, 64, 5d, 7a, 51, a1 #> 7 7 39, f8, f9, c4, 70 #> 8 8 4f, 78, 65, b1 #> 9 9 fa, 71, 65, ff, ed, ca #> 10 10 6f, 26, f9, b8, 69, c9, 42, 64, a8, 39, 77, 7d, 58 #> # … with 999,990 more rows

^{Created on 2022-01-05 by the reprex package (v2.0.1)}

Right...that's a good point! The current mode of failure is weird I think...I'd prefer to error rather than return a column of a type that the user didn't expect and didn't request. In something like readr you could error and tell the user to use schema = schema(latin1_chars = col_binary(), .default = col_guess())...the CSV reader interface doesn't really allow col_guess() or .default to my knowledge.

paleolimbot · 2022-01-05T20:35:11Z

A reprex that might help with testing (since my example is a little dinky)...the probable use case is reading encodings like latin1, but I also did check with a bunch of random latin1 characters:

# generate a data frame with funky characters
latin1_chars <- iconv(
  # exclude the comma and control characters
  list(as.raw(setdiff(c(38:126, 161:255), 44))),
  "latin1", "UTF-8"
)

make_text_col <- function(chars, 
                          chars_per_item_min = 1, chars_per_item_max = 20,
                          n_items = 20) {
  choices <- unlist(strsplit(chars, ""))
  text_col <- character(n_items)
  for (i in seq_along(text_col)) {
    text_col[i] <- paste0(
      sample(
        choices, 
        round(runif(1, chars_per_item_min, chars_per_item_max)), 
        replace = TRUE
      ),
      collapse = ""
    )
  }
  text_col
}

set.seed(1843)
n_items <- 1e6

df_latin1 <- data.frame(
  n = 1:n_items,
  latin1_chars = make_text_col(latin1_chars, n_items = n_items)
)

# now check the CSV reader
library(arrow, warn.conflicts = FALSE)

# make some files
tf_latin1_utf8 <- tempfile()
tf_latin1_latin1 <- tempfile()

readr::write_csv(df_latin1, tf_latin1_utf8)
readr::write_file(
  iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]],
  tf_latin1_latin1
)


fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
  fs$OpenInputStream(tf_latin1_utf8),
  read_options = CsvReadOptions$create(encoding = "UTF-8")
)

df_latin1_from_utf8 <- tibble::as_tibble(reader$Read())

reader <- CsvTableReader$create(
  fs$OpenInputStream(tf_latin1_latin1),
  read_options = CsvReadOptions$create(encoding = "latin1")
)
df_latin1_from_latin1 <- tibble::as_tibble(reader$Read())

identical(df_latin1_from_utf8, df_latin1_from_latin1)
#> [1] TRUE

^{Created on 2022-01-05 by the reprex package (v2.0.1)}

ursabot · 2022-01-10T20:41:21Z

Benchmark runs are scheduled for baseline = 2e8b836 and contender = 8f35337. 8f35337 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️5.38% ⬆️1.79%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

paleolimbot added 11 commits December 21, 2021 15:13

template solution with segfaulting test

ab1ce80

fix lifecycle issues for iconv wrapper

30ae49c

cleanup debug code so that everything works for the trivial case

d6904c1

remove unused test variable

52e9cdc

remove identity case

6ef38f5

failing tests for pending characters

4fc160e

mostly working with pending characters in the buffer

4a2d25b

with passing test

ccc3f91

with better tests

e11682e

remove debugging, tidy C++

754d9a7

add test for csvs in non-utf8

f0bf19d

github-actions bot added the Component: R label Dec 22, 2021

paleolimbot added 5 commits December 23, 2021 09:13

clang-format

afa3bbd

test with maybe better diff for Windows

720d3e2

clang-format again

1774a50

mark constructor explicit

1087af0

just check the bytes and not the translated characters for now

f16a94c

paleolimbot marked this pull request as draft December 24, 2021 02:09

paleolimbot added 2 commits January 3, 2022 08:08

try to write raw bytes rather than write marked strings

c4a3385

maybe correct use of iconv() that will work when the session charset …

d69737b

…is not UTF-8

pitrou reviewed Jan 3, 2022

View reviewed changes

paleolimbot added 7 commits January 3, 2022 11:00

move type usage specific to iconv() to the iconv wrapper

9e922b2

use RETURN_NOT_OK

7805f43

tidy and explain iconv() usage

f17c5bf

better reallocation logic

d432add

error message for invalid input bytes

1cee075

clang-format

157bb98

more flexible detection of UTF-8 encoding string

2a013f9

paleolimbot added 2 commits January 3, 2022 15:28

check for invalid input in the pending_ buffer

b77f544

use a shared_ptr for the iconv wrapper

ea726f5

pitrou reviewed Jan 3, 2022

View reviewed changes

r/src/io.cpp Outdated Show resolved Hide resolved

pitrou reviewed Jan 3, 2022

View reviewed changes

r/src/io.cpp Outdated Show resolved Hide resolved

fix typo in initial size, use Status constructor instead of stringstream

7438640

paleolimbot marked this pull request as ready for review January 3, 2022 20:03

paleolimbot and others added 2 commits January 4, 2022 15:48

consume as much input as possible in each iteration of the loop

6646590

Use BufferBuilder

493c88e

pitrou requested a review from jonkeane January 5, 2022 19:20

pitrou approved these changes Jan 5, 2022

View reviewed changes

pitrou requested a review from thisisnic January 5, 2022 19:21

jonkeane reviewed Jan 5, 2022

View reviewed changes

jonkeane closed this in 8f35337 Jan 10, 2022

paleolimbot deleted the r-csv-encoding branch January 20, 2022 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9186: [R] Allow specifying CSV file encoding #12030

ARROW-9186: [R] Allow specifying CSV file encoding #12030

paleolimbot commented Dec 22, 2021

github-actions bot commented Dec 22, 2021

github-actions bot commented Dec 22, 2021

pitrou Jan 3, 2022

pitrou Jan 3, 2022

paleolimbot Jan 3, 2022

pitrou Jan 3, 2022

paleolimbot Jan 4, 2022

pitrou Jan 5, 2022

paleolimbot Jan 5, 2022

pitrou left a comment

jonkeane left a comment

jonkeane Jan 5, 2022

paleolimbot Jan 5, 2022

jonkeane Jan 5, 2022

paleolimbot Jan 5, 2022

paleolimbot commented Jan 5, 2022

ursabot commented Jan 10, 2022 •

edited

Loading

ARROW-9186: [R] Allow specifying CSV file encoding #12030

ARROW-9186: [R] Allow specifying CSV file encoding #12030

Conversation

paleolimbot commented Dec 22, 2021

github-actions bot commented Dec 22, 2021

github-actions bot commented Dec 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pitrou left a comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Jan 5, 2022

ursabot commented Jan 10, 2022 • edited Loading

ursabot commented Jan 10, 2022 •

edited

Loading