-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9186: [R] Allow specifying CSV file encoding #12030
Conversation
|
r/src/io.cpp
Outdated
|
||
arrow::Result<std::shared_ptr<arrow::Buffer>> operator()( | ||
const std::shared_ptr<arrow::Buffer>& src) { | ||
ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(32)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-allocating src->size()
bytes sounds like a better heuristic (or perhaps even a bit more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it would perhaps be even better to use a BufferBuilder
and call Reserve
accordingly. It will handle overallocation for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a BufferBuilder
solution but I found it hard to make it readable without allocating a bunch of intermediary buffers and Append()
ing them. I did completely rewrite the Buffer
-based solution but if there's a pattern I'm missing I'm happy to rewrite again! I think the current version works for most input with a single allocation (but I will check more thoroughly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised, because you could just use the Reserve
, mutable_data
and UnsafeAdvance
methods on BufferBuilder
: basically, rely on the (re)allocation facilities but do the buffer filling yourself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm totally game to use it, but in the current version I only interact with the buffer object 3 times:
ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(initial_size));
if (out_bytes_left < in_bytes_left) {
RETURN_NOT_OK(dest->Resize(dest->size() * 1.2));
out_buf = dest->mutable_data() + out_bytes_used;
out_bytes_left = dest->size() - out_bytes_used;
}
RETURN_NOT_OK(dest->Resize(out_bytes_used, false));
I found that with the BufferBuiler
I needed an extra line for Finish()
and the rest was almost the same (Reserve()
instead of Resize()
). Is there anything that I'm missing that would be safer or more readable using the BufferBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main point is to avoid reinventing the overallocation logic. I pushed a commit that uses BufferBuilder
, feel free to keep it or not depending on how you feel about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! It's definitely better. It also takes care of some of the bookkeeping that was previously duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me. Can someone double-check the R parts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Do we need to add any documentation around this? And one question about error handling.
reader <- CsvTableReader$create( | ||
fs$OpenInputStream(tf), | ||
read_options = CsvReadOptions$create(encoding = "UTF-16LE") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if someone calls this without specifying the encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is "UTF-8":
Line 416 in 493c88e
encoding = "UTF-8") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nods I was wondering more if there is any sort of error / detection. Altering your reprex below slightly, I see now we get binary columns out. That's not the worst (and there's probably not a good way to detect and do something differently with that reliably anyway):
library(arrow, warn.conflicts = FALSE)
# generate a data frame with funky characters
latin1_chars <- iconv(
# exclude the comma and control characters
list(as.raw(setdiff(c(38:126, 161:255), 44))),
"latin1", "UTF-8"
)
make_text_col <- function(chars,
chars_per_item_min = 1, chars_per_item_max = 20,
n_items = 20) {
choices <- unlist(strsplit(chars, ""))
text_col <- character(n_items)
for (i in seq_along(text_col)) {
text_col[i] <- paste0(
sample(
choices,
round(runif(1, chars_per_item_min, chars_per_item_max)),
replace = TRUE
),
collapse = ""
)
}
text_col
}
set.seed(1843)
n_items <- 1e6
df_latin1 <- data.frame(
n = 1:n_items,
latin1_chars = make_text_col(latin1_chars, n_items = n_items)
)
# now check the CSV reader
library(arrow, warn.conflicts = FALSE)
# make some files
tf_latin1_utf8 <- tempfile()
tf_latin1_latin1 <- tempfile()
readr::write_csv(df_latin1, tf_latin1_utf8)
readr::write_file(
iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]],
tf_latin1_latin1
)
fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
fs$OpenInputStream(tf_latin1_latin1)
)
tibble::as_tibble(reader$Read())
#> # A tibble: 1,000,000 × 2
#> n latin1_chars
#> <int> <arrw_bnr>
#> 1 1 be, 31, 4f, e3, d8, d4, 5c, f9, e9, 76, cd
#> 2 2 5c, ad, bf, ed, 62, dd
#> 3 3 fe, 63, ec, 48, c7, 37, 45, e1, 71, 6b, 77, ca, a4, a6, 3b
#> 4 4 47, 2f, 67, cc, a9, e3, 51, b0, 38, 52, f8, 74, f3
#> 5 5 48
#> 6 6 7c, 47, 50, f4, e5, 49, cc, e3, 65, b7, 64, 61, b7, 64, 5d, 7a, 51, a1
#> 7 7 39, f8, f9, c4, 70
#> 8 8 4f, 78, 65, b1
#> 9 9 fa, 71, 65, ff, ed, ca
#> 10 10 6f, 26, f9, b8, 69, c9, 42, 64, a8, 39, 77, 7d, 58
#> # … with 999,990 more rows
Created on 2022-01-05 by the reprex package (v2.0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right...that's a good point! The current mode of failure is weird I think...I'd prefer to error rather than return a column of a type that the user didn't expect and didn't request. In something like readr you could error and tell the user to use schema = schema(latin1_chars = col_binary(), .default = col_guess())
...the CSV reader interface doesn't really allow col_guess()
or .default
to my knowledge.
A reprex that might help with testing (since my example is a little dinky)...the probable use case is reading encodings like latin1, but I also did check with a bunch of random latin1 characters: # generate a data frame with funky characters
latin1_chars <- iconv(
# exclude the comma and control characters
list(as.raw(setdiff(c(38:126, 161:255), 44))),
"latin1", "UTF-8"
)
make_text_col <- function(chars,
chars_per_item_min = 1, chars_per_item_max = 20,
n_items = 20) {
choices <- unlist(strsplit(chars, ""))
text_col <- character(n_items)
for (i in seq_along(text_col)) {
text_col[i] <- paste0(
sample(
choices,
round(runif(1, chars_per_item_min, chars_per_item_max)),
replace = TRUE
),
collapse = ""
)
}
text_col
}
set.seed(1843)
n_items <- 1e6
df_latin1 <- data.frame(
n = 1:n_items,
latin1_chars = make_text_col(latin1_chars, n_items = n_items)
)
# now check the CSV reader
library(arrow, warn.conflicts = FALSE)
# make some files
tf_latin1_utf8 <- tempfile()
tf_latin1_latin1 <- tempfile()
readr::write_csv(df_latin1, tf_latin1_utf8)
readr::write_file(
iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]],
tf_latin1_latin1
)
fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
fs$OpenInputStream(tf_latin1_utf8),
read_options = CsvReadOptions$create(encoding = "UTF-8")
)
df_latin1_from_utf8 <- tibble::as_tibble(reader$Read())
reader <- CsvTableReader$create(
fs$OpenInputStream(tf_latin1_latin1),
read_options = CsvReadOptions$create(encoding = "latin1")
)
df_latin1_from_latin1 <- tibble::as_tibble(reader$Read())
identical(df_latin1_from_utf8, df_latin1_from_latin1)
#> [1] TRUE Created on 2022-01-05 by the reprex package (v2.0.1) |
Benchmark runs are scheduled for baseline = 2e8b836 and contender = 8f35337. 8f35337 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This PR makes it possible to read non-utf-8-encoded CSV files as was done in Python (ARROW-9106). I'm very open to (and would love suggestions on!) changes in the structure, naming, and implementation, since C++ isn't my strong suit. I opted for using R's C-level iconv because it made more sense to me than calling back to R (where I don't know how I'd handle partial multibyte characters at the end of a buffer).
Reprex for testing: