Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9186: [R] Allow specifying CSV file encoding #12030

Closed
wants to merge 30 commits into from

Conversation

paleolimbot
Copy link
Member

This PR makes it possible to read non-utf-8-encoded CSV files as was done in Python (ARROW-9106). I'm very open to (and would love suggestions on!) changes in the structure, naming, and implementation, since C++ isn't my strong suit. I opted for using R's C-level iconv because it made more sense to me than calling back to R (where I don't know how I'd handle partial multibyte characters at the end of a buffer).

Reprex for testing:

library(arrow, warn.conflicts = FALSE)

tf <- tempfile()
on.exit(unlink(tf))

strings <- c("a", "\u00e9", "\U0001f4a9", NA)
file_string <- paste0(
  "col1,col2\n",
  paste(strings, 1:400, sep = ",", collapse = "\n")
)

file_bytes_utf16 <- iconv(file_string, to = "UTF-16LE", toRaw = TRUE)[[1]]

con <- file(tf, open = "wb")
writeBin(file_bytes_utf16, con)
close(con)

fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
  fs$OpenInputStream(tf),
  read_options = CsvReadOptions$create(encoding = "UTF-16LE")
)

tibble::as_tibble(reader$Read())
#> # A tibble: 400 × 2
#>    col1   col2
#>    <chr> <int>
#>  1 a         1
#>  2 é         2
#>  3 💩        3
#>  4 NA        4
#>  5 a         5
#>  6 é         6
#>  7 💩        7
#>  8 NA        8
#>  9 a         9
#> 10 é        10
#> # … with 390 more rows

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@paleolimbot paleolimbot marked this pull request as draft December 24, 2021 02:09
r/R/csv.R Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated

arrow::Result<std::shared_ptr<arrow::Buffer>> operator()(
const std::shared_ptr<arrow::Buffer>& src) {
ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(32));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-allocating src->size() bytes sounds like a better heuristic (or perhaps even a bit more).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it would perhaps be even better to use a BufferBuilder and call Reserve accordingly. It will handle overallocation for you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a BufferBuilder solution but I found it hard to make it readable without allocating a bunch of intermediary buffers and Append()ing them. I did completely rewrite the Buffer-based solution but if there's a pattern I'm missing I'm happy to rewrite again! I think the current version works for most input with a single allocation (but I will check more thoroughly).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised, because you could just use the Reserve, mutable_data and UnsafeAdvance methods on BufferBuilder: basically, rely on the (re)allocation facilities but do the buffer filling yourself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally game to use it, but in the current version I only interact with the buffer object 3 times:

ARROW_ASSIGN_OR_RAISE(auto dest, arrow::AllocateResizableBuffer(initial_size));
if (out_bytes_left < in_bytes_left) {
  RETURN_NOT_OK(dest->Resize(dest->size() * 1.2));
  out_buf = dest->mutable_data() + out_bytes_used;
  out_bytes_left = dest->size() - out_bytes_used;
}
RETURN_NOT_OK(dest->Resize(out_bytes_used, false));

I found that with the BufferBuiler I needed an extra line for Finish() and the rest was almost the same (Reserve() instead of Resize()). Is there anything that I'm missing that would be safer or more readable using the BufferBuilder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main point is to avoid reinventing the overallocation logic. I pushed a commit that uses BufferBuilder, feel free to keep it or not depending on how you feel about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! It's definitely better. It also takes care of some of the bookkeeping that was previously duplicated.

r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/tests/testthat/test-csv.R Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
r/src/io.cpp Outdated Show resolved Hide resolved
@paleolimbot paleolimbot marked this pull request as ready for review January 3, 2022 20:03
@pitrou pitrou requested a review from jonkeane January 5, 2022 19:20
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. Can someone double-check the R parts?

@pitrou pitrou requested a review from thisisnic January 5, 2022 19:21
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Do we need to add any documentation around this? And one question about error handling.

reader <- CsvTableReader$create(
fs$OpenInputStream(tf),
read_options = CsvReadOptions$create(encoding = "UTF-16LE")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if someone calls this without specifying the encoding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is "UTF-8":

arrow/r/R/csv.R

Line 416 in 493c88e

encoding = "UTF-8") {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nods I was wondering more if there is any sort of error / detection. Altering your reprex below slightly, I see now we get binary columns out. That's not the worst (and there's probably not a good way to detect and do something differently with that reliably anyway):

library(arrow, warn.conflicts = FALSE)

# generate a data frame with funky characters
latin1_chars <- iconv(
  # exclude the comma and control characters
  list(as.raw(setdiff(c(38:126, 161:255), 44))),
  "latin1", "UTF-8"
)

make_text_col <- function(chars, 
                          chars_per_item_min = 1, chars_per_item_max = 20,
                          n_items = 20) {
  choices <- unlist(strsplit(chars, ""))
  text_col <- character(n_items)
  for (i in seq_along(text_col)) {
    text_col[i] <- paste0(
      sample(
        choices, 
        round(runif(1, chars_per_item_min, chars_per_item_max)), 
        replace = TRUE
      ),
      collapse = ""
    )
  }
  text_col
}

set.seed(1843)
n_items <- 1e6

df_latin1 <- data.frame(
  n = 1:n_items,
  latin1_chars = make_text_col(latin1_chars, n_items = n_items)
)

# now check the CSV reader
library(arrow, warn.conflicts = FALSE)

# make some files
tf_latin1_utf8 <- tempfile()
tf_latin1_latin1 <- tempfile()

readr::write_csv(df_latin1, tf_latin1_utf8)
readr::write_file(
  iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]],
  tf_latin1_latin1
)

fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
  fs$OpenInputStream(tf_latin1_latin1)
)
tibble::as_tibble(reader$Read())
#> # A tibble: 1,000,000 × 2
#>        n                                                           latin1_chars
#>    <int>                                                             <arrw_bnr>
#>  1     1                             be, 31, 4f, e3, d8, d4, 5c, f9, e9, 76, cd
#>  2     2                                                 5c, ad, bf, ed, 62, dd
#>  3     3             fe, 63, ec, 48, c7, 37, 45, e1, 71, 6b, 77, ca, a4, a6, 3b
#>  4     4                     47, 2f, 67, cc, a9, e3, 51, b0, 38, 52, f8, 74, f3
#>  5     5                                                                     48
#>  6     6 7c, 47, 50, f4, e5, 49, cc, e3, 65, b7, 64, 61, b7, 64, 5d, 7a, 51, a1
#>  7     7                                                     39, f8, f9, c4, 70
#>  8     8                                                         4f, 78, 65, b1
#>  9     9                                                 fa, 71, 65, ff, ed, ca
#> 10    10                     6f, 26, f9, b8, 69, c9, 42, 64, a8, 39, 77, 7d, 58
#> # … with 999,990 more rows

Created on 2022-01-05 by the reprex package (v2.0.1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right...that's a good point! The current mode of failure is weird I think...I'd prefer to error rather than return a column of a type that the user didn't expect and didn't request. In something like readr you could error and tell the user to use schema = schema(latin1_chars = col_binary(), .default = col_guess())...the CSV reader interface doesn't really allow col_guess() or .default to my knowledge.

@paleolimbot
Copy link
Member Author

A reprex that might help with testing (since my example is a little dinky)...the probable use case is reading encodings like latin1, but I also did check with a bunch of random latin1 characters:

# generate a data frame with funky characters
latin1_chars <- iconv(
  # exclude the comma and control characters
  list(as.raw(setdiff(c(38:126, 161:255), 44))),
  "latin1", "UTF-8"
)

make_text_col <- function(chars, 
                          chars_per_item_min = 1, chars_per_item_max = 20,
                          n_items = 20) {
  choices <- unlist(strsplit(chars, ""))
  text_col <- character(n_items)
  for (i in seq_along(text_col)) {
    text_col[i] <- paste0(
      sample(
        choices, 
        round(runif(1, chars_per_item_min, chars_per_item_max)), 
        replace = TRUE
      ),
      collapse = ""
    )
  }
  text_col
}

set.seed(1843)
n_items <- 1e6

df_latin1 <- data.frame(
  n = 1:n_items,
  latin1_chars = make_text_col(latin1_chars, n_items = n_items)
)

# now check the CSV reader
library(arrow, warn.conflicts = FALSE)

# make some files
tf_latin1_utf8 <- tempfile()
tf_latin1_latin1 <- tempfile()

readr::write_csv(df_latin1, tf_latin1_utf8)
readr::write_file(
  iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw = TRUE)[[1]],
  tf_latin1_latin1
)


fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
  fs$OpenInputStream(tf_latin1_utf8),
  read_options = CsvReadOptions$create(encoding = "UTF-8")
)

df_latin1_from_utf8 <- tibble::as_tibble(reader$Read())

reader <- CsvTableReader$create(
  fs$OpenInputStream(tf_latin1_latin1),
  read_options = CsvReadOptions$create(encoding = "latin1")
)
df_latin1_from_latin1 <- tibble::as_tibble(reader$Read())

identical(df_latin1_from_utf8, df_latin1_from_latin1)
#> [1] TRUE

Created on 2022-01-05 by the reprex package (v2.0.1)

@jonkeane jonkeane closed this in 8f35337 Jan 10, 2022
@ursabot
Copy link

ursabot commented Jan 10, 2022

Benchmark runs are scheduled for baseline = 2e8b836 and contender = 8f35337. 8f35337 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️5.38% ⬆️1.79%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@paleolimbot paleolimbot deleted the r-csv-encoding branch January 20, 2022 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants