[R] open_dataset() behavior with incorrectly quoted input data #37908

angela-li · 2023-09-27T15:17:11Z

Describe the usage question you have. Please include as many useful details as possible.

I've started using {arrow} to read in data in R, and I noticed that it handles messy (aka human-collected!) tabular data slightly worse than {data.table}'s fread function. (But I want to work with arrow as vs. data.table as the partitioning element is going to be useful for me in the future!)

One issue I came across was how open_dataset handles incorrectly quoted data, or data where the default quote_char of " is included accidentally in a column.

Here's how the behavior is different between data.table and arrow. (Here's the test_data.txt file for the below code. It's a .txt file because the original humongous data file is delivered as a .txt.)

# Does not work
library(arrow)
test <- open_dataset("test_data.txt",
             format = "text",
             delim = "|")

# Works - generates error about improper quoting, but reads it correctly
library(data.table)
test <- fread("test_data.txt")

The data.table() documentation describes how they handle this data situation reasonably well, NEWS file here.

For now, I think I can change parse_options in the open_dataset() function to handle this, but it was quite fiddly to do this - hard to track down in the docs how to do this. Changing this option is also not good for the rest of the data, where I do want the quote_char to be ".

# Works, but is frustrating to figure out - and not ideal for the rest of the data
test <- open_dataset("test_data.txt",
             format = "text",
             parse_options = CsvParseOptions$create(
                                   delimiter = "|",
                                   quoting = TRUE,
                                   quote_char = '', # changing this to blank, instead of '"', solves the problem
                                   double_quote = TRUE,
                                   escaping = FALSE,
                                   escape_char = "\\",
                                   newlines_in_values = FALSE,
                                   ignore_empty_lines = TRUE))

I don't know if improper quoting happens elsewhere in the data, so ideally there would be some way to detect and fix this type of improper quoting systematically (as versus skipping rows manually, or changing the quote_char to blank, which could cause issues for other columns.)

Two qs:

(immediate q) Are there more effective usage strategies for handling this type of data?
(longer-term improvement) Would automatically handling this type of messy data be something that the arrow team would consider building into open_dataset()?

Thanks for your help!

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-09-29T14:35:22Z

Hi @angela-li, thanks for reporting this! Yeah, that looks pretty gnarly in terms of the paths you had to go down to find that information, we should definitely improve our docs around this. Would you mind telling me a bit more about where you looked and how you figured it out?

In terms of question 1, I'm not sure there is a better strategy here.

In terms of question 2, it sounds like a really helpful feature request, though it'd be a pretty substantial, and most likely in the C++ code, and I know that a lot of those devs are focussed on other areas of the codebase. Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.

…$create() docs (#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in #37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

angela-li · 2023-10-01T21:00:46Z

Would you mind telling me a bit more about where you looked and how you figured it out?

Sure! Here's what I did to try to debug:

First I tried to zoom in on the line that was producing the error and subset it with the Linux shell (head, cut, etc) to see what the smallest subset of the data that I could read in to keep producing the problem. After inspecting it and comparing it to the row above, I started to get a hunch that it was the extra " character.
Then, I read that tiny subset of data into data.table::fread() to see if it handled it better. It did, but it produced a different error in fread: "Warning: Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning."
- I then went to the data.table source code and searched the wording of that error in the Github repo, which led me to the NEWS file linked above.
I realized then it had something to do with incorrectly quoted data and looked on StackOverflow, which led me to this PyArrow documentation of quote_char (probably the most useful in terms of figuring out the argument I'd need to tweak - so use this as a reference for doc improvements!)
From that, I thought that maybe there was an equivalent for pyarrow.csv.ParseOptions¶ in the R arrow package, which led me to the CsvParseOptions page on the pkgdown site. I didn't know what default quote_char took so then I went and looked at the source code for CsvParseOptions, and realized it was a ". And then I slowly started to realize how to fix it!
- ...And told myself that would be a useful docs PR which is now merged as MINOR: [R][Documentation] Add default descriptions in CsvParseOptions$create() docs #37909!!
- It was difficult to figure out how to use CsvParseOptions$create() because there were no examples, but I just tried things out (with my limited knowledge of S4 methods in R), paralleling the examples in the Python code, until reading my messy data with open_dataset() worked!
Along the way, I also found someone else on StackOverflow who ran into the same issue but did not figure it out (but I felt glad I was not the only one with the problem)

Phew!

Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.

That makes sense that Arrow is optimized for fast parsing of valid CSVs - that's what I started to suspect after seeing other examples that Arrow is used for (oftentimes machine-output data, not messy human-collected data). I'll think about what to do in this case.

thisisnic · 2023-10-03T13:32:12Z

Thanks, that's really helpful! I wondering if a nice solution here might be to create wrapper functions, e.g. csv_parse_options() etc, which just call CSVParseOptions$create() but are more nicely documented and easier to look up. I feel like this may only partially solve your problem though - the docs could be simpler to find, but perhaps the error message could be improved too.

The error message:

Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from '/home/nic/test_data.txt'. Is this a 'csv' file?: Could not open CSV input source '/home/nic/test_data.txt': Invalid: CSV parse error: Row #2: Expected 5 columns, got 2: 1|"RM-4.15|1958||1
2||||1

I think the problem here is that that error could be caused by multiple problems, so it's not definitive. We technically could search for the quote character in the problematic line and if there is an uneven number, mention that, but I'm a bit unsure if it's a bit of a fringe case. Perhaps it's not though given someone else had the same issue!

amoeba · 2023-10-10T21:03:02Z

Hi @angela-li, when I've run into situations like yours in the past, I've resorted to adding a cleanup step in between the raw data and the less flexible system (in this case, arrow) in order to get the raw data in a form that can be read without issues. I can imagine this might not be practical for your use case.

This comment got me thinking,

Changing this option is also not good for the rest of the data, where I do want the quote_char to be ".

One other thing you might try that arrow can do right now would be to make use of arrow's UnionDataset functionality. As described above, you essentially need to parse some files with one set of rules and other files with another. open_dataset can actually open other Datasets so you could do something like,

my_ds <- open_dataset(
  list(
    open_dataset("good_file.txt", type = "text")
    open_dataset("bad_file.txt", type = "text", parse_options = CsvParseOptions$create(...))
  )
) # <- this is a UnionDataset

From here you can work with my_ds normally.

This problem also reminds me of lubridate and its orders argument in lubridate::parse_date_time. One limitation of the above approach is that it requires you to know which files are problematic and which are not. So an idea would be to create a list of CsvParseOptions objects, try opening your files in a tryCatch as you try each option. I've included hacky example below.

flexible_open_dataset.R

library(arrow)

# First create a set of CsvParseOptions to try. Order matters.
default_parse_options <- CsvParseOptions$create(delimiter = "|")
quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = '')
my_parse_options <- c(default_parse_options, quirk_parse_options)

# Then we define two helper functions that attempt to call open_dataset until one succeeds
flexible_open_dataset_single <- function(file, parse_options) {
  for (parse_option in parse_options) {
    ds <- tryCatch({
      open_dataset(file, format = "text", parse_options = parse_option)
    },
    error = function(e) {
      warning(
        "Failed to parse ", file,
        " with provided ParseOption. Trying any remaining options...")
      NULL
    })

    if (!is.null(ds)) {
      break;
    }
  }

  ds
}

flexible_open_dataset <- function(files, parse_options) {
  open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, parse_options) }))
}

# Then, finally, we use our new helper and this should print a warning but otherwise work
my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), my_parse_options)

If we wanted to provide something like this in arrow, one way would be to allow parse_options to take multiple values and use a similar mechanism internally to try each.

…$create() docs (apache#37909) ### Rationale for this change Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908. I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain. ### What changes are included in this PR? Documentation changes ### Are these changes tested? No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder ### Are there any user-facing changes? Yes, on documentation of methods. Lead-authored-by: Angela Li <angela-li@users.noreply.github.com> Co-authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

angela-li added the Type: usage Issue is a user question label Sep 27, 2023

github-actions bot added the Component: R label Sep 27, 2023

angela-li mentioned this issue Sep 27, 2023

MINOR: [R][Documentation] Add default descriptions in CsvParseOptions$create() docs #37909

Merged

github-actions bot assigned angela-li Sep 27, 2023

angela-li removed their assignment Sep 27, 2023

thisisnic mentioned this issue Oct 3, 2023

[R] Create wrapper functions for the CSV*Options classes #37994

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] open_dataset() behavior with incorrectly quoted input data #37908

[R] open_dataset() behavior with incorrectly quoted input data #37908

angela-li commented Sep 27, 2023 •

edited

Loading

thisisnic commented Sep 29, 2023

angela-li commented Oct 1, 2023 •

edited

Loading

thisisnic commented Oct 3, 2023 •

edited

Loading

amoeba commented Oct 10, 2023

[R] open_dataset() behavior with incorrectly quoted input data #37908

[R] open_dataset() behavior with incorrectly quoted input data #37908

Comments

angela-li commented Sep 27, 2023 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

thisisnic commented Sep 29, 2023

angela-li commented Oct 1, 2023 • edited Loading

thisisnic commented Oct 3, 2023 • edited Loading

amoeba commented Oct 10, 2023

angela-li commented Sep 27, 2023 •

edited

Loading

angela-li commented Oct 1, 2023 •

edited

Loading

thisisnic commented Oct 3, 2023 •

edited

Loading