Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] open_dataset() behavior with incorrectly quoted input data #37908

Open
angela-li opened this issue Sep 27, 2023 · 4 comments
Open

[R] open_dataset() behavior with incorrectly quoted input data #37908

angela-li opened this issue Sep 27, 2023 · 4 comments
Labels
Component: R Type: usage Issue is a user question

Comments

@angela-li
Copy link
Contributor

angela-li commented Sep 27, 2023

Describe the usage question you have. Please include as many useful details as possible.

I've started using {arrow} to read in data in R, and I noticed that it handles messy (aka human-collected!) tabular data slightly worse than {data.table}'s fread function. (But I want to work with arrow as vs. data.table as the partitioning element is going to be useful for me in the future!)

One issue I came across was how open_dataset handles incorrectly quoted data, or data where the default quote_char of " is included accidentally in a column.

Here's how the behavior is different between data.table and arrow. (Here's the test_data.txt file for the below code. It's a .txt file because the original humongous data file is delivered as a .txt.)

# Does not work
library(arrow)
test <- open_dataset("test_data.txt",
             format = "text",
             delim = "|")

# Works - generates error about improper quoting, but reads it correctly
library(data.table)
test <- fread("test_data.txt")

The data.table() documentation describes how they handle this data situation reasonably well, NEWS file here.

For now, I think I can change parse_options in the open_dataset() function to handle this, but it was quite fiddly to do this - hard to track down in the docs how to do this. Changing this option is also not good for the rest of the data, where I do want the quote_char to be ".

# Works, but is frustrating to figure out - and not ideal for the rest of the data
test <- open_dataset("test_data.txt",
             format = "text",
             parse_options = CsvParseOptions$create(
                                   delimiter = "|",
                                   quoting = TRUE,
                                   quote_char = '', # changing this to blank, instead of '"', solves the problem
                                   double_quote = TRUE,
                                   escaping = FALSE,
                                   escape_char = "\\",
                                   newlines_in_values = FALSE,
                                   ignore_empty_lines = TRUE))

I don't know if improper quoting happens elsewhere in the data, so ideally there would be some way to detect and fix this type of improper quoting systematically (as versus skipping rows manually, or changing the quote_char to blank, which could cause issues for other columns.)

Two qs:

  1. (immediate q) Are there more effective usage strategies for handling this type of data?
  2. (longer-term improvement) Would automatically handling this type of messy data be something that the arrow team would consider building into open_dataset()?

Thanks for your help!

Component(s)

R

@thisisnic
Copy link
Member

Hi @angela-li, thanks for reporting this! Yeah, that looks pretty gnarly in terms of the paths you had to go down to find that information, we should definitely improve our docs around this. Would you mind telling me a bit more about where you looked and how you figured it out?

In terms of question 1, I'm not sure there is a better strategy here.

In terms of question 2, it sounds like a really helpful feature request, though it'd be a pretty substantial, and most likely in the C++ code, and I know that a lot of those devs are focussed on other areas of the codebase. Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.

thisisnic added a commit that referenced this issue Sep 29, 2023
…$create() docs (#37909)

### Rationale for this change

Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in #37908.

I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain.

### What changes are included in this PR?

Documentation changes

### Are these changes tested?

No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder

### Are there any user-facing changes?

Yes, on documentation of methods.

Lead-authored-by: Angela Li <angela-li@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@angela-li
Copy link
Contributor Author

angela-li commented Oct 1, 2023

Would you mind telling me a bit more about where you looked and how you figured it out?

Sure! Here's what I did to try to debug:

  • First I tried to zoom in on the line that was producing the error and subset it with the Linux shell (head, cut, etc) to see what the smallest subset of the data that I could read in to keep producing the problem. After inspecting it and comparing it to the row above, I started to get a hunch that it was the extra " character.
  • Then, I read that tiny subset of data into data.table::fread() to see if it handled it better. It did, but it produced a different error in fread: "Warning: Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning."
    • I then went to the data.table source code and searched the wording of that error in the Github repo, which led me to the NEWS file linked above.
  • I realized then it had something to do with incorrectly quoted data and looked on StackOverflow, which led me to this PyArrow documentation of quote_char (probably the most useful in terms of figuring out the argument I'd need to tweak - so use this as a reference for doc improvements!)
  • From that, I thought that maybe there was an equivalent for pyarrow.csv.ParseOptions in the R arrow package, which led me to the CsvParseOptions page on the pkgdown site. I didn't know what default quote_char took so then I went and looked at the source code for CsvParseOptions, and realized it was a ". And then I slowly started to realize how to fix it!
  • Along the way, I also found someone else on StackOverflow who ran into the same issue but did not figure it out (but I felt glad I was not the only one with the problem)

Phew!

Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.

That makes sense that Arrow is optimized for fast parsing of valid CSVs - that's what I started to suspect after seeing other examples that Arrow is used for (oftentimes machine-output data, not messy human-collected data). I'll think about what to do in this case.

@thisisnic
Copy link
Member

thisisnic commented Oct 3, 2023

Thanks, that's really helpful! I wondering if a nice solution here might be to create wrapper functions, e.g. csv_parse_options() etc, which just call CSVParseOptions$create() but are more nicely documented and easier to look up. I feel like this may only partially solve your problem though - the docs could be simpler to find, but perhaps the error message could be improved too.

The error message:

Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from '/home/nic/test_data.txt'. Is this a 'csv' file?: Could not open CSV input source '/home/nic/test_data.txt': Invalid: CSV parse error: Row #2: Expected 5 columns, got 2: 1|"RM-4.15|1958||1
2||||1

I think the problem here is that that error could be caused by multiple problems, so it's not definitive. We technically could search for the quote character in the problematic line and if there is an uneven number, mention that, but I'm a bit unsure if it's a bit of a fringe case. Perhaps it's not though given someone else had the same issue!

@amoeba
Copy link
Member

amoeba commented Oct 10, 2023

Hi @angela-li, when I've run into situations like yours in the past, I've resorted to adding a cleanup step in between the raw data and the less flexible system (in this case, arrow) in order to get the raw data in a form that can be read without issues. I can imagine this might not be practical for your use case.

This comment got me thinking,

Changing this option is also not good for the rest of the data, where I do want the quote_char to be ".

One other thing you might try that arrow can do right now would be to make use of arrow's UnionDataset functionality. As described above, you essentially need to parse some files with one set of rules and other files with another. open_dataset can actually open other Datasets so you could do something like,

my_ds <- open_dataset(
  list(
    open_dataset("good_file.txt", type = "text")
    open_dataset("bad_file.txt", type = "text", parse_options = CsvParseOptions$create(...))
  )
) # <- this is a UnionDataset

From here you can work with my_ds normally.

This problem also reminds me of lubridate and its orders argument in lubridate::parse_date_time. One limitation of the above approach is that it requires you to know which files are problematic and which are not. So an idea would be to create a list of CsvParseOptions objects, try opening your files in a tryCatch as you try each option. I've included hacky example below.

flexible_open_dataset.R
library(arrow)

# First create a set of CsvParseOptions to try. Order matters.
default_parse_options <- CsvParseOptions$create(delimiter = "|")
quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = '')
my_parse_options <- c(default_parse_options, quirk_parse_options)

# Then we define two helper functions that attempt to call open_dataset until one succeeds
flexible_open_dataset_single <- function(file, parse_options) {
  for (parse_option in parse_options) {
    ds <- tryCatch({
      open_dataset(file, format = "text", parse_options = parse_option)
    },
    error = function(e) {
      warning(
        "Failed to parse ", file,
        " with provided ParseOption. Trying any remaining options...")
      NULL
    })

    if (!is.null(ds)) {
      break;
    }
  }

  ds
}

flexible_open_dataset <- function(files, parse_options) {
  open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, parse_options) }))
}

# Then, finally, we use our new helper and this should print a warning but otherwise work
my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), my_parse_options)

If we wanted to provide something like this in arrow, one way would be to allow parse_options to take multiple values and use a similar mechanism internally to try each.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…$create() docs (apache#37909)

### Rationale for this change

Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908.

I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain.

### What changes are included in this PR?

Documentation changes

### Are these changes tested?

No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder

### Are there any user-facing changes?

Yes, on documentation of methods.

Lead-authored-by: Angela Li <angela-li@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…$create() docs (apache#37909)

### Rationale for this change

Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908.

I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain.

### What changes are included in this PR?

Documentation changes

### Are these changes tested?

No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder

### Are there any user-facing changes?

Yes, on documentation of methods.

Lead-authored-by: Angela Li <angela-li@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…$create() docs (apache#37909)

### Rationale for this change

Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908.

I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain.

### What changes are included in this PR?

Documentation changes

### Are these changes tested?

No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder

### Are there any user-facing changes?

Yes, on documentation of methods.

Lead-authored-by: Angela Li <angela-li@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…$create() docs (apache#37909)

### Rationale for this change

Add more function documentation for folks who want to manually change `quote_char` or `escape_char` options in `CsvParseOptions$create()`, as I did in apache#37908.

I had to go through the source code to arrow/r/R/csv.R - [line 587](https://github.com/apache/arrow/blob/7dc9f69a8a77345d0ec7920af9224ef96d7f5f78/r/R/csv.R#L587) to find the default argument, which was a pain.

### What changes are included in this PR?

Documentation changes

### Are these changes tested?

No - not changing underlying code. Maintainers might need to rebuild package to include these changes in man/ folder

### Are there any user-facing changes?

Yes, on documentation of methods.

Lead-authored-by: Angela Li <angela-li@users.noreply.github.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: R Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

3 participants