Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

Open
asfimport opened this issue Nov 2, 2022 · 6 comments

Comments

@asfimport
Copy link

read_csv_arrow() incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953

This is an example that throws the error:

x <- tempfile()
readr::write_lines(
'
id,text
1,"some text on \\"BLAH
" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on \"BLAH\" and X, and Y also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on \"BLAH\" and X, and Y also"

#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(file = x, escape_backslash = TRUE, delim = ",")
#> 2. └─base::tryCatch(...) at r/R/csv.R:217:2
#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. └─value[[3L]](cond)
#> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema) at r/R/csv.R:222:6
#> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2

Created on 2022-11-02 with reprex v2.0.2

This version includes four lines that might potentially error but do not:

x <- tempfile()
readr::write_lines(
'
id,text
2,"some text on X and Y"
3,"some text on X, and Y"
4,"some text on \\"BLAH
"
5,"some text on X and Y, and \\"BLAH
" also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 2,"some text on X and Y"
#> 3,"some text on X, and Y"
#> 4,"some text on \"BLAH\"
#> 5,"some text on X and Y, and \"BLAH\" also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> # A tibble: 4 × 2
#> id text 
#> <int> <chr> 
#> 1 2 "some text on X and Y" 
#> 2 3 "some text on X, and Y" 
#> 3 4 "some text on \\BLAH\\\"" 
#> 4 5 "some text on X and Y, and \\BLAH\\\" also\""

Created on 2022-11-02 with reprex v2.0.2

I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark:

x <- tempfile()
readr::write_lines(
'
id,text
1,"some text on X and Y"
2,"some text on X, and Y"
3,"some text on \\"BLAH
"
4,"some text on X and Y, and \\"BLAH
" also"
5,"some text on \\"BLAH
" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on X and Y"
#> 2,"some text on X, and Y"
#> 3,"some text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"
#> 5,"some text on \"BLAH\" and X, and Y also"

csv <- reticulate::import("pyarrow.csv")
opt <- csv$ParseOptions(escape_char='
')
csv$read_csv(x, parse_options = opt)
#> Error in py_call_impl(callable, dots$args, dots$keywords): pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"

Created on 2022-11-02 with reprex v2.0.2

Reporter: Danielle Navarro / @djnavarro

Note: This issue was originally created as ARROW-18219. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
How was the original CSV file produced? By which utility?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Note the CSV is incorrect: the third line misses an ending quote before the newline.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
(for the record: the Python CSV reader also fails reading this CSV file "correctly")

@asfimport
Copy link
Author

Thomas Klebel:
The original CSV I was having issues looked like this:

id,text
1,Some interesting text
2,"Some text on: \"how to break arrow\" by X, and Y" 

It was created using spark_write_csv, with

sparklyr v1.7.5

Spark v2.3.2

HDFS v3.1.1

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
[~tklebel] That file is parsed correctly by PyArrow using the following options:

>>> s
'id,text\n1,Some interesting text\n2,"Some text on: \\"how to break arrow\\" by X, and Y" \n'
>>> parse_options = csv.ParseOptions(escape_char='\\', quote_char='"', double_quote=False)
>>> csv.read_csv(io.BytesIO(s.encode()), parse_options=parse_options).to_pandas()
   id                                             text
0   1                            Some interesting text
1   2  Some text on: "how to break arrow" by X, and Y 

@asfimport
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant