[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

asfimport · 2022-11-02T05:38:55Z

read_csv_arrow() incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953

This is an example that throws the error:

x <- tempfile()
readr::write_lines(
'
id,text
1,"some text on \\"BLAH
" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on \"BLAH\" and X, and Y also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> Error:
#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on \"BLAH\" and X, and Y also"

#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(file = x, escape_backslash = TRUE, delim = ",")
#> 2. └─base::tryCatch(...) at r/R/csv.R:217:2
#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. └─value[[3L]](cond)
#> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema) at r/R/csv.R:222:6
#> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2

^{Created on 2022-11-02 with reprex v2.0.2}

This version includes four lines that might potentially error but do not:

x <- tempfile()
readr::write_lines(
'
id,text
2,"some text on X and Y"
3,"some text on X, and Y"
4,"some text on \\"BLAH
"
5,"some text on X and Y, and \\"BLAH
" also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 2,"some text on X and Y"
#> 3,"some text on X, and Y"
#> 4,"some text on \"BLAH\"
#> 5,"some text on X and Y, and \"BLAH\" also"
arrow::read_csv_arrow(x, escape_backslash = TRUE)
#> # A tibble: 4 × 2
#> id text 
#> <int> <chr> 
#> 1 2 "some text on X and Y" 
#> 2 3 "some text on X, and Y" 
#> 3 4 "some text on \\BLAH\\\"" 
#> 4 5 "some text on X and Y, and \\BLAH\\\" also\""

^{Created on 2022-11-02 with reprex v2.0.2}

I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark:

x <- tempfile()
readr::write_lines(
'
id,text
1,"some text on X and Y"
2,"some text on X, and Y"
3,"some text on \\"BLAH
"
4,"some text on X and Y, and \\"BLAH
" also"
5,"some text on \\"BLAH
" and X, and Y also"
', x)

cat(system(paste('cat', x), intern = TRUE), sep = "\n")
#> 
#> id,text
#> 1,"some text on X and Y"
#> 2,"some text on X, and Y"
#> 3,"some text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"
#> 5,"some text on \"BLAH\" and X, and Y also"

csv <- reticulate::import("pyarrow.csv")
opt <- csv$ParseOptions(escape_char='
')
csv$read_csv(x, parse_options = opt)
#> Error in py_call_impl(callable, dots$args, dots$keywords): pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some text on \"BLAH\"
#> 4,"some text on X and Y, and \"BLAH\" also"

^{Created on 2022-11-02 with reprex v2.0.2}

Reporter: Danielle Navarro / @djnavarro

_{Note: This issue was originally created as ARROW-18219. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2022-11-02T08:47:15Z

Antoine Pitrou / @pitrou:
How was the original CSV file produced? By which utility?

asfimport · 2022-11-02T08:48:06Z

Antoine Pitrou / @pitrou:
Note the CSV is incorrect: the third line misses an ending quote before the newline.

asfimport · 2022-11-02T10:41:54Z

Antoine Pitrou / @pitrou:
(for the record: the Python CSV reader also fails reading this CSV file "correctly")

asfimport · 2022-11-03T09:02:43Z

Thomas Klebel:
The original CSV I was having issues looked like this:

id,text
1,Some interesting text
2,"Some text on: \"how to break arrow\" by X, and Y"

It was created using spark_write_csv, with

sparklyr v1.7.5

Spark v2.3.2

HDFS v3.1.1

asfimport · 2022-11-03T13:18:08Z

Antoine Pitrou / @pitrou:
[~tklebel] That file is parsed correctly by PyArrow using the following options:

>>> s
'id,text\n1,Some interesting text\n2,"Some text on: \\"how to break arrow\\" by X, and Y" \n'
>>> parse_options = csv.ParseOptions(escape_char='\\', quote_char='"', double_quote=False)
>>> csv.read_csv(io.BytesIO(s.encode()), parse_options=parse_options).to_pandas()
   id                                             text
0   1                            Some interesting text
1   2  Some text on: "how to break arrow" by X, and Y

asfimport · 2022-11-03T13:19:48Z

Antoine Pitrou / @pitrou:
cc @thisisnic @paleolimbot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 3, 2022

asfimport commented Nov 3, 2022

asfimport commented Nov 3, 2022

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

[R] read_csv_arrow fails when a string contains a backslash-escaped quote mark followed by a comma #33405

Comments

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 2, 2022

asfimport commented Nov 3, 2022

asfimport commented Nov 3, 2022

asfimport commented Nov 3, 2022