You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
read_csv_arrow() incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953
This is an example that throws the error:
x<- tempfile()
readr::write_lines(
'id,text1,"some text on \\"BLAH" and X, and Y also"', x)
cat(system(paste('cat', x), intern=TRUE), sep="\n")
#> #> id,text#> 1,"some text on \"BLAH\" and X, and Y also"arrow::read_csv_arrow(x, escape_backslash=TRUE)
#> Error:#> ! Invalid: CSV parse error: Expected 2 columns, got 3: 1,"some text on \"BLAH\" and X, and Y also"#> Backtrace:#> ▆#> 1. └─arrow (local) `<fn>`(file = x, escape_backslash = TRUE, delim = ",")#> 2. └─base::tryCatch(...) at r/R/csv.R:217:2#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])#> 5. └─value[[3L]](cond)#> 6. └─arrow:::augment_io_error_msg(e, call, schema = schema) at r/R/csv.R:222:6#> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
Created on 2022-11-02 with reprex v2.0.2
This version includes four lines that might potentially error but do not:
x<- tempfile()
readr::write_lines(
'id,text2,"some text on X and Y"3,"some text on X, and Y"4,"some text on \\"BLAH"5,"some text on X and Y, and \\"BLAH" also"', x)
cat(system(paste('cat', x), intern=TRUE), sep="\n")
#> #> id,text#> 2,"some text on X and Y"#> 3,"some text on X, and Y"#> 4,"some text on \"BLAH\"#> 5,"some text on X and Y, and \"BLAH\" also"arrow::read_csv_arrow(x, escape_backslash=TRUE)
#> # A tibble: 4 × 2#> id text #> <int> <chr> #> 1 2 "some text on X and Y" #> 2 3 "some text on X, and Y" #> 3 4 "some text on \\BLAH\\\"" #> 4 5 "some text on X and Y, and \\BLAH\\\" also\""
Created on 2022-11-02 with reprex v2.0.2
I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark:
x<- tempfile()
readr::write_lines(
'id,text1,"some text on X and Y"2,"some text on X, and Y"3,"some text on \\"BLAH"4,"some text on X and Y, and \\"BLAH" also"5,"some text on \\"BLAH" and X, and Y also"', x)
cat(system(paste('cat', x), intern=TRUE), sep="\n")
#> #> id,text#> 1,"some text on X and Y"#> 2,"some text on X, and Y"#> 3,"some text on \"BLAH\"#> 4,"some text on X and Y, and \"BLAH\" also"#> 5,"some text on \"BLAH\" and X, and Y also"csv<-reticulate::import("pyarrow.csv")
opt<-csv$ParseOptions(escape_char='')
csv$read_csv(x, parse_options=opt)
#> Error in py_call_impl(callable, dots$args, dots$keywords): pyarrow.lib.ArrowInvalid: CSV parse error: Expected 2 columns, got 3: 3,"some text on \"BLAH\"#> 4,"some text on X and Y, and \"BLAH\" also"
Antoine Pitrou / @pitrou: [~tklebel] That file is parsed correctly by PyArrow using the following options:
>>>s'id,text\n1,Some interesting text\n2,"Some text on: \\"how to break arrow\\" by X, and Y" \n'>>>parse_options=csv.ParseOptions(escape_char='\\', quote_char='"', double_quote=False)
>>>csv.read_csv(io.BytesIO(s.encode()), parse_options=parse_options).to_pandas()
idtext01Someinterestingtext12Sometexton: "how to break arrow"byX, andY
read_csv_arrow()
incorrectly parses CSV files when a string value contains a comma that appears after a backslash-escaped quote mark. Originally noted by Thomas Klebel https://scicomm.xyz/@tklebel/109270436511066953This is an example that throws the error:
Created on 2022-11-02 with reprex v2.0.2
This version includes four lines that might potentially error but do not:
Created on 2022-11-02 with reprex v2.0.2
I'm not sure if the problem is R specific. I've partially reproduced the error using reticulate and pyarrow as follows, but notice that this errors at a different point: the pyarrow version appears to fail with the comma preceding the backslash-escaped quote mark:
Created on 2022-11-02 with reprex v2.0.2
Reporter: Danielle Navarro / @djnavarro
Note: This issue was originally created as ARROW-18219. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: