[R] CSV parser got out of sync with chunker #39857

larry77 · 2024-01-30T16:12:21Z

Describe the bug, including details regarding any error messages, version, and platform.

Hello,
Unfortunately the example involves a large dataset and, according to my tests, it appears when the number of read lines goes above 1.6 million.

The data can be downloaded as a compressed file from (nothing dangerous in the link).

https://e.pcloud.link/publink/show?code=XZqHIeZokLxWCpx940hw3y45fsKqJPAVK0X

Using a script I have had for quite some time, I want to open the tsv (tab separated file) I get when I decompress the file and then save it as a parquet file without holding it (entirely) in memory.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

data <- open_dataset("export.tsv",
  format = "tsv",
  skip_rows = 1, 
  schema = schema(
    AID_MEASURE_ID = string(), 
    DATE_CREATED = string(), 
    DATE_GRANTED = string(), 
    AA_PUBLISHED_DATE = string(), 
    SERVER_REF = string(), 
    AM_TITLE = string(), 
    AM_TITLE_EN = string(), 
    STATUS = string(), 
    AM_PROC_TYPE_CD = string(), 
    COFINANCE = string(), 
    OBJECTIVE = string(), 
    OTHER_OBJECTIVE_EN = string(), 
    AID_INSTRUMENT = string(), 
    OTHER_AID_INSTRUMENT_EN = string(), 
    BENEFICIARY_NAME = string(), 
    BENEFICIARY_NAME_ENGLISH = string(), 
    BENEFICIARY_NATIONAL_ID = string(), 
    BENEFICIARY_NAT_ID_TYPE_SD = string(), 
    BENEFICIARY_TYPE_SD = string(), 
    COUNTRY_SD = string(), 
    REGION_SD = string(), 
    SECTOR_SD = string(), 
    GRANTED_AMOUNT_FROM_EUR = double(), 
    NOMINAL_AMOUNT_EUR_FROM = double(), 
    GRANT_RANGE = string(),
    GRANTED_AMOUNT_RANGE_DESC=string(),
    GRANTING_AUTHORITY_NAME = string(), 
    GRANTING_AUTHORITY_NAME_EN = string(), 
    NUTS_CD = string(), 
    GRANTING_AUTHORITY_COUNTRY = string()
  )
  )

  
write_dataset(
  data,
  format = "parquet",
  path = ".",
  max_rows_per_file = 1e7
)
#> Error: Invalid: CSV parser got out of sync with chunker
    

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] arrow_14.0.0.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.4       cli_3.6.1         knitr_1.45        rlang_1.1.2      
#>  [5] xfun_0.41         purrr_1.0.2       styler_1.10.2     generics_0.1.3   
#>  [9] assertthat_0.2.1  glue_1.6.2        bit_4.0.5         htmltools_0.5.7  
#> [13] fansi_1.0.5       rmarkdown_2.25    R.cache_0.16.0    tibble_3.2.1     
#> [17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.4  
#> [21] compiler_4.3.2    dplyr_1.1.3       fs_1.6.3          pkgconfig_2.0.3  
#> [25] R.oo_1.25.0       R.utils_2.12.2    digest_0.6.33     R6_2.5.1         
#> [29] utf8_1.2.4        reprex_2.0.2      tidyselect_1.2.0  pillar_1.9.0     
#> [33] magrittr_2.0.3    R.methodsS3_1.8.2 tools_4.3.2       withr_2.5.2      
#> [37] bit64_4.0.5

^{Created on 2024-01-30 with reprex v2.0.2}

Any idea of what the issue may be? Thanks!

Component(s)

R

thisisnic · 2024-01-31T17:51:48Z

Thanks for repoting this, @larry77!
Can you confirm which version of the R package you're using? And have you used this code+data with an earlier version and it worked, or is this the first time you're running this?

larry77 · 2024-01-31T20:44:57Z

Hello!
As per reprex, I use arrow 14.0.0.2. All I can say is that I have used an earlier version of R arrow on a slightly shorter dataset (same structure) and it worked. The present version also works on the shorter dataset. However, the present data set is not pathological: I can read it with read_csv from readr and it works. It seems the problem arose once the dataset grew above 1.6 million lines.

thisisnic · 2024-01-31T21:31:00Z

@pitrou I took a look at the C++ code that raises this error, but couldn't quite figure out what had happened here - do you know what it might be?

pitrou · 2024-02-01T16:44:54Z

Hmm, I can reproduce using PyArrow, I'll try to see if I can further diagnose this.

Note, however, that this data file will need to set newlines_in_values, because some cell values span multiple lines.

pitrou · 2024-02-01T17:45:02Z

Ok, the error message is weird, but it is really a consequence of having newlines in values.

pitrou · 2024-02-01T17:51:16Z

I'll put up a PR to improve the error message.

… condition

pitrou · 2024-02-01T18:00:38Z

Note that, once you enable the newlines_in_values option, reading the CSV file should be successful. For example with PyArrow:

        AID_MEASURE_ID DATE_CREATED DATE_GRANTED  ...                         GRANTING_AUTHORITY_NAME_EN NUTS_CD GRANTING_AUTHORITY_COUNTRY
0             SA.42315     16/09/16     30/08/16  ...                     Ministry of Industry and Trade                            Czechia
1             SA.42315     16/09/16     26/08/16  ...                     Ministry of Industry and Trade                            Czechia
2             SA.42328     19/09/16     16/08/16  ...  Ministry of Industry and Trade, Department of ...                            Czechia
3             SA.41602     21/09/16     01/07/16  ...                                              VLAIO                            Belgium
4             SA.41602     26/09/16     15/07/16  ...                                              VLAIO                            Belgium
...                ...          ...          ...  ...                                                ...     ...                        ...
1677781      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677782      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677783      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677784      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany
1677785      SA.100743     24/01/24     15/03/23  ...                   CCI for Munich and Upper Bavaria     DE2                    Germany

[1677786 rows x 30 columns]

larry77 · 2024-02-01T18:22:20Z

Thanks! Do I have the same `newlines_in_values` also in the R package and open_dataset? Cheers

…

On Thu, Feb 01, 2024 at 10:00:50AM -0800, Antoine Pitrou wrote: Note that, once you enable the `newlines_in_values` option, reading the CSV file should be successful. For example with PyArrow: ``` AID_MEASURE_ID DATE_CREATED DATE_GRANTED ... GRANTING_AUTHORITY_NAME_EN NUTS_CD GRANTING_AUTHORITY_COUNTRY 0 SA.42315 16/09/16 30/08/16 ... Ministry of Industry and Trade Czechia 1 SA.42315 16/09/16 26/08/16 ... Ministry of Industry and Trade Czechia 2 SA.42328 19/09/16 16/08/16 ... Ministry of Industry and Trade, Department of ... Czechia 3 SA.41602 21/09/16 01/07/16 ... VLAIO Belgium 4 SA.41602 26/09/16 15/07/16 ... VLAIO Belgium ... ... ... ... ... ... ... ... 1677781 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany 1677782 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany 1677783 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany 1677784 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany 1677785 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany [1677786 rows x 30 columns] ``` -- Reply to this email directly or view it on GitHub: #39857 (comment) You are receiving this because you were mentioned. Message ID: ***@***.***>

… condition

…tion (#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in GH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: #39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

… condition (apache#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in apacheGH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: apache#39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

dmontecino · 2024-08-28T16:36:01Z

Hey thanks for this awesome package. Any news on this error?

larry77 added the Type: bug label Jan 30, 2024

github-actions bot added the Component: R label Jan 30, 2024

thisisnic added the Component: C++ label Jan 31, 2024

pitrou added a commit to pitrou/arrow that referenced this issue Feb 1, 2024

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

1ce1131

… condition

pitrou added a commit to pitrou/arrow that referenced this issue Feb 1, 2024

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

b5a5b51

… condition

pitrou mentioned this issue Feb 1, 2024

GH-39857: [C++] Improve error message for "chunker out of sync" condition #39892

Merged

github-actions bot assigned pitrou Feb 1, 2024

kou changed the title ~~[R]: CSV parser got out of sync with chunker~~ [R] CSV parser got out of sync with chunker Feb 2, 2024

github-actions bot removed the Component: C++ label Feb 2, 2024

pitrou added a commit to pitrou/arrow that referenced this issue Feb 5, 2024

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

f0e4fbd

… condition

pitrou added a commit to pitrou/arrow that referenced this issue Feb 5, 2024

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

72c2695

… condition

pitrou added a commit to pitrou/arrow that referenced this issue Feb 6, 2024

apacheGH-39857: [C++] Improve error message for "chunker out of sync"…

7d03352

… condition

pitrou closed this as completed in #39892 Feb 6, 2024

pitrou added this to the 16.0.0 milestone Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] CSV parser got out of sync with chunker #39857

[R] CSV parser got out of sync with chunker #39857

larry77 commented Jan 30, 2024

thisisnic commented Jan 31, 2024

larry77 commented Jan 31, 2024

thisisnic commented Jan 31, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

larry77 commented Feb 1, 2024 via email

dmontecino commented Aug 28, 2024

[R] CSV parser got out of sync with chunker #39857

[R] CSV parser got out of sync with chunker #39857

Comments

larry77 commented Jan 30, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

thisisnic commented Jan 31, 2024

larry77 commented Jan 31, 2024

thisisnic commented Jan 31, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

pitrou commented Feb 1, 2024

larry77 commented Feb 1, 2024 via email

dmontecino commented Aug 28, 2024