-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] CSV parser got out of sync with chunker #39857
Comments
Thanks for repoting this, @larry77! |
Hello! |
@pitrou I took a look at the C++ code that raises this error, but couldn't quite figure out what had happened here - do you know what it might be? |
Hmm, I can reproduce using PyArrow, I'll try to see if I can further diagnose this. Note, however, that this data file will need to set |
Ok, the error message is weird, but it is really a consequence of having newlines in values. |
I'll put up a PR to improve the error message. |
Note that, once you enable the
|
Thanks!
Do I have the same `newlines_in_values` also in the R package and
open_dataset?
Cheers
…On Thu, Feb 01, 2024 at 10:00:50AM -0800, Antoine Pitrou wrote:
Note that, once you enable the `newlines_in_values` option, reading the CSV file should be successful. For example with PyArrow:
```
AID_MEASURE_ID DATE_CREATED DATE_GRANTED ... GRANTING_AUTHORITY_NAME_EN NUTS_CD GRANTING_AUTHORITY_COUNTRY
0 SA.42315 16/09/16 30/08/16 ... Ministry of Industry and Trade Czechia
1 SA.42315 16/09/16 26/08/16 ... Ministry of Industry and Trade Czechia
2 SA.42328 19/09/16 16/08/16 ... Ministry of Industry and Trade, Department of ... Czechia
3 SA.41602 21/09/16 01/07/16 ... VLAIO Belgium
4 SA.41602 26/09/16 15/07/16 ... VLAIO Belgium
... ... ... ... ... ... ... ...
1677781 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany
1677782 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany
1677783 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany
1677784 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany
1677785 SA.100743 24/01/24 15/03/23 ... CCI for Munich and Upper Bavaria DE2 Germany
[1677786 rows x 30 columns]
```
--
Reply to this email directly or view it on GitHub:
#39857 (comment)
You are receiving this because you were mentioned.
Message ID: ***@***.***>
|
…tion (#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in GH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: #39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… condition (apache#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in apacheGH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: apache#39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… condition (apache#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in apacheGH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: apache#39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… condition (apache#39892) ### Rationale for this change When writing the CSV reader, we thought that the parser not finding the same line limits as the chunker should never happen, hence the terse "chunker out of sync" error message. It turns out that, if the input contains multiline cell values and the `newlines_in_values` option was not enabled, the chunker can happily delimit a block on a newline that's inside a quoted string. The parser will then see truncated data and will stop parsing, yielding a parsed size that's smaller than the first block (see added comment in the code). ### What changes are included in this PR? * Add some parser tests that showcase the condition encountered in apacheGH-39857 * Improve error message to guide users towards the solution ### Are these changes tested? There's no functional change, the error message itself isn't tested. ### Are there any user-facing changes? No. * Closes: apache#39857 Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Hey thanks for this awesome package. Any news on this error? |
Describe the bug, including details regarding any error messages, version, and platform.
Hello,
Unfortunately the example involves a large dataset and, according to my tests, it appears when the number of read lines goes above 1.6 million.
The data can be downloaded as a compressed file from (nothing dangerous in the link).
https://e.pcloud.link/publink/show?code=XZqHIeZokLxWCpx940hw3y45fsKqJPAVK0X
Using a script I have had for quite some time, I want to open the tsv (tab separated file) I get when I decompress the file and then save it as a parquet file without holding it (entirely) in memory.
Created on 2024-01-30 with reprex v2.0.2
Any idea of what the issue may be? Thanks!
Component(s)
R
The text was updated successfully, but these errors were encountered: