Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] CSV skip wrong rows #26306

Closed
asfimport opened this issue Oct 15, 2020 · 8 comments
Closed

[C++] CSV skip wrong rows #26306

asfimport opened this issue Oct 15, 2020 · 8 comments

Comments

@asfimport
Copy link

It would be helpful to add another option to ReadOptions which will enable skipping rows with wrong data (e.g. data type mismatch with column type) and continue reading next rows. Wrong rows numbers may be reported at the end of processing.

This way I can deal with the wrongly formatted data or ignore it if I have a large load success rate and I don’t care about the exceptions.

Reporter: Maciej / @mskrzypkowski

Note: This issue was originally created as ARROW-10315. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Skipping rows entirely will be difficult. We could add an option to emit nulls in that case, though. What do you think?

@asfimport
Copy link
Author

Maciej / @mskrzypkowski:
Emitting nulls wouldn't work for me. I may stick with checking the file myself before loading by Arrow.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
For reference (not that we should copy pandas.read_csv's full API though ;)), pandas.read_csv has a error_bad_lines option, with a default to raise an error on a "bad line" (eg not enough values, or too many values), but with the option to skip them.

@pitrou when you say that skipping is difficult, is this because if you encounter an error in the value for a certain column, the values are already appended to the builder for the previous columns?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Columns are converted independently from each other. So we would need a way to remove individual values from an array after the fact. Not only we don't have such a facility currently, but it wouldn't be costless at runtime either.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
And a summary of some discussion related to this in pandas (and old issue of mine): pandas-dev/pandas#15122

In general it might be useful to add some options on how to deal with lines with a wrong number of elements (eg filling with nulls if too few, skipping the extra values if there are too many)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
This issue specifically mentions "data type mismatch with column type", which is a different category of errors.

Dealing with lines with wrong number of elements wouldn't easier, though the difficulties would reside in the parser.

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:

Columns are converted independently from each other. So we would need a way to remove individual values from an array after the fact. Not only we don't have such a facility currently, but it wouldn't be costless at runtime either.
This seems like a very similar case to having to filter out rows.  Would a viable solution be to optionally create a second array (i.e. non-nullable bool), and do the compaction afterwards? Not costless, but I would hope a code path like this would be optimized in the compute kernels?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Well, we don't want the price of creating the second array in the common case, so we would need two codepaths and a refactor to retry with the error-handling codepath when a conversion fails...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant