[C++] CSV skip wrong rows #26306

asfimport · 2020-10-15T14:03:36Z

It would be helpful to add another option to ReadOptions which will enable skipping rows with wrong data (e.g. data type mismatch with column type) and continue reading next rows. Wrong rows numbers may be reported at the end of processing.

This way I can deal with the wrongly formatted data or ignore it if I have a large load success rate and I don’t care about the exceptions.

Reporter: Maciej / @mskrzypkowski

_{Note: This issue was originally created as ARROW-10315. Please see the migration documentation for further details.}

asfimport · 2020-10-15T16:55:12Z

Antoine Pitrou / @pitrou:
Skipping rows entirely will be difficult. We could add an option to emit nulls in that case, though. What do you think?

asfimport · 2020-10-19T09:03:51Z

Maciej / @mskrzypkowski:
Emitting nulls wouldn't work for me. I may stick with checking the file myself before loading by Arrow.

asfimport · 2020-10-21T18:47:34Z

Joris Van den Bossche / @jorisvandenbossche:
For reference (not that we should copy pandas.read_csv's full API though ;)), pandas.read_csv has a error_bad_lines option, with a default to raise an error on a "bad line" (eg not enough values, or too many values), but with the option to skip them.

@pitrou when you say that skipping is difficult, is this because if you encounter an error in the value for a certain column, the values are already appended to the builder for the previous columns?

asfimport · 2020-10-21T18:51:25Z

Antoine Pitrou / @pitrou:
Columns are converted independently from each other. So we would need a way to remove individual values from an array after the fact. Not only we don't have such a facility currently, but it wouldn't be costless at runtime either.

asfimport · 2020-10-21T18:52:32Z

Joris Van den Bossche / @jorisvandenbossche:
And a summary of some discussion related to this in pandas (and old issue of mine): pandas-dev/pandas#15122

In general it might be useful to add some options on how to deal with lines with a wrong number of elements (eg filling with nulls if too few, skipping the extra values if there are too many)

asfimport · 2020-10-21T18:55:45Z

Antoine Pitrou / @pitrou:
This issue specifically mentions "data type mismatch with column type", which is a different category of errors.

Dealing with lines with wrong number of elements wouldn't easier, though the difficulties would reside in the parser.

asfimport · 2020-10-22T05:46:52Z

Micah Kornfield / @emkornfield:

Columns are converted independently from each other. So we would need a way to remove individual values from an array after the fact. Not only we don't have such a facility currently, but it wouldn't be costless at runtime either.
This seems like a very similar case to having to filter out rows. Would a viable solution be to optionally create a second array (i.e. non-nullable bool), and do the compaction afterwards? Not costless, but I would hope a code path like this would be optimized in the compute kernels?

asfimport · 2020-10-22T08:24:40Z

Antoine Pitrou / @pitrou:
Well, we don't want the price of creating the second array in the common case, so we would need two codepaths and a refactor to retry with the error-handling codepath when a conversion fails...

asfimport closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] CSV skip wrong rows #26306

[C++] CSV skip wrong rows #26306

asfimport commented Oct 15, 2020

asfimport commented Oct 15, 2020

asfimport commented Oct 19, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 22, 2020

asfimport commented Oct 22, 2020

[C++] CSV skip wrong rows #26306

[C++] CSV skip wrong rows #26306

Comments

asfimport commented Oct 15, 2020

asfimport commented Oct 15, 2020

asfimport commented Oct 19, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 21, 2020

asfimport commented Oct 22, 2020

asfimport commented Oct 22, 2020