ARROW-12673: [C++] Add parser handler for incorrect column counts #10202

n3world · 2021-04-29T16:39:33Z

Add a new callback to ParseOption which is invoked when a row does not have the correct number of columns. Using the RowModifier interface the callback can skip the row or add or remove fields.

Some basic handlers were added to skip all invalid rows, add a null value for missing columns and one that forces the row to the correct number of columns by adding null values or removing the columns at the end of the row.

github-actions · 2021-04-29T19:00:34Z

https://issues.apache.org/jira/browse/ARROW-12001

westonpace

Again, thanks for the contribution. I have just a few thoughts. CC @pitrou to take a look when he gets back.

westonpace · 2021-05-06T01:16:25Z

cpp/src/arrow/csv/parser.cc

+  Status PushField(const std::string& field) {
+    if (field.length() > extra_allocated_) {
+      // just in case this happens more allocate enough for 10x this amount
+      auto to_allocate = static_cast<uint32_t>(
+          std::max(field.length() * 10, static_cast<std::string::size_type>(128)));


One bonus of preventing user supplied handlers is that you can change this to PushFields(const std::string&, int) and remove the need to guess how much space to allocate.

I have a bit of a confession. My original intent was to write a way to handle rows with incorrect number of columns and not add nulls or truncate the rows but instead record them in a custom handler. I actually just piggy backed on this issue since it was a subset of what I wanted to implement.
With that being said I would strongly like to be able to keep the custom handlers in the API.

That being said I can easily add an AddFields optimization which just allocates enough for all the additional fields in a row under the assumption that this is so rare that it won't be called often and not attempt to allocate for the future

My original intent was to write a way to handle rows with incorrect number of columns and not add nulls or truncate the rows but instead record them in a custom handler.

That's fine, you're welcome to whatever intent :). Can you create a JIRA or add a comment to the other JIRA describing your needs? That will help others in evaluating the feature.

With that being said I would strongly like to be able to keep the custom handlers in the API.

Arrow doesn't do a lot of "calling out" today but that might just be happenstance. I'll let others more knowledgeable than me chime in on the subject.

westonpace · 2021-05-06T01:20:35Z

cpp/src/arrow/csv/options.h

@@ -56,6 +62,8 @@ struct ARROW_EXPORT ParseOptions {
  /// Whether empty lines are ignored.  If false, an empty line represents
  /// a single empty value (assuming a one-column CSV file).
  bool ignore_empty_lines = true;
+  /// A handler function for rows which do not have the correct number of columns
+  util::optional<InvalidRowHandler> invalid_row_handler;


My vote would be that this is too much complexity. I think simply back filling nulls (in the InvalidRowHandler::Force style) or skipping rows would be sufficient. Maybe a InvalidRowHandler is still used under the hood to accommodate the two approaches but the interface to the user should be simpler (enum or bools).

In the rare event that someone wanted to do something custom (e.g. a unique value for each column) they could fill nulls and then in a later compute step replace null with a default value.

westonpace · 2021-05-06T01:22:49Z

cpp/src/arrow/csv/parser.cc

-  char s[50];
-  snprintf(s, sizeof(s), "Expected %d columns, got %d", expected, actual);
-  return ParseError(s);
+Status MismatchingColumns(int32_t expected, int32_t actual, const std::string& row) {


The row could be very long. If we want to provide more context is there anyway we can put the row index in here instead? Given that we only allow \n or \r\n to be line breaks that should make it pretty easy to locate the offending row via row index (i.e. row index == line number).

Yes row index or line number which are not always the same because in theory new lines can be in a value would be nice but there didn't seem to be anything which tracked overall parser location at this level to be able to do that. That would also be good for the convert layer so that if a conversion error does occur not just column number but line/row number could be given too.

Another option could be to just truncate row if it is too long. Adding location tracking to put line numbers in errors could be a standalone JIRA / future PR as well.

I'll add a truncate with an ellipse if over say 100 characters and open a jira ticket for tracking absolute line and row during csv parsing to be included in error messages

cpp/src/arrow/csv/parser.cc

cpp/src/arrow/csv/parser_test.cc

github-actions · 2021-05-06T19:00:47Z

https://issues.apache.org/jira/browse/ARROW-12673

…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

Add a function to the parser configuration which allows a custom handler for when there is a row which does not match the expected column count. This also adds an interface and implementation for modifying the current row when this occurs allowing fields to be added or removed. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

Add a few simple handlers which skip all rows with a bad count, add null fields or one that adds null fields or removes extra fields. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

pitrou · 2021-05-10T13:27:03Z

Like @westonpace it seems to me that this is adding a lot of complexity. I would be ok with an option to treat missing columns as null.

n3world · 2021-05-10T14:08:36Z

Like @westonpace it seems to me that this is adding a lot of complexity. I would be ok with an option to treat missing columns as null.

Is the complexity you don't like the callback function or the ability to add a field with any value and not just null?
As for only allowing null fields to be added the minor complexity there is that the value of null are configurable so empty string could be removed from that list and because of the division between parser and converter there is currently no way for the parser to say this value is null. A null bit could be added to ParsedValueDesc to indicate this is a null value.
As for the callback that is actually more geared towards ARROW-12673, which is actually the functionality I started to implement and then stumbled across ARROW-12001 and tagged that ticket in this PR.

pitrou · 2021-05-10T14:26:20Z

Is the complexity you don't like the callback function or the ability to add a field with any value and not just null?

The callback is my main concern. I think we can add a "missing" bit to ParsedValueDescr. Hopefully that wouldn't slow down the conversion of non-missing values.

n3world · 2021-05-10T14:36:48Z

Is the complexity you don't like the callback function or the ability to add a field with any value and not just null?

The callback is my main concern. I think we can add a "missing" bit to ParsedValueDescr. Hopefully that wouldn't slow down the conversion of non-missing values.

If a bit is added to ParsedValueDesc would it be taken away from offset to try and keep the struct at 32bits?

Also, this proposed change would only work for resolving ARROW-12001 and not the original intent which was essentially to allow the caller the ability to determine how rows with incorrect column counts should be handled. At a minimum I would like to be able to skip the rows and collect information about the skipped rows so it could be presented to a user saying when and where malformed rows were found. The fill in nulls or truncate rows were just simple additions I added to be able to process those rows if that was desired.

pitrou · 2021-05-10T15:14:26Z

If a bit is added to ParsedValueDesc would it be taken away from offset to try and keep the struct at 32bits?

Yes.

At a minimum I would like to be able to skip the rows and collect information about the skipped rows so it could be presented to a user saying when and where malformed rows were found

I agree that being able to point the row number where an error occurred is useful, but we shouldn't need a callback for that.

n3world · 2021-05-10T15:28:55Z

At a minimum I would like to be able to skip the rows and collect information about the skipped rows so it could be presented to a user saying when and where malformed rows were found

I agree that being able to point the row number where an error occurred is useful, but we shouldn't need a callback for that.

For that alone no. But when you start to think about the combinations of ways these rows could be handled it starts to get very complex for both short rows and long rows you could either error, skip or fix and if you don't error do you need to report that row or is it silent. To describe that combination of possible handlers you would need 5 options for both short and long rows and then you would need to express any combination of those 5 options. The distinction between the silent skip and report skip is because currently the best way to report a row is by including the entire text of the row and if there are a good number of rows that need to be reported that could result in noticeable overhead if the caller just wants the handling to be done silently. Because of this I was thinking it would be easier to expose a callback with some pre defined simple implementations. That way more complex options could be implemented by the user.

If we wanted to not have the callback and support that matrix of options the best way might be two enums one for short rows and one for long rows and then a mechanism to track rows which are to be reported.

pitrou · 2021-05-10T16:09:57Z

My main issue is that the callback adds a significant amount of complication to the parser implementation and it's not obvious it's desirable to constrain ourselves in the future with supporting this functionality.

n3world · 2021-05-11T14:14:41Z

My main issue is that the callback adds a significant amount of complication to the parser implementation and it's not obvious it's desirable to constrain ourselves in the future with supporting this functionality.

I am not sure where that leaves this PR. It seems like the only functionality you would like is adding missing values to rows with too few columns.

In your opinion would it be reasonable to add two enums to the parser options one for handling too few columns and one for too many with the values ERROR, SKIP and FIX. Error will be the current behaviour, skip will not include the row but add the the row to tracker of problematic rows and fix will either append or truncate values and add the row to the list of problematic rows.

pitrou · 2021-05-12T12:30:15Z

It seems like the only functionality you would like is adding missing values to rows with too few columns.

Basically, that seems to be the most reasonable user request.

In your opinion would it be reasonable to add two enums to the parser options one for handling too few columns and one for too many with the values ERROR, SKIP and FIX

Well, I'm not sure I understand in which situation skipping would be the right answer. Can you explain a bit more?

n3world · 2021-05-12T14:17:52Z

In your opinion would it be reasonable to add two enums to the parser options one for handling too few columns and one for too many with the values ERROR, SKIP and FIX

Well, I'm not sure I understand in which situation skipping would be the right answer. Can you explain a bit more?

I can give you my use case. I have users which upload CSV files to be analyzed. Currently if the csv is malformed an error message is immediately returned on the first error. I would like the behavior to be that on the first pass of parsing the csv any bad rows are skipped but tracked so that if no bad rows are found success but otherwise the user can be notified that not all data could be parsed and tell them all the rows which had an issue. The user then can decide on how they want to handle this situation, ie upload a fixed csv, just keep ignoring the lines or reparse the lines by having the parser "fix" them.

This behavior is similar to the pandas.read_csv options error_bad_lines and warn_bad_lines

n3world · 2021-05-27T13:39:24Z

I can give you my use case. I have users which upload CSV files to be analyzed. Currently if the csv is malformed an error message is immediately returned on the first error. I would like the behavior to be that on the first pass of parsing the csv any bad rows are skipped but tracked so that if no bad rows are found success but otherwise the user can be notified that not all data could be parsed and tell them all the rows which had an issue. The user then can decide on how they want to handle this situation, ie upload a fixed csv, just keep ignoring the lines or reparse the lines by having the parser "fix" them.

Does this use case make sense? Does it seem like something you want to support?

pitrou · 2021-06-01T16:45:07Z

Does this use case make sense? Does it seem like something you want to support?

It makes sense, but I'm not sure we want to support it. Basically, I would like to see if it can be implemented with minimal complication at the heart of the CSV parser internals. In particular, I don't want the handler to be able to modify any data. It should only be allowed to return a Status to say whether we should go on or not.

Sidenote: if the CSV parser loses sync (for exemple because of a misquoted CSV cell), you may also have many "invalid" rows.

n3world · 2021-06-03T13:40:26Z

Does this use case make sense? Does it seem like something you want to support?

It makes sense, but I'm not sure we want to support it. Basically, I would like to see if it can be implemented with minimal complication at the heart of the CSV parser internals. In particular, I don't want the handler to be able to modify any data. It should only be allowed to return a Status to say whether we should go on or not.

By modify the row are you specifically referring to the ability to remove columns from the row, or do you mean instead of passing the RowModifier into the callback the callback will return an enum indicating error, skip or fix and the csv parser will modify the row for fix?

Does this mean you now don't mind the idea of a callback just you want limit its abilities?

Sidenote: if the CSV parser loses sync (for exemple because of a misquoted CSV cell), you may also have many "invalid" rows.

Yes that could be a problem but could be reduced by capping the number of errors to report or a threshold that after so many bad rows an error will be generated and parsing stopped. If there is a callback it would be up to the callback implementer.

n3world · 2021-06-09T14:22:51Z

Should I move these design discussions out of this PR and to the dev mailing list or just keep it here for continuity?

pitrou · 2021-06-15T14:48:02Z

Sorry for the delay. Before answering in more detail, what kind of information does the callback need? Just a string_view of the raw CSV line?

n3world · 2021-06-15T21:26:02Z

Sorry for the delay. Before answering in more detail, what kind of information does the callback need? Just a string_view of the raw CSV line?

The current callback as written is passed an object which it can use to manipulate the current line in a limited manor, RowModifier, but if I was to go with a callback which returned an enum indicating the action to take I would pass a const string_view to the callback.

pitrou · 2021-06-16T12:48:10Z

but if I was to go with a callback which returned an enum indicating the action to take I would pass a const string_view to the callback.

That would be fine, under the condition that the only possible actions are "skip" and "error". Fixing the row on-the-fly would produce too much coupling between the parser and its consumers, IMHO.

n3world · 2021-06-16T14:20:48Z

but if I was to go with a callback which returned an enum indicating the action to take I would pass a const string_view to the callback.

That would be fine, under the condition that the only possible actions are "skip" and "error". Fixing the row on-the-fly would produce too much coupling between the parser and its consumers, IMHO.

That wouldn't solve the issue mentioned by ARROW-12001, which requests to add nulls to fill out the line. If you do not wish to address that ticket, then the API would be a callback which takes a row number, number of expected columns, number of actual columns and string_view of the row. It can return an enum indicating skip or error and the default would be the current error behavior.

pitrou · 2021-06-29T17:33:04Z

If you do not wish to address that ticket, then the API would be a callback which takes a row number, number of expected columns, number of actual columns and string_view of the row. It can return an enum indicating skip or error and the default would be the current error behavior.

That would be ok with me, but the row number would not be available for parallel reading, right?

westonpace · 2021-06-29T17:51:44Z

That would be ok with me, but the row number would not be available for parallel reading, right?

Hmm, an interesting point. The parallel implementation would indeed struggle to know the correct row number. A byte offset to the start of the row would be possible however.

n3world · 2021-06-29T19:24:59Z

That would be ok with me, but the row number would not be available for parallel reading, right?

Hmm, an interesting point. The parallel implementation would indeed struggle to know the correct row number. A byte offset to the start of the row would be possible however.

Updating the reporting to be row number, where available, and byte offset of the start of the row would be useful for the parallel parsers which cannot easily know row number. If I were to do that it seems like it should be a separate MR done before this and on top of/after ARROW-11889.

n3world · 2021-07-01T19:28:02Z

Updating the reporting to be row number, where available, and byte offset of the start of the row would be useful

I started to play with this and I was able to do it easily for the parser but ran into trouble trying to add the byte offset information to the parser but for conversion it would require at a minimum tracking each rows offset to the start of the block if not each values offset for more accuracy. Is adding this overhead acceptable for error reporting?

westonpace · 2021-07-01T20:48:32Z

I'm not sure there is an obvious way to solve this problem in parallel. The parser will start parsing block X before block X-1 has finished parsing. The parser input (CSVBlock) doesn't know how many lines it has. That is not discovered until parsing time. So, for example, the parser for block 2 might realize there is an error on the third line of block 2. But, without knowing how many lines are in block 1 (and block 1 may not have finished parsing) it is hard to say what the lines number of that error is.

You could do a serial pass prior to parsing that just figures out how many lines are in a block but I suspect that would be too much overhead.

You could delay reporting an error on block X until blocks 1 to (X-1) have finished parsing (so you can know what the line number is). That would probably be the solution I would take if I needed to do this. However, I don't know off-hand how to do that delay in a low-complexity way.

n3world · 2021-07-02T00:32:33Z

I'm not sure there is an obvious way to solve this problem in parallel. The parser will start parsing block X before block X-1 has finished parsing. The parser input (CSVBlock) doesn't know how many lines it has. That is not discovered until parsing time. So, for example, the parser for block 2 might realize there is an error on the third line of block 2. But, without knowing how many lines are in block 1 (and block 1 may not have finished parsing) it is hard to say what the lines number of that error is.

Sorry I don't think I was clear in my comment. I was able to get reporting the byte offset of the row on which the parser error occurred working but ran into the issue getting the offset during conversion. This was to be able to give some location feedback even for parallel parsing where row number is difficult as you mentioned.

You could do a serial pass prior to parsing that just figures out how many lines are in a block but I suspect that would be too much overhead.

I agree this wouldn't be worth it

You could delay reporting an error on block X until blocks 1 to (X-1) have finished parsing (so you can know what the line number is). That would probably be the solution I would take if I needed to do this. However, I don't know off-hand how to do that delay in a low-complexity way.

That could work but not sure it is worth the complexity

pitrou · 2021-07-05T16:16:13Z

Is adding this overhead acceptable for error reporting?

If the overhead is paid just in case of error then it would be ok. I don't think it would be a great idea to add more than ~3% overhead to the successful case, though.

n3world · 2021-07-06T13:40:40Z

Unfortunately the overhead is needed before you know if there is a failure. You can look at #10662 for an implementation of this.

n3world · 2021-07-23T13:53:10Z

Starting new PR with updated implementation

Add a new callback handler to the csv parser which is invoked when the column count in a row does not match the expected. The callback can return an enum which indicates if the row should result in an error result or it should be skipped. Replaces #10202 Closes #10790 from n3world/ARROW-12673-Invalid_column_count Lead-authored-by: Nate Clark <nate@neworld.us> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

Add a new callback handler to the csv parser which is invoked when the column count in a row does not match the expected. The callback can return an enum which indicates if the row should result in an error result or it should be skipped. Replaces apache#10202 Closes apache#10790 from n3world/ARROW-12673-Invalid_column_count Lead-authored-by: Nate Clark <nate@neworld.us> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

n3world force-pushed the ARROW-17001-Invalid_column_count branch from 9c6e34d to 7716af1 Compare April 29, 2021 16:41

n3world changed the title ~~ARROW-17001: [C++] Add parser handler for incorrect column counts~~ ARROW-12001: [C++] Add parser handler for incorrect column counts Apr 29, 2021

n3world force-pushed the ARROW-17001-Invalid_column_count branch 3 times, most recently from aec4726 to 8dbbab5 Compare April 29, 2021 18:52

github-actions bot added the Component: C++ label Apr 29, 2021

n3world force-pushed the ARROW-17001-Invalid_column_count branch 4 times, most recently from 5ddd7bc to a3b7748 Compare May 4, 2021 02:00

apache deleted a comment from github-actions bot May 5, 2021

n3world force-pushed the ARROW-17001-Invalid_column_count branch from a3b7748 to 19de41f Compare May 5, 2021 22:19

westonpace reviewed May 6, 2021

View reviewed changes

n3world changed the title ~~ARROW-12001: [C++] Add parser handler for incorrect column counts~~ ARROW-12673: [C++] Add parser handler for incorrect column counts May 6, 2021

n3world force-pushed the ARROW-17001-Invalid_column_count branch 2 times, most recently from efc3269 to c32b3db Compare May 7, 2021 02:33

n3world added 3 commits May 7, 2021 09:01

ARROW-12673: [C++] csv/parser: Add more information to column mismatc…

3774807

…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>

n3world force-pushed the ARROW-17001-Invalid_column_count branch from c32b3db to c5c5cfb Compare May 7, 2021 13:03

n3world mentioned this pull request Jul 23, 2021

ARROW-12673: [C++] Add callback to handle incorrect column counts #10790

Closed

n3world closed this Jul 23, 2021

n3world deleted the ARROW-17001-Invalid_column_count branch April 29, 2022 17:55

This was referenced Sep 20, 2021

[C++] Configure a custom handler for rows with incorrect column counts #28421

Closed

[C++][CSV] Allow missing columns at end of row #27832

Open

ARROW-12673: [C++] Add parser handler for incorrect column counts #10202

ARROW-12673: [C++] Add parser handler for incorrect column counts #10202

Conversation

n3world commented Apr 29, 2021

github-actions bot commented Apr 29, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3world May 6, 2021 • edited

Choose a reason for hiding this comment

github-actions bot commented May 6, 2021

pitrou commented May 10, 2021

n3world commented May 10, 2021

pitrou commented May 10, 2021

n3world commented May 10, 2021

pitrou commented May 10, 2021

n3world commented May 10, 2021 • edited

pitrou commented May 10, 2021

n3world commented May 11, 2021

pitrou commented May 12, 2021

n3world commented May 12, 2021

n3world commented May 27, 2021

pitrou commented Jun 1, 2021

n3world commented Jun 3, 2021

n3world commented Jun 9, 2021

pitrou commented Jun 15, 2021

n3world commented Jun 15, 2021

pitrou commented Jun 16, 2021

n3world commented Jun 16, 2021

pitrou commented Jun 29, 2021

westonpace commented Jun 29, 2021

n3world commented Jun 29, 2021

n3world commented Jul 1, 2021

westonpace commented Jul 1, 2021

n3world commented Jul 2, 2021

pitrou commented Jul 5, 2021

n3world commented Jul 6, 2021

n3world commented Jul 23, 2021

n3world May 6, 2021 •

edited

n3world commented May 10, 2021 •

edited