New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12673: [C++] Add parser handler for incorrect column counts #10202
Conversation
9c6e34d
to
7716af1
Compare
aec4726
to
8dbbab5
Compare
5ddd7bc
to
a3b7748
Compare
a3b7748
to
19de41f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, thanks for the contribution. I have just a few thoughts. CC @pitrou to take a look when he gets back.
cpp/src/arrow/csv/parser.cc
Outdated
Status PushField(const std::string& field) { | ||
if (field.length() > extra_allocated_) { | ||
// just in case this happens more allocate enough for 10x this amount | ||
auto to_allocate = static_cast<uint32_t>( | ||
std::max(field.length() * 10, static_cast<std::string::size_type>(128))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bonus of preventing user supplied handlers is that you can change this to PushFields(const std::string&, int)
and remove the need to guess how much space to allocate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bit of a confession. My original intent was to write a way to handle rows with incorrect number of columns and not add nulls or truncate the rows but instead record them in a custom handler. I actually just piggy backed on this issue since it was a subset of what I wanted to implement.
With that being said I would strongly like to be able to keep the custom handlers in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said I can easily add an AddFields optimization which just allocates enough for all the additional fields in a row under the assumption that this is so rare that it won't be called often and not attempt to allocate for the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original intent was to write a way to handle rows with incorrect number of columns and not add nulls or truncate the rows but instead record them in a custom handler.
That's fine, you're welcome to whatever intent :). Can you create a JIRA or add a comment to the other JIRA describing your needs? That will help others in evaluating the feature.
With that being said I would strongly like to be able to keep the custom handlers in the API.
Arrow doesn't do a lot of "calling out" today but that might just be happenstance. I'll let others more knowledgeable than me chime in on the subject.
@@ -56,6 +62,8 @@ struct ARROW_EXPORT ParseOptions { | |||
/// Whether empty lines are ignored. If false, an empty line represents | |||
/// a single empty value (assuming a one-column CSV file). | |||
bool ignore_empty_lines = true; | |||
/// A handler function for rows which do not have the correct number of columns | |||
util::optional<InvalidRowHandler> invalid_row_handler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be that this is too much complexity. I think simply back filling nulls (in the InvalidRowHandler::Force
style) or skipping rows would be sufficient. Maybe a InvalidRowHandler
is still used under the hood to accommodate the two approaches but the interface to the user should be simpler (enum or bools).
In the rare event that someone wanted to do something custom (e.g. a unique value for each column) they could fill nulls and then in a later compute step replace null with a default value.
cpp/src/arrow/csv/parser.cc
Outdated
char s[50]; | ||
snprintf(s, sizeof(s), "Expected %d columns, got %d", expected, actual); | ||
return ParseError(s); | ||
Status MismatchingColumns(int32_t expected, int32_t actual, const std::string& row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The row could be very long. If we want to provide more context is there anyway we can put the row index in here instead? Given that we only allow \n or \r\n to be line breaks that should make it pretty easy to locate the offending row via row index (i.e. row index == line number).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes row index or line number which are not always the same because in theory new lines can be in a value would be nice but there didn't seem to be anything which tracked overall parser location at this level to be able to do that. That would also be good for the convert layer so that if a conversion error does occur not just column number but line/row number could be given too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option could be to just truncate row
if it is too long. Adding location tracking to put line numbers in errors could be a standalone JIRA / future PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a truncate with an ellipse if over say 100 characters and open a jira ticket for tracking absolute line and row during csv parsing to be included in error messages
efc3269
to
c32b3db
Compare
…h error Add the line which has the incorrect column count to the output so it is easier to identify in large inputs. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>
Add a function to the parser configuration which allows a custom handler for when there is a row which does not match the expected column count. This also adds an interface and implementation for modifying the current row when this occurs allowing fields to be added or removed. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>
Add a few simple handlers which skip all rows with a bad count, add null fields or one that adds null fields or removes extra fields. Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Nate Clark <nate@neworld.us>
c32b3db
to
c5c5cfb
Compare
Like @westonpace it seems to me that this is adding a lot of complexity. I would be ok with an option to treat missing columns as null. |
Is the complexity you don't like the callback function or the ability to add a field with any value and not just null? |
The callback is my main concern. I think we can add a "missing" bit to ParsedValueDescr. Hopefully that wouldn't slow down the conversion of non-missing values. |
If a bit is added to ParsedValueDesc would it be taken away from offset to try and keep the struct at 32bits? Also, this proposed change would only work for resolving ARROW-12001 and not the original intent which was essentially to allow the caller the ability to determine how rows with incorrect column counts should be handled. At a minimum I would like to be able to skip the rows and collect information about the skipped rows so it could be presented to a user saying when and where malformed rows were found. The fill in nulls or truncate rows were just simple additions I added to be able to process those rows if that was desired. |
Yes.
I agree that being able to point the row number where an error occurred is useful, but we shouldn't need a callback for that. |
For that alone no. But when you start to think about the combinations of ways these rows could be handled it starts to get very complex for both short rows and long rows you could either error, skip or fix and if you don't error do you need to report that row or is it silent. To describe that combination of possible handlers you would need 5 options for both short and long rows and then you would need to express any combination of those 5 options. The distinction between the silent skip and report skip is because currently the best way to report a row is by including the entire text of the row and if there are a good number of rows that need to be reported that could result in noticeable overhead if the caller just wants the handling to be done silently. Because of this I was thinking it would be easier to expose a callback with some pre defined simple implementations. That way more complex options could be implemented by the user. If we wanted to not have the callback and support that matrix of options the best way might be two enums one for short rows and one for long rows and then a mechanism to track rows which are to be reported. |
My main issue is that the callback adds a significant amount of complication to the parser implementation and it's not obvious it's desirable to constrain ourselves in the future with supporting this functionality. |
I am not sure where that leaves this PR. It seems like the only functionality you would like is adding missing values to rows with too few columns. In your opinion would it be reasonable to add two enums to the parser options one for handling too few columns and one for too many with the values ERROR, SKIP and FIX. Error will be the current behaviour, skip will not include the row but add the the row to tracker of problematic rows and fix will either append or truncate values and add the row to the list of problematic rows. |
Basically, that seems to be the most reasonable user request.
Well, I'm not sure I understand in which situation skipping would be the right answer. Can you explain a bit more? |
I can give you my use case. I have users which upload CSV files to be analyzed. Currently if the csv is malformed an error message is immediately returned on the first error. I would like the behavior to be that on the first pass of parsing the csv any bad rows are skipped but tracked so that if no bad rows are found success but otherwise the user can be notified that not all data could be parsed and tell them all the rows which had an issue. The user then can decide on how they want to handle this situation, ie upload a fixed csv, just keep ignoring the lines or reparse the lines by having the parser "fix" them. This behavior is similar to the pandas.read_csv options error_bad_lines and warn_bad_lines |
Does this use case make sense? Does it seem like something you want to support? |
It makes sense, but I'm not sure we want to support it. Basically, I would like to see if it can be implemented with minimal complication at the heart of the CSV parser internals. In particular, I don't want the handler to be able to modify any data. It should only be allowed to return a Sidenote: if the CSV parser loses sync (for exemple because of a misquoted CSV cell), you may also have many "invalid" rows. |
By modify the row are you specifically referring to the ability to remove columns from the row, or do you mean instead of passing the RowModifier into the callback the callback will return an enum indicating Does this mean you now don't mind the idea of a callback just you want limit its abilities?
Yes that could be a problem but could be reduced by capping the number of errors to report or a threshold that after so many bad rows an error will be generated and parsing stopped. If there is a callback it would be up to the callback implementer. |
Should I move these design discussions out of this PR and to the dev mailing list or just keep it here for continuity? |
Sorry for the delay. Before answering in more detail, what kind of information does the callback need? Just a |
The current callback as written is passed an object which it can use to manipulate the current line in a limited manor, RowModifier, but if I was to go with a callback which returned an enum indicating the action to take I would pass a |
That would be fine, under the condition that the only possible actions are "skip" and "error". Fixing the row on-the-fly would produce too much coupling between the parser and its consumers, IMHO. |
That wouldn't solve the issue mentioned by ARROW-12001, which requests to add nulls to fill out the line. If you do not wish to address that ticket, then the API would be a callback which takes a row number, number of expected columns, number of actual columns and |
That would be ok with me, but the row number would not be available for parallel reading, right? |
Hmm, an interesting point. The parallel implementation would indeed struggle to know the correct row number. A byte offset to the start of the row would be possible however. |
Updating the reporting to be row number, where available, and byte offset of the start of the row would be useful for the parallel parsers which cannot easily know row number. If I were to do that it seems like it should be a separate MR done before this and on top of/after ARROW-11889. |
I started to play with this and I was able to do it easily for the parser but ran into trouble trying to add the byte offset information to the parser but for conversion it would require at a minimum tracking each rows offset to the start of the block if not each values offset for more accuracy. Is adding this overhead acceptable for error reporting? |
I'm not sure there is an obvious way to solve this problem in parallel. The parser will start parsing block X before block X-1 has finished parsing. The parser input (CSVBlock) doesn't know how many lines it has. That is not discovered until parsing time. So, for example, the parser for block 2 might realize there is an error on the third line of block 2. But, without knowing how many lines are in block 1 (and block 1 may not have finished parsing) it is hard to say what the lines number of that error is. You could do a serial pass prior to parsing that just figures out how many lines are in a block but I suspect that would be too much overhead. You could delay reporting an error on block X until blocks 1 to (X-1) have finished parsing (so you can know what the line number is). That would probably be the solution I would take if I needed to do this. However, I don't know off-hand how to do that delay in a low-complexity way. |
Sorry I don't think I was clear in my comment. I was able to get reporting the byte offset of the row on which the parser error occurred working but ran into the issue getting the offset during conversion. This was to be able to give some location feedback even for parallel parsing where row number is difficult as you mentioned.
I agree this wouldn't be worth it
That could work but not sure it is worth the complexity |
If the overhead is paid just in case of error then it would be ok. I don't think it would be a great idea to add more than ~3% overhead to the successful case, though. |
Unfortunately the overhead is needed before you know if there is a failure. You can look at #10662 for an implementation of this. |
Starting new PR with updated implementation |
Add a new callback handler to the csv parser which is invoked when the column count in a row does not match the expected. The callback can return an enum which indicates if the row should result in an error result or it should be skipped. Replaces #10202 Closes #10790 from n3world/ARROW-12673-Invalid_column_count Lead-authored-by: Nate Clark <nate@neworld.us> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Add a new callback handler to the csv parser which is invoked when the column count in a row does not match the expected. The callback can return an enum which indicates if the row should result in an error result or it should be skipped. Replaces apache#10202 Closes apache#10790 from n3world/ARROW-12673-Invalid_column_count Lead-authored-by: Nate Clark <nate@neworld.us> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Add a new callback to ParseOption which is invoked when a row does not have the correct number of columns. Using the RowModifier interface the callback can skip the row or add or remove fields.
Some basic handlers were added to skip all invalid rows, add a null value for missing columns and one that forces the row to the correct number of columns by adding null values or removing the columns at the end of the row.