-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-697: Improve scan tool to report the location of corruption. #582
Conversation
long badBatches = 0; | ||
long currentRow = 0; | ||
long goodRows = 0; | ||
try (RecordReader rows = reader.rows()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we handle the corrupted file more gracefully? This looks like a design but this may be considered as a regression in the following case.
BEFORE
$ orc-tools scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780]
Unable to dump data for file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc
AFTER (this PR)
$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780]
Unable to open file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc
java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:4456)
at org.apache.orc.OrcProto$StripeFooter.getColumns(OrcProto.java:14080)
at org.apache.orc.impl.reader.StripePlanner.buildEncodings(StripePlanner.java:224)
at org.apache.orc.impl.reader.StripePlanner.parseStripe(StripePlanner.java:126)
at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1117)
at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1168)
at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1203)
at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:268)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:841)
at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:835)
at org.apache.orc.tools.ScanData.main(ScanData.java:171)
at org.apache.orc.tools.Driver.main(Driver.java:126)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, instead of showing the raw exception like IndexOutOfBoundsException
directly, can we say about corrupted footer
as a general message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, although the Scan tool doesn't have the information about where problems happen, when it is the footer. We'd need to add better exceptions in the ReaderImpl, which is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to hide the actual exceptions behind a "-v" option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, -v
sounds much better to me. Thanks!
rows.nextBatch(batch); | ||
} | ||
} catch (Throwable t) { | ||
System.out.printf("Column %d failed at row %d%n", column.getId(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, can we show the column name together?
$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/missing_length_stream_in_string_dict.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788]
Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300)
java.lang.NullPointerException
at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237)
at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198)
at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897)
at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
at org.apache.orc.tools.ScanData.main(ScanData.java:175)
at org.apache.orc.tools.Driver.main(Driver.java:126)
Column 9 failed at row 0
Column 10 failed at row 0
Column 11 failed at row 0
Unable to open file: ../examples/corrupt/missing_length_stream_in_string_dict.orc
java.lang.IllegalArgumentException: Seek after the end of reader range
at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310)
at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362)
at org.apache.orc.tools.ScanData.main(ScanData.java:188)
at org.apache.orc.tools.Driver.main(Driver.java:126)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if you pass "-s" to the tool, it will print out the entire schema in json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like:
Processing data file examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788]
{"category": "struct", "id": 0, "max": 11, "fields": [
{ "id": {"category": "int", "id": 1, "max": 1}},
{ "bool_col": {"category": "boolean", "id": 2, "max": 2}},
{ "tinyint_col": {"category": "tinyint", "id": 3, "max": 3}},
{ "smallint_col": {"category": "smallint", "id": 4, "max": 4}},
{ "int_col": {"category": "int", "id": 5, "max": 5}},
{ "bigint_col": {"category": "bigint", "id": 6, "max": 6}},
{ "float_col": {"category": "float", "id": 7, "max": 7}},
{ "double_col": {"category": "double", "id": 8, "max": 8}},
{ "date_string_col": {"category": "string", "id": 9, "max": 9}},
{ "string_col": {"category": "string", "id": 10, "max": 10}},
{ "timestamp_col": {"category": "timestamp", "id": 11, "max": 11}}]}
Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300)
java.lang.NullPointerException
at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237)
at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198)
at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897)
at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
at org.apache.orc.tools.ScanData.main(ScanData.java:175)
at org.apache.orc.tools.Driver.main(Driver.java:126)
Column 9 failed at row 0
Column 10 failed at row 0
Column 11 failed at row 0
Unable to open file: examples/corrupt/missing_length_stream_in_string_dict.orc
java.lang.IllegalArgumentException: Seek after the end of reader range
at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310)
at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362)
at org.apache.orc.tools.ScanData.main(ScanData.java:188)
at org.apache.orc.tools.Driver.main(Driver.java:126)
I'll update the patch to remove the final exception, which happens because it is trying to seek past the end of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature looks helpful. Thank you, @omalley .
Ok, I pushed an update:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for the explanation and updates, @omalley .
The AS-IS PR also looks good to me. Merged to master.
What changes were proposed in this pull request?
This PR updates the scan tool to print information about where the file is corrupted. It
Why are the changes needed?
It helps diagnose where (row & column) an ORC file is corrupted.
How was this patch tested?
It was tested on ORC files that were corrupted by bad machines.