ORC-697: Improve scan tool to report the location of corruption. #582

omalley · 2020-12-14T22:38:42Z

What changes were proposed in this pull request?

This PR updates the scan tool to print information about where the file is corrupted. It

reads data by batches until there is a problem
tries re-reading that batch column by column to find which column is corrupted
figures out the next location that the reader can seek to

Why are the changes needed?

It helps diagnose where (row & column) an ORC file is corrupted.

How was this patch tested?

It was tested on ORC files that were corrupted by bad machines.

java/tools/src/java/org/apache/orc/tools/ScanData.java

dongjoon-hyun · 2020-12-14T23:49:20Z

java/tools/src/java/org/apache/orc/tools/ScanData.java

+            long badBatches = 0;
+            long currentRow = 0;
+            long goodRows = 0;
+            try (RecordReader rows = reader.rows()) {


Can we handle the corrupted file more gracefully? This looks like a design but this may be considered as a regression in the following case.

BEFORE

$ orc-tools scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780] Unable to dump data for file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc

AFTER (this PR)

$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780] Unable to open file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc java.lang.IndexOutOfBoundsException: Index: 0 at java.util.Collections$EmptyList.get(Collections.java:4456) at org.apache.orc.OrcProto$StripeFooter.getColumns(OrcProto.java:14080) at org.apache.orc.impl.reader.StripePlanner.buildEncodings(StripePlanner.java:224) at org.apache.orc.impl.reader.StripePlanner.parseStripe(StripePlanner.java:126) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1117) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1168) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1203) at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:268) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:841) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:835) at org.apache.orc.tools.ScanData.main(ScanData.java:171) at org.apache.orc.tools.Driver.main(Driver.java:126)

In this case, instead of showing the raw exception like IndexOutOfBoundsException directly, can we say about corrupted footer as a general message?

Yes, although the Scan tool doesn't have the information about where problems happen, when it is the footer. We'd need to add better exceptions in the ReaderImpl, which is a good idea.

Do you want me to hide the actual exceptions behind a "-v" option?

Oh, -v sounds much better to me. Thanks!

dongjoon-hyun · 2020-12-14T23:56:06Z

java/tools/src/java/org/apache/orc/tools/ScanData.java

+          rows.nextBatch(batch);
+        }
+      } catch (Throwable t) {
+        System.out.printf("Column %d failed at row %d%n", column.getId(),


For this one, can we show the column name together?

$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/missing_length_stream_in_string_dict.orc log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file ../examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788] Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300) java.lang.NullPointerException at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236) at org.apache.orc.tools.ScanData.main(ScanData.java:175) at org.apache.orc.tools.Driver.main(Driver.java:126) Column 9 failed at row 0 Column 10 failed at row 0 Column 11 failed at row 0 Unable to open file: ../examples/corrupt/missing_length_stream_in_string_dict.orc java.lang.IllegalArgumentException: Seek after the end of reader range at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310) at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362) at org.apache.orc.tools.ScanData.main(ScanData.java:188) at org.apache.orc.tools.Driver.main(Driver.java:126)

Actually, if you pass "-s" to the tool, it will print out the entire schema in json.

It looks like:

Processing data file examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788] {"category": "struct", "id": 0, "max": 11, "fields": [ { "id": {"category": "int", "id": 1, "max": 1}}, { "bool_col": {"category": "boolean", "id": 2, "max": 2}}, { "tinyint_col": {"category": "tinyint", "id": 3, "max": 3}}, { "smallint_col": {"category": "smallint", "id": 4, "max": 4}}, { "int_col": {"category": "int", "id": 5, "max": 5}}, { "bigint_col": {"category": "bigint", "id": 6, "max": 6}}, { "float_col": {"category": "float", "id": 7, "max": 7}}, { "double_col": {"category": "double", "id": 8, "max": 8}}, { "date_string_col": {"category": "string", "id": 9, "max": 9}}, { "string_col": {"category": "string", "id": 10, "max": 10}}, { "timestamp_col": {"category": "timestamp", "id": 11, "max": 11}}]} Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300) java.lang.NullPointerException at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236) at org.apache.orc.tools.ScanData.main(ScanData.java:175) at org.apache.orc.tools.Driver.main(Driver.java:126) Column 9 failed at row 0 Column 10 failed at row 0 Column 11 failed at row 0 Unable to open file: examples/corrupt/missing_length_stream_in_string_dict.orc java.lang.IllegalArgumentException: Seek after the end of reader range at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310) at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362) at org.apache.orc.tools.ScanData.main(ScanData.java:188) at org.apache.orc.tools.Driver.main(Driver.java:126)

I'll update the patch to remove the final exception, which happens because it is trying to seek past the end of the file.

dongjoon-hyun

This feature looks helpful. Thank you, @omalley .

omalley · 2020-12-15T01:05:57Z

Ok, I pushed an update:

Added a better exception for RecordReaderImpl.
Added handling for a recovery point that is the end of the file.

dongjoon-hyun

+1, LGTM. Thank you for the explanation and updates, @omalley .
The AS-IS PR also looks good to me. Merged to master.

ORC-697: Improve scan tool to report the location of corruption.

e648e06

dongjoon-hyun reviewed Dec 14, 2020

View reviewed changes

java/tools/src/java/org/apache/orc/tools/ScanData.java Show resolved Hide resolved

dongjoon-hyun reviewed Dec 14, 2020

View reviewed changes

Minor additions based on feedback.

6a9013e

dongjoon-hyun approved these changes Dec 15, 2020

View reviewed changes

dongjoon-hyun merged commit 3f3f62f into apache:master Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-697: Improve scan tool to report the location of corruption. #582

ORC-697: Improve scan tool to report the location of corruption. #582

omalley commented Dec 14, 2020

dongjoon-hyun Dec 14, 2020 •

edited

dongjoon-hyun Dec 14, 2020

omalley Dec 15, 2020

omalley Dec 15, 2020

dongjoon-hyun Dec 15, 2020

dongjoon-hyun Dec 14, 2020

omalley Dec 15, 2020

omalley Dec 15, 2020 •

edited

dongjoon-hyun Dec 15, 2020

dongjoon-hyun left a comment

omalley commented Dec 15, 2020

dongjoon-hyun left a comment •

edited

ORC-697: Improve scan tool to report the location of corruption. #582

ORC-697: Improve scan tool to report the location of corruption. #582

Conversation

omalley commented Dec 14, 2020

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

dongjoon-hyun Dec 14, 2020 • edited

Choose a reason for hiding this comment

dongjoon-hyun Dec 14, 2020

Choose a reason for hiding this comment

omalley Dec 15, 2020

Choose a reason for hiding this comment

omalley Dec 15, 2020

Choose a reason for hiding this comment

dongjoon-hyun Dec 15, 2020

Choose a reason for hiding this comment

dongjoon-hyun Dec 14, 2020

Choose a reason for hiding this comment

omalley Dec 15, 2020

Choose a reason for hiding this comment

omalley Dec 15, 2020 • edited

Choose a reason for hiding this comment

dongjoon-hyun Dec 15, 2020

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

omalley commented Dec 15, 2020

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun Dec 14, 2020 •

edited

omalley Dec 15, 2020 •

edited

dongjoon-hyun left a comment •

edited