Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-697: Improve scan tool to report the location of corruption. #582

Merged
merged 2 commits into from
Dec 15, 2020

Conversation

omalley
Copy link
Contributor

@omalley omalley commented Dec 14, 2020

What changes were proposed in this pull request?

This PR updates the scan tool to print information about where the file is corrupted. It

  • reads data by batches until there is a problem
  • tries re-reading that batch column by column to find which column is corrupted
  • figures out the next location that the reader can seek to

Why are the changes needed?

It helps diagnose where (row & column) an ORC file is corrupted.

How was this patch tested?

It was tested on ORC files that were corrupted by bad machines.

long badBatches = 0;
long currentRow = 0;
long goodRows = 0;
try (RecordReader rows = reader.rows()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we handle the corrupted file more gracefully? This looks like a design but this may be considered as a regression in the following case.

BEFORE

$ orc-tools scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780]
Unable to dump data for file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc

AFTER (this PR)

$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/stripe_footer_bad_column_encodings.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/stripe_footer_bad_column_encodings.orc [length: 780]
Unable to open file: ../examples/corrupt/stripe_footer_bad_column_encodings.orc
java.lang.IndexOutOfBoundsException: Index: 0
	at java.util.Collections$EmptyList.get(Collections.java:4456)
	at org.apache.orc.OrcProto$StripeFooter.getColumns(OrcProto.java:14080)
	at org.apache.orc.impl.reader.StripePlanner.buildEncodings(StripePlanner.java:224)
	at org.apache.orc.impl.reader.StripePlanner.parseStripe(StripePlanner.java:126)
	at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1117)
	at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1168)
	at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1203)
	at org.apache.orc.impl.RecordReaderImpl.<init>(RecordReaderImpl.java:268)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:841)
	at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:835)
	at org.apache.orc.tools.ScanData.main(ScanData.java:171)
	at org.apache.orc.tools.Driver.main(Driver.java:126)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, instead of showing the raw exception like IndexOutOfBoundsException directly, can we say about corrupted footer as a general message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although the Scan tool doesn't have the information about where problems happen, when it is the footer. We'd need to add better exceptions in the ReaderImpl, which is a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to hide the actual exceptions behind a "-v" option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, -v sounds much better to me. Thanks!

rows.nextBatch(batch);
}
} catch (Throwable t) {
System.out.printf("Column %d failed at row %d%n", column.getId(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, can we show the column name together?

$ java -jar tools/target/orc-tools-1.7.0-SNAPSHOT-uber.jar scan ../examples/corrupt/missing_length_stream_in_string_dict.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file ../examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788]
Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300)
java.lang.NullPointerException
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897)
	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
	at org.apache.orc.tools.ScanData.main(ScanData.java:175)
	at org.apache.orc.tools.Driver.main(Driver.java:126)
Column 9 failed at row 0
Column 10 failed at row 0
Column 11 failed at row 0
Unable to open file: ../examples/corrupt/missing_length_stream_in_string_dict.orc
java.lang.IllegalArgumentException: Seek after the end of reader range
	at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310)
	at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362)
	at org.apache.orc.tools.ScanData.main(ScanData.java:188)
	at org.apache.orc.tools.Driver.main(Driver.java:126)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if you pass "-s" to the tool, it will print out the entire schema in json.

Copy link
Contributor Author

@omalley omalley Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like:

Processing data file examples/corrupt/missing_length_stream_in_string_dict.orc [length: 1788]
{"category": "struct", "id": 0, "max": 11, "fields": [
{  "id": {"category": "int", "id": 1, "max": 1}},
{  "bool_col": {"category": "boolean", "id": 2, "max": 2}},
{  "tinyint_col": {"category": "tinyint", "id": 3, "max": 3}},
{  "smallint_col": {"category": "smallint", "id": 4, "max": 4}},
{  "int_col": {"category": "int", "id": 5, "max": 5}},
{  "bigint_col": {"category": "bigint", "id": 6, "max": 6}},
{  "float_col": {"category": "float", "id": 7, "max": 7}},
{  "double_col": {"category": "double", "id": 8, "max": 8}},
{  "date_string_col": {"category": "string", "id": 9, "max": 9}},
{  "string_col": {"category": "string", "id": 10, "max": 10}},
{  "timestamp_col": {"category": "timestamp", "id": 11, "max": 11}}]}
Unable to read batch at row 0 in stripe 1 (rows 0-300), recovery at row 300 in stripe 1 (rows 300-300)
java.lang.NullPointerException
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryByteArray(TreeReaderFactory.java:2237)
	at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.nextVector(TreeReaderFactory.java:2198)
	at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1897)
	at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:42)
	at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:72)
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1236)
	at org.apache.orc.tools.ScanData.main(ScanData.java:175)
	at org.apache.orc.tools.Driver.main(Driver.java:126)
Column 9 failed at row 0
Column 10 failed at row 0
Column 11 failed at row 0
Unable to open file: examples/corrupt/missing_length_stream_in_string_dict.orc
java.lang.IllegalArgumentException: Seek after the end of reader range
	at org.apache.orc.impl.RecordReaderImpl.findStripe(RecordReaderImpl.java:1310)
	at org.apache.orc.impl.RecordReaderImpl.seekToRow(RecordReaderImpl.java:1362)
	at org.apache.orc.tools.ScanData.main(ScanData.java:188)
	at org.apache.orc.tools.Driver.main(Driver.java:126)

I'll update the patch to remove the final exception, which happens because it is trying to seek past the end of the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature looks helpful. Thank you, @omalley .

@omalley
Copy link
Contributor Author

omalley commented Dec 15, 2020

Ok, I pushed an update:

  • Added a better exception for RecordReaderImpl.
  • Added handling for a recovery point that is the end of the file.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you for the explanation and updates, @omalley .
The AS-IS PR also looks good to me. Merged to master.

@dongjoon-hyun dongjoon-hyun merged commit 3f3f62f into apache:master Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants