DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

cgivre · 2021-03-02T02:20:39Z

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp

Description

This PR addresses:

https://issues.apache.org/jira/browse/DRILL-7875

DRILL-7875 adds a null check in the Splunk reader. In the event that there is a Splunk index with no timestamp field, this will cause Drill to receive several null columns. This rectifies that problem.

Documentation

No user facing changes.

Testing

Manual testing and existing unit tests.

vvysotskyi · 2021-03-02T20:35:07Z

contrib/format-esri/src/main/java/org/apache/drill/exec/store/esri/ShpBatchReader.java

@@ -125,7 +125,7 @@ public boolean next() {

  private void openFile(FileSchemaNegotiator negotiator) {
    try {
-      fileReaderShp = negotiator.fileSystem().open(split.getPath());
+      fileReaderShp = negotiator.fileSystem().openDecompressedInputStream(split.getPath());


I have some doubts regarding the safety of the openDecompressedInputStream() method. It contains the following line:

byte[] bytes = IOUtils.toByteArray(compressedStream);

that will read all the stream to the memory, but it may cause OOM.
Why we can't use here another openPossiblyCompressedStream() method?

@vvysotskyi
Thanks for your comment. The issue that was happening on S3 was that the inputstream that gets passed to the format plugin seemed to be compressed in some way, even if the file itself was not compressed.
This didn't seem to matter for text based file formats, but for formats that read binary data, such as pcap, the stream wasn't being decompressed and the result was that Drill couldn't parse the file.

The openPossiblyCompressedStream() method didn't really solve this issue, because you could end up with a compressed stream that the format plugins couldn't read. I thought another approach would be to put this logic in the format plugins themselves, but I couldn't figure out a way to determine whether the stream was compressed or not after you call the openPossiblyCompressedStream().

Do you have any suggestions as to how to fix so that we can avoid the OOM?

Both methods openPossiblyCompressedStream() and openDecompressedInputStream() have similar logic in determining compression codec (based on specified path) and producing decompression. The difference is only that the last one places whole the stream to the buffer and wraps it to the ByteArrayInputStream. So looks like if openDecompressedInputStream() will not help, the openPossiblyCompressedStream() also wouldn't work.

If for some reason files from s3 are read as compressed, and there is no way for obtaining info for the compression method when reading the file, we can add a property to the file storage plugin config block that will hold the default compression type if any.

luocooong

@vvysotskyi Hello, Is there a new PR #2186 to replace this PR?

vvysotskyi · 2021-03-06T10:17:01Z

Hi @luocooong, no, this PR contains also fixes for the SHP and Splunk plugins (DRILL-7875) that aren't addressed in #2186. So those fixes may be left here, but other changes related to replacing openPossiblyCompressedStream with openDecompressedInputStream should be reverted since they don't fix the actual issue.

cgivre · 2021-03-07T03:20:03Z

@vvysotskyi I fixed this PR to use the latest changes from your s3 fix. Now it just addresses the NPE from Splunk indexes w/o timestamps.

luocooong

@cgivre +1
It's ready to merge once all checks have passed.

vvysotskyi

LGTM, +1

cgivre added the bug label Mar 2, 2021

cgivre self-assigned this Mar 2, 2021

cgivre requested a review from luocooong March 2, 2021 02:38

vvysotskyi reviewed Mar 2, 2021

View reviewed changes

cgivre changed the title ~~DRILL-7874: Drill Fails to Read File Types on S3~~ DRILL-7874: Drill Fails to Read Binary File Types on S3 Mar 2, 2021

luocooong reviewed Mar 6, 2021

View reviewed changes

cgivre force-pushed the s3_fixes branch from c408f2c to cc9641f Compare March 7, 2021 03:16

cgivre changed the title ~~DRILL-7874: Drill Fails to Read Binary File Types on S3~~ DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp Mar 7, 2021

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp

85474c1

cgivre force-pushed the s3_fixes branch from cc9641f to 85474c1 Compare March 7, 2021 03:19

luocooong approved these changes Mar 7, 2021

View reviewed changes

vvysotskyi approved these changes Mar 7, 2021

View reviewed changes

cgivre merged commit 9ad10f5 into apache:master Mar 7, 2021

cgivre deleted the s3_fixes branch March 7, 2021 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

cgivre commented Mar 2, 2021 •

edited

vvysotskyi Mar 2, 2021

cgivre Mar 2, 2021

vvysotskyi Mar 2, 2021

luocooong left a comment

vvysotskyi commented Mar 6, 2021

cgivre commented Mar 7, 2021

luocooong left a comment

vvysotskyi left a comment

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

Conversation

cgivre commented Mar 2, 2021 • edited

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp

Description

Documentation

Testing

vvysotskyi Mar 2, 2021

Choose a reason for hiding this comment

cgivre Mar 2, 2021

Choose a reason for hiding this comment

vvysotskyi Mar 2, 2021

Choose a reason for hiding this comment

luocooong left a comment

Choose a reason for hiding this comment

vvysotskyi commented Mar 6, 2021

cgivre commented Mar 7, 2021

luocooong left a comment

Choose a reason for hiding this comment

vvysotskyi left a comment

Choose a reason for hiding this comment

cgivre commented Mar 2, 2021 •

edited