Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184

Merged
merged 1 commit into from Mar 7, 2021

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Mar 2, 2021

DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp

Description

This PR addresses:

DRILL-7875 adds a null check in the Splunk reader. In the event that there is a Splunk index with no timestamp field, this will cause Drill to receive several null columns. This rectifies that problem.

Documentation

No user facing changes.

Testing

Manual testing and existing unit tests.

@cgivre cgivre added the bug label Mar 2, 2021
@cgivre cgivre self-assigned this Mar 2, 2021
@cgivre cgivre requested a review from luocooong March 2, 2021 02:38
@@ -125,7 +125,7 @@ public boolean next() {

private void openFile(FileSchemaNegotiator negotiator) {
try {
fileReaderShp = negotiator.fileSystem().open(split.getPath());
fileReaderShp = negotiator.fileSystem().openDecompressedInputStream(split.getPath());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some doubts regarding the safety of the openDecompressedInputStream() method. It contains the following line:

byte[] bytes = IOUtils.toByteArray(compressedStream);

that will read all the stream to the memory, but it may cause OOM.
Why we can't use here another openPossiblyCompressedStream() method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi
Thanks for your comment. The issue that was happening on S3 was that the inputstream that gets passed to the format plugin seemed to be compressed in some way, even if the file itself was not compressed.
This didn't seem to matter for text based file formats, but for formats that read binary data, such as pcap, the stream wasn't being decompressed and the result was that Drill couldn't parse the file.

The openPossiblyCompressedStream() method didn't really solve this issue, because you could end up with a compressed stream that the format plugins couldn't read. I thought another approach would be to put this logic in the format plugins themselves, but I couldn't figure out a way to determine whether the stream was compressed or not after you call the openPossiblyCompressedStream().

Do you have any suggestions as to how to fix so that we can avoid the OOM?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both methods openPossiblyCompressedStream() and openDecompressedInputStream() have similar logic in determining compression codec (based on specified path) and producing decompression. The difference is only that the last one places whole the stream to the buffer and wraps it to the ByteArrayInputStream. So looks like if openDecompressedInputStream() will not help, the openPossiblyCompressedStream() also wouldn't work.

If for some reason files from s3 are read as compressed, and there is no way for obtaining info for the compression method when reading the file, we can add a property to the file storage plugin config block that will hold the default compression type if any.

@cgivre cgivre changed the title DRILL-7874: Drill Fails to Read File Types on S3 DRILL-7874: Drill Fails to Read Binary File Types on S3 Mar 2, 2021
Copy link
Member

@luocooong luocooong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvysotskyi Hello, Is there a new PR #2186 to replace this PR?

@vvysotskyi
Copy link
Member

Hi @luocooong, no, this PR contains also fixes for the SHP and Splunk plugins (DRILL-7875) that aren't addressed in #2186. So those fixes may be left here, but other changes related to replacing openPossiblyCompressedStream with openDecompressedInputStream should be reverted since they don't fix the actual issue.

@cgivre cgivre changed the title DRILL-7874: Drill Fails to Read Binary File Types on S3 DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp Mar 7, 2021
@cgivre
Copy link
Contributor Author

cgivre commented Mar 7, 2021

@vvysotskyi I fixed this PR to use the latest changes from your s3 fix. Now it just addresses the NPE from Splunk indexes w/o timestamps.

Copy link
Member

@luocooong luocooong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgivre +1
It's ready to merge once all checks have passed.

Copy link
Member

@vvysotskyi vvysotskyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, +1

@cgivre cgivre merged commit 9ad10f5 into apache:master Mar 7, 2021
@cgivre cgivre deleted the s3_fixes branch March 7, 2021 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants