New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp #2184
Conversation
@@ -125,7 +125,7 @@ public boolean next() { | |||
|
|||
private void openFile(FileSchemaNegotiator negotiator) { | |||
try { | |||
fileReaderShp = negotiator.fileSystem().open(split.getPath()); | |||
fileReaderShp = negotiator.fileSystem().openDecompressedInputStream(split.getPath()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some doubts regarding the safety of the openDecompressedInputStream()
method. It contains the following line:
byte[] bytes = IOUtils.toByteArray(compressedStream);
that will read all the stream to the memory, but it may cause OOM.
Why we can't use here another openPossiblyCompressedStream()
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi
Thanks for your comment. The issue that was happening on S3 was that the inputstream
that gets passed to the format plugin seemed to be compressed in some way, even if the file itself was not compressed.
This didn't seem to matter for text based file formats, but for formats that read binary data, such as pcap
, the stream wasn't being decompressed and the result was that Drill couldn't parse the file.
The openPossiblyCompressedStream()
method didn't really solve this issue, because you could end up with a compressed stream that the format plugins couldn't read. I thought another approach would be to put this logic in the format plugins themselves, but I couldn't figure out a way to determine whether the stream was compressed or not after you call the openPossiblyCompressedStream()
.
Do you have any suggestions as to how to fix so that we can avoid the OOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both methods openPossiblyCompressedStream()
and openDecompressedInputStream()
have similar logic in determining compression codec (based on specified path) and producing decompression. The difference is only that the last one places whole the stream to the buffer and wraps it to the ByteArrayInputStream
. So looks like if openDecompressedInputStream()
will not help, the openPossiblyCompressedStream()
also wouldn't work.
If for some reason files from s3 are read as compressed, and there is no way for obtaining info for the compression method when reading the file, we can add a property to the file storage plugin config
block that will hold the default compression type if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvysotskyi Hello, Is there a new PR #2186 to replace this PR?
Hi @luocooong, no, this PR contains also fixes for the SHP and Splunk plugins (DRILL-7875) that aren't addressed in #2186. So those fixes may be left here, but other changes related to replacing |
@vvysotskyi I fixed this PR to use the latest changes from your s3 fix. Now it just addresses the NPE from Splunk indexes w/o timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cgivre +1
It's ready to merge once all checks have passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, +1
DRILL-7875: Drill Fails to Splunk Indexes with no Timestamp
Description
This PR addresses:
DRILL-7875 adds a
null
check in the Splunk reader. In the event that there is a Splunk index with no timestamp field, this will cause Drill to receive severalnull
columns. This rectifies that problem.Documentation
No user facing changes.
Testing
Manual testing and existing unit tests.