-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-5674: Support ZIP compression #1879
Conversation
} | ||
} | ||
|
||
public List<FileStatus> getFileStatuses() { | ||
return statuses; | ||
} | ||
|
||
public boolean supportDirPrunig() { | ||
public boolean supportDirPruning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. suppportsDirPruning
(with an s)?
The support
form is imperative, it tells this object to support dir pruning. The supports
form asks if this object does or does not support dir pruning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, renamed.
private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class); | ||
private static final Logger logger = LoggerFactory.getLogger(FileSystemPlugin.class); | ||
|
||
private static final List<String> BUILT_IN_CODECS = Collections.singletonList(ZipCodec.class.getCanonicalName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are no other codecs provided "out of the box"? For others, I need to provide a jar and set a config option? Or, should we move the other built-in ones here and out of the config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.hadoop.io.compress
library supports gzip / bzip2 out of box. Here we only need to add codecs that are missing in this library. I have updated parameter name and added comment to avoid the confusion. TestCompressedFiles
contains tests for all supported formats.
@@ -63,6 +60,6 @@ public FileSelection getSelection(){ | |||
|
|||
@JsonIgnore | |||
public boolean supportDirPruning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above support
--> supports
. Is safe because this value is not serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
public class ZipCodec extends DefaultCodec { | ||
|
||
private static final String EXTENSION = ".zip"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any need to support G-zip? (.gz
) or Tar/g-zip (.tar.gz
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.hadoop.io.compress
supports gzip, bzip2 out of box. We are adding only zip codec implementation since it is missing in org.apache.hadoop.io.compress
library.
Regarding .tar.gz
, I am not sure how to support since it basically two compressions one over another and mostly is used for folders. Since Drill allows only to read compressed files not folders, I think for now we don't need it.
*/ | ||
private static class ZipCompressionOutputStream extends CompressionOutputStream { | ||
|
||
private static final String DEFAULT_ENTRY_NAME = "entry.out"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the entry name be the same as the file name so it is sensible if someone unzips the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stream is called by compression codec and we have no control to pass file name etc.
Anyway, Drill is not using output stream, only input stream, I have added implementation only for testing purposes.
import java.util.Map; | ||
import java.util.Set; | ||
import java.util.concurrent.TimeUnit; | ||
import java.util.regex.Pattern; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change your IDE import order to put java above org? That way, there won't be constant import shuffling each time your IDE touches a file. (Yes, we should decide on preferred order and document it somewhere...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I guess we definitely should decide on import order, For now reverted the change.
@@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fs | |||
|
|||
public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig formatPluginConfig) { | |||
super(name, context, fsConf, config, formatPluginConfig, true, | |||
false, true, false, | |||
false, true, true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the middle true
wrong? It is for blockSplittable
. That means we'll start reading at an arbitrary block boundary. Since this is a binary format, it is not clear that we can scan forward to the beginning of the next record as can be done in Sequence File and (restricted) CSV.
Also, if the file is zip-encoded, then it is never block splittable since Zip files cannot be read at an arbitrary offset.
This creates an issue: the block-splittable attribute right now is a constant. But, if any file is zip-encoded, then it is never block splittable. Any way to handle this fact?
And, any way to test this behaviour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Drill uses
BlockMapBuilder
to split file into blocks if possible. According to its code, it tries to split the file ifblockSplittable
is set to true and file IS NOT compressed. So even if format is block splittable but came as compressed file, it won't be split.
Looks like most of compressed formats are not splittable, that's why Drill considers any compressed file not splittable: https://i.stack.imgur.com/jpprr.jpg
- Regarding blockSplittable for Pcang format, you are right such format is not splittable, as well as Pcap, I have updated the value of
blockSplittable
tofalse
for both formats.
@@ -16,7 +16,7 @@ | |||
* limitations under the License. | |||
*/ | |||
/** | |||
* For comments on realization of this format plugin look at : | |||
* For comments on implementation of this format plugin look at: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"look at" --> "see"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
9f8bdd5
to
874ab29
Compare
@paul-rogers addressed code review comments. |
Thanks for the answers to my questions. LGTM. |
1. Added ZipCodec implementation which can read / write single file. 2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression. 3. Added unit tests. 4. General refactoring.
874ab29
to
99731b5
Compare
Jira - DRILL-5674.