DRILL-5674: Support ZIP compression #1879

arina-ielchiieva · 2019-10-21T14:31:25Z

Added ZipCodec implementation which can read / write single file.
Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression.
Added unit tests.
General refactoring.

paul-rogers · 2019-10-22T03:40:51Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java

    }
  }

  public List<FileStatus> getFileStatuses() {
    return statuses;
  }

-  public boolean supportDirPrunig() {
+  public boolean supportDirPruning() {


Good catch. suppportsDirPruning (with an s)?

The support form is imperative, it tells this object to support dir pruning. The supports form asks if this object does or does not support dir pruning.

Agree, renamed.

paul-rogers · 2019-10-22T03:42:53Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSystemPlugin.java

-  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class);
+  private static final Logger logger = LoggerFactory.getLogger(FileSystemPlugin.class);
+
+  private static final List<String> BUILT_IN_CODECS = Collections.singletonList(ZipCodec.class.getCanonicalName());


Are no other codecs provided "out of the box"? For others, I need to provide a jar and set a config option? Or, should we move the other built-in ones here and out of the config file?

org.apache.hadoop.io.compress library supports gzip / bzip2 out of box. Here we only need to add codecs that are missing in this library. I have updated parameter name and added comment to avoid the confusion. TestCompressedFiles contains tests for all supported formats.

paul-rogers · 2019-10-22T03:43:32Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatSelection.java

@@ -63,6 +60,6 @@ public FileSelection getSelection(){

  @JsonIgnore
  public boolean supportDirPruning() {


As above support --> supports. Is safe because this value is not serialized.

paul-rogers · 2019-10-22T03:44:31Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java

+ */
+public class ZipCodec extends DefaultCodec {
+
+  private static final String EXTENSION = ".zip";


Any need to support G-zip? (.gz) or Tar/g-zip (.tar.gz)?

org.apache.hadoop.io.compress supports gzip, bzip2 out of box. We are adding only zip codec implementation since it is missing in org.apache.hadoop.io.compress library.

Regarding .tar.gz, I am not sure how to support since it basically two compressions one over another and mostly is used for folders. Since Drill allows only to read compressed files not folders, I think for now we don't need it.

paul-rogers · 2019-10-22T03:46:22Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/ZipCodec.java

+   */
+  private static class ZipCompressionOutputStream extends CompressionOutputStream {
+
+    private static final String DEFAULT_ENTRY_NAME = "entry.out";


Should the entry name be the same as the file name so it is sensible if someone unzips the file?

This stream is called by compression codec and we have no control to pass file name etc.
Anyway, Drill is not using output stream, only input stream, I have added implementation only for testing purposes.

paul-rogers · 2019-10-22T03:49:17Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java

+import java.util.Map;
+import java.util.Set;
+import java.util.concurrent.TimeUnit;
+import java.util.regex.Pattern;


Maybe change your IDE import order to put java above org? That way, there won't be constant import shuffling each time your IDE touches a file. (Yes, we should decide on preferred order and document it somewhere...)

Agree, I guess we definitely should decide on import order, For now reverted the change.

paul-rogers · 2019-10-22T03:54:29Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/PcapngFormatPlugin.java

@@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fs

  public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig formatPluginConfig) {
    super(name, context, fsConf, config, formatPluginConfig, true,
-        false, true, false,
+        false, true, true,


Isn't the middle true wrong? It is for blockSplittable. That means we'll start reading at an arbitrary block boundary. Since this is a binary format, it is not clear that we can scan forward to the beginning of the next record as can be done in Sequence File and (restricted) CSV.

Also, if the file is zip-encoded, then it is never block splittable since Zip files cannot be read at an arbitrary offset.

This creates an issue: the block-splittable attribute right now is a constant. But, if any file is zip-encoded, then it is never block splittable. Any way to handle this fact?

And, any way to test this behaviour?

Drill uses BlockMapBuilder to split file into blocks if possible. According to its code, it tries to split the file if blockSplittable is set to true and file IS NOT compressed. So even if format is block splittable but came as compressed file, it won't be split.

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/schedule/BlockMapBuilder.java#L115

Looks like most of compressed formats are not splittable, that's why Drill considers any compressed file not splittable: https://i.stack.imgur.com/jpprr.jpg

Regarding blockSplittable for Pcang format, you are right such format is not splittable, as well as Pcap, I have updated the value of blockSplittable to false for both formats.

https://blog.marouni.fr/pcap2seq/

paul-rogers · 2019-10-22T03:55:09Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/pcapng/package-info.java

@@ -16,7 +16,7 @@
 * limitations under the License.
 */
 /**
- * For comments on realization of this format plugin look at :
+ * For comments on implementation of this format plugin look at:


"look at" --> "see"

arina-ielchiieva · 2019-10-22T11:51:17Z

@paul-rogers addressed code review comments.

paul-rogers · 2019-10-23T05:39:10Z

Thanks for the answers to my questions. LGTM.
+1

1. Added ZipCodec implementation which can read / write single file. 2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression. 3. Added unit tests. 4. General refactoring.

paul-rogers reviewed Oct 22, 2019

View reviewed changes

arina-ielchiieva force-pushed the DRILL-5674 branch from 9f8bdd5 to 874ab29 Compare October 22, 2019 11:51

DRILL-5674: Support ZIP compression

99731b5

1. Added ZipCodec implementation which can read / write single file. 2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression. 3. Added unit tests. 4. General refactoring.

arina-ielchiieva force-pushed the DRILL-5674 branch from 874ab29 to 99731b5 Compare October 23, 2019 09:23

arina-ielchiieva merged commit 7d33bf1 into apache:master Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-5674: Support ZIP compression #1879

DRILL-5674: Support ZIP compression #1879

arina-ielchiieva commented Oct 21, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

paul-rogers Oct 22, 2019

arina-ielchiieva Oct 22, 2019

arina-ielchiieva commented Oct 22, 2019

paul-rogers commented Oct 23, 2019

		@@ -63,6 +60,6 @@ public FileSelection getSelection(){

		@JsonIgnore
		public boolean supportDirPruning() {

DRILL-5674: Support ZIP compression #1879

DRILL-5674: Support ZIP compression #1879

Conversation

arina-ielchiieva commented Oct 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arina-ielchiieva commented Oct 22, 2019

paul-rogers commented Oct 23, 2019