Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-5674: Support ZIP compression #1879

Merged
merged 1 commit into from
Oct 23, 2019

Conversation

arina-ielchiieva
Copy link
Member

  1. Added ZipCodec implementation which can read / write single file.
  2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression.
  3. Added unit tests.
  4. General refactoring.

Jira - DRILL-5674.

}
}

public List<FileStatus> getFileStatuses() {
return statuses;
}

public boolean supportDirPrunig() {
public boolean supportDirPruning() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. suppportsDirPruning (with an s)?

The support form is imperative, it tells this object to support dir pruning. The supports form asks if this object does or does not support dir pruning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, renamed.

private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FileSystemPlugin.class);
private static final Logger logger = LoggerFactory.getLogger(FileSystemPlugin.class);

private static final List<String> BUILT_IN_CODECS = Collections.singletonList(ZipCodec.class.getCanonicalName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are no other codecs provided "out of the box"? For others, I need to provide a jar and set a config option? Or, should we move the other built-in ones here and out of the config file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.hadoop.io.compress library supports gzip / bzip2 out of box. Here we only need to add codecs that are missing in this library. I have updated parameter name and added comment to avoid the confusion. TestCompressedFiles contains tests for all supported formats.

@@ -63,6 +60,6 @@ public FileSelection getSelection(){

@JsonIgnore
public boolean supportDirPruning() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above support --> supports. Is safe because this value is not serialized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
public class ZipCodec extends DefaultCodec {

private static final String EXTENSION = ".zip";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any need to support G-zip? (.gz) or Tar/g-zip (.tar.gz)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.hadoop.io.compress supports gzip, bzip2 out of box. We are adding only zip codec implementation since it is missing in org.apache.hadoop.io.compress library.

Regarding .tar.gz, I am not sure how to support since it basically two compressions one over another and mostly is used for folders. Since Drill allows only to read compressed files not folders, I think for now we don't need it.

*/
private static class ZipCompressionOutputStream extends CompressionOutputStream {

private static final String DEFAULT_ENTRY_NAME = "entry.out";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the entry name be the same as the file name so it is sensible if someone unzips the file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stream is called by compression codec and we have no control to pass file name etc.
Anyway, Drill is not using output stream, only input stream, I have added implementation only for testing purposes.

import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.regex.Pattern;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change your IDE import order to put java above org? That way, there won't be constant import shuffling each time your IDE touches a file. (Yes, we should decide on preferred order and document it somewhere...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I guess we definitely should decide on import order, For now reverted the change.

@@ -47,7 +47,7 @@ public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fs

public PcapngFormatPlugin(String name, DrillbitContext context, Configuration fsConf, StoragePluginConfig config, PcapngFormatConfig formatPluginConfig) {
super(name, context, fsConf, config, formatPluginConfig, true,
false, true, false,
false, true, true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the middle true wrong? It is for blockSplittable. That means we'll start reading at an arbitrary block boundary. Since this is a binary format, it is not clear that we can scan forward to the beginning of the next record as can be done in Sequence File and (restricted) CSV.

Also, if the file is zip-encoded, then it is never block splittable since Zip files cannot be read at an arbitrary offset.

This creates an issue: the block-splittable attribute right now is a constant. But, if any file is zip-encoded, then it is never block splittable. Any way to handle this fact?

And, any way to test this behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Drill uses BlockMapBuilder to split file into blocks if possible. According to its code, it tries to split the file if blockSplittable is set to true and file IS NOT compressed. So even if format is block splittable but came as compressed file, it won't be split.

https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/schedule/BlockMapBuilder.java#L115

Looks like most of compressed formats are not splittable, that's why Drill considers any compressed file not splittable: https://i.stack.imgur.com/jpprr.jpg

  1. Regarding blockSplittable for Pcang format, you are right such format is not splittable, as well as Pcap, I have updated the value of blockSplittable to false for both formats.

https://blog.marouni.fr/pcap2seq/

@@ -16,7 +16,7 @@
* limitations under the License.
*/
/**
* For comments on realization of this format plugin look at :
* For comments on implementation of this format plugin look at:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"look at" --> "see"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@arina-ielchiieva
Copy link
Member Author

@paul-rogers addressed code review comments.

@paul-rogers
Copy link
Contributor

Thanks for the answers to my questions. LGTM.
+1

1. Added ZipCodec implementation which can read / write single file.
2. Revisited Drill plugin formats to ensure 'openPossiblyCompressedStream' method is used in those which support compression.
3. Added unit tests.
4. General refactoring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants