PARQUET-777: Add Parquet CLI. #384

rdblue · 2016-11-13T22:51:25Z

This adds a new parquet-cli module with an improved command-line tool. The parquet-cli/README.md file has instructions for building and testing locally.

rdblue · 2016-11-13T22:57:29Z

There are still Cloudera copyright headers, which @tomwhite is planning to change to the Apache header on Cloudera's behalf. Thanks, Tom!

xhochy

Looks really good. I'll definitely going to use this for debugging my parquet-cpp generated files.

xhochy · 2016-11-14T12:00:35Z

parquet-cli/README.md

+Then, run the command-line using the `target/dependencies` for the classpath:
+
+```
+java -cp 'target/parquet-cli-1.9.1-runtime.jar:target/dependency/*' org.apache.parquet.cli.Main


Can you make this a version independent string? Directly run straight into while trying it out.

Fixed. Thanks for taking a look!

sadikovi · 2016-11-14T19:05:55Z

@rdblue is parquet-tools going to be removed, and will parquet-cli be available on maven?

rdblue · 2016-11-14T19:32:35Z

@sadikovi, the plan is for parquet-cli to be available through maven in the next release. I'm not sure about removing parquet-tools. I don't use them and I think that this is going to be easier to maintain, but I'm not going to insist they be removed.

julienledem · 2016-12-06T00:09:12Z

Let's not remove parquet-tools right now since the commands are not the same. If people shift to the new tool then we can deprecate and then remove it.

piyushnarang

Took a high level look. I like the direction this PR is going, the revamped cli looks nice! Few comments at a high level:

As-is this PR is really massive :-). I think it would be great if we could break it up to make it iterative. That would allow the community to provide more through reviews as well. One suggestion to do this would be to start with the skeleton Command, BaseCommand list of proposed commands. We could then follow up with concrete implementations of subsets of individual commands. What do you think?
Seems like we're missing thrift / protobuf support right? Would be nice to have thrift / proto equivalents to some of the avro command variants.
If possible it would be nice to add some unit tests for each of the commands.

piyushnarang · 2016-12-07T01:36:18Z

parquet-cli/pom.xml

+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet</artifactId>
+    <relativePath>../pom.xml</relativePath>
+    <version>1.9.1-SNAPSHOT</version>


{project.version} ?

This matches the other module POM files and is updated by Maven's release plugin. I don't have much of an opinion either way on best practice, but I think that this should match the others for now. Feel free to open a follow-up issue to change over to ${project.version} if you think we should.

I think this is the one place where you can't put project.version since project.version come from the parent

piyushnarang · 2016-12-07T01:50:37Z

parquet-cli/src/main/java/org/apache/parquet/cli/BaseCommand.java

+      case PARQUET:
+        Configuration conf = new Configuration(getConf());
+        // TODO: add these to the reader builder
+        AvroReadSupport.setRequestedProjection(conf, projection);


don't we also need thrift / protobuf support? Or am I missing something?

We don't really need thrift and protobuf support. It would be nice to be able to translate those formats into Parquet in the CLI tool, but it doesn't have to do everything yet.

piyushnarang · 2016-12-07T01:55:50Z

parquet-cli/src/main/java/org/apache/parquet/cli/BaseCommand.java

+import java.util.List;
+import java.util.NoSuchElementException;
+
+public abstract class BaseCommand implements Command, Configurable {


can we decouple things in this class a bit to make it more readable? Seems like there are a few things, all the FS support (create file, path resolution etc) which could be pulled out into a separate class. Classloader stuff could be pulled out, parquet avro record reader could be broken out, SeekableFSDataInputStream could be broken out too.

Ideally we should only define the methods here that are going to be overridden but specific command classes. Other pieces of code that specific sub commands need could be used via composition?

The FS operations are convenience methods to help commands. That's basically the purpose of this base class and they require access to the Configuration set on the command instance. I think it makes sense to keep those here, so they're easily accessed by commands.

I agree that some of the static classes can be refactored and placed in a util package.

Removed the static embedded classes.

piyushnarang · 2016-12-07T01:58:09Z

parquet-cli/src/main/java/org/apache/parquet/cli/Help.java

+      for (String cmd : helpCommands) {
+        JCommander commander = jc.getCommands().get(cmd);
+        if (commander == null) {
+          console.error("Unknown command: {}", cmd);


maybe list the set of known commands here?

Fixed. It now prints the generic help after the log message.

piyushnarang · 2016-12-07T02:00:25Z

parquet-cli/src/main/java/org/apache/parquet/cli/Util.java

+  private static final long GB = 1024 * MB;
+
+  public static String humanReadable(float bytes) {
+    if (bytes > GB) {


Added. It won't get used since this is mostly for printing sizes inside of a row group, but it will be good to have if we end up printing total file size later.

piyushnarang · 2016-12-07T18:23:24Z

parquet-cli/src/main/java/org/apache/parquet/cli/commands/CheckParquet251Command.java

+import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY;
+
+@Parameters(commandDescription = "Check Parquet files for corrupt page and column stats (PARQUET-251)")
+public class CheckParquet251Command extends BaseCommand {


how about naming this check corrupt stats instead of check parquet 251?

This only checks for PARQUET-251, where min and max aren't the actual min and max values for some String columns. It doesn't check, for example, that the correct sort order was used or that the min and max for numeric columns are valid. Because it is limited to checking for PARQUET-251, I'd leave the name as it is.

sadikovi · 2016-12-07T20:49:09Z

parquet-cli/src/main/java/org/apache/parquet/cli/Util.java

+    }
+  }
+
+  public static String humanReadable(long bytes) {


Should method be like snippet below instead of copying logic (there is also a possible typo around %.02f B in above case and %d B below)?

public static String humanReadable(long bytes) { float fbytes = (float) bytes; return humanReadable(fbytes); }

The bytes argument in the one above is a float because it is typically used as an average, showing the average number of bytes for values in a row group or page. That's why the float version uses %.02f and not %d. This function is called when the number of bytes is exact, so it uses %d.

rdblue · 2016-12-07T22:15:22Z

@piyushnarang, thanks for taking the time to review. I'll fix a few things that you pointed out and I've responded to some comments to clarify.

I think it would be great if we could break it up to make it iterative.

I don't think I have the time to break this into a series of commits. Perhaps it would be helpful to review a section at a time along the lines of the commits you suggested? It probably isn't very valuable to break it across commits in the long run, unless there are a huge number of changes that are needed.

Seems like we're missing thrift / protobuf support right?

This shouldn't hold up what's already done. It would be great if someone added the ability to read thrift and protobuf, but that can be done later.

If possible it would be nice to add some unit tests for each of the commands.

It is possible to test, and a good idea in the long run. I wanted to get this out to users sooner than I would be able to find time to thoroughly test. Is it okay to focus on areas that must be tested to in order to get this committed, and add tests as it evolves later?

isnotinvain · 2016-12-08T21:58:15Z

I want to echo what @piyushnarang has commented here, if not for this PR but for future ones.

This PR is really large. The larger a PR is, the less likely it is to be thoroughly reviewed. I think we really need to aim for smaller / more incremental PRs than this in general. I don't know if it's worth the time to split this one up now that it's been done, but I think we should avoid setting a precedent of PRs this large being common, especially for all new code (as opposed to large refactors that sometimes have to be large).

Additionally, we need to also make sure that new code is accompanied by matching new tests. Wanting to get something into the hands of users early w/o the matching tests is also not something I think we should set a precedent for either. I don't think that means 100% test coverage, but there should be test coverage for the most important parts.

Was this a project developed separately, and now it's being imported into the parquet codebase? That sort of situation is difficult, as it may have been built incrementally and reviewed thoroughly, and it's odd to break it up and import it piece by piece after the fact, so I do understand that.

rdblue · 2016-12-08T23:08:29Z

@isnotinvain, this was developed as a separate project. I originally thought it would be better as a separate Parquet repository since it isn't really a part of Parquet MR (but does depend on it). I think it's more important to get this project released than to add thorough unit tests at this point because we've been using it internally and know that it works for the most part. It also provides some great help for people experimenting with Parquet that you can't get elsewhere, like CSV, JSON, and Avro file conversions.

I completely agree that we should aim for smaller PRs with test coverage. I think that's more the norm, like #390 that is a series of commits and has thorough tests for the ByteBufferInputStream classes.

julienledem

My main comment would be that there should be tests for at least the utility methods (dynamic methods, schema stuff etc) If some code was borrowed from another project, probably the tests should come along.

julienledem · 2017-01-26T20:53:20Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/SeekableFSDataInputStream.java

+/**
+ * A wrapper for FSDataInputStream that implements Avro's SeekableInput.
+ */
+public class SeekableFSDataInputStream extends InputStream implements SeekableInput {


should this be in parquet-avro?

julienledem · 2017-01-26T20:54:34Z

parquet-cli/pom.xml

+    <groupId>org.apache.parquet</groupId>
+    <artifactId>parquet</artifactId>
+    <relativePath>../pom.xml</relativePath>
+    <version>1.9.1-SNAPSHOT</version>


I think this is the one place where you can't put project.version since project.version come from the parent

julienledem · 2017-01-26T20:55:29Z

parquet-cli/pom.xml

+              <relocations>
+                <relocation>
+									<!-- relocate Avro in the runtime jar to avoid conflicts with
+                       on-cluster Avro versions.


Guava shouldn't be necessary because Hadoop uses 11.0.2 and this specifies 11.0.

julienledem · 2017-01-26T20:56:47Z

parquet-cli/src/main/java/org/apache/parquet/cli/BaseCommand.java

+      case PARQUET:
+        Configuration conf = new Configuration(getConf());
+        // TODO: add these to the reader builder
+        AvroReadSupport.setRequestedProjection(conf, projection);


julienledem · 2017-01-26T21:05:06Z

parquet-cli/src/main/java/org/apache/parquet/cli/Help.java

+              console.info("\n    {}", example);
+            } else {
+              console.info("    {} {} {}",
+                  new Object[] {programName, cmd, example});


console doesn't have varargs methods?

julienledem · 2017-01-26T21:14:30Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/Expressions.java

+import java.util.regex.Pattern;
+
+
+public class Expressions {


julienledem · 2017-01-26T21:14:46Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/Expressions.java

+public class Expressions {
+  private static final Pattern NUMERIC_RE = Pattern.compile("^\\d+$");
+
+  public static Object select(Schema schema, Object datum, String path) {


javadoc? It would make it easier to follow what this does

julienledem · 2017-01-26T21:15:52Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/Expressions.java

+      case MAP:
+        Preconditions.checkArgument(datum instanceof Map,
+            "Not a map: %s", datum);
+        Map<Object, Object> map = (Map<Object, Object>) datum;


you can also use Map<?,?> and cast on get to avoid warnings

julienledem · 2017-01-26T21:18:05Z

parquet-cli/src/main/java/org/apache/parquet/cli/util/Codecs.java

+    }
+  }
+
+  public static CodecFactory avroCodec(String codec) {


in parquet-avro?

julienledem · 2017-01-26T21:19:02Z

parquet-cli/src/main/java/org/apache/parquet/cli/json/AvroJsonReader.java

+
+  @Override
+  public Iterator<E> iterator() {
+    return this;


the contract of iterable would be to get a new one everytime

julienledem · 2017-01-26T21:27:35Z

@isnotinvain @piyushnarang In general I agree that we should avoid large patches and get smaller ones. However this particular patch doesn't touch any of the existing parquet code and by itself has a lot of independent parts that are easy to review independently in one patch (various commands, some utility packages, etc) and I don't think that it would help much to have this particular patch split into smaller ones since it would basically be split by files. It would be more work on Ryan's side with very little benefit on the reviewer end. So I think we can make an exception here and review as is. I did a first pass review above.

julienledem · 2017-05-12T22:45:46Z

we should merge this.

xhochy · 2017-05-13T07:03:26Z

+1, I'm already using this regularly!

rdblue · 2017-05-18T20:00:38Z

Great! I'll rebase and commit it shortly. I think I was working on tests when I last had a chance... so I'll make sure that's reasonably complete.

xhochy requested changes Nov 14, 2016

View reviewed changes

xhochy approved these changes Nov 14, 2016

View reviewed changes

piyushnarang suggested changes Dec 7, 2016

View reviewed changes

sadikovi reviewed Dec 7, 2016

View reviewed changes

rdblue closed this Dec 7, 2016

rdblue reopened this Dec 7, 2016

julienledem reviewed Jan 26, 2017

View reviewed changes

julienledem approved these changes May 12, 2017

View reviewed changes

rdblue force-pushed the PARQUET-777-add-parquet-cli branch 2 times, most recently from a7515ba to ce807af Compare May 19, 2017 17:15

rdblue and others added 5 commits July 28, 2017 15:33

PARQUET-777: Add Parquet CLI.

d718363

Replace source file headers with Apache header.

aed223d

PARQUET-777: Update README.md with better instructions.

f953fd4

PARQUET-777: Update for review feedback.

affdfb9

PARQUET-777: Move dynamic support classes, add tests.

de49eff

rdblue force-pushed the PARQUET-777-add-parquet-cli branch from ce807af to de49eff Compare July 28, 2017 22:33

asfgit closed this in ddbeb4d Jul 28, 2017

rdblue deleted the PARQUET-777-add-parquet-cli branch July 28, 2017 23:26

PARQUET-777: Add Parquet CLI. #384

PARQUET-777: Add Parquet CLI. #384

Conversation

rdblue commented Nov 13, 2016 • edited

rdblue commented Nov 13, 2016

xhochy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sadikovi commented Nov 14, 2016

rdblue commented Nov 14, 2016

julienledem commented Dec 6, 2016

piyushnarang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Dec 7, 2016

isnotinvain commented Dec 8, 2016

rdblue commented Dec 8, 2016

julienledem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julienledem commented Jan 26, 2017

julienledem commented May 12, 2017

xhochy commented May 13, 2017

rdblue commented May 18, 2017

rdblue commented Nov 13, 2016 •

edited