Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-777: Add Parquet CLI. #384

Closed
wants to merge 5 commits into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 13, 2016

This adds a new parquet-cli module with an improved command-line tool. The parquet-cli/README.md file has instructions for building and testing locally.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 13, 2016

There are still Cloudera copyright headers, which @tomwhite is planning to change to the Apache header on Cloudera's behalf. Thanks, Tom!

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I'll definitely going to use this for debugging my parquet-cpp generated files.

Then, run the command-line using the `target/dependencies` for the classpath:

```
java -cp 'target/parquet-cli-1.9.1-runtime.jar:target/dependency/*' org.apache.parquet.cli.Main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a version independent string? Directly run straight into while trying it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for taking a look!

@sadikovi
Copy link

@rdblue is parquet-tools going to be removed, and will parquet-cli be available on maven?

@rdblue
Copy link
Contributor Author

rdblue commented Nov 14, 2016

@sadikovi, the plan is for parquet-cli to be available through maven in the next release. I'm not sure about removing parquet-tools. I don't use them and I think that this is going to be easier to maintain, but I'm not going to insist they be removed.

@julienledem
Copy link
Member

Let's not remove parquet-tools right now since the commands are not the same. If people shift to the new tool then we can deprecate and then remove it.

Copy link

@piyushnarang piyushnarang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a high level look. I like the direction this PR is going, the revamped cli looks nice! Few comments at a high level:

  1. As-is this PR is really massive :-). I think it would be great if we could break it up to make it iterative. That would allow the community to provide more through reviews as well. One suggestion to do this would be to start with the skeleton Command, BaseCommand list of proposed commands. We could then follow up with concrete implementations of subsets of individual commands. What do you think?
  2. Seems like we're missing thrift / protobuf support right? Would be nice to have thrift / proto equivalents to some of the avro command variants.
  3. If possible it would be nice to add some unit tests for each of the commands.

<groupId>org.apache.parquet</groupId>
<artifactId>parquet</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.9.1-SNAPSHOT</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{project.version} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the other module POM files and is updated by Maven's release plugin. I don't have much of an opinion either way on best practice, but I think that this should match the others for now. Feel free to open a follow-up issue to change over to ${project.version} if you think we should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the one place where you can't put project.version since project.version come from the parent

case PARQUET:
Configuration conf = new Configuration(getConf());
// TODO: add these to the reader builder
AvroReadSupport.setRequestedProjection(conf, projection);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we also need thrift / protobuf support? Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really need thrift and protobuf support. It would be nice to be able to translate those formats into Parquet in the CLI tool, but it doesn't have to do everything yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

import java.util.List;
import java.util.NoSuchElementException;

public abstract class BaseCommand implements Command, Configurable {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we decouple things in this class a bit to make it more readable? Seems like there are a few things, all the FS support (create file, path resolution etc) which could be pulled out into a separate class. Classloader stuff could be pulled out, parquet avro record reader could be broken out, SeekableFSDataInputStream could be broken out too.

Ideally we should only define the methods here that are going to be overridden but specific command classes. Other pieces of code that specific sub commands need could be used via composition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FS operations are convenience methods to help commands. That's basically the purpose of this base class and they require access to the Configuration set on the command instance. I think it makes sense to keep those here, so they're easily accessed by commands.

I agree that some of the static classes can be refactored and placed in a util package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the static embedded classes.

for (String cmd : helpCommands) {
JCommander commander = jc.getCommands().get(cmd);
if (commander == null) {
console.error("Unknown command: {}", cmd);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe list the set of known commands here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. It now prints the generic help after the log message.

private static final long GB = 1024 * MB;

public static String humanReadable(float bytes) {
if (bytes > GB) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TB too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. It won't get used since this is mostly for printing sizes inside of a row group, but it will be good to have if we end up printing total file size later.

import static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName.BINARY;

@Parameters(commandDescription = "Check Parquet files for corrupt page and column stats (PARQUET-251)")
public class CheckParquet251Command extends BaseCommand {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about naming this check corrupt stats instead of check parquet 251?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks for PARQUET-251, where min and max aren't the actual min and max values for some String columns. It doesn't check, for example, that the correct sort order was used or that the min and max for numeric columns are valid. Because it is limited to checking for PARQUET-251, I'd leave the name as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

}
}

public static String humanReadable(long bytes) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should method be like snippet below instead of copying logic (there is also a possible typo around %.02f B in above case and %d B below)?

public static String humanReadable(long bytes) {
  float fbytes = (float) bytes;
  return humanReadable(fbytes);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytes argument in the one above is a float because it is typically used as an average, showing the average number of bytes for values in a row group or page. That's why the float version uses %.02f and not %d. This function is called when the number of bytes is exact, so it uses %d.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 7, 2016

@piyushnarang, thanks for taking the time to review. I'll fix a few things that you pointed out and I've responded to some comments to clarify.

I think it would be great if we could break it up to make it iterative.

I don't think I have the time to break this into a series of commits. Perhaps it would be helpful to review a section at a time along the lines of the commits you suggested? It probably isn't very valuable to break it across commits in the long run, unless there are a huge number of changes that are needed.

Seems like we're missing thrift / protobuf support right?

This shouldn't hold up what's already done. It would be great if someone added the ability to read thrift and protobuf, but that can be done later.

If possible it would be nice to add some unit tests for each of the commands.

It is possible to test, and a good idea in the long run. I wanted to get this out to users sooner than I would be able to find time to thoroughly test. Is it okay to focus on areas that must be tested to in order to get this committed, and add tests as it evolves later?

@rdblue rdblue closed this Dec 7, 2016
@rdblue rdblue reopened this Dec 7, 2016
@isnotinvain
Copy link
Contributor

I want to echo what @piyushnarang has commented here, if not for this PR but for future ones.

This PR is really large. The larger a PR is, the less likely it is to be thoroughly reviewed. I think we really need to aim for smaller / more incremental PRs than this in general. I don't know if it's worth the time to split this one up now that it's been done, but I think we should avoid setting a precedent of PRs this large being common, especially for all new code (as opposed to large refactors that sometimes have to be large).

Additionally, we need to also make sure that new code is accompanied by matching new tests. Wanting to get something into the hands of users early w/o the matching tests is also not something I think we should set a precedent for either. I don't think that means 100% test coverage, but there should be test coverage for the most important parts.

Was this a project developed separately, and now it's being imported into the parquet codebase? That sort of situation is difficult, as it may have been built incrementally and reviewed thoroughly, and it's odd to break it up and import it piece by piece after the fact, so I do understand that.

@rdblue
Copy link
Contributor Author

rdblue commented Dec 8, 2016

@isnotinvain, this was developed as a separate project. I originally thought it would be better as a separate Parquet repository since it isn't really a part of Parquet MR (but does depend on it). I think it's more important to get this project released than to add thorough unit tests at this point because we've been using it internally and know that it works for the most part. It also provides some great help for people experimenting with Parquet that you can't get elsewhere, like CSV, JSON, and Avro file conversions.

I completely agree that we should aim for smaller PRs with test coverage. I think that's more the norm, like #390 that is a series of commits and has thorough tests for the ByteBufferInputStream classes.

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment would be that there should be tests for at least the utility methods (dynamic methods, schema stuff etc) If some code was borrowed from another project, probably the tests should come along.

/**
* A wrapper for FSDataInputStream that implements Avro's SeekableInput.
*/
public class SeekableFSDataInputStream extends InputStream implements SeekableInput {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in parquet-avro?

<groupId>org.apache.parquet</groupId>
<artifactId>parquet</artifactId>
<relativePath>../pom.xml</relativePath>
<version>1.9.1-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the one place where you can't put project.version since project.version come from the parent

<relocations>
<relocation>
<!-- relocate Avro in the runtime jar to avoid conflicts with
on-cluster Avro versions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and guava?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guava shouldn't be necessary because Hadoop uses 11.0.2 and this specifies 11.0.

case PARQUET:
Configuration conf = new Configuration(getConf());
// TODO: add these to the reader builder
AvroReadSupport.setRequestedProjection(conf, projection);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

console.info("\n {}", example);
} else {
console.info(" {} {} {}",
new Object[] {programName, cmd, example});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console doesn't have varargs methods?

import java.util.regex.Pattern;


public class Expressions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc?

public class Expressions {
private static final Pattern NUMERIC_RE = Pattern.compile("^\\d+$");

public static Object select(Schema schema, Object datum, String path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc? It would make it easier to follow what this does

case MAP:
Preconditions.checkArgument(datum instanceof Map,
"Not a map: %s", datum);
Map<Object, Object> map = (Map<Object, Object>) datum;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also use Map<?,?> and cast on get to avoid warnings

}
}

public static CodecFactory avroCodec(String codec) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in parquet-avro?


@Override
public Iterator<E> iterator() {
return this;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the contract of iterable would be to get a new one everytime

@julienledem
Copy link
Member

@isnotinvain @piyushnarang In general I agree that we should avoid large patches and get smaller ones. However this particular patch doesn't touch any of the existing parquet code and by itself has a lot of independent parts that are easy to review independently in one patch (various commands, some utility packages, etc) and I don't think that it would help much to have this particular patch split into smaller ones since it would basically be split by files. It would be more work on Ryan's side with very little benefit on the reviewer end. So I think we can make an exception here and review as is. I did a first pass review above.

@julienledem
Copy link
Member

we should merge this.

@xhochy
Copy link
Member

xhochy commented May 13, 2017

+1, I'm already using this regularly!

@rdblue
Copy link
Contributor Author

rdblue commented May 18, 2017

Great! I'll rebase and commit it shortly. I think I was working on tests when I last had a chance... so I'll make sure that's reasonably complete.

@rdblue rdblue force-pushed the PARQUET-777-add-parquet-cli branch 2 times, most recently from a7515ba to ce807af Compare May 19, 2017 17:15
@rdblue rdblue force-pushed the PARQUET-777-add-parquet-cli branch from ce807af to de49eff Compare July 28, 2017 22:33
@asfgit asfgit closed this in ddbeb4d Jul 28, 2017
@rdblue rdblue deleted the PARQUET-777-add-parquet-cli branch July 28, 2017 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants