New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-565] Upgrade to Parquet filter2 API. #616

Merged
merged 2 commits into from Mar 16, 2015

Conversation

Projects
None yet
6 participants
@tomwhite
Member

tomwhite commented Mar 12, 2015

See discussion on #565

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 12, 2015

Can one of the admins verify this patch?

Can one of the admins verify this patch?

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 12, 2015

Member

Jenkins, add to whitelist and test please.

Really appreciate the contribution, @tomwhite !

Member

massie commented Mar 12, 2015

Jenkins, add to whitelist and test please.

Really appreciate the contribution, @tomwhite !

Show outdated Hide outdated adam-core/src/test/scala/org/bdgenomics/adam/util/ADAMFunSuite.scala
@@ -25,6 +25,7 @@ trait ADAMFunSuite extends SparkFunSuite {
override val properties: Map[String, String] = Map(("spark.serializer", "org.apache.spark.serializer.KryoSerializer"),
("spark.kryo.registrator", "org.bdgenomics.adam.serialization.ADAMKryoRegistrator"),
("spark.kryoserializer.buffer.mb", "4"),
("spark.kryo.referenceTracking", "true"))
("spark.kryo.referenceTracking", "true"),
("spark.hadoop.parquet.task.side.metadata", "false"))

This comment has been minimized.

@massie

massie Mar 12, 2015

Member

We can't use parquet.task.side.metadata with the filter2 API? Or just better to not use it in general?

@massie

massie Mar 12, 2015

Member

We can't use parquet.task.side.metadata with the filter2 API? Or just better to not use it in general?

This comment has been minimized.

@tomwhite

tomwhite Mar 13, 2015

Member

Filtering will still work, it's just that row group elimination only occurs on the client side. We could remove this change (leave parquet.task.side.metadata set to true) and the test will still pass.

@tomwhite

tomwhite Mar 13, 2015

Member

Filtering will still work, it's just that row group elimination only occurs on the client side. We could remove this change (leave parquet.task.side.metadata set to true) and the test will still pass.

This comment has been minimized.

@massie

massie Mar 13, 2015

Member

In ParquetInputFormat.java, there is the following comment,

/**
   * key to turn on or off task side metadata loading (default true)
   * if true then metadata is read on the task side and some tasks may finish immediately.
   * if false metadata is read on the client which is slower if there is a lot of metadata but tasks will only be spawn if there is work to do.
   */
  public static final String TASK_SIDE_METADATA = "parquet.task.side.metadata";

which I read to mean that setting this to false means metadata is read on the client side instead of the task side. I thought that "task side" is the opposite of "client side" but I could very well be misunderstanding this too.

@massie

massie Mar 13, 2015

Member

In ParquetInputFormat.java, there is the following comment,

/**
   * key to turn on or off task side metadata loading (default true)
   * if true then metadata is read on the task side and some tasks may finish immediately.
   * if false metadata is read on the client which is slower if there is a lot of metadata but tasks will only be spawn if there is work to do.
   */
  public static final String TASK_SIDE_METADATA = "parquet.task.side.metadata";

which I read to mean that setting this to false means metadata is read on the client side instead of the task side. I thought that "task side" is the opposite of "client side" but I could very well be misunderstanding this too.

This comment has been minimized.

@tomwhite

tomwhite Mar 16, 2015

Member

Task side is on the cluster nodes, client side is where the Spark client is running. The reason that task side is the default now for reading metadata is to avoid situations where a large number (thousands) of Parquet files are being processed and the client has to read all the footers before the job can start. See https://issues.apache.org/jira/browse/PARQUET-139

Row group filtering will occur in both cases (client side and task side) - I was wrong to say above that it only occurs on the client side. So I've made a minor adjustment to the code to remove this change, and keep the Parquet default of task side row group elimination. I used the debugger to confirm that row group elimination was still happening in the test.

@tomwhite

tomwhite Mar 16, 2015

Member

Task side is on the cluster nodes, client side is where the Spark client is running. The reason that task side is the default now for reading metadata is to avoid situations where a large number (thousands) of Parquet files are being processed and the client has to read all the footers before the job can start. See https://issues.apache.org/jira/browse/PARQUET-139

Row group filtering will occur in both cases (client side and task side) - I was wrong to say above that it only occurs on the client side. So I've made a minor adjustment to the code to remove this change, and keep the Parquet default of task side row group elimination. I used the debugger to confirm that row group elimination was still happening in the test.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 12, 2015

Member

This PR looks good to me. Not only will we get a performance bump but the code is cleaner and easier to reason about. Thanks, Tom.

Member

massie commented Mar 12, 2015

This PR looks good to me. Not only will we get a performance bump but the code is cleaner and easier to reason about. Thanks, Tom.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 16, 2015

Member

Jenkins, add to whitelist and test this please.

Member

massie commented Mar 16, 2015

Jenkins, add to whitelist and test this please.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 16, 2015

Member

I am +1 to merge, preconditioned on one change.

I think we should prefer to remove predicate pushdown from the automagical loaders instead of warning that the predicate pushdown is ignored. I believe that @laserson would favor this (doing this would resolve #609). My reasoning is that ignoring a projection is OK, because you only pay a performance penalty; getting "more" fields back doesn't change the logic of your query, it just hurts your performance. Ignoring a predicate should be verboten, because you'll get a different set of records back than if the predicate had been applied.

I'll comment in the diff to indicate where these changes would go.

Member

fnothaft commented Mar 16, 2015

I am +1 to merge, preconditioned on one change.

I think we should prefer to remove predicate pushdown from the automagical loaders instead of warning that the predicate pushdown is ignored. I believe that @laserson would favor this (doing this would resolve #609). My reasoning is that ignoring a projection is OK, because you only pay a performance penalty; getting "more" fields back doesn't change the logic of your query, it just hurts your performance. Ignoring a predicate should be verboten, because you'll get a different set of records back than if the predicate had been applied.

I'll comment in the diff to indicate where these changes would go.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 16, 2015

Member

I think this looks great, BTW. I'm stoked to move over to the new filter2 API.

Member

fnothaft commented Mar 16, 2015

I think this looks great, BTW. I'm stoked to move over to the new filter2 API.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

This comment has been minimized.

@massie

massie Mar 16, 2015

Member

Just to clarify. You want the predicate argument removed? If so, how will the predicate be available to the final loadParquetVariantAnnotations which can use predicate pushdown? Does it make more sense to fast fail here when there is Some(predicate) but the format is vcf instead of Parquet?

@massie

massie Mar 16, 2015

Member

Just to clarify. You want the predicate argument removed? If so, how will the predicate be available to the final loadParquetVariantAnnotations which can use predicate pushdown? Does it make more sense to fast fail here when there is Some(predicate) but the format is vcf instead of Parquet?

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Yeah; this is the discussion that @laserson raised on #609. Uri makes good arguments on the ticket for that issue. I disagreed at first, but have come around to agree with Uri's arguments.

As an aside, if you are going to fail if a predicate is provided and the format isn't VCF, there's not really any benefit in allowing people to provide a predicate to the automagical function. Essentially, the automagical function just exists as a shim of convenience, so it should be difficult to make the automagical loaders throw an error.

@fnothaft

fnothaft Mar 16, 2015

Member

Yeah; this is the discussion that @laserson raised on #609. Uri makes good arguments on the ticket for that issue. I disagreed at first, but have come around to agree with Uri's arguments.

As an aside, if you are going to fail if a predicate is provided and the format isn't VCF, there's not really any benefit in allowing people to provide a predicate to the automagical function. Essentially, the automagical function just exists as a shim of convenience, so it should be difficult to make the automagical loaders throw an error.

This comment has been minimized.

@massie

massie Mar 16, 2015

Member

Looks like these methods are mostly used for the Adam2* functions anyway. Agree it makes sense to just remove predicates altogether since they aren't being used (or useful).

@massie

massie Mar 16, 2015

Member

Looks like these methods are mostly used for the Adam2* functions anyway. Agree it makes sense to just remove predicates altogether since they aren't being used (or useful).

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
predicate: Option[Class[U]] = None,
projection: Option[Schema] = None): RDD[Gene] = {
def loadGenes(filePath: String,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

Show outdated Hide outdated adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala
filePath: String,
predicate: Option[Class[U]] = None,
predicate: Option[FilterPredicate] = None,

This comment has been minimized.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@fnothaft

fnothaft Mar 16, 2015

Member

Suggest removal.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 16, 2015

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/626/

Build result: FAILURE

[...truncated 6 lines...]Cloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse refs/remotes/origin/pr/616/merge^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/pr/616/merge^{commit} # timeout=10Checking out Revision 49bbf3f095af2bfd8416c891de95565bcc661c0c (refs/remotes/origin/pr/616/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 49bbf3f095af2bfd8416c891de95565bcc661c0cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosTriggering ADAM-prb ? 2.2.0,centosADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result FAILUREADAM-prb ? 2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Setting status of 057ea28 to FAILURE with url https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/626/ and message: Merged build finished.
Test FAILed.

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/626/

Build result: FAILURE

[...truncated 6 lines...]Cloning repository https://github.com/bigdatagenomics/adam.git > git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse refs/remotes/origin/pr/616/merge^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/pr/616/merge^{commit} # timeout=10Checking out Revision 49bbf3f095af2bfd8416c891de95565bcc661c0c (refs/remotes/origin/pr/616/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 49bbf3f095af2bfd8416c891de95565bcc661c0cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,centosTriggering ADAM-prb ? 1.0.4,centosTriggering ADAM-prb ? 2.2.0,centosADAM-prb ? 2.3.0,centos completed with result FAILUREADAM-prb ? 1.0.4,centos completed with result FAILUREADAM-prb ? 2.2.0,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Setting status of 057ea28 to FAILURE with url https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/626/ and message: Merged build finished.
Test FAILed.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Mar 16, 2015

Member

This is an issue with Jenkins that @shaneknapp is looking into.

Member

massie commented Mar 16, 2015

This is an issue with Jenkins that @shaneknapp is looking into.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Mar 16, 2015

Contributor

jenkins, test this please

Contributor

shaneknapp commented Mar 16, 2015

jenkins, test this please

@tomwhite

This comment has been minimized.

Show comment
Hide comment
@tomwhite

tomwhite Mar 16, 2015

Member

Added a new commit with @fnothaft's suggested changes. Please let me know if I should squash the commits.

Member

tomwhite commented Mar 16, 2015

Added a new commit with @fnothaft's suggested changes. Please let me know if I should squash the commits.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 16, 2015

Member

@tomwhite I would probably squash the first three into the [ADAM-565] commit, and would relabel the last commit with [ADAM-609].

Member

fnothaft commented Mar 16, 2015

@tomwhite I would probably squash the first three into the [ADAM-565] commit, and would relabel the last commit with [ADAM-609].

@tomwhite

This comment has been minimized.

Show comment
Hide comment
@tomwhite

tomwhite Mar 16, 2015

Member

Done.

Member

tomwhite commented Mar 16, 2015

Done.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 16, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/627/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/627/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 16, 2015

Member

Thanks @tomwhite! I will merge once the new @AmplabJenkins run completes.

Member

fnothaft commented Mar 16, 2015

Thanks @tomwhite! I will merge once the new @AmplabJenkins run completes.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Mar 16, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/628/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/628/
Test PASSed.

fnothaft added a commit that referenced this pull request Mar 16, 2015

Merge pull request #616 from tomwhite/ADAM-565-filter2
[ADAM-565] Upgrade to Parquet filter2 API.
[ADAM-609] Remove predicate pushdown from the automagical loaders.

@fnothaft fnothaft merged commit 830f3ed into bigdatagenomics:master Mar 16, 2015

1 check passed

default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 16, 2015

Member

Merged! Thanks @tomwhite!

Member

fnothaft commented Mar 16, 2015

Merged! Thanks @tomwhite!

@tomwhite

This comment has been minimized.

Show comment
Hide comment
@tomwhite

tomwhite Mar 16, 2015

Member

Thanks for your help @massie and @fnothaft!

Member

tomwhite commented Mar 16, 2015

Thanks for your help @massie and @fnothaft!

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Mar 16, 2015

Contributor

Beautiful! Great work @tomwhite :)

Contributor

laserson commented Mar 16, 2015

Beautiful! Great work @tomwhite :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment