New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor RDD loading; explicitly load alignments #468

Merged
merged 1 commit into from Nov 14, 2014

Conversation

Projects
None yet
4 participants
@ryan-williams
Member

ryan-williams commented Nov 7, 2014

There’s really not a case where we are loading a generic Parquet file of
SpecificRecords whose type we don’t have a strong requirement about;
on the other hand, there is a case where we know what type we want
(AlignmentRecord) but don’t know which code path to read it in via (sam,
bam, ifq, fq, parquet, …).

Here I’ve made the caller explicitly specify if it wants
AlignmentRecords (which it was previously doing by labeling the return
value), but wants to benefit from the smarts around file-extension
inference.

Parquet reads with a known type can still go through one place so that
we can continue to benefit from that code not needing to be duplicated,
by only relying on SpecificRecord.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 7, 2014

Member

+1, I think this is a good idea. We should extend it to variants as well, at a later time.

Member

fnothaft commented Nov 7, 2014

+1, I think this is a good idea. We should extend it to variants as well, at a later time.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 7, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/382/

Build result: FAILURE

GitHub pull request #468 of commit b9b4718 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/468/merge^{commit} # timeout=10Checking out Revision 81f2b1f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 81f2b1f > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 2.3.0,centosADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 2.3.0,centos completed with result FAILURE
Test FAILed.

AmplabJenkins commented Nov 7, 2014

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/382/

Build result: FAILURE

GitHub pull request #468 of commit b9b4718 automatically merged.[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-slave-01 (centos) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/468/merge^{commit} # timeout=10Checking out Revision 81f2b1f (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 81f2b1f > git rev-list a34074082d54668273cacb5528d8aaed1b9c6ea8 # timeout=10Triggering ADAM-prb » 1.0.4,centosTriggering ADAM-prb » 2.2.0,centosTriggering ADAM-prb » 2.3.0,centosADAM-prb » 1.0.4,centos completed with result FAILUREADAM-prb » 2.2.0,centos completed with result FAILUREADAM-prb » 2.3.0,centos completed with result FAILURE
Test FAILed.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 7, 2014

Member

Jenkins retest this.

Member

massie commented Nov 7, 2014

Jenkins retest this.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Nov 7, 2014

Member

@massie do you know why Jenkins failed? or why the retest? (I thought tests were passing for me locally; trying again now as well)

Member

ryan-williams commented Nov 7, 2014

@massie do you know why Jenkins failed? or why the retest? (I thought tests were passing for me locally; trying again now as well)

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 7, 2014

Member

@ryan-williams it looks like the SAM loading tests are failing:

Error Message

Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
Stacktrace

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:190)
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:146)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:458)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
      at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
      at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
      at org.bdgenomics.adam.plugins.Take10Plugin.run(Take10Plugin.scala:30)
      at org.bdgenomics.adam.cli.PluginExecutor.run(PluginExecutor.scala:121)
      at 
Member

fnothaft commented Nov 7, 2014

@ryan-williams it looks like the SAM loading tests are failing:

Error Message

Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
Stacktrace

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/reads124698313342189278134.sam is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [58, 55, 53, 10]
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:190)
      at parquet.hadoop.ParquetFileReader.readAllFootersInParallelUsingSummaryFiles(ParquetFileReader.java:146)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:458)
      at parquet.hadoop.ParquetInputFormat.getFooters(ParquetInputFormat.java:443)
      at parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:344)
      at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
      at scala.Option.getOrElse(Option.scala:120)
      at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
      at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
      at org.bdgenomics.adam.plugins.Take10Plugin.run(Take10Plugin.scala:30)
      at org.bdgenomics.adam.cli.PluginExecutor.run(PluginExecutor.scala:121)
      at 
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Nov 7, 2014

Member

yea seems real, let me look, I swear they were passing very shortly before I pushed

Member

ryan-williams commented Nov 7, 2014

yea seems real, let me look, I swear they were passing very shortly before I pushed

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Nov 7, 2014

Member

cool, that should have fixed it, just needed to bring PluginExecutor into the future

Member

ryan-williams commented Nov 7, 2014

cool, that should have fixed it, just needed to bring PluginExecutor into the future

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 7, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/383/
Test PASSed.

AmplabJenkins commented Nov 7, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/383/
Test PASSed.

refactor RDD loading; explicitly load alignments
There’s really not a case where we are loading a generic Parquet file of
SpecificRecords whose type we don’t have a strong requirement about;
on the other hand, there is a case where we know what type we want
(AlignmentRecord) but don’t know which code path to read it in via (sam,
bam, ifq, fq, parquet, …).

Here I’ve made the caller explicitly specify if it wants
AlignmentRecords (which it was previously doing by labeling the return
value), but wants to benefit from the smarts around file-extension
inference.

Parquet reads with a known type can still go through one place so that
we can continue to benefit from that code not needing to be duplicated,
by only relying on SpecificRecord.
@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Nov 14, 2014

Member

bump! this good to go?

Member

ryan-williams commented Nov 14, 2014

bump! this good to go?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 14, 2014

Member

Ah, yes! Thanks for the ping. Can you rebase? I'll merge once rebased.

Member

fnothaft commented Nov 14, 2014

Ah, yes! Thanks for the ping. Can you rebase? I'll merge once rebased.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Nov 14, 2014

Member

I believe it is now rebased

Member

ryan-williams commented Nov 14, 2014

I believe it is now rebased

fnothaft added a commit that referenced this pull request Nov 14, 2014

Merge pull request #468 from ryan-williams/loading
refactor RDD loading; explicitly load alignments

@fnothaft fnothaft merged commit a2673e1 into bigdatagenomics:master Nov 14, 2014

1 check was pending

default Merged build started.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 14, 2014

Member

Merged! Thanks @ryan-williams!

Member

fnothaft commented Nov 14, 2014

Merged! Thanks @ryan-williams!

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 14, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/399/
Test PASSed.

AmplabJenkins commented Nov 14, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/399/
Test PASSed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment