Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1481] Refactor ADAMContext loadXxx methods for consistency #1487

Merged
merged 8 commits into from Apr 19, 2017

Conversation

@heuermh
Copy link
Member

@heuermh heuermh commented Apr 12, 2017

Fixes #1481.

I flagged most of the pathName method parameter param doc strings with (todo: confirm) because it is not immediately clear which support glob/directories and which do not.

There is also some inconsistency wrt when .gz is supported and when it is not.

@coveralls
Copy link

@coveralls coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.03%) to 81.667% when pulling 072ccf5 on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1943/
Test PASSed.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 13, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1944/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1487/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains b99480b8032090dce76fdfb1cce900bb7800a5ff # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/merge^{commit} # timeout=10Checking out Revision b99480b8032090dce76fdfb1cce900bb7800a5ff (origin/pr/1487/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b99480b8032090dce76fdfb1cce900bb7800a5ffFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls
Copy link

@coveralls coveralls commented Apr 13, 2017

Coverage Status

Coverage decreased (-0.1%) to 81.508% when pulling 8082b2c on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@coveralls
Copy link

@coveralls coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.04%) to 81.674% when pulling 8082b2c on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 13, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1945/
Test PASSed.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 13, 2017

@heuermh After merging #1482, this branch has conflicts. Can you rebase?

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 13, 2017

Will do. I have more changes coming.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 13, 2017

Sounds good! Would you like me to hold off on reviewing until you're done with the changes?

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 13, 2017

If you have feedback on some of the open questions now, that would be helpful.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 13, 2017

What are the open questions? Are they the following? If so, answers below:

I flagged most of the pathName method parameter param doc strings with (todo: confirm) because it is not immediately clear which support glob/directories and which do not.

All of the loadX methods should support globs and directories.

There is also some inconsistency wrt when .gz is supported and when it is not.

  • GZIPed VCF is loaded (and even split if BGZIPed)
  • All of the files that use sc.textFile (FASTA, all of the feature formats) will load both GZIPed files (as a single split) and BZIP2 files
  • FileInputFormat (FASTQ) will load GZIPed data, but there is an issue, see #1457

I'm not sure about GZIPed SAM.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 13, 2017

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

Others forward to loadAvro and sc.newAPIHadoopFile without going through getFsAndFiles and related, do those support globs and directories? E.g. loadInterleavedFastq forwards pathName: String to sc.newAPIHadoopFile.

There is also some inconsistency wrt when .gz is supported and when it is not.

Thanks, we need to make sure .gz and .bz2 extensions pass format guessing checks where appropriate then.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 13, 2017

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L369-L375 ;)

Others forward to loadAvro and sc.newAPIHadoopFile without going through getFsAndFiles and related, do those support globs and directories? E.g. loadInterleavedFastq forwards pathName: String to sc.newAPIHadoopFile.

In general, they should, unless the underlying InputFormat is doing something funny.

Thanks, we need to make sure .gz and .bz2 extensions pass format guessing checks where appropriate then.

+1

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 13, 2017

All of the loadX methods should support globs and directories.

Some clearly do not, loadIndexedBam, for example.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L369-L375 ;)

Dang it! Who made that work and forgot to update the @param docs? :)

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 13, 2017

Dang it! Who made that work and forgot to update the @param docs? :)

My bad! ;)

@coveralls
Copy link

@coveralls coveralls commented Apr 13, 2017

Coverage Status

Coverage increased (+0.04%) to 81.674% when pulling 579126a on heuermh:load-cleanup into 93b32c6 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 13, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1946/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse 579126a^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 579126a # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1487/head^{commit} # timeout=10Checking out Revision 579126a (origin/pr/1487/head) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 579126af17f2b882c715080dd75e309bfda3ebf5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh heuermh force-pushed the heuermh:load-cleanup branch from 579126a to 6c988b7 Apr 14, 2017
@coveralls
Copy link

@coveralls coveralls commented Apr 14, 2017

Coverage Status

Coverage increased (+0.3%) to 81.854% when pulling c7814c9 on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1947/
Test PASSed.

@coveralls
Copy link

@coveralls coveralls commented Apr 14, 2017

Coverage Status

Coverage increased (+0.2%) to 81.809% when pulling f3d1d3c on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1948/
Test PASSed.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 14, 2017

This one is ready to review!

Copy link
Member

@fnothaft fnothaft left a comment

Let's check for compressors using the Hadoop factory, but otherwise this LGTM! Thanks @heuermh! This is a great cleanup.

* @param pathName The path name to load SAM/BAM/CRAM formatted alignments from. Globs/directories are
* supported (todo: confirm).
* @param pathName The path name to load SAM/BAM/CRAM formatted alignment records from.
* Globs/directories are supported (todo: confirm).

This comment has been minimized.

@fnothaft

fnothaft Apr 15, 2017
Member

Can confirm; globs/directories are supported.

*/
private[adam] def isBedExt(pathName: String): Boolean = {
pathName.endsWith(".bed") ||
pathName.endsWith(".bed.gz") ||

This comment has been minimized.

@fnothaft

fnothaft Apr 15, 2017
Member

As an aside, for all of the compressed file format checks, we should use Hadoop's CompressionCodecFactory to check if the path is a compressed file. This allows users to swap in additional compression codecs.

This comment has been minimized.

@heuermh

heuermh Apr 15, 2017
Author Member

ok, I suppose we could do an isCompressed check in ADAMContext, since it has access to the Hadoop configuration, then trim off the extension before checking with FileExtensions

@coveralls
Copy link

@coveralls coveralls commented Apr 18, 2017

Coverage Status

Coverage increased (+0.4%) to 81.942% when pulling b021e2c on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 18, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1954/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

This LGTM from a code perspective, but I think we should mention in each of the docstrings that we do support compressed text files, and where that comes from. I think we're very close here; thanks for driving this @heuermh!

* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .fa/.fasta as FASTA format,
* * .fq/.fastq as FASTQ format, and
* * .ifq as interleaved FASTQ format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Should mention that we read the configured compression codecs from the Hadoop config for this, which by default include .gz and .bz2, but can include more.

* If the path name has a .fa/.fa.gz/.fa.bz2/.fasta/.fasta.gz/.fasta.bz2 extension,
* load as FASTA format. Else, fall back to Parquet + Avro.
* If the path name has a .fa/.fasta extension, load as FASTA format.
* Else, fall back to Parquet + Avro.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

@@ -92,7 +92,7 @@ class JavaADAMContext(val ac: ADAMContext) extends Serializable {
*
* Loads path names ending in:
* * .bam/.cram/.sam as BAM/CRAM/SAM format and
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .ifq as interleaved FASTQ format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

* * .gff3 as GFF3 format,
* * .gtf/.gff as GTF/GFF2 format,
* * .narrow[pP]eak as NarrowPeak format, and
* * .interval_list as IntervalList format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .fa/.fasta as FASTA format,
* * .fq/.fastq as FASTQ format, and
* * .ifq as interleaved FASTQ format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

@@ -1721,7 +1744,7 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
*
* Loads path names ending in:
* * .bam/.cram/.sam as BAM/CRAM/SAM format and
* * .ifq/.ifq.gz/.ifq.bz2 as interleaved FASTQ format.
* * .ifq as interleaved FASTQ format.

This comment has been minimized.

@fnothaft

fnothaft Apr 19, 2017
Member

Ditto.

@coveralls
Copy link

@coveralls coveralls commented Apr 19, 2017

Coverage Status

Coverage increased (+0.4%) to 82.018% when pulling 9b13e6e on heuermh:load-cleanup into 6c32a02 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 19, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1955/
Test PASSed.

Copy link
Member

@fnothaft fnothaft left a comment

LGTM! Do you want to squash in any specific way, or would you like me to hit the green squash button?

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 19, 2017

Green squash is fine!

@fnothaft fnothaft merged commit 93d17b4 into bigdatagenomics:master Apr 19, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage increased (+0.4%) to 82.018%
Details
default Merged build finished.
Details
@heuermh heuermh deleted the heuermh:load-cleanup branch Apr 19, 2017
@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 19, 2017

Thank you!

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.