Resolves various single file save/header issues #1104

Merged
merged 7 commits into from Sep 7, 2016

Conversation

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Aug 8, 2016

Resolves #1009, #1059, #1058, #676.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1372/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1372/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 10, 2016

Member

Resolves #1106 now too. Ready for review/merge.

Member

fnothaft commented Aug 10, 2016

Resolves #1106 now too. Ready for review/merge.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 10, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1376/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1376/
Test PASSed.

+ val start = feature.getStart + 1 // IntervalList ranges are 1-based
+ val end = feature.getEnd // IntervalList ranges are closed
+ val strand = Features.asString(feature.getStrand, emptyUnknown = false)
+ val attributes = Features.gatherAttributes(feature)

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

This goes back to shoving all attributes into the IntervalList format name column, right? I saw that as an abuse to the format and generated a name with reasonable defaults, see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/Features.scala#L207.

@heuermh

heuermh Aug 11, 2016

Member

This goes back to shoving all attributes into the IntervalList format name column, right? I saw that as an abuse to the format and generated a name with reasonable defaults, see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/Features.scala#L207.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

I'm not 100% here—I'm a bit unfamiliar with the IntervalList format, even after this PR—but the implementation here passes round trip (with the exception of some ordering bits) with the interval list file in our test resources directory. For me to understand the difference a bit better, could you perhaps elucidate how this code would change the output? E.g., how would the resultant interval list line look different?

@fnothaft

fnothaft Aug 11, 2016

Member

I'm not 100% here—I'm a bit unfamiliar with the IntervalList format, even after this PR—but the implementation here passes round trip (with the exception of some ordering bits) with the interval list file in our test resources directory. For me to understand the difference a bit better, could you perhaps elucidate how this code would change the output? E.g., how would the resultant interval list line look different?

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

The javadoc for IntervalList (all the spec I'm aware of) describes the last column as "Interval name (an, ideally unique, name for the interval)". I haven't seen many of these files in the wild, so I don't know if | delimited attributes in the Interval name column is a reasonable thing to do.

Thus I found the previous code somewhat unsavory and arbitrary (as some hard coded keys are shoved into dbxrefs), see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L199.

I hadn't got to round tripping; I guess I was planning to shove the Interval name column into Feature.name field when reading IntervalList format and use Features.nameOf when writing. And then find a different file for the unit test.

@heuermh

heuermh Aug 11, 2016

Member

The javadoc for IntervalList (all the spec I'm aware of) describes the last column as "Interval name (an, ideally unique, name for the interval)". I haven't seen many of these files in the wild, so I don't know if | delimited attributes in the Interval name column is a reasonable thing to do.

Thus I found the previous code somewhat unsavory and arbitrary (as some hard coded keys are shoved into dbxrefs), see https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L199.

I hadn't got to round tripping; I guess I was planning to shove the Interval name column into Feature.name field when reading IntervalList format and use Features.nameOf when writing. And then find a different file for the unit test.

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Just wanted to get back to this, the reason I've shoved records in here is that this is the format we got in the interval list file that described the exome panel. I don't know what the "best" approach is, but I don't think interval list is a rigorously specified format, so we're somewhat making things up as we go here.

@fnothaft

fnothaft Sep 6, 2016

Member

Just wanted to get back to this, the reason I've shoved records in here is that this is the format we got in the interval list file that described the exome panel. I don't know what the "best" approach is, but I don't think interval list is a rigorously specified format, so we're somewhat making things up as we go here.

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

Yeah, without seeing more examples of these in the wild, I'm not sure if the name column should be mapped to Feature.name or to Feature.attributes.

@heuermh

heuermh Sep 6, 2016

Member

Yeah, without seeing more examples of these in the wild, I'm not sure if the name column should be mapped to Feature.name or to Feature.attributes.

This comment has been minimized.

@heuermh

heuermh Sep 7, 2016

Member

ok with me for now, will resolve in future issue

@heuermh

heuermh Sep 7, 2016

Member

ok with me for now, will resolve in future issue

+ )
+
+ // remove header file
+ //fs.delete(headPath, true)

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

Why is this commented out?

@heuermh

heuermh Aug 11, 2016

Member

Why is this commented out?

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Ah, thanks for catching this. I had commented it out while debugging something.

@fnothaft

fnothaft Aug 11, 2016

Member

Ah, thanks for catching this. I had commented it out while debugging something.

strand match {
case Strand.FORWARD => "+"
case Strand.REVERSE => "-"
case Strand.INDEPENDENT => "."
case Strand.UNKNOWN => "?"
- case _ => ""
+ case _ => {

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

So "" is read in as null and then written back out as "" if emptyUnknown, right?

@heuermh

heuermh Aug 11, 2016

Member

So "" is read in as null and then written back out as "" if emptyUnknown, right?

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Correct.

@fnothaft

fnothaft Aug 11, 2016

Member

Correct.

@@ -308,7 +338,7 @@ class NarrowPeakParser extends FeatureParser {
// skip empty or invalid lines
if (fields.length < 3) {
- if (log.isDebugEnabled) log.debug("Empty or invalid NarrowPeak line: {}", line)
+ log.warn("Empty or invalid NarrowPeak line: {}", line)

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

Blank lines (and browser & track header lines) are valid in BED and NarrowPeak files, and encouraged between different tracks in the same file, so log.warn is probably too severe. See e.g. https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/resources/dvl1.200.bed#L34

@heuermh

heuermh Aug 11, 2016

Member

Blank lines (and browser & track header lines) are valid in BED and NarrowPeak files, and encouraged between different tracks in the same file, so log.warn is probably too severe. See e.g. https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/test/resources/dvl1.200.bed#L34

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Ah yeah, thanks for flagging this. I wanted to make another pass back over this code and attach ValidationStringency flags and etc to this. How about I:

  • Remove logging on blank/header lines
  • Add a validation stringency flag:
    • Throw exception on malformed lines with strict stringency
    • log.warn on lenient
    • No logging on silent

Thoughts?

@fnothaft

fnothaft Aug 11, 2016

Member

Ah yeah, thanks for flagging this. I wanted to make another pass back over this code and attach ValidationStringency flags and etc to this. How about I:

  • Remove logging on blank/header lines
  • Add a validation stringency flag:
    • Throw exception on malformed lines with strict stringency
    • log.warn on lenient
    • No logging on silent

Thoughts?

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

That sounds good, sorry I didn't get it in before.

@heuermh

heuermh Aug 11, 2016

Member

That sounds good, sorry I didn't get it in before.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

No sweat!

@fnothaft

fnothaft Aug 11, 2016

Member

No sweat!

- val commandDescription = "Convert a file with sequence features into corresponding ADAM format"
+object TransformFeatures extends BDGCommandCompanion {
+ val commandName = "transformFeatures"
+ val commandDescription = "Convert a file with sequence features into corresponding ADAM format and vice versa"

This comment has been minimized.

@heuermh

heuermh Aug 11, 2016

Member

If I have this correctly, on either INPUT or OUTPUT if the file extension is not recognized as a plain-text feature file, then it is assumed to be parquet. That way this one command can convert both ways. The usage strings below should be updated.

@heuermh

heuermh Aug 11, 2016

Member

If I have this correctly, on either INPUT or OUTPUT if the file extension is not recognized as a plain-text feature file, then it is assumed to be parquet. That way this one command can convert both ways. The usage strings below should be updated.

This comment has been minimized.

@fnothaft

fnothaft Aug 11, 2016

Member

Correct. I'll update the usage strings.

@fnothaft

fnothaft Aug 11, 2016

Member

Correct. I'll update the usage strings.

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

LGTM

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 11, 2016

Member

Looks pretty good. Will try it out on various files tomorrow (and also hopefully re-remember how to use our internal cluster).

Member

heuermh commented Aug 11, 2016

Looks pretty good. Will try it out on various files tomorrow (and also hopefully re-remember how to use our internal cluster).

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

I attempted a round-trip BAM file conversion like:

adam-submit transform testb.bam testb.adam
adam-submit transform testb.adam testb.roundtrip.bam -single

and although the resulting testb.roundtrip.bam is 104MB ( the original was 101MB) running samtools flagstat testb.roundtrip.bam reports 0 counts for all stats, while there were 984817 reads reported by flagstat on the orignal file.

Also samtools view -h -o testb.roundtrip.conver2sam.sam testb.roundtrip.bam will not produce any reads data in an output sam file, only the header. So although no corruption error is logged by samtools, the file must be corrupt from the point of view of samtools 1.3

When not using -single
adam-submit transform testb.adam testb.roundtrip.bam part-r-0000* parquet files appear to be correctly generated.

If you have trouble replicating - let me know and I'll get you the file.

Member

jpdna commented Aug 19, 2016

I attempted a round-trip BAM file conversion like:

adam-submit transform testb.bam testb.adam
adam-submit transform testb.adam testb.roundtrip.bam -single

and although the resulting testb.roundtrip.bam is 104MB ( the original was 101MB) running samtools flagstat testb.roundtrip.bam reports 0 counts for all stats, while there were 984817 reads reported by flagstat on the orignal file.

Also samtools view -h -o testb.roundtrip.conver2sam.sam testb.roundtrip.bam will not produce any reads data in an output sam file, only the header. So although no corruption error is logged by samtools, the file must be corrupt from the point of view of samtools 1.3

When not using -single
adam-submit transform testb.adam testb.roundtrip.bam part-r-0000* parquet files appear to be correctly generated.

If you have trouble replicating - let me know and I'll get you the file.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 19, 2016

Member

Thanks for looking at this @jpdna! By any chance, have you tried running any of the Picard metrics tools on the file? Our round trip SAM/BAM unit tests are passing, so I believe that htsjdk is able to read the file fine. I wonder if we're writing the BAM header in a way so that htsjdk and htslib read the file differently? I'll take a look tomorrow.

Member

fnothaft commented Aug 19, 2016

Thanks for looking at this @jpdna! By any chance, have you tried running any of the Picard metrics tools on the file? Our round trip SAM/BAM unit tests are passing, so I believe that htsjdk is able to read the file fine. I wonder if we're writing the BAM header in a way so that htsjdk and htslib read the file differently? I'll take a look tomorrow.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

Interestingly - Picard seems to not like even the ORIGINAL test input file

java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=testb.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq
[Fri Aug 19 00:02:08 EDT 2016] picard.sam.SamToFastq INPUT=testb.bam FASTQ=out.fastq SECOND_END_FASTQ=out_second_end.fastq    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Aug 19 00:02:08 EDT 2016] Executing as jp@jp1 on Linux 4.3.5-300.fc23.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.2.2(20d49152d0840a960fcb97df76dbaca260b39244_1461168806) IntelDeflater
[Fri Aug 19 00:02:09 EDT 2016] picard.sam.SamToFastq done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2024275968
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 40060, Read name ERR001301.6170436, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:652)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:637)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:607)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:174)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Here is the original bam file I am feeding into this round-trip test:
https://drive.google.com/file/d/0B6jh69UgixwpVVJwdE1qTXhjcEk/view?usp=sharing

I'd write this off to just an incorrect BAM file that samtools is too permissive on, but the test file was derived from the 1000 genomes files available here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam

I also just tried that whole 1000 genomes with the Picard command above, and Picard produces the same error, but samtools flagstat and view seems to work fine on it.

So - it would be interesting to know exactly what is up with these 1000 genomes files that makes htsjdk unable to read them ( related to "bin" field I guess) - and thus ADAM can't either, and see if this jdk error message can at least be passed on more clearly in ADAM rather than just dropping the reads - if indeed it is legitimate to refuse to process these.

Member

jpdna commented Aug 19, 2016

Interestingly - Picard seems to not like even the ORIGINAL test input file

java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=testb.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq
[Fri Aug 19 00:02:08 EDT 2016] picard.sam.SamToFastq INPUT=testb.bam FASTQ=out.fastq SECOND_END_FASTQ=out_second_end.fastq    OUTPUT_PER_RG=false RG_TAG=PU RE_REVERSE=true INTERLEAVE=false INCLUDE_NON_PF_READS=false CLIPPING_MIN_LENGTH=0 READ1_TRIM=0 READ2_TRIM=0 INCLUDE_NON_PRIMARY_ALIGNMENTS=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Fri Aug 19 00:02:08 EDT 2016] Executing as jp@jp1 on Linux 4.3.5-300.fc23.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15; Picard version: 2.2.2(20d49152d0840a960fcb97df76dbaca260b39244_1461168806) IntelDeflater
[Fri Aug 19 00:02:09 EDT 2016] picard.sam.SamToFastq done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=2024275968
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Record 40060, Read name ERR001301.6170436, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
    at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:441)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:652)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:637)
    at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:607)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:545)
    at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:519)
    at picard.sam.SamToFastq.doWork(SamToFastq.java:174)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Here is the original bam file I am feeding into this round-trip test:
https://drive.google.com/file/d/0B6jh69UgixwpVVJwdE1qTXhjcEk/view?usp=sharing

I'd write this off to just an incorrect BAM file that samtools is too permissive on, but the test file was derived from the 1000 genomes files available here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam

I also just tried that whole 1000 genomes with the Picard command above, and Picard produces the same error, but samtools flagstat and view seems to work fine on it.

So - it would be interesting to know exactly what is up with these 1000 genomes files that makes htsjdk unable to read them ( related to "bin" field I guess) - and thus ADAM can't either, and see if this jdk error message can at least be passed on more clearly in ADAM rather than just dropping the reads - if indeed it is legitimate to refuse to process these.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

I googled a bit and find that setting VALIDATION_STRINGENCY=LENIENT when running picard allows picard to process this 1000 genomes file file.

 java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq VALIDATION_STRINGENCY=LENIENT

as discussed here : http://gatkforums.broadinstitute.org/gatk/discussion/4290/sam-bin-field-error-for-the-gatk-run

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

Member

jpdna commented Aug 19, 2016

I googled a bit and find that setting VALIDATION_STRINGENCY=LENIENT when running picard allows picard to process this 1000 genomes file file.

 java -jar ~/Apps/picard/release_2.2.2/picard-tools-2.2.2/picard.jar SamToFastq I=NA12878.chrom22.ILLUMINA.bwa.CEU.high_coverage.20100311.bam F=out.fastq SECOND_END_FASTQ=out_second_end.fastq VALIDATION_STRINGENCY=LENIENT

as discussed here : http://gatkforums.broadinstitute.org/gatk/discussion/4290/sam-bin-field-error-for-the-gatk-run

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 19, 2016

Member

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

I don't think that applies for ADAM, as we're not using the index.

Member

fnothaft commented Aug 19, 2016

So perhaps we need to see if we can allow the same "lenient" mode in ADAM?

I don't think that applies for ADAM, as we're not using the index.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

The ADAM parquet file testb.adam examined with adam-submit flagstat testb.adam does appear fine and agrees withsamtools flagstat of the original test.bam.

Also looking further at the per partition "bam" files named part-r-0000* produced by adam-submit transform testb.adam testb.roundtrip.NOSINGLE.bam without -single are reporting reasonable samtools flagstat count of reads, though the
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
warning is logged by samtools.

Member

jpdna commented Aug 19, 2016

The ADAM parquet file testb.adam examined with adam-submit flagstat testb.adam does appear fine and agrees withsamtools flagstat of the original test.bam.

Also looking further at the per partition "bam" files named part-r-0000* produced by adam-submit transform testb.adam testb.roundtrip.NOSINGLE.bam without -single are reporting reasonable samtools flagstat count of reads, though the
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.
warning is logged by samtools.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

Trying to output a -single sam file (rather than bam) works fine as:
adam-submit transform testb.adam testb.rountrip.sam -single
and samtools is then happy to operate on that sam file, including converting it to bam.

Also - I did same test and got same result of samtools not being to read the round tripped bam when using the unsorted.sam file we use in our unit tests.

adam-submit transform unsorted.sam unsorted.adam
../adam/bin/adam-submit transform unsorted.adam unsorted.roundtrip.bam -single


samtools flagstat unsorted.roundtrip.bam
0 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (N/A : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Member

jpdna commented Aug 19, 2016

Trying to output a -single sam file (rather than bam) works fine as:
adam-submit transform testb.adam testb.rountrip.sam -single
and samtools is then happy to operate on that sam file, including converting it to bam.

Also - I did same test and got same result of samtools not being to read the round tripped bam when using the unsorted.sam file we use in our unit tests.

adam-submit transform unsorted.sam unsorted.adam
../adam/bin/adam-submit transform unsorted.adam unsorted.roundtrip.bam -single


samtools flagstat unsorted.roundtrip.bam
0 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (N/A : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 19, 2016

Member

<3 htslib/htsjdk format disagreements

@jpdna thanks for testing this on samtools. I'll go ahead and debug this further. As an aside, we should add an integration test that reads the files in from samtools, or some other tool that uses htslib. I know that htslib and htsjdk also disagree on VCF, but this one is somewhat shocking. If we can root cause the issue, we should report it upstream.

Member

fnothaft commented Aug 19, 2016

<3 htslib/htsjdk format disagreements

@jpdna thanks for testing this on samtools. I'll go ahead and debug this further. As an aside, we should add an integration test that reads the files in from samtools, or some other tool that uses htslib. I know that htslib and htsjdk also disagree on VCF, but this one is somewhat shocking. If we can root cause the issue, we should report it upstream.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 19, 2016

Member

I'd be happy to have a go at making such integration test with samtools, I'll open an issue. We can then test the fix with it when ready.

Member

jpdna commented Aug 19, 2016

I'd be happy to have a go at making such integration test with samtools, I'll open an issue. We can then test the fix with it when ready.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 19, 2016

Member

Awesome! Thanks @jpdna!

Member

fnothaft commented Aug 19, 2016

Awesome! Thanks @jpdna!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 23, 2016

Member

I'm looking at this now. It looks like we write the header completely correctly. I'm not sure what exactly is going on, so I'm unrolling some of the code so that we can write uncompressed BAM (at least for the header) so that I can look at a hexdump to see what's going wrong.

Otherwise, we've got some bad stuff going on in the AlignmentRecordRDD round trip tests, so I'm going to patch this up in this PR as well.

Member

fnothaft commented Aug 23, 2016

I'm looking at this now. It looks like we write the header completely correctly. I'm not sure what exactly is going on, so I'm unrolling some of the code so that we can write uncompressed BAM (at least for the header) so that I can look at a hexdump to see what's going wrong.

Otherwise, we've got some bad stuff going on in the AlignmentRecordRDD round trip tests, so I'm going to patch this up in this PR as well.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Aug 30, 2016

Member

Just pinging on this @fnothaft - I think you are continuing to dig through the hexdumps?

Member

jpdna commented Aug 30, 2016

Just pinging on this @fnothaft - I think you are continuing to dig through the hexdumps?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 30, 2016

Member

Thanks for the ping @jpdna ; let me get back on this...

Member

fnothaft commented Aug 30, 2016

Thanks for the ping @jpdna ; let me get back on this...

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 30, 2016

Member

Just rebased and pushed a fix for the samtools issue. It is isolated in 74928e6, I'll squash that down into one of the other commits later. I've tested this locally on a small file with samtools. @jpdna can you check this on your side as well?

Member

fnothaft commented Aug 30, 2016

Just rebased and pushed a fix for the samtools issue. It is isolated in 74928e6, I'll squash that down into one of the other commits later. I've tested this locally on a small file with samtools. @jpdna can you check this on your side as well?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Aug 30, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1438/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1438/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 31, 2016

Member

Ping for review/merge?

Member

fnothaft commented Aug 31, 2016

Ping for review/merge?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Aug 31, 2016

Member

I'm not sure we resolved the IntervalList format name column discussion above, and a few of the other review comments still need addressing.

On my end, I want to run through some tests on the cluster.

Member

heuermh commented Aug 31, 2016

I'm not sure we resolved the IntervalList format name column discussion above, and a few of the other review comments still need addressing.

On my end, I want to run through some tests on the cluster.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 31, 2016

Member

Ah, sorry! I did a quick pass over the PR earlier and thought I'd addressed the comments. Evidently, I read way too fast. I'll make a pass over those tomorrow.

Member

fnothaft commented Aug 31, 2016

Ah, sorry! I did a quick pass over the PR earlier and thought I'd addressed the comments. Evidently, I read way too fast. I'll make a pass over those tomorrow.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

@heuermh just rebased and addressed your review comments. Let me know your thoughts.

Member

fnothaft commented Sep 6, 2016

@heuermh just rebased and addressed your review comments. Let me know your thoughts.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 6, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1460/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1460/
Test PASSed.

- log.warn("Empty or invalid NarrowPeak line: {}", line)
+ if (stringency == ValidationStringency.STRICT) {
+ throw new IllegalArgumentException("Empty or invalid NarrowPeak line: %s".format(line))
+ } else if (stringency == ValidationStringency.STRICT) {

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

this should be LENIENT

@heuermh

heuermh Sep 6, 2016

Member

this should be LENIENT

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Just patched, thanks for catching.

@fnothaft

fnothaft Sep 6, 2016

Member

Just patched, thanks for catching.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 6, 2016

Member

@heuermh if this looks good, can you merge? Merging this will simplify #1145.

Member

fnothaft commented Sep 6, 2016

@heuermh if this looks good, can you merge? Merging this will simplify #1145.

@@ -169,18 +179,28 @@ class GFF3Parser extends FeatureParser {
* In lieu of a specification, see:
* https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/samtools/util/IntervalList.html
*/
-class IntervalListParser extends Serializable {
- def parse(line: String): (Option[SequenceRecord], Option[Feature]) = {
+private[rdd] class IntervalListParser extends Serializable with Logging {

This comment has been minimized.

@heuermh

heuermh Sep 6, 2016

Member

Since we are reusing the SAM header code for IntervalList format I was thinking we could remove all the startsWith("@") stuff here.

This method should only need to handle the five tab-delimited columns (Sequence name, Start position (1-based), End position (1-based, end inclusive), Strand (either + or -), Interval name (an, ideally unique, name for the interval).

@heuermh

heuermh Sep 6, 2016

Member

Since we are reusing the SAM header code for IntervalList format I was thinking we could remove all the startsWith("@") stuff here.

This method should only need to handle the five tab-delimited columns (Sequence name, Start position (1-based), End position (1-based, end inclusive), Strand (either + or -), Interval name (an, ideally unique, name for the interval).

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

We're reusing the SAM header writing code, not the SAM header reading code. We might be able to reuse that code, but that's less straightforward. Perhaps we can punt that to a later ticket?

@fnothaft

fnothaft Sep 6, 2016

Member

We're reusing the SAM header writing code, not the SAM header reading code. We might be able to reuse that code, but that's less straightforward. Perhaps we can punt that to a later ticket?

This comment has been minimized.

@fnothaft

fnothaft Sep 6, 2016

Member

Essentially, I just don't have the bandwidth to take that on now.

@fnothaft

fnothaft Sep 6, 2016

Member

Essentially, I just don't have the bandwidth to take that on now.

This comment has been minimized.

@heuermh

heuermh Sep 7, 2016

Member

ok, how about just getting rid of the arbitrary dbxref key stuff at

https://github.com/fnothaft/adam/blob/f21ead579471ab7854aa519ee7efc884f2f6f38b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L226

then? Don't set dbxrefs at all, just attributes.

@heuermh

heuermh Sep 7, 2016

Member

ok, how about just getting rid of the arbitrary dbxref key stuff at

https://github.com/fnothaft/adam/blob/f21ead579471ab7854aa519ee7efc884f2f6f38b/adam-core/src/main/scala/org/bdgenomics/adam/rdd/features/FeatureParser.scala#L226

then? Don't set dbxrefs at all, just attributes.

This comment has been minimized.

@fnothaft

fnothaft Sep 7, 2016

Member

I assume this was meant to be on a different line? That change tracks all the way back to #686. I'd want to get @ryan-williams' thoughts on that code before we yanked it out or changed it. I think we're getting into a debate that's somewhat orthogonal to this PR, which is what exactly the .interval_list format is. I've just added code to said parser so that it passes round trip tests. If we'd like to change the parser, I think we should open an issue and discuss that there, and not block this PR on it. That's just my opinion though. Perhaps let's discuss on the call tomorrow?

@fnothaft

fnothaft Sep 7, 2016

Member

I assume this was meant to be on a different line? That change tracks all the way back to #686. I'd want to get @ryan-williams' thoughts on that code before we yanked it out or changed it. I think we're getting into a debate that's somewhat orthogonal to this PR, which is what exactly the .interval_list format is. I've just added code to said parser so that it passes round trip tests. If we'd like to change the parser, I think we should open an issue and discuss that there, and not block this PR on it. That's just my opinion though. Perhaps let's discuss on the call tomorrow?

This comment has been minimized.

@heuermh

heuermh Sep 7, 2016

Member

Let's create an issue to 1) look into using SAM header code when reading IntervalList format and 2) clean up this dbxref key stuff

@heuermh

heuermh Sep 7, 2016

Member

Let's create an issue to 1) look into using SAM header code when reading IntervalList format and 2) clean up this dbxref key stuff

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 6, 2016

Member

Reviewing meow

Member

heuermh commented Sep 6, 2016

Reviewing meow

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 6, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1461/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1461/
Test PASSed.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 7, 2016

Member

Created a new issue #1152 to resolve IntervalList discussions later.

Is this pull request ready to merge then?

Member

heuermh commented Sep 7, 2016

Created a new issue #1152 to resolve IntervalList discussions later.

Is this pull request ready to merge then?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 7, 2016

Member

Just rebased. Should be good for a merge once tests pass.

Member

fnothaft commented Sep 7, 2016

Just rebased. Should be good for a merge once tests pass.

[ADAM-1009] Port single file save code over to VCF.
Factor single file save code out of AlignmentRecordRDD. Resolves #1009.

fnothaft added some commits Aug 8, 2016

[ADAM-1059] Implemented single file save with SAM header for Interval…
…List.


Resolves #1059. Factored out SAM Header writing code, used it to implement
single file save for IntervalList files.
[ADAM-1058] Add support for -single across all Feature formats.
Resolves #1058. Added `asSingleFile` to all legacy Feature format save methods
in `org.bdgenomics.adam.rdd.features.FeatureRDD`. Renamed `Features2ADAM` to
`TransformFeatures` and added the `-single` option, as well as the ability to
export back to legacy feature formats.
[ADAM-676] Clean up header issues for sharded files.
Resolves #676. In #964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.
[ADAM-1106] Improve feature unit test coverage.
Resolves #1106. Improves round trip coverage by checking feature counts after
writing out, and spot checking converted feature lines.

* Split out all string encoding functions for features to make them easier to
  test, and wrote tests to spot check the correctness of these features.
* Made end-to-end feature tests pass.
* Eliminated some redundant tests.
* Added assertions to check that the same number of features were present in the
  input and output files when writing between different formats.
* Rewrote interval list parser and encoder.
* Fixed various bugs:
  * GFF2/GTF/GFF3 parser would discard lines that lacked an attribute column.
  * NarrowPeak parser would discard INDEPENDENT strand.
  * NarrowPeak signal intensities should be written as Ints, not Floats.
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 7, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1466/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1466/
Test PASSed.

@heuermh heuermh merged commit 79749ae into bigdatagenomics:master Sep 7, 2016

1 check passed

default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 7, 2016

Member

Merged via multiple commits.

Thank you, @fnothaft!

Member

heuermh commented Sep 7, 2016

Merged via multiple commits.

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment