-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding metrics for estimating strand specificity. #733
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nh13 I think this looks much better. I do have some minor, and probably easily addressed concerns about the actual algorithm, though in fairness some of those pre-date your changes!!
I think it would be better to use the strand of the read you are observing in your loop - which just makes the check marginally more complicated - rather than the strand reported for R1 from the record you have.
If you were to want to tighten things up in general, another way to allay that concern would be to filter for all the strand-specificity checks, to only inserts that are fully contained within the span of a gene (and with the existing check for the read overlapping an exon). That way you are guaranteed when you look at the mate pair's strand, that the mate pair is also within the gene's locus, and not you're not looking at a pair that happens to be overlap the start/end of the gene by chance. Gene even has a start
and end
field populated correctly that would make that check trivial. FWIW though this will probably have minimal effect on any reasonable library.
@@ -66,6 +66,19 @@ | |||
/** Number of aligned reads that are mapped to the incorrect strand. 0 if library is not strand-specific. */ | |||
public long INCORRECT_STRAND_READS; | |||
|
|||
/** The fraction of reads that support the model where R1 is on the strand of transcription and R2 is on the | |||
* opposite strand of transcription. For unpaired reads, this is the fraction of reads that are on the same strand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just stop the sentence at "opposite strand."
|
||
if (!rec.getReadPairedFlag() || SamPairUtil.getPairOrientation(rec) == SamPairUtil.PairOrientation.FR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will fail with reads that have unmapped mates, which I can't find being filtered out above. SamPairUtil.getPairOrientation
will throw an IllegalArgumentException
for mate-paired reads with an unmapped mate.
if (!rec.getReadPairedFlag() || SamPairUtil.getPairOrientation(rec) == SamPairUtil.PairOrientation.FR) { | ||
final boolean firstReadNegativeStrand; | ||
// Get the strand of the first read. | ||
if (!rec.getReadPairedFlag() || rec.getFirstOfPairFlag()) firstReadNegativeStrand = rec.getReadNegativeStrandFlag(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of things here that seem a little off to me:
- You're looking at first of pair strand for your test, but you only really know whether the read you have in hand (maybe R1 or R2) overlaps an exon.
- You're only looking at R1 strand, but you're single-counting it for fragments and double-counting it for pairs.
@tfenne care to take one more look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. One minor suggestion. 👍
else { | ||
properOrientation = SamPairUtil.getPairOrientation(rec) == SamPairUtil.PairOrientation.FR; | ||
leftMostAlignedBase = Math.min(rec.getAlignmentStart(), rec.getMateAlignmentStart()); | ||
rightMostAlignedBase = Math.max(rec.getAlignmentEnd(), SAMUtils.getMateAlignmentEnd(rec)); // requires the mate cigar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could just use the inferred insert size here to avoid requiring the mate cigar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is true for FR
reads, but not FF
reads. Unfortunately, Picard/htsjdk computes insert size as 5' to 5' rather than left-most to right-most bases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Picard/htsjdk computes insert size"? I was under the impression that it leaves it alone as it was placed by the mapper. Can you point to the code that does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example is SamPairUtil.computeInsertSize
within SamPairUtil.setMateInfo
which is used in AbstractAlignmentMerger.transferAlignmentInfoToPairedRead
.
@yfarjoun it would be great to have a look by someone else, either yourself, or someone you suggest? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nh13 This looks very nice.
I have some comments about the implementation and tests. I realize that some of the tests I requested were missing but given that you changed some of the filters (getNotPrimaryAlignmentFlag) I'd like to be sure that we're not missing something...
@@ -66,6 +66,19 @@ | |||
/** Number of aligned reads that are mapped to the incorrect strand. 0 if library is not strand-specific. */ | |||
public long INCORRECT_STRAND_READS; | |||
|
|||
/** The fraction of reads that support the model where R1 is on the strand of transcription and R2 is on the | |||
* opposite strand. For unpaired reads, this is the fraction of reads that are on the same strand than that of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For unpaired reads, it is the fraction of read that are on the transcription strand (out of all the reads)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
/** | ||
* The fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite | ||
* strand. For unpaired reads, this is the fraction of reads that are on the opposite strand than that of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -232,25 +228,60 @@ public void acceptRecord(SAMRecord rec) { | |||
|
|||
// Strand-specificity is tallied on read basis rather than base at a time. A read that aligns to more than one | |||
// gene is not counted. | |||
if (!rec.getNotPrimaryAlignmentFlag() && overlapsExon && strandSpecificity != StrandSpecificity.NONE && overlappingGenes.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this now start counting secondary and supplementary alignments as full reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line 140 (https://github.com/broadinstitute/picard/pull/733/files#diff-5c77d4872231d96dbaeb66da41c02f84R140) where there is a return
statement if rec.getNotPrimaryAlignmentFlag()
.
if (rec.getMateUnmappedFlag()) { | ||
properOrientation = false; // mate is unmapped! | ||
leftMostAlignedBase = rightMostAlignedBase = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
else { | ||
properOrientation = SamPairUtil.getPairOrientation(rec) == SamPairUtil.PairOrientation.FR; | ||
leftMostAlignedBase = Math.min(rec.getAlignmentStart(), rec.getMateAlignmentStart()); | ||
rightMostAlignedBase = Math.max(rec.getAlignmentEnd(), SAMUtils.getMateAlignmentEnd(rec)); // requires the mate cigar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will blow-up without MC tag won't it? If so that's a change in spec, since I think this collector didn't need to have MC in the past...so I'd prefer if you could use a poorer solution with a warning in case you find a record with no MC tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example use the inferred read length if the mate cigar is missing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I'll fight the battle to change your mind at a latter date. I am going to omit a warning, since then I'd have to find a way to pass a log in, and I'm not up for that.
"REF_FLAT=" + getRefFlatFile(sequence).getAbsolutePath(), | ||
"RIBOSOMAL_INTERVALS=" + rRnaIntervalsFile.getAbsolutePath(), | ||
"STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND", | ||
"IGNORE_SEQUENCE=" +ignoredSequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after +
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here and elsewhere (copy paste)
builder.addPair("read_one_end_unmapped", sequenceIndex, 50, 51, false, true, "50M", null, false, false, -1); | ||
|
||
// Reads that are enclosed in the gene, paired and frag: one count per template | ||
builder.addFrag("frag_enclosed_forward", sequenceIndex, 150, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These have null cigars...so how many bases get counted towards coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered my own question. 36.
final int sequenceIndex = builder.getHeader().getSequenceIndex(sequence); | ||
|
||
// Reads that start *before* the gene: not counted | ||
builder.addPair("pair_prior_1", sequenceIndex, 45, 50, false, false, "50M", "50M", true, false, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add:
- secondary alignments
- supplementary alignments
- records that do not have MC tag
- duplicate reads.
- ignored sequence reads
- non-PF reads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add 5 since that is tested in the other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, duplicate reads are OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: previously were were including supplementary reads when calculating CORRECT_STRAND_READS
and INCORRECT_STRAND_READS
. Now we are not.
|
||
final RnaSeqMetrics metrics = output.getMetrics().get(0); | ||
Assert.assertEquals(metrics.PF_ALIGNED_BASES, 722); | ||
Assert.assertEquals(metrics.PF_BASES, 576); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it just me or is it strange that PF_BASES < PF_ALIGNED_BASES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to set the read length on the builder, the call to add*
didn't check the cigar versus the bases length (yikes).
samFile.deleteOnExit(); | ||
|
||
final SAMFileWriter samWriter = new SAMFileWriterFactory().makeSAMWriter(builder.getHeader(), false, samFile); | ||
for (final SAMRecord rec: builder.getRecords()) samWriter.addAlignment(rec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streams for the win?!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the rest of the code. Keeping it for now. I am not in love with Java streams yet 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun addressed your concerns.
@@ -66,6 +66,19 @@ | |||
/** Number of aligned reads that are mapped to the incorrect strand. 0 if library is not strand-specific. */ | |||
public long INCORRECT_STRAND_READS; | |||
|
|||
/** The fraction of reads that support the model where R1 is on the strand of transcription and R2 is on the | |||
* opposite strand. For unpaired reads, this is the fraction of reads that are on the same strand than that of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
/** | ||
* The fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite | ||
* strand. For unpaired reads, this is the fraction of reads that are on the opposite strand than that of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -232,25 +228,60 @@ public void acceptRecord(SAMRecord rec) { | |||
|
|||
// Strand-specificity is tallied on read basis rather than base at a time. A read that aligns to more than one | |||
// gene is not counted. | |||
if (!rec.getNotPrimaryAlignmentFlag() && overlapsExon && strandSpecificity != StrandSpecificity.NONE && overlappingGenes.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line 140 (https://github.com/broadinstitute/picard/pull/733/files#diff-5c77d4872231d96dbaeb66da41c02f84R140) where there is a return
statement if rec.getNotPrimaryAlignmentFlag()
.
if (rec.getMateUnmappedFlag()) { | ||
properOrientation = false; // mate is unmapped! | ||
leftMostAlignedBase = rightMostAlignedBase = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
else { | ||
properOrientation = SamPairUtil.getPairOrientation(rec) == SamPairUtil.PairOrientation.FR; | ||
leftMostAlignedBase = Math.min(rec.getAlignmentStart(), rec.getMateAlignmentStart()); | ||
rightMostAlignedBase = Math.max(rec.getAlignmentEnd(), SAMUtils.getMateAlignmentEnd(rec)); // requires the mate cigar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example is SamPairUtil.computeInsertSize
within SamPairUtil.setMateInfo
which is used in AbstractAlignmentMerger.transferAlignmentInfoToPairedRead
.
"REF_FLAT=" + getRefFlatFile(sequence).getAbsolutePath(), | ||
"RIBOSOMAL_INTERVALS=" + rRnaIntervalsFile.getAbsolutePath(), | ||
"STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND", | ||
"IGNORE_SEQUENCE=" +ignoredSequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed here and elsewhere (copy paste)
final int sequenceIndex = builder.getHeader().getSequenceIndex(sequence); | ||
|
||
// Reads that start *before* the gene: not counted | ||
builder.addPair("pair_prior_1", sequenceIndex, 45, 50, false, false, "50M", "50M", true, false, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add 5 since that is tested in the other tests.
final int sequenceIndex = builder.getHeader().getSequenceIndex(sequence); | ||
|
||
// Reads that start *before* the gene: not counted | ||
builder.addPair("pair_prior_1", sequenceIndex, 45, 50, false, false, "50M", "50M", true, false, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, duplicate reads are OK.
final int sequenceIndex = builder.getHeader().getSequenceIndex(sequence); | ||
|
||
// Reads that start *before* the gene: not counted | ||
builder.addPair("pair_prior_1", sequenceIndex, 45, 50, false, false, "50M", "50M", true, false, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB: previously were were including supplementary reads when calculating CORRECT_STRAND_READS
and INCORRECT_STRAND_READS
. Now we are not.
|
||
final RnaSeqMetrics metrics = output.getMetrics().get(0); | ||
Assert.assertEquals(metrics.PF_ALIGNED_BASES, 722); | ||
Assert.assertEquals(metrics.PF_BASES, 576); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to set the read length on the builder, the call to add*
didn't check the cigar versus the bases length (yikes).
@yfarjoun thanks for the helpful comments, take another look and let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. thanks!
/** | ||
* The fraction of reads for which the transcription strand model could not be inferred. | ||
*/ | ||
public long NUM_UNEXPLAINED_TRANSCRIPTION_STRAND_READS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconclusive rather than unexplained?
*/ | ||
public double PCT_FIRST_READ_TRANSCRIPTION_STRAND_READS; | ||
|
||
/** | ||
* The fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite | ||
* strand. For unpaired reads, this is the fraction of reads that are on the opposite strand than that of the | ||
* transcription strand. | ||
* strand. For unpaired reads, it is the fraction of read that are on opposite strand than that of the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the the
|
||
/** | ||
* The fraction of reads that support the model where R2 is on the strand of transcription and R1 is on the opposite | ||
* strand. For unpaired reads, it is the fraction of read that are on opposite strand than that of the the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fraction->number (x2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still worded in a confusing manner (for unpaired reads)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing unpaired description as it is just plain confusing.
@@ -226,9 +226,13 @@ public void acceptRecord(SAMRecord rec) { | |||
} | |||
} | |||
|
|||
if (metrics.PF_ALIGNED_BASES > metrics.PF_BASES) { | |||
throw new IllegalStateException("HUH?"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
// Get the alignment length of the mate. Approximate it with the current read length if no | ||
// mate cigar is found. | ||
final Cigar mateCigar = SAMUtils.getMateCigar(rec); | ||
final int mateReferenceLength = (mateCigar == null) ? rec.getReadLength() : mateCigar.getReferenceLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for humoring me.
builder.addPair("pair_prior_1", sequenceIndex, 45, 50, false, false, "50M", "50M", true, false, -1); | ||
builder.addPair("pair_prior_2", sequenceIndex, 49, 50, false, false, "50M", "50M", true, false, -1); | ||
|
||
// Read that is enclosed in the gene, but one end does not map: not counted | ||
// Read that is enclosed in the gene, but one end does not map: not counted: count as unexamined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not counted:
public long NUM_SECOND_READ_TRANSCRIPTION_STRAND_READS; | ||
|
||
/** | ||
* The fraction of reads for which the transcription strand model could not be inferred. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fraction->number
@@ -66,19 +66,43 @@ | |||
/** Number of aligned reads that are mapped to the incorrect strand. 0 if library is not strand-specific. */ | |||
public long INCORRECT_STRAND_READS; | |||
|
|||
/** The number of reads that support the model where R1 is on the strand of transcription and R2 is on the | |||
* opposite strand. For unpaired reads, it is the fraction of read that are on the transcription strand (out of all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reads
@nh13 I just realized that somewhere between my review and @yfarjoun's review something changed that I disagree with. Previously the two percentages were calculated using the sum of the I think this is wrong because what that represents is reads that aren't properly paired or in the expected Also, while I'm here, could we change |
@tfenne thanks for catching this, I'll update. |
👍 from me. @yfarjoun ? |
actually, given that oracle DB has problems with long names...could you shorten the length of the metrics names to 30 or less? |
sorry for the late comment. (I was starting to think about the needed DB changes..) |
@yfarjoun I shortened them to 30 characters or less. 👍 , then I can rebase and once tests pass merge? |
Added two metrics that measure the fraction of observed reads that support R1 being on the strand of transcription versus the opposite strand of transcription. For paired end reads, R2 is assumed to be on the opposite strand of that of R1.
37f496d
to
86682fb
Compare
Added two metrics that measure the fraction of observed reads that
support R1 being on the strand of transcription versus the opposite
strand of transcription. For paired end reads, R2 is assumed to be on
the opposite strand of that of R1.