New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flagstat results still don't match samtools flagstat #946

Closed
jpdna opened this Issue Feb 16, 2016 · 5 comments

Comments

Projects
None yet
3 participants
@jpdna
Member

jpdna commented Feb 16, 2016

Even after commit #943 fixed the main problem with counting of read_1 and read_2 there is still a difference between adam and samtools in counting total number of pairs. I'm examining the samtools code for flagstat to make sure we apply the same same filters.

@jpdna jpdna self-assigned this Feb 16, 2016

@heuermh heuermh added this to the 0.19.0 milestone Feb 16, 2016

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Feb 16, 2016

Member

at least part of this problem is due to samtools filtering out reads with the 0x800 flag set while ADAM does not. However, trying to add this logic ran into another problem that it appears that within FlagStat.scala the p.getSupplementaryAlignment flag is providing incorrect results, as
samtools view -f 0x800 /jp5t/data/AshkenazimTrio/HG002.mate_pair.sorted.chr22.bam
says there are 39225 reads with 0x800 set and ADAM seems to think there are 0.
I'm investigating whether the parquet file is correct. I tried already to add AlignmentRecordField.supplementaryAlignment to the flagstat and adamcontext projections but answer still incorrect.

Member

jpdna commented Feb 16, 2016

at least part of this problem is due to samtools filtering out reads with the 0x800 flag set while ADAM does not. However, trying to add this logic ran into another problem that it appears that within FlagStat.scala the p.getSupplementaryAlignment flag is providing incorrect results, as
samtools view -f 0x800 /jp5t/data/AshkenazimTrio/HG002.mate_pair.sorted.chr22.bam
says there are 39225 reads with 0x800 set and ADAM seems to think there are 0.
I'm investigating whether the parquet file is correct. I tried already to add AlignmentRecordField.supplementaryAlignment to the flagstat and adamcontext projections but answer still incorrect.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Feb 16, 2016

Member

Some more background describing the difference between ADAM flagstat and Samtools Flagstat output.

On the same example file SRR062643.bam:
Samtools Flagstat

samtools flagstat SRR062643.bam
59676190 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
2453666 + 0 supplementary
0 + 0 duplicates
59506112 + 0 mapped (99.71% : N/A)
57222524 + 0 paired in sequencing
28611262 + 0 read1
28611262 + 0 read2
55815676 + 0 properly paired (97.54% : N/A)
56938646 + 0 with itself and mate mapped
113800 + 0 singletons (0.20% : N/A)
539234 + 0 with mate mapped to a different chr
150020 + 0 with mate mapped to a different chr (mapQ>=5)

ADAM flagstat

59676190 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 primary duplicates
0 + 0 primary duplicates - both read and mate mapped
0 + 0 primary duplicates - only read mapped
0 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
59506112 + 0 mapped (99.71%:0.00%)
59676190 + 0 paired in sequencing
29837734 + 0 read1
29838456 + 0 read2
57896985 + 0 properly paired (97.02%:0.00%)
59384680 + 0 with itself and mate mapped
121432 + 0 singletons (0.20%:0.00%)
2931808 + 0 with mate mapped to a different chr

Looking at code in FlagStat.scala and SAMRecordConverter.scala as well as the Samtools Flagstat code here
there are three causes:

  1. A likely bug in ADAM code in SAMRecordConverter.scala only sets the SupplementaryAlingmentFlag (0x800) when the SecondaryAlignmentFlag (0x100) is set, thus the 0x800 bit in input was often ignored
  2. Samtools Flagstat counts both mapped and unmapped reads in the (paired in sequencing, read1, read2) statistics, ADAM currently does not count these because ADAM does not set its own PrimaryAlignment flag for unmapped or undefined reference reads, and this value is needed in computing the stats.
  3. Samtools does not count Secondary(non primary) or Supplemental reads in read pair counts stats

Fixes to these issues have been implemented in coming pull request, these changes result in perfect concordance between ADAM and Samtools flagstats

Member

jpdna commented Feb 16, 2016

Some more background describing the difference between ADAM flagstat and Samtools Flagstat output.

On the same example file SRR062643.bam:
Samtools Flagstat

samtools flagstat SRR062643.bam
59676190 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
2453666 + 0 supplementary
0 + 0 duplicates
59506112 + 0 mapped (99.71% : N/A)
57222524 + 0 paired in sequencing
28611262 + 0 read1
28611262 + 0 read2
55815676 + 0 properly paired (97.54% : N/A)
56938646 + 0 with itself and mate mapped
113800 + 0 singletons (0.20% : N/A)
539234 + 0 with mate mapped to a different chr
150020 + 0 with mate mapped to a different chr (mapQ>=5)

ADAM flagstat

59676190 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 primary duplicates
0 + 0 primary duplicates - both read and mate mapped
0 + 0 primary duplicates - only read mapped
0 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
59506112 + 0 mapped (99.71%:0.00%)
59676190 + 0 paired in sequencing
29837734 + 0 read1
29838456 + 0 read2
57896985 + 0 properly paired (97.02%:0.00%)
59384680 + 0 with itself and mate mapped
121432 + 0 singletons (0.20%:0.00%)
2931808 + 0 with mate mapped to a different chr

Looking at code in FlagStat.scala and SAMRecordConverter.scala as well as the Samtools Flagstat code here
there are three causes:

  1. A likely bug in ADAM code in SAMRecordConverter.scala only sets the SupplementaryAlingmentFlag (0x800) when the SecondaryAlignmentFlag (0x100) is set, thus the 0x800 bit in input was often ignored
  2. Samtools Flagstat counts both mapped and unmapped reads in the (paired in sequencing, read1, read2) statistics, ADAM currently does not count these because ADAM does not set its own PrimaryAlignment flag for unmapped or undefined reference reads, and this value is needed in computing the stats.
  3. Samtools does not count Secondary(non primary) or Supplemental reads in read pair counts stats

Fixes to these issues have been implemented in coming pull request, these changes result in perfect concordance between ADAM and Samtools flagstats

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Feb 24, 2016

Member

Fixed via caa60d6

Member

heuermh commented Feb 24, 2016

Fixed via caa60d6

@heuermh heuermh closed this Feb 24, 2016

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Mar 28, 2016

Member

Two related questions:

  • are any of these changes such that we don't actually agree with what samtools counts, but are mimicking it for concordance's sake?
  • are there things that samtools doesn't output that we'd like to?

What I'm getting at is: should we have a flag that outputs a version that we like, and another flag (or more likely the default) that is char-for-char identical to samtools, as a sanity check?

Member

ryan-williams commented Mar 28, 2016

Two related questions:

  • are any of these changes such that we don't actually agree with what samtools counts, but are mimicking it for concordance's sake?
  • are there things that samtools doesn't output that we'd like to?

What I'm getting at is: should we have a flag that outputs a version that we like, and another flag (or more likely the default) that is char-for-char identical to samtools, as a sanity check?

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Apr 19, 2016

Member

FTR, I just ran an ADAM flagstat from 0.19.0 against a v1.3 samtools; this gist has each output and the diff:

< 8021767 + 0 primary duplicates
< 7984598 + 0 primary duplicates - both read and mate mapped
< 37169 + 0 primary duplicates - only read mapped
< 44210 + 0 primary duplicates - cross chromosome
< 0 + 0 secondary duplicates
< 0 + 0 secondary duplicates - both read and mate mapped
< 0 + 0 secondary duplicates - only read mapped
< 0 + 0 secondary duplicates - cross chromosome
---
> 0 + 0 secondary
> 199776 + 0 supplementary
> 8021767 + 0 duplicates
14c9
< 53968350 + 0 properly paired (98.30%:0.00%)
---
> 53968350 + 0 properly paired (98.66% : N/A)
16c11
< 79540 + 0 singletons (0.14%:0.00%)
---
> 79540 + 0 singletons (0.15% : N/A)

The only line that ADAM doesn't include (in spirit if not char for char) is the "199776 + 0 supplementary", which seems probably insignificant?

The slightly off percentages also seem to be of minimal concern since the raw numbers for those lines concord.

Member

ryan-williams commented Apr 19, 2016

FTR, I just ran an ADAM flagstat from 0.19.0 against a v1.3 samtools; this gist has each output and the diff:

< 8021767 + 0 primary duplicates
< 7984598 + 0 primary duplicates - both read and mate mapped
< 37169 + 0 primary duplicates - only read mapped
< 44210 + 0 primary duplicates - cross chromosome
< 0 + 0 secondary duplicates
< 0 + 0 secondary duplicates - both read and mate mapped
< 0 + 0 secondary duplicates - only read mapped
< 0 + 0 secondary duplicates - cross chromosome
---
> 0 + 0 secondary
> 199776 + 0 supplementary
> 8021767 + 0 duplicates
14c9
< 53968350 + 0 properly paired (98.30%:0.00%)
---
> 53968350 + 0 properly paired (98.66% : N/A)
16c11
< 79540 + 0 singletons (0.14%:0.00%)
---
> 79540 + 0 singletons (0.15% : N/A)

The only line that ADAM doesn't include (in spirit if not char for char) is the "199776 + 0 supplementary", which seems probably insignificant?

The slightly off percentages also seem to be of minimal concern since the raw numbers for those lines concord.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment