Difference running markdups with and without projection #1014

Closed
jpdna opened this Issue Apr 24, 2016 · 1 comment

Comments

Projects
3 participants
@jpdna
Member

jpdna commented Apr 24, 2016

There is a seemingly minor but unexpected difference I see when running ADAM markdups with and without the -limit-projection flag. My assumption was that this flag was just a performance optimization, and should not affect the result.

I noticed the difference when running ADAM flagstat on two post mark-dups ADAM files, one that was produced including the -limit-projection flag like:

adam-submit --executor-memory=14G --master spark://127.0.0.1:7077 --conf spark.sql.shuffle.partitions=121 -- transform ./input.adam ./output.dupsmarked.adam -mark_duplicate_reads -limit_projection

and one run that was the same command but excluded -limit_projection

The differences were 3 counts shifting in the flagstat output between two of the categories,
primary duplicates - both read and mate mapped and primary duplicates - only read mapped

Without limit project

38632222 + 0 in total (QC-passed reads + QC-failed reads)
1765963 + 0 primary duplicates
1290944 + 0 primary duplicates - both read and mate mapped
475019 + 0 primary duplicates - only read mapped
535062 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
36854876 + 0 mapped (95.40%:0.00%)
32767453 + 0 paired in sequencing
16383937 + 0 read1
16383516 + 0 read2
26170970 + 0 properly paired (67.74%:0.00%)
29212761 + 0 with itself and mate mapped
1777346 + 0 singletons (4.60%:0.00%)
1324347 + 0 with mate mapped to a different chr
676289 + 0 with mate mapped to a different chr (mapQ>=5)

With Limit Projection:

38632222 + 0 in total (QC-passed reads + QC-failed reads)
1765963 + 0 primary duplicates
1290947 + 0 primary duplicates - both read and mate mapped
475016 + 0 primary duplicates - only read mapped
535062 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
36854876 + 0 mapped (95.40%:0.00%)
32767453 + 0 paired in sequencing
16383937 + 0 read1
16383516 + 0 read2
26170970 + 0 properly paired (67.74%:0.00%)
29212761 + 0 with itself and mate mapped
1777346 + 0 singletons (4.60%:0.00%)
1324347 + 0 with mate mapped to a different chr
676289 + 0 with mate mapped to a different chr (mapQ>=5)

@fnothaft fnothaft added this to the 0.20.0 milestone Jul 20, 2016

@heuermh heuermh modified the milestones: 0.20.0, 0.22.0 Oct 13, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 3, 2017

Member

As referenced from #941, this was caused to an issue in the projection that dropped the read in fragment field. Closing; @jpdna let me know if I am incorrect.

Member

fnothaft commented Mar 3, 2017

As referenced from #941, this was caused to an issue in the projection that dropped the read in fragment field. Closing; @jpdna let me know if I am incorrect.

@fnothaft fnothaft closed this Mar 3, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Mar 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment