Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency problem when using ‘Scatter Gather‘ model in Mutect2 #8152

Closed
imDpeng opened this issue Jan 11, 2023 · 3 comments
Closed

Consistency problem when using ‘Scatter Gather‘ model in Mutect2 #8152

imDpeng opened this issue Jan 11, 2023 · 3 comments

Comments

@imDpeng
Copy link

imDpeng commented Jan 11, 2023

Bug Report

Affected tool(s) or class(es)

Mutect2

Affected version(s)

I test this problem in two versions, V4.1.4.1 and V4.3.0.0.They all have this problem.

Description

Following the recommendations of the 'Best Practice Workflows', I run mutect2 in the following command.

java -jar -Djava.io.tmpdir=${tmpDir} -Xms2g -Xmx16g
/mnt/bin/gatk-4.1.4.1/gatk-package-4.1.4.1-SNAPSHOT-local.jar Mutect2
--native-pair-hmm-threads 32
-R ${Fasta}
-I ${cancer_bam}
-I ${normal_bam}
--tumor-sample cancer --normal-sample normal
-L ${all_chrome_bed}
--bam-output ${bam_output}
-O ${vcf_output}

To improve parallelism, I try to split my all chrome bed to 25 files.Parallel running the flowing command brings me signficient performance improvement.

java -jar -Djava.io.tmpdir=${tmpDir} -Xms2g -Xmx16g
/mnt/bin/gatk-4.1.4.1/gatk-package-4.1.4.1-SNAPSHOT-local.jar Mutect2
--native-pair-hmm-threads 32
-R ${Fasta}
-I ${cancer_bam}
-I ${normal_bam}
--tumor-sample cancer --normal-sample normal
-L ${chr1_bed}
--bam-output ${chr1_bam_output}
-O ${chr1_vcf_output}

java -jar -Djava.io.tmpdir=${tmpDir} -Xms2g -Xmx16g
/mnt/bin/gatk-4.1.4.1/gatk-package-4.1.4.1-SNAPSHOT-local.jar Mutect2
--native-pair-hmm-threads 32
-R ${Fasta}
-I ${cancer_bam}
-I ${normal_bam}
--tumor-sample cancer --normal-sample normal
-L ${chr2_bed}
--bam-output ${chr2_bam_output}
-O ${chr2_vcf_output}

But when I examined the vcf results produced by both modes of operation, I found consistency issues.

Expected behavior

Let's focus on chromosome 2.I expect 100% consistency between the following two runs.

  1. The vcf file is obtained using a bed file containing only chromosome 2.
  2. Use bed file with all chromosomes to get all calling results, then filter to get chromesome 2 calling result.

Actual behavior

  1. The first method above gives one more result than the second.
  2. There are 168 vcf results inconsistent, out of 1247 total.One of the inconsistencies is shown below.

21 44513281 . ACCACCG A . . AS_SB_TABLE=1282,1133|5,3;DP=3142;ECNT=2;MBQ=33,35;MFRL=173,197;MMQ=60,60;MPOS=25;NALOD=1.21;NLOD=161.38;POPAF=6.00;RPA=3,2;RU=CCACCG;STR;TLOD=11.75 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|1:1810,7:4.456e-03:1817:555,1:575,4:1581,6:0|1:44513281_ACCACCG_A:44513281:980,830,4,3 0|0:605,1:3.469e-03:606:179,0:184,1:560,1:0|1:44513281_ACCACCG_A:44513281:302,303,1,0

21 44513281 . ACCACCG A . . AS_SB_TABLE=1282,1134|5,3;DP=3143;ECNT=2;MBQ=33,35;MFRL=172,197;MMQ=60,60;MPOS=25;NALOD=1.21;NLOD=161.08;POPAF=6.00;RPA=3,2;RU=CCACCG;STR;TLOD=11.75 GT:AD:AF:DP:F1R2:F2R1:FAD:PGT:PID:PS:SB 0|1:1811,7:4.462e-03:1818:555,1:572,4:1579,6:0|1:44513281_ACCACCG_A:44513281:982,829,4,3 0|0:605,1:3.474e-03:606:178,0:184,1:559,1:0|1:44513281_ACCACCG_A:44513281:300,305,1,0

Feature request

How to Scatter Gather correctly can ensure that the results are exactly the same.

@davidbenjamin
Copy link
Contributor

If your all_chrome_bed is identical to the concatenation of chr1_bed, chr2_bed etc I would have expected no discrepancies. Fortunately, the discrepancies are very small, and the annotations with the discrepancies will not exist in the next version of Mutect, which we will be releasing in a few months.

@limwz01
Copy link

limwz01 commented Mar 7, 2024

@davidbenjamin May I know which release fixed this bug, if it was fixed? Thanks!

@davidbenjamin
Copy link
Contributor

I'm going to close this issue because it's not a bug. Several things in the code of Mutect2 and FilterMutectCalls adapt as they traverse the genome and it's possible that some learned parameter shifts minutely. For example, the assembly graph pruning algorithm uses knowledge of previously assembled regions to better distinguish between errors and somatic variation. It's also possible that somewhere we forgot to give something a fixed random seed.

In full honesty, I wish that I knew exactly what causes the 3142 to become 3143, and I regret that I don't have time for it. Nonetheless, in principle it is not cause for alarm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants