Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LCR regions still in final VCF #3609

Closed
adammaikai opened this issue Jan 21, 2022 · 2 comments
Closed

LCR regions still in final VCF #3609

adammaikai opened this issue Jan 21, 2022 · 2 comments

Comments

@adammaikai
Copy link

My germline WGS run completed successfully, however, I double checked the LCR regions using a bedtools intersectBed of the final vcf and the provided LCR.bed.gz, and variants were still present.

intersectBed -wa -a RF_0890-joint-gatk-haplotype-annotated.vcf.gz -b /shared/workspace/software/bcbio/genomes/Hsapiens/hg38/coverage/problem_regions/repeats/LCR.bed.gz | head

chr1	94808	.	CCTTTT	C,<NON_REF>	7.6	PASS	BaseQRankSum=0.319;ClippingRankSum=-1.282;DP=7;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MMQ=39,23,60;MQ0=0;MQRankSum=-1.282;RAW_MQandDP=8545,7;ReadPosRankSum=1.15	GT:AD:DP:GQ:PL:SB	0/1:4,1,0:5:15:15,0,165,27,168,195:0,4,0,1
chr1	269239	.	TA	T,TAA,TAAA,<NON_REF>	78.6	PASS	BaseQRankSum=0.613;ClippingRankSum=-0.52;DP=63;ExcessHet=3.0103;MLEAC=0,1,0,0;MLEAF=0,0.5,0,0;MMQ=60,40,40,45,60;MQ0=0;MQRankSum=-4.102;RAW_MQandDP=150027,63;ReadPosRankSum=0.745	GT:AD:DP:GQ:PL:SB	0/2:40,2,8,3,0:53:86:86,191,1174,0,807,839,113,1058,923,1387,211,1076,885,1128,1126:23,17,4,9
chr1	832736	.	AGTTTT	A,<NON_REF>	887.0	PASS	DP=30;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;MMQ=60,60,60;MQ0=0;RAW_MQandDP=107881,30	GT:AD:DP:GQ:PL:SB	1/1:0,20,0:20:61:901,61,0,901,61,901:0,0,11,9
chr1	1016060	.	CT	C,<NON_REF>	165.6	PASS	BaseQRankSum=-1.208;ClippingRankSum=0.083;DP=24;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MMQ=60,60,60;MQ0=0;MQRankSum=0;RAW_MQandDP=82281,24;ReadPosRankSum=-0.207	GT:AD:DP:GQ:PL:SB	0/1:8,11,0:19:99:173,0,144,197,172,369:5,3,4,7
chr1	1028320	.	GCC	G,<NON_REF>	661.0	PASS	DP=15;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1,0;MMQ=60,60,60;MQ0=0;RAW_MQandDP=54000,15	GT:AD:DP:GQ:PL:SB	1/1:0,15,0:15:45:675,45,0,675,45,675:0,0,0,15
chr1	1048767	.	CAAAAAAAAAAA	C,<NON_REF>	388.6	PASS	BaseQRankSum=2.617;ClippingRankSum=0.083;DP=21;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MMQ=60,60,60;MQ0=0;MQRankSum=0;RAW_MQandDP=74916,21;ReadPosRankSum=0.327	GT:AD:DP:GQ:PL:SB	0/1:8,10,1:19:99:396,0,306,420,336,756:4,4,8,3
chr1	1190111	.	CA	C,<NON_REF>	0.0	PASS	BaseQRankSum=-0.587;ClippingRankSum=1.658;DP=31;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MMQ=60,60,60;MQ0=0;MQRankSum=0;RAW_MQandDP=111600,31;ReadPosRankSum=0.501	GT:AD:DP:GQ:PL:SB	0/0:23,2,0:25:25:0,25,499,68,505,548:13,10,1,1
chr1	1214845	.	CA	C,CAA,<NON_REF>	0.0	PASS	BaseQRankSum=1.363;ClippingRankSum=-0.679;DP=38;ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0,0,0;MMQ=60,60,60,60;MQ0=0;MQRankSum=0;RAW_MQandDP=136800,38;ReadPosRankSum=0	GT:AD:DP:GQ:PL:SB	0/0:27,3,2,1:33:20:0,20,636,46,554,639,93,626,641,701:13,14,3,3
chr1	1251356	.	AT	A,<NON_REF>	344.6	PASS	BaseQRankSum=-1.245;ClippingRankSum=-1.57;DP=25;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.5,0;MMQ=60,60,60;MQ0=0;MQRankSum=-0.687;RAW_MQandDP=88418,25;ReadPosRankSum=2.163	GT:AD:DP:GQ:PL:SB	0/1:5,17,1:23:56:352,0,56,367,107,475:4,1,7,11
chr1	1256888	.	GA	G,TA,<NON_REF>	1.8	PASS	BaseQRankSum=-0.18;ClippingRankSum=0.566;DP=12;ExcessHet=3.0103;MLEAC=0,0,0;MLEAF=0,0,0;MMQ=60,60,60,60;MQ0=0;MQRankSum=0;RAW_MQandDP=37354,12;ReadPosRankSum=-0.484	GT:AD:DP:GQ:PL:SB	0/1:4,2,1,0:7:9:9,0,79,29,69,286,38,91,205,202:3,1,2,1

I even see the bed file being used in the pipeline from the commands log, for example:

bcbio-nextgen-commands.log:[2021-12-03T13:28Z] bedtools subtract -nonamecheck -a /scratch/germline/RF_0890/gatk-haplotype/chr7/RF_0890-joint-chr7_152549199_159345973-regions.bed -b /shared/workspace/software/bcbio/genomes/Hsapiens/hg38/coverage/problem_regions/repeats/LCR.bed.gz > /scratch/germline/RF_0890/bcbiotx/tmpwxxyfsbq/RF_0890-joint-chr7_152549199_159345973-regions-nolcr.bed

Did I misunderstand that step in the pipeline?

Version info

  • bcbio version (bcbio_nextgen.py --version): 1.2.8
  • OS name and version (lsb_release -ds): Ubuntu 16.04.7 LTS

To Reproduce
Exact bcbio command you have used:

cd $workspace && bcbio_nextgen.py \
	-w template $template_yaml \
	$metadata \
	$workspace \
	--workdir $workspace

bcbio_nextgen.py \
	$workspace/"${base_meta%.*}"/config/"${base_meta%.*}".yaml \
	-n 16 \
	--workdir $workspace

Your yaml configuration file:

details:
- algorithm:
    aligner: bwa
    mark_duplicates: true
    #recalibrate: gatk
    exclude_regions: [altcontigs, highdepth, lcr]
    variantcaller: gatk-haplotype
    tools_on: [gatk4, gvcf, qualimap_full]
    tools_off: [gemini]
    svcaller: cnvkit
    svprioritize: actionable/ACMG56
    effects: false
  analysis: variant2
  genome_build: hg38
upload:
  dir: final

Log files (could be found in work/log)

Please attach (10MB max): bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log.

bcbio-nextgen-commands.log
bcbio-nextgen.log

@naumenko-sa
Copy link
Contributor

Hi @adammaikai

Thanks for reporting. A fair point!

I think the filtration does happen, but what we see is a boundary effect: there should not be many of these leftover variants.

I've checked it on a smaller WES chr22 example. Please note that I reported both -wa and -wb entries:

$ bedtools intersect -wa -wb -a NA12878-gatk-haplotype.vcf.gz -b /n/app/bcbio/1.2.9/genomes/Hsapiens/hg38/coverage/problem_regions/repeats/LCR.bed.gz | head | awk -F '\t' '{print $1,$2,$4,$5,$(NF-2),$(NF-1),$NF}'
chr22 20953424 AAAC A chr22 20953424 20953452
chr22 21614427 CA C chr22 21614427 21614460

So these two variants are right on the left boundary of the interval, and the exclusion logic did not capture them.

In your case:
chr1 94808 . CCTTTT C,<NON_REF> vs chr 1 94809 94847
chr1 1256888 . GA G,TA,<NON_REF> vs chr1 1256888 1256911

So I think these leftover variants are reported due to discrepancies between bedtools intersect and bedtools subtract.

You may inspect the nolcr bed files in `work/gatk-haplotype/chrXX/nolcr.bed.
In my case it has :

chr22 20952753 20953424

(note that the interval > chr2:20953424 is gone).

So this bed file is applied to filter the result vcf and chr22 20953424 AAAC A is included.

Sergey

@adammaikai
Copy link
Author

Thank you for the clarification, that was very helpful.
Adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants