Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truvari bench hangs indefinitely #182

Closed
TimD1 opened this issue Dec 15, 2023 · 3 comments
Closed

Truvari bench hangs indefinitely #182

TimD1 opened this issue Dec 15, 2023 · 3 comments

Comments

@TimD1
Copy link

TimD1 commented Dec 15, 2023

Version :
v4.1.0

Describe the bug :
Truvari bench starts running, but eventually silently stops making progress. No threads seems to be actively working once this happens.

To Reproduce :
truvari bench
-b $data/t2t-q100-v0.9/split/t2t-q100.all.vcf.gz
-c $data/pav-v4.0/split/pav.all.vcf.gz \
-f $data/refs/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta
--bSample HG002
--cSample HG002
--includebed $data/giab-tr-v4.20/GIABTR.HG002.benchmark.regions.bed
--no-ref a
--sizemin 1 --sizefilt 1 --sizemax 1000
--pick single \
--typeignore
--dup-to-ins
-o ./truvari_tr/${query_names[i]}
2> ./truvari_tr/${query_names[i]}.log

Expected behavior :
Truvari should run to completion.

Example Data :
I've provided the query (PAV) VCF, truth (T2T-Q100) VCF, and log file in a Google Drive link.

Additional context :
With two other WGS query VCFs and the same truth VCF for the same GIAB-TR benchmarking regions with these parameters, Truvari completes in about 15 minutes. I let this one run for 3 days, to no avail. The other two WGS query VCFs later hang indefinitely on Truvari refine.

@ACEnglish
Copy link
Owner

There's a huge deletion on chromosome 4 ID=chr4-49657849-DEL-140440990. It spans many TR regions and each one performs a pysam.VariantFile.fetch which has to parse it. The variant by itself in a gzip vcf is 38M. Removing that variant from the VCF allows the job to complete.

Chromosome 9 also has some larger variants that might need to be pre-filtered

# LEN              ID
-140440990	chr4-49657849-DEL-140440990
-22543055	chr9-43222012-DEL-22543055
-20115032	chr9-42684836-DEL-20115032
-19608353	chr9-40910205-DEL-19608353
-4215131	chr21-5393558-DEL-4215131
-2828305	chr5-46867696-DEL-2828305
-2818263	chr9-62556860-DEL-2818263
-2240586	chr9-60559282-DEL-2240586
-1664861	chr9-40910205-DEL-1664861

@TimD1
Copy link
Author

TimD1 commented Dec 16, 2023

Thanks for looking into this! I never thought to check if large INDELs were slowing things down, since I assumed the --sizemax flag was excluding them from the analysis entirely.

@TimD1
Copy link
Author

TimD1 commented Jan 4, 2024

Reopening this issue, but now with Truvari refine (Truvari bench now succeeds on this input in ~15 minutes, thanks!). I've filtered out all remotely large variants and all inversions as follows:

bcftools view \
    -i 'TYPE=="SNP" || (ILEN < 1000 && ILEN > -1000)'\
    pav.all.vcf.gz |
    grep -v "INV" > pav.most.vcf
bgzip -f pav.most.vcf
tabix -p vcf pav.most.vcf.gz

I ran Truvari WFA refine on the bench results, limiting the GIAB-TR regions to candidate.refine.bed. I've included my log file here: pav.log.

On this attempt, I defined the default /tmp directory to be located on an external hard drive. It crashed after 2.5 hours with a MemoryError, and used 475GB of /x/tmp memory.

On a previous attempt, it hung for a few days after filling the /tmp directory with 185GB of data and then the current directory with another 600GB of data.

All of this data is located in files named tmp********, and they appear to be FASTA files auto-generated by samtools or something. How much space is this expected to take, and how long should the analysis run for?

Thanks in advance, sorry for opening so many issues. I'd be happy to help provide any more info to get this working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants