Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job won't run to completion #7

Open
Samir-A opened this issue Apr 12, 2023 · 7 comments
Open

Job won't run to completion #7

Samir-A opened this issue Apr 12, 2023 · 7 comments

Comments

@Samir-A
Copy link

Samir-A commented Apr 12, 2023

Hello,

I really like your tool but I was wondering why my jobs won't run to completion. It always stops at Performing weighted average proximity (WAP) hotspot enrichments... Is it possible you can help me to figure why this is the case?

The commands I run are:

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_Nmasked.txt -o Cd1_Nmasked_Intron_noncoding --prefix test_global -p 5

python MutEnricher-master/mutEnricher.py noncoding /data2/samir/SNP_Hotspot/bed/Introns_merged_mm10.bed VCF_Cd1_snps_filtered.txt -o Cd1_snps_filtered_final_Intron_noncoding --prefix test_global -p 5

My bed file is here (I changed it to .txt form so I could upload):
the bedfile was sorted and merged
Introns_merged_mm10.bed.txt
head:
chr1 3216968 3421701 1_Intron
chr1 3421901 3670551 2_Intron
chr1 4120073 4142611 3_Intron
chr1 4142766 4147811 4_Intron
chr1 4147963 4148611 5_Intron
chr1 4148744 4163854 6_Intron
chr1 4163941 4170204 7_Intron
chr1 4170404 4197533 8_Intron
chr1 4197641 4206659 9_Intron
chr1 4206837 4226610 10_Intron

My vcf.txt files are here:
/data2/samir/SNP_Hotspot/vcf/Cd1_Nmasked_final_sorted.mm10.vcf.gz Cd1_Nmasked
/data2/samir/SNP_Hotspot/vcf/Cd1_snps_filtered_final_sorted.mm10.vcf.gz Cd1_snps_filtered

I bgzipped, sorted, and indexed them with bgzip and bcftools.

run.txt

run2.txt

@asoltis
Copy link
Owner

asoltis commented Apr 12, 2023

Are you seeing an error or is the procedure just taking a "long" time?

If you disable the WAP procedure (with the --no-wap flag in the noncoding module), does it complete?

I took a look at the regions in your BED file and noticed there are some extremely long regions (e.g. > 100kb, see histograms). The WAP procedure is likely taking a long time on these as it is looping over the length of the region. I would suggest either a) disabling the procedure (as mentioned above) or b) running on a reduced set of regions excluding these very long ones.
Introns_merged_mm10_lengths_hist.pdf
Introns_merged_mm10_lengths_hist_ylim1000.pdf

@asoltis
Copy link
Owner

asoltis commented Apr 12, 2023

Also, how long of execution are we talking about? With the WAP procedure, it may not be the region lengths per se, but it may be taking time given that there are >185K total regions in your BED file. If you are able, dedicating more processors to the run can also speed things up.

@Samir-A
Copy link
Author

Samir-A commented Apr 12, 2023

I have been running it for 2 days now, but I have some other runs with different annotation files the finished with no hotspots that took around an hour. Thank you for the help, I will try your suggestions!

@Samir-A
Copy link
Author

Samir-A commented Apr 12, 2023

I couldn't find the --no-wap in the documentation. Can you please explain what we might be losing if we use the --no-wap? Because these are introns, after merging all the regions together I don't know another way to shorten the reads without losing a substantial amount of introns.

Thank you

@asoltis
Copy link
Owner

asoltis commented Apr 12, 2023

You can find a brief explanation for that option in the code help page (python mutenricher.py noncoding), which is simply:

"Select flag to skip weighted average proximity (WAP) procedure."

You can still perform the main burden and hotspot analyses without this procedure as it is just another method to assess clustering of mutations (the [prefix]_region_WAP_enrichments.txt output will only report the overall burden p-values and the [prefix]_region_WAP_hotspot_Fisher_enrichments.txt output will only combine the burden and hotspot p-values). The WAP procedure may considerably extend runtime with many long regions as it is a permutation procedure that can run up to 1e6 iterations; it runs for a minimum for 1000 iterations but can exit out if a stopping criteria is met. Given your run times, it seems advisable to disable this procedure. For particular regions of interest, you could run the code again with a limited BED and the WAP procedure enabled to generate these statistics.

@Samir-A
Copy link
Author

Samir-A commented Apr 13, 2023

Thank you for the explanation! But I believe we found the issue to be due to our vcf files. We ran a job with your example vcfs and our annotation file and the job went to completion. Is there a specific way the vcf file should be prepared?

@asoltis
Copy link
Owner

asoltis commented Apr 13, 2023

The sample VCFs were prepared from human genome alignments so it is possible few variants in your VCFs line up with your BED regions (you can check in the outputs); this could be an explanation for the quicker run time. How many VCFs in total are you using? If you run without the WAP procedure, do you see many more variants corresponding to the regions in your BED files?

There is no special processing needed for noncoding analyses, other than sorting and indexing with tabix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants