Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ramdom BUGS for germline copy number variants calling with GATK 4. PythonScriptExecutorException #6235

Closed
xysj1989 opened this issue Oct 28, 2019 · 9 comments · Fixed by #6244

Comments

@xysj1989
Copy link

xysj1989 commented Oct 28, 2019

Hi,

I am trying to call common and rare germline copy number variants with GATK 4, for more than 100 human samples based on human genome reference: hg19. For this project, I have 500 GB for memory, 10 TB for storage and 300 cpu cores. The program version is as below:

GATK Version: 4.1.2.0
Openjdk Version: 1.8.0_232
Python Version: 3.6.8

I didn't use the WDL way. I just follow the document of Notebook#11684 and build a local pipeline. I split the my project based on Chromosome, including (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrMT).

After finish the pipeline, I am testing it with 6 samples.

When I separately submit my script for each chromosome, every sub-project goes well: through my Input BAM Files, I can get the corresponding VCF Files (10 cores and 10 GB for each single project). That is to say, the environment of our GATK and Python for germline copy number variants calling should be OK.

However, When I submit all the 25 sub-projects (12 cores and 12 GB for each single project) at the same time, I' m randomly suffering the two following PythonScriptExecutorException for some of the random sub-projects:

.............................................................(BUG 001)..........................................................

Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.3351404099122294482.py", line 8, in
import gcnvkernel
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/gcnvkernel/init.py", line 1, in
from pymc3 import version as pymc3_version
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/init.py", line 5, in
from .distributions import *
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/init.py", line 1, in
from . import timeseries
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/timeseries.py", line 5, in
from .continuous import get_tau_sd, Normal, Flat
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 16, in
from pymc3.theanof import floatX
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/theanof.py", line 89, in
empty_gradient = tt.zeros(0, dtype='float32')
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 2558, in zeros
return alloc(np.array(0, dtype=dtype), *shape)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 3091, in call
ret = super(Alloc, self).call(val, *shapes, **kwargs)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 670, in call
no_recycling=[])
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 48, in get_module_cache
return cmodule.get_module_cache(config.compiledir, init_args=init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1587, in get_module_cache
_module_cache = ModuleCache(dirname, **init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 703, in init
self.refresh()
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 794, in refresh
files = os.listdir(root)
FileNotFoundError: [Errno 2] No such file or directory: '/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/tmpmy0w17z3'
00:34:39.396 DEBUG ScriptExecutor - Result: 1
00:34:39.397 INFO DetermineGermlineContigPloidy - Shutting down engine
[October 27, 2019 12:34:39 AM EDT] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 0.66 minutes.
Runtime.totalMemory()=2151677952
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_determine_ploidy_and_depth.3351404099122294482.py --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig8898090777596224038.tsv --output_calls_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir/ploidy-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --mean_bias_sd=1.000000e-02 --psi_j_scale=1.000000e-03 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --interval_list=/tmp/intervals8430607484736018931.tsv --contig_ploidy_prior_table=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/3-contig_ploidy_priors/22.contig_ploidy_priors.csv --output_model_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir/ploidy-model
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:411)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:288)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /usr/local/apps/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/apps/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar DetermineGermlineContigPloidy -L /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/2-Filter_Interval/22.preprocessed.Filtered.interval_list --interval-merging-rule OVERLAPPING_ONLY -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349574.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349575.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349488.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349489.bam.csv --contig-ploidy-priors /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/3-contig_ploidy_priors/22.contig_ploidy_priors.csv --output /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir --output-prefix ploidy --verbosity DEBUG

.............................................................(BUG 002)..........................................................
Stderr: Traceback (most recent call last):
File "/tmp/segment_gcnv_calls.3402406683372415608.py", line 9, in
import gcnvkernel
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/gcnvkernel/init.py", line 1, in
from pymc3 import version as pymc3_version
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/init.py", line 5, in
from .distributions import *
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/init.py", line 1, in
from . import timeseries
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/timeseries.py", line 5, in
from .continuous import get_tau_sd, Normal, Flat
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 16, in
from pymc3.theanof import floatX
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/theanof.py", line 89, in
empty_gradient = tt.zeros(0, dtype='float32')
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 2558, in zeros
return alloc(np.array(0, dtype=dtype), *shape)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 3091, in call
ret = super(Alloc, self).call(val, *shapes, **kwargs)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 670, in call
no_recycling=[])
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 48, in get_module_cache
return cmodule.get_module_cache(config.compiledir, init_args=init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1587, in get_module_cache
_module_cache = ModuleCache(dirname, **init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 703, in init
self.refresh()
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 794, in refresh
files = os.listdir(root)
FileNotFoundError: [Errno 2] No such file or directory: '/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/tmp3mkfuhpw'
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.executeSegmentGermlineCNVCallsPythonScript(PostprocessGermlineCNVCalls.java:509)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.generateSegmentsVCFFileFromAllShards(PostprocessGermlineCNVCalls.java:447)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.onTraversalSuccess(PostprocessGermlineCNVCalls.java:304)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1041)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)

................................................................................................................................................
................................................................................................................................................
................................................................................................................................................

Thess exceptions happens randomly during the following two functions:
(1) DetermineGermlineContigPloidy
(2) PostprocessGermlineCNVCalls

I have tried 6 times, and for each time less than 6 random sub-projects (chromosome) failed because of the above two PythonScriptExecutorException, while the other sub-projects (chromosome) are pretty good. And for each time, the failed chromosomes are different from each other.

(1) Would you please help me to solve my problems? Dose it mean that, the current version of GATK germline calling process, do not support parallel projects in the high performace computer at the same time, which will bring about potential thread conflict?

(2) I notice that there are several tmp directory and files generated under "/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/ ", which are not specified by myself and they are never deleted. Are these temp process generated from theano? How can we set them to other paths of my expected dirs?

Best regards.

@xysj1989 xysj1989 changed the title ramdom BUGS for germline copy number variants with GATK 4. PythonScriptExecutorException ramdom BUGS for germline copy number variants calling with GATK 4. PythonScriptExecutorException Oct 28, 2019
@xysj1989
Copy link
Author

xysj1989 commented Oct 29, 2019

Hi,

Last night, I tried again, and when I submited all the 25 sub-projects, a similar exceptions happens during the function GermlineCNVCaller. It seems that the problem is from gcnvkernel, when parallel projects are submitted at the same time.

.............................................................(BUG 003)..........................................................

00:50:20.554 DEBUG ScriptExecutor - /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-07410307475890858352.tsv
00:50:20.554 DEBUG ScriptExecutor - /gpfs/gsfs7/users/gatk_users1/0-Project//0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-12290301678667639499.tsv
00:50:20.554 DEBUG ScriptExecutor - /gpfs/gsfs7/users/gatk_users1/0-Project//0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-21824691337189197401.tsv
00:50:20.554 DEBUG ScriptExecutor - /gpfs/gsfs7/users/gatk_users1/0-Project//0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-31776045115104931009.tsv
00:50:20.554 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
00:50:20.554 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
00:50:20.554 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
00:50:20.554 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
00:50:20.554 DEBUG ScriptExecutor - --max_bias_factors=5
00:50:20.554 DEBUG ScriptExecutor - --psi_t_scale=1.000000e-03
00:50:20.554 DEBUG ScriptExecutor - --log_mean_bias_std=1.000000e-01
00:50:20.554 DEBUG ScriptExecutor - --init_ard_rel_unexplained_variance=1.000000e-01
00:50:20.554 DEBUG ScriptExecutor - --num_gc_bins=20
00:50:20.554 DEBUG ScriptExecutor - --gc_curve_sd=1.000000e+00
00:50:20.554 DEBUG ScriptExecutor - --active_class_padding_hybrid_mode=50000
00:50:20.554 DEBUG ScriptExecutor - --enable_bias_factors=True
00:50:20.554 DEBUG ScriptExecutor - --disable_bias_factors_in_active_class=False
00:50:20.554 DEBUG ScriptExecutor - --p_alt=1.000000e-06
00:50:20.554 DEBUG ScriptExecutor - --cnv_coherence_length=1.000000e+04
00:50:20.555 DEBUG ScriptExecutor - --max_copy_number=5
00:50:20.555 DEBUG ScriptExecutor - --p_active=0.010000
00:50:20.555 DEBUG ScriptExecutor - --class_coherence_length=10000.000000
00:50:20.555 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
00:50:20.555 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
00:50:20.555 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
00:50:20.555 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
00:50:20.555 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
00:50:20.555 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
00:50:20.555 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
00:50:20.555 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
00:50:20.555 DEBUG ScriptExecutor - --min_training_epochs=10
00:50:20.555 DEBUG ScriptExecutor - --max_training_epochs=50
00:50:20.555 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
00:50:20.555 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
00:50:20.555 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
00:50:20.555 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
00:50:20.555 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
00:50:20.555 DEBUG ScriptExecutor - --max_calling_iters=10
00:50:20.555 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
00:50:20.555 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
00:50:20.555 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
00:50:20.555 DEBUG ScriptExecutor - --disable_caller=false
00:50:20.555 DEBUG ScriptExecutor - --disable_sampler=false
00:50:20.555 DEBUG ScriptExecutor - --disable_annealing=false
Traceback (most recent call last):
File "/data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/cohort_denoising_calling.7177495255490777642.py", line 10, in
import gcnvkernel
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/gcnvkernel/init.py", line 1, in
from pymc3 import version as pymc3_version
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/init.py", line 5, in
from .distributions import *
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/init.py", line 1, in
from . import timeseries
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/timeseries.py", line 5, in
from .continuous import get_tau_sd, Normal, Flat
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 16, in
from pymc3.theanof import floatX
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/theanof.py", line 89, in
empty_gradient = tt.zeros(0, dtype='float32')
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 2558, in zeros
return alloc(np.array(0, dtype=dtype), *shape)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 3091, in call
ret = super(Alloc, self).call(val, *shapes, **kwargs)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 670, in call
no_recycling=[])
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 48, in get_module_cache
return cmodule.get_module_cache(config.compiledir, init_args=init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1587, in get_module_cache
_module_cache = ModuleCache(dirname, **init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 703, in init
self.refresh()
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 794, in refresh
files = os.listdir(root)
FileNotFoundError: [Errno 2] No such file or directory: '/spin1/home/linux/sangj2/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/tmpyvtzjxzm'
00:50:23.369 DEBUG ScriptExecutor - Result: 1
00:50:23.370 INFO GermlineCNVCaller - Shutting down engine
[October 29, 2019 12:50:23 AM EDT] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 0.72 minutes.
Runtime.totalMemory()=2335703040
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/cohort_denoising_calling.7177495255490777642.py --ploidy_calls_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/1-Contig-Ploidy/14.Contig_Ploidy_Dir/ploidy-calls --output_calls_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/2-Germline-CNV/14.Germline-CNV/CNV-calls --output_tracking_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/2-Germline-CNV/14.Germline-CNV/CNV-tracking --modeling_interval_list=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/intervals8729903857029540703.tsv --output_model_path=/gpfs/gsfs7/users/sangj2/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/2-Germline-CNV/14.Germline-CNV/CNV-model --enable_explicit_gc_bias_modeling=True --read_count_tsv_files /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-07410307475890858352.tsv /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-12290301678667639499.tsv /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-21824691337189197401.tsv /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-31776045115104931009.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=1.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=200 --min_training_epochs=10 --max_training_epochs=50 --initial_temperature=1.500000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:441)
at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:292)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
.........................................................................................................................................................
.........................................................................................................................................................
.........................................................................................................................................................

@samuelklee
Copy link
Contributor

@droazen Mark has been CNV tech lead for some time now, so I’ll let him take a first crack at this or delegate. However, I will point out #4782, which is tangentially related. Looks like handling the global compiler lock appropriately should also address the main issue. Finally, I’ll add that we should include such computing environments in our future testing infrastructure.

@xysj1989
Copy link
Author

@droazen Mark has been CNV tech lead for some time now, so I’ll let him take a first crack at this or delegate. However, I will point out #4782, which is tangentially related. Looks like handling the global compiler lock appropriately should also address the main issue. Finally, I’ll add that we should include such computing environments in our future testing infrastructure.

Dear samuelklee,

Thank you for your concern. I just finish my testing with GATK/4.1.3.0, which is suffered from the same exceptions, too. I hope to get good news from you and Mark soon.

Best regards and thank you!

@samuelklee
Copy link
Contributor

samuelklee commented Oct 30, 2019

@xysj1989 We primarily run this workflow using the WDL on Terra. In this case, each GermlineCNVCaller shard is run on a separate VM using the GATK Docker. Hopefully, we can always at least guarantee that this default mode of running the workflow is functional and covered by tests.

However, if you'd like to instead run multiple instances of GermlineCNVCaller locally, you may need to make sure certain environment variables are set appropriate. For example, I think you can address (2) above (the location of the temporary theano directory) by either setting environment variables or modifying your Theano configuration (see http://deeplearning.net/software/theano/library/config.html) appropriately. You may also want to check the GermlineCNVCaller task in the WDL to see how other variables are set there.

Let me look into whether you can also address (1) in this way, or if this will require a GATK code change, and get back to you. (Of course, if you figure it out before me, please follow up!) Thanks again for bringing this to our attention.

@samuelklee
Copy link
Contributor

samuelklee commented Oct 31, 2019

OK, looks like you can get around the compiler lock issues by pointing each invocation of GermlineCNVCaller to a different compilation directory. For example, invoke gatk by

THEANORC=PATH/TO/THEANORC_# gatk GermlineCNVCaller ...

This uses the THEANORC environment variable to set the .theanorc configuration file to PATH/TO/THEANORC_# for this instance of GATK (where you should fill in # appropriately). Each PATH/TO/THEANORC_# should be a file containing the following:

[global]
base_compiledir = PATH/TO/COMPILEDIR_#

Where again, # is filled in appropriately. The goal is to point each GermlineCNVCaller instance to a different compilation directory. @xysj1989 can you let me know if this works for you?

This is a bit of a hack. We could probably avoid this by changing the GATK code to use a specified or temporary directory for the theano directory without too much effort.

However, there is an upside to using a non-temporary directory to avoid recompilation of the model upon subsequent runs. In this case, we'd just want to let the user be able to specify the theano directory (rather than dump things in ~/.theano unexpectedly). We should think about whether this should be opt-in, i.e., should we preserve the original behavior of using ~/.theano by default?

@mwalker174 opinions? @droazen or engine team, thoughts on what the policy should be for python/R scripts doing this sort of thing? Is it generally true that the GATK leaves no trace, other than producing the expected output?

@xysj1989
Copy link
Author

OK, looks like you can get around the compiler lock issues by pointing each invocation of GermlineCNVCaller to a different compilation directory. For example, invoke gatk by

THEANORC=PATH/TO/THEANORC_# gatk GermlineCNVCaller ...

This uses the THEANORC environment variable to set the .theanorc configuration file to PATH/TO/THEANORC_# for this instance of GATK (where you should fill in # appropriately). Each PATH/TO/THEANORC_# should be a file containing the following:

[global]
base_compiledir = PATH/TO/COMPILEDIR_#

Where again, # is filled in appropriately. The goal is to point each GermlineCNVCaller instance to a different compilation directory. @xysj1989 can you let me know if this works for you?

This is a bit of a hack. We could probably avoid this by changing the GATK code to use a specified or temporary directory for the theano directory without too much effort.

However, there is an upside to using a non-temporary directory to avoid recompilation of the model upon subsequent runs. In this case, we'd just want to let the user be able to specify the theano directory (rather than dump things in ~/.theano unexpectedly). We should think about whether this should be opt-in, i.e., should we preserve the original behavior of using ~/.theano by default?

@mwalker174 opinions? @droazen or engine team, thoughts on what the policy should be for python/R scripts doing this sort of thing? Is it generally true that the GATK leaves no trace, other than producing the expected output?

Dear samuelklee,

Thank you very much for you reply. I also found this problem last night. It seems that the problem is originally from Theano and Pymc3, rather than GATK 4.0. Some similar problems have been reported just like (1) pymc-devs/pymc#1463 (2) https://stackoverflow.com/questions/52270853/how-to-get-rid-of-theano-gof-compilelock and (3) https://groups.google.com/forum/#!topic/theano-users/eJ2vl2PUTk4

Last night, I have already tried to reset base_compiledir for theano, through two ways: (1) creating a ~/.theanorc file just like you suggested (2) modifying the file ~/.bashrc for my login node, by adding a line: export THEANO_FLAGS="base_compiledir=/scratch/gatk-user1/z-Temp/z-Temp-Theano-$chr"

However, the truth is that, in our cluster, when I submit the 25 jobs (for each chromosomes), they are assigned to different computer nodes randomly. It means that I have to set THEANO environment variable for each corresponding random computer nodes respectively, which is quite difficult for me, as the nodes are random assigned.

So, now I'm going to add lines like below to the ~/.theanorc in my login node, to see what will happen. Maybe It will work.
#######
[global]
config.compile.timeout = 100000

However, I'm really appreciate it if some one in your team can help to add a function to specify a temporary directory for the theano directory, which can be bound to the corresponding node shared by other GATK threads.

Thank you and Best regards.

@samuelklee
Copy link
Contributor

@xysj1989 I would think that if you use the python gatk launch script, prefaced immediately by THEANORC as above, you should be able to tie each GermlineCNVCaller run to a separate compilation directory even if you don’t have control over which nodes you are running on. Increasing the timeout means that different runs will not be able to compile models at the same time, which will add some overhead; however, I think setting separate directories avoids this.

In any case, I will try to issue a PR allowing you to directly set the directory or use a temporary one soon. Thanks again for raising the issue!

@xysj1989
Copy link
Author

xysj1989 commented Nov 1, 2019

Dear samuelklee,

@samuelklee Thank you for your suggested solution, which sounds really fantastic. At present, I'm testing the pipelines by adding THEANORC=PATH/TO/THEANORC before "GATK sub-functions". I will report the result, when they are finished!

Best regards.

@xysj1989
Copy link
Author

xysj1989 commented Nov 4, 2019

@samuelklee The method you suggested, successfully solve my problems. I have tested 12 times and no more "theano-gof-compilelock" occurs. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants