Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in bcbio structural variant calling #653

Closed
shang-qian opened this issue Oct 28, 2014 · 47 comments
Closed

error in bcbio structural variant calling #653

shang-qian opened this issue Oct 28, 2014 · 47 comments

Comments

@shang-qian
Copy link

Hi Brad,

Thanks for your help. I want to call structural variants, but get an error: the parallel, svtyper, cnvnator_wrapper.py, cnvnator-multi, annotate_rd.py are not found in PATH, like this:

[2014-10-27 23:05] Uncaught exception occurred
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 20, in run
_do_run(cmd, checks, log_stdout)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 93, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; speedseq sv -v -B ......
Sourcing executables from /public/software/bcbio-nextgen/tools/bin/speedseq.config ...
which: no parallel in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-nextgen/anaconda/bin:.....)
which: no svtyper in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio....
which: no cnvnator_wrapper.py in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio....
which: no cnvnator-multi in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-....
which: no annotate_rd.py in((/public/software/bcbio-nextgen/tools/bin:/....)
Calculating alignment stats... sambamba-view: (Broken pipe)
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/tools/share/lumpy-sv/pairend_distro.py", line 12, in
import numpy as np
ImportError: No module named numpy

How can I fix this, thanks again.

Shangqian

@chapmanb
Copy link
Member

Shangqian;
Thanks for the report and apologies about the issue. The problem was that speedseq, which wraps the lumpy calling, calls out to a lumpy python script that requires numpy. If your system python does not have numpy installed, it results in this error. The other messages about svtyper and cnvnator are not a problem as we don't use those within bcbio.

I pushed a fix which resolves the issue by ensuring we use the Python installed with bcbio, which does contain numpy. If you upgrade with:

bcbio_nextgen.py upgrade -u development

it will grab the latest code and should work cleanly now. Thanks again.

@shang-qian
Copy link
Author

Hi Brad,
Thanks so much, and It works well in genome_sv. There is another little question that I am not sure:
In my data analysis, I had called the vcfs from a family used one caller (gatk-hc). Now I want to use the three callers and ensemble the result vcf. The following is my yaml file. Is that right? Thanks.
details:

  • files: [../input/sample08.bam]
    description: sample08
    metadata:
    batch: ceph
    sex: male
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes,gatk-haplotype]
    quality_format: Standard
    coverage_interval: regional
    validate: ../input/GiaB_NIST_RTG_v0_2.vcf.gz
    validate_regions: ../input/GiaB_NIST_RTG_v0_2_regions.bed
    variant_regions: ../input/NGv3.bed
    ensemble:
    format-filters: [DP < 4]
    classifiers:
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
    classifier-params:
    type: svm
    trusted-pct: 0.65
  • files: [../input/sample09.bam]
    description: sample09
    metadata:
    batch: ceph
    sex: female
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes,gatk-haplotype]
    quality_format: Standard
    coverage_interval: regional
    validate: ../input/GiaB_NIST_RTG_v0_2.vcf.gz
    validate_regions: ../input/GiaB_NIST_RTG_v0_2_regions.bed
    variant_regions: ../input/NGv3.bed
    ensemble:
    format-filters: [DP < 4]
    classifiers:
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
    classifier-params:
    type: svm
    trusted-pct: 0.65
  • files: [../input/sample10.bam]
    description: sample10
    metadata:
    batch: ceph
    sex: male
    analysis: variant2
    genome_build: GRCh37
    algorithm:
    aligner: bwa
    align_split_size: 5000000
    mark_duplicates: true
    recalibrate: false
    realign: false
    variantcaller: [freebayes,gatk-haplotype]
    quality_format: Standard
    coverage_interval: regional
    remove_lcr: true
    validate: ../input/GiaB_NIST_RTG_v0_2.vcf.gz
    validate_regions: ../input/GiaB_NIST_RTG_v0_2_regions.bed
    variant_regions: ../input/NGv3.bed
    ensemble:
    format-filters: [DP < 4]
    classifiers:
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
    classifier-params:
    type: svm
    trusted-pct: 0.65

@chapmanb
Copy link
Member

Shangqian;
That generally looks good, although you only have 2 variant callers listed. You'll want to have 3 or more to get good results from ensemble calling: samtools and platypus are two other good choices. Glad the fix worked for you.

@shang-qian
Copy link
Author

sorry for my typing mistake in three caller :). Thanks again for your helpful suggestion and contribution. The bcbio is great and useful for me.

@shang-qian
Copy link
Author

Hi Brad,

when I run above code, there exists following error:
Exception in thread "main" java.lang.Exception: VCF files do not have consistent headers: ["ceph-gatk-haplotype.vcf.gz" "ceph-samtools.vcf.gz"]

I know the problem is in VCF file, so I open the two VCF files and find the header sample names are different: the order in gatk-hc is sample10/sample8/sample9, but sample8/sample10/sample9 in samtols. finally, this problem is solved since I correct the same order. However, I don't think this is a good way to manually modify every time.

So, is there an automatic way for the same header by just modifying the input yaml file or bcbio-nextgen.
Thanks.

kind regards,
Shangqian

chapmanb added a commit to chapmanb/bcbio.variation that referenced this issue Nov 7, 2014
…with identical file names. Thanks to Severine Catreux. Resort multisample inputs that have inconsistent sample orders. Fixes bcbio/bcbio-nextgen#653
@chapmanb
Copy link
Member

chapmanb commented Nov 7, 2014

Shangqian;
Sorry about the issue. bcbio.variation did not explicitly sort input VCFs which can cause issues with different callers that insist on sorting in specific ways. I pushed a fix which should handle resorting these to a consistent order prior to doing ensemble calling. If you upgrade your tools with:

bcbio_nextgen.py upgrade --tools

and re-run it should hopefully work cleanly now. Thanks again for the reports.

@shang-qian
Copy link
Author

Hi Brad,

Thanks for your reponse. I had updated the bcbio-nextgen. Thanks a lot.
Besides, my log file out from the cancer yaml showed the memory did not enough for gatk. But this issue didn't exist in the exome pipeline. So can you help me to fix this. The following is the error log content:

[2014-11-10 17:06] ##### ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
[2014-11-10 17:06] ##### ERROR ------------------------------------------------------------------------------------------
[2014-11-10 17:06] Uncaught exception occurred
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /public/software/bcbio-nextgen/tools/bin/gatk-framework -Xms166m -Xmx1166m -XX:+UseSerialGC -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -T PrintReads -L 9:96714156-127734373 -R /public/software/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -I /public/users/xieshangqian/project/LungC/bcbio/work/align/syn3-tumor/2_2014-11-03_dream-syn3-sort.bam --downsample_to_coverage 10000 -BQSR /public/users/xieshangqian/project/LungC/bcbio/work/align/syn3-tumor/2_2014-11-03_dream-syn3-sort.grp -o /public/users/xieshangqian/project/LungC/bcbio/work/bamprep/syn3-tumor/9/tx/tmpRv7YoC/2_2014-11-03_dream-syn3-sort-9_96714155_127734373-prep-prealign.bam

Kind regards,
Shangqian

@chapmanb
Copy link
Member

Shangqian;
It looks like you need to allocate additional memory to GATK in your /public/software/bcbio-nextgen/galaxy/bcbio_system.yaml file, specifically increasing the -Xmx value under gatk. The cancer dataset is high depth (100x) and it looks like GATK needs additional memory to run effectively. Hope this helps.

@shang-qian
Copy link
Author

Thank you, Brad,.The cancer pipeline was done well. Many thanks for your big help every time.

By the way, does the bcbio require the same length of paired-end Read1 and Read2 for bwa-men alignment? Because the different length Read1 and Read 2 that were trimed by trimmomatic showed the error: "paired reads have different names". In my view the bwa-men may be normal for different length read alignment. So, is there some special setting or some parameters than I didn't mention in bcbio. Thanks again. :)

@chapmanb
Copy link
Member

Shangqian;
Glad that the cancer calling finished without any problems. bcbio/bwa-mem do not require reads to be the same length, but do require that all reads are paired. How did you run trimmomatic? The best approach is to use the paired end (PE) mode and feed the paired output into bcbio:

http://www.usadellab.org/cms/index.php?page=trimmomatic

It sounds like you may have trimmed separately or added the unpaired reads in which creates non-identical pair names in your fastq files. Hope this helps.

@shang-qian
Copy link
Author

Hi Brad,
Thanks so much for your reponse. I had used the NA12891 data for testing , and the bcbio/bwa-men is ok. So that I am uncertain where is the problem happened now. My test step was that:
I input the NA12891.R1 and R2 fastq file ,the error also existed, the following is the error messages:

[2014-11-20 18:16] [mem_sam_pe] paired reads have different names: "FFECCFHG>DEGCGGABGBCGEIDCFGGH:DF######", "AF@?EEFDB>B3<>FCD?BFBCGGGGFEHGEHE7GHHHHDEIHGE=>FECDE=AECCH/7?>DC@IH8DCFCFDC"
[2014-11-20 18:16] samblaster: Loaded 84 header sequence entries.
[2014-11-20 18:16] Uncaught exception occurred
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /public/software/bcbio-nextgen/tools/bin/bwa mem -M -t 16 -R '@rg\tID:1\tPL:illumina\tPU:s1\tSM:s1' -v 1 /public/software/bcbio-nextgen/genomes/Hsapiens/GRCh37/bwa/GRCh37.fa <(/public/software/bcbio-nextgen/tools/bin/grabix grab /public/users/xieshangqian/project/NAR/bcbio/work/align_prep/NA12891.R1.fastq.gz 20000000 39999999) <(/public/software/bcbio-nextgen/tools/bin/grabix grab /public/users/xieshangqian/project/NAR/bcbio/work/align_prep/NA12891.R2.fastq.gz 20000000 39999999) | /public/software/bcbio-nextgen/tools/bin/samblaster --splitterFile >(/public/software/bcbio-nextgen/tools/bin/samtools view -S -u /dev/stdin | /public/software/bcbio-nextgen/tools/bin/sambamba sort -t 16 -m 682M --tmpdir /public/users/xieshangqian/project/NAR/bcbio/work/tx/tmpnH0wTs/spl -o /public/users/xieshangqian/project/NAR/bcbio/work/align/s1/split/tx/tmp4ArZBx/s1-sort-20000000_39999999-sr.bam /dev/stdin) --discordantFile >(/public/software/bcbio-nextgen/tools/bin/samtools view -S -u /dev/stdin | /public/software/bcbio-nextgen/tools/bin/sambamba sort -t 16 -m 682M --tmpdir /public/users/xieshangqian/project/NAR/bcbio/work/tx/tmpnH0wTs/disc -o /public/users/xieshangqian/project/NAR/bcbio/work/align/s1/split/tx/tmpjoA75N/s1-sort-20000000_39999999-disc.bam /dev/stdin) | /public/software/bcbio-nextgen/tools/bin/samtools view -S -u /dev/stdin | /public/software/bcbio-nextgen/tools/bin/sambamba sort -t 16 -m 682M --tmpdir /public/users/xieshangqian/project/NAR/bcbio/work/tx/tmpnH0wTs/full -o /public/users/xieshangqian/project/NAR/bcbio/work/align/s1/split/tx/tmpZuouw8/s1-sort-20000000_39999999.bam /dev/stdin

I think the error is because of the paired reads have different names, so I grep the "FFECCFHG>DEGCGGABGBCGEIDCFGGH:DF######" and "AF@?EEFDB>B3<>FCD?BFBCGGGGFEHGEHE7GHHHHDEIHGE=>FECDE=AECCH/7?>DC@IH8DCFCFDC" that are both in line 20000000 from NA12891 R1 and R2 fastq file. the result are :
R1 read: line 19999997-20000000
@206B4ABXX100825:6:61:6782:130154/1
AAATCTCACCACTTAACCCATACCAGACCAGACCCAAAAGGAAAGGCCGGGTTCAGTAACAACAACCTGGGTTCAA
+
DEFDIGHEAHDGFCCGGHHECAGHEFECH=HD>FFECCFHG>DEGCGGABGBCGEIDCFGGH:DF######
R2 read: line 19999997-20000000
@206B4ABXX100825:6:61:6782:130154/2
TTGTAGGGGTGTGATGCCGTGGACCCCTTCTTGAACCCCCAAGCTCGTCTTGCATTTGGGGCTCTAGCATGCAGCT
+
@af@?EEFDB>B3<>FCD?BFBCGGGGFEHGEHE7GHHHHDEIHGE=>FECDE=AECCH/7?>DC@IH8DCFCFDC

The result showed the same length of R1 and R2 sequence also had error. So I think may be is the data problem. Then I awk the 1M fastq from line 19999997-20999996 as the test NA12891_test R1 and R2 file.

When I run the test file that just line 19999997-20999996 from original files and also include the @206B4ABXX100825:6:61:6782:130154/1 and /2 reads. There is normal and work well without any error.

So I am uncertain where is the problem. Any advice woulde be appreciated for me. Thanks again.

Shangqian

@chapmanb
Copy link
Member

Shangqian;
Thanks for including the full traceback, that is very helpful. This is due to a change in grabix, the tool we use for indexing fastq files when running in sections. You may have updated the code or tools separately, and this fix requires a simultaneous update. You can either fix by removing align_split_size and running individually, or getting the latest code and tools:

bcbio_nextgen.py upgrade -u development --tools

You may need to also remove alignprep/*.gbi to force the creation of new indexes. Hope this fixes the issue for you.

@shang-qian
Copy link
Author

Brad,
Many thanks for your former detail advice, the whole exome and genome are working in our HPC now. there are also two questions that need your help:

  1. There are 32 samples in my exome datasets. and I want to run bcbio with crossing multi nodes. In the documents 0.8.2 "bcbio_nextgen.py bcbio_sample.yaml -t ipython -n 12 -s lsf -q queue" can fix this problem. But I have a little confusion about the parameter -s and -q. Should I need to chang the lsf and queue in our cluster computer or just keep the default is ok?
  2. RNA pipeline error: "[2014-11-28 21:23] ../rnaseq/ref-transcripts.dexseq.gff3 was not found, so exon-level counting is being skipped."
    In the ../rnaseq/ folder, it just exists the ref-transcripts.dexseq.gff file , so how can I fix this problem. Does link the *.gff3 file to *.gff use code "ln -s *.gff *.gff3" right?

my yaml file is :
details:

  • algorithm:
    adapters:
  • truseq
  • polya
    aligner: tophat2
    quality_format: Standard
    strandedness: unstranded
    trim_reads: read_through
    analysis: RNA-seq
    description: Test_rep2
    files:
    - /public/users/zhusimin/Xie_project/Xiang_RNA/raw/fastq_raw/Ptf1aKO/ptf1amut1_R1.fastq
    - /public/users/zhusimin/Xie_project/Xiang_RNA/raw/fastq_raw/Ptf1aKO/ptf1amut1_R2.fastq
    genome_build: mm10

Thanks again for your helpful advice.

Best,
Shangqian

@roryk
Copy link
Collaborator

roryk commented Dec 1, 2014

Hi Shangqian,

Sorry about the DEXSeq issue; linking will fix it, our pre-built indices have the wrong extension.

For the scheduler and queue, on your HPC, is there a job scheduler that you submit your jobs to that distributes the jobs over the nodes? There are a bunch of different types of scheduler, LSF is one, there are others like SLURM and SGE. If you can find out what scheduler your HPC has running then you put that as the scheduler, and then the queue you are allowed to submit jobs to as the queue.

roryk added a commit that referenced this issue Dec 1, 2014
If you explicitly set a get_x or set_x function, it will not be
overwritten. This lets special complicated cases get handled inside
there without changing the function signature everywhere.

Fix to allow for .gff and .gff3 extensions for the DEXseq file.
Addresses an issue raised in #653.
@roryk
Copy link
Collaborator

roryk commented Dec 1, 2014

Shangqian,

I fixed this DEXSeq behavior so now it will find either .dexseq.gff or .dexseq.gff3 files here: 0e9c746.

@shang-qian
Copy link
Author

Hi Roryk,

Thanks for your promptly response, It helps me so much. Thanks a lot.

best,
Shangqian

@shang-qian
Copy link
Author

RoryK,

The gff problem had been fixed,but the other issue existed. the error showed:
[2014-12-02 11:12] multiprocessing: generate_transcript_counts
Error in find.package("DEXSeq") : there is no package called 'DEXSeq'

So how can I to install DEXSeq packages in bcbio, Thanks.

@roryk
Copy link
Collaborator

roryk commented Dec 2, 2014

Hi Shangqian,

Hm-- it should be getting installed automatically. If you fire up R and do:

source("http://bioconductor.org/biocLite.R")
biocLite("DEXSeq")

it should install it.

@shang-qian
Copy link
Author

Hi Roryk,

I install DEXSeq in the node with R. But when I run bcbio, it also can't find this packages, I think it maybe package DEXSeq was not add into the bcbio running.
Besides, I found the DEXSeq package is in the ./tools/lib/R/site-library, So I think I can use this packages by set R_LIBRARY_PATH. But it also existed the same error.
Can you show me how to add the DEXSeq package to R under the bcbio envirornment. Thanks.

Shangqian

@roryk
Copy link
Collaborator

roryk commented Dec 2, 2014

Hi Shangqian,

I agree, that seems like it should work, thanks for helping to debug this. Hmm-- if you type:

Rscript -e 'find.package("DEXSeq")'

Does it output a directory or say the package cannot be found? If it works, does it work also with R_LIBRARY_PATH unset?

chapmanb added a commit that referenced this issue Dec 2, 2014
…ndle GFF retrieval when no GTF file in DEXSeq unit tests. Fixes #653
@chapmanb
Copy link
Member

chapmanb commented Dec 2, 2014

Shangqian;
Apologies, @roryk and I traced this back to bcbio not injecting the installed site-libraries for R into the search path when looking for DEXSeq. I pushed a fix which does this, so if you upgrade to the latest development version:

bcbio_nextgen.py upgrade -u development 

it should hopefully work cleanly now. Thanks for the bug report and hope this fixes it for you.

@shang-qian
Copy link
Author

Brad and Roryk,
Thanks for the fix. I am upgrading the bcbio now.

By the way, in former test of whole genome SV, the bcbio was normal. But five days ago, I submitted a true lung cancer data to analysis SV, and the error happened today morning :

[2014-12-03 09:01] Index BAM file: 1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam
[2014-12-03 09:01] Samtools-htslib-API: bam_index_build2() not yet implemented
[2014-12-03 09:01] /bin/bash: line 1: 26699 Aborted /public/software/bcbio-nextgen/tools/bin/samtools index /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/tx/tmpdRaD64/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam.bai
[2014-12-03 09:01] Index BAM file (single core): 1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam
[2014-12-03 09:01] Samtools-htslib-API: bam_index_build2() not yet implemented
[2014-12-03 09:01] /bin/bash: line 1: 26702 Aborted /public/software/bcbio-nextgen/tools/bin/samtools index /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/tx/tmpdRaD64/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam.bai
[2014-12-03 09:01] Uncaught exception occurred
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
_do_run(cmd, checks, log_stdout)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 95, in _do_run
raise subprocess.CalledProcessError(exitcode, error_msg)
CalledProcessError: Command 'set -o pipefail; /public/software/bcbio-nextgen/tools/bin/samtools index /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/tx/tmpdRaD64/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam.bai
Samtools-htslib-API: bam_index_build2() not yet implemented
/bin/bash: line 1: 26702 Aborted /public/software/bcbio-nextgen/tools/bin/samtools index /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam /public/users/xieshangqian/project/LungC/bcbio_sv/work/bamprep/Lungtissue/7/tx/tmpdRaD64/1_2014-11-28_ceu-sort-7_78680799_94537032-prep.bam.bai
' returned non-zero exit status 134

I can't find the cause for this error, because the same command was normal for another bam file. Can you help me. Thanks.

Shangqian

chapmanb added a commit that referenced this issue Dec 3, 2014
@chapmanb
Copy link
Member

chapmanb commented Dec 3, 2014

Shangqian;
Thanks for the report. The new version of samtools index does not support specifying the output of the .bam.bai file, which triggered this error. I'm confused as to why the code used samtools for indexing since it should use sambamba index by default, but perhaps there is something problematic about your sambamba install. Either way, I pushed a small fix to work around this issue so if you update it should hopefully work cleanly now. Thanks again.

@shang-qian
Copy link
Author

Brad,
I am upgrading bcbio, but it runs the
[localhost] local: /public/software/bcbio-nextgen/tools/bin/brew info speedseq
exits error:
Fatal error: local() encountered an error (return code 1) while executing '/public/software/bcbio-nextgen/tools/bin/brew info speedseq'

I run 5 times ,every times has the same error.Can you check this?
Thanks.

@chapmanb
Copy link
Member

chapmanb commented Dec 3, 2014

Shangqian;
Sorry about the problem. I'm not sure why that command would fail. Does it provide any useful error messages if you run it outside of the upgrade process?

/public/software/bcbio-nextgen/tools/bin/brew info speedseq

@shang-qian
Copy link
Author

Hi Brad,
it takes the following messages:
[root@compute-0-15 bin]# /public/software/bcbio-nextgen/tools/bin/brew info speedseq
speedseq: stable 2014-08-22
https://github.com/cc2qe/speedseq
/public/software/bcbio-nextgen/tools/Cellar/speedseq/2014-08-22 (4 files, 92K) *
Built from source
From: https://github.com/chapmanb/homebrew-cbl/blob/master/speedseq.rb
==> Dependencies
Error: No available formula for sambamba

Is this problem caused by sambamba package, How can I fix this.

@chapmanb
Copy link
Member

chapmanb commented Dec 3, 2014

Shangqian;
That's strange, it seems like your recipes are not getting updated since sambamba should be present in homebrew-science. This should happen automatically but you can run:

/public/software/bcbio-nextgen/tools/bin/brew update

which should pull it in. Hope this helps.

@shang-qian
Copy link
Author

Bran,
Thanks for your response, when I run the the relation code, it yields below error:

[root@compute-0-15 bin]# /public/software/bcbio-nextgen/tools/bin/brew update
Unpacking objects: 100% (12/12), done.
error: Your local changes to 'bedtools.rb' would be overwritten by merge. Aborting.
Please, commit your changes or stash them before you can merge.
Error: Failed to update tap: homebrew/science
Already up-to-date.

should I rm homebrew/science and re-upgrade?

@chapmanb
Copy link
Member

chapmanb commented Dec 4, 2014

Shangqian;
I'm not sure how the bedtools formula got changed manually but that explains the issues. You can fix with:

cd /public/software/bcbio-nextgen/tools/Library/Taps/homebrew/homebrew-science
git checkout bedtools.rb

then you should be able to re-run the updater and find everything working. Hope this helps figure it out.

@shang-qian
Copy link
Author

Brad,

Thanks for your advice, I had upgraded the bcbio, and the dexseq is working now, but I test the exome pipeline ,under the command:
[2014-12-07 20:54] java -Xms750m -Xmx2500m -Djava.io.tmpdir=/public/users/xieshangqian/Testcode/testdata/bcbio/work/ensemble/test/tmp -jar /public/software/bcbio-nextgen/tools/share/java/bcbio_variation/bcbio.variation-0.1.9-standalone.jar variant-ensemble /public/users/xieshangqian/Testcode/testdata/bcbio/work/ensemble/test/config/test-ensemble.yaml /public/software/bcbio-nextgen/genomes/Hsapiens/GRCh37/seq/GRCh37.fa /public/users/xieshangqian/Testcode/testdata/bcbio/work/ensemble/test/test-ensemble.vcf /public/users/xieshangqian/Testcode/testdata/bcbio/work/gatk-haplotype/test-effects-ploidyfix-combined-gatkclean.vcf.gz /public/users/xieshangqian/Testcode/testdata/bcbio/work/freebayes/test-effects-ploidyfix-filter.vcf.gz /public/users/xieshangqian/Testcode/testdata/bcbio/work/samtools/test-effects-ploidyfix-filter.vcf.gz

it yields the following info:
[2014-12-07 20:59] Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String
[2014-12-07 20:59] at htsjdk.variant.variantcontext.CommonInfo.getAttributeAsInt(CommonInfo.java:242)
[2014-12-07 20:59] at htsjdk.variant.variantcontext.VariantContext.getAttributeAsInt(VariantContext.java:703)
[2014-12-07 20:59] at org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils.simpleMerge(GATKVariantContextUtils.java:946)
[2014-12-07 20:59] at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:309)
[2014-12-07 20:59] at org.broadinstitute.gatk.tools.walkers.variantutils.CombineVariants.map(CombineVariants.java:117)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
[2014-12-07 20:59] at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
[2014-12-07 20:59] at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:314)
[2014-12-07 20:59] at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
[2014-12-07 20:59] at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
[2014-12-07 20:59] at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
[2014-12-07 20:59] at bcbio.run.broad$run_gatk$fn__1805.invoke(broad.clj:34)
[2014-12-07 20:59] at bcbio.run.broad$run_gatk.invoke(broad.clj:31)
[2014-12-07 20:59] at bcbio.variation.combine$combine_variants.doInvoke(combine.clj:71)
[2014-12-07 20:59] at clojure.lang.RestFn.invoke(RestFn.java:1557)
[2014-12-07 20:59] at bcbio.variation.recall$get_min_merged.invoke(recall.clj:158)
[2014-12-07 20:59] at bcbio.variation.recall$fn__7040.invoke(recall.clj:173)
[2014-12-07 20:59] at clojure.lang.MultiFn.invoke(MultiFn.java:249)
[2014-12-07 20:59] at bcbio.variation.recall$create_merged$fn__7045.invoke(recall.clj:187)
[2014-12-07 20:59] at clojure.core$map$fn__4207.invoke(core.clj:2487)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4214.invoke(core.clj:2496)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4207.invoke(core.clj:2479)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4211.invoke(core.clj:2490)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4207.invoke(core.clj:2479)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4214.invoke(core.clj:2496)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$map$fn__4207.invoke(core.clj:2479)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$reduce1.invoke(core.clj:890)
[2014-12-07 20:59] at clojure.core$reverse.invoke(core.clj:904)
[2014-12-07 20:59] at clojure.math.combinatorics$combinations.invoke(combinatorics.clj:73)
[2014-12-07 20:59] at bcbio.variation.compare$variant_comparison_from_config$iter__7582__7586$fn__7587.invoke(compare.clj:255)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.RT.seq(RT.java:484)
[2014-12-07 20:59] at clojure.core$seq.invoke(core.clj:133)
[2014-12-07 20:59] at clojure.core$tree_seq$walk__4647$fn__4648.invoke(core.clj:4475)
[2014-12-07 20:59] at clojure.lang.LazySeq.sval(LazySeq.java:42)
[2014-12-07 20:59] at clojure.lang.LazySeq.seq(LazySeq.java:60)
[2014-12-07 20:59] at clojure.lang.LazySeq.more(LazySeq.java:96)
[2014-12-07 20:59] at clojure.lang.RT.more(RT.java:607)
[2014-12-07 20:59] at clojure.core$rest.invoke(core.clj:73)
[2014-12-07 20:59] at clojure.core$flatten.invoke(core.clj:6478)
[2014-12-07 20:59] at bcbio.variation.compare$variant_comparison_from_config.invoke(compare.clj:254)
[2014-12-07 20:59] at bcbio.variation.ensemble$consensus_calls.invoke(ensemble.clj:113)
[2014-12-07 20:59] at bcbio.variation.ensemble$_main.doInvoke(ensemble.clj:133)
[2014-12-07 20:59] at clojure.lang.RestFn.applyTo(RestFn.java:137)
[2014-12-07 20:59] at clojure.core$apply.invoke(core.clj:617)
[2014-12-07 20:59] at bcbio.variation.core$_main.doInvoke(core.clj:35)
[2014-12-07 20:59] at clojure.lang.RestFn.applyTo(RestFn.java:137)
[2014-12-07 20:59] at bcbio.variation.core.main(Unknown Source)

and at least one day to do this command (now is still running this command).
Before upgrading, I know it didn't need many time like this.so I think it maybe need fix. Is this command normal.
Thanks again.

Shangqian

@chapmanb
Copy link
Member

chapmanb commented Dec 8, 2014

Shangqian;
Sorry about the issue. This is from a problem with vcfallelicprimitives and multi-allele sites and was recently fixed in the development version. See more details here:

#679 (comment)

If you upgrade with bcbio_nextgen.py upgrade -u development, remove the freebayes and checkpoints directories (rm -rf freebayes && rm -rf checkpoints_parallel), and re-run it should hopefully work cleanly. Hope this fixes it for you.

@shang-qian
Copy link
Author

Thanks,Brad, It had upgraded and solved the problem. Thanks again.

@shang-qian
Copy link
Author

Brad and Roryk,
Many thanks for your big help.

Our HPC has 32 nodes(each node has 20 cores) and PBS scheduler is used for submitting mission. I formerly submitted a pbs file to run bcbio just under one node and it worked well. But now I need to analyse many samples in bcbio. So I think I can use the parallel for crossing multi nodes just like Roryk had told me. The following is my test pbs file for node13 and node17:

#PBS -N exome_s10
#PBS -j oe
#PBS -l nodes=c13:ppn=3+c17:ppn=5
#PBS -l walltime=5000:00:00
#PBS -q high
cd ~/Testcode/testdata/bcbio/work/
bcbio_nextgen.py ../config/test_exome_single.yaml -t ipython -n 8 -s torque -q high

when I qsub this file, it yields error like this:

[2014-12-10 11:45] compute-0-13.local: Resource requests: bwa, sambamba, samtools; memory: 2.0; cores: 16, 1, 16
[2014-12-10 11:45] compute-0-13.local: Configuring 1 jobs to run, using 8 cores each with 16.2g of memory reserved for each job
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipython_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipython_nbconvert_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipcontroller_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipengine_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipcluster_config.py'
[ProfileCreate] Generating default config file: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/iplogger_config.py'
2014-12-10 11:45:36.491 [IPClusterStart] Config changed:
2014-12-10 11:45:36.491 [IPClusterStart] {'BcbioTORQUEEngineSetLauncher': {'mem': '16.2', 'cores': 8, 'tag': '', 'resources': ''}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'e1bf1e39-9d63-4884-ba38-345be349dbd2'}, 'TORQUELauncher': {'queue': 'high'}, 'BcbioTORQUEControllerLauncher': {'mem': '16.2', 'cores': 2, 'tag': '', 'resources': ''}, 'IPClusterStart': {'delay': 10, 'n': 1, 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioTORQUEEngineSetLauncher', 'controller_launcher_class': u'cluster_helper.cluster.BcbioTORQUEControllerLauncher'}}
2014-12-10 11:45:36.503 [IPClusterStart] Using existing profile dir: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython'
2014-12-10 11:45:36.504 [IPClusterStart] Searching path [u'/public/users/xieshangqian/Testcode/testdata/bcbio/work', u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython'] for config files
2014-12-10 11:45:36.504 [IPClusterStart] Attempting to load config file: ipython_config.py
2014-12-10 11:45:36.505 [IPClusterStart] Loaded config file: /public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipython_config.py
2014-12-10 11:45:36.506 [IPClusterStart] Attempting to load config file: ipcluster_e1bf1e39_9d63_4884_ba38_345be349dbd2_config.py
2014-12-10 11:45:36.507 [IPClusterStart] Loaded config file: /public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipcontroller_config.py
2014-12-10 11:45:36.507 [IPClusterStart] Attempting to load config file: ipcluster_e1bf1e39_9d63_4884_ba38_345be349dbd2_config.py
2014-12-10 11:45:36.508 [IPClusterStart] Loaded config file: /public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipengine_config.py
2014-12-10 11:45:36.509 [IPClusterStart] Attempting to load config file: ipcluster_e1bf1e39_9d63_4884_ba38_345be349dbd2_config.py
2014-12-10 11:45:36.510 [IPClusterStart] Loaded config file: /public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython/ipcluster_config.py
2014-12-10 12:01:09.032 [IPClusterStop] Using existing profile dir: u'/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython'
2014-12-10 12:01:09.094 [IPClusterStop] Stopping cluster [pid=21885] with [signal=2]
Traceback (most recent call last):
File "/public/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 216, in
main(**kwargs)
File "/public/software/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 42, in main
run_main(**kwargs)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
fc_dir, run_info_yaml)
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 81, in _run_toplevel
for xs in pipeline.run(config, run_info_yaml, parallel, dirs, samples):
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 140, in run
multiplier=alignprep.parallel_multiplier(samples)) as run_parallel:
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/distributed/prun.py", line 53, in start
with ipython.create(parallel, dirs, config) as view:
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/contextlib.py", line 17, in enter
return self.gen.next()
File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/cluster_helper/cluster.py", line 913, in cluster_view
raise IOError("Cluster startup timed out.")
IOError: Cluster startup timed out.

So, I have a little confusion, is my understand your advice right? Should I need to modify my PBS file or the bcbio system. By the way, the mission just works on node13 but not in node17 after I qsub pbs file. Thanks again for your kindly response.

Shangqian

@roryk
Copy link
Collaborator

roryk commented Dec 10, 2014

Hi Shanqian,

Is your HPC busy? If you have to wait a long time to get a job, bcbio-nextgen will time out. When you tried to submit it, were the jobs pending for a long time or did they move to running status? If they are pending and bcbio-nextgen is timing out, you can have bcbio-nextgen wait for longer by adding --timeout time-in-minutes to the bcbio-nextgen command, so it won't time out while it is waiting. Hope that helps, let us know how it goes.

Best,

Rory

@shang-qian
Copy link
Author

Hi Rory,
I had checked our HPC nodes, they were all idle before submitting my job.
Ok, I try to add the --timeout command. And if there is any problem, I also still need your help :) , Thanks.
Best,
Shangqian

@roryk
Copy link
Collaborator

roryk commented Dec 10, 2014

Hi Shangqian,

If the nodes were idle then it might be an issue running on Torque. When you submit the job does everything get to the running state and it still times out, or are the jobs pending? If the jobs are in the running state but it still times out that would be very helpful to know.

@shang-qian
Copy link
Author

Hi Rory,
Jobs are in running state, and the same error happened.

@roryk
Copy link
Collaborator

roryk commented Dec 10, 2014

Great, when the jobs are in the running state, is there a controller job and an engine job both running too? There should be three jobs running, one that is the bcbio_nextgen job in the script you wrote to submit to the scheduler. The other two should be a controller and a set of engines. Were all of those running, or just the one bcbio_nextgen job?

If the controller and engine jobs were running too, are there files that have engine and ipcontroller in them that are in your directory? I think if you look at those, you should see some errors talking about heartbeats between the engine and controller. Do you see something like that?

@shang-qian
Copy link
Author

Thanks, Rory
this is the qstat result:
Job id Name User Time Use S Queue
11063.cluster exome_s10 xieshangqian 00:00:05 R high

and the top shows the "bcbio-nextgene.py" is runnning.
How should I to check whether the controller or engines jobs is running or not.

@roryk
Copy link
Collaborator

roryk commented Dec 10, 2014

They should be appearing on there if they are running, so it seems like they aren't starting. Are there engine and ipcontroller files in the directory? There should be job submission scripts for each of them.

@shang-qian
Copy link
Author

yes, there are in the lib and pkgs folders
/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/sqlalchemy/engine
/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/IPython/parallel/engine
/public/software/bcbio-nextgen/anaconda/pkgs/sqlalchemy-0.9.7-py27_0/lib/python2.7/site-packages/sqlalchemy/engine
/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.3.1-py27_0/lib/python2.7/site-packages/IPython/parallel/engine
/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.2.0-py27_0/lib/python2.7/site-packages/IPython/parallel/engine
/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.3.0-py27_0/lib/python2.7/site-packages/IPython/parallel/engine

/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.3.1-py27_0/bin/ipcontroller
/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.2.0-py27_0/bin/ipcontroller
/public/software/bcbio-nextgen/anaconda/pkgs/ipython-2.3.0-py27_0/bin/ipcontroller
/public/software/bcbio-nextgen/anaconda/bin/ipcontroller

@roryk
Copy link
Collaborator

roryk commented Dec 10, 2014

Hm-- nothing in the work directory? They should look like torque_controller with a bunch of letters and numbers after them. If we can track down those files we can try to figure out why they aren't getting run.

@shang-qian
Copy link
Author

ye, I had found them in the work directory. The contents are:
Controller:
#!/bin/sh
#PBS -q high
#PBS -V
#PBS -N bcbio-c
#PBS -j oe
#PBS -l nodes=1:ppn=2
#PBS -l walltime=239:00:00
cd $PBS_O_WORKDIR
/public/software/bcbio-nextgen/anaconda/bin/python2.7 -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else 10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from cluster_helper.cluster import VMFixIPControllerApp; VMFixIPControllerApp.launch_instance()' --ip=* --log-to-file --profile-dir="/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython" --cluster-id="3b8f9f6a-98f4-47d4-a9db-d3f15d4f3669" --nodb --hwm=1 --scheme=leastload --HeartMonitor.max_heartmonitor_misses=120 --HeartMonitor.period=60000

Engines:
#!/bin/sh
#PBS -q high
#PBS -V
#PBS -j oe
#PBS -N bcbio-e
#PBS -t 1-1
#PBS -l nodes=1:ppn=5
#PBS -l mem=10444mb
#PBS -l walltime=239:00:00
cd $PBS_O_WORKDIR
/public/software/bcbio-nextgen/anaconda/bin/python2.7 -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else 10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from IPython.parallel.apps.ipengineapp import launch_new_instance; launch_new_instance()' --timeout=960 --IPEngineApp.wait_for_url_file=960 --EngineFactory.max_heartbeat_misses=120 --profile-dir="/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython" --cluster-id="3b8f9f6a-98f4-47d4-a9db-d3f15d4f3669"

@chapmanb
Copy link
Member

Shangqian;
Thanks for the help debugging. If you manually submit one of these:

qsub torque_controller*

does it provide any useful error messages? It sounds like something with the submission is problematic with your setup and maybe this will provide a clue. Thanks again.

@lpantano
Copy link
Collaborator

Hi,

I would try to submit one of this files. Normally they don't start
because something is wrong with these files due to any configuration. If
you submit one of this files alone, you will check if there is any error
with them that we didn't think of.

It happened to me in one queue where you should submit a job with more
than two cores, for instance. So the cluster manager will not get those
jobs enter the queue and bcbio gets stuck.

On 12/10/2014 03:18 AM, shang-qian wrote:

ye, I had found them in the work directory. The contents are:
Controller:
#!/bin/sh
#PBS -q high
#PBS -V
#PBS -N bcbio-c
#PBS -j oe
#PBS -l nodes=1:ppn=2
#PBS -l walltime=239:00:00
cd $PBS_O_WORKDIR
/public/software/bcbio-nextgen/anaconda/bin/python2.7 -E -c 'import
resource; cur_proc, max_proc =
resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc,
10240) if max_proc > 0 else 10240;
resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc),
max_proc)); cur_hdls, max_hdls =
resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls =
min(max_hdls, 10240) if max_hdls > 0 else 10240;
resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls,
target_hdls), max_hdls)); from cluster_helper.cluster import
VMFixIPControllerApp; VMFixIPControllerApp.launch_instance()' --ip=*
--log-to-file
--profile-dir="/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython"
--cluster-id="3b8f9f6a-98f4-47d4-a9db-d3f15d4f3669" --nodb --hwm=1
--scheme=leastload --HeartMonitor.max_heartmonitor_misses=120
--HeartMonitor.period=60000

Engines:
#!/bin/sh
#PBS -q high
#PBS -V
#PBS -j oe
#PBS -N bcbio-e
#PBS -t 1-1
#PBS -l nodes=1:ppn=5
#PBS -l mem=10444mb
#PBS -l walltime=239:00:00
cd $PBS_O_WORKDIR
/public/software/bcbio-nextgen/anaconda/bin/python2.7 -E -c 'import
resource; cur_proc, max_proc =
resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc,
10240) if max_proc > 0 else 10240;
resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc),
max_proc)); cur_hdls, max_hdls =
resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls =
min(max_hdls, 10240) if max_hdls > 0 else 10240;
resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls,
target_hdls), max_hdls)); from IPython.parallel.apps.ipengineapp
import launch_new_instance; launch_new_instance()' --timeout=960
--IPEngineApp.wait_for_url_file=960
--EngineFactory.max_heartbeat_misses=120
--profile-dir="/public/users/xieshangqian/Testcode/testdata/bcbio/work/log/ipython"
--cluster-id="3b8f9f6a-98f4-47d4-a9db-d3f15d4f3669"


Reply to this email directly or view it on GitHub
#653 (comment).

@shang-qian
Copy link
Author

Hi lpantano,
Thanks for your trying, I had also qsub the torque_controller* file as Brad's advice, The result is similar to you. It is in the running state in one node with two cores, but the time use is always zero. And in fact, it doesn't work in the node that I qsub.

@lpantano
Copy link
Collaborator

Hi,

if you run it just a file with the same header but with another thing,
just to make sure is only related to the cluster and not ipython or
bcbio? and see the ouput files, and if it finishes...or something..

#!/bin/sh
#PBS -q high
#PBS -V
#PBS -N bcbio-c
#PBS -j oe
#PBS -l nodes=1:ppn=2
#PBS -l walltime=239:00:00

cd /SOME/PATH/WITH/FILES
sleep(10)
ls

On 12/10/2014 09:57 AM, shang-qian wrote:

Hi lpantano,
Thanks for your trying, I had also qsub the torque_controller* file as
Brad's advice, The result is similar to you. It is in the running
state in one node with two cores, but the time use is always zero. And
in fact, it doesn't work in the node that I qsub.


Reply to this email directly or view it on GitHub
#653 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants