Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canfam3 dbSNP - ensembl 75 #386

Closed
caddymob opened this issue Apr 3, 2014 · 56 comments
Closed

canfam3 dbSNP - ensembl 75 #386

caddymob opened this issue Apr 3, 2014 · 56 comments

Comments

@caddymob
Copy link

caddymob commented Apr 3, 2014

greetings! Can we add the canine dbSNP vcf to the variation resources in 9dcb447, please? I realize recallibration will not be available but getting rsIDs sure would be nice :)

The vcf can be obtained here: ftp://ftp.ensembl.org/pub/release-75/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz

Only thing is the canine genome for bcbio has "chr" prefixes on contigs where the dbSNP does not... I seem to recal you have a ensembl <--> ucsc conversion method from when we added the rn5 genome, so hoping this is easy without just awk'in on a 'chr' :)

Thanks!

@roryk
Copy link
Collaborator

roryk commented Apr 9, 2014

Thanks Jason, sorry for the delay in getting back to you. It looks like that file doesn't have any scaffolds or anything in it, so if you just add 'chr' to the beginning of every line, gzip it back up and add the dbsnp line pointing to it in the resources file in the genome directory it should be all good. Could you try that and see if it ends up working as expected?

@roryk
Copy link
Collaborator

roryk commented Apr 10, 2014

It turns out @chapmanb already had a script to do something similar in the utilities section of cloudbiolinux for mouse, I generalized it a bit to work for other genomes and when it is working point you to it.

@roryk
Copy link
Collaborator

roryk commented Apr 10, 2014

Hi Jason I fixed up that dbSNP preparer here: chapmanb/cloudbiolinux@ed4c5b3

It will require a reinstallation of the canFam3 genome; the original genome preparation had a bug where it was not karyotype sorting the multiFASTA genome file. I fixed that bug here: chapmanb/cloudbiolinux@1496390

The last bit is to stick that canFam dbSNP file up so bcbio-nextgen can grab it and fix up that resources file and test it then you should be all set.

@chapmanb
Copy link
Member

Rory -- thanks for doing all the work of preparing this and updating the genome. I finalized the last bits of automating the download and sticking the dbSNP file on S3, so we should be good to go. Jason, please let us know if you run into any problems. We haven't used this so if anything looks off please yell and we can reopen and fix this.

@caddymob
Copy link
Author

Hey guys, sorry for the delay - we just tried to pull down the latest and before we even get to the canine dbSNP, we are borking on STAR. Seems the rnaseq.gtf is not on S3.. Any suggestions?

~> bcbio_nextgen.py upgrade -u development --tools --genomes canFam3 --aligners star --data

2014-04-29 10:50:49 ERROR 404: Not Found.


Warning: local() encountered an error (return code 8) while executing 'wget --no-check-certificate -O canFam3-star.tar.xz 'https://s3.amazonaws.com/biodata/genomes/canFam3-star.tar.xz''

ERR [genomes.py(450)] Genome preparation method s3 failed, trying next
Traceback (most recent call last):
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 444, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 685, in _download_s3_index
    out_file = shared._remote_fetch(env, url)
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/custom/shared.py", line 190, in _remote_fetch
    raise IOError("Failure to retrieve remote file")
IOError: Failure to retrieve remote file
ERR [genomes.py(450)] Genome preparation method s3 failed, trying next
Traceback (most recent call last):
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 444, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 685, in _download_s3_index
    out_file = shared._remote_fetch(env, url)
  File "/home/jpeden/tmpbcbio-install/cloudbiolinux/cloudbio/custom/shared.py", line 190, in _remote_fetch
    raise IOError("Failure to retrieve remote file")
IOError: Failure to retrieve remote file
INFO: Preparing genome canFam3 with index star
INFO: Preparing genome canFam3 with index star
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/../rnaseq/ref-transcripts.gtf not found, skipping creating the STAR index.
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/../rnaseq/ref-transcripts.gtf not found, skipping creating the STAR index.
/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/pkgutil.py:186: ImportWarning: Not importing directory '/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio': missing __init__.py
  file, filename, etc = imp.find_module(subname, path)

@roryk
Copy link
Collaborator

roryk commented Apr 29, 2014

Hi Jason,

Sorry about that-- the upgrade system is a little brittle when it comes to the genomes. Is the canFam genome installed already? If you do it in two parts, install the canFam3 genome with:

bcbio_nextgen.py upgrade --data --genomes canFam3

and then install the STAR index with:

bcbio_nextgen.py upgrade -u development --genomes canFam3 --aligners star --data

Does that work?

@caddymob
Copy link
Author

Unfortunately no, same error. Yes, the canFam3.fa is in there - here is everything in the canine genome dir:

find /packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/canFam3.fa
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/canFam3.fa.gz
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/canFam3.fa.fai
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/canFam3.dict
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/canFam3-resources.yaml
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa/canFam3.fa.pac
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa/canFam3.fa.ann
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa/canFam3.fa.amb
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa/canFam3.fa.bwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bwa/canFam3.fa.sa
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.3.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.4.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.1.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.2.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.rev.1.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.rev.2.bt2
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie2/canFam3.fa
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.3.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.4.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.1.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.2.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.rev.1.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/bowtie/canFam3.rev.2.ebwt
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/ucsc
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/snpeff
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/snpeff/CanFam3.1.74
/packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/snpeff/CanFam3.1.74/snpEffectPredictor.bin

@roryk
Copy link
Collaborator

roryk commented Apr 29, 2014

Gotcha. Thanks Jason. Sorry, I am a tool-- I never made the RNA-seq bit for canFam3.

I added support for it with the prepper script in cloubiolinux here: chapmanb/cloudbiolinux@581c92e99f514fb6

When it is done I'll have Brad stick it up on S3. If you don't want to wait, you can prep it yourself if you grab the cloudbiolinux repo, use the python that is installed with bcbio-nextgen and run:

python cloudbiolinux/utils/prepare_tx_gff.py path-to-picard-dir canFam3

inside the /packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog directory.

@caddymob
Copy link
Author

Hey Rory -

We ran the prepper script - thanks for this. Once we had the rnaseq gtfs and so forth built, we had to ln -s rnaseq-2014-04-29 rnaseq to get STAR to start building locally. Its going now, I'm sure if this is just an effect of stepping in with the cloubiolinux linux but wanted to let you and the internet know for posterity sake. Thanks!

@roryk
Copy link
Collaborator

roryk commented Apr 30, 2014

Hi Jason,

Thanks a lot for testing, it is good to know that it works on someone elses machine. The pieces are there to automatically prep and install arbitrary genomes from Ensemble we just need to glue them together and knowing that the rnaseq prep part is good to go is helpful.

chapmanb added a commit to chapmanb/cloudbiolinux that referenced this issue May 6, 2014
@caddymob
Copy link
Author

caddymob commented May 7, 2014

Hey gents - so we finally got RNA-seq working with STAR on the canines but neglected to check the dbSNP for canine. Turns out we don't have a variation directory. We just pulled down a development update with:

bcbio_nextgen.py upgrade --u development -—genomes canFam3 —-data

Did we miss something?

@chapmanb
Copy link
Member

chapmanb commented May 7, 2014

Jason;
Apologies, my mistake. I hadn't made these available in the download configuration. It should hopefully grab them now if you re-run:

bcbio_nextgen.py upgrade --genomes canFam3 --data

@caddymob
Copy link
Author

caddymob commented May 7, 2014

Not quite - dying like this:

~> bcbio_nextgen.py upgrade --genomes canFam3 --data
Traceback (most recent call last):
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 34, in <module>
    from bcbio.distributed import runfn
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/runfn.py", line 11, in <module>
    from bcbio.distributed import multitasks
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 3, in <module>
    from bcbio import structural, utils, chipseq
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/structural/__init__.py", line 6, in <module>
    import toolz as tz

@caddymob
Copy link
Author

caddymob commented May 7, 2014

Actually the upgrade broke the whole build - running a pipeline, not an upgrade also dies with the same error on toolz

@chapmanb
Copy link
Member

chapmanb commented May 8, 2014

Jason;
Sorry about the issue. I'm not sure how the upgrade managed to complete without installing toolz, that's in the requirements.txt file and should also be installed via conda. I'll have to try to replicate but in the meantime you can fix with:

/packages/bcbio/0.7.9/bcbio/anaconda/bin/pip install toolz

Hope this gets things working again

@caddymob
Copy link
Author

caddymob commented May 9, 2014

Thanks Brad - that worked getting the install back up. Unfortunately, darn dogs are still an issue with the dbSNP file... 10 hours later with 512 cores I get the error below with gatk-haplotype. What do you think, should I just replace INFO white space with underscores or give freebayes a shot? Is there a module in bcbio to clean the dbSNP file that I've missed?

Thanks for all the help!

INFO  18:34:37,282 ReadShardBalancer$1 - Done loading BAM index data
INFO  18:35:17,325 ProgressMeter -   chr27:3844849        4.37e+07   34.8 m       47.0 s     74.5%        46.7 m    11.9 m
INFO  18:35:47,478 ProgressMeter -  chr27:32106717        4.44e+07   35.4 m       47.0 s     75.7%        46.7 m    11.3 m
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A USER ERROR has occurred (version 3.1-1-g07a4bf8):
##### ERROR
##### ERROR This means that one or more arguments or inputs in your command are incorrect.
##### ERROR The error message below tells you what is the problem.
##### ERROR
##### ERROR If the problem is an invalid argument, please check the online documentation guide
##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
##### ERROR
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
##### ERROR
##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 2332475: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "RSPOS=38742484;RV;GENEINFO=0:CD9 antigen gene;dbSNPBuildID=118;SAO=0;VC=snp;VP=050000000004000100020100", for input source: /packages/bcbio/0.7.9/bcbio/genomes/Cfamiliaris_Dog/canFam3/variation/canFam3-dbSNP-2014-04-10.vcf.gz
##### ERROR ------------------------------------------------------------------------------------------

chapmanb added a commit to chapmanb/cloudbiolinux that referenced this issue May 10, 2014
@chapmanb
Copy link
Member

Jason;
Sorry about the issue, it looks like we need to do additional cleanup of that dbSNP file during preparation. I updated the prep script and uploaded a new version which passed GATK VariantEval so should hopefully work cleanly for you. If you update your data to get the new files and configuration you should be good to go:

bcbio_nextgen.py upgrade --data

Hope this works for you. Thanks again for all the patience getting this going.

@caddymob
Copy link
Author

Hey Brad!

So, it took a couple tries - first the run was still referencing the old canFam3-dbSNP-2014-04-10.vcf.gz file. I wiped my work dir and re-ran, still dies. So, we rm'd the canFam3-dbSNP-2014-04-10.vcf.gz file and symlinked the new canFam3-dbSNP-2014-05-10.vcf.gz file - this seemed to run up to the very end -- I had the job farmed out across 32 nodes and died with this error:

Traceback (most recent call last):
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 62, in <module>
    main(**kwargs)
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 40, in main
    run_main(**kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 44, in run_main
    fc_dir, run_info_yaml)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items):
[2014-05-14 18:54] pnap-pe5-s21: INFO  11:54:35,349 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 332, in run
    samples = region.parallel_prep_region(samples, run_parallel)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/region.py", line 104, in parallel_prep_region
    "piped_bamprep", _add_combine_info, file_key, ["config"])
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 58, in parallel_split_combine
    split_output = parallel_fn(parallel_name, split_args)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipython.py", line 57, in run
    for data in view.map_sync(fn, items, track=False):
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 366, in map_sync
    return self.map(f,*sequences,**kwargs)
  File "<string>", line 2, in map
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 66, in sync_results
    ret = f(self, *args, **kwargs)
  File "<string>", line 2, in map
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 51, in save_ids
    ret = f(self, *args, **kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 1123, in map
    return pf.map(*sequences)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 280, in map
    ret = self(*sequences)
[2014-05-14 18:54] pnap-pe5-s24: INFO  11:54:34,476 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
  File "<string>", line 2, in __call__
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 91, in sync_view_results
    return f(self, *args, **kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 263, in __call__
    return r.get()
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
[2014-05-14 18:54] pnap-pe4-s21: INFO  11:54:35,926 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
    raise self._exception
IPython.parallel.error.CompositeError: [2014-05-14 18:54] pnap-pe5-s32: INFO  11:54:34,045 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
one or more exceptions from call to method: piped_bamprep
[367:apply]: OSError: [Errno 7] Argument list too long
[226:apply]: OSError: [Errno 7] Argument list too long
[365:apply]: OSError: [Errno 7] Argument list too long
[65:apply]: OSError: [Errno 7] Argument list too long
.... 2 more exceptions ...
[2014-05-14 18:54] pnap-pe5-s32: INFO  11:54:41,390 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

I then re-launched from a single node running -t local and still getting the Argument list too long error here:

[2014-05-14 16:12] Select unanalyzed reads
[2014-05-14 16:12] Uncaught exception occurred
Traceback (most recent call last):
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run
    _do_run(cmd, checks, log_stdout)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 100, in _do_run
    stderr=subprocess.STDOUT, close_fds=True)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/subprocess.py", line 709, in __init__
    errread, errwrite)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/subprocess.py", line 1326, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long
An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (13, 0))

Any ideas?

chapmanb added a commit that referenced this issue May 15, 2014
@chapmanb
Copy link
Member

Jason;
Sorry, that was my mistake with the dbSNP file, I forgot to bump the genome resource version to pull in the latest canFam3 YAML files. Apologies for making you waste extra time with that.

For the command line error, I pushed a fix that writes the regions to a file. This normally allows us to work around the command line length issues so hopefully it'll resolve this as well. This doesn't have anything specifically to do with canFam3, but is likely a result of having a lot of chromosomes and thus a lot of regions.

Hopefully this will keep things moving along. Thanks again for all the patience getting this running.

@caddymob
Copy link
Author

Bummer - no.. Still dies .Again fresh job from the beginning:

[2014-05-16 00:23] pnap-pe5-s11: INFO  17:23:01,323 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
Traceback (most recent call last):
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 62, in <module>
[2014-05-16 00:23] pnap-pe5-s11: INFO  17:23:00,784 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
    main(**kwargs)
  File "/packages/bcbio/0.7.9/tooldir/bin/bcbio_nextgen.py", line 40, in main
    run_main(**kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 44, in run_main
    fc_dir, run_info_yaml)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 82, in _run_toplevel
    for xs in pipeline.run(config, config_file, parallel, dirs, pipeline_items):
[2014-05-16 00:23] pnap-pe5-s09: INFO  17:23:01,476 ProgressMeter -        Location processed.reads  runtime per.1M.reads completed total.runtime remaining
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 332, in run
    samples = region.parallel_prep_region(samples, run_parallel)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/region.py", line 104, in parallel_prep_region
    "piped_bamprep", _add_combine_info, file_key, ["config"])
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/split.py", line 58, in parallel_split_combine
    split_output = parallel_fn(parallel_name, split_args)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/ipython.py", line 57, in run
    for data in view.map_sync(fn, items, track=False):
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 366, in map_sync
    return self.map(f,*sequences,**kwargs)
  File "<string>", line 2, in map
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 66, in sync_results
    ret = f(self, *args, **kwargs)
  File "<string>", line 2, in map
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 51, in save_ids
    ret = f(self, *args, **kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/view.py", line 1123, in map
    return pf.map(*sequences)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 280, in map
    ret = self(*sequences)
  File "<string>", line 2, in __call__
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 91, in sync_view_results
    return f(self, *args, **kwargs)
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/remotefunction.py", line 263, in __call__
    return r.get()
  File "/packages/bcbio/0.7.9/bcbio/anaconda/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
    raise self._exception
IPython.parallel.error.CompositeError: one or more exceptions from call to method: piped_bamprep
[97:apply]: OSError: [Errno 7] Argument list too long
[0:apply]: OSError: [Errno 7] Argument list too long
[46:apply]: OSError: [Errno 7] Argument list too long
[168:apply]: OSError: [Errno 7] Argument list too long
.... 2 more exceptions ...

chapmanb added a commit that referenced this issue May 16, 2014
@chapmanb
Copy link
Member

Jason;
Bummer, sorry for the continued problems. I fixed it a different way by avoiding needing the long list of regions and instead using sambamba's ability to select regions using a BED file. If you update again to the latest development and re-run, fingers crossed it will work now. There is no need to start the processing from the beginning -- it should pick up right where it left off.

If it still fails, could you pass along the last command line in log/bcbio-nextgen-commands.log. It's also possible we're looking at the wrong command so might be missing another issue. Thanks again.

@chapmanb chapmanb reopened this May 16, 2014
@caddymob
Copy link
Author

Hmmm, better, sorta - 39966 parallel analysis blocks, however it still hangs at Timing: variant calling. Here is the gist of the regions/*analysis_blocks.bed file: https://gist.github.com/caddymob/4ff0ba0daef4a5402fd4

As for the re-run - yes, it is actually starting back at BWA on the single node, not just stepping through. I've ssh'd in to the node to check - strange right? I will try the checkpoints_parallel delete on our next round :)

chapmanb added a commit that referenced this issue May 21, 2014
…ce chromosomes to avoid excessive numbers of split regions. #386
@chapmanb
Copy link
Member

Jason;
Thanks for the BED file, that helped trigger what I think the underlying issue is. I believe the process we used to filter split regions actually ends up producing less "split blocks" than chromosomes since there are so many contigs. There was a "sanity" check in place but it is not reasonable for these cases because the extra contigs dominate the larger chromosomes we really want to split.

So I pushed yet another fix which I hope will help if you're able to retry that. I don't think you'll need nomap_split_targets any longer, as well. Fingers crossed.

The re-run of bwa steps is also totally wrong. I'm at a loss for why that is. Would you be able to post a ls -lh of one of the align directories for a sample, some of the commands that are being re-run, and the input sample YAML file? That could direct me to the right place to look for bugs.

Thanks again for the help debugging.

@mjafin
Copy link
Contributor

mjafin commented May 21, 2014

Hi Brad,
Would the latest fixes improve/affect my monkey runs too (with 7000+ chrs and scaffolds)?

@chapmanb
Copy link
Member

Miika;
I hope so. I didn't want to send you both on debugging missions but I hope it will reduce the total number of blocks and avoid splitting on the small scaffolds/chromosomes which will avoid having too many concurrent messages. If you have time to test it would be great to know if it helps. Thanks much.

@caddymob
Copy link
Author

Hey Brad -

So, I wiped the regions/*analysis_blocks.bed and it seemed to distribute, but we're still coming up with 39966 parallel analysis blocks.. And we're still hanging at variant calling... I did remove the nomap_split_targets from my yaml but it seems like it is not any different than yesterday...

@chapmanb
Copy link
Member

Jason;
Argh, sorry. Do you know of any public dog whole genome data in SRA I could use for testing? Maybe if I can replicate this locally we can avoid all the back and forth and guessing, since I feel like I'm making you try lots of things without much success. Thanks again and sorry for the lack of progress so far.

@caddymob
Copy link
Author

No I dont... but I'll send you a genome or two over aspera... standby

@chapmanb
Copy link
Member

Jason;
Thanks for sending the files to test on, that was super helpful to get us testing the same thing. I ran your two genome example with the YAML configuration and am seeing a reduced number of total blocks with the latest code:

[2014-05-22 14:51] Identified 4953 parallel analysis blocks
Block sizes:
  min: 759
  5%: 4573.8
  25%: 7533.0  median: 20768.0
  75%: 1241968.0
  95%: 1454726.4
  99%: 1731813.16
  max: 2502250
Between block sizes:
  min: 101
  5%: 106.0
  25%: 158.0
  median: 237.0
  75%: 557.25
  95%: 1623.15
  99%: 3215.66
  max: 24501

Is it possible you were running a different install than intended or not fully up to date with the latest development version? I also couldn't replicate the issue you saw with restarting bwa from scratch. All of my runs skip past that quickly and restart right at the last end point. So I'm officially confused as to what is going on, but I hope if you can get the same code as the latest it will work better for you.

I did identify a bug in preparing the canFam3 genome, where it could pick up fasta files from the transcripts directory and generate a reference genome with ~35k contigs since it included the transcripts. That is now fixed, but I don't think it is your issue since my install with that problem died immediately during alignment (some tools don't like 35k contigs, I guess). However, you might want to double check that your genome install looks right (grep '>' /path/to/seq/canFam3.fa).

I'll do more testing on memory usage and variant call hanging but wanted to report back about the region and re-run alignment discrepency and see if we could resolve those before debugging more. Hope this helps some and thanks again for all the patience debugging this.

@caddymob
Copy link
Author

Hey Brad,

Thanks so much for testing - not sure why you get drastically less blocks - my run has 6 dogs, 3 different breeds. Did you test just dog one or both dogs I set? I sent 2 different breeds in hopes of maximizing variation diversity (if this even matters in this case?) ... As far as the canFam3 ref, I have the following - looks correct to me, ~3,500 contigs, not ~35,000:

jcorneveaux@galaxy:/packages/bcbio/genomes/Cfamiliaris_Dog/canFam3/seq/grep '>' canFam3.fa | cut -f 1 -d "_" | sort | uniq -c
      1 >chr1
      1 >chr10
      1 >chr11
      1 >chr12
      1 >chr13
      1 >chr14
      1 >chr15
      1 >chr16
      1 >chr17
      1 >chr18
      1 >chr19
      1 >chr2
      1 >chr20
      1 >chr21
      1 >chr22
      1 >chr23
      1 >chr24
      1 >chr25
      1 >chr26
      1 >chr27
      1 >chr28
      1 >chr29
      1 >chr3
      1 >chr30
      1 >chr31
      1 >chr32
      1 >chr33
      1 >chr34
      1 >chr35
      1 >chr36
      1 >chr37
      1 >chr38
      1 >chr4
      1 >chr5
      1 >chr6
      1 >chr7
      1 >chr8
      1 >chr9
      1 >chrM
   3228 >chrUn
      1 >chrX

@chapmanb
Copy link
Member

Jason;
I ran both together as one batch. I wouldn't expect a big different between number of blocks based on organism diversity. My hunch is that you might not be using the latest development commit, maybe due to the confusion Jim mentioned with installations. If you can sort that and run the latest development, I think you'll be good to go, fingers crossed.

@caddymob
Copy link
Author

Hey Brad - well, after all that we're down to 5008 analysis blocks! My job is stuck in the queue ATM but I think this 10X reduction in regions was the problem so we should be set. Will let you know if i hit any more snags but wanted to update you for now. Thanks for all the work and apologies for the confusion. That was intense!

chapmanb added a commit that referenced this issue May 23, 2014
… algorithm argument in bcbio_system.yaml. Experimental feature to test parallelization memory usage on genomes with larger number of regions #386
@chapmanb
Copy link
Member

Jason;
Brilliant, thanks for the updates. Glad we're seeing the same thing now and fingers crossed that the block reduction will help. I also did some more work on this today to play with other ideas and came up with an approach that JSON encodes and compresses messages to IPython. This gives a ~5x reduction in message size, which I'm hoping will help with memory as well. If you update to the latest development you can enable this by adding compress_msg to your bcbio_system.yaml algorithm section:

algorithm:                                                                                                            
  compress_msg: true

I'm testing this from my side as well, so haven't made it automatically enabled. If it helps with scaling we could also add the same idea to function outputs (right now it is only inputs) so that would give us 10x reduction in message sizes.

Miika, if you have time to test these with your genome as well I'd welcome feedback if they help/hurt/do nothing.

Thanks again to all of you for the testing and patience getting this running.

@mjafin
Copy link
Contributor

mjafin commented May 23, 2014

Thanks Brad, will throw 25 monkey exomes at it next week!

@mjafin
Copy link
Contributor

mjafin commented May 26, 2014

OK so here's a quick update:

  • 1C_pe-analysis_blocks.bed contains 2375 lines (1C_pe-noanalysis_blocks.bed contains 9259 lines), where 1C is one of my samples. This seems to be quite a drop from the previous >8k!
  • Restarts are now super quick - no hanging as previously!
  • I had one crash at, I think, in the callable regions step where I got an OS error about no available disk space. There's more than a hundred terabytes available, so this must be something to do with file handles probably. We've upped file handles previously in the queue to S_DESCRIPTORS=20000 so maybe that's still not enough (I think it's on a per-process basis). I managed to get past this error-step by reducing the number of slots for the job, then restarted the job with a much higher slot count in the variant calling phase.

@chapmanb
Copy link
Member

Brilliant -- thank you for the update. Glad to hear it's generally working better and the improvements in splitting short contigs help reduce the total block count. I'm also stress testing this and have a few more small fixes pending but mostly for GATK HaplotypeCaller scaling and avoiding pyzmq errors which it looks like you're not seeing.

For the disk space issue, is it possible your temporary file space could fill up? pybedtools uses the system temporary directory but a useful fix would be to use set_tempdir to point it at local temporary space instead:

https://pythonhosted.org/pybedtools/topical-design-principles.html#principle-1-temporary-files-are-created-and-deleted-automatically

I'll work on integrating this as well. Thanks for the pointers and testing.

@mjafin
Copy link
Contributor

mjafin commented May 26, 2014

@chapmanb I've set $TMPDIR to point to the main file server but if something else gets used (like /tmp) then it's certainly a possibility it's actually running out of disk space! Where do I set set_tempdir in bcbio_system.yaml (or somewhere else) to be universally adopted?

Edit: I've got this already in bcbio_system.yaml

  tmp:
    dir: /ngs/tmp

(pointing to our gpfs)

@chapmanb
Copy link
Member

Miika;
I pushed fixes which use local temporary space (using the bcbio_system.yaml conventions as well) for pybedtools which should hopefully avoid this going forward. Thanks again for reporting it and let me know if you run into anything else at all. I appreciate all this help testing at scale.

@mjafin
Copy link
Contributor

mjafin commented May 28, 2014

Well, in my testing of the larger data set, I got this:

[2014-05-28 08:59] ukapdlnx115: Timing: alignment post-processing
[2014-05-28 09:00] ukapdlnx115: ipython: piped_bamprep
[2014-05-28 09:07] ukapdlnx115: Timing: variant calling
[2014-05-28 12:12] ukapdlnx115: ipython: variantcall_sample
Killed

Sorry, not very informative! :D
Maybe I should restrict variant calling to the standard chromosomes only?

chapmanb added a commit that referenced this issue May 31, 2014
…ping of return messages through IPython, increase available GC sockets to IPython, and perform better downsampling of GATK calls. #386
@chapmanb
Copy link
Member

Miika;
'Killed' normally indicates we hit a ton of memory usage/swap and the kernel killed the process. I've been doing more large scale testing and just checked in fixes which might help some by compacting return messages from IPython as well. I also identified a few other places where we had issues on larger runs and cleaned those up. There is definitely going to be a point with a ton of concurrent messages that the controller becomes a bottleneck, but this should help extend it some.

If you still run into issues, practically the best thing to do is skip the extra chromosomes if you don't actually need variant calls there. Hope this helps fix things up and get your processing done.

@mjafin
Copy link
Contributor

mjafin commented Jun 4, 2014

@chapmanb I tried running the analysis excluding the extra bits from the bed file, but I guess the underlying problem is that all regions still get processed prior to variant calling (the 'bamprep' stage).

[2014-06-03 23:34] ukapdlnx117: Index BAM file: 10_2014-05-24_ensemble-variant-sort-NW_005100582.1_0_3823-prep.bam
[2014-06-03 23:35] ukapdlnx117: Index BAM file: 10_2014-05-24_ensemble-variant-sort-noanalysis-prep.bam
[2014-06-03 23:35] ukapdlnx115: Index BAM file: 10_2014-05-24_ensemble-variant-sort-nochrom-prep.bam
[2014-06-04 00:36] ukapdlnx115: Timing: variant calling
[2014-06-04 03:25] ukapdlnx115: ipython: variantcall_sample
Killed

I think I can run one sample at a time, but 48 samples is clearly too many. Any suggestions?

@chapmanb
Copy link
Member

chapmanb commented Jun 4, 2014

Miika;
Thanks again for help testing this and sorry about the continued problems. How many regions do you end up with after excluding the extra regions:

$ wc -l regions/*-analysis_blocks.bed
4 regions/TestBatch1-analysis_blocks.bed

Those should get excluded from bamprep and variantcalling but it sounds like I may need to push some fixes if that's not happening correctly. It would be useful to know if the analysis blocks have the regions or not and I can try to hone in on potential issues.

More generally, how much memory do you have on the box that is running the main script? It looks like that is the one being killed, so it would be helpful to get a sense of how bad we're doing relative to 48 samples time n blocks. Thanks again.

@mjafin
Copy link
Contributor

mjafin commented Jun 4, 2014

Thanks again Brad, here's the output:

   2396 regions/10V_pe-analysis_blocks.bed
   2367 regions/11C_pe-analysis_blocks.bed
   2450 regions/11M_pe-analysis_blocks.bed
   2367 regions/11V_pe-analysis_blocks.bed
   2357 regions/12C_pe-analysis_blocks.bed
   2355 regions/12M_pe-analysis_blocks.bed
   2348 regions/12V_pe-analysis_blocks.bed
   2366 regions/13C_pe-analysis_blocks.bed
   2381 regions/14M_pe-analysis_blocks.bed
....

(It goes on)
The above bed files contain regions from the extra scaffolds. The regions are not in the bedprep/genes_padded_no_scaffolds.bed file itself.

The box has 192GB RAM..

@chapmanb
Copy link
Member

chapmanb commented Jun 6, 2014

Miika;
That's strange, I can't reproduce this and can successfully exclude chromosomes by leaving them out of the variant_regions input bed. As a sanity check, when you added this bed did you remove the old callable regions block files:

rm -f align/*/*blocks.bed
rm -rf regions

For the example files Jason sent for dog, this reduces from 4952 regions with all the extra bits, to 1763 after only keeping standard chr* contigs.

This is with the latest development version and no changes other than adding a variant_regions BED including regions of only the standard chromosomes. Is there anything different you can think of that might help me reproduce to dig deeper? Thanks again for testing this.

@mjafin
Copy link
Contributor

mjafin commented Jun 6, 2014

Thanks Brad, rather than continuing to hijack this thread I opened a new one #445
(I restarted the run from scratch for just one sample to see if I get the same behaviour you describe)

@roryk
Copy link
Collaborator

roryk commented Oct 3, 2015

Looks like this was continued on and fixed in #445, so closing.

@roryk roryk closed this as completed Oct 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants