Parallelization? #4

olgabot · 2016-06-19T00:10:27Z

Would it be possible to do a parallelized sra and fastq-dump on multiple samples? I'm downloading this ~2k sample dataset on our supercomputer and it's just taking a while and I'm hoping that I could parallelize this to 8 nodes.

Here's the command I was using:

bionode-ncbi download sra SRP061902 | bionode-sra fastq-dump  --split-files -

The text was updated successfully, but these errors were encountered:

thejmazz · 2016-06-19T01:55:15Z

Hi Olga, I am working on a bionode streaming workflow engine for a GSoC project. Feel free to follow the gitter chat room for the project and propose problems, questions, features there!

I have finished investigated existing tools (bash, make, snakemake, nextflow, ...) and have begun the implementation - you can see my ngs workflows post. It concludes with a proposed API surface.

As far as I know, you cannot natively parallelize with just those two modules. @bmpvieira will know more. However, parallelization will be an integral part of the engine - with regards to running two tasks simultaneously if they do not depend on each other to finish (two streaming tasks can of course run at the same time), and providing config options to then pass on number of threads to tools.

thejmazz · 2016-06-19T02:14:36Z

In the meantime, Nextflow might work for you:

#!/usr/bin/env nextflow

SAMPLES = 'SRP061902'

process downloadSamples {
  input: val samples from SAMPLES
  output: file '**/*.sra' into reads

  """
  bionode-ncbi download sra $samples
  """
}

process dump {
  input: file sample from samples

  """
  fastq-dump --split-files $sample
  """
}

Though this will not stream the downloads into fastq-dump. I tried to use stdin/stdout across processes with Nextflow but it was too much for the JVM.

I am not sure if that will give dump each sra at once, or if sample will be an array of sra file names, but something along those lines might fit your use case - until bionode is ready ;) As well, Nextflow has a bunch of configuration options for use on a cluster.

thejmazz mentioned this issue Jun 24, 2016

Timeline bionode/bionode-watermill#2

Closed

38 tasks

bmpvieira added this to Backlog in Bionode Project Board Mar 29, 2017

bmpvieira added the question label Mar 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization? #4

Parallelization? #4

olgabot commented Jun 19, 2016

thejmazz commented Jun 19, 2016 •

edited

Loading

thejmazz commented Jun 19, 2016

Parallelization? #4

Parallelization? #4

Comments

olgabot commented Jun 19, 2016

thejmazz commented Jun 19, 2016 • edited Loading

thejmazz commented Jun 19, 2016

thejmazz commented Jun 19, 2016 •

edited

Loading