New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can be a pipeline extended to deal with multiple samples? #47

Closed
zhangzhen opened this Issue May 8, 2017 · 3 comments

Comments

3 participants
@zhangzhen

zhangzhen commented May 8, 2017

A pipeline is written to deal with one sample at a time. How can be the pipeline extended to deal with multiple samples?

@thejmazz

This comment has been minimized.

Member

thejmazz commented May 8, 2017

Hi @zhangzhen, thanks for the question!

The pipeline returns a thunk which returns a Promise (and also alternatively accepts a callback), so it would be feasible to create a bunch of pipelines for each sample, and then run all of them (probably in serial rather than parallel for now). Something like

function buildPipeline(sampleId) { ... }

const runs = samples.map(id => buildPipeline(id))

// using bluebird
Promise.mapSeries(runs, (run) => run())
  .then((results) => console.log(results)
  .catch(console.error)

This is not using any watermill specific API. Running a pipeline across multiple samples is something we have thought about before, but the current API so far is a "low level" atomic task one. (join, junction, and fork also follow the task API) I think the best approach for implementing new features is to nail down the task API, and build new features on top of it, try different approaches (I think Observables could be useful for this sort of stuff), and then perhaps later we can move certain things into core.

For example, we've mused with the idea of a stream producing a bunch of sample IDs (perhaps from filtering down a stream from bionode-ncbi) and having that stream of IDs transformed into running pipelines.

@zhangzhen

This comment has been minimized.

zhangzhen commented May 9, 2017

Thank you for your reply.
There are still questions to be discussed that are listed below:

  1. Multiple samples involves multiple files including output files generated by each task in a pipeline. In the case of multiple samples, does using the glob pattern to resolve the input and output files of each task work properly?
  2. If input data comes from another source, for example, directly from an Illumina sequencer, how is a pipeline constructed?
  3. Can the pipeline be parameterized and used in command line?
  4. If a task is stopped unexpectedly, can it be resumed?

P.S. your blog article "NGS Workflows" is one of the best articles that discuss and compare bioinformatics pipelines. It's very practical! I am encouraged to use your watermill to do my research work. Thank for your excellent work.

@thejmazz

This comment has been minimized.

Member

thejmazz commented May 9, 2017

Thank you very much for the kind remarks :)

  1. Yes, it should work properly; the directory for every task is based on that task's UID which is built from hashes of input, output, and params. The task directory can also be overwritten by the user. See
  1. As long as the source is a file or stream it can work.
  2. Not yet - what do you have in mind for this? It's fairly simple to build node CLIs using yargs or something similar.
  3. You can run it again, and tasks will check if the hash of files found on disk match the stored hashes in watermill's task metadata folder, and will skip over already completed work. However, this is not 100% functional - I think there is the case where if you change parameters from an upstream task, the downstream one won't necessarily be forced to rerun

If you choose to use watermill in practice, please be weary that it may not cover all use cases you desire, there will be bugs, and the tool is undergoing rapid development (we have GSOC 2017 student now!). That being said, feedback, bug reports, suggestions for example pipelines, etc, is extremely useful.

@bmpvieira bmpvieira added this to In Progress in Bionode Project Board May 11, 2017

@bmpvieira bmpvieira added the question label May 11, 2017

@bmpvieira bmpvieira closed this Jun 1, 2017

@bmpvieira bmpvieira moved this from In Progress to Done in Bionode Project Board Jun 1, 2017

@bmpvieira bmpvieira added the mozsprint label Jun 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment