Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"task binding" without hardcoded file suffixes #66

thejmazz opened this issue Jul 13, 2017 · 1 comment

"task binding" without hardcoded file suffixes #66

thejmazz opened this issue Jul 13, 2017 · 1 comment


Copy link

// We want to run a bunch of variant callers
// Then count common SNPs

const variantCallerInput = {
  reference: '*_genomic.fna',
  bam: '*.bam',
  _bai: '*.bam.bai'

// two tasks once stream support ;)
const mpileupAndCall = task({
  input: variantCallerInput,
  output: 'variants.vcf',
  name: 'samtools mpileup | bcftools call',
}, ({ input }) => `
samtools mpileup -uf ${input.reference} ${input.bam} | \
bcftools call -c - > variants.vcf

const variantCaller2 = task({
  input: variantCallerInput
  output: '*.vcf'
}, ({ input }) => `caller ${input.reference} ${input.bam}` )

// Maybe declares a count for an input,
// whereas { input: '*.vcf' } could fail when more than one is found since it expects 1
const counter = task({
  input: [ { pattern: '*.vcf', count: (n) => n >= 1 } ]
  output: '*.snpcount'
}, ({ input }) => `counter ${input.join(' ')}`

// currently, this would actually work
// minus the count: (n) => <boolean expression> part
// when the counter tasks resolves input, it will find multiple files matching
// *.vcf from the junction node which is its immidiate parent
// lets see we did count: (n) => n === 5, and let it pick up matching files until 5 was reached
// this would work if there was 5 callers, but could pick up more files from upstream tasks
// SO. maybe whats needed is a way to declare which nodes a task has
// permission to resolve input files from
const pipeline = join(
  junction(mpileupAndCall, variantCaller2),


Related problems:

  • use an upstream task who went to stdout output as an input as a file for another task
  • run something for each variant calling after each one finishes, but then also run the task that requires all variant calls when they are all ready (or the chromosome example, run something for each chromosome, then do more, and run something that needs result of all chromosomes when they are ready)
Copy link

tiagofilipe12 commented Jul 13, 2017

Some ideas to solve this might be:

  • to allow for several inputs with the same pattern *.foo.
{ input: [ { pattern: '*.foo', multiple: true } ]

However this wouldn't be able to match certain tasks. This is somewhat related to #65 .

  • to reference other tasks within a task definition of the input:
{ input: [ { pattern: '*.foo', tasksNames: [ task1, task2, ... ] } ] } 
// this taskNames would take the name of the tasks in an array for watermill 
// to search for the resolvedOutput in their respective tasks

Then on its execution it would be a function similar to the one that follows:

({ inTasks }) => `countsnps ${ => obj.output).join(' ')}`
  • Or maybe a mixture of the two previous options?

@bmpvieira bmpvieira moved this from In Progress to Backlog in Bionode Project Board Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Watermill board
DAG related

No branches or pull requests

2 participants