New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added task.name to uid generator #69

Closed
wants to merge 1 commit into
base: dev
from

Conversation

3 participants
@tiagofilipe12
Member

tiagofilipe12 commented Jul 27, 2017

In this PR I added a simple fix to a problem with duplicated uids when tasks are duplicated. Please check this journal entry since it has a detailed description on the issue related with this and on how I tried to solve it.

This duplication of uids not only prevented pipeline branches to run the same task, even if we declare different const for each task (with same contents), but also when branching with junction using same task it will execute with the first hit regardless of the branch it came from.

@tiagofilipe12 tiagofilipe12 self-assigned this Jul 27, 2017

@tiagofilipe12 tiagofilipe12 requested a review from thejmazz Jul 27, 2017

@thejmazz

This comment has been minimized.

Member

thejmazz commented Jul 28, 2017

This is a simple PR but a very detailed write-up - give me a few days to understand the notebook entry and then I will merge!

PS. the dotted lines for not-yet-ran tasks visualization is amazing 💯

@thejmazz

This comment has been minimized.

Member

thejmazz commented Jul 28, 2017

Why not use something else unique, like the time of task creation (or maintain an incrementation for each task generated in this store - root of store could even store a unique hash which represents state of graph at every time a task is added to it), to generate the uid? (that is, to include in addition to that hashing function). This seems to me like an solution to a problem by introducing manual effort that can otherwise be implemented via a stateful solution.

This bug is basically an oversight on my part having never ran a junction with that configuration, having never encountered that issue with fork.

It's cleaner than

const everythingForTaskButName = { /* input, output, props, operationCreator */ }
const [ task1, task2 ] = [ 'my-task', 'my-task' ]
  .reduce((sum, val, i) => sum.concat([ val + '_' + i]) , [ ])
// or something else over-engineered and difficult to read or even runs

const pipeline = junction(
  join(somethingA, addNameToTask(everythingButTaskName)(task1),
  join(somethingB, addNameToTask(everythingButTaskName)(task2))
)

Maybe we should be able to identify as well if a task with the exact same input, output, params has already ran (+ whatever else is required to be confident it is OK to use these existing results), and so can resolve from the output of that.

@thejmazz

This comment has been minimized.

Member

thejmazz commented Jul 28, 2017

Maybe we even need to generate a uid after the task finishes - would it make sense for a "wants" uid, which many tasks can share, and also a "resolved" one which includes + resolvedOutput, using hashes of input contents + params, (so as to indicate there is no point running again). I say both because I imagine that a uid is still needed before any tasks run - but then one maybe one is needed after to verify what (the hash of output files combined with hash of output map), given which input (hash of params map + input contents).

In this way, one set of uids could generate a simple DAG for the general design of the pipeline, whereas the other includes all nodes for all executions of the pipeline (same consted task declarations) over multiple input samples.

@thejmazz

This comment has been minimized.

Member

thejmazz commented Jul 28, 2017

All that being said - it is very interesting how what this PR proposes is quite similar to the method used with Nextflow to implement simple forking behaviour. The whole .into stuff

@tiagofilipe12

This comment has been minimized.

Member

tiagofilipe12 commented Aug 1, 2017

You are right, we should make this as automatic as possible rather than needing to duplicate tasks and add a name to it. Also, interesting fork seems to handle this already, since it does not have problems with dealing with the same tasks multiple times.

I will make a new PR to replace this (when it is ready) because there are some other fixes to orchestrators that need to be addressed.

@tiagofilipe12

This comment has been minimized.

Member

tiagofilipe12 commented Aug 4, 2017

This PR is obsolete because of #70.

@tiagofilipe12 tiagofilipe12 moved this from DAG related to Done in Watermill board Aug 5, 2017

@tiagofilipe12 tiagofilipe12 moved this from Done to DAG related in Watermill board Aug 5, 2017

@tiagofilipe12 tiagofilipe12 removed this from DAG related in Watermill board Aug 5, 2017

@tiagofilipe12 tiagofilipe12 deleted the uid_fix branch Aug 17, 2017

@bmpvieira bmpvieira removed the enhancement label Aug 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment