-
Notifications
You must be signed in to change notification settings - Fork 6
Description
A tentative plan for the way forward. I think once the core API is stable, time would be best spent reimplementing real world workflows that can be most improved with Streams. The primary concerns are
- task orchestration
- integration with SGE
After that work can begin on DSL, Docker, admin panel app, nbind etc.
If you think there is not enough time partitioned to plans as their should be, or if some should be swapped, triaged for others, etc, don't hesitate to let me know. I'd like us all to agree upon realistic plans for these next 8 weeks that are exciting and fully satisfy the original overarching goal.
Week 5
- formalize task orchestration API
- blocking vs streaming vs async see Task Architecture #1
- join
- parallel
- forking
- transform stream as task
- tests on core API with basic examples, CI, and coverage
- pipe tasks through each other instead of one task with
shellPipetaking an array
Summary
- refactored original prototype
taskobject now a stream that can wrap streams/non-streams- This makes task easier to use with existing stream modules
- chatted with Matthias and Max about API, internal code
- happy with API design so far, keep it going forward
- change
new Fileto vannila JS obect{file: 'foo'}- easier to new devs, no peer dependencies - ab782f- ability for other devs to work with, over shorter/cleaner syntax of "new" from consumer perspective - DSL will be clean
- dont emit custom events after "end" of "finish" of stream, use those instead, and leave an object in the stream object "to be retrieved" (rather than emitting data with custom event)
- chatted with Davide
- does big social data analytics type stuff
- interested in using waterwheel, contributing
- walked him through clone and install
For this week, continue improving waterwheel, examples with real world workflows. We already have a basic genomic one, I'd like to try out an RNASeq pipeline, or whatever you guys suggest. - stick to improving genomic workflow with sound reentrancy
Task orchestration core codebase largely resolved, parallel/forking/join becomes easy when task returns a regularly compliant stream (i.e. no more custom events). Forking not done yet, but because task is now a steam, can be done with existing modules - e.g. multi-write-stream.
Week 6
- Unit tests for task orchestration Unit tests for task orchestration #18
- check resolution scenarios (from previous task vs fs)
- simple joins
- simple parallels
- joins and parallels should return valid task streams -> further composable
- simple forking
- transform task
- stream search results, filter down to IDs, run pipeline on each ID
- file validation File Validation #22
- existence
- not null check enabled by default
- pass in custom validator function(s)
- reentrancy Test Reentrancy #27
- file timestamp or checksum
- force rerun of all/specific task
-
--resumeoption -
teestream to file
Pushed down:
- integrate with clustering systems like SGE - try to solve this bionode-sra issue
- run tasks in their own folders
- implement new real world workflows from papers
Week summary:
- more unit tests on task orchestration, still need more for complex scenarios
- basic reentrancy using existence of file, non null file, custom validator on file
- pass in custom validators as an array, functions that take file path, return true/false
- working simple variant calling example
- came across, how to let user provide validators that take more than one file (e.g. reads_1 and reads_2)
- came across this problem Resolve input from all tasks in join #35
Week 7
- play with "pass files from other tasks" problem
- integrate with Docker - specify a container for each Task
[ ] formalize YAML/hackfile based DSL
Week Summary:
- this week didn't feel very productive, but
- took time to think about how to restructure codebase with consideration of the "pass files from other tasks" problem
- set up a gitbook, partially documenting the restructure approach: https://thejmazz.gitbooks.io/bionode-waterwheel/content/
- tasks have a hash for
params,input, andoutput, and series of tasks are arranged hierarchically using these - the pipeline at any point in time is a very stateful entity - the config and tasks of the pipeline are now managed in a redux store, this lets me describe every change to a live task (e.g. resolving output, running output validators, finding matching input from previous tasks) with an action, the action results in a reducer being called that returns a new state - reducers are small and more easily testable since they are pure functions. the giant codebase of
taskis now gradually moving into many smaller reducers which a smaller scope
Week 8
- implement new real world workflows from papers
Week summary:
- refactor fully completed - pipeline state managed by redux, actions are dispatched for each step in task lifecycle --> bug reports can be submitted with a snapshot of the exact state
- big functions --> small, testable, pure functions
- improved simple vcf example to be updated to refactored codebase
- began implementation of "hierarchical output dump"
- each task has a "trajectory" which is an array of keys of the output dump
- task will match input patterns to absolute paths in the output dump, going through each "trajection" in the trajectory
- keys of the DAG of the output dump are
JSON.stringify(params)of each task - works somewhat but needs improvement (WIP)
Week 9
- implement new real world workflows from papers
Week 10
- implement new real world workflows from papers
Week 11
- Project website
- Complete documentation, examples, use cases, etc
Week 12
- Final cleanup of website, docs, testing, examples
Extras/Pushed out
- prototype a simple pipeline with nbind - an in browser functional Waterwheel pipeline will be a great way to introduce and teach the module
- web/electron admin panel app - view tasks, edit tasks, see progress, see logs in realtime