Skip to content

Old Roadmap #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 of 6 tasks
bmpvieira opened this issue Mar 11, 2017 · 4 comments
Closed
5 of 6 tasks

Old Roadmap #35

bmpvieira opened this issue Mar 11, 2017 · 4 comments
Assignees

Comments

@bmpvieira
Copy link
Member

bmpvieira commented Mar 11, 2017

Roadmap

This is a WIP roadmap started as a Mozilla Working Open Workshop 2017 activity

Modular and universal bioinformatics

Bionode.io is a community that aims at building highly reusable tools and code for bioinformatics by leveraging the Node.JS ecosystem and community.

Why

Genomic data is flooding the web, and we need tools that can scale to analyse it in realtime and fast to:

  • Potentially save lives with real-time analysis in scenarios such as rapid response to bacteria or virus outbreaks (especially now with portable real-time DNA sequencers like Oxford Nanopore);
  • Make research advance faster (with quicker data analysis) and be more reusable (with modular tools);
  • Democratize science by reducing the computational resources required for some data analysis (because not everyone has access to terabytes of RAM and petabytes of space) and allowing things to run in a browser without complicated software installations.

Core features

  • Run everywhere: JavaScript/Node.JS is the most "write once, run anywhere" programming language available. Bionode data analysis tools and pipelines should run on distributed high performance computing servers for big data but also locally on user machines for browser web applications (e.g., genome browsers).
  • Use Streams: The core architecture of the code should be based on Node.JS Streams. This allows to process data in realtime and in chunks with self-regulation though backpressure (i.e., if one step is slow, the whole pipeline adjusts). For example, while you are downloading a data set, you could be analysing it before the download is complete without worrying about latency and timeouts issues. In practice this means Bionode pipelines would use less computing resources (memory and disk space), do the data analysis in realtime and finish faster than other approaches.

Short term - what we're working on now

  • Funding for full-time development
  • Showcase data analysis pipeline

Medium term - what we're working on next!

  • CWL integration
  • Dat integration

Longer term items - working on this soon!

  • C++ integration
  • Workflow GUIs
  • BioJS integration

Achievements

  • Prototype code and tools
  • GSoC 2016
  • Google Campus Hackathon in London, UK
  • Workshop at the Mozilla Festival 2016 in London, UK
  • Workshop at the Bioinformatics Open Days in Braga, Portugal
  • Workshop at the Institute of Molecular Medicine in Lisbon, Portugal
@Wandalen
Copy link

Any chance to get involved from outside?

@thejmazz
Copy link
Member

Any chance to get involved from outside?

Of course @Wandalen! I think it is safe to say one of bionode's goals is to gather up contributors. One of the contributing factors to this would be to have extremely well documented+explained core APIs (e.g. watermill task API) - so that a plugin ecosystem for bioinformatics can emerge.

One of the most immediate + substantial things to work on would be bringing streaming tasks to bionode watermill. Right now tasks declare input and output as glob patterns (in the future, regex, other options). However, you cannot stream between tasks. This should be a matter of the task declaring its input/output as stdin/stdout, respectively (or similar). Then the plumbing which connects two tasks (gets more complicated with orchestration operators like fork and junction) needs to observe this and pass the stream through appropriately. Ideally the fact that the task
receives streams or inputs might be abstracted (eventually, if it improves API consistency).

This might actually be less overhead than the way I have already done it, which does a number of things involving files (was made clear very soon that bioinformatics workflows are now, for the most part, consisting of reading/writing files - once we can handle that perfectly, and researchers are happy they can implement what they are used to - time to unveil the curtain of streams and its pros/cons given certain pipeline situations):

  • check file exists
  • run file validators (global - e.g. not null, specific for task - e.g. VCF validator)
  • give middleware chance to observe file/stream
  • check input/output match correctly
  • create task folders, create symlinks for files
  • proper file unique identifiers
  • trigger events that signify upstream dependency changed (NOT DONE YET)

For streams, it's basically a bunch of functions that use read and write pipes. Would also be good to observe streams and write to file, in case task needs to be reran, or for storage of intermediate results (probably good point for discussion - why use streams if writing files anyway? someone might ask - interesting place to look for pro/cons - e.g., by using streams, can check stderr for messages indicative of failure for a specific tool despite 0 exit code). Another interesting aspects of streams is instead of doing this can spawn three child process, pass streams between them, enabling storage of intermediate results and experimentation with paramaters on a per tool basis. E.g. for A | B | C, could declare b1 and b2, and have it run everything through those variants of B automatically.

I see you've made BufferFromFile, very cool! Perhaps a simple watermill pipeline using that module can be a place to get started for you.

Other than watermill, our modules are built for performing bioinformatics tasks via JS APIs, Node streams, and proxying to tools (binaries). Our "wrapper modules" should probably be deprecated and replaced with a watermill task module once watermill matures. Too much work to maintain a wrapper module for every tool.

We are also trying to ease the use of somewhat undocumented web APIs. For example, bionode-ncbi provides access to NCBI, but not for all databases, and needs to parse FTP/HTML responses. In bionode-blast I wrote a helper function to pull down "json" response which is actually a (malformed - but still extracts with less spec-following modules) zip, and return a JSON. I hope to improve that module by strictly documenting all parameter types and their validations, and provide a documented REST API. Perhaps using things like JSON schema, schema-salad (my JS version does nothing atm), typed JavaScript with Flow and Typescript, swagger/raml. Imagine having a REST API which documents its response schema (array of objects) - could then have that enter an object streaming pipeline and operate on its shape confidently (IDEs pick up on types). Overall - some way to create ultra documented APIs that work as CLI tools, JS modules, proxied by REST APIs - all from one main definition, would be very cool.

Hopefully that gives you a taste of what I see in the roadmap for bionode, and introduce you to topics you might be interested in working on!

@Wandalen
Copy link

Interesting, @thejmazz. Want to play with the project. Thank you for such deep excurse. Good starting point.

@bmpvieira
Copy link
Member Author

Sorry, but I'm moving this discussion to #42 just because it's a more meaningful number (see what I did there?) and easier to remember when I'm making a link to this issue! 😎

@bmpvieira bmpvieira changed the title Roadmap Old Roadmap May 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants