DynamoDB backed bio4j prototype
- difficulty hard
- technologies scala++, aws+++, blueprints++, dynamodb+++, s3++, ec2++
This is a really challenging project. We think it is possible to have a prototype for a DynamoDB-based bio4j distribution. The basic idea is to take advantage of how our really particular needs fit with a model like this: write and read are completely isolated and happen at different times, we have a strongly typed immutable domain model which determines a lot about the data topology, etc.
@eparejatobes and @evdokim already have some design ideas on how to do this, extending the ideas of the "Hexastore" paper using composite keys and a lot of tables intelligently. It is key to realize here that the Bio4j case is really specific as explained above; in particular there are a lot of opportunities for optimizations here taking advantage of all the static information about the data topology; also the fact that the raw data store architecture is exposed makes for a really simple and efficient model for module composition.
All this will be implemented in Scala, and requires a deep understanding of the DynamoDB model and eventually consistent data stores in general. Implementing the writing/import part would involve working with EC2 + S3 + SQS, using tools such as nispero. A subset of the general data model will be implemented, possibly using again nispero.
- A literature survey of graph databases
- Amazon DynamoDB design patterns and best practices
- Hexastore: sextuple indexing for semantic web data management
- imGraph: A distributed in-memory graph database
- G-SPARQL: a hybrid engine for querying large attributed graphs
- A Distributed In-Memory SPARQL Query Processor based on Message Passing
- A Distributed Graph Engine for Web Scale RDF Data
A proof-of-concept implementation of a core bio4j module (like UniProtKB) on top of DynamoDB, capable of executing queries through the corresponding module API.
- @eparejatobes (mailto:eparejatobes@ohnosequences.com) He's coordinating all the distributed systems and architecture aspects, and has a lot of experience building and designing systems based in AWS. He's also working right now on abstract models for query languages and graph databases.
-
@evdokim (mailto:ekovach@ohnosequences.com)
He is the main developer behind nispero and related extensions, and has built several data analysis tools based in DynamoDB, EC2, SQS and S3.
- your own idea!
- DynamoDB backed bio4j prototype
- AWS based bio4j specific CI platform
- incorporate range based data into bio4j
- apply advanced Scala techniques to bio4j
- integrate sequence data into bio4j
- cloud based query benchmarking suite
- OrientDB based bio4j distribution
- graphical browser for bio4j model
- Cytoscape app/plugin for bio4j
- graphml/gexf exporter
- Bio4j Gephi toolkit