DynamoDB backed bio4j prototype

difficulty hard
technologies scala++, aws+++, blueprints++, dynamodb+++, s3++, ec2++

This is a really challenging project. We think it is possible to have a prototype for a DynamoDB-based bio4j distribution. The basic idea is to take advantage of how our really particular needs fit with a model like this: write and read are completely isolated and happen at different times, we have a strongly typed immutable domain model which determines a lot about the data topology, etc.

@eparejatobes and @evdokim already have some design ideas on how to do this, extending the ideas of the "Hexastore" paper using composite keys and a lot of tables intelligently. It is key to realize here that the Bio4j case is really specific as explained above; in particular there are a lot of opportunities for optimizations here taking advantage of all the static information about the data topology; also the fact that the raw data store architecture is exposed makes for a really simple and efficient model for module composition.

All this will be implemented in Scala, and requires a deep understanding of the DynamoDB model and eventually consistent data stores in general. Implementing the writing/import part would involve working with EC2 + S3 + SQS, using tools such as nispero. A subset of the general data model will be implemented, possibly using again nispero.

references

expected outcome

A proof-of-concept implementation of a core bio4j module (like UniProtKB) on top of DynamoDB, capable of executing queries through the corresponding module API.

mentors

@eparejatobes (mailto:eparejatobes@ohnosequences.com) He's coordinating all the distributed systems and architecture aspects, and has a lot of experience building and designing systems based in AWS. He's also working right now on abstract models for query languages and graph databases.
@evdokim (mailto:ekovach@ohnosequences.com)
He is the main developer behind nispero and related extensions, and has built several data analysis tools based in DynamoDB, EC2, SQS and S3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DynamoDB backed bio4j prototype

references

expected outcome

mentors

ideas list

Clone this wiki locally