Skip to content

DynamoDB backed bio4j prototype

Eduardo Pareja Tobes edited this page Mar 3, 2014 · 8 revisions
  • difficulty hard
  • technologies scala++, aws+++, blueprints++, dynamodb+++, s3++, ec2++

This is a really challenging project. We think it is possible to have a prototype for a DynamoDB-based bio4j distribution. The basic idea is to take advantage of how our really particular needs fit with a model like this: write and read are completely isolated and happen at different times, we have a strongly typed immutable domain model which determines a lot about the data topology, etc.

@eparejatobes and @evdokim already have some design ideas on how to do this, extending the ideas of the "Hexastore" paper using composite keys and a lot of tables intelligently. It is key to realize here that the Bio4j case is really specific as explained above; in particular there are a lot of opportunities for optimizations here taking advantage of all the static information about the data topology; also the fact that the raw data store architecture is exposed makes for a really simple and efficient model for module composition.

All this will be implemented in Scala, and requires a deep understanding of the DynamoDB model and eventually consistent data stores in general. Implementing the writing/import part would involve working with EC2 + S3 + SQS, using tools such as nispero. A subset of the general data model will be implemented, possibly using again nispero.

references

expected outcome

A proof-of-concept implementation of a core bio4j module (like UniProtKB) on top of DynamoDB, capable of executing queries through the corresponding module API.

mentors