Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend architecture: Datomic, datahike, OpenCrux, datalevin, Fluree #9

Open
tangjeff0 opened this issue Apr 30, 2020 · 23 comments
Open
Labels

Comments

@tangjeff0
Copy link
Collaborator

tangjeff0 commented Apr 30, 2020

Datomic datahike OpenCrux
scalability 100B datoms M of entities dependent on document size
time uni-temporal uni-temporal bi-temporal
license closed-source EPL 1.0 MIT
storage services DynamoDB, Cassandra, JDBC SQLs LevelDB, Redis, PostgreSQL RocksDB, LMDB, Kafka, JDBC SQLs
@tangjeff0 tangjeff0 changed the title Decide on a backend architecture: Datomic, datahike, or OpenCrux Architect backend: Datomic, datahike, or OpenCrux Apr 30, 2020
@tangjeff0 tangjeff0 mentioned this issue May 1, 2020
5 tasks
@tangjeff0
Copy link
Collaborator Author

From Jeroen in the Slack:

Maybe start with Datomic (the best known and most mature option) and postpone this decision? Unless someone has a clear vision on this. I think these three should be mostly compatible at query and data model level. If something becomes difficult with Datomic, reconsider (e.g. when implementing collaboration features)? Or when people have trouble setting up the free version, and don’t want to pay for the commercial version, reconsider. If this is an upfront certainty, go for datahike or OpenCrux right away? (edited)

@refset
Copy link

refset commented May 2, 2020

I can only speak for Crux on these points...

Pros:

  1. Regular, fully-featured releases w/ transparent roadmap (e.g. upcoming JSON and SQL support might help non-Clojure Athens users to build tools/integrations): https://github.com/juxt/crux/projects/1
  2. Low memory requirements makes it particularly suitable for self-hosting (this is mostly because the query engine is lazy)
  3. Setting up a collaborative Crux-backed environment could be as simple as having a group of users share access to a managed Kafka service, see https://juxt.pro/blog/posts/crux-confluent-cloud.html (vs. always having to maintain a bunch of centralised DB infrastructure somewhere)
  4. Dev team that is excited and keen to see Athens succeed
  5. There's a tantalising possibility that bitemporality could be an invaluable capability in a collaborative context. We're already thinking about the feasibility of using Hybrid Logical Clocks in place of a simple valid-time timestamp (see: https://jaredforsyth.com/posts/hybrid-logical-clocks/ & CockroachDB)
  6. An EQL syntax is available for "pull" EQL-like syntax? xtdb/xtdb#849

Cons:

  1. We're still in Beta - so there may be a few API changes, but nothing too fundamental
  2. Crux is schemaless, but not magic, so you still need to have some idea of what your schema looks like :)

Hope that helps!

Edit: this might be of interest: https://findka.com/blog/migrating-to-biff/ (Firebase-like stack on top of Crux)

@tangjeff0
Copy link
Collaborator Author

From Christopher Small, author of datsync

My hope is that DatSync will be able to support Datahike on the backend, and I have no objections to supporting Crux if there aren't technical blockers.
Is it [Athens] mainly focused on small deployments for a sort of DIY self-hosted Roam? If so, and you'd mostly be expecting data from small numbers of users, you can probably get things working with any of these tools.
If however, you are hoping to have large centralized (but OSS) hosting available, then you'd need to think about scalability, and I think your best option there would be Datomic.
Datahike has pretty decent query performance, but writes have the potential to be a bottleneck, so you can look into where that pain point hits. For lots of (relatively) small deployments though, datahike would be perfect.
If I knew more about crux I might be able to say more about its advantages, but if you are looking to use DataScript on the client, Datahike is a fork, and so likely to have more impedence match.

@refset
Copy link

refset commented May 2, 2020

We've not looked at datsync in any detail but we have spent some time thinking about crux->datascript replication already: https://github.com/crux-labs/crux-datascript/blob/master/src/crux_datascript/core.clj

@whilo
Copy link

whilo commented May 4, 2020

Just to also chime in and add a few things that have not been said yet:

Yes, Datahike is still very much compatible with DataScript and moreover we are aiming to port our query engine with durability back over to ClojureScript in our next release as well (after 0.3.0 which is pending), so Datahike will be able to substitute for DataScript and optionally provide client-side durability at the same time. We have implemented all our abstractions as replikativ libraries in a platform neutral way from the start, the main thing missing is to provide ClojureScript asynchronous IO support in Datahike's query engine code. This is a very doable task, it was just easier and more attractive to get the JVM version working well first. Replicating Datahike will be possible with P2P web technology, such as demonstrated in https://lambdaforge.io/2019/12/08/replicate-datahike-wherever-you-go.html. We are convinced that we need to find better business models than the current data silo approach.

We also provide a Datomic compatible core API that is used by our commercial clients, so if you decide to stick to the common subset, you will be able to swap Datahike in at any point. If you hit missing features or incompatibilities, please open an issue. We are currently working on our write throughput and I am confident that we can scale to Datomic size deployments in principle, it was just a matter of priorities.

We, the members of LambdaForge, are also big fans of the Zettelkasten method (even before we were aware of Roam) and use https://org-roam.readthedocs.io/en/latest/ at the moment. We would be super happy to see a reliable open source implementation like Athens to succeed, so keep going 💯 !

I think ideally the backends should be exchangeable, so even if you decide for one, keep in mind when you buy into its specific semantics.

@jelmerderonde
Copy link
Contributor

Although I don't consider myself an expert in databases, I guess one of the (future) advantages of Datahike would be that it could potentially enable "local first" as described here: https://www.inkandswitch.com/local-first.html. For me this would be great to have in a tool like Athens because you could easily edit offline on multiple machines, while having confidence that your edits could later on combine seamlessly.

@tangjeff0
Copy link
Collaborator Author

Thanks so much for sharing that link. Several engineers (including myself) are quite interested in local first applications. We've discussed databases like OrbitDB, Gun, and Scuttlebutt. Datahike is very interesting for this reason.

@tangjeff0 tangjeff0 changed the title Architect backend: Datomic, datahike, or OpenCrux Research and debate backend architecture: Datomic, datahike, OpenCrux, OrbitDB May 15, 2020
@jelmerderonde
Copy link
Contributor

@tangjeff0 no problem. I guess Datahike isn't quite there yet, but maybe @whilo can share something about whether Datahike would allow a local-first workflow in the future?

@whilo
Copy link

whilo commented May 16, 2020

Yes, since our early work on http://replikativ.io/, which was predating most of these other local first approaches, but did not attract a large community back then and also did not have a nice programming model such as Datalog, we wanted to be local-first. We aim to port Datahike back to ClojureScript in our next iteration. Do you think open-collective would work to fund this work? Any help would be appreciated, as we are currently still hammering out Datomic compatibility and some scalability issues in the JVM version.

@tangjeff0
Copy link
Collaborator Author

Will re-open when after v1 is complete

@tangjeff0 tangjeff0 reopened this May 29, 2020
@tangjeff0 tangjeff0 changed the title Research and debate backend architecture: Datomic, datahike, OpenCrux, OrbitDB Backend architecture: Datomic, datahike, OpenCrux, OrbitDB... May 29, 2020
@tangjeff0 tangjeff0 mentioned this issue Jun 10, 2020
4 tasks
@pepoospina
Copy link

TL;DR;

Do you plan to support block-level access control and notifications/subscriptions? If so, how do you plan to do this? Maybe the DB is a deal-breaker.


Hi there. I've been discussing with @tangjeff0 a little bit on Twitter about your plans and how they could be linked with ours.

I also had some experience working with heavily nested and linked content with my previous project www.collectiveone.org and I have a couple of comments regarding the DB and how to handle the multi-player case:

  • access control at block level: Ideally you want access control at the block level. But you need some sort of "default" inheritance logic to be able to switch access control of a whole area at once. This is done in notion at the page level, with stuff like (permissions of this page are defined by this other workspace...). Besides inheritance, I would like to have composability: so that you can say stuff like "those with access to A AND or OR access to B can access this". Also, access control must be super fast as it is computed almost every time a block is read.

  • subscriptions and notifications: Ideally here you also need some sort of inheritance logic, so that If I have a block to which I want to be notified of changes, I get notified every time any of its children blocks changes. Each user can have different notifications settings for each object, and one block can be in many places at the same time, so it's very hard to, once there is an event on one block, determine who you need to send that email/push notification.

I did this in Postgres the last time I tried and relied a lot on algorithmic recursion, so I navigated the DB in many directions before determining what to do, or who to send a message to. This was too slow. I am not an expert in big data systems, so I really wonder how these problems should be actually handled.

@tangjeff0 tangjeff0 pinned this issue Jul 9, 2020
@tangjeff0
Copy link
Collaborator Author

Another factor I'd like to point out is the conflict resolution story, whether it be distributed or centralized.

@vHanda
Copy link

vHanda commented Oct 16, 2020

Another option could be to use a Git Repository as a backend. This would require creating a REST API on top of the Git repo to parse the documents, but it would result in greater compatibility with existing tools. One will be easily able to have to files locally, and even use other markdown editors or more advanced editors like Obsidian. And there is also a mobile app already ready (GitJournal - I'm the author)

This would result in a very different architecture though. I'm willing to help, if you want to go down this route. I would love more tools to be compatible with each other.

@almereyda
Copy link

You could also consider https://github.com/terminusdb/terminusdb-server

@tangjeff0 tangjeff0 unpinned this issue Jan 19, 2021
@tangjeff0 tangjeff0 pinned this issue Jan 19, 2021
@tangjeff0 tangjeff0 unpinned this issue Jan 22, 2021
@agentydragon
Copy link
Contributor

I'd just like to add that for me, Athens being open-source is a significant advantage over Roam, and if Athens ends up requiring a closed-source backend to be most useful, that advantage would be diminished.

Also it would be nice to abstract the backend-talking code to allow people to potentially run Athens on other backends, as long as they support some defined protocol.

@tangjeff0
Copy link
Collaborator Author

tangjeff0 commented Feb 25, 2021

Protocol is always most ideal but hardest to pull off. Crux, Datomic, Datascript, Datahike will inevitably have some differences with each other.

Agree that closed-source backend diminishes value. Inevitably parts of our infrastructure will be closed, but if there is a fully open-source full-stack solution for users to self-host, super great.

Also just learned about https://github.com/fluree/ from Matei. Clojure, Web3, open-source.

https://www.youtube.com/watch?v=uSum3uynHy4&feature=youtu.be

@pepoospina
Copy link

Hi there! I'm glad to see some movement here 🙂

We have been working on an interface specification for our Athens-like app so that the backend is abstracted. We have also been working hard on a NodeJS + DGraph backend API that is AGPL-like open-sourced.

I'd bet the interface supports (or will support) all the needs of Athens. Who knows! 💪. It includes backlinks and search features, granular access control (and thus multi-player), and fast data creation and fetching.

Reusing our backend, or just the interface, will also provide interoperability among our apps. Users will be able to embed and edit blocks from Athens in Intercreativity, for example. They can also "fork" them as we want to support GIT-like flows with content.

Oh, and eventually Athens could connect to other data storage solutions. We have prototypes for OrbitDB, Ethereum, Kusama, and IndexedDB (local).

This is a recent demo of our latest milestone (a simple case where users mix private with public content). We are about to release a new version where users can explore a feed of blog posts.

If you want to run it, this repo should run ok on Ubuntu or Mac. It is our latest development version.

Oh, and this is our discord in case you want to reach us. 👋

@mateicanavra
Copy link

mateicanavra commented Feb 26, 2021

The video @tangjeff0 mentioned above covers both the broad vision and technical details of Fluree better than I could, but here's my quick take:

Fluree is an in-memory, semantic graph database backed by a permissioned blockchain, built with Clojure and open-source.

It can be containerized (with Kubernetes support) and optionally decentralized (e.g. using StorJ via Tardigrade), run as a standalone JVM service, or embedded inside the browser as a web worker. Read more here about the query server (fluree-db) and the ledger server (fluree-ledger)

Since Fluree extends RDF (official W3C standard for data interchange), it immediately becomes interoperable with the linked open datasets on the semantic web. One interesting use case would be to directly query DBPedia or Wikidata from within Athens and combine it with your own data at runtime, without an API. Additionally an RDF foundation means you can build ontologies with any of the modeling languages that build on top of it (RDFS, OWL, etc., which are the official recommendations of W3C), which opens up capabilities for inferencing and automated reasoning.

From my view, Fluree could be a powerhouse tool to strongly differentiate Athens from Roam and every other "tool for networked thought." Between RDF standards and a permissioned blockchain (which allows for block/cell-level access control), you could seamlessly and securely deploy Athens at an individual, team, or enterprise level using the same scalable infrastructure.

Would love to get the Fluree team's thoughts here...

@lambduhh
Copy link
Contributor

@quoll I would like to advocate for the adoption of https://github.com/threatgrid/asami but feel like it would better be left up to the expert :) Athens is currently Clojurescript/re-frame/datascript/posh (I'm working on sunsetting posh rn actually)

What are your thoughts on whether Asami would be a good fit as a graph-DB for us?

Selfishly, I will admit I would LOVE the excuse to combine our opensource powers to leverage the benefits of bi-directional knowledge linking, use Asami in the wild and possibly have the opportunity to work with you in a technical aspect to help implement it if we do end up going this way... and I don't think I'd be the only one!

@quoll
Copy link

quoll commented Feb 28, 2021

Love to help. I hope to have Asami 2.0-alpha out by the end of the week. This will have storage when on the JVM. JavaScript is coming, but in the meantime it will have save/load functions.
Unfortunately, Asami doesn’t have all the APIs of the other stores, e.g. the Pull API.

@agentydragon
Copy link
Contributor

I've looked a bit into Datahike. From what I learned it looks like:

As someone new to Clojure, this makes me less nervous about depending on a backend that has a Datomic-like API, and optimistic about Datahike, becaue it would still allow freedom of backing storage system.

@Tr3yb0
Copy link

Tr3yb0 commented Apr 15, 2021

@mateicanavra laid out Fluree for us very well in his comment above. I will elaborate a little on some of the points made and bring up one additional one, which is one of the most powerful parts of Fluree.
The foundation of RDF is intended to enable data interoperability across the semantic web and provides a very flexible data model for the applications built on top of it. Our immutable ledger brings both decentralization and horizontal scaling in the transaction tier, if that is needed, as well as some benefits that are brought to bear from querying historical data states in from earlier in the block chain.
We have segregated the query and transactions tiers, such that the query engine and an LRU cache of data can be loaded in-memory on the client device using a service worker. I would imagine for a personal Athens graph, that may be the entire thing, which enables millisecond query responses. The db (query peer) is also linearly scalable, but im not sure that really applies to the use case here.
The biggest advantage Fluree brings are SmartFunctions. Because each transaction is encrypted with the user's private key, the data can be permissioned at the individual RDF element level. You could write the SmartFunctions in such a way that no one else would have access to them and a user could share as desired.

@refset
Copy link

refset commented Apr 20, 2021

Noting this work-in-progress Datahike backend for the benefit of those following this issue: https://github.com/athensresearch/athens-backend

Also, I recently pulled together a comparison matrix for various Clojure-Datalog stores: https://clojurelog.github.io/

@tangjeff0 tangjeff0 changed the title Backend architecture: Datomic, datahike, OpenCrux, OrbitDB... Backend architecture: Datomic, datahike, OpenCrux, datalevin Apr 27, 2021
@tangjeff0 tangjeff0 changed the title Backend architecture: Datomic, datahike, OpenCrux, datalevin Backend architecture: Datomic, datahike, OpenCrux, datalevin, Fluree May 14, 2021
@sid597 sid597 added the i/high impact label Jun 9, 2021
sid597 added a commit to sid597/athens that referenced this issue Jun 9, 2021
Made code better based on jeff's great review.
neotyk added a commit that referenced this issue Jul 14, 2021
Sid migrate `:drop/...` events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests