Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] In-memory mode #4813

Closed
marvin-hansen opened this issue Feb 19, 2020 · 3 comments
Closed

[Feature Request] In-memory mode #4813

marvin-hansen opened this issue Feb 19, 2020 · 3 comments
Labels
kind/feature Something completely new we should consider.

Comments

@marvin-hansen
Copy link

marvin-hansen commented Feb 19, 2020

TL'DR:
Add in-memory mode to boost performance and use in-memory (predicate) compression to pack as much data as possible in the given memory. That smokes a lot of benchmarks and helps a lot with graph OLAP.

Experience Report

Right now, we get about a 6 - 10 ms latency with a 5 node Dgraph cluster, and that's nice, but only the case because we load balance across all alphas and the current graph size is small enough that it fits completely in memory. Specifically, we run a count query against all predicates and fetch them all in a quiet hour so they are in memory when needed. That works very well, latency is very low and it's relatively easy to script.

What you wanted to do

For our unified graph, we actually wanted one single DB for operations and analytics in one cluster.
But we had some troubles with DGraph and some hard questions about scalability that are still not answered so we moved forward.

Why that wasn't great, with examples

We simply could not figure out if DGraph is going to sustain performance beyond a few billion predicates. There are no case studies of very large deployments, and that indicates there are most likely none.

What you actually did

We decided to factor out analytics and instead use an in-memory graph that goes all the way up to 500 billion ops per second and, in fact, have to integrate a secondary graph for analytics because of the alternative already ships with all important graph analytic algorithms so it makes sense.

What would be the greatest possible solution?

Please consider adding an persistent in-memory mode to DGraph.
Persistent in-memory means, the entire graph is in memory AND synced on disk so you get both, the speed from querying everything from memory AND the safety of having all data on disk in case a node goes down or needs a restart, say for an update.

Preserve "correctness" and transactional safety but improve performance with an option to hold the entire graph in memory and sync memory to disk. A bit like Redis-graph, but actually useful.

I know, one USP of DGraph is exactly that you can manage graphs way bigger than available memory because it only retains the keys in memory but stores the actual values on disk. Sometimes, you either don't have a graph that big, or equally possible, you have a lot of memory so you can hold the entire graph in memory. Because memory prices are falling already for years, you get so much cheap memory on GCP, that in-memory analytics becomes a no-brainer.

We are actually moving all analytics to a persistent in-memory graph, because, it's actually 30% cheaper than an equivalent GPU accelerated column-based storage but equivalent in performance on most tasks. In terms of system complexity, in-memory is way simpler to handle and scales much either because allocating memory or adding more high-mem nodes can be done with auto-scaling in Kubernetes.

As a rule of thumb, you need twice the memory relative to the graph size plus ~20% for the system. If a graph is about 150 GB in size, you need at least 300 GB memory plus some 50GB reserved for the system, so in total, you need 350GB. With predicate compression, however, it is certainly feasible to pack twice the graph size in the same memory and that's a huge additional relative to the total gain in performance. Obviously, memory usage is high but we are good with that.

Any external references to support your case

Trillion Triple Benchmark:
https://dzone.com/articles/got-a-minute-check-out-anzographs-record-shatterin

@marvin-hansen marvin-hansen changed the title In-memory {Ludicrous} mode [Feature Request] In-memory {Ludicrous} mode Feb 19, 2020
@marvin-hansen marvin-hansen mentioned this issue Feb 19, 2020
24 tasks
@marvin-hansen marvin-hansen changed the title [Feature Request] In-memory {Ludicrous} mode [Feature Request] In-memory mode Feb 19, 2020
@danielmai
Copy link
Contributor

An in-memory option is available today. By default, Dgraph mmaps LSM tree tables. You can set this to in-memory by setting the flag --badger.tables=ram.

@marvin-hansen
Copy link
Author

marvin-hansen commented Feb 19, 2020

@danielmai Hang on does the flag --badger.tables=ram syncs memory to disk?

I was talking about holding the graph in memory AND syncing memory to disk so you get the speed from memory AND keep your data safe on disk. Otherwise, it's only volatile and, I guess, nobody wants that for serious data....

@MichelDiz MichelDiz added the kind/feature Something completely new we should consider. label Feb 26, 2020
@minhaj-shakeel
Copy link
Contributor

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Something completely new we should consider.
Development

No branches or pull requests

4 participants