-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for API Bulk Loading #86
Comments
Origin comment by: @jeffreylovitz The major bottleneck in the speed of graph creation is reallocating the matrices to accommodate new vertices and edges. Using a clause like The binary encoding we chose for the bulk loader saves some time in processing and reduces some of the pressure on the Redis input buffer (which has limits of roughly 1 million tokens and/or 1 gigabyte of data). That benefit is fairly trivial compared to the speed gained by reducing the frequency of matrix reallocations, however. I think that the most straightforward approach, as you suggest, would be writing a script similar to the CSV loader that encodes data in the same binary format. What format is your input data in? I like the idea of improving bulk load capabilities very much. As one word of caution, when using the |
Origin comment by: @mboldisc |
Origin comment by: @mboldisc
|
Origin comment by: @jeffreylovitz One reason for this is that the memory cost associated with matrix size is pretty steep. Each label has one associated matrix, and each relationship type has two. For traversals to work properly, all labels and relations in a query have to be of the same size (since we perform a series of matrix multiplications). Per matrix, the cost of an allocated-but-unused vertex is 16 bytes, which is not in itself that bad, but you can see how it would quickly add up. I think that one interesting approach would be to change matrix sizing policy either through user specification (like your idea 1) or through some simple heuristics that determine whether the graph is write-heavy (and would thus benefit from fewer reallocs, suggesting more generous sizing like in idea 2) or read-heavy (and would thus benefit more from reducing wasted RAM). I'll discuss this with the rest of our team, but I'm not sure when we'll be able to place it in our roadmap! If you'd like to experiment with some changes yourself, I can point you to the areas in the code where we set the resizing policies and the dimensions to which matrices should be resized (a la idea 2). |
Origin comment by: @mboldisc |
Origin comment by: @mboldisc |
Origin comment by: @jeffreylovitz One of the first things that happens when a query is received is that the matrix synchronization policy is set: As the comments indicate, these policies serve two roles at the moment - managing the dimensions of matrices as well as flushing all pending changes to those matrices. I think that we'll likely split those roles into two separate policies when we try to optimize this area more thoroughly. The first role is fairly safe to change, and you can create a buffer of unused space in matrices by making changes in or around the matrix dimension setter: The second role is a bit trickier, but also important, as one can break atomicity guarantees if read queries are being issued to the database while matrices are out of sync. Pending changes are flushed with calls to In normal execution, synchronization occurs whenever a matrix is fetched - a By adding some buffer space The RedisGraph team discussed this issue today, and think we should be able to make significant improvements in allowing batch operations like this, but won't be able to make changes until we're sure that they're safe for all use cases. From what you've described, I think that a more bespoke approach would work fine for you in the meantime - we ought to be able to introduce a formal solution in the next few months. I hope that this was helpful! Let me know if you have any questions or need help, and please tell us what your experience is like. This is an area we can make big improvements in, so additional sets of eyes are a huge help. |
Origin comment by: @github-actions[bot] |
Origin comment by: @kkonevets |
I have ended up using LMDB to implement a graph database on top, turns out it's easy to do. At the time I tried Redis didn't persist state on disk, it was pure in memory, so I switched to LMDB |
Created by: @mboldisc
Source: RedisGraph/RedisGraph#476
I've been trying to load hundreds of thousands of vertices into RedisGraph using Redis Python APIs.
Each vertex is loaded as follows:
MERGE (n:MyNode{id:'12345'})
Using the basic Redis Python client and asyncio client I see around 200 inserts per second. I also tried pipelining and it had the same performance. Redis consumes one entire CPU during bulk load. It appears to be tapped on the single-core Redis CPU. Redis doesn't support multi-process (out of the box).
I assume there's a reason why the CSV bulk loader script was written using binary data. I'd prefer a solution that doesn't involve CSV records for bulk loading data. Any recommendations? My first thought is to convert the bulk loading script into a simple API. I'd be willing to help if others think this is useful. Other thoughts?
The text was updated successfully, but these errors were encountered: