Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for API Bulk Loading #86

Open
gkorland opened this issue Sep 24, 2023 · 10 comments
Open

Support for API Bulk Loading #86

gkorland opened this issue Sep 24, 2023 · 10 comments
Labels
enhancement New feature or request no-issue-activity

Comments

@gkorland
Copy link
Contributor

Created by: @mboldisc
Source: RedisGraph/RedisGraph#476
I've been trying to load hundreds of thousands of vertices into RedisGraph using Redis Python APIs.

Each vertex is loaded as follows:
MERGE (n:MyNode{id:'12345'})

Using the basic Redis Python client and asyncio client I see around 200 inserts per second. I also tried pipelining and it had the same performance. Redis consumes one entire CPU during bulk load. It appears to be tapped on the single-core Redis CPU. Redis doesn't support multi-process (out of the box).

I assume there's a reason why the CSV bulk loader script was written using binary data. I'd prefer a solution that doesn't involve CSV records for bulk loading data. Any recommendations? My first thought is to convert the bulk loading script into a simple API. I'd be willing to help if others think this is useful. Other thoughts?

@gkorland gkorland added enhancement New feature or request no-issue-activity labels Sep 24, 2023
@gkorland
Copy link
Contributor Author

Origin comment by: @jeffreylovitz
hi @mboldisc,

The major bottleneck in the speed of graph creation is reallocating the matrices to accommodate new vertices and edges. Using a clause like MERGE will perform these reallocations one vertex at a time, which is basically worst-case behavior!

The binary encoding we chose for the bulk loader saves some time in processing and reduces some of the pressure on the Redis input buffer (which has limits of roughly 1 million tokens and/or 1 gigabyte of data). That benefit is fairly trivial compared to the speed gained by reducing the frequency of matrix reallocations, however.

I think that the most straightforward approach, as you suggest, would be writing a script similar to the CSV loader that encodes data in the same binary format. What format is your input data in?

I like the idea of improving bulk load capabilities very much. As one word of caution, when using the GRAPH.BULK endpoint, the module is incapable of de-duping, which makes commands much more similar to CREATE than MERGE. Any script using that endpoint should be capable of sanitizing inputs.

@gkorland
Copy link
Contributor Author

Origin comment by: @mboldisc
Ah. Thanks for the response! I need to think about this a bit. I really need a "MERGE" instead of a "CREATE". I'm taking various data sources in and dumping them in a graph, trying not to clobber existing data. Neo4j has a batch JDBC PreparedStatement approach. Maybe we need something similar to that.

@gkorland
Copy link
Contributor Author

Origin comment by: @mboldisc
It sounds like the bottleneck is the matrix memory allocation and associated processing. Here are two other potential ideas. Let me know if either would help:

  1. Have a configuration file parameter to pre-allocate a matrix of a given size. The elements would be reserved somehow for new vertex data. Say, for example I don't expect more than six million vertices in my graph. I could set a config specifying: redisgraph.vertex.allocation=6000000

  2. Similar to the idea above, could adding new vertices beyond the current max size cause the matrix to expand beyond the requested size? Possibly double the size upon a needed expansion? Many list data structures have a growing approach that grows beyond the current requested element for future capacity.

@gkorland
Copy link
Contributor Author

Origin comment by: @jeffreylovitz
Those are both good suggestions, and in the past we have in fact implemented variants of both. The trade-off between perfectly-sized matrices (which we maintain right now in all contexts except GRAPH.BULK usage) and matrices with empty space for growing is a tricky one to call.

One reason for this is that the memory cost associated with matrix size is pretty steep. Each label has one associated matrix, and each relationship type has two. For traversals to work properly, all labels and relations in a query have to be of the same size (since we perform a series of matrix multiplications). Per matrix, the cost of an allocated-but-unused vertex is 16 bytes, which is not in itself that bad, but you can see how it would quickly add up.

I think that one interesting approach would be to change matrix sizing policy either through user specification (like your idea 1) or through some simple heuristics that determine whether the graph is write-heavy (and would thus benefit from fewer reallocs, suggesting more generous sizing like in idea 2) or read-heavy (and would thus benefit more from reducing wasted RAM).

I'll discuss this with the rest of our team, but I'm not sure when we'll be able to place it in our roadmap!

If you'd like to experiment with some changes yourself, I can point you to the areas in the code where we set the resizing policies and the dimensions to which matrices should be resized (a la idea 2).

@gkorland
Copy link
Contributor Author

Origin comment by: @mboldisc
@jeffreylovitz - If you have a couple minutes to point out where to start in the code, I'd appreciate it.

@gkorland
Copy link
Contributor Author

Origin comment by: @mboldisc
I should also mention that I was off by an order of magnitude in MERGE statement speed. It's more like 25 operations per second.

@gkorland
Copy link
Contributor Author

Origin comment by: @jeffreylovitz
I'd be happy to!

One of the first things that happens when a query is received is that the matrix synchronization policy is set:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L220

As the comments indicate, these policies serve two roles at the moment - managing the dimensions of matrices as well as flushing all pending changes to those matrices. I think that we'll likely split those roles into two separate policies when we try to optimize this area more thoroughly.

The first role is fairly safe to change, and you can create a buffer of unused space in matrices by making changes in or around the matrix dimension setter:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L300

The second role is a bit trickier, but also important, as one can break atomicity guarantees if read queries are being issued to the database while matrices are out of sync. Pending changes are flushed with calls to GrB_Matrix_nvals(). The default synchronization function (used for all endpoints except GRAPH.DELETE and GRAPH.BULK) is here:
https://github.com/RedisLabsModules/RedisGraph/blob/06bad49ffc3624a59c5bd3dac3d0a2e7b83172e8/src/graph/graph.c#L172

In normal execution, synchronization occurs whenever a matrix is fetched - a MERGE query, for example, will fetch every matrix associated with the edges and vertices your pattern describes.

By adding some buffer space Graph_RequiredMatrixDim() and changing the associated calls to never shrink the retrieved matrix, some performance gains should be possible. If concurrent queries are not a risk for your use case, you should be able to get even greater gains by changing when pending changes are flushed to GraphBLAS.

The RedisGraph team discussed this issue today, and think we should be able to make significant improvements in allowing batch operations like this, but won't be able to make changes until we're sure that they're safe for all use cases. From what you've described, I think that a more bespoke approach would work fine for you in the meantime - we ought to be able to introduce a formal solution in the next few months.

I hope that this was helpful! Let me know if you have any questions or need help, and please tell us what your experience is like. This is an area we can make big improvements in, so additional sets of eyes are a huge help.

@gkorland
Copy link
Contributor Author

Origin comment by: @github-actions[bot]
Stale issue message

@gkorland
Copy link
Contributor Author

Origin comment by: @kkonevets
@mboldisc Have you achieved the state of ready to use bulk loader for your input format? I need to do the same #709

@kkonevets
Copy link

Origin comment by: @kkonevets
@mboldisc Have you achieved the state of ready to use bulk loader for your input format? I need to do the same #709

I have ended up using LMDB to implement a graph database on top, turns out it's easy to do.

At the time I tried Redis didn't persist state on disk, it was pure in memory, so I switched to LMDB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request no-issue-activity
Projects
None yet
Development

No branches or pull requests

2 participants