Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add streaming match generation #34

Merged
merged 1 commit into from Feb 13, 2024
Merged

Add streaming match generation #34

merged 1 commit into from Feb 13, 2024

Conversation

davidmezzetti
Copy link
Contributor

This PR proposes "streaming match generation" for GrandCypher. It depends on the changes in this corresponding PR aplbrain/grandiso-networkx#39.

This change is designed to improve the performance of large openCypher queries. It employs a streaming match generation strategy to iteratively process each result.

For example, with this code snippet:

import time

import networkx as nx

from grandcypher import GrandCypher

host = nx.DiGraph()

start = time.time()
size = 1000000
for x in range(size):
    host.add_node(x, name=x)
    host.add_edge(x, x + 1)

print(f"Built graph in {time.time() - start}s")

qry = """
MATCH (A) -[]-> (B) -[]-> (C) -[]-> (D)
RETURN A.name, B.name, C.name, D.name
LIMIT 1
"""

start = time.time()
res = GrandCypher(host).run(qry)
print(f"Query took {time.time() - start}s")
print(res)

The current code base takes:

Built graph in 3.705066442489624s
Query took 86.24252390861511s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}

With this PR, it takes:

Built graph in 2.572826623916626s
Query took 0.003414630889892578s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}

While this query on the surface seems ridiculous, it's not too different than:

SELECT * from table LIMIT 1

For situations where you may just want to sample a few rows.

Here are a couple additional examples:

qry = """
MATCH (A{name: 0}) -[]-> (B) -[]-> (C) -[]-> (D)
RETURN A.name, B.name, C.name, D.name
LIMIT 1
"""

Current:

Built graph in 2.583895444869995s
Query took 0.6473803520202637s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}

With this PR:

Built graph in 3.0622048377990723s
Query took 0.003484010696411133s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}
qry = """
MATCH (A) -[]-> (B{name: 1}) -[]-> (C) -[]-> (D)
RETURN A.name, B.name, C.name, D.name
LIMIT 1
"""

Current:

Built graph in 2.8512825965881348s
Query took 25.05796718597412s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}

With this PR:

Built graph in 2.822657346725464s
Query took 0.0033731460571289062s
{'A.name': [0], 'B.name': [1], 'C.name': [2], 'D.name': [3]}
qry = """
MATCH (A) -[]-> (B) -[]-> (C) -[]-> (D)
WHERE D.name = 100
RETURN A.name, B.name, C.name, D.name
LIMIT 1
"""

Current:

Built graph in 3.100694179534912s
Query took 85.86176109313965s
{'A.name': [97], 'B.name': [98], 'C.name': [99], 'D.name': [100]}

With this PR:

Built graph in 2.6066715717315674s
Query took 0.013916015625s
{'A.name': [97], 'B.name': [98], 'C.name': [99], 'D.name': [100]}

This change also adds an optional default limit that can be applied to all queries. It's worth noting that this only improves performance when a query has results. For queries without results, it still goes through the whole graph looking for results. I thought about adding a maximum iterations parameter but didn't include that with this PR.

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 13, 2024

Wow these are some pretty compelling performance numbers!! 😍

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 13, 2024

@davidmezzetti can you help me understand — I think this change does NOT depend on the grandiso change, but the performance will still be (relatively) unaffected until grandiso is updated...is that right? i.e., we can merge this in before we merge the upstream change, but not expect to get any performance boost?

My instinct is to wait to merge (until grandiso rolls to pypi) anyway so we can bump the version dep for grandiso here in the next release, just asking for awareness :)

@davidmezzetti
Copy link
Contributor Author

That's correct, this change works with the current version of grandiso. But as you said, it wouldn't benefit from the short circuiting logic without the changes from grandiso.

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 13, 2024

Ok I've thought about it a lot and I think the "right" answer is to NOT bump the version dep on grandiso in this library, since it's a nonbreaking and backwards compatible change and I'd rather avoid the possibility of adding incompatibilities into existing projects; but I will merge this presently and anyone who installs from now on should get the latest of both, and enjoy the huge performance bump from your work!!

Thank you again! Excited to play around with this some more :)

@j6k4m8 j6k4m8 merged commit 0fab013 into aplbrain:master Feb 13, 2024
6 checks passed
@davidmezzetti
Copy link
Contributor Author

Sounds like a good approach to me!

@davidmezzetti davidmezzetti deleted the match-streaming branch February 13, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants