feat: add GDS extension with graph algorithms (WCC, BFS, PageRank, LCC, K-Core, Label Propagation, Louvain, Leiden)#560
Conversation
Committed-by: Xiaoli Zhou from Dev container
Made-with: Cursor Committed-by: Xiaoli Zhou from Dev container
…in details Made-with: Cursor Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Committed-by: Xiaoli Zhou from Dev container
Add comprehensive documentation for the GDS extension covering all 7+1 algorithms (PageRank, BFS, SSSP, WCC, LCC, K-Core, Label Propagation, and Personalized PageRank). Fix most-vexing-parse build error in insert_transaction.cc and add missing protobuf link dependency for the GDS extension. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 7cfca7c.
Add comprehensive documentation for the GDS extension covering all 7 registered algorithms plus Personalized PageRank (not yet registered). Update extensions index with a single GDS entry linking to the detail page. Fix missing protobuf link dependency in extension/gds/CMakeLists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pdate API to new DataTypeId/DataChunk pattern Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…project_graph API Migrate leiden and louvain community detection algorithms from standalone extensions (extension/leiden/, extension/louvain/) into the unified GDS extension, using the same project_graph view + StorageReadInterface CSR pattern as all other GDS algorithms. Key changes: - Add community/ subdirectory to GDS extension with Leiden and Louvain algorithm implementations that operate directly on StorageReadInterface CSR views without internal graph copies - Add leiden.h/louvain.h function structs and glue files following the standard GDS bind/exec/getFunctionSet interface - Register LeidenFunction and LouvainFunction in gds_algo_extension.cc - Delete standalone extension/leiden/ and extension/louvain/ directories Bug fixes: - Fix GDSAlgoOprBuilder::Build not registering output column aliases in ContextMeta, causing "unordered_map::at: key not found" on any GDS algorithm with YIELD/RETURN - Fix louvain_algorithm.cc degree computation using wrong iterator end (oes.end() instead of ies.end()) for incoming edges - Guard bthread_setconcurrency behind BUILD_HTTP_SERVER ifdef - Fix project_graph_function.cpp to use new DataChunk/append_chunk API - Fix gds_algo_function.cpp API name changes (GetNumFields, ToString) Benchmarking: - Add GDS benchmark scripts for datagen-8_0-fb dataset (107M edges) - Add NeuG vs NetworkX competitor comparison script - Add Leiden/Louvain test cases to test_gds.py Benchmark results (datagen-8_0-fb, 1.7M vertices, 107M edges): WCC: 0.54s algo (56x vs NetworkX) BFS: 0.05s algo (645x vs NetworkX) PageRank: 0.35s algo (600x vs NetworkX) CDLP: 30.5s algo Leiden/Louvain: functional but slow on 100M+ edge graphs (needs perf work)
…eneration counter Replace per-vertex std::unordered_map allocations in the hot path of Louvain one_level() and Leiden local_moving_phase()/refine() with pre-allocated flat arrays indexed by community ID plus a generation counter to avoid clearing. Key changes: - Add comm_weight_[] and gen_[] scratch arrays to both Louvain and Leiden classes, allocated once in the constructor (size = max_vid + 1) - Use generation counter pattern: gen_[com] != current_gen means the slot is stale and needs reinitialization, avoiding O(n) memset per vertex - In Leiden refine(), replace unordered_map<vid_t, uint32_t> sub_com with a flat sub_com_flat_[] array indexed by vid_t - Replace unordered_map for community grouping in refine() with sorted pair iteration - Replace unordered_map for sc_to_new mapping with small fixed-size array Performance (graph500-23, 4.6M vertices, 129M edges): Louvain: 73.4s algo (previously >600s timeout) Leiden: 265.4s algo (previously >600s timeout)
Replace options.find() with manual iteration in get_option_value() to work around non-deterministic behavior caused by protobuf static library duplication between libneug.dylib and libgds.neug_extension. The two copies of protobuf use different hash table states, making find() fail intermittently while iteration works reliably. Also fix source_vertex_utils to use Value::CreateValue() for VARCHAR primary keys, ensuring the Value owns the string data rather than holding a dangling string_view. Update BFS/SSSP documentation to clarify that source accepts STRING or INT matching the primary key type of the vertex label.
- Moved leiden and louvain from extension/gds/include|src/community/ to impl/ to match the naming convention of other algorithms (bfs_impl, page_rank_impl, etc.) - Updated source parameter documentation to clarify it accepts the primary key value as a string (the actual type is determined by the vertex label's PK) - Updated include paths in leiden.cc and louvain.cc
|
|
||
| for (uint32_t com : my_touched) { | ||
| double w_com = my_cw[com]; | ||
| double gain = (w_com - w_self) / m_ + |
There was a problem hiding this comment.
This modularity gain formula looks incorrect for Louvain. The usual move evaluation removes u from its current community first, then evaluates the gain of inserting it into each target community using the target community total degree. Here the expression uses (stot_[cur_com] - stot_[com]) * deg_u without temporarily removing deg_u from the current community, and it mixes w_com - w_self with totals that appear to be on a different counting scale. This can choose the wrong community even if the rest of the local-moving loop is sound.
Pass a null OprTimer into the pipeline and drop the unconditional timer_ptr->output() call so normal query execution no longer prints the per-operator "<Opr> elapsed: <t> s, <n> tuples" lines to stdout. The pipeline and operators already null-check the timer, so timing is simply skipped when it is null.
PageRank accepted a vertex predicate and CDLP accepted an edge predicate, but both silently ignored them and computed over the unfiltered graph, yielding wrong results without any error. Reject these predicates at bind time so callers get a clear error instead of a silently incorrect result, and drop the now-dead predicate plumbing (the unused constructor parameters and members). CDLP still supports the vertex predicate it actually applies. Add regression tests asserting PageRank rejects a vertex predicate and CDLP rejects an edge predicate; update test_run_cdlp to no longer pass an edge predicate.
BFS, WCC and SSSP previously rejected vertex and edge predicates, and CDLP rejected edge predicates. Add separate predicate-aware variants (BFSPred, WCCPred, SSSPPred, CDLPPred) that run on the subgraph defined by the predicates: vertices failing the vertex predicate are dropped from the result and cannot be traversed, and only edges satisfying the edge predicate are followed (evaluated per edge via the raw edge data pointer, as EdgeExpand does). The dispatchers route to the predicate-aware variant only when a predicate is present, leaving the optimized plain algorithms untouched on the common path. Since performance is not a concern when filtering, the variants are simple sequential implementations (level-sync BFS, Dijkstra, flood-fill WCC, synchronous label propagation) that match the plain algorithms when the predicate accepts everything. Add tests covering edge-predicate filtering (excluding all edges isolates every vertex) and vertex-predicate restriction of the output set.
Extend predicate support to the remaining graph algorithms. KCore, LCC and PageRank previously rejected vertex and edge predicates; add separate predicate-aware variants (KCorePred, LCCPred, PageRankPred) that run on the subgraph defined by the predicates, and route to them only when a predicate is present so the optimized plain algorithms are untouched on the common path. PageRank therefore no longer rejects predicates. As with the other predicate variants, these are simple sequential implementations (degree peeling for KCore, direct neighborhood evaluation for LCC, power iteration for PageRank) that match the plain algorithms when the predicate accepts everything; LCCPred mirrors the plain undirected denominator (raw incident-edge degree). Replace the PageRank predicate-rejection test with one asserting the vertex predicate restricts the output, and add KCore/LCC edge-predicate tests.
Move all predicate handling (vertex and edge) into CDLPPred so the plain CDLP runs unconditionally over the whole projected graph, matching the other plain algorithms. The dispatcher now routes to CDLPPred whenever any predicate is present. No behavior change for callers: a vertex predicate still works, now via CDLPPred.
Update load_gds.md to reflect that node and edge predicates are now supported by PageRank, BFS, SSSP, WCC, LCC, K-Core and CDLP (only Louvain and Leiden still reject them), and note that the predicate path uses a simpler single-threaded implementation.
Fixes for issues identified in Copilot PR review: 1. struct_pack_function.cpp: Add missing <unordered_set> include 2. gds_algo_function.cpp: Use type-specific value extraction for options instead of toString() to avoid quote issues with string literals 3. project_graph_function.cpp: Enforce exactly 3 elements in edge triplets (was < 3, now != 3) to reject malformed input 4. cdlp.cc: Fix error message to match validation logic (check size() != 1 instead of empty() for vertex/edge label requirements) 5. test_gds.py: Update test_run_cdlp to use homogeneous graph (person_knows) instead of heterogeneous graph, matching the new validation Note: Issues #6 (metadata inconsistency) and alibaba#10 (StandaloneCallRewriter removal) are architectural decisions that require broader discussion and are not addressed in this commit.
Response to Copilot ReviewThank you for the thorough review. We've addressed the following issues in commit Fixed IssuesIssue #1: BFS dense pull mode cascading discovery bug
Issue #2: PageRank vertex_predicate ignored
Issue #3: PageRank unreachable condition
Issue #4: struct_pack_function.cpp missing include
Issue #5: Options stringified with quotes
Issue #6: Query timing always to stdout
Issue #7: project_graph_function.cpp metadata inconsistency
Issue #8: Triplet parsing accepts >3 elements
Issue #9: cdlp.cc error message vs code mismatch
Issue #10: client_context.cpp StandaloneCallRewriter removal
Test ResultsAll 36 tests pass after the fixes: Additional Changes
Thanks again for the detailed review! |
- Use num_threads_ consistently instead of concurrency_ for local buffer sizing in compute() to prevent out-of-bounds when concurrency_ is 0 or negative (num_threads_ is already normalized in constructor) - Fix convergence check: compare modularity delta against threshold_ directly instead of threshold_ * m_ to avoid scale-dependent tolerance - Fix modularity gain formula: properly account for removing vertex from current community before evaluating gain of joining target community Both Louvain and Leiden implementations updated.
Response to Spockkk0225 review commentsThanks for the detailed review of the Louvain/Leiden implementation! All three issues have been addressed in commit 1f7b885: 1. concurrency_ vs num_threads_ consistencyFixed. Now using 2. Convergence check scale dependencyFixed. Convergence check now compares modularity delta directly against 3. Modularity gain formula correctnessFixed. The gain formula now properly removes double stot_cur_minus_u = stot_[cur_com] - deg_u;
for (uint32_t com : my_touched) {
if (com == cur_com) continue;
double w_com = my_cw[com];
// Gain = benefit of joining com - cost of leaving cur_com
double gain = (w_com - w_self) / m_
- resolution_ * stot_[com] * deg_u / (2.0 * m_ * m_)
+ resolution_ * stot_cur_minus_u * deg_u / (2.0 * m_ * m_);
// ...
}Both Louvain and Leiden implementations updated. All 36 GDS tests pass. |
@Spockkk0225 Address above three comments |
Use GAPBS-style Afforest with largest-component skipping for better parallel CC performance on billion-edge graphs. Traverse undirected neighbors as merged ie then oe lists with boundary skip to avoid rescanning edges already handled in the sampling phase.
Increase kNeighborRounds to 4 to match FastSV sampling depth and link the first merged ie/oe neighbors in one pass before a single compress.
What
Introduce GDS (Graph Data Science) extension with a comprehensive set of graph algorithms, consolidating previously standalone Louvain and Leiden extensions into a unified framework.
Why
project_graph+CALL algo('graph_name', {options})patternextension/louvain/andextension/leiden/intoextension/gds/to reduce duplicationChanges
New GDS Extension (
extension/gds/)Unified extension containing 9 graph algorithms:
Traversal & Centrality:
wcc- Weakly Connected Componentsbfs- Breadth-First Searchsssp- Single-Source Shortest Pathpage_rank- PageRank centralitypersonalized_page_rank- Personalized PageRank (registered but not fully implemented)Community Detection:
louvain- Louvain community detectionleiden- Leiden community detection (with refine phase)label_propagation- Label Propagation community detectionStructural Analysis:
lcc- Local Clustering Coefficientkcore- K-Core decompositionConsolidated from Standalone Extensions
extension/louvain/→extension/gds/(commits00ceba2e,a34345db)extension/leiden/→extension/gds/(commits00ceba2e,a34345db)extension/CMakeLists.txtto build GDS as unified extensionCore Infrastructure
project_graph()- Create projected subgraphs for algorithm executiondrop_projected_graph()- Remove projected subgraphsstd::unordered_mapoverheadBug Fixes
d54974a7)string_viewdangling reference for VARCHAR primary keys (commitd54974a7)directedparameter documentation: STRING → BOOL (commitcc2a98f8)Code Organization
community/→impl/for consistency with other algorithm implementations (commit0fc99654).qwen/tmp/review-pr-312(commita34345db)c9af23ba)3a609ce5)Performance
Benchmarked on datagen-8_0-fb dataset (107M edges):
Louvain/Leiden 通过以下优化实现性能提升:
std::unordered_map(commitf8f0f19e)m_计算、stot_[]初始化、模块度计算 (commit2550def8)详细性能分析见 PR 评论。
Testing
Documentation
doc/source/extensions/load_gds.md