A research implementation based on mgBench for evaluating graph modeling strategies
This repository contains the experimental setup for benchmarking different graph modeling approaches using Neo4j. The work extends mgBench framework to evaluate performance impact of graph modeling transformations.
- Four Graph Variants: Property-To-Node (V1, research_uplift_opt.py), Path-To-Edge (V2, research_transitive_opt.py), Edge-To-Node (V3, research_intermediate_opt.py), Aggregation-Materialization (V4, research_denormalized_opt.py)
- Seven Analytical Queries: Complex traversals, aggregations, and pattern matching
- Two Dataset Sizes: Large (280K nodes) and Small (28K nodes)
- Comprehensive Metrics: Throughput, latency, CPU, memory, and storage
- File:
research_uplift_opt.py - Description: Elevates frequently queried properties to first-class nodes
- Transformations: Country properties converted to Country nodes with FROM_COUNTRY and LOCATED_IN relationships
- Impact: +0.004% nodes, +15% edges for explicit property representation
- File:
research_transitive_opt.py - Description: Materializes transitive paths as direct edges
- Transformations: COLLABORATED_WITH edges between Person nodes with shared movie collaborations
- Impact: Preserves node count, adds derived edges for path optimization
- File:
research_intermediate_opt.py - Description: Converts relationships to intermediate role nodes
- Transformations: Role-specific nodes (ActorRole, DirectorRole, etc.) with two-hop Person-Role-Movie patterns
- Impact: Significant node increase for fine-grained relationship modeling
- File:
research_denormalized_opt.py - Description: Pre-computes analytical properties for query optimization
- Transformations: Materialized aggregates on Person, Genre, Language, and Studio nodes
- Impact: Enhanced node properties with pre-computed metrics
- Core Nodes: Person, Movie, Studio, Genre, Language, Award
- Relationship Types: ACTED_IN, DIRECTED, PRODUCED, WROTE, COMPOSED_FOR, HAS_GENRE, IN_LANGUAGE, WON
- Rich Properties: Salaries, ratings, budgets, temporal data, and role-specific attributes
Seven analytical queries designed to stress different database capabilities:
- Multi-dimensional aggregations and filtering
- Complex graph traversals and pattern matching
- Relationship property analysis
- Cross-entity analytics and collaboration detection
- Execution Model: 60-second time windows with 4 concurrent workers
- Cache Conditions: Hot runs with pre-warm queries to simulate production usage
- Measurement: Continuous resource monitoring via Docker container metrics
- Throughput: Queries per second under concurrent load
- Latency: Response time percentiles (P50, P95, P99)
- Resource Usage: CPU utilization and peak memory consumption
- Storage Efficiency: Disk footprint including indexes and graph structure
This implementation builds upon the mgBench framework, leveraging its:
- Concurrent workload execution engine
- Bolt protocol client for Neo4j communication
- Dataset management and indexing automation
- Result collection and aggregation pipelines
For details on the original benchmarking tool, see the mgBench documentation and methodology.