Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) by LRriver · Pull Request #303 · apache/hugegraph-ai

LRriver · 2025-09-30T14:18:17Z

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

🏗️ Project Structure

Vertical_Text2Gremlin/
├── README.md
├── __pycache__/
├── data/
├── db_data/
├── graph2gremlin.py
├── gremlin_checker.py
├── gremlin_qa_dataset.csv
├── instruct_convert.py
├── llm_handler.py
└── qa_generalize.py

./graph2gremlin.py: Initially generates Gremlin data based on templates and graph data, ensuring correctness through templates, and translates and preliminarily generalizes the Gremlin data and questions.
./gremlin_checker.py: Performs syntax checking using Antlr4.
./llm_handler.py: An LLM interaction model that inputs QA data for each batch of seed numbers (during seed data generation, queries undergo a small batch generalization), allowing the LLM to understand how to write text2gremlin, first generalizing Gremlin, then translating and generalizing the query.
./qa_generalize.py: Calls gremlin_checker and llm_handler for seed data generalization.
./instruct_convert.py: Handles instruction format conversion and the division of training and test sets.
./db_data: Contains schema and graph data.
./data/seed_data: Seed data (to be uploaded).
./data/vertical_training_sets: Vertical scenario generalization data (to be uploaded).

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

📋 Project Overview
This PR adds a complete Text-to-Gremlin corpus generation system based on a recursive backtracking recipe-guided generation approach, capable of automatically generating large-scale and diverse training data from Gremlin query templates.

🏗️ Project Structure

AST_Text2Gremlin/                   # Project root directory
├── base/                           # Core system directory
│   ├── generator.py                # Main generator entry point
│   ├── GremlinTransVisitor.py      # ANTLR syntax tree visitor
│   ├── TraversalGenerator.py       # Recursive backtracking generator
│   ├── Schema.py                   # Graph database Schema management
│   ├── GremlinBase.py              # Base component library
│   ├── Config.py                   # Configuration management
│   ├── cypher2gremlin_dataset.csv  # 3514 real query dataset
│   └── test/                       # Test suite
├── config.json                     # Global configuration file
├── db_data/                        # Schema and data files
└── README.md                       # Detailed technical documentation

🎯 Core Features

Recipe-Guided Generation
- Parse Gremlin queries into Recipes using ANTLR
- Perform intelligent parameter generalization based on Schema
- Generate large numbers of valid variants through recursive backtracking
Large-Scale Data Processing
- Support batch loading of query templates from CSV files
- Process 3514 real cypher2gremlin dataset entries
- Global deduplication to ensure corpus quality
Complete Error Handling
- Support complex query types (g.call(), .with(), etc.)
- Individual failures don't affect overall processing
- Detailed statistics and error reporting
Intelligent Constraint Mechanism
- Schema connectivity validation
- Syntax validity checking
- Combinatorial explosion control (320k → 7k valid combinations)

📊 System Capabilities

Query type support: V/E traversals, graph algorithm calls, complex filtering, etc.
Generation scale: Single complex template can generate 6000+ valid variants
Processing efficiency: Batch processing of 3514 templates with robust error handling
Output quality: JSON format with query-description pairs and detailed metadata

🧪 Technical Features

Recursive backtracking algorithm: Systematically explore parameter combination space
Recipe abstraction: Structure queries into generalizable Recipes
Constraint optimization: 97%+ invalid combinations intelligently filtered
Modular design: Core components can be used and tested independently

📈 Application Value

Text-to-Gremlin training: Provide large-scale training data for NLP models
Query diversity: Generate rich query variants from limited templates
Data quality: Ensure syntactic correctness and semantic reasonableness of generated queries
Extensibility: Support extension of new schemas and query types

🔧 Usage

# Basic usage
from generator import generate_corpus_from_templates

templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
result = generate_corpus_from_templates(templates)
print(f"Generated {result['total_unique_queries']} unique queries")

📋 Documentation

README.md: Quick start guide

…eneration parameters

… structures

…nnectors support

…d properties

… data instances

…ing and call/with support

…y variants from Recipe

…cation and error handling

…path settings

…and visitor classes

…with correctness guarantee and preliminary question generalization

…d translation

…and llm_handler

… set division

… data directory

…add data control policies.

… .tokens, and .csv files.

imbajin

Merge this PR now, enhance it later & could merge into master branch

LRriver added 16 commits September 30, 2025 20:52

feat: add configuration management module with dictionary paths and g…

52f2e01

…eneration parameters

feat: add Gremlin parsing base classes with Step, Traversal core data…

fadaaf7

… structures

feat: add Gremlin expression processing module with predicates and co…

b775d29

…nnectors support

feat: add graph database schema management with vertex/edge labels an…

f0588a1

…d properties

feat: add Gremlin base component library with synonym replacement and…

5f3b039

… data instances

feat: add ANTLR syntax tree visitor with Gremlin query to Recipe pars…

822272f

…ing and call/with support

feat: add recursive backtracking traversal generator for diverse quer…

441b32c

…y variants from Recipe

feat: add main corpus generator with batch processing, global dedupli…

2de2096

…cation and error handling

config: add global configuration file with generation parameters and …

c92f09a

…path settings

data: add cypher2gremlin dataset with 3514 real query templates

25ca990

docs: add project README with quick start guide and usage instructions

25a2876

feat: add ANTLR-generated Gremlin grammar package with lexer, parser …

541aa20

…and visitor classes

data: add schema and graph data

eb7eb01

feat: add template directory with schema dictionary and synonym files

f0579e8

test: add gremlin statement generalization generation test module

9c13457

test: add generator unit tests for corpus generation validation

b14ffb3

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Sep 30, 2025

LRriver added 11 commits September 30, 2025 22:31

Add graph2gremlin.py: Initial template-based Gremlin data generation …

7cd8427

…with correctness guarantee and preliminary question generalization

Add gremlin_checker.py: Syntax checking using Antlr4

4da021c

Add llm_handler.py: LLM interaction model for query generalization an…

bc10fe2

…d translation

Add qa_generalize.py: Seed data generalization using gremlin_checker …

6ea48d5

…and llm_handler

Add instruct_convert.py: Instruction format conversion and train/test…

78f8c2a

… set division

Add da_data: Schema and graph data

b7f3f4a

Add data/seed_data: Seed data directory

332b879

Add data/vertical_training_sets: Vertical domain scenario generalized…

8a94bad

… data directory

Add books on Gremlin syntax knowledge to process data.

676d28c

Add a dataset of Gremlin QA pairs synthesized based on LLM.

90f346f

Add README.md

4120356

LRriver changed the title ~~Gremlin Corpus Generation System Based on Recursive Backtracking~~ Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) Sep 30, 2025

LRriver added 3 commits October 5, 2025 23:00

Compatible with OpenAI format

67b523a

Increase Gremlin syntax vocabulary that supports generalization, and …

bccc147

…add data control policies.

modify README.md

44592b4

imbajin changed the base branch from main to text2gql October 30, 2025 11:37

LRriver added 2 commits October 30, 2025 19:53

Add Apache-2.0 license, fix review comments

a1d614c

Modify the .licenserc.yaml file to ignore license checks for .interp,…

471e141

… .tokens, and .csv files.

imbajin approved these changes Oct 30, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 30, 2025

imbajin merged commit f84276b into apache:text2gql Oct 30, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios)#303

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios)#303
imbajin merged 32 commits intoapache:text2gqlfrom
LRriver:text2gremlin

LRriver commented Sep 30, 2025 •

edited

Loading

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LRriver commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LRriver commented Sep 30, 2025 •

edited

Loading