Skip to content

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios)#303

Merged
imbajin merged 32 commits intoapache:text2gqlfrom
LRriver:text2gremlin
Oct 30, 2025
Merged

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios)#303
imbajin merged 32 commits intoapache:text2gqlfrom
LRriver:text2gremlin

Conversation

@LRriver
Copy link
Contributor

@LRriver LRriver commented Sep 30, 2025

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

🏗️ Project Structure

Vertical_Text2Gremlin/
├── README.md
├── __pycache__/
├── data/
├── db_data/
├── graph2gremlin.py
├── gremlin_checker.py
├── gremlin_qa_dataset.csv
├── instruct_convert.py
├── llm_handler.py
└── qa_generalize.py
  • ./graph2gremlin.py: Initially generates Gremlin data based on templates and graph data, ensuring correctness through templates, and translates and preliminarily generalizes the Gremlin data and questions.
  • ./gremlin_checker.py: Performs syntax checking using Antlr4.
  • ./llm_handler.py: An LLM interaction model that inputs QA data for each batch of seed numbers (during seed data generation, queries undergo a small batch generalization), allowing the LLM to understand how to write text2gremlin, first generalizing Gremlin, then translating and generalizing the query.
  • ./qa_generalize.py: Calls gremlin_checker and llm_handler for seed data generalization.
  • ./instruct_convert.py: Handles instruction format conversion and the division of training and test sets.
  • ./db_data: Contains schema and graph data.
  • ./data/seed_data: Seed data (to be uploaded).
  • ./data/vertical_training_sets: Vertical scenario generalization data (to be uploaded).

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

📋 Project Overview
This PR adds a complete Text-to-Gremlin corpus generation system based on a recursive backtracking recipe-guided generation approach, capable of automatically generating large-scale and diverse training data from Gremlin query templates.

🏗️ Project Structure

AST_Text2Gremlin/                   # Project root directory
├── base/                           # Core system directory
│   ├── generator.py                # Main generator entry point
│   ├── GremlinTransVisitor.py      # ANTLR syntax tree visitor
│   ├── TraversalGenerator.py       # Recursive backtracking generator
│   ├── Schema.py                   # Graph database Schema management
│   ├── GremlinBase.py              # Base component library
│   ├── Config.py                   # Configuration management
│   ├── cypher2gremlin_dataset.csv  # 3514 real query dataset
│   └── test/                       # Test suite
├── config.json                     # Global configuration file
├── db_data/                        # Schema and data files
└── README.md                       # Detailed technical documentation

🎯 Core Features

  1. Recipe-Guided Generation

    • Parse Gremlin queries into Recipes using ANTLR
    • Perform intelligent parameter generalization based on Schema
    • Generate large numbers of valid variants through recursive backtracking
  2. Large-Scale Data Processing

    • Support batch loading of query templates from CSV files
    • Process 3514 real cypher2gremlin dataset entries
    • Global deduplication to ensure corpus quality
  3. Complete Error Handling

    • Support complex query types (g.call(), .with(), etc.)
    • Individual failures don't affect overall processing
    • Detailed statistics and error reporting
  4. Intelligent Constraint Mechanism

    • Schema connectivity validation
    • Syntax validity checking
    • Combinatorial explosion control (320k → 7k valid combinations)

📊 System Capabilities

  • Query type support: V/E traversals, graph algorithm calls, complex filtering, etc.
  • Generation scale: Single complex template can generate 6000+ valid variants
  • Processing efficiency: Batch processing of 3514 templates with robust error handling
  • Output quality: JSON format with query-description pairs and detailed metadata

🧪 Technical Features

  • Recursive backtracking algorithm: Systematically explore parameter combination space
  • Recipe abstraction: Structure queries into generalizable Recipes
  • Constraint optimization: 97%+ invalid combinations intelligently filtered
  • Modular design: Core components can be used and tested independently

📈 Application Value

  • Text-to-Gremlin training: Provide large-scale training data for NLP models
  • Query diversity: Generate rich query variants from limited templates
  • Data quality: Ensure syntactic correctness and semantic reasonableness of generated queries
  • Extensibility: Support extension of new schemas and query types

🔧 Usage

# Basic usage
from generator import generate_corpus_from_templates

templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
result = generate_corpus_from_templates(templates)
print(f"Generated {result['total_unique_queries']} unique queries")

📋 Documentation

  • README.md: Quick start guide

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Sep 30, 2025
@LRriver LRriver changed the title Gremlin Corpus Generation System Based on Recursive Backtracking Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) Sep 30, 2025
@imbajin imbajin changed the base branch from main to text2gql October 30, 2025 11:37
Copy link
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this PR now, enhance it later & could merge into master branch

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 30, 2025
@imbajin imbajin merged commit f84276b into apache:text2gql Oct 30, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request lgtm This PR has been approved by a maintainer size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants