# LangGraph CLI Synthetic Dataset Generation

This notebook demonstrates how to use **NVIDIA NeMo Data Designer** to create a synthetic dataset for training an AI agent to translate natural language queries into structured CLI tool calls.

## What is NeMo Data Designer?

[NeMo Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is a powerful synthetic data generation library that transforms data designs into high-quality datasets. It supports:

- **Sampling-based columns**: Generate values from statistical distributions (uniform, categorical, etc.)
- **LLM-based columns**: Use language models to generate realistic text or structured outputs
- **Expression columns**: Compute values based on other columns using Python expressions
- **Jinja templating**: Create dynamic prompts with conditional logic

## Use Case: LangGraph CLI Agent

We're generating training data for an agent that can interpret natural language requests like:
> "Create a new project using the react-agent template"

And convert them to structured tool calls:
```json
{"command": "new", "template": "react-agent", "path": null, ...}
```

This synthetic data will enable fine-tuning an LLM to perform accurate tool-calling for the LangGraph CLI.

## Step 0: Provide NVIDIA API Key



In [1]:
import os 
import getpass

os.environ["NVIDIA_API_KEY"] = getpass.getpass()

## Step 1: Setup Data Designer

Install and import the Data Designer library.

In [2]:
from data_designer.essentials import (
    DataDesigner,
    DataDesignerConfigBuilder,
    SamplerColumnConfig,
    LLMTextColumnConfig,
    LLMStructuredColumnConfig,
    SamplerType,
    CategorySamplerParams,
    UniformSamplerParams,
    ModelConfig,
    ChatCompletionInferenceParams,
)

designer = DataDesigner()

## Step 2: Design the Synthetic Data Schema

Define the structure of our synthetic dataset using Data Designer's column types:

- **Pydantic Model**: `CLIToolCall` schema for structured outputs
- **Model Config**: LLM settings for data generation
- **Sampler Columns**: Statistical distributions for seed values
- **LLM Columns**: Generate natural language inputs and structured outputs

In [4]:
from pydantic import BaseModel, Field
from typing import Optional

class CLIToolCall(BaseModel):
    command: str = Field(..., description="CLI command: new, dev, up, build, or dockerfile")
    template: Optional[str] = Field(None, description="Template name for 'new' command")
    path: Optional[str] = Field(None, description="Project path for 'new' command")
    port: Optional[int] = Field(None, description="Port for 'dev' or 'up' command")
    no_browser: Optional[bool] = Field(None, description="Skip browser for 'dev' command")
    watch: Optional[bool] = Field(None, description="Watch mode for 'up' command")
    tag: Optional[str] = Field(None, description="Image tag for 'build' command")
    output_path: Optional[str] = Field(None, description="Output path for 'dockerfile' command")

# Model config
model_configs = [
    ModelConfig(
        alias="command-generator",
        provider="nvidia",
        model="nvidia/nemotron-3-nano-30b-a3b",
        inference_parameters=ChatCompletionInferenceParams(
            temperature=1.0,
            top_p=1.0,
            max_tokens=1000
        )
    )
]

config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

# Sampler columns
config_builder.add_column(
    SamplerColumnConfig(
        name="command",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["new", "dev", "up", "build", "dockerfile"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="template",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["basic", "react-agent", "memory-agent", "retrieval-agent", "data-enrichment"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="include_path",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 3])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="port",
        sampler_type=SamplerType.UNIFORM,
        params=UniformSamplerParams(low=3000, high=9000),
        convert_to="int"
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="no_browser",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 4])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="watch",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=[True, False], weights=[1, 2])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="image_tag",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["myapp:latest", "latest", "langgraph-app:v1"])
    )
)

config_builder.add_column(
    SamplerColumnConfig(
        name="dockerfile_path",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(values=["Dockerfile", "Dockerfile.custom", "docker/Dockerfile"])
    )
)

# Input column - generates natural language requests
config_builder.add_column(
    LLMTextColumnConfig(
        name="input",
        model_alias="command-generator",
        prompt=(
            "Generate a natural user request for the LangGraph CLI.\n\n"
            "Command: {{ command }}\n\n"
            "{% if command == 'new' %}"
            "The user wants to create a new project with the '{{ template }}' template."
            "{% if include_path %} They want it in a custom directory.{% endif %}"
            "{% elif command == 'dev' %}"
            "The user wants to start the dev server on port {{ port }}."
            "{% if no_browser %} They don't want to auto-open a browser.{% endif %}"
            "{% elif command == 'up' %}"
            "The user wants to launch the server container on port {{ port }}."
            "{% if watch %} They want to watch for code changes.{% endif %}"
            "{% elif command == 'build' %}"
            "The user wants to build a Docker image with tag '{{ image_tag }}'."
            "{% elif command == 'dockerfile' %}"
            "The user wants to generate a Dockerfile at '{{ dockerfile_path }}'."
            "{% endif %}\n\n"
            "Write one natural, conversational sentence."
        ),
        system_prompt="Output only a single sentence. No explanation.",
    )
)

# Output column - generates structured CLI tool calls
config_builder.add_column(
    LLMStructuredColumnConfig(
        name="output",
        model_alias="command-generator",
        prompt=(
            "Convert this user request to a LangGraph CLI tool-call.\n\n"
            "Command type: {{ command }}\n"
            "User request: {{ input }}\n\n"
            "{% if command == 'new' %}"
            "Set: template, and path if specified."
            "{% elif command == 'dev' %}"
            "Set: port, and no_browser if specified."
            "{% elif command == 'up' %}"
            "Set: port, and watch if specified."
            "{% elif command == 'build' %}"
            "Set: tag."
            "{% elif command == 'dockerfile' %}"
            "Set: output_path."
            "{% endif %}\n\n"
            "Only set fields relevant to the command. Leave others as null."
        ),
        system_prompt="Output ONLY the JSON object. No preamble, no quotes, no meta-commentary.",
        output_format=CLIToolCall,
    )
)

## Step 3: Preview Data Generation

Before generating a large dataset, we use the preview feature to validate our configuration and inspect sample outputs. This follows the recommended workflow:

1. **Design phase** ‚Üí Define columns and prompts
2. **Preview** ‚Üí Generate small batches to validate quality
3. **Iterate** ‚Üí Refine prompts and constraints
4. **Batch generation** ‚Üí Create full dataset

The preview runs the full pipeline on a small sample (5 records here), returning a Pandas DataFrame with all generated columns.

This lets us verify that:
- Natural language inputs sound realistic
- Structured outputs conform to our Pydantic schema
- The command-to-input-to-output pipeline produces coherent training pairs

In [5]:
# Preview with 5 records to validate the configuration
preview_result = designer.preview(config_builder=config_builder, num_records=5)
preview_df = preview_result.dataset
print(preview_df[['input', 'output']].head(5))

[13:12:30] [INFO] üì∫ Preview generation in progress
[13:12:32] [INFO] ‚úÖ Validation passed
[13:12:32] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[13:12:32] [INFO] ü©∫ Running health checks for models...
[13:12:32] [INFO]   |-- üëÄ Checking 'nvidia/nemotron-3-nano-30b-a3b' in provider named 'nvidia' for model alias 'command-generator'...
[13:12:33] [INFO]   |-- ‚úÖ Passed!
[13:12:33] [INFO] üé≤ Preparing samplers to generate 5 records across 8 columns
[13:12:34] [INFO] üìù llm-text model config for column 'input'
[13:12:34] [INFO]   |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[13:12:34] [INFO]   |-- model alias: 'command-generator'
[13:12:34] [INFO]   |-- model provider: 'nvidia'
[13:12:34] [INFO]   |-- inference parameters: generation_type=chat-completion, max_parallel_requests=4, temperature=1.00, top_p=1.00, max_tokens=1000
[13:12:34] [INFO] üêô Processing llm-text column 'input' with 4 concurrent workers
[13:12:36] [INFO] üóÇÔ∏è llm-structured model 

                                               input  \
0  Run the LangGraph server inside the container ...   
1  Could you start the server container on port 8...   
2  Could you spin up the dev server on port‚ÄØ6446 ...   
3  Could you build the Docker image and tag it as...   
4  Could you start the server container and expos...   

                                              output  
0  {'command': 'up', 'template': None, 'path': No...  
1  {'command': 'up', 'template': None, 'path': No...  
2  {'command': 'dev', 'port': 6446, 'no_browser':...  
3  {'command': 'build', 'template': None, 'path':...  
4  {'command': 'up', 'template': None, 'path': No...  


## Step 4: Generate Full Dataset

Once we're satisfied with the preview results, we generate the full dataset. The `designer.generate()` method processes data in batches with parallel LLM requests for efficiency.

> NOTE: We use the following to supress warnings from Data Designer and build.nvidia.com responses having a cosmetic mistmatch.

In [6]:
# Generate 250 synthetic training examples
generate_result = designer.create(config_builder=config_builder, num_records=250)

[13:13:21] [INFO] üé® Creating Data Designer dataset
[13:13:21] [INFO] ‚úÖ Validation passed
[13:13:21] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[13:13:21] [INFO] ü©∫ Running health checks for models...
[13:13:21] [INFO]   |-- üëÄ Checking 'nvidia/nemotron-3-nano-30b-a3b' in provider named 'nvidia' for model alias 'command-generator'...
[13:13:23] [INFO]   |-- ‚úÖ Passed!
[13:13:23] [INFO] ‚è≥ Processing batch 1 of 1
[13:13:23] [INFO] üé≤ Preparing samplers to generate 250 records across 8 columns
[13:13:23] [INFO] üìù llm-text model config for column 'input'
[13:13:23] [INFO]   |-- model: 'nvidia/nemotron-3-nano-30b-a3b'
[13:13:23] [INFO]   |-- model alias: 'command-generator'
[13:13:23] [INFO]   |-- model provider: 'nvidia'
[13:13:23] [INFO]   |-- inference parameters: generation_type=chat-completion, max_parallel_requests=4, temperature=1.00, top_p=1.00, max_tokens=1000
[13:13:23] [INFO] üêô Processing llm-text column 'input' with 4 concurrent workers

In [7]:
dataset_df = generate_result.load_dataset()
print(f"Generated {len(dataset_df)} records")

Generated 250 records


## Step 5: Save Dataset for Training

Export the generated dataset to JSONL format for use with the GRPO training pipeline. We split into training and validation sets.

In [8]:
from sklearn.model_selection import train_test_split
import json
from pathlib import Path

# Split into train/val (90/10)
train_df, val_df = train_test_split(dataset_df, test_size=0.1, random_state=42)

# Create output directory
output_dir = Path("data/langgraph_cli")
output_dir.mkdir(parents=True, exist_ok=True)

# Save as JSONL
def save_jsonl(df, path):
    with open(path, 'w') as f:
        for _, row in df.iterrows():
            record = {"input": row["input"], "output": row["output"]}
            f.write(json.dumps(record) + "\n")

save_jsonl(train_df, output_dir / "train.jsonl")
save_jsonl(val_df, output_dir / "val.jsonl")

print(f"Saved {len(train_df)} training and {len(val_df)} validation examples")

Saved 225 training and 25 validation examples
