# NeMo DataDesigner ‚Äî Pipeline Test Notebook

Interactive testing of the NDD pipeline before wiring it into the full stack.

1. **Setup** ‚Äî imports, env var verification, provider/model instantiation
2. **Tag data** ‚Äî load existing projects, build co-occurrence map
3. **Pipeline preview** ‚Äî run preview with 2-3 records
4. **Validate output** ‚Äî check columns match WaywoProjectDB fields
5. **Prompt iteration** ‚Äî tweak and re-run
6. **Embedding test** ‚Äî generate embedding for a row
7. **Full save test** ‚Äî save a generated project to DB

## 1. Setup

In [1]:
import sys
sys.path.insert(0, '/app')

import nest_asyncio
nest_asyncio.apply()

import asyncio
import json
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
logger = logging.getLogger(__name__)

In [2]:
# Verify env vars
from src.settings import LLM_BASE_URL, LLM_MODEL_NAME, LLM_API_KEY, EMBEDDING_URL

print(f'LLM_BASE_URL:   {LLM_BASE_URL}')
print(f'LLM_MODEL_NAME: {LLM_MODEL_NAME}')
print(f'LLM_API_KEY:    {LLM_API_KEY[:10]}...' if len(LLM_API_KEY) > 10 else f'LLM_API_KEY: {LLM_API_KEY}')
print(f'EMBEDDING_URL:  {EMBEDDING_URL}')

LLM_BASE_URL:   http://192.168.6.19:8002/v1
LLM_MODEL_NAME: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
LLM_API_KEY: not-needed
EMBEDDING_URL:  http://192.168.5.96:8000


In [3]:
# Build provider and models
from src.ndd_config import build_ndd_provider, build_ndd_models
from data_designer.interface.data_designer import DataDesigner

provider = build_ndd_provider()
models = build_ndd_models(creativity=0.85)

dd = DataDesigner(
    model_providers=[provider],
    artifact_path='/app/data/ndd_artifacts',
)
print('DataDesigner initialized')

  from .autonotebook import tqdm as notebook_tqdm
[01:39:40] [INFO] NDD provider: waywo-llm -> http://192.168.6.19:8002/v1
[01:39:40] [INFO] NDD models: ['waywo-creative', 'waywo-structured', 'waywo-judge'], model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, creative temp=0.85


DataDesigner initialized


## 2. Tag Data ‚Äî Co-occurrence Map

In [4]:
from src.db.projects import get_all_projects, get_all_hashtags
from src.ndd_pipeline import build_tag_cooccurrence

# Load all valid projects
projects = get_all_projects(is_valid=True)
print(f'Loaded {len(projects)} valid projects')

# Get all unique tags
all_tags = get_all_hashtags()
print(f'Unique tags: {len(all_tags)}')
print(f'Sample: {all_tags[:20]}')

Loaded 444 valid projects
Unique tags: 915
Sample: ['3d', '3dprinting', '6502', 'ableton', 'abtesting', 'academic', 'access-control', 'accessibility', 'accounting', 'actions', 'activitypub', 'adblocking', 'adtech', 'adultart', 'advisory', 'agegating', 'agent', 'agents', 'aggregation', 'aggregator']


In [5]:
# Build co-occurrence map
cooccurrence = build_tag_cooccurrence(projects)
print(f'Tags with co-occurrence data: {len(cooccurrence)}')

# Inspect top tags
for tag in ['ai', 'web', 'python', 'saas', 'open-source']:
    if tag in cooccurrence:
        top5 = cooccurrence[tag][:5]
        print(f'  {tag}: {top5}')
    else:
        print(f'  {tag}: (not in co-occurrence data)')

[01:40:03] [INFO] Built tag co-occurrence map: 915 tags, avg 7 co-tags each


Tags with co-occurrence data: 915
  ai: [('productivity', 33), ('opensource', 20), ('saas', 20), ('education', 12), ('llm', 12)]
  web: [('saas', 3), ('productivity', 3), ('javascript', 2), ('monitoring', 2), ('uptime', 2)]
  python: [('opensource', 4), ('ai', 4), ('rag', 2), ('django', 1), ('activitypub', 1)]
  saas: [('productivity', 22), ('ai', 20), ('monitoring', 5), ('devops', 4), ('web', 3)]
  open-source: [('typescript', 1), ('linq', 1), ('database', 1), ('query', 1), ('emulator', 1)]


## 3. Pipeline Preview

Generate 2-3 records to test the pipeline without a full run.

In [6]:
from src.ndd_pipeline import build_pipeline_config

# Build config with seed tags
config = build_pipeline_config(
    models=models,
    seed_tags=['ai', 'python', 'developer-tools'],
    tag_cooccurrence=cooccurrence,
    all_tags=all_tags,
)

# Show pipeline structure
print('Pipeline columns:')
for col in config.get_column_configs():
    print(f'  [{col.column_type}] {col.name}')

[01:40:09] [INFO] Built NDD pipeline: 4 samplers, 1 LLM text, 1 structured, 1 judge, 4 expression


Pipeline columns:
  [sampler] primary_tag
  [sampler] secondary_tags
  [sampler] target_audience
  [sampler] target_complexity
  [llm-text] project_idea
  [llm-structured] metadata
  [llm-judge] idea_quality
  [expression] title
  [expression] short_description
  [expression] description
  [expression] hashtags


In [7]:
# Run preview (this calls the LLM ‚Äî may take 30-60s)
preview = dd.preview(config, num_records=2)
df = preview.dataset

print(f'Generated {len(df)} records')
print(f'Columns: {list(df.columns)}')
df

[01:40:11] [INFO] üëÄ Preview generation in progress
[01:40:15] [INFO] ‚úÖ Validation passed
[01:40:15] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[01:40:15] [INFO] ü©∫ Running health checks for models...
[01:40:15] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-judge' (skip_health_check=True)
[01:40:15] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-structured' (skip_health_check=True)
[01:40:15] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-creative' (skip_health_check=True)
[01:40:15] [INFO] üé≤ Preparing samplers to generate 2 records across 4 columns
[01:40:17] [INFO] üìù llm-text model config for column 'project_idea'
[01:40:17] [INFO]   |-- model: 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16'
[01:40:17] [INFO]   |-- model alias: 'waywo-creative'
[01:40:17] [INFO]   |-- model provider: 'waywo-llm'
[01:40:17] [INFO]   |-- inference parameters: generation_type=chat-completion, max_parallel_requests

Generated 2 records
Columns: ['primary_tag', 'secondary_tags', 'target_audience', 'target_complexity', 'project_idea', 'metadata', 'idea_quality', 'title', 'short_description', 'description', 'hashtags']


Unnamed: 0,primary_tag,secondary_tags,target_audience,target_complexity,project_idea,metadata,idea_quality,title,short_description,description,hashtags
0,ai,llm,creative technologists,7,\n**Project Overview** \nCreate **‚ÄúMuseWeaver...,{'title': 'MuseWeaver AI Storytelling Platform...,{'idea_score': {'reasoning': 'The concept comb...,MuseWeaver AI Storytelling Platform,AI-driven collaborative narrative creation wit...,MuseWeaver merges constraint‚Äëguided LLM genera...,"['ai', 'interactive-storytelling', 'collaborat..."
1,python,ai,frontend developers,9,\n**Project Overview ‚Äì ‚ÄúPromptUI‚Äù** \nPromptU...,"{'title': 'PromptUI AI UI Generator', 'short_d...",{'idea_score': {'reasoning': 'The concept merg...,PromptUI AI UI Generator,AI-powered CLI that generates UI components fr...,PromptUI parses natural‚Äëlanguage UI specs or d...,"['python', 'ai-code-generation', 'frontend-aut..."


In [8]:
# Inspect a generated row in detail
if len(df) > 0:
    row = df.iloc[0]
    print(f'=== Row 0 ===')
    print(f'primary_tag:      {row.get("primary_tag")}')
    print(f'secondary_tags:   {row.get("secondary_tags")}')
    print(f'target_audience:  {row.get("target_audience")}')
    print(f'target_complexity:{row.get("target_complexity")}')
    print(f'\n--- project_idea (first 300 chars) ---')
    print(str(row.get('project_idea', ''))[:300])
    print(f'\n--- metadata ---')
    print(row.get('metadata'))
    print(f'\n--- Extracted fields ---')
    print(f'title:             {row.get("title")}')
    print(f'short_description: {row.get("short_description")}')
    print(f'description:       {str(row.get("description", ""))[:200]}')
    print(f'hashtags:          {row.get("hashtags")}')
    print(f'\n--- Scores ---')
    print(f'idea_quality:      {row.get("idea_quality")}')

=== Row 0 ===
primary_tag:      ai
secondary_tags:   llm
target_audience:  creative technologists
target_complexity:7

--- project_idea (first 300 chars) ---

**Project Overview**  
Create **‚ÄúMuseWeaver,‚Äù** an AI‚Äëdriven collaborative storytelling platform that lets creative technologists co‚Äëauthor interactive narrative experiences (e.g., choose‚Äëyour‚Äëown‚Äëadventure games, AR‚Äëenabled tales, or web‚Äëbased interactive comics) in real time. The core problem it 

--- metadata ---
{'title': 'MuseWeaver AI Storytelling Platform', 'short_description': 'AI-driven collaborative narrative creation with real-time branching', 'description': 'MuseWeaver merges constraint‚Äëguided LLM generation with a semantic branching graph, enabling real‚Äëtime co‚Äëauthoring and instant export to game engines or AR web experiences. It solves fragmentation in interactive story workflows and lets small teams ship immersive narratives quickly.', 'hashtags': ['ai', 'interactive-storytelling', 'collabo

## 4. Validate Output

Check that output columns can map to `WaywoProjectDB` fields.

In [9]:
required_fields = ['title', 'short_description', 'description', 'hashtags']

print('Checking required fields in output...')
for field in required_fields:
    present = field in df.columns
    sample = str(df.iloc[0].get(field, 'MISSING'))[:60] if present and len(df) > 0 else 'N/A'
    status = 'OK' if present else 'MISSING'
    print(f'  {status}: {field} = {sample}')

# Check scores
print('\nChecking scores...')
if 'idea_quality' in df.columns and len(df) > 0:
    quality = df.iloc[0]['idea_quality']
    print(f'  idea_quality raw: {quality} (type: {type(quality).__name__})')
    # Scores might be nested in the quality dict/string
    if isinstance(quality, str):
        try:
            quality = json.loads(quality)
        except json.JSONDecodeError:
            pass
    if isinstance(quality, dict):
        print(f'  idea_score:       {quality.get("idea_score")}')
        print(f'  complexity_score: {quality.get("complexity_score")}')
    else:
        print(f'  (unexpected format ‚Äî may need parsing adjustment)')

Checking required fields in output...
  OK: title = MuseWeaver AI Storytelling Platform
  OK: short_description = AI-driven collaborative narrative creation with real-time br
  OK: description = MuseWeaver merges constraint‚Äëguided LLM generation with a se
  OK: hashtags = ['ai', 'interactive-storytelling', 'collaborative-narrative'

Checking scores...
  idea_quality raw: {'idea_score': {'reasoning': 'The concept combines several cutting‚Äëedge areas‚ÄîLLM‚Äëdriven narrative generation, semantic branching graphs, and seamless export to game engines and AR web platforms‚Äîinto a unified collaborative authoring tool. It addresses a clear pain point (fragmented interactive‚Äëstory workflows) and offers a novel workflow that enables small teams to produce immersive, real‚Äëtime co‚Äëauthored stories. While it builds on existing technologies, the integration and real‚Äëtime branching logic create a distinctive, high‚Äëimpact product that stands out from typical story‚Äëgeneration tools.', 

## 5. Prompt Iteration

Tweak prompts by rebuilding the config with different parameters and re-running preview.

In [10]:
# Try with different seed tags and higher creativity
models_wild = build_ndd_models(creativity=1.1)

config_wild = build_pipeline_config(
    models=models_wild,
    seed_tags=['blockchain', 'gaming'],
    tag_cooccurrence=cooccurrence,
    all_tags=all_tags,
)

dd_wild = DataDesigner(
    model_providers=[provider],
    artifact_path='/app/data/ndd_artifacts',
)

preview_wild = dd_wild.preview(config_wild, num_records=2)
df_wild = preview_wild.dataset

print(f'High-creativity results ({len(df_wild)} rows):')
for i, row in df_wild.iterrows():
    print(f'\n  [{i}] {row.get("title", "?")} ‚Äî {row.get("short_description", "?")}')
    print(f'      Tags: {row.get("hashtags")}')

[01:42:52] [INFO] NDD models: ['waywo-creative', 'waywo-structured', 'waywo-judge'], model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16, creative temp=1.1
[01:42:52] [INFO] Built NDD pipeline: 4 samplers, 1 LLM text, 1 structured, 1 judge, 4 expression
[01:42:52] [INFO] üëÄ Preview generation in progress
[01:42:52] [INFO] ‚úÖ Validation passed
[01:42:52] [INFO] ‚õìÔ∏è Sorting column configs into a Directed Acyclic Graph
[01:42:52] [INFO] ü©∫ Running health checks for models...
[01:42:52] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-judge' (skip_health_check=True)
[01:42:52] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-structured' (skip_health_check=True)
[01:42:52] [INFO]   |-- ‚è≠Ô∏è  Skipping health check for model alias 'waywo-creative' (skip_health_check=True)
[01:42:52] [INFO] üé≤ Preparing samplers to generate 2 records across 4 columns
[01:42:52] [INFO] üìù llm-text model config for column 'project_idea'
[01:42:52] [INFO]   |-- model: 

High-creativity results (2 rows):

  [0] Data Hunt ‚Äî Gamified data discovery and model training via NFTs
      Tags: ['gaming', 'machine-learning', 'data-science', 'blockchain', 'web3']

  [1] Dungeon Companion AI ‚Äî AI companion that learns playstyle and adapts dungeon difficulty
      Tags: ['gaming', 'roguelike-game', 'artificial-intelligence', 'reinforcement-learning']


## 6. Embedding Test

Take a generated row and run it through the embedding pipeline.

In [11]:
from src.clients.embedding import create_embedding_text, get_single_embedding

if len(df) > 0:
    row = df.iloc[0]
    
    # Parse hashtags if needed
    hashtags = row.get('hashtags', [])
    if isinstance(hashtags, str):
        try:
            hashtags = json.loads(hashtags)
        except json.JSONDecodeError:
            hashtags = [hashtags]
    
    # Create embedding text
    emb_text = create_embedding_text(
        title=str(row.get('title', '')),
        description=str(row.get('description', '')),
        hashtags=hashtags if isinstance(hashtags, list) else [],
    )
    print(f'Embedding text ({len(emb_text)} chars):')
    print(emb_text[:300])
    
    # Generate embedding
    async def gen_emb():
        return await get_single_embedding(emb_text)
    
    embedding = asyncio.get_event_loop().run_until_complete(gen_emb())
    print(f'\nEmbedding shape: {len(embedding)} dimensions')
    print(f'First 5 values: {embedding[:5]}')
else:
    print('No preview data to test embedding with')

[01:46:02] [INFO] üì° Calling embedding service for 1 text(s)


Embedding text (395 chars):
MuseWeaver AI Storytelling Platform
MuseWeaver merges constraint‚Äëguided LLM generation with a semantic branching graph, enabling real‚Äëtime co‚Äëauthoring and instant export to game engines or AR web experiences. It solves fragmentation in interactive story workflows and lets small teams ship immersive


[01:46:02] [INFO] ‚úÖ Got 1 embedding(s)



Embedding shape: 4096 dimensions
First 5 values: [-0.00848388671875, -0.0101318359375, 0.01495361328125, 0.01226806640625, 0.006011962890625]


## 7. Full Save Test

Save a generated project to the database with `source="nemo_data_designer"`.

In [13]:
from datetime import datetime
from src.models import WaywoProject
from src.db.projects import save_project, get_project

def extract_judge_score(quality: dict, score_name: str, default: int = 5) -> int:
    """Extract an integer score from the NDD judge output.
    
    Judge columns return: {score_name: {"score": N, "reasoning": "..."}}
    """
    val = quality.get(score_name, default)
    if isinstance(val, dict):
        val = val.get("score", default)
    return max(1, min(10, int(val)))

if len(df) > 0:
    row = df.iloc[0]
    
    # Parse fields from the generated row
    hashtags = row.get('hashtags', [])
    if isinstance(hashtags, str):
        try:
            hashtags = json.loads(hashtags)
        except json.JSONDecodeError:
            hashtags = [hashtags]
    
    # Parse scores from idea_quality
    quality = row.get('idea_quality', {})
    if isinstance(quality, str):
        try:
            quality = json.loads(quality)
        except json.JSONDecodeError:
            quality = {}
    
    idea_score = extract_judge_score(quality, 'idea_score')
    complexity_score = extract_judge_score(quality, 'complexity_score')
    
    now = datetime.utcnow()
    
    project = WaywoProject(
        id=0,  # will be auto-assigned
        source_comment_id=None,
        source='nemo_data_designer',
        is_valid_project=True,
        title=str(row.get('title', 'Untitled')),
        short_description=str(row.get('short_description', '')),
        description=str(row.get('description', '')),
        hashtags=hashtags if isinstance(hashtags, list) else [],
        project_urls=[],
        url_summaries={},
        primary_url=None,
        url_contents={},
        idea_score=idea_score,
        complexity_score=complexity_score,
        workflow_logs=['Generated by NeMo DataDesigner'],
        created_at=now,
        processed_at=now,
    )
    
    print(f'Saving project: {project.title}')
    print(f'  source: {project.source}')
    print(f'  source_comment_id: {project.source_comment_id}')
    print(f'  scores: idea={project.idea_score}, complexity={project.complexity_score}')
    print(f'  hashtags: {project.hashtags}')
    
    # Save with embedding
    project_id = save_project(project, embedding=embedding if 'embedding' in dir() else None)
    print(f'\nSaved! Project ID: {project_id}')
    
    # Verify it can be read back
    saved = get_project(project_id)
    print(f'\nRead back from DB:')
    print(f'  id: {saved.id}')
    print(f'  title: {saved.title}')
    print(f'  source: {saved.source}')
    print(f'  source_comment_id: {saved.source_comment_id}')
else:
    print('No preview data to save')

Saving project: MuseWeaver AI Storytelling Platform
  source: nemo_data_designer
  source_comment_id: None
  scores: idea=8, complexity=7
  hashtags: ["['ai', 'interactive-storytelling', 'collaborative-narrative', 'game-dev']"]

Saved! Project ID: 453

Read back from DB:
  id: 453
  title: MuseWeaver AI Storytelling Platform
  source: nemo_data_designer
  source_comment_id: None


  now = datetime.utcnow()


In [14]:
# Verify it appears in the API with source filter
from src.db.projects import get_all_projects

ndd_projects = get_all_projects(source='nemo_data_designer')
print(f'Projects with source=nemo_data_designer: {len(ndd_projects)}')
for p in ndd_projects:
    print(f'  [{p.id}] {p.title} (idea={p.idea_score}, complexity={p.complexity_score})')

Projects with source=nemo_data_designer: 1
  [453] MuseWeaver AI Storytelling Platform (idea=8, complexity=7)


In [None]:
# Optional: delete the test project if you don't want to keep it
# from src.db.projects import delete_project
# if 'project_id' in dir():
#     delete_project(project_id)
#     print(f'Deleted test project {project_id}')