# Golden Dataset Generator - Cuttlefish JIRA

This notebook creates a synthetic golden dataset using RAGAS from JIRA issue data and uploads it to LangSmith.

**Features:**
- Self-contained (no dependencies on other notebooks)
- Generates 15 synthetic Q&A pairs from JIRA issues
- Uses "description" and "title" fields as input sources
- Uploads dataset to LangSmith for evaluation tracking
- Uses "Abstracted SDG" approach from RAGAS

## Step 1: Install Dependencies

In [1]:
# Install required packages
!pip install -q ragas langsmith langchain-community datasets pandas numpy pillow rapidfuzz
!pip install -q langchain-openai langchain-core qdrant-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Step 2: API Keys Configuration

In [2]:
import os
import getpass
from uuid import uuid4

# OpenAI API Key
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

# LangSmith API configuration
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
if "LANGCHAIN_API_KEY" not in os.environ:
    os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Enter your LangSmith API Key: ")
os.environ["LANGCHAIN_PROJECT"] = f"Cuttlefish - JIRA Golden Dataset Generation - {uuid4().hex[0:8]}"

print("✅ API keys configured successfully!")

✅ API keys configured successfully!


## Step 3: Load JIRA Issue Data

In [3]:
import csv
import pandas as pd
from langchain_core.documents import Document
from datetime import datetime, timedelta

# Set CSV field size limit for large descriptions
csv.field_size_limit(10000000)

# Load JIRA data from CSV
print("Loading JIRA issue data...")
jira_documents = []

with open('./JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv', 'r', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)
    
    for i, row in enumerate(reader):
        # Combine title and description for rich content
        title = row.get('title', '').strip()
        description = row.get('description', '').strip()
        
        # Skip empty entries
        if not title and not description:
            continue
            
        # Create combined content with title and description
        if title and description:
            content = f"Issue Title: {title}\n\nDescription: {description}"
        elif title:
            content = f"Issue Title: {title}"
        else:
            content = f"Description: {description}"
        
        # Create document with metadata
        doc = Document(
            page_content=content,
            metadata={
                "key": row.get('key', ''),
                "project": row.get('project', ''),
                "project_name": row.get('project_name', ''),
                "priority": row.get('priority', ''),
                "type": row.get('type', ''),
                "status": row.get('status', ''),
                "created": row.get('created', ''),
                "title": title,
                "description_length": len(description)
            }
        )
        
        jira_documents.append(doc)
        
        # Limit to first 100 documents for manageable processing
        if len(jira_documents) >= 100:
            break

print(f"✅ Loaded {len(jira_documents)} JIRA issue documents")
print(f"Example document preview: {jira_documents[0].page_content[:300]}...")

# Show project distribution
projects = [doc.metadata['project'] for doc in jira_documents]
project_counts = pd.Series(projects).value_counts().head(5)
print(f"\nTop 5 projects in sample:")
for project, count in project_counts.items():
    print(f"  {project}: {count} issues")

Loading JIRA issue data...
✅ Loaded 100 JIRA issue documents
Example document preview: Issue Title: MAX_VERSIONS not respected.

Description: Below is a report from the list.  I confirmed playing in shell that indeed we have this problem.  Lets fix for 0.2.1.{code}Hello.I made some tests with HBase 0.2.0 (RC2), focused on insertion andtimestamps behaviour. I had some surprising result...

Top 5 projects in sample:
  HBASE: 100 issues


## Step 4: Configure Synthetic Data Generation Models

In [4]:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Set up the LLM and embedding models for synthetic data generation
# Using GPT-4 for better understanding of technical JIRA content
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini", temperature=0.1))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(model="text-embedding-3-small"))

print("✅ Synthetic data generation models configured successfully!")
print("Using GPT-4o-mini for better technical content understanding")

  from .autonotebook import tqdm as notebook_tqdm


✅ Synthetic data generation models configured successfully!
Using GPT-4o-mini for better technical content understanding


## Step 5: Generate Synthetic Golden Dataset

In [5]:
# Generate synthetic dataset using the abstracted SDG approach
print("Generating synthetic dataset using RAGAS...")
print(f"Using first 50 documents from {len(jira_documents)} total JIRA issue documents")

# Initialize the test set generator
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Generate 15 synthetic Q&A pairs focused on JIRA technical issues
synthetic_dataset = generator.generate_with_langchain_docs(
    jira_documents[:50],  # Use first 50 docs to balance quality and speed
    testset_size=15  # Generate exactly 15 Q&A pairs as requested
)

print(f"✅ Generated {len(synthetic_dataset)} synthetic Q&A pairs successfully!")

Generating synthetic dataset using RAGAS...
Using first 50 documents from 100 total JIRA issue documents


Applying CustomNodeFilter:   0%|          | 0/50 [00:00<?, ?it/s]         Node cdc91764-06db-49fb-ba8f-d3bff675f972 does not have a summary. Skipping filtering.
Node 9f03bac9-613b-4132-bb68-3b299aeeac58 does not have a summary. Skipping filtering.
Node b6b877c5-8237-401e-aabe-3f420f97a5c8 does not have a summary. Skipping filtering.
Node f4c3e8f9-792f-4bbe-9d6f-935fd4f94014 does not have a summary. Skipping filtering.
Node 146f9f7a-e6a5-4ee2-b3fd-9d5efb56f4c2 does not have a summary. Skipping filtering.
Node 399f46ee-6442-48d9-9d8e-99a11566d90c does not have a summary. Skipping filtering.
Node 5ce4c6f0-dfaf-41d6-bc82-34d395c7f840 does not have a summary. Skipping filtering.
Node f29037d1-c266-4ecc-b742-603085814bd8 does not have a summary. Skipping filtering.
Node fca28fa9-aaa5-452f-8148-b746b2f921e7 does not have a summary. Skipping filtering.
Node 78d5b51a-bec3-439d-8bce-4fe048b3af41 does not have a summary. Skipping filtering.
Applying CustomNodeFilter:  20%|██        | 10/50 [00:00

✅ Generated 15 synthetic Q&A pairs successfully!


## Step 6: Convert to Pandas and Display Dataset

In [6]:
import pandas as pd

# Convert the synthetic dataset to pandas DataFrame
golden_df = synthetic_dataset.to_pandas()

print(f"Golden dataset shape: {golden_df.shape}")
print(f"Columns: {list(golden_df.columns)}")
print("\n" + "="*80)
print("JIRA GOLDEN DATASET PREVIEW:")
print("="*80)

# Display first few examples
display(golden_df.head(3))

print("\n" + "="*80)
print("EXAMPLE JIRA QUESTION-ANSWER PAIR:")
print("="*80)

# Show a detailed example
example_idx = 0
print(f"Question: {golden_df.iloc[example_idx]['user_input']}")
print(f"\nExpected Answer: {golden_df.iloc[example_idx]['reference']}")
print(f"\nReference Contexts ({len(golden_df.iloc[example_idx]['reference_contexts'])} chunks):")
for i, context in enumerate(golden_df.iloc[example_idx]['reference_contexts'][:2]):  # Show first 2 contexts
    print(f"  Context {i+1}: {context[:250]}...")

Golden dataset shape: (15, 4)
Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']

JIRA GOLDEN DATASET PREVIEW:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the issue with HBase regarding the MAX...,[Issue Title: MAX_VERSIONS not respected.\n\nD...,The issue reported is that the MAX_VERSIONS pa...,single_hop_specifc_query_synthesizer
1,What issues are associated with the node at IP...,[Issue Title: Splitting log in a hostile envir...,The node at IP address 10.252.219.207 is exper...,single_hop_specifc_query_synthesizer
2,Wut is the problem with DemoClient.java?,[Issue Title: Thrift host and port are hardcod...,The problem with DemoClient.java is that the T...,single_hop_specifc_query_synthesizer



EXAMPLE JIRA QUESTION-ANSWER PAIR:
Question: What is the issue with HBase regarding the MAX_VERSIONS parameter and how does it affect data storage?

Expected Answer: The issue reported is that the MAX_VERSIONS parameter is not being respected in HBase, leading to the conclusion that despite setting the VERSIONS parameter of the columns to 3, it appears that all versions of the data are being stored. This raises the question of whether there is a garbage collector process that removes old versions and, if so, when this process occurs.

Reference Contexts (1 chunks):
  Context 1: Issue Title: MAX_VERSIONS not respected.

Description: Below is a report from the list.  I confirmed playing in shell that indeed we have this problem.  Lets fix for 0.2.1.{code}Hello.I made some tests with HBase 0.2.0 (RC2), focused on insertion and...


In [7]:
golden_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the issue with HBase regarding the MAX...,[Issue Title: MAX_VERSIONS not respected.\n\nD...,The issue reported is that the MAX_VERSIONS pa...,single_hop_specifc_query_synthesizer
1,What issues are associated with the node at IP...,[Issue Title: Splitting log in a hostile envir...,The node at IP address 10.252.219.207 is exper...,single_hop_specifc_query_synthesizer
2,Wut is the problem with DemoClient.java?,[Issue Title: Thrift host and port are hardcod...,The problem with DemoClient.java is that the T...,single_hop_specifc_query_synthesizer
3,How does the class org.apache.maven.surefire.j...,[Issue Title: MapReduce based tests broken on ...,The class org.apache.maven.surefire.junit4.JUn...,single_hop_specifc_query_synthesizer
4,What compatibility issue arises in version 0.9...,[Issue Title: 0.94: HBASE-9865 breaks coproces...,"In version 0.94, the change from 'public List<...",single_hop_specifc_query_synthesizer
5,What issues arise from the Short-Circuit Copro...,[<1-hop>\n\nIssue Title: Short-Circuit Coproce...,The Short-Circuit Coprocessor not correctly lo...,multi_hop_abstract_query_synthesizer
6,What issues are related to backporting in the ...,[<1-hop>\n\nIssue Title: Backport HBASE-3890 '...,The issues related to backporting in the HBase...,multi_hop_abstract_query_synthesizer
7,What issues are associated with backporting HB...,[<1-hop>\n\nIssue Title: Backport HBASE-3890 '...,The issues associated with backporting HBASE-3...,multi_hop_abstract_query_synthesizer
8,Why does the snapshot restoration process some...,[<1-hop>\n\nIssue Title: MAX_VERSIONS not resp...,The snapshot restoration process sometimes fai...,multi_hop_abstract_query_synthesizer
9,What are the implications of the RegionTooBusy...,[<1-hop>\n\nIssue Title: RegionTooBusyExceptio...,The RegionTooBusyException indicates that a re...,multi_hop_abstract_query_synthesizer


## Step 7: Upload Dataset to LangSmith

In [8]:
from langsmith import Client
from datetime import datetime

# Initialize LangSmith client
client = Client()

# Create a unique dataset name with timestamp
dataset_name = f"cuttlefish-jira-golden-dataset-{datetime.now().strftime('%Y%m%d-%H%M%S')}"

# Create the dataset in LangSmith
try:
    langsmith_dataset = client.create_dataset(
        dataset_name=dataset_name,
        description="Cuttlefish golden dataset for RAG evaluation using 15 synthetic JIRA issue Q&A pairs generated with RAGAS from title and description fields"
    )
    print(f"✅ Created LangSmith dataset: {dataset_name}")
    print(f"Dataset ID: {langsmith_dataset.id}")
except Exception as e:
    print(f"Error creating dataset: {e}")
    # If dataset already exists, get it
    langsmith_dataset = client.read_dataset(dataset_name=dataset_name)

✅ Created LangSmith dataset: cuttlefish-jira-golden-dataset-20250731-122634
Dataset ID: cb6d3deb-3b08-4b8e-8270-3eb5de4d328e


In [9]:
# Upload examples to LangSmith dataset
print("Uploading JIRA examples to LangSmith...")

upload_count = 0
for idx, row in golden_df.iterrows():
    try:
        client.create_example(
            inputs={
                "question": row["user_input"]
            },
            outputs={
                "answer": row["reference"]
            },
            metadata={
                "reference_contexts": row["reference_contexts"],
                "synthesizer_name": row.get("synthesizer_name", "unknown"),
                "evolution_type": row.get("evolution_type", "unknown"),
                "episode_done": row.get("episode_done", False),
                "dataset_size": len(golden_df),
                "source": "cuttlefish_jira_generator",
                "data_source": "JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv",
                "content_fields": "title + description"
            },
            dataset_id=langsmith_dataset.id
        )
        upload_count += 1
    except Exception as e:
        print(f"Error uploading example {idx}: {e}")
        continue

print(f"✅ Successfully uploaded {upload_count}/{len(golden_df)} examples to LangSmith dataset!")

Uploading JIRA examples to LangSmith...
✅ Successfully uploaded 15/15 examples to LangSmith dataset!


## Step 8: Summary and Next Steps

In [10]:
print("\n" + "="*80)
print("CUTTLEFISH JIRA GOLDEN DATASET GENERATION COMPLETE!")
print("="*80)
print(f"📊 Dataset Name: {dataset_name}")
print(f"📊 Dataset ID: {langsmith_dataset.id}")
print(f"📊 Number of Q&A Pairs: {len(golden_df)}")
print(f"📊 Source Documents Used: {len(jira_documents)} (first 50 for generation)")
print(f"📊 Data Source: JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv")
print(f"📊 Content Fields: title + description")
print(f"📊 Generation Method: RAGAS Abstracted SDG")
print(f"📊 Models Used:")
print(f"   - LLM: gpt-4o-mini")
print(f"   - Embeddings: text-embedding-3-small")

print("\n🎯 READY FOR JIRA RAG EVALUATION!")
print("This dataset can now be used to evaluate RAG chains focused on JIRA issue resolution.")
print("\n📋 Dataset Structure:")
print(f"   - Questions: {golden_df['user_input'].nunique()} unique")
print(f"   - Answer Length: {golden_df['reference'].str.len().mean():.0f} chars avg")
print(f"   - Context Chunks: {golden_df['reference_contexts'].apply(len).mean():.1f} per question")

# Display synthesizer distribution
if 'synthesizer_name' in golden_df.columns:
    synthesizer_counts = golden_df['synthesizer_name'].value_counts()
    print(f"\n📈 Synthesizer Distribution:")
    for synthesizer, count in synthesizer_counts.items():
        print(f"   - {synthesizer}: {count} questions")

print("\n🔧 JIRA Content Analysis:")
# Analyze question types
questions = golden_df['user_input'].tolist()
technical_keywords = ['error', 'exception', 'bug', 'fix', 'issue', 'problem', 'failure']
technical_questions = sum(1 for q in questions if any(keyword in q.lower() for keyword in technical_keywords))
print(f"   - Technical Questions: {technical_questions}/{len(questions)}")
print(f"   - Average Question Length: {golden_df['user_input'].str.len().mean():.0f} chars")


CUTTLEFISH JIRA GOLDEN DATASET GENERATION COMPLETE!
📊 Dataset Name: cuttlefish-jira-golden-dataset-20250731-122634
📊 Dataset ID: cb6d3deb-3b08-4b8e-8270-3eb5de4d328e
📊 Number of Q&A Pairs: 15
📊 Source Documents Used: 100 (first 50 for generation)
📊 Data Source: JIRA_OPEN_DATA_LARGESET_DATESHIFTED.csv
📊 Content Fields: title + description
📊 Generation Method: RAGAS Abstracted SDG
📊 Models Used:
   - LLM: gpt-4o-mini
   - Embeddings: text-embedding-3-small

🎯 READY FOR JIRA RAG EVALUATION!
This dataset can now be used to evaluate RAG chains focused on JIRA issue resolution.

📋 Dataset Structure:
   - Questions: 15 unique
   - Answer Length: 578 chars avg
   - Context Chunks: 1.7 per question

📈 Synthesizer Distribution:
   - single_hop_specifc_query_synthesizer: 5 questions
   - multi_hop_abstract_query_synthesizer: 5 questions
   - multi_hop_specific_query_synthesizer: 5 questions

🔧 JIRA Content Analysis:
   - Technical Questions: 14/15
   - Average Question Length: 129 chars


## Optional: Save Dataset Locally

In [11]:
# Save the dataset locally for backup/reference
local_filename = f"cuttlefish_jira_golden_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
golden_df.to_csv(local_filename, index=False)
print(f"💾 Dataset saved locally as: {local_filename}")

# Also save as JSON for better context preservation
json_filename = f"cuttlefish_jira_golden_dataset_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
golden_df.to_json(json_filename, orient='records', indent=2)
print(f"💾 Dataset saved locally as JSON: {json_filename}")

print(f"\n📁 Files created:")
print(f"   - CSV: {local_filename}")
print(f"   - JSON: {json_filename}")
print(f"   - LangSmith Dataset: {dataset_name}")

💾 Dataset saved locally as: cuttlefish_jira_golden_dataset_20250731_122806.csv
💾 Dataset saved locally as JSON: cuttlefish_jira_golden_dataset_20250731_122806.json

📁 Files created:
   - CSV: cuttlefish_jira_golden_dataset_20250731_122806.csv
   - JSON: cuttlefish_jira_golden_dataset_20250731_122806.json
   - LangSmith Dataset: cuttlefish-jira-golden-dataset-20250731-122634


## Dataset Quality Check

In [12]:
# Quality check: Display all questions to review variety and quality
print("\n" + "="*80)
print("QUALITY CHECK: ALL GENERATED QUESTIONS")
print("="*80)

for i, question in enumerate(golden_df['user_input'], 1):
    print(f"{i:2d}. {question}")

print("\n" + "="*80)
print("CONTEXT ANALYSIS")
print("="*80)

# Analyze context sources
all_contexts = []
for contexts in golden_df['reference_contexts']:
    all_contexts.extend(contexts)

print(f"Total context chunks used: {len(all_contexts)}")
print(f"Average context length: {sum(len(ctx) for ctx in all_contexts) / len(all_contexts):.0f} chars")

# Check for JIRA-specific content in contexts
jira_terms = ['HBASE', 'FLEX', 'JBIDE', 'SPR', 'exception', 'error', 'bug', 'fix']
jira_contexts = sum(1 for ctx in all_contexts if any(term in ctx for term in jira_terms))
print(f"Contexts with JIRA-specific terms: {jira_contexts}/{len(all_contexts)} ({jira_contexts/len(all_contexts)*100:.1f}%)")


QUALITY CHECK: ALL GENERATED QUESTIONS
 1. What is the issue with HBase regarding the MAX_VERSIONS parameter and how does it affect data storage?
 2. What issues are associated with the node at IP address 10.252.219.207 during the log splitting process?
 3. Wut is the problem with DemoClient.java?
 4. How does the class org.apache.maven.surefire.junit4.JUnit4TestSet relate to the errors encountered in MapReduce tests on Hadoop 2.0.0-alpha?
 5. What compatibility issue arises in version 0.94 related to HBASE-9865?
 6. What issues arise from the Short-Circuit Coprocessor not correctly looking up tables on the server, and how does this relate to the problems caused by a corrupt HFile leading to resource leaks and OOM errors in region servers?
 7. What issues are related to backporting in the HBase cluster and how do they affect connection to the new active master?
 8. What issues are associated with backporting HBASE-3890 to version 0.94 and how do they relate to the sporadic failures en