# Synthetic Query Generation

Synthetic data generation has become a crucial technique in AI development, especially when working with specialized domains where obtaining high-quality human-labeled data is challenging.

As highlighted in a recent [Answer.AI blog post](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html), the key to effective synthetic data lies in balancing two critical factors:

1. **Quality** - Ensuring the generated queries are accurate, relevant, and useful
2. **Diversity** - Creating a wide range of query types, styles, and perspectives

If you're interested in synthetic data generation in general, we highly recommend to check out this blog post. We will implement a similar process targeted at query generation in this notebook.

In this notebook, we will get to the main topic of this workshop: *synthetic* ***query*** *generation*.

By the end of this notebook, you'll have a comprehensive understanding of how to generate high-quality, diverse synthetic queries for fine-tuning embedding models on specialized domains.

## Setup

> ***Important:*** *As we won't need it in this notebook and usage is limited, make sure you are* ***not*** *using a GPU runtime. Click on `Runtime` > `Change runtime type` > Select `CPU` and Save.*

> *Also, to make sure there are no older sessions running, click on `Runtime` > `Manage sessions` > `Terminate other sessions`*

We will use the same setup as in notebook `01_intro.ipynb`

In [None]:
!wget "https://drive.google.com/uc?export=download&id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y" -O ./sample_data.csv
!wget "https://drive.google.com/uc?export=download&id=1hBGWmXKW2LhMZ9rOd05UTg_aUp_nJ_Wt" -O ./fewshot_examples.csv

In [None]:
import os
import random
import time
from typing import Dict, Any, Optional, List
from google import genai
from google.colab import userdata
from google.genai import types
from IPython.display import display, Markdown

os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")  # alternatively paste your key here
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

In [None]:
def generate_text(
    prompt: str,
    model: str = "gemini-2.0-flash",
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    system_instructions: Optional[str] = None
) -> str:
    """
    Generate text using Google's Gemini model with configurable parameters.

    Args:
        prompt: The user prompt to send to the model
        model: Model name to use (default: gemini-2.0-flash)
        temperature: Controls temperature (0.0-2.0, lower is more deterministic)
        max_tokens: Maximum number of tokens to generate
        system_instructions: Optional system instruction to guide the model

    Returns:
        Generated text response as string
    """
    try:
        # Create config with only non-None parameters
        config_params = {}
        if temperature:
            config_params["temperature"] = temperature
        if max_tokens is not None:
            config_params["max_output_tokens"] = max_tokens
        if system_instructions:
            config_params["system_instruction"] = system_instructions

        # Create the config object
        config = types.GenerateContentConfig(**config_params)

        # Generate content
        response = client.models.generate_content(
            model=model,
            contents=[prompt],
            config=config
        )

        return response.text
    except Exception as e:
        return f"Error generating text: {str(e)}"

## 1. Basic Query Generation

Let's start with the most basic approach to generating synthetic queries: simply asking an LLM to generate questions about the EU AI Act. This method requires minimal setup and can quickly produce a set of queries to serve as starting point.

We'll use Gemini to generate some initial queries about the EU AI Act without providing any specific context from the actual documents.

**Exercise 1a:**
> Write a short system prompt that tells Gemini more about the application we're building.

In [None]:
system_prompt = """describe in 1-3 sentences what we're trying to achieve"""

**Exercise 1b:**
> Now write a simple prompt that asks the LLM to generate a question about the EU's AI Act. Then call Gemini with this prompt as well as the system prompt you wrote above. Note that you might need to add an instruction to tell it only to return the query and nothing else.

In [None]:
basic_prompt = """write your prompt here"""

# Generate the questions
response = ...
print(response)

**Exercise 1c:**
> Now run the same prompt 5 times

In [None]:
...

**Reflection:**
> How would you evaluate the quality of these queries?

## 2. Grounded Generation

To address the limitations we observed above, we'll now explore a more effective approach: grounding our query generation in actual passages from the EU AI Act.

This technique significantly improves both the quality and diversity of our synthetic queries by:
1. Ensuring queries are relevant to the actual content of the document
2. Naturally increasing diversity as different passages cover different aspects of the legislation
3. Incorporating accurate terminology and concepts from the source material

Let's load a few random passages from the AI Act.

In [None]:
import pandas as pd
df = pd.read_csv("sample_data.csv")
len(df)

In [None]:
passages = df.sample(5, random_state=10)["passage"].values.tolist()

Print out one passage which we'll use for grounded generation

In [None]:
passage = passages[0]
display(Markdown(passage))

In addition to the passage, we will also provide the LLM with a list of criteria which define a high quality query.

**Exercise 2a:**
> Write a list of criteria which define a high quality query. For example, a high quality query should be specific, relevant and answerable by the passage.

In [None]:
criteria = """
1. criterion 1
2. criterion 2
...
"""

**Exercise 2b:**
> Write a prompt that asks the LLM to generate a query based on the provided criteria and passage.

<details>
<summary>Click to see a hint</summary>

Add the passage using a placeholder:

```
"""
Write a query about the following passage:

## Passage

{passage}

## Criteria

Follow these criteria:
1. criterion 1
2. criterion 2
...
"""
```

</details>

In [None]:
grounded_prompt = """write your prompt here"""

**Tip:**
> Always print out the fully formatted prompt to make sure everything is correct

<details>
<summary>Click to see a hint</summary>

Format the prompt by passing a passage to it:

```
prompt = grounded_prompt.format(passage=passage)
```

</details>

In [None]:
prompt = ...
print(prompt)

In [None]:
response = ...
print(response)

**Exercise 2c:**
> Call Gemini 5 times with this prompt, each time using a different passage.

**Tip:**
> *Look at your data!* Always make sure to print out the context used for grounding, along with the generated query. Only if you can see all the relevant information yourself will you be able to judge a query's quality.

In [None]:
...

**Reflection:**
> What do you think? Did this technique improve the generated questions?

**Exercise 2d:**
> Now call Gemini 5 times with this prompt, each time using the same passage.

In [None]:
...

## 3. Persona-Based Generation

Our previous approach successfully grounded queries in relevant passages, but we still observed limited diversity in query styles and perspectives. To address this limitation, we'll now explore persona-based generation, a powerful technique for further enhancing the diversity of our synthetic queries.

Persona-based generation involves creating detailed character profiles (personas) that guide the LLM to generate content from specific perspectives. This approach was introduced in the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas".

The key benefits of persona-based generation for our query task include:

1. **Stylistic diversity** - Different personas use different language patterns, terminology, and complexity levels
2. **Varied perspectives** - Personas with different backgrounds approach topics with unique concerns and priorities
3. **Realistic variation** - Real users come from diverse backgrounds and have different levels of domain knowledge

Let's implement this approach by creating a set of personas with varying backgrounds, knowledge levels, and interests in the EU AI Act.

**Exercise 3a:**
> Create a list of 5 diverse personas who might have questions about the EU AI Act. Consider including different professions, technical backgrounds, and reasons for interest in the legislation.

In [None]:
personas = [
    "A data protection officer at a large European enterprise implementing AI compliance",
    "A software developer specializing in machine learning applications",
    ...
]

In addition to personas, we will also will generate a few different query styles to further increase diversity.

**Exercise 3b:**
> Create a list of 5 different query styles. Try to make them as realistic and diverse as possible.

In [None]:
query_styles = [
    "Technical language with domain-specific terminology",
    "Simple direct question with basic vocabulary",
    ...
]

**Exercise 3c:**
> Enhance our prompt from part 2 to also include persona and query style. Then call Gemini once with this prompt based on a single persona and query style.

<details>
<summary>Click to see a hint</summary>

Structure your prompt like this:

```
"""
Write a query based on a passage, persona and query style.

## Passage

{passage}

## Context:

Persona: {persona}
Query style: {query_style}

## Criteria

Follow these criteria:
1. criterion 1
2. criterion 2
...
"""
```

</details>

In [None]:
persona_prompt = """write your prompt here"""

<details>
<summary>Click to see a hint</summary>

Pass all necessary inputs to your prompt:

```
prompt = persona_prompt.format(
    passage=passage, persona=personas[0], query_style=query_styles[0]
)
```

</details>

In [None]:
prompt = ...
print(prompt)

In [None]:
response = ...
print(response)

**Exercise 3d:**
> Run Gemini 5 times with this prompt on the same passage, but sample a different persona and query style each time you run it.

<details>
<summary>Click to see a hint</summary>

Sample a random item using `random.choice`:

```
persona = random.choice(personas)
```

In [None]:
...

**Reflect:**
> What do you think? Are we already happy with the diversity and quality of the queries?

## 4. Few-Shot Generation

So far, we've made significant progress in generating diverse, relevant queries using passage grounding and persona-based techniques. Our queries are now much more varied in style, complexity, and perspective. However, we're still relying heavily on the LLM to interpret our instructions, personas, and query styles correctly.

To gain more control over the generation process and further improve quality, we implement few-shot learning. This technique involves providing the LLM with carefully curated examples of exactly what we want it to produce and is frequently used in papers such as InPars (2022), Promptagator (2022), SWIM-IR (2024) and Gecko (2024).

By showing the model high-quality examples that demonstrate the desired output format, style, and quality, we can:
1. **Increase consistency** - Examples provide concrete guidance on expected output format and quality
2. **Improve adherence to criteria** - Seeing examples helps the model better understand our quality criteria
3. **Reduce misinterpretations** - Examples clarify how personas and query styles should be applied
4. **Raise the quality bar** - Well-crafted examples set a higher standard for the generated queries

Let's enhance our prompt with a few carefully selected examples that demonstrate the kind of high-quality, diverse queries we want to generate.

**Load example data**

Load the 4 few-shot examples we have prepared. For your own use case, you can either write few-shot examples by hand or generate them with the help of an LLM. However, make sure to review them carefully and if necessary, filter or improve them to ensure high quality.

*Note that you need to provide not only example queries, but all additional inputs to the prompt, such as the passage, persona, and query style.*

In [None]:
dfe = pd.read_csv("fewshot_examples.csv")
dfe.head()

In [None]:
def format_few_shot_examples(examples, k=None):
    """
    Format a list of few-shot examples into a markdown-formatted string for prompts.

    Args:
        examples: List of dictionaries containing 'passage', 'persona', 'query_style', and 'query'
        k: Optional number of examples to randomly sample (if None, use all examples)

    Returns:
        A markdown-formatted string containing the examples
    """
    # If k is specified, randomly sample k examples
    if k is not None and k < len(examples):
        examples = random.sample(list(examples), k)

    formatted_examples = []
    for i, example in enumerate(examples):
        example_text = f"## Example {i+1}\n\n"
        example_text += f"### Passage:\n\n{example['passage']}\n\n"
        example_text += f"### Context:\n\nPersona: {example['persona']}\nQuery style: {example['query_style']}\n\n"
        example_text += f"### Generated Query: {example['query']}"
        formatted_examples.append(example_text)

    return "\n\n".join(formatted_examples)

We format our few-shot examples as a single string using the function above.

In [None]:
examples = dfe.to_dict(orient="records")
examples_formatted = format_few_shot_examples(examples)
print(examples_formatted)

**Exercise 4a:**
> Enhance our prompt from 3c to also include few shot examples. Print out the full final prompt and then call Gemini once with this prompt.

**Tip:**
> You can further increase the diversity of the data generation process by randomly sampling a subset of examples.

<details>
<summary>Click to see a hint</summary>

Structure your prompt like this:

```
"""
Write a query based on a passage, persona and query style.

# Criteria:

Follow these criteria
1. criterion 1
2. criterion 2
...

# Examples

{examples_formatted}

# Your task

Now generate a query about the following passage.

### Passage:

{passage}

### Context:

Persona: {persona}
Query style: {query_style}

### Generated Query:"""
```

In [None]:
fewshot_prompt = """write your prompt here"""

In [None]:
prompt = ...
print(prompt)

In [None]:
response = ...
print(response)

**Exercise 4b:**
> As before, run 5 iterations with the same passage, and sample a different persona and query style each time you run it. However, at each iteration call both `persona_prompt` and `fewshot_prompt` so that we're able to directly compare them. Also, store the generated queries, along with their sampled personas and query styles in a list, as we'll use them in the next section.

In [None]:
results = []

print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")

for i in range(5):
    # Sample a random persona and query style
    ...

    # Format the prompts with the same passage, persona, and query style
    ...

    # Generate responses from both prompts
    ...
    time.sleep(0.5)  # Avoid overloading the API

    # Store results
    results.append({
        "persona": ...,
        "query_style": ...,
        "persona_query": ...,
        "fewshot_query": ...
    })

    # Print results for comparison
    ...

**Reflect:**
> What do you notice about the queries generated by the two prompts? Which one do you think is better?

## 5. Quality Filtering

We've now explored several techniques for generating diverse, high-quality synthetic queries. By combining passage grounding, persona-based generation, and few-shot examples, we've significantly improved both the quality and diversity of our synthetic data. However, even with these advanced techniques, not every generated query will meet our standards.

A crucial final step in synthetic data generation is quality filtering. As highlighted in the Answer.AI blog:

> "To address these concerns, let’s use another prompt. It will evaluate and filter the generations. We’ll use the 5-point scoring system in The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. It proved most effective at evaluating the quality of data."

The FineWeb paper introduced an additive scoring approach where points are accumulated based on satisfying specific quality criteria. The LLM is instructed to first write a critique of the generated example and then score it based on the provided scoring system.

Here is the prompt used in the blog post:

In [None]:
eval_prompt_template = """
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

<translation>
{translation}
</translation>

After examining the translation:

- Briefly justify your total score in a single line.
- Conclude with the score of the translation."""

**Exercise 5a:**
> Rewrite the scoring instructions above to fit our use case. Reuse the quality criteria you have written before. We already prefilled all the rest for you.

**Tip:**
> Asking the model to return its reponse in JSON format will simplify parsing and postprocessing.

In [None]:
quality_filter_prompt = """
Write your scoring instructions here

## Passage:

{passage}

## Context:
Persona: {persona}
Query style: {query_style}

## Generated Query: {query}

After examining the query:

- Briefly justify your total score in a single line.
- Conclude with the score of the query (1-5).

Return the evaluation in valid JSON format with double quotes:

Evaluation = {{"critique": str, "score": int}}
Return: Evaluation
"""

In [None]:
prompt = ...
print(prompt)

In [None]:
response = ...
print(response)

We can use this function to parse the response

In [None]:
import json

def extract_json(response_text):
    """Extract and parse JSON from LLM response text, handling various formats."""
    try:
        # First try to extract JSON from markdown code blocks if present
        if "```" in response_text:
            # Extract content between code blocks
            json_text = response_text.split("```")[1]
            # Remove language indicator if present
            if json_text.startswith("json"):
                json_text = json_text[4:].strip()
        else:
            json_text = response_text.strip()

        # Parse the JSON
        return json.loads(json_text)
    except (json.JSONDecodeError, IndexError) as e:
        print(f"Error parsing JSON: {e}")
        print(f"Raw response: {response_text}")
        # Return a default value in case of error
        return {"critique": "Error parsing response", "score": 0}

In [None]:
extract_json(response)

**Exercise 5b:**
> Now run the prompt on all results you stored in the previous section, parse the responses and store the critiques and scores together with the previous results. To simplify things we already wrote this part for you.

In [None]:
# Print the passage so we can easily check it
print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")

# Evaluate all queries and store results
for i, result in enumerate(results):
    print(f"Evaluating query {i+1}/5...\n")

    # Evaluate persona-based query
    persona_filter_prompt = quality_filter_prompt.format(
        passage=passage,
        persona=result["persona"],
        query_style=result["query_style"],
        query=result["persona_query"]
    )
    persona_response = generate_text(persona_filter_prompt, system_instructions=system_prompt)

    # Evaluate few-shot query
    fewshot_filter_prompt = quality_filter_prompt.format(
        passage=passage,
        persona=result["persona"],
        query_style=result["query_style"],
        query=result["fewshot_query"]
    )
    fewshot_response = generate_text(fewshot_filter_prompt, system_instructions=system_prompt)
    time.sleep(0.5)

    # Parse and store evaluations
    try:
        persona_eval = extract_json(persona_response)
        fewshot_eval = extract_json(fewshot_response)

        # Add evaluations to results
        results[i]["persona_critique"] = persona_eval.get("critique", "Error")
        results[i]["persona_score"] = persona_eval.get("score", 0)
        results[i]["fewshot_critique"] = fewshot_eval.get("critique", "Error")
        results[i]["fewshot_score"] = fewshot_eval.get("score", 0)

        # Print personas and query styles
        print(f"  PERSONA: {results[i]['persona']}")
        print(f"  QUERY STYLE: {results[i]['query_style']}")

        # Print scores during generation
        print(f"\n  ZERO-SHOT SCORE: {results[i]['persona_score']}")
        print(f"     Query: {results[i]['persona_query'].strip()}")
        print(f"     Critique: {results[i]['persona_critique'].strip()}")
        print(f"\n  FEW-SHOT SCORE: {results[i]['fewshot_score']}")
        print(f"     Query: {results[i]['fewshot_query'].strip()}")
        print(f"     Critique: {results[i]['fewshot_critique'].strip()}")

    except Exception as e:
        print(f"Error processing result {i}: {e}")

    print("\n" + "-"*80 + "\n")

# Calculate average scores
avg_persona_score = sum(r.get("persona_score", 0) for r in results) / len(results)
avg_fewshot_score = sum(r.get("fewshot_score", 0) for r in results) / len(results)

print("=" * 80)
print("AVERAGE SCORES:")
print(f"Zero-shot queries: {avg_persona_score:.2f}")
print(f"Few-shot queries: {avg_fewshot_score:.2f}")
print("=" * 80)

## Conclusion

In this notebook, we've explored a comprehensive approach to synthetic query generation for specialized domains like the EU AI Act. We've progressed from basic generation to increasingly sophisticated techniques:

1. **Basic generation** demonstrated the limitations of simple prompting
2. **Grounded generation** improved relevance by anchoring queries to specific passages
3. **Persona-based generation** enhanced diversity through varied perspectives and styles
4. **Few-shot learning** provided more control over output quality and format
5. **Quality filtering** ensured only the best queries make it into our final dataset

These techniques allow us to create synthetic queries that are both high-quality and diverse.

However, generating good queries is only the first step in creating a robust training dataset for embedding models. In the next notebook, we'll explore the crucial process of mining positive and negative examples - identifying which passages truly answer our queries and which ones don't. This step is essential for creating the clean, well-structured training data needed for effective embedding model fine-tuning.