# Synthetic Query Generation

Synthetic data generation has become a crucial technique in AI development, especially when working with specialized domains where obtaining high-quality human-labeled data is challenging.

As highlighted in a recent [Answer.AI blog post](https://www.answer.ai/posts/2024-10-15-how-to-synthesize-data.html), the key to effective synthetic data lies in balancing two critical factors:

1. **Quality** - Ensuring the generated queries are accurate, relevant, and useful
2. **Diversity** - Creating a wide range of query types, styles, and perspectives

If you're interested in synthetic data generation in general, we highly recommend to check out this blog post. We will implement a similar process targeted at query generation in this notebook.

In this notebook, we will get to the main topic of this workshop: *synthetic* ***query*** *generation*. While some methods focus on generating both queries and passages, in most real-world settings you will already have some corpus of documents or passages you want to retrieve data from. Thus, the part you need to collect, write or generate are realistic and diverse user queries.

We'll walk through a progressive approach to generating synthetic queries based on the EU AI Act:
1. We'll start with basic query generation, demonstrating its limitations
2. Then improve with grounded generation based on specific passages
3. Increase diversity using persona-based generation
4. Enhance quality through few-shot learning with carefully crafted examples
5. Finally, implement quality filtering to filter out low-quality examples

By the end of this notebook, you'll have a comprehensive understanding of how to generate high-quality, diverse synthetic queries for fine-tuning embedding models on specialized domains.

## Setup

> ***Important:*** *As we won't need it in this notebook and usage is limited, make sure you are* ***not*** *using a GPU runtime. Click on `Runtime` > `Change runtime type` > Select `CPU` and Save.*

> *Also, to make sure there are no older sessions running, click on `Runtime` > `Manage sessions` > `Terminate other sessions`*

We will use the same setup as in notebook `01_intro.ipynb`

In [None]:
!wget "https://drive.google.com/uc?export=download&id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y" -O ./sample_data.csv
!wget "https://drive.google.com/uc?export=download&id=1hBGWmXKW2LhMZ9rOd05UTg_aUp_nJ_Wt" -O ./fewshot_examples.csv

--2025-04-10 12:15:04--  https://drive.google.com/uc?export=download&id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y
Resolving drive.google.com (drive.google.com)... 64.233.187.100, 64.233.187.101, 64.233.187.113, ...
Connecting to drive.google.com (drive.google.com)|64.233.187.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y&export=download [following]
--2025-04-10 12:15:05--  https://drive.usercontent.google.com/download?id=1kTbWY9JJf0fFoqZGh6d-DRHQel6sT-9Y&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 108.177.97.132, 2404:6800:4008:c00::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|108.177.97.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 88246 (86K) [application/octet-stream]
Saving to: ‘./sample_data.csv’


2025-04-10 12:15:07 (88.8 MB/s) - ‘./sample_data.csv’ saved [88

In [None]:
import os
import time
from typing import Dict, Any, Optional, List
from google import genai
from google.colab import userdata
from google.genai import types
from IPython.display import display, Markdown

os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")  # alternatively paste your key here
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))

In [None]:
def generate_text(
    prompt: str,
    model: str = "gemini-2.0-flash",
    temperature: Optional[float] = None,
    max_tokens: Optional[int] = None,
    system_instructions: Optional[str] = None
) -> str:
    """
    Generate text using Google's Gemini model with configurable parameters.

    Args:
        prompt: The user prompt to send to the model
        model: Model name to use (default: gemini-2.0-flash)
        temperature: Controls temperature (0.0-2.0, lower is more deterministic)
        max_tokens: Maximum number of tokens to generate
        system_instructions: Optional system instruction to guide the model

    Returns:
        Generated text response as string
    """
    try:
        # Create config with only non-None parameters
        config_params = {}
        if temperature:
            config_params["temperature"] = temperature
        if max_tokens is not None:
            config_params["max_output_tokens"] = max_tokens
        if system_instructions:
            config_params["system_instruction"] = system_instructions

        # Create the config object
        config = types.GenerateContentConfig(**config_params)

        # Generate content
        response = client.models.generate_content(
            model=model,
            contents=[prompt],
            config=config
        )

        return response.text
    except Exception as e:
        return f"Error generating text: {str(e)}"

## 1. Basic Query Generation

Let's start with the most basic approach to generating synthetic queries: simply asking an LLM to generate questions about the EU AI Act. This method requires minimal setup and can quickly produce a set of queries to serve as starting point.

We'll use Gemini to generate some initial queries about the EU AI Act without providing any specific context from the actual documents.

**Exercise 1a:**
> Write a short system prompt that tells Gemini more about the application we're building.

In [None]:
system_prompt = """You are a highly skilled data annotator working on a Q&A application focused on the European Union's AI Act.

We are training a custom embedding model to power this application, and we need your expertise to generate realistic user queries that people might ask about this legislation."""

**Exercise 1b:**
> Now write a simple prompt that asks the LLM to generate a question about the EU's AI Act. Then call Gemini with this prompt as well as the system prompt you wrote above. Note that you might need to add an instruction to tell it only to return the query and nothing else.

In [None]:
basic_prompt = """Generate a single, realistic question that a user might ask about the EU AI Act.

Return only the question text without any additional explanation, commentary, or quotation marks."""

# Generate the questions
response = generate_text(basic_prompt, system_instructions=system_prompt)
print(response)

What are the potential fines for companies that violate the AI Act?



**Exercise 1c:**
> Now run the same prompt 5 times

In [None]:
for i in range(5):
    response = generate_text(basic_prompt, system_instructions=system_prompt)
    print(response)

How does the AI Act define general-purpose AI?

What are the penalties for non-compliance with the AI Act?

What are the penalties for violating the AI Act?

What are the penalties for non-compliance with the AI Act?

What are the penalties for non-compliance with the AI Act?



**Reflection:**
> How would you evaluate the quality of these queries?

While some generated queries may already be reasonable, we can observe several limitations:

1. **Limited thematic diversity** - The model generates similar questions across different runs, focusing on the same few high-level themes
2. **Lack of specificity** - Questions are broad and generic rather than targeting specific details in the AI Act
3. **Limited stylistic diversity** - All questions follow a similar format and complexity level, missing the variety of how real users might phrase their queries
4. **Missing context and accuracy issues** - Without grounding in the actual text, questions may not align with the actual content, terminology, or current state of the EU AI Act

We also noticed that the queries are *very sensitive to the prompt* and small reformulations can lead to completely different questions.

## 2. Grounded Generation

To address the limitations we observed above, we'll now explore a more effective approach: grounding our query generation in actual passages from the EU AI Act.

This technique significantly improves both the quality and diversity of our synthetic queries by:
1. Ensuring queries are relevant to the actual content of the document
2. Naturally increasing diversity as different passages cover different aspects of the legislation
3. Incorporating accurate terminology and concepts from the source material

Let's load a few random passages from the AI Act.

In [None]:
import pandas as pd
df = pd.read_csv("sample_data.csv")
len(df)

100

In [None]:
passages = df.sample(5, random_state=10)["passage"].values.tolist()

Print out one passage which we'll use for grounded generation

In [None]:
passage = passages[0]
display(Markdown(passage))

Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.

In addition to the passage, we will also provide the LLM with a list of criteria which define a high quality query.

**Exercise 2a:**
> Write a list of criteria which define a high quality query. For example, a high quality query should be specific, relevant and answerable by the passage.

In [None]:
criteria = """
1. **Realistic** - The query should resemble something a real user might ask, using natural language and phrasing
2. **Relevant** - The query should be directly related to information contained in the passage
3. **Specific** - The query should focus on one specific topic or concept mentioned in the passage rather than being overly broad
4. **Answerable** - The passage should contain sufficient information to provide an answer to the query
5. **User-oriented** - The query should represent something a user would naturally ask *without* having seen the passage
6. **Original** - The query should not simply restate or rephrase the passage content, but use its own language
"""

**Exercise 2b:**
> Write a prompt that asks the LLM to generate a query based on the provided criteria and passage.

In [None]:
grounded_prompt = """Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act.

## Passage:

{passage}

## Criteria:

Make sure to follow these criteria which define a high quality query:
1. **Realistic** - The query should resemble something a real user might ask, using natural language and phrasing
2. **Relevant** - The query should be directly related to information contained in the passage
3. **Specific** - The query should focus on one specific topic or concept mentioned in the passage rather than being overly broad
4. **Answerable** - The passage should contain sufficient information to provide an answer to the query
5. **User-oriented** - The query should represent something a user would naturally ask *without* having seen the passage
6. **Original** - The query should not simply restate or rephrase the passage content, but use its own language

Return ONLY the query text, without any additional explanation, commentary, or quotation marks."""

**Tip:**
> Always print out the fully formatted prompt to make sure everything is correct

In [None]:
prompt = grounded_prompt.format(passage=passage)
print(prompt)

Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act.

## Passage:

Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national la

In [None]:
response = generate_text(prompt, system_instructions=system_prompt)
print(response)

What factors are considered when determining the size of a fine under the EU AI Act?



**Exercise 2c:**
> Call Gemini 5 times with this prompt, each time using a different passage.

**Tip:**
> *Look at your data!* Always make sure to print out the context used for grounding, along with the generated query. Only if you can see all the relevant information yourself will you be able to judge a query's quality.

In [None]:
for i in range(len(passages)):
    prompt = grounded_prompt.format(passage=passages[i])
    response = generate_text(prompt, system_instructions=system_prompt)
    print(f"# PASSAGE {i+1}:\n")
    display(Markdown(passages[i]))
    print("\n# QUERY:", response.strip())
    print("\n" + "-"*80 + "\n")

# PASSAGE 1:



Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.


# QUERY: What factors are considered when determining the size of a fine under the AI Act?

--------------------------------------------------------------------------------

# PASSAGE 2:



Chapter IX - POST-MARKET MONITORING, INFORMATION SHARING AND MARKET SURVEILLANCE

Section 3 - Enforcement

Article 81 - Union safeguard procedure

1.   Where, within three months of receipt of the notification referred to in Article 79(5), or within 30 days in the case of non-compliance with the prohibition of the AI practices referred to in Article 5, objections are raised by the market surveillance authority of a Member State to a measure taken by another market surveillance authority, or where the Commission considers the measure to be contrary to Union law, the Commission shall without undue delay enter into consultation with the market surveillance authority of the relevant Member State and the operator or operators, and shall evaluate the national measure. On the basis of the results of that evaluation, the Commission shall, within six months, or within 60 days in the case of non-compliance with the prohibition of the AI practices referred to in Article 5, starting from the notification referred to in Article 79(5), decide whether the national measure is justified and shall notify its decision to the market surveillance authority of the Member State concerned. The Commission shall also inform all other market surveillance authorities of its decision.


# QUERY: What happens if a market surveillance authority objects to a measure taken by another member state's authority under the AI Act, and what role does the Commission play in resolving this?

--------------------------------------------------------------------------------

# PASSAGE 3:



Preamble

(100)When a general-purpose AI model is integrated into or forms part of an AI system, this system should be considered to be general-purpose AI system when, due to this integration, this system has the capability to serve a variety of purposes. A general-purpose AI system can be used directly, or it may be integrated into other AI systems.


# QUERY: Under the AI Act, how is a general-purpose AI system defined when it's integrated into another AI system?

--------------------------------------------------------------------------------

# PASSAGE 4:



Preamble

(66)Requirements should apply to high-risk AI systems as regards risk management, the quality and relevance of data sets used, technical documentation and record-keeping, transparency and the provision of information to deployers, human oversight, and robustness, accuracy and cybersecurity. Those requirements are necessary to effectively mitigate the risks for health, safety and fundamental rights. As no other less trade restrictive measures are reasonably available those requirements are not unjustified restrictions to trade.


# QUERY: What are the key requirements for high-risk AI systems under the AI Act?

--------------------------------------------------------------------------------

# PASSAGE 5:



ANNEX X

5.   European Travel Information and Authorisation System
(a) Regulation (EU) 2018/1240 of the European Parliament and of the Council of 12 September 2018 establishing a European Travel Information and Authorisation System (ETIAS) and amending Regulations (EU) No 1077/2011, (EU) No 515/2014, (EU) 2016/399, (EU) 2016/1624 and (EU) 2017/2226 (OJ L 236, 19.9.2018, p. 1).
(b) Regulation (EU) 2018/1241 of the European Parliament and of the Council of 12 September 2018 amending Regulation (EU) 2016/794 for the purpose of establishing a European Travel Information and Authorisation System (ETIAS) (OJ L 236, 19.9.2018, p. 72).
6.   European Criminal Records Information System on third-country nationals and stateless persons
Regulation (EU) 2019/816 of the European Parliament and of the Council of 17 April 2019 establishing a centralised system for the identification of Member States holding conviction information on third-country nationals and stateless persons (ECRIS-TCN) to supplement the European Criminal Records Information System and amending Regulation (EU) 2018/1726 (OJ L 135, 22.5.2019, p. 1).
7.   Interoperability
(a) Regulation (EU) 2019/817 of the European Parliament and of the Council of 20 May 2019 on establishing a framework for interoperability between EU information systems in the field of borders and visa and amending Regulations (EC) No 767/2008, (EU) 2016/399, (EU) 2017/2226, (EU) 2018/1240, (EU) 2018/1726 and (EU) 2018/1861 of the European Parliament and of the Council and Council Decisions 2004/512/EC and 2008/633/JHA (OJ L 135, 22.5.2019, p. 27).


# QUERY: How does the AI Act relate to the European Travel Information and Authorisation System (ETIAS)?

--------------------------------------------------------------------------------



**Reflection:**
> What do you think? Did this technique improve the generated questions?

The grounded generation approach has significantly improved our synthetic queries compared to the basic approach:
1. **Increased relevance** - Each query is now directly connected to specific content in the AI Act
2. **Much higher diversity** - Different passages naturally lead to different query topics
3. **Improved specificity** - Queries target particular aspects mentioned in the passages

And of course, another important advantage of using grounded generation is that we directly collect positive **query-passage pairs** which we can use for fine-tuning our embedding model later.

However, some limitations remain:
1. **Stylistic uniformity** - All queries follow a similar question format, complexity level, and formal tone
2. **Limited user perspectives** - The queries don't reflect the diverse backgrounds, knowledge levels, and intentions of real users

**Exercise 2d:**
> Now call Gemini 5 times with this prompt, each time using the same passage.

In [None]:
prompt = grounded_prompt.format(passage=passage)
print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")
for i in range(5):
    response = generate_text(prompt, system_instructions=system_prompt)
    print(f"# QUERY {i+1}:", response)

# PASSAGE:



Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.


--------------------------------------------------------------------------------

# QUERY 1: What factors are considered when determining the size of a fine under the AI Act?

# QUERY 2: What factors are considered when determining the size of a fine under the AI Act?

# QUERY 3: What factors are considered when determining the amount of a fine under the EU AI Act?

# QUERY 4: What factors are considered when determining the size of a fine under the AI Act?

# QUERY 5: What factors are considered when determining the size of a fine under the AI Act?



We can see the limited degree of diversity still present if we call the model multiple times on the same passage. Let's address this in the next section.

## 3. Persona-Based Generation

Our previous approach successfully grounded queries in relevant passages, but we still observed limited diversity in query styles and perspectives. To address this limitation, we'll now explore persona-based generation, a powerful technique for further enhancing the diversity of our synthetic queries.

Persona-based generation involves creating detailed character profiles (personas) that guide the LLM to generate content from specific perspectives. This approach was introduced in the paper "Scaling Synthetic Data Creation with 1,000,000,000 Personas".

The key benefits of persona-based generation for our query task include:

1. **Stylistic diversity** - Different personas use different language patterns, terminology, and complexity levels
2. **Varied perspectives** - Personas with different backgrounds approach topics with unique concerns and priorities
3. **Realistic variation** - Real users come from diverse backgrounds and have different levels of domain knowledge

Let's implement this approach by creating a set of personas with varying backgrounds, knowledge levels, and interests in the EU AI Act.

**Exercise 3a:**
> Create a list of diverse personas who might have questions about the EU AI Act. Consider including different professions, technical backgrounds, and reasons for interest in the legislation.

In [None]:
personas = [
    "A data protection officer at a large European enterprise implementing AI compliance",
    "A legal consultant who specializes in technology and intellectual property law",
    "A software developer specializing in machine learning applications",
    "A small business owner who develops software solutions for local retail stores",
    "A privacy advocate with a background in civil liberties and digital rights",
    "A journalist who covers technology trends for a mainstream news outlet",
    "A venture capital investor focusing on early-stage technology startups",
    "A municipal government official responsible for digital transformation initiatives",
    "A healthcare professional working with diagnostic technologies at a major hospital",
    "A university student majoring in computer science with an interest in ethics"
]

In addition to personas, we will also will generate a few different query styles to further increase diversity.

**Exercise 3b:**
> Create a list of different query styles. Try to make them as realistic and diverse as possible.

In [None]:
query_styles = [
    "Technical language with domain-specific terminology",
    "Simple direct question with basic vocabulary",
    "Informal conversational question",
    "Search engine keyword query without full sentence structure",
    "Academic/research-oriented inquiry with formal language",
    "Hypothetical scenario-based question"
]

**Exercise 3c:**
> Enhance our prompt from part 2 to also include persona and query style. Then call Gemini once with this prompt based on a single persona and query style.

In [None]:
persona_prompt = """Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act.

You will also be provided a persona and query style that the generated query should follow.

## Passage:

{passage}

## Context:

Persona: {persona}
Query style: {query_style}

## Criteria:

Make sure to follow these criteria which define a high quality query:
1. **Realistic** - The query should resemble something a real user might ask, using natural language and phrasing
2. **Relevant** - The query should be directly related to information contained in the passage
3. **Specific** - The query should focus on one specific topic or concept mentioned in the passage rather than being overly broad
4. **Answerable** - The passage should contain sufficient information to provide an answer to the query
5. **User-oriented** - The query should represent something a user would naturally ask *without* having seen the passage
6. **Original** - The query should not simply restate or rephrase the passage content, but use its own language

Return ONLY the query text, without any additional explanation, commentary, or quotation marks."""

In [None]:
prompt = persona_prompt.format(
    passage=passage, persona=personas[0], query_style=query_styles[0]
)
print(prompt)

Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act. You will also be provided a persona and query style that the generated query should follow.

## Passage:

Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been appli

In [None]:
response = generate_text(prompt, system_instructions=system_prompt)
print("# PERSONA:", personas[0], "\n")
print("# QUERY STYLE:", query_styles[0], "\n")
print("# GENERATED QUERY:", response)

# PERSONA: A data protection officer at a large European enterprise implementing AI compliance 

# QUERY STYLE: Technical language with domain-specific terminology 

# GENERATED QUERY: When assessing administrative fines under the AI Act, how does the regulation account for scenarios where an organization has already faced penalties for similar or related infringements under other EU or national laws?



**Exercise 3d:**
> Run Gemini 5 times with this prompt on the same passage, but sample a different persona and query style each time you run it.

In [None]:
import random
random.choice(personas)

'A healthcare professional working with diagnostic technologies at a major hospital'

In [None]:
print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")
for i in range(5):
    persona = random.choice(personas)
    query_style = random.choice(query_styles)
    prompt = persona_prompt.format(
        passage=passage, persona=persona, query_style=query_style
    )
    response = generate_text(prompt, system_instructions=system_prompt)
    print("# PERSONA:", persona, "\n")
    print("# QUERY STYLE:", query_style, "\n")
    print("# GENERATED QUERY:", response.strip())
    print("\n" + "-"*80 + "\n")

# PASSAGE:



Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.


--------------------------------------------------------------------------------

# PERSONA: A small business owner who develops software solutions for local retail stores 

# QUERY STYLE: Academic/research-oriented inquiry with formal language 

# GENERATED QUERY: Within the framework of Article 99, Paragraph 7 of the proposed AI Act, could you elucidate the specific factors that market surveillance authorities must consider when determining the imposition and magnitude of administrative fines for non-compliant AI systems, particularly concerning the interplay between the nature of the infringement, the operator's cooperation, and the mitigation of harm to affected individuals?

--------------------------------------------------------------------------------

# PERSONA: A software developer specializing in machine learning applications 

# QUERY STYLE: Technical language with domain-specific terminology 

# GENERATED QUERY: Under the AI Act, how are factors like model size and market

**Reflect:**
> What do you think? Are we already happy with the diversity and quality of the queries?

Our persona-based approach has successfully addressed the stylistic uniformity issue we observed earlier. By combining:

1. **Diverse personas** - Characters with different professional backgrounds, knowledge levels, and interests
2. **Varied query styles** - Different linguistic patterns from technical to conversational

We've achieved significantly greater diversity in our generated queries:

- **Language variation** - From the technical terminology to the casual and informal language
- **Complexity differences** - From simple, direct questions to multi-part hypothetical scenarios
- **Perspective diversity** - Different personas focus on different aspects that matter to them (compliance, investment implications, etc.)

Even when working with the same passage, the combination of personas and query styles produces remarkably different queries. This approach maintains the relevance benefits of grounded generation while adding the natural diversity we see in real-world user queries.

In practical terms, this means our synthetic dataset will better represent the wide range of ways users might seek information about the EU AI Act, leading to more robust embedding models that can handle diverse query formulations.

## 4. Few-Shot Generation

So far, we've made significant progress in generating diverse, relevant queries using passage grounding and persona-based techniques. Our queries are now much more varied in style, complexity, and perspective. However, we're still relying heavily on the LLM to interpret our instructions, personas, and query styles correctly.

To gain more control over the generation process and further improve quality, we implement few-shot learning. This technique involves providing the LLM with carefully curated examples of exactly what we want it to produce and is frequently used in papers such as InPars (2022), Promptagator (2022), SWIM-IR (2024) and Gecko (2024).

By showing the model high-quality examples that demonstrate the desired output format, style, and quality, we can:
1. **Increase consistency** - Examples provide concrete guidance on expected output format and quality
2. **Improve adherence to criteria** - Seeing examples helps the model better understand our quality criteria
3. **Reduce misinterpretations** - Examples clarify how personas and query styles should be applied
4. **Raise the quality bar** - Well-crafted examples set a higher standard for the generated queries

Let's enhance our prompt with a few carefully selected examples that demonstrate the kind of high-quality, diverse queries we want to generate.

**Load example data**

Load the 4 few-shot examples we have prepared. For your own use case, you can either write few-shot examples by hand or generate them with the help of an LLM. However, make sure to review them carefully and if necessary, filter or improve them to ensure high quality.

*Note that you need to provide not only example queries, but all additional inputs to the prompt, such as the passage, persona, and query style.*

In [None]:
dfe = pd.read_csv("fewshot_examples.csv")

In [None]:
dfe.head()

Unnamed: 0,passage,persona,query_style,query
0,"Chapter IX - POST-MARKET MONITORING, INFORMATI...",A legal consultant who specializes in technolo...,Technical language with domain-specific termin...,What authority do EU regulatory bodies have to...
1,Preamble\n\n(174)Given the rapid technological...,A journalist who covers technology trends for ...,Simple direct question with basic vocabulary,How often will the EU evaluate if their AI reg...
2,Chapter II - PROHIBITED AI PRACTICES\n\nArticl...,A municipal government official responsible fo...,Search engine keyword query without full sente...,prohibited AI social scoring systems governmen...
3,ANNEX IV\n\n(b) the design specifications of t...,A software developer specializing in machine l...,Informal conversational question with filler w...,"Hey, so um, what kind of documentation do I ne..."


In [None]:
examples = dfe.to_dict(orient="records")

In [None]:
def format_few_shot_examples(examples, k=None):
    """
    Format a list of few-shot examples into a markdown-formatted string for prompts.

    Args:
        examples: List of dictionaries containing 'passage', 'persona', 'query_style', and 'query'
        k: Optional number of examples to randomly sample (if None, use all examples)

    Returns:
        A markdown-formatted string containing the examples
    """
    # If k is specified, randomly sample k examples
    if k is not None and k < len(examples):
        examples = random.sample(list(examples), k)

    formatted_examples = []
    for i, example in enumerate(examples):
        example_text = f"## Example {i+1}\n\n"
        example_text += f"### Passage:\n\n{example['passage']}\n\n"
        example_text += f"### Context:\n\nPersona: {example['persona']}\nQuery style: {example['query_style']}\n\n"
        example_text += f"### Generated Query: {example['query']}"
        formatted_examples.append(example_text)

    return "\n\n".join(formatted_examples)

**Exercise 4a:**
> Enhance our prompt from 3c to also include few shot examples. Use the function above to format the examples as a single string. Print out the full final prompt and then call Gemini once with this prompt.

**Tip:**
> You can further increase the diversity of the data generation process by randomly sampling a subset of examples.

In [None]:
examples_formatted = format_few_shot_examples(examples, k=3)

In [None]:
print(examples_formatted)

## Example 1

### Passage:

ANNEX IV

(b) the design specifications of the system, namely the general logic of the AI system and of the algorithms; the key design choices including the rationale and assumptions made, including with regard to persons or groups of persons in respect of who, the system is intended to be used; the main classification choices; what the system is designed to optimise for, and the relevance of the different parameters; the description of the expected output and output quality of the system; the decisions about any possible trade-off made regarding the technical solutions adopted to comply with the requirements set out in Chapter III, Section 2;
(c) the description of the system architecture explaining how software components build on or feed into each other and integrate into the overall processing; the computational resources used to develop, train, test and validate the AI system;
(d) where relevant, the data requirements in terms of datasheets describing t

In [None]:
fewshot_prompt = """Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act. You will also be provided a persona and query style that the generated query should follow.

# Criteria:

Make sure to follow these criteria which define a high quality query:
1. **Realistic** - The query should resemble something a real user might ask, using natural language and phrasing
2. **Relevant** - The query should be directly related to information contained in the passage
3. **Specific** - The query should focus on one specific topic or concept mentioned in the passage rather than being overly broad
4. **Answerable** - The passage should contain sufficient information to provide an answer to the query
5. **User-oriented** - The query should represent something a user would naturally ask *without* having seen the passage
6. **Original** - The query should not simply restate or rephrase the passage content, but use its own language

# Examples

Here are some examples of high quality queries according to our criteria.

{examples_formatted}

# Your task

Generate a high quality query based on the following passage and context. Return ONLY the query text, without any additional explanation, commentary, or quotation marks.

### Passage:

{passage}

### Context:

Persona: {persona}
Query style: {query_style}

### Generated Query:"""

In [None]:
prompt = fewshot_prompt.format(
    examples_formatted=format_few_shot_examples(examples),
    passage=passage,
    persona=personas[0],
    query_style=query_styles[0]
)
print(prompt)

Based on the following passage, generate a query that a user might ask when seeking information about the EU's AI Act. You will also be provided a persona and query style that the generated query should follow.

# Criteria:

Make sure to follow these criteria which define a high quality query:
1. **Realistic** - The query should resemble something a real user might ask, using natural language and phrasing
2. **Relevant** - The query should be directly related to information contained in the passage
3. **Specific** - The query should focus on one specific topic or concept mentioned in the passage rather than being overly broad
4. **Answerable** - The passage should contain sufficient information to provide an answer to the query
5. **User-oriented** - The query should represent something a user would naturally ask *without* having seen the passage
6. **Original** - The query should not simply restate or rephrase the passage content, but use its own language

# Examples

Here are some ex

In [None]:
response = generate_text(prompt, system_instructions=system_prompt)
print("# PERSONA:", personas[0], "\n")
print("# QUERY STYLE:", query_styles[0], "\n")
print("# GENERATED QUERY:", response)

# PERSONA: A data protection officer at a large European enterprise implementing AI compliance 

# QUERY STYLE: Technical language with domain-specific terminology 

# GENERATED QUERY: Under the AI Act, what factors are considered when determining administrative fine amounts for non-compliance, especially concerning the operator's cooperation and implemented technical measures?



**Exercise 4b:**
> As before, run 5 iterations with the same passage, and sample a different persona and query style each time you run it. However, at each iteration call both `persona_prompt` and `fewshot_prompt` so that we're able to directly compare them. Also, store the generated queries, along with their sampled personas and query styles in a list, as we'll use them in the next section.

In [None]:
results = []

print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")

for i in range(5):
    # Sample a random persona and query style
    persona = random.choice(personas)
    query_style = random.choice(query_styles)

    # Format the prompts with the same passage, persona, and query style
    persona_prompt_formatted = persona_prompt.format(
        passage=passage, persona=persona, query_style=query_style
    )

    fewshot_prompt_formatted = fewshot_prompt.format(
        examples_formatted=format_few_shot_examples(examples),
        passage=passage,
        persona=persona,
        query_style=query_style
    )

    # Generate responses from both prompts
    persona_response = generate_text(persona_prompt_formatted, system_instructions=system_prompt)
    fewshot_response = generate_text(fewshot_prompt_formatted, system_instructions=system_prompt)
    time.sleep(0.5)

    # Store results
    results.append({
        "persona": persona,
        "query_style": query_style,
        "persona_query": persona_response.strip(),
        "fewshot_query": fewshot_response.strip()
    })

    # Print results for comparison
    print("# PERSONA:", persona)
    print("\n# QUERY STYLE:", query_style)
    print("\n# ZERO-SHOT QUERY:", persona_response.strip())
    print("\n# FEW-SHOT QUERY:", fewshot_response.strip())
    print("\n" + "-"*80 + "\n")

# PASSAGE:



Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.


--------------------------------------------------------------------------------

# PERSONA: A data protection officer at a large European enterprise implementing AI compliance

# QUERY STYLE: Simple direct question with basic vocabulary

# ZERO-SHOT QUERY: What things are considered when deciding the size of fines for breaking the AI Act?

# FEW-SHOT QUERY: What factors are considered when determining fines for AI Act violations?

--------------------------------------------------------------------------------

# PERSONA: A data protection officer at a large European enterprise implementing AI compliance

# QUERY STYLE: Simple direct question with basic vocabulary

# ZERO-SHOT QUERY: What things do they look at to decide how big a fine to give?

# FEW-SHOT QUERY: What factors are considered when determining AI Act fine amounts?

--------------------------------------------------------------------------------

# PERSONA: A legal consultant who specializes in technology and intellectua

**Reflect:**
> What do you notice about the queries generated by the two prompts? Which one do you think is better?

Few-shot learning provides us with another tool to steer the generation process by showing the model concrete examples of what we want.

However, when comparing the few-shot approach with our zero-shot persona-based approach, we observe:
1. **Similar quality level** - We don't see an obvious improvement in overall quality compared to our previous persona-based approach, which was already producing good results
2. **More consistent formatting** - The few-shot examples help ensure consistent output styles that match our examples
3. **More predictable outputs** - The generated queries tend to follow patterns similar to the examples provided

A major limitation here is that our 4 few-shot examples are themselves not diverse enough and don't cover all possible personas and query styles.

While not showing clear quality improvements in this case, few-shot learning remains valuable when:
- You have domain experts who can create high-quality example queries
- You need more precise control over output format and style
- You want to reduce variance in the generated outputs

Let's now move on to the final section of this notebook

## 5. Quality Filtering

We've now explored several techniques for generating diverse, high-quality synthetic queries. By combining passage grounding, persona-based generation, and few-shot examples, we've significantly improved both the quality and diversity of our synthetic data. However, even with these advanced techniques, not every generated query will meet our standards.

A crucial final step in synthetic data generation is quality filtering. As highlighted in the Answer.AI blog:

> "To address these concerns, let’s use another prompt. It will evaluate and filter the generations. We’ll use the 5-point scoring system in The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale. It proved most effective at evaluating the quality of data."

The FineWeb paper introduced an additive scoring approach where points are accumulated based on satisfying specific quality criteria. The LLM is instructed to first write a critique of the generated example and then score it based on the provided scoring system.

Here is the prompt used in the blog post:

In [None]:
eval_prompt_template = """
Below is an extract of a translation. Evaluate its quality as a senior translator would, considering its suitability for professional use. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the translation conveys the basic meaning of the source text, even if it includes some minor errors or awkward phrasing.
- Add another point if the translation is generally accurate but lacks refinement in style or fails to capture some nuances of the original. It might use inconsistent terminology or have occasional lapses in register.
- Award a third point if the translation is appropriate for professional use and accurately conveys key concepts of the source text. It demonstrates good understanding of both languages, though it may not be flawless or could include some slight inconsistencies. It resembles the work of a competent translator but may have room for improvement in fluency or precision.
- Grant a fourth point if the translation is highly accurate and reads naturally in the target language, exhibiting a consistent and appropriate style. It could be similar to the work of an experienced translator, offering faithful rendering of content and tone, with minimal errors, and effectively handling complex concepts or cultural references. The result is coherent, well-expressed, and valuable for its intended purpose.
- Bestow a fifth point if the translation is outstanding, demonstrating mastery of both source and target languages. It captures subtle nuances, maintains the author's voice and intent, and reads as if it were originally written in the target language. The translator has made excellent choices in dealing with challenging elements like wordplay, idiomatic expressions, or culture-specific content.

<translation>
{translation}
</translation>

After examining the translation:

- Briefly justify your total score in a single line.
- Conclude with the score of the translation."""

**Exercise 5a:**
> Rewrite the critique and scoring prompt above to fit our use case. Reuse the quality criteria you have written before. The prompt should take as input: passage, persona, query style and generated query which you saved in the previous exercise.

**Tip:**
> Ask the model to return its reponse in JSON format. This will simplify parsing and postprocessing.

In [None]:
quality_filter_prompt = """
Below is a synthetic query generated for the EU AI Act. Evaluate its quality as an expert in information retrieval would, considering its suitability for training embedding models. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the query is relevant, being directly related to information contained in the passage.
- Add another point if the query is specific, focusing on one particular topic or concept mentioned in the passage rather than being overly broad.
- Award a third point if the query is answerable, with the passage containing sufficient information to provide an answer.
- Grant a fourth point if the query is realistic given the persona and query style, using appropriate language, terminology, and phrasing that matches what someone with that background would use.
- Bestow a fifth point if the query is both user-oriented (representing something a user would naturally ask without having seen the passage) and original (not simply restating or rephrasing the passage content, but using its own language).

Be critical but fair in your evaluation and don't easily award points.

## Passage:

{passage}

## Context:
Persona: {persona}
Query style: {query_style}

## Generated Query: {query}

After examining the query:

- Briefly justify your total score in a single line.
- Conclude with the score of the query (1-5).

Return the evaluation in valid JSON format with double quotes:

Evaluation = {{"critique": str, "score": int}}
Return: Evaluation
"""

In [None]:
prompt = quality_filter_prompt.format(
    passage=passage,
    persona=results[0]["persona"],
    query_style=results[0]["query_style"],
    query=results[0]["persona_query"]
)
print(prompt)


Below is a synthetic query generated for the EU AI Act. Evaluate its quality as an expert in information retrieval would, considering its suitability for training embedding models. Use the additive 5-point scoring system described below. Points are accumulated based on the satisfaction of each criterion:

- Add 1 point if the query is relevant, being directly related to information contained in the passage.
- Add another point if the query is specific, focusing on one particular topic or concept mentioned in the passage rather than being overly broad.
- Award a third point if the query is answerable, with the passage containing sufficient information to provide an answer.
- Grant a fourth point if the query is realistic given the persona and query style, using appropriate language, terminology, and phrasing that matches what someone with that background would use.
- Bestow a fifth point if the query is both user-oriented (representing something a user would naturally ask without havin

In [None]:
response = generate_text(prompt, system_instructions=system_prompt)
print(response)

```json
{
  "critique": "The query is relevant, specific, answerable, and realistic for the persona; however, it directly reflects the passage's content rather than being a novel user question.",
  "score": 4
}
```


We can use this function to parse the response

In [None]:
import json

def extract_json(response_text):
    """Extract and parse JSON from LLM response text, handling various formats."""
    try:
        # First try to extract JSON from markdown code blocks if present
        if "```" in response_text:
            # Extract content between code blocks
            json_text = response_text.split("```")[1]
            # Remove language indicator if present
            if json_text.startswith("json"):
                json_text = json_text[4:].strip()
        else:
            json_text = response_text.strip()

        # Parse the JSON
        return json.loads(json_text)
    except (json.JSONDecodeError, IndexError) as e:
        print(f"Error parsing JSON: {e}")
        print(f"Raw response: {response_text}")
        # Return a default value in case of error
        return {"critique": "Error parsing response", "score": 0}

In [None]:
extract_json(response)

{'critique': "The query is relevant, specific, answerable, and realistic for the persona; however, it directly reflects the passage's content rather than being a novel user question.",
 'score': 4}

**Exercise 5b:**
> Now run the prompt on all results you stored in the previous section, parse the results and store the critiques and scores.

**Tip:**
> Remember to always print out all necessary context so that you can fully understand the critiques and scores.

In [None]:
# Print the passage so we can easily check it
print("# PASSAGE:\n")
display(Markdown(passage))
print("\n" + "-"*80 + "\n")

# Evaluate all queries and store results
for i, result in enumerate(results):
    print(f"Evaluating query {i+1}/5...\n")

    # Evaluate persona-based query
    persona_filter_prompt = quality_filter_prompt.format(
        passage=passage,
        persona=result["persona"],
        query_style=result["query_style"],
        query=result["persona_query"]
    )
    persona_response = generate_text(persona_filter_prompt, system_instructions=system_prompt)

    # Evaluate few-shot query
    fewshot_filter_prompt = quality_filter_prompt.format(
        passage=passage,
        persona=result["persona"],
        query_style=result["query_style"],
        query=result["fewshot_query"]
    )
    fewshot_response = generate_text(fewshot_filter_prompt, system_instructions=system_prompt)
    time.sleep(0.5)

    # Parse and store evaluations
    try:
        persona_eval = extract_json(persona_response)
        fewshot_eval = extract_json(fewshot_response)

        # Add evaluations to results
        results[i]["persona_critique"] = persona_eval.get("critique", "Error")
        results[i]["persona_score"] = persona_eval.get("score", 0)
        results[i]["fewshot_critique"] = fewshot_eval.get("critique", "Error")
        results[i]["fewshot_score"] = fewshot_eval.get("score", 0)

        # Print personas and query styles
        print(f"  PERSONA: {results[i]['persona']}")
        print(f"  QUERY STYLE: {results[i]['query_style']}")

        # Print scores during generation
        print(f"\n  ZERO-SHOT SCORE: {results[i]['persona_score']}")
        print(f"     Query: {results[i]['persona_query'].strip()}")
        print(f"     Critique: {results[i]['persona_critique'].strip()}")
        print(f"\n  FEW-SHOT SCORE: {results[i]['fewshot_score']}")
        print(f"     Query: {results[i]['fewshot_query'].strip()}")
        print(f"     Critique: {results[i]['fewshot_critique'].strip()}")

    except Exception as e:
        print(f"Error processing result {i}: {e}")

    print("\n" + "-"*80 + "\n")

# Calculate average scores
avg_persona_score = sum(r.get("persona_score", 0) for r in results) / len(results)
avg_fewshot_score = sum(r.get("fewshot_score", 0) for r in results) / len(results)

print("=" * 80)
print("AVERAGE SCORES:")
print(f"Zero-shot queries: {avg_persona_score:.2f}")
print(f"Few-shot queries: {avg_fewshot_score:.2f}")
print("=" * 80)

# PASSAGE:



Chapter XII - PENALTIES

Article 99 - Penalties

7.   When deciding whether to impose an administrative fine and when deciding on the amount of the administrative fine in each individual case, all relevant circumstances of the specific situation shall be taken into account and, as appropriate, regard shall be given to the following: (a) the nature, gravity and duration of the infringement and of its consequences, taking into account the purpose of the AI system, as well as, where appropriate, the number of affected persons and the level of damage suffered by them; (b) whether administrative fines have already been applied by other market surveillance authorities to the same operator for the same infringement; (c) whether administrative fines have already been applied by other authorities to the same operator for infringements of other Union or national law, when such infringements result from the same activity or omission constituting a relevant infringement of this Regulation; (d) the size, the annual turnover and market share of the operator committing the infringement; (e) any other aggravating or mitigating factor applicable to the circumstances of the case, such as financial benefits gained, or losses avoided, directly or indirectly, from the infringement; (f) the degree of cooperation with the national competent authorities, in order to remedy the infringement and mitigate the possible adverse effects of the infringement; (g) the degree of responsibility of the operator taking into account the technical and organisational measures implemented by it; (h) the manner in which the infringement became known to the national competent authorities, in particular whether, and if so to what extent, the operator notified the infringement; (i) the intentional or negligent character of the infringement; (j) any action taken by the operator to mitigate the harm suffered by the affected persons.


--------------------------------------------------------------------------------

Evaluating query 1/5...

  PERSONA: A data protection officer at a large European enterprise implementing AI compliance
  QUERY STYLE: Simple direct question with basic vocabulary

  ZERO-SHOT SCORE: 4
     Query: What things are considered when deciding the size of fines for breaking the AI Act?
     Critique: The query is relevant, specific, and answerable, using reasonable language for the persona, but it could be more original by not directly mirroring the passage's content.

  FEW-SHOT SCORE: 4
     Query: What factors are considered when determining fines for AI Act violations?
     Critique: The query is relevant, specific, and answerable, using appropriate terminology; however, it could be more original and less of a restatement of the passage's main point. Thus, only 4 points are awarded.

--------------------------------------------------------------------------------

Evaluating query 2/5...



Our LLM-as-a-judge implementation provides a systematic evaluation of each generated query. The additive scoring approach offers both numerical assessment and detailed feedback on specific strengths and weaknesses of each query.

As we already expected, our evaluation showed comparable performance between persona-based and few-shot approaches.

With these quality scores, we can now:
- Select the highest-scoring query for each persona and query style combination
- Filter our dataset to include only queries that meet or exceed our quality threshold

This filtering ensures only high-quality synthetic queries make it into our final training dataset.

## Conclusion

In this notebook, we've explored a comprehensive approach to synthetic query generation for specialized domains like the EU AI Act. We've progressed from basic generation to increasingly sophisticated techniques:

1. **Basic generation** demonstrated the limitations of simple prompting
2. **Grounded generation** improved relevance by anchoring queries to specific passages
3. **Persona-based generation** enhanced diversity through varied perspectives and styles
4. **Few-shot learning** provided more control over output quality and format
5. **Quality filtering** ensured only the best queries make it into our final dataset

These techniques allow us to create synthetic queries that are both high-quality and diverse.

However, generating good queries is only the first step in creating a robust training dataset for embedding models. In the next notebook, we'll explore the crucial process of mining positive and negative examples - identifying which passages truly answer our queries and which ones don't. This step is essential for creating the clean, well-structured training data needed for effective embedding model fine-tuning.