fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

czhiming-maker · 2025-09-11T14:50:56Z

Issue Link / Problem Description

ragas version: 0.3.3

I meet a problem when i try to generate multi-hop query based on some documents.

console log show that the ThemesPersonasInput model validation error:
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type, input_value=('车险代理', '车险'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.11/v/string_type
themes.1
Input should be a valid string [type=string_type, input_value=('交强险', '交强保险'), input_type=tuple]

I also found that has someone already report this issue: #2274

Changes Made

I found that ThemesPersonasInput model define the "themes" field as List[str], but actually themes value is List[tuple[str, str]], so this error occurred.

I changed the "themes" field to List[t.Union[str, t.Tuple[str, str]]] to solve this error.
because I find that when themes value generated, it always be tuple list, code below show the reason:
https://github.com/explodinggradients/ragas/blob/main/src/ragas/testset/transforms/relationship_builders/traditional.py#L162

Testing

Manual testing steps:

change code as described above
try to generate multi-hop query again
the result show success

Test code:

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import JSONLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os
os.environ["GOOGLE_API_KEY"] = "xxxx-O5-xxxxx"

loader = JSONLoader(
    file_path="data/documents.json",
    jq_schema=".[].content",
)
documents = loader.load()

generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(model="gemini-2.5-flash"))
generator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
))

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())

References

Related issues: Unable to create Test data with TestsetGenerator #2274

anistark

Thanks for the PR @czhiming-maker

This might break the definitions and type checkers.

Have you considered updating traditional.py to generate only strings instead of tuples?

Or can also do explicit modeling for it. Open for discussion if needed.

czhiming-maker · 2025-09-25T08:47:57Z

Thanks for the PR @czhiming-maker

This might break the definitions and type checkers.

Have you considered updating traditional.py to generate only strings instead of tuples?

Or can also do explicit modeling for it. Open for discussion if needed.

Good idea.
I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings.
Thanks.

anistark · 2025-09-25T09:18:03Z

Good idea. I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings. Thanks.

Great. Let us know how it goes. Closing this PR for the time being. Feel free to open a new one if there's any other changes you'd want to add. :)

czhiming-maker · 2025-09-25T14:16:33Z

Good idea. I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings. Thanks.

Great. Let us know how it goes. Closing this PR for the time being. Feel free to open a new one if there's any other changes you'd want to add. :)

OK，I have opened a new PR about this, #2313
Thank You.

@czhiming-maker

#2347) ## Issue Link / Problem Description  - Fixes #2275 - I also encounter this problem on my test on multi hop specific query synthesizer: 1 validation error for ThemesPersonasInput themes.0 Input should be a valid string [type=string_type, input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'], input_type=list] For further information visit https://errors.pydantic.dev/2.11/v/string_type - @czhiming-maker have not update it already 2 weeks past, so I might give a try to update on it. ## Changes Made  - add helper function _extract_themes_from_overlaps from the dicsussion in #2275 - - ## Testing  ### How to Test ``` from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers.multi_hop.specific import ( MultiHopSpecificQuerySynthesizer, ) import os loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader) documents = loader.load() # Set your Azure OpenAI credentials AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", ) AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "") AZURE_API_VERSION = "2024-12-01-preview" AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini") EMBEDDING_DEPLOYMENT = "text-embedding-ada-002" # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, model_kwargs={ "response_format": {"type": "json_object"} # Force clean JSON output } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) async def generate(): # generate testset testset = generator.generate_with_langchain_docs( documents, testset_size=4, query_distribution=distribution, ) testset.to_evaluation_dataset().to_jsonl("testset.jsonl") import asyncio asyncio.run(generate()) ``` ## References  - Related issues: #2275 ## Screenshots/Examples (if applicable)  Samples Generation in test <img width="1701" height="111" alt="image" src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b" /> ---  Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>

@czhiming-maker

#2347) ## Issue Link / Problem Description  - Fixes #2275 - I also encounter this problem on my test on multi hop specific query synthesizer: 1 validation error for ThemesPersonasInput themes.0 Input should be a valid string [type=string_type, input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'], input_type=list] For further information visit https://errors.pydantic.dev/2.11/v/string_type - @czhiming-maker have not update it already 2 weeks past, so I might give a try to update on it. ## Changes Made  - add helper function _extract_themes_from_overlaps from the dicsussion in #2275 - - ## Testing  ### How to Test ``` from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers.multi_hop.specific import ( MultiHopSpecificQuerySynthesizer, ) import os loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader) documents = loader.load() # Set your Azure OpenAI credentials AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", ) AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "") AZURE_API_VERSION = "2024-12-01-preview" AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini") EMBEDDING_DEPLOYMENT = "text-embedding-ada-002" # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, model_kwargs={ "response_format": {"type": "json_object"} # Force clean JSON output } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) async def generate(): # generate testset testset = generator.generate_with_langchain_docs( documents, testset_size=4, query_distribution=distribution, ) testset.to_evaluation_dataset().to_jsonl("testset.jsonl") import asyncio asyncio.run(generate()) ``` ## References  - Related issues: #2275 ## Screenshots/Examples (if applicable)  Samples Generation in test <img width="1701" height="111" alt="image" src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b" /> ---  Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>

fix ThemesPersonasInput model error when generate multi-hop query

6e9073c

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Sep 11, 2025

czhiming-maker changed the title ~~fix ThemesPersonasInput model error when generate multi-hop query~~ fix ThemesPersonasInput model validation error when generate multi-hop query Sep 11, 2025

czhiming-maker changed the title ~~fix ThemesPersonasInput model validation error when generate multi-hop query~~ fix: ThemesPersonasInput model validation error when generate multi-hop query Sep 12, 2025

anistark reviewed Sep 22, 2025

View reviewed changes

anistark closed this Sep 25, 2025

czhiming-maker mentioned this pull request Sep 25, 2025

fix ThemesPersonasInput model error when generate multi-hop query #2313

Closed

kenzoyan mentioned this pull request Oct 8, 2025

fix: streamline theme extraction from overlaps in MultiHopSpecificQue… #2347

Merged

izikeros mentioned this pull request Oct 10, 2025

Unable to create Test data with TestsetGenerator #2274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

Uh oh!

czhiming-maker commented Sep 11, 2025 •

edited

Loading

Uh oh!

anistark left a comment •

edited

Loading

Uh oh!

czhiming-maker commented Sep 25, 2025

Uh oh!

anistark commented Sep 25, 2025

Uh oh!

czhiming-maker commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

Uh oh!

Conversation

czhiming-maker commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Link / Problem Description

Changes Made

Testing

References

Uh oh!

anistark left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

czhiming-maker commented Sep 25, 2025

Uh oh!

anistark commented Sep 25, 2025

Uh oh!

czhiming-maker commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

czhiming-maker commented Sep 11, 2025 •

edited

Loading

anistark left a comment •

edited

Loading