Skip to content

Conversation

@czhiming-maker
Copy link

@czhiming-maker czhiming-maker commented Sep 11, 2025

Issue Link / Problem Description

ragas version: 0.3.3

I meet a problem when i try to generate multi-hop query based on some documents.

console log show that the ThemesPersonasInput model validation error:
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type, input_value=('车险代理', '车险'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.11/v/string_type
themes.1
Input should be a valid string [type=string_type, input_value=('交强险', '交强保险'), input_type=tuple]

I also found that has someone already report this issue: #2274

Changes Made

I found that ThemesPersonasInput model define the "themes" field as List[str], but actually themes value is List[tuple[str, str]], so this error occurred.

I changed the "themes" field to List[t.Union[str, t.Tuple[str, str]]] to solve this error.
because I find that when themes value generated, it always be tuple list, code below show the reason:
https://github.com/explodinggradients/ragas/blob/main/src/ragas/testset/transforms/relationship_builders/traditional.py#L162

Testing

Manual testing steps:

  1. change code as described above
  2. try to generate multi-hop query again
  3. the result show success
image

Test code:

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import JSONLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os
os.environ["GOOGLE_API_KEY"] = "xxxx-O5-xxxxx"

loader = JSONLoader(
    file_path="data/documents.json",
    jq_schema=".[].content",
)
documents = loader.load()

generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(model="gemini-2.5-flash"))
generator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
))

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())

References

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Sep 11, 2025
@czhiming-maker czhiming-maker changed the title fix ThemesPersonasInput model error when generate multi-hop query fix ThemesPersonasInput model validation error when generate multi-hop query Sep 11, 2025
@czhiming-maker czhiming-maker changed the title fix ThemesPersonasInput model validation error when generate multi-hop query fix: ThemesPersonasInput model validation error when generate multi-hop query Sep 12, 2025
Copy link
Member

@anistark anistark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @czhiming-maker

This might break the definitions and type checkers.

Have you considered updating traditional.py to generate only strings instead of tuples?

Or can also do explicit modeling for it. Open for discussion if needed.

@czhiming-maker
Copy link
Author

Thanks for the PR @czhiming-maker

This might break the definitions and type checkers.

Have you considered updating traditional.py to generate only strings instead of tuples?

Or can also do explicit modeling for it. Open for discussion if needed.

Good idea.
I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings.
Thanks.

@anistark
Copy link
Member

Good idea. I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings. Thanks.

Great. Let us know how it goes. Closing this PR for the time being. Feel free to open a new one if there's any other changes you'd want to add. :)

@czhiming-maker
Copy link
Author

Good idea. I've decided not to change the original schema's structure definition. I'll replace the assignment of overlapped_items in traditional.py from tuples to strings. Thanks.

Great. Let us know how it goes. Closing this PR for the time being. Feel free to open a new one if there's any other changes you'd want to add. :)

OK,I have opened a new PR about this, #2313
Thank You.

anistark pushed a commit that referenced this pull request Oct 9, 2025
#2347)

## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2275 
- I also encounter this problem on my test on multi hop specific query
synthesizer:
1 validation error for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type,
input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'],
input_type=list]
For further information visit
https://errors.pydantic.dev/2.11/v/string_type

- @czhiming-maker have not update it already 2 weeks past, so I might
give a try to update on it.


## Changes Made
<!-- Describe what you changed and why -->
- add helper function _extract_themes_from_overlaps from the dicsussion
in #2275
- 
- 

## Testing
<!-- Describe how this should be tested -->
### How to Test

```
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os

loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()

# Set your Azure OpenAI credentials
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", )
AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "")
AZURE_API_VERSION = "2024-12-01-preview" 
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini")
EMBEDDING_DEPLOYMENT = "text-embedding-ada-002"

# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY,
    api_version=AZURE_API_VERSION,
    azure_deployment=EMBEDDING_DEPLOYMENT
)

# Initialize LLM with JSON mode enabled
llm = AzureChatOpenAI(
    azure_endpoint=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY,
    openai_api_version=AZURE_API_VERSION,
    azure_deployment=AZURE_DEPLOYMENT,
    temperature=0.3,
    model_kwargs={
        "response_format": {"type": "json_object"}  # Force clean JSON output
    }
)

generator_llm = LangchainLLMWrapper(llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())
```

## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues: #2275 


## Screenshots/Examples (if applicable)
<!-- Add screenshots or code examples showing the change -->
Samples Generation in test
<img width="1701" height="111" alt="image"
src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b"
/>

---
<!-- 
Thank you for contributing to Ragas! 
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->

Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
anistark pushed a commit that referenced this pull request Nov 17, 2025
#2347)

## Issue Link / Problem Description
<!-- Link to related issue or describe the problem this PR solves -->
- Fixes #2275 
- I also encounter this problem on my test on multi hop specific query
synthesizer:
1 validation error for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type,
input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'],
input_type=list]
For further information visit
https://errors.pydantic.dev/2.11/v/string_type

- @czhiming-maker have not update it already 2 weeks past, so I might
give a try to update on it.


## Changes Made
<!-- Describe what you changed and why -->
- add helper function _extract_themes_from_overlaps from the dicsussion
in #2275
- 
- 

## Testing
<!-- Describe how this should be tested -->
### How to Test

```
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os

loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()

# Set your Azure OpenAI credentials
AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", )
AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "")
AZURE_API_VERSION = "2024-12-01-preview" 
AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini")
EMBEDDING_DEPLOYMENT = "text-embedding-ada-002"

# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(
    azure_endpoint=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY,
    api_version=AZURE_API_VERSION,
    azure_deployment=EMBEDDING_DEPLOYMENT
)

# Initialize LLM with JSON mode enabled
llm = AzureChatOpenAI(
    azure_endpoint=AZURE_ENDPOINT,
    api_key=AZURE_API_KEY,
    openai_api_version=AZURE_API_VERSION,
    azure_deployment=AZURE_DEPLOYMENT,
    temperature=0.3,
    model_kwargs={
        "response_format": {"type": "json_object"}  # Force clean JSON output
    }
)

generator_llm = LangchainLLMWrapper(llm)
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())
```

## References
<!-- Link to related issues, discussions, forums, or external resources
-->
- Related issues: #2275 


## Screenshots/Examples (if applicable)
<!-- Add screenshots or code examples showing the change -->
Samples Generation in test
<img width="1701" height="111" alt="image"
src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b"
/>

---
<!-- 
Thank you for contributing to Ragas! 
Please fill out the sections above as completely as possible.
The more information you provide, the faster your PR can be reviewed and
merged.
-->

Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants