fix ThemesPersonasInput model error when generate multi-hop query #2313

czhiming-maker · 2025-09-25T14:11:27Z

contd... #2275

Issue Link / Problem Description

ragas version: 0.3.3

I meet a problem when i try to generate multi-hop query based on some documents.

console log show that the ThemesPersonasInput model validation error:
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type, input_value=('车险代理', '车险'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.11/v/string_type
themes.1
Input should be a valid string [type=string_type, input_value=('交强险', '交强保险'), input_type=tuple]

I also found that has someone already report this issue: #2274

Changes Made

I found that ThemesPersonasInput model define the "themes" field as List[str], but actually themes value is List[tuple[str, str]], so this error occurred.

So I replace the assignment of overlapped_items in traditional.py from tuples to strings, to fix the error about ThemesPersonasInput model "themes" field validation.

Testing

Manual testing steps:

change code as described above
try to generate multi-hop query again
the result show success

Test code:

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import JSONLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os
os.environ["GOOGLE_API_KEY"] = "xxxx-O5-xxxxx"

loader = JSONLoader(
    file_path="data/documents.json",
    jq_schema=".[].content",
)
documents = loader.load()

generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(model="gemini-2.5-flash"))
generator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
))

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())

References

Related issues: Unable to create Test data with TestsetGenerator #2274

anistark · 2025-10-03T07:44:18Z

Thanks for the PR @czhiming-maker

While this is better approach, the original data structure seem to have lost. entity pairs are not handled anymore.
Think of different formats:

tuples: [("car insurance", "车险")]
strings: ["car insurance"]
dicts: {"car insurance": [...]}

Here's what I'm thinking:

Add a helper method to make intent clear. In specific.py, add method to MultiHopSpecificQuerySynthesizer class::

  def _extract_themes_from_overlaps(
      self, overlapped_items: t.Any
  ) -> t.List[str]:
      """
      Extract unique entity names from overlapped items.

      Handles multiple formats:
      - List[Tuple[str, str]]: Entity pairs from overlap detection
      - List[str]: Direct entity names
      - Dict[str, Any]: Keys as entity names
      """
      if isinstance(overlapped_items, dict):
          return list(overlapped_items.keys())

      if not isinstance(overlapped_items, list):
          return []

      unique_entities = set()
      for item in overlapped_items:
          if isinstance(item, tuple):
              # Extract both entities from the pair
              for entity in item:
                  if isinstance(entity, str):
                      unique_entities.add(entity)
          elif isinstance(item, str):
              unique_entities.add(item)

      return list(unique_entities)

Then in line traditional.py, replace with:

themes = self._extract_themes_from_overlaps(overlapped_items)

czhiming-maker · 2025-10-09T04:40:28Z

Thanks for the PR @czhiming-maker

While this is better approach, the original data structure seem to have lost. entity pairs are not handled anymore. Think of different formats:

tuples: [("car insurance", "车险")]
strings: ["car insurance"]
dicts: {"car insurance": [...]}

Here's what I'm thinking:

Add a helper method to make intent clear. In specific.py, add method to MultiHopSpecificQuerySynthesizer class::

  def _extract_themes_from_overlaps(
      self, overlapped_items: t.Any
  ) -> t.List[str]:
      """
      Extract unique entity names from overlapped items.

      Handles multiple formats:
      - List[Tuple[str, str]]: Entity pairs from overlap detection
      - List[str]: Direct entity names
      - Dict[str, Any]: Keys as entity names
      """
      if isinstance(overlapped_items, dict):
          return list(overlapped_items.keys())

      if not isinstance(overlapped_items, list):
          return []

      unique_entities = set()
      for item in overlapped_items:
          if isinstance(item, tuple):
              # Extract both entities from the pair
              for entity in item:
                  if isinstance(entity, str):
                      unique_entities.add(entity)
          elif isinstance(item, str):
              unique_entities.add(item)

      return list(unique_entities)

Then in line traditional.py, replace with:

themes = self._extract_themes_from_overlaps(overlapped_items)

Very nice!
Sorry about that I just look this suggestion, and i just finished my vacation.
I look the guy has already open a new PR and updated base on this suggestion.
#2347

anistark · 2025-10-09T07:54:08Z

No worries @czhiming-maker

Closing this as merged #2347

fix ThemesPersonasInput model error when generate multi-hop query

9994e02

dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Sep 25, 2025

czhiming-maker mentioned this pull request Sep 25, 2025

fix: ThemesPersonasInput model validation error when generate multi-hop query #2275

Closed

anistark closed this Oct 9, 2025

izikeros mentioned this pull request Oct 10, 2025

Unable to create Test data with TestsetGenerator #2274

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix ThemesPersonasInput model error when generate multi-hop query #2313

fix ThemesPersonasInput model error when generate multi-hop query #2313

Uh oh!

czhiming-maker commented Sep 25, 2025 •

edited

Loading

Uh oh!

anistark commented Oct 3, 2025

Uh oh!

czhiming-maker commented Oct 9, 2025

Uh oh!

anistark commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix ThemesPersonasInput model error when generate multi-hop query #2313

fix ThemesPersonasInput model error when generate multi-hop query #2313

Uh oh!

Conversation

czhiming-maker commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Link / Problem Description

Changes Made

Testing

References

Uh oh!

anistark commented Oct 3, 2025

Uh oh!

czhiming-maker commented Oct 9, 2025

Uh oh!

anistark commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

czhiming-maker commented Sep 25, 2025 •

edited

Loading