Skip to content

Conversation

@czhiming-maker
Copy link

@czhiming-maker czhiming-maker commented Sep 25, 2025

contd... #2275

Issue Link / Problem Description

ragas version: 0.3.3

I meet a problem when i try to generate multi-hop query based on some documents.

console log show that the ThemesPersonasInput model validation error:
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type, input_value=('车险代理', '车险'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.11/v/string_type
themes.1
Input should be a valid string [type=string_type, input_value=('交强险', '交强保险'), input_type=tuple]

I also found that has someone already report this issue: #2274

Changes Made

I found that ThemesPersonasInput model define the "themes" field as List[str], but actually themes value is List[tuple[str, str]], so this error occurred.

So I replace the assignment of overlapped_items in traditional.py from tuples to strings, to fix the error about ThemesPersonasInput model "themes" field validation.

Testing

Manual testing steps:

  1. change code as described above
  2. try to generate multi-hop query again
  3. the result show success
image

Test code:

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from langchain_community.document_loaders import JSONLoader
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from ragas.testset.synthesizers.multi_hop.specific import (
    MultiHopSpecificQuerySynthesizer,
)

import os
os.environ["GOOGLE_API_KEY"] = "xxxx-O5-xxxxx"

loader = JSONLoader(
    file_path="data/documents.json",
    jq_schema=".[].content",
)
documents = loader.load()

generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(model="gemini-2.5-flash"))
generator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
))

distribution = [
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1),
]
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

async def generate():
    # generate testset
    testset = generator.generate_with_langchain_docs(
        documents,
        testset_size=4, 
        query_distribution=distribution,
    )

    testset.to_evaluation_dataset().to_jsonl("testset.jsonl")

import asyncio
asyncio.run(generate())

References

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Sep 25, 2025
@anistark
Copy link
Member

anistark commented Oct 3, 2025

Thanks for the PR @czhiming-maker

While this is better approach, the original data structure seem to have lost. entity pairs are not handled anymore.
Think of different formats:

  • tuples: [("car insurance", "车险")]
  • strings: ["car insurance"]
  • dicts: {"car insurance": [...]}

Here's what I'm thinking:

Add a helper method to make intent clear. In specific.py, add method to MultiHopSpecificQuerySynthesizer class::

  def _extract_themes_from_overlaps(
      self, overlapped_items: t.Any
  ) -> t.List[str]:
      """
      Extract unique entity names from overlapped items.

      Handles multiple formats:
      - List[Tuple[str, str]]: Entity pairs from overlap detection
      - List[str]: Direct entity names
      - Dict[str, Any]: Keys as entity names
      """
      if isinstance(overlapped_items, dict):
          return list(overlapped_items.keys())

      if not isinstance(overlapped_items, list):
          return []

      unique_entities = set()
      for item in overlapped_items:
          if isinstance(item, tuple):
              # Extract both entities from the pair
              for entity in item:
                  if isinstance(entity, str):
                      unique_entities.add(entity)
          elif isinstance(item, str):
              unique_entities.add(item)

      return list(unique_entities)

Then in line traditional.py, replace with:

themes = self._extract_themes_from_overlaps(overlapped_items)

@czhiming-maker
Copy link
Author

Thanks for the PR @czhiming-maker

While this is better approach, the original data structure seem to have lost. entity pairs are not handled anymore. Think of different formats:

  • tuples: [("car insurance", "车险")]
  • strings: ["car insurance"]
  • dicts: {"car insurance": [...]}

Here's what I'm thinking:

Add a helper method to make intent clear. In specific.py, add method to MultiHopSpecificQuerySynthesizer class::

  def _extract_themes_from_overlaps(
      self, overlapped_items: t.Any
  ) -> t.List[str]:
      """
      Extract unique entity names from overlapped items.

      Handles multiple formats:
      - List[Tuple[str, str]]: Entity pairs from overlap detection
      - List[str]: Direct entity names
      - Dict[str, Any]: Keys as entity names
      """
      if isinstance(overlapped_items, dict):
          return list(overlapped_items.keys())

      if not isinstance(overlapped_items, list):
          return []

      unique_entities = set()
      for item in overlapped_items:
          if isinstance(item, tuple):
              # Extract both entities from the pair
              for entity in item:
                  if isinstance(entity, str):
                      unique_entities.add(entity)
          elif isinstance(item, str):
              unique_entities.add(item)

      return list(unique_entities)

Then in line traditional.py, replace with:

themes = self._extract_themes_from_overlaps(overlapped_items)

Very nice!
Sorry about that I just look this suggestion, and i just finished my vacation.
I look the guy has already open a new PR and updated base on this suggestion.
#2347

@anistark
Copy link
Member

anistark commented Oct 9, 2025

No worries @czhiming-maker

Closing this as merged #2347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants