-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: ThemesPersonasInput model validation error when generate multi-hop query #2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ThemesPersonasInput model validation error when generate multi-hop query #2275
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @czhiming-maker
This might break the definitions and type checkers.
Have you considered updating traditional.py to generate only strings instead of tuples?
Or can also do explicit modeling for it. Open for discussion if needed.
Good idea. |
Great. Let us know how it goes. Closing this PR for the time being. Feel free to open a new one if there's any other changes you'd want to add. :) |
OK,I have opened a new PR about this, #2313 |
#2347) ## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2275 - I also encounter this problem on my test on multi hop specific query synthesizer: 1 validation error for ThemesPersonasInput themes.0 Input should be a valid string [type=string_type, input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'], input_type=list] For further information visit https://errors.pydantic.dev/2.11/v/string_type - @czhiming-maker have not update it already 2 weeks past, so I might give a try to update on it. ## Changes Made <!-- Describe what you changed and why --> - add helper function _extract_themes_from_overlaps from the dicsussion in #2275 - - ## Testing <!-- Describe how this should be tested --> ### How to Test ``` from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers.multi_hop.specific import ( MultiHopSpecificQuerySynthesizer, ) import os loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader) documents = loader.load() # Set your Azure OpenAI credentials AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", ) AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "") AZURE_API_VERSION = "2024-12-01-preview" AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini") EMBEDDING_DEPLOYMENT = "text-embedding-ada-002" # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, model_kwargs={ "response_format": {"type": "json_object"} # Force clean JSON output } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) async def generate(): # generate testset testset = generator.generate_with_langchain_docs( documents, testset_size=4, query_distribution=distribution, ) testset.to_evaluation_dataset().to_jsonl("testset.jsonl") import asyncio asyncio.run(generate()) ``` ## References <!-- Link to related issues, discussions, forums, or external resources --> - Related issues: #2275 ## Screenshots/Examples (if applicable) <!-- Add screenshots or code examples showing the change --> Samples Generation in test <img width="1701" height="111" alt="image" src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b" /> --- <!-- Thank you for contributing to Ragas! Please fill out the sections above as completely as possible. The more information you provide, the faster your PR can be reviewed and merged. --> Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
#2347) ## Issue Link / Problem Description <!-- Link to related issue or describe the problem this PR solves --> - Fixes #2275 - I also encounter this problem on my test on multi hop specific query synthesizer: 1 validation error for ThemesPersonasInput themes.0 Input should be a valid string [type=string_type, input_value=['Vedenjäähdytyskoneen', 'Vedenjäähdytyskone'], input_type=list] For further information visit https://errors.pydantic.dev/2.11/v/string_type - @czhiming-maker have not update it already 2 weeks past, so I might give a try to update on it. ## Changes Made <!-- Describe what you changed and why --> - add helper function _extract_themes_from_overlaps from the dicsussion in #2275 - - ## Testing <!-- Describe how this should be tested --> ### How to Test ``` from ragas.testset import TestsetGenerator from langchain_community.document_loaders import DirectoryLoader, TextLoader from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI from ragas.testset.synthesizers.multi_hop.specific import ( MultiHopSpecificQuerySynthesizer, ) import os loader = DirectoryLoader("./data/", glob="**/*.md", loader_cls=TextLoader) documents = loader.load() # Set your Azure OpenAI credentials AZURE_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT", ) AZURE_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY", "") AZURE_API_VERSION = "2024-12-01-preview" AZURE_DEPLOYMENT = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4o-mini") EMBEDDING_DEPLOYMENT = "text-embedding-ada-002" # Initialize embeddings embeddings = AzureOpenAIEmbeddings( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, api_version=AZURE_API_VERSION, azure_deployment=EMBEDDING_DEPLOYMENT ) # Initialize LLM with JSON mode enabled llm = AzureChatOpenAI( azure_endpoint=AZURE_ENDPOINT, api_key=AZURE_API_KEY, openai_api_version=AZURE_API_VERSION, azure_deployment=AZURE_DEPLOYMENT, temperature=0.3, model_kwargs={ "response_format": {"type": "json_object"} # Force clean JSON output } ) generator_llm = LangchainLLMWrapper(llm) generator_embeddings = LangchainEmbeddingsWrapper(embeddings) distribution = [ (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 1), ] generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings) async def generate(): # generate testset testset = generator.generate_with_langchain_docs( documents, testset_size=4, query_distribution=distribution, ) testset.to_evaluation_dataset().to_jsonl("testset.jsonl") import asyncio asyncio.run(generate()) ``` ## References <!-- Link to related issues, discussions, forums, or external resources --> - Related issues: #2275 ## Screenshots/Examples (if applicable) <!-- Add screenshots or code examples showing the change --> Samples Generation in test <img width="1701" height="111" alt="image" src="https://github.com/user-attachments/assets/890d14bc-cd31-4940-8ce3-df8bc5ea459b" /> --- <!-- Thank you for contributing to Ragas! Please fill out the sections above as completely as possible. The more information you provide, the faster your PR can be reviewed and merged. --> Co-authored-by: Kenzo Yan <kenzo.yan@granlund.fi>
Issue Link / Problem Description
ragas version: 0.3.3
I meet a problem when i try to generate multi-hop query based on some documents.
console log show that the ThemesPersonasInput model validation error:
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ThemesPersonasInput
themes.0
Input should be a valid string [type=string_type, input_value=('车险代理', '车险'), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.11/v/string_type
themes.1
Input should be a valid string [type=string_type, input_value=('交强险', '交强保险'), input_type=tuple]
I also found that has someone already report this issue: #2274
Changes Made
I found that ThemesPersonasInput model define the "themes" field as List[str], but actually themes value is List[tuple[str, str]], so this error occurred.
I changed the "themes" field to List[t.Union[str, t.Tuple[str, str]]] to solve this error.
because I find that when themes value generated, it always be tuple list, code below show the reason:
https://github.com/explodinggradients/ragas/blob/main/src/ragas/testset/transforms/relationship_builders/traditional.py#L162
Testing
Manual testing steps:
Test code:
References