Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(testset): Ensure each document is used only once for question gen… #880

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

princepride
Copy link

…eration

Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents.

To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include:

  1. Introduced a start_index variable to keep track of the starting document index for each evolution type.
  2. Calculated the end_index for each evolution type by adding the rounded value of probability * test_size to the start_index.
  3. Used an inner loop to iterate from start_index to end_index and submit tasks to the executor for each document within that range.
  4. Updated the start_index to end_index after processing each evolution type to ensure the next evolution type starts from the correct position.
  5. If total_evolutions is less than test_size after processing all evolution types, randomly selected evolution types to fill the remaining documents using the choices function.

With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution.

This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.

…eration

Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents.

To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include:

1. Introduced a `start_index` variable to keep track of the starting document index for each evolution type.
2. Calculated the `end_index` for each evolution type by adding the rounded value of `probability * test_size` to the `start_index`.
3. Used an inner loop to iterate from `start_index` to `end_index` and submit tasks to the executor for each document within that range.
4. Updated the `start_index` to `end_index` after processing each evolution type to ensure the next evolution type starts from the correct position.
5. If `total_evolutions` is less than `test_size` after processing all evolution types, randomly selected evolution types to fill the remaining documents using the `choices` function.

With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution.

This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.
@princepride
Copy link
Author

It's quite strange that my code changes do not involve ragas/llms/base.py, but it seems that "ChatVertexAI" is not exported from the module "langchain_community.chat_models". The error message suggests importing it from "langchain_community.chat_models.vertexai" instead.

As the error message when I tried to merge my branch to the main branch in explodinggradients#880. 

found 0 vulnerabilities
/home/runner/work/ragas/ragas/src/ragas/llms/base.py
  /home/runner/work/ragas/ragas/src/ragas/llms/base.py:10:45 - error: "ChatVertexAI" is not exported from module "langchain_community.chat_models"
    Import from "langchain_community.chat_models.vertexai" instead (reportPrivateImportUsage)
1 error, 0 warnings, 0 informations 
Error: Process completed with exit code 123.

I checked the document from https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.vertexai.ChatVertexAI.html, it said ChatVertexAI is from the package vertexai
@omkar-334
Copy link
Contributor

omkar-334 commented May 15, 2024

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

@princepride
Copy link
Author

princepride commented May 15, 2024

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

I mean current_nodes, current_nodes was initialized from:

current_nodes = [
            CurrentNodes(root_node=n, nodes=[n])
            for n in self.docstore.get_random_nodes(k=test_size)
        ]

And each time it will use the index between 0 to (probability * test_size), so in every distributions, it will always use the the front part current_nodes

@ciekawy
Copy link

ciekawy commented May 15, 2024

The general idea seems viable, though I'm not sure if explicit requirement to use each document only once is really what we need. I can easily imagine that for longer documents you may need

  • create more than one question verifying various topics from the document to be possible retrieved properly (e.g. I had a case recently with my RAG that due to lossy semantic chunking - more like a extensive summaries - some data were missing in the actual app)
  • the multi-context questions obviously should be able to refer to previously used documents

@omkar-334
Copy link
Contributor

I can easily imagine that for longer documents you may need

By document, they mean the current_nodes. I think the length of the document/file is irrelevant as it has been embedded into nodes.
Here the node is used only once, for question generation. This node can still be used in other questions for finding relevant question.

@omkar-334
Copy link
Contributor

I'm not sure if explicit requirement to use each document only once is really what we need.

I think you're right. But this is a recurring issue in the generated datasets. A few questions are similar with only the phrasing/wording changed. I was looking into a method such that nodes used for generating seed questions are not used again for generating, although they can be used as context for other questions. This approach seems viable though

@shahules786
Copy link
Member

Hey guys,
first of all - apologies for late reply @princepride @omkar-334 @ciekawy
This is an interesting issue. I have noticed it before, that is when I implemented penalizing the selection of repeated chunks using this logic here

  • wins here refers to how many times the node has been used
  • An adjustment factor is used to weigh down nodes as they increasingly get selected.
    On top of that now, I just merged PR Fix testset generator issue on context selection #937 which randomizes the selected docs for each evolution.

What do you guys think? I am working on improving test generation this week and would love to chat with any of you guys
https://cal.com/shahul-ragas/30min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants