fix(testset): Ensure each document is used only once for question gen… #880

princepride · 2024-04-19T10:07:03Z

…eration

Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents.

To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include:

Introduced a start_index variable to keep track of the starting document index for each evolution type.
Calculated the end_index for each evolution type by adding the rounded value of probability * test_size to the start_index.
Used an inner loop to iterate from start_index to end_index and submit tasks to the executor for each document within that range.
Updated the start_index to end_index after processing each evolution type to ensure the next evolution type starts from the correct position.
If total_evolutions is less than test_size after processing all evolution types, randomly selected evolution types to fill the remaining documents using the choices function.

With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution.

This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.

…eration Previously, the code used a nested loop to iterate over the distributions and generate questions for each document. However, this approach had a potential issue where a single document could be used multiple times for generating questions, leading to redundancy and inefficient usage of the available documents. To address this issue, the code has been modified to use a cumulative approach for determining the range of documents assigned to each evolution type based on their probability distribution. The key changes include: 1. Introduced a `start_index` variable to keep track of the starting document index for each evolution type. 2. Calculated the `end_index` for each evolution type by adding the rounded value of `probability * test_size` to the `start_index`. 3. Used an inner loop to iterate from `start_index` to `end_index` and submit tasks to the executor for each document within that range. 4. Updated the `start_index` to `end_index` after processing each evolution type to ensure the next evolution type starts from the correct position. 5. If `total_evolutions` is less than `test_size` after processing all evolution types, randomly selected evolution types to fill the remaining documents using the `choices` function. With these modifications, each document is guaranteed to be used only once for question generation, avoiding redundancy and ensuring efficient utilization of the available documents. The cumulative probability approach ensures that the document ranges for different evolution types do not overlap, maintaining the desired probability distribution. This fix improves the quality and diversity of the generated questions by preventing the repeated use of documents and ensuring a more balanced distribution of questions across the available documents.

princepride · 2024-04-20T18:18:33Z

It's quite strange that my code changes do not involve ragas/llms/base.py, but it seems that "ChatVertexAI" is not exported from the module "langchain_community.chat_models". The error message suggests importing it from "langchain_community.chat_models.vertexai" instead.

As the error message when I tried to merge my branch to the main branch in explodinggradients#880. found 0 vulnerabilities /home/runner/work/ragas/ragas/src/ragas/llms/base.py /home/runner/work/ragas/ragas/src/ragas/llms/base.py:10:45 - error: "ChatVertexAI" is not exported from module "langchain_community.chat_models" Import from "langchain_community.chat_models.vertexai" instead (reportPrivateImportUsage) 1 error, 0 warnings, 0 informations Error: Process completed with exit code 123. I checked the document from https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.vertexai.ChatVertexAI.html, it said ChatVertexAI is from the package vertexai

omkar-334 · 2024-05-15T07:37:11Z

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

princepride · 2024-05-15T07:43:04Z

each document is guaranteed to be used only once for question generation

Do you mean document as in files or the node/embeddings?

I mean current_nodes, current_nodes was initialized from:

current_nodes = [
            CurrentNodes(root_node=n, nodes=[n])
            for n in self.docstore.get_random_nodes(k=test_size)
        ]

And each time it will use the index between 0 to (probability * test_size), so in every distributions, it will always use the the front part current_nodes

ciekawy · 2024-05-15T07:58:53Z

The general idea seems viable, though I'm not sure if explicit requirement to use each document only once is really what we need. I can easily imagine that for longer documents you may need

create more than one question verifying various topics from the document to be possible retrieved properly (e.g. I had a case recently with my RAG that due to lossy semantic chunking - more like a extensive summaries - some data were missing in the actual app)
the multi-context questions obviously should be able to refer to previously used documents

omkar-334 · 2024-05-15T08:04:50Z

I can easily imagine that for longer documents you may need

By document, they mean the current_nodes. I think the length of the document/file is irrelevant as it has been embedded into nodes.
Here the node is used only once, for question generation. This node can still be used in other questions for finding relevant question.

omkar-334 · 2024-05-15T08:07:42Z

I'm not sure if explicit requirement to use each document only once is really what we need.

I think you're right. But this is a recurring issue in the generated datasets. A few questions are similar with only the phrasing/wording changed. I was looking into a method such that nodes used for generating seed questions are not used again for generating, although they can be used as context for other questions. This approach seems viable though

shahules786 · 2024-05-21T05:12:06Z

Hey guys,
first of all - apologies for late reply @princepride @omkar-334 @ciekawy
This is an interesting issue. I have noticed it before, that is when I implemented penalizing the selection of repeated chunks using this logic here

wins here refers to how many times the node has been used
An adjustment factor is used to weigh down nodes as they increasingly get selected.
On top of that now, I just merged PR Fix testset generator issue on context selection #937 which randomizes the selected docs for each evolution.

What do you guys think? I am working on improving test generation this week and would love to chat with any of you guys
https://cal.com/shahul-ragas/30min

princepride mentioned this pull request Apr 29, 2024

[R-239] Why the later parts testset will never be accessed? #860

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(testset): Ensure each document is used only once for question gen… #880

fix(testset): Ensure each document is used only once for question gen… #880

princepride commented Apr 19, 2024

princepride commented Apr 20, 2024

omkar-334 commented May 15, 2024 •

edited

princepride commented May 15, 2024 •

edited

ciekawy commented May 15, 2024

omkar-334 commented May 15, 2024

omkar-334 commented May 15, 2024

shahules786 commented May 21, 2024

fix(testset): Ensure each document is used only once for question gen… #880

Are you sure you want to change the base?

fix(testset): Ensure each document is used only once for question gen… #880

Conversation

princepride commented Apr 19, 2024

princepride commented Apr 20, 2024

omkar-334 commented May 15, 2024 • edited

princepride commented May 15, 2024 • edited

ciekawy commented May 15, 2024

omkar-334 commented May 15, 2024

omkar-334 commented May 15, 2024

shahules786 commented May 21, 2024

omkar-334 commented May 15, 2024 •

edited

princepride commented May 15, 2024 •

edited