Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced data type handling and improved code stability in computing functions #3

Closed
Martyniqo opened this issue Aug 11, 2024 · 4 comments

Comments

@Martyniqo
Copy link

Martyniqo commented Aug 11, 2024

Problem:
I encountered an issue in the computation.py file where the computation functions don't always handle different input data types properly. Specifically, I noticed that when the input data is a one-dimensional array or a list, it can cause errors that prevent the functions from running correctly.

What happened:
While running these:
evaluator = RAGChecker( extractor_name='bedrock/meta.llama3-1-8b-instruct-v1:0', checker_name='bedrock/meta.llama3-1-8b-instruct-v1:0', batch_size_extractor=32, batch_size_checker=32 )
evaluator.evaluate(rag_results, all_metrics)

I received this error message while trying to compute retriever_metrics and generator_metrics :
Error during evaluation: object of type 'numpy.bool_' has no len()

This error suggests that the code tried to calculate the length of a boolean value, which shouldn't happen. It seems that the input data wasn't processed as expected, leading to this issue.

What I changed:
Added Type Checks: I updated several functions (like compute_precision, compute_recall, and compute_retrieval) to include type checks. Now, the code verifies that the input data is either a numpy array or a list before proceeding. This should help prevent similar errors in the future.

Better Handling of 1D Arrays: In functions such as compute_retrieval and compute_context_utilization, I added logic to check if the input is a one-dimensional array. If it is, the code adjusts accordingly, which should avoid errors related to operations like np.max.

Impact on computation accuracy:
While these changes should make the code more robust, there's a chance they could affect the accuracy of the computations. Specifically, if the input data doesn't match the expected format, the new conditions might change how the computations are carried out. I've tested the changes, but I'm not entirely sure if they might cause something to work incorrectly. If you notice any issues or if the changes affect the computations in unintended ways, please advise.

Changes in code:

  • def evaluate_precision
    if isinstance(answer2response, (np.ndarray, list)) and len(answer2response) > 0: result.metrics[metrics.precision] = np.mean(answer2response) else: result.metrics[metrics.precision] = 0.

  • def evaluate_retrieval
    if isinstance(retrieved2answer, (np.ndarray, list)) and len(retrieved2answer) > 0: if isinstance(retrieved2answer[0], (np.ndarray, list)) and len(retrieved2answer[0]) > 0: claim_recalled = np.max(retrieved2answer, axis=1) result.metrics[metrics.claim_recall] = np.mean(claim_recalled) psg_useful = np.max(retrieved2answer, axis=0) result.metrics[metrics.context_precision] = np.mean(psg_useful) else: claim_recalled = retrieved2answer result.metrics[metrics.claim_recall] = np.mean(claim_recalled) result.metrics[metrics.context_precision] = 0. else: result.metrics[metrics.claim_recall] = 0. result.metrics[metrics.context_precision] = 0.

  • def evaluate_context_utilization
    if isinstance(retrieved2answer, (np.ndarray, list)) and len(retrieved2answer) > 0: if np.ndim(retrieved2answer) == 1 or (np.ndim(retrieved2answer) > 1 and len(retrieved2answer[0]) > 0): claim_recalled = np.max(retrieved2answer, axis=1) if np.ndim(retrieved2answer) > 1 else retrieved2answer if np.sum(claim_recalled) > 0: claim_used = claim_recalled & response2answer result.metrics[metrics.context_utilization] = np.sum(claim_used) / np.sum(claim_recalled) else: result.metrics[metrics.context_utilization] = 0. else: result.metrics[metrics.context_utilization] = 0. else: result.metrics[metrics.context_utilization] = 0.

computation-v2.zip

@HuXiangkun
Copy link
Contributor

Hi @Martyniqo , this is great feedback! We will review your code and get back to you soon! Also, would you consider create a pull request for your changes? We can review and merge your code accordingly. Thanks for your effort!

@Martyniqo
Copy link
Author

Martyniqo commented Aug 13, 2024 via email

@rudongyu
Copy link
Contributor

Hi @Martyniqo, are you running the example here with only the backbone LLM changed?

I ran the code below but couldn't reproduce the error.

from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics


# initialize ragresults from json/dict
with open("examples/checking_inputs.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# set-up the evaluator
evaluator = RAGChecker(
    extractor_name='bedrock/meta.llama3-1-8b-instruct-v1:0',
    checker_name='bedrock/meta.llama3-1-8b-instruct-v1:0',
    batch_size_extractor=32, batch_size_checker=32
)

evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

Since the input of functions in computation.py is from our upstream package RefChecker, the data type should have been well-controlled only if the input data has the same format with the example here.

If you are running RAGChecker on your own data, could you provide some samples leading to the error?

@rudongyu
Copy link
Contributor

The bug has been fixed by modifying the output formats of the dependency RefChecker. Please install the latest version to avoid the error. Feel free to reopen the issue if you find something wrong on your data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants