Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat:Add RAG Benchmark method #1193

Merged
merged 50 commits into from
Apr 25, 2024
Merged

Feat:Add RAG Benchmark method #1193

merged 50 commits into from
Apr 25, 2024

Conversation

YangQianli92
Copy link
Contributor

@YangQianli92 YangQianli92 commented Apr 15, 2024

Features

  • New MetaGPT-RAG assessment module, involving RougL, Bleu, Recall, Hit Rate, MRR and other assessment indicators.
  • Feel free to review the effects of the different modules of RAG.
  • Support customized evaluation dataset, please follow the sample provided by us to modify the structure can be.
  • Added Reranker support for Cohere, FlagEmbedding.
  • Based on the above work, we have evaluated the various components of MetaGPT, and some of the settings can be viewed in Figure
    • LLM: chatgpt-3.5-1106-turbo
    • Embedding: text-embedding-3-small
    • chunk_size: 256
    • chunk_overleap: 0
    • similarity_top_k: 5
    • ranker_top_n: 3
      image
      image

@codecov-commenter
Copy link

codecov-commenter commented Apr 15, 2024

Codecov Report

Attention: Patch coverage is 8.41121% with 98 lines in your changes are missing coverage. Please review.

Project coverage is 70.26%. Comparing base (933d6c1) to head (debe6b0).
Report is 22 commits behind head on main.

Files Patch % Lines
metagpt/rag/benchmark/base.py 0.00% 86 Missing ⚠️
metagpt/rag/factories/ranker.py 16.66% 10 Missing ⚠️
metagpt/rag/benchmark/__init__.py 0.00% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1193      +/-   ##
==========================================
- Coverage   70.60%   70.26%   -0.34%     
==========================================
  Files         314      316       +2     
  Lines       18714    18821     +107     
==========================================
+ Hits        13213    13225      +12     
- Misses       5501     5596      +95     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

metagpt/rag/factories/ranker.py Outdated Show resolved Hide resolved
metagpt/rag/factories/ranker.py Outdated Show resolved Hide resolved
metagpt/rag/schema.py Outdated Show resolved Hide resolved
metagpt/rag/factories/ranker.py Outdated Show resolved Hide resolved
metagpt/rag/benchmark/base.py Outdated Show resolved Hide resolved
metagpt/rag/benchmark/base.py Outdated Show resolved Hide resolved
examples/rag_bm.py Outdated Show resolved Hide resolved
examples/rag_bm.py Outdated Show resolved Hide resolved
metagpt/rag/benchmark/base.py Outdated Show resolved Hide resolved
metagpt/rag/benchmark/base.py Outdated Show resolved Hide resolved
@geekan
Copy link
Owner

geekan commented Apr 22, 2024

/review

Copy link

PR Review

⏱️ Estimated effort to review [1-5]

4, due to the extensive amount of new code across multiple files, involving complex functionalities such as data retrieval, ranking, and evaluation metrics. The PR integrates new features and configurations which require careful review to ensure correctness and performance.

🧪 Relevant tests

No

🔍 Possible issues

Possible Bug: The method rag_evaluate_single in rag_bm.py might return incorrect metrics if an exception is thrown and caught. The method catches all exceptions and returns a default metric set which might not accurately reflect the error state or provide meaningful feedback for debugging.

Performance Concern: The extensive use of synchronous file I/O operations and potentially large data processing in loops could lead to performance bottlenecks, especially noticeable when processing large datasets or when used in a high-latency network environment.

🔒 Security concerns

No

Code feedback:
relevant fileexamples/rag_bm.py
suggestion      

Consider implementing more granular exception handling in the rag_evaluate_single method to differentiate between different types of errors (e.g., network issues, data format errors) and handle them appropriately. This will improve the robustness and debuggability of the module. [important]

relevant lineexcept Exception as e:

relevant fileexamples/rag_bm.py
suggestion      

To enhance performance, consider using asynchronous file operations or a more efficient data handling mechanism to manage I/O operations, especially when loading or writing large datasets in the rag_evaluate_pipeline method. [important]

relevant linewrite_json_file((EXAMPLE_BENCHMARK_PATH / dataset.name / "bm_result.json").as_posix(), results, "utf-8")

relevant filemetagpt/rag/benchmark/base.py
suggestion      

Optimize the compute_metric method by caching results of expensive operations like bleu_score and rougel_score if the same responses and references are being evaluated multiple times. This can significantly reduce computation time in scenarios with repetitive data. [medium]

relevant linebleu_avg, bleu1, bleu2, bleu3, bleu4 = self.bleu_score(response, reference)

relevant fileexamples/rag_bm.py
suggestion      

Refactor the rag_evaluate_pipeline method to break down its functionality into smaller, more manageable functions. This improves modularity and makes the code easier to maintain and test. [medium]

relevant lineasync def rag_evaluate_pipeline(self, dataset_name: list[str] = ["all"]):


✨ Review tool usage guide:

Overview:
The review tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be added by configuring the tool.

The tool can be triggered automatically every time a new PR is opened, or can be invoked manually by commenting on any PR.

  • When commenting, to edit configurations related to the review tool (pr_reviewer section), use the following template:
/review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=...
[pr_reviewer]
some_config1=...
some_config2=...

See the review usage page for a comprehensive guide on using this tool.

@better629
Copy link
Collaborator

lgtm

@geekan geekan merged commit 2476672 into geekan:main Apr 25, 2024
0 of 3 checks passed
@YangQianli92
Copy link
Contributor Author

YangQianli92 commented Apr 26, 2024

In the PR submitted above, there is a slight error in the MRR calculation of the Benchmark metrics, I have submitted another PR for fixing this bug, and all the results are recalculated after fixing the bug!
#1228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants