AutoRank-LLM is an innovative Python package designed to automatically evaluate and rank Language Learning Models (LLMs) without human intervention. It utilizes a recursive peer review system with adaptive weighting to provide a fair and dynamic ranking of LLMs based on their performance on language understanding and generation tasks.
- Recursive Evaluation: LLMs evaluate each other in multiple rounds to establish a performance-based ranking.
- Adaptive Weighting: Evaluations are weighted by the evaluator's current ranking to ensure that more skilled LLMs have a greater influence.
- Dynamic Adjustment: Rankings are updated after each round based on weighted evaluations, allowing for gradual stabilization of the system.
- Normalization and Scaling: Techniques applied to prevent score inflation/deflation and maintain a consistent ranking scale.
- Machine Learning Integration: Potential for continuous learning and refinement of the system based on evaluation outcomes.
To install AutoRank-LLM, follow these steps:
- Ensure you have Python 3.6 or higher installed.
- Clone the repository or download the source code.
- Navigate to the root directory of the project.
- Install the package with pip:
pip install .
Or, for a development installation:
pip install -e .
To use AutoRank-LLM, you'll need to write a Python script or open a Python shell.
Here's a simple example:
from autorank_llm.evaluator import LLMEvaluator
# Define the model names and the task
model_names = ["mistral:instruct", "orca-mini", "llama2"]
task = "3 names of a pet cow."
# Create an evaluator instance
evaluator = LLMEvaluator(model_names, task)
# Perform the evaluation
evaluator.evaluate_llms()
# Retrieve and display the rankings
rankings = evaluator.get_rankings()
print(rankings)
Only Ollama models are supported for now.
Run unit tests on the package:-
python -m unittest discover tests
Contributions to AutoRank-LLM are welcome and greatly appreciated. Here's how you can contribute:
- Report Issues: If you find a bug or have a suggestion for improving the package, please open an issue.
- Submit Pull Requests: Feel free to fork the repository and submit pull requests. Whether it's fixing a bug, adding a feature, or improving documentation, your help is valuable.
- Feedback: Your feedback is crucial to the ongoing improvement of AutoRank-LLM. Share your experiences and ideas for future development.
Please ensure your contributions adhere to the following guidelines:
- Write clear and concise commit messages.
- Make sure your code follows the PEP 8 style guide for Python code.
- Update the documentation to reflect any changes in the API or functionality.
- Add tests for new features to ensure they work as expected.
AutoRank-LLM is released under the Apache License 2.0. Please review the license for more details.
AutoRank-LLM was inspired by the need for a fair and unbiased system to evaluate the burgeoning number of Language Learning Models in the field. We are grateful to the open-source community for their contributions and support.
AutoRank-LLM offers a robust and scalable solution to the challenge of evaluating and ranking LLMs. By leveraging the collective assessment capabilities of LLMs themselves, it provides a unique approach to understanding and benchmarking language model performance.