-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Suggestion] Leading Models Answer Consolidator! #460
Comments
I tested this idea's effectiveness, manually, with a series of various style questions, and the final consolidated response was better than any of the individual model's response in all the tests. I suppose the model chosen to grade could have a bias toward its own model's response (if it was the model for one of responses). To try to combat that, I created a new thread each time for the evaluation, grading, consolidation process, and didn't state where the answers came from. |
I acknowledge this might get expensive for individuals and be slower, so not practical to do all the time |
@keithclift24 - are you able to build the Please let me know quickly if you are able to build the |
@keithclift24 I need help with the "response consolidation logic". Please let me know how you performed what I call the "merge" phase. I have a few ideas: replace system prompt, add all 1..N messages as Assistant messages and then have the user instructions, or have them all into 1 user message sandwiched between instructions, but have not selected the method(s) to do it. I'll have the UI with max flexibility, even for custom, but I need a proven way of doing it. My investigation into Llamaindex and Langchain yielded nothing. Please also reach out on Discord (you'll find me on the big-AGI server). |
I really am a novice at coding, and never really worked on a serious project (besides playing around, learning), so I am afraid I'll be more trouble than help. I generated the to-do list above based off my "idea" and with GPT-4 helping me describe my suggested idea in terms you all may understand (so the requirements to do list could be jibberish for all I know, but looked relatively logical). Sorry, I'd love to help, but it's way over my head unfortunately. |
When I was talking about testing it manually above, I literally just asked a question of each GPT-4 Turbo, Claude 3 Opus, and Gemini Advanced, then copied the responses into a single .txt file on my PC and called them "Answer 1...Answer 2..Answer 3...". Then in a new thread asked GPT-4 Turbo "To the question "[Question i asked the 3 models]", I want you to objectively grade the attached 3 answers (you decide various logical criteria, but use a 1-100 scale)" and attached the .txt file saved on my PC. Then I asked to consolidate the 3 answers into one response with the best of each of the 3 answers and focus on further improving the final response based on resolving the "weaknesses" identified in the grading. (then for my own testing I asked to grade that result against the original 3 answers). So when it comes to the most efficient way to have the big-agi.com code base accomplish this whole process (let alone the grading/merge), I don't have any advice, unfortunately. |
This is still insightful. And be ready to try it out very very soon, it's a huge feature, the UX is great, and days to be done. |
Most definitely, I will! Love the project and look forward to tracking and maybe helping some. After using the website off and on over that last 6 months a few other things I would find most useful (I'm sure you've heard these):
|
For both we have tickets already. |
Hi @keithclift24 , are you on discord? This feature is released today to the first testers and I'd love you to take a look at this and give your opinion |
Yes, username "kmc24", and I'm a member of the big-agi server |
@enricoros I think you're good to close but please merge with the "Chat: Best-Of N effect #381" and "BEAM - feature thread #470". |
@keithclift24 your suggestion here was seminal and great - and it's amazing we shipped fast. Your advice (and the community's) came just at the right time to be able to shape this. We can now think of what would make V2 great:
|
Best of the Best: Integrating multi-model responses with Peer Review to generate a consolidated response, taking the best elements from each leading model's response to query, improving overall accuracy and quality than from any one individual model.
Why
This suggestion is grounded in the goal of enhancing the user experience by providing more accurate, comprehensive, and nuanced responses possible. Users will be able to receive the best possible answers to their queries, synthesized from the strengths of multiple leading language models (LLMs) such as GPT-4 Turbo, Claude Opus, and Gemini Ultra. This approach not only ensures a high quality of responses but also introduces a layer of reliability and depth not achievable by any single LLM. The peer review mechanism by an independent LLM further refines this process, ensuring that users have access to information that is not only diverse but also critically evaluated and consolidated.
Description
Integrate multiple leading LLMs (e.g., GPT-4 Turbo, Claude Opus, Gemini Ultra) to simultaneously process a users prompt, and employ an separate additional LLM chat (e.g., ChatGPT or whichever is the leading model at the time) to independently review and grade these responses of 3 different models based on accuracy, relevance, and comprehensiveness. The system then consolidates these insights to deliver a single, optimized response that leverages the collective intelligence and strengths of all participating models.
Requirements
The text was updated successfully, but these errors were encountered: