You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is a screenshot of an example workflow.
After you have responses from different systems/models you can open a Pairwise Comparison Node that will run an evaluation LLM on all possible pairs (forward and backward).
Looking at it now I realize I need to make some changes to how the rubric is set, because the A and B responses are hard-coded in the scoring code. I also could improve settings for saving different rubrics and options for forward-backward pairings or not, and for random subsets or not. It does not integrate well with the progress bar, not tell you how many requests will be sent to the LLM. I suppose it could also choose the best several instead of just a pair. And the response options could be expanded, perhaps matching the lm-sys/FastChat approach of A, B, tie, bad. And could implement flexibility in scoring, perhaps within settings or an optional additional Python/JS evaluation node. I'd also like to implement a hint or ground-truth flow, perhaps with a node that aggregates all responses and is checked by a human before entering into the Comparison Node as a hint. Note: I'm intending this particularly for testing generative web search systems, so I haven't been thinking about other goals.
Here is the Pairwise Comparison Inspector Grouped List tab
and the Scores tab:
I will make another repo with my custom providers (very rudimentary at this moment).
I'll make a more careful write-up of my process and learnings.
The text was updated successfully, but these errors were encountered:
Wow!!! I love the idea and the Scores tab. It’s incredible what you’ve been able to do! Please keep me updated.
Some thoughts:
How do you know the scores are ‘correct’? Maybe its an example here, so the rubric is fairly general, but I still think verification may be an issue for users. It was my main issue w “prompt royale.”
The LLM scorer node is the worst designed node IMO in ChainForge —it should be much better UI with locked output types. I implemented it under time pressure. I see an inherited design problem w the pairwise scorer node —would be nice to think of how to improve this UI design for users, later on.
Here is a screenshot of an example workflow.
After you have responses from different systems/models you can open a Pairwise Comparison Node that will run an evaluation LLM on all possible pairs (forward and backward).
Looking at it now I realize I need to make some changes to how the rubric is set, because the A and B responses are hard-coded in the scoring code. I also could improve settings for saving different rubrics and options for forward-backward pairings or not, and for random subsets or not. It does not integrate well with the progress bar, not tell you how many requests will be sent to the LLM. I suppose it could also choose the best several instead of just a pair. And the response options could be expanded, perhaps matching the lm-sys/FastChat approach of A, B, tie, bad. And could implement flexibility in scoring, perhaps within settings or an optional additional Python/JS evaluation node. I'd also like to implement a hint or ground-truth flow, perhaps with a node that aggregates all responses and is checked by a human before entering into the Comparison Node as a hint. Note: I'm intending this particularly for testing generative web search systems, so I haven't been thinking about other goals.
Here is the Pairwise Comparison Inspector Grouped List tab
and the Scores tab:
The text was updated successfully, but these errors were encountered: