Add example workflow #1

danielsgriffin · 2023-12-20T00:50:47Z

Here is a screenshot of an example workflow.
After you have responses from different systems/models you can open a Pairwise Comparison Node that will run an evaluation LLM on all possible pairs (forward and backward).

Looking at it now I realize I need to make some changes to how the rubric is set, because the A and B responses are hard-coded in the scoring code. I also could improve settings for saving different rubrics and options for forward-backward pairings or not, and for random subsets or not. It does not integrate well with the progress bar, not tell you how many requests will be sent to the LLM. I suppose it could also choose the best several instead of just a pair. And the response options could be expanded, perhaps matching the lm-sys/FastChat approach of A, B, tie, bad. And could implement flexibility in scoring, perhaps within settings or an optional additional Python/JS evaluation node. I'd also like to implement a hint or ground-truth flow, perhaps with a node that aggregates all responses and is checked by a human before entering into the Comparison Node as a hint. Note: I'm intending this particularly for testing generative web search systems, so I haven't been thinking about other goals.

Here is the Pairwise Comparison Inspector Grouped List tab

and the Scores tab:

I will make another repo with my custom providers (very rudimentary at this moment).
I'll make a more careful write-up of my process and learnings.

danielsgriffin · 2023-12-20T00:59:37Z

See here: ChainForge_SearchProviders

ianarawjo · 2023-12-20T02:42:48Z

Wow!!! I love the idea and the Scores tab. It’s incredible what you’ve been able to do! Please keep me updated.

Some thoughts:

How do you know the scores are ‘correct’? Maybe its an example here, so the rubric is fairly general, but I still think verification may be an issue for users. It was my main issue w “prompt royale.”
The LLM scorer node is the worst designed node IMO in ChainForge —it should be much better UI with locked output types. I implemented it under time pressure. I see an inherited design problem w the pairwise scorer node —would be nice to think of how to improve this UI design for users, later on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example workflow #1

Add example workflow #1

danielsgriffin commented Dec 20, 2023 •

edited

danielsgriffin commented Dec 20, 2023

ianarawjo commented Dec 20, 2023 •

edited

Add example workflow #1

Add example workflow #1

Comments

danielsgriffin commented Dec 20, 2023 • edited

danielsgriffin commented Dec 20, 2023

ianarawjo commented Dec 20, 2023 • edited

danielsgriffin commented Dec 20, 2023 •

edited

ianarawjo commented Dec 20, 2023 •

edited