Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example workflow #1

Open
1 of 2 tasks
danielsgriffin opened this issue Dec 20, 2023 · 2 comments
Open
1 of 2 tasks

Add example workflow #1

danielsgriffin opened this issue Dec 20, 2023 · 2 comments

Comments

@danielsgriffin
Copy link
Owner

danielsgriffin commented Dec 20, 2023

Here is a screenshot of an example workflow.
After you have responses from different systems/models you can open a Pairwise Comparison Node that will run an evaluation LLM on all possible pairs (forward and backward).
image

Looking at it now I realize I need to make some changes to how the rubric is set, because the A and B responses are hard-coded in the scoring code. I also could improve settings for saving different rubrics and options for forward-backward pairings or not, and for random subsets or not. It does not integrate well with the progress bar, not tell you how many requests will be sent to the LLM. I suppose it could also choose the best several instead of just a pair. And the response options could be expanded, perhaps matching the lm-sys/FastChat approach of A, B, tie, bad. And could implement flexibility in scoring, perhaps within settings or an optional additional Python/JS evaluation node. I'd also like to implement a hint or ground-truth flow, perhaps with a node that aggregates all responses and is checked by a human before entering into the Comparison Node as a hint. Note: I'm intending this particularly for testing generative web search systems, so I haven't been thinking about other goals.

Here is the Pairwise Comparison Inspector Grouped List tab
image

and the Scores tab:
image

  • I will make another repo with my custom providers (very rudimentary at this moment).
  • I'll make a more careful write-up of my process and learnings.
@danielsgriffin
Copy link
Owner Author

See here: ChainForge_SearchProviders

@ianarawjo
Copy link

ianarawjo commented Dec 20, 2023

Wow!!! I love the idea and the Scores tab. It’s incredible what you’ve been able to do! Please keep me updated.

Some thoughts:

  • How do you know the scores are ‘correct’? Maybe its an example here, so the rubric is fairly general, but I still think verification may be an issue for users. It was my main issue w “prompt royale.”
  • The LLM scorer node is the worst designed node IMO in ChainForge —it should be much better UI with locked output types. I implemented it under time pressure. I see an inherited design problem w the pairwise scorer node —would be nice to think of how to improve this UI design for users, later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants