New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Multiple styles of computing reward with DPO #1

Closed

natolambert opened this issue Jan 5, 2024 · 1 comment · Fixed by #96

Assignees

Labels

Collaborator

natolambert commented Jan 5, 2024

Currently matches the paper, but we should add the ability to normalize by length:

Divide by length of response (chosen or rejected).
Take a norm-style approach which is a length weighted average.

natolambert added the enhancement label

natolambert assigned ValentinaPy

ljvmiranda921 mentioned this issue

Update per token reward #25

Merged

Collaborator Author

natolambert commented Mar 5, 2024

Supported in #31 but not tested.

natolambert mentioned this issue

DPO ref free sweep prep #96

Merged

natolambert closed this as completed in #96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment