Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ArmoRM to RewardBench #135

Merged
merged 2 commits into from
May 24, 2024
Merged

Add ArmoRM to RewardBench #135

merged 2 commits into from
May 24, 2024

Conversation

Haoxiang-Wang
Copy link
Contributor

@Haoxiang-Wang Haoxiang-Wang commented May 24, 2024

This PR adds a new reward model, ArmoRM, to RewardBench.

  • Architecture

    image

RewardBench LeaderBoard

Model Base Model Method Score Chat Chat Hard Safety Reasoning Prior Sets (0.5 weight)
ArmoRM-Llama3-8B-v0.1 Llama-3 8B ArmoRM + MoE 88.97 96.9 76.8 92.2 97.3 74.3
Cohere May 2024 Unknown Unknown 88.25 96.4 71.3 92.7 97.7 78.2
GPT-4 Turbo (0125 version) GPT-4 Turbo LLM-as-a-Judge 84.25 95.3 74.3 87.2 86.9 70.9
FsfairX-LLaMA3-RM-v0.1 Llama-3 8B Bradley-Terry 83.61 99.4 65.1 87.8 86.4 74.9
Starling-RM-34B Yi-34B Bradley-Terry 81.44 96.9 57.2 88.2 88.5 71.4

Demo Code

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = "cuda"
path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device, 
                               trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
# We load a random sample from the validation set of the HelpSteer dataset
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
messages = [{"role": "user", "content": prompt},
           {"role": "assistant", "content": response}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
with torch.no_grad():
   output = model(input_ids)
   # Multi-objective rewards for the response
   multi_obj_rewards = output.rewards.cpu().float() 
   # The gating layer's output is conditioned on the prompt
   gating_output = output.gating_output.cpu().float()
   # The preference score for the response, aggregated from the 
   # multi-objective rewards with the gating layer
   preference_score = output.score.cpu().float()  
# We apply a transformation matrix to the multi-objective rewards
# before multiplying with the gating layer's output. This mainly aims
# at reducing the verbosity bias of the original reward objectives
obj_transform = model.reward_transform_matrix.data.cpu().float()
# The final coefficients assigned to each reward objective
multi_obj_coeffs = gating_output @ obj_transform.T
# The preference score is the linear combination of the multi-objective rewards with
# the multi-objective coefficients, which can be verified by the following assertion
assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3) 
# Find the top-K reward objectives with coefficients of the highest magnitude
K = 3
top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)

# The attributes of the 19 reward objectives
attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
   'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
   'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
   'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
   'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
   'code-style','code-explanation','code-instruction-following','code-readability']

example_index = 0
for i in range(K):
   attribute = attributes[top_obj_dims[example_index, i].item()]
   coeff = top_obj_coeffs[example_index, i].item()
   print(f"{attribute}: {round(coeff,5)}")
# code-complexity: 0.19922
# helpsteer-verbosity: -0.10864
# ultrafeedback-instruction_following: 0.07861

# The actual rewards of this example from the HelpSteer dataset
# are [3,3,4,2,2] for the five helpsteer objectives: 
# helpfulness, correctness, coherence, complexity, verbosity
# We can linearly transform our predicted rewards to the 
# original reward space to compare with the ground truth
helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
print(helpsteer_rewards_pred)
# [2.78125   2.859375  3.484375  1.3847656 1.296875 ]

Modified Files

  • rewardbench/models/
    • __init__.py: add config for ArmoRM
    • armorm.py: add ArmoRM pipeline
  • scripts/
    • configs/eval_config.yaml: eval config for ArmoRM
    • run_rm.py:
      • Enable TF32 (to use TensorCore on Ampere GPUs)
      • Add a model config choice, torch_dtype. Our ArmoRM is native to torch.bfloat16 (certainlytorch.float32 also works, but it takes larger GPU memory), and the Int-8 quantization leads to a very slow inference of ArmoRM (even slower than using FP32). This new config choice allows evaluation of ArmoRM under torch.bfloat16.

Evaluation Commands

python scripts/run_rm.py --model=RLHFlow/ArmoRM-Llama3-8B-v0.1 --trust_remote_code
python scripts/run_rm.py --model=RLHFlow/ArmoRM-Llama3-8B-v0.1 --trust_remote_code --pref_sets

Copy link
Collaborator

@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM if tests + style pass.
If style fails,

make style
make quality

@natolambert natolambert merged commit 0851402 into allenai:main May 24, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants