Add ArmoRM to RewardBench #135

Haoxiang-Wang · 2024-05-24T03:43:51Z

This PR adds a new reward model, ArmoRM, to RewardBench.

Description: Arbitrary-Rating Multi-Objective Reward Model (ArmoRM) with Mixture-of-Experts (MoE) Aggregation of Reward Objectives
Authors (* indicates equal contribution)

Haoxiang Wang*, Wei Xiong*, Tengyang Xie, Han Zhao, Tong Zhang
Blog: To appear soon (with implementation details)
Tech Report: To be released in June 2024
Model: ArmoRM-Llama3-8B-v0.1
- Finetuned from model: FsfairX-LLaMA3-RM-v0.1

Code Repository: https://github.com/RLHFlow/RLHF-Reward-Modeling/

Architecture

RewardBench LeaderBoard

Model	Base Model	Method	Score	Chat	Chat Hard	Safety	Reasoning	Prior Sets (0.5 weight)
ArmoRM-Llama3-8B-v0.1	Llama-3 8B	ArmoRM + MoE	88.97	96.9	76.8	92.2	97.3	74.3
Cohere May 2024	Unknown	Unknown	88.25	96.4	71.3	92.7	97.7	78.2
GPT-4 Turbo (0125 version)	GPT-4 Turbo	LLM-as-a-Judge	84.25	95.3	74.3	87.2	86.9	70.9
FsfairX-LLaMA3-RM-v0.1	Llama-3 8B	Bradley-Terry	83.61	99.4	65.1	87.8	86.4	74.9
Starling-RM-34B	Yi-34B	Bradley-Terry	81.44	96.9	57.2	88.2	88.5	71.4

Demo Code

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
device = "cuda"
path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device, 
                               trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
# We load a random sample from the validation set of the HelpSteer dataset
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
messages = [{"role": "user", "content": prompt},
           {"role": "assistant", "content": response}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
with torch.no_grad():
   output = model(input_ids)
   # Multi-objective rewards for the response
   multi_obj_rewards = output.rewards.cpu().float() 
   # The gating layer's output is conditioned on the prompt
   gating_output = output.gating_output.cpu().float()
   # The preference score for the response, aggregated from the 
   # multi-objective rewards with the gating layer
   preference_score = output.score.cpu().float()  
# We apply a transformation matrix to the multi-objective rewards
# before multiplying with the gating layer's output. This mainly aims
# at reducing the verbosity bias of the original reward objectives
obj_transform = model.reward_transform_matrix.data.cpu().float()
# The final coefficients assigned to each reward objective
multi_obj_coeffs = gating_output @ obj_transform.T
# The preference score is the linear combination of the multi-objective rewards with
# the multi-objective coefficients, which can be verified by the following assertion
assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3) 
# Find the top-K reward objectives with coefficients of the highest magnitude
K = 3
top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)

# The attributes of the 19 reward objectives
attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
   'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
   'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
   'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
   'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
   'code-style','code-explanation','code-instruction-following','code-readability']

example_index = 0
for i in range(K):
   attribute = attributes[top_obj_dims[example_index, i].item()]
   coeff = top_obj_coeffs[example_index, i].item()
   print(f"{attribute}: {round(coeff,5)}")
# code-complexity: 0.19922
# helpsteer-verbosity: -0.10864
# ultrafeedback-instruction_following: 0.07861

# The actual rewards of this example from the HelpSteer dataset
# are [3,3,4,2,2] for the five helpsteer objectives: 
# helpfulness, correctness, coherence, complexity, verbosity
# We can linearly transform our predicted rewards to the 
# original reward space to compare with the ground truth
helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
print(helpsteer_rewards_pred)
# [2.78125   2.859375  3.484375  1.3847656 1.296875 ]

Modified Files

rewardbench/models/
- __init__.py: add config for ArmoRM
- armorm.py: add ArmoRM pipeline
scripts/
- configs/eval_config.yaml: eval config for ArmoRM
- run_rm.py:
  - Enable TF32 (to use TensorCore on Ampere GPUs)
  - Add a model config choice, torch_dtype. Our ArmoRM is native to torch.bfloat16 (certainlytorch.float32 also works, but it takes larger GPU memory), and the Int-8 quantization leads to a very slow inference of ArmoRM (even slower than using FP32). This new config choice allows evaluation of ArmoRM under torch.bfloat16.

Evaluation Commands

python scripts/run_rm.py --model=RLHFlow/ArmoRM-Llama3-8B-v0.1 --trust_remote_code
python scripts/run_rm.py --model=RLHFlow/ArmoRM-Llama3-8B-v0.1 --trust_remote_code --pref_sets

natolambert

Code LGTM if tests + style pass.
If style fails,

make style
make quality

Add ArmoRM to RewardBench

c64d22f

natolambert approved these changes May 24, 2024

View reviewed changes

reformat

f3578ba

natolambert merged commit 0851402 into allenai:main May 24, 2024
3 checks passed

Haoxiang-Wang mentioned this pull request Jun 11, 2024

Code for Armo on Reward Bench RLHFlow/RLHF-Reward-Modeling#15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ArmoRM to RewardBench #135

Add ArmoRM to RewardBench #135

Haoxiang-Wang commented May 24, 2024 •

edited

Loading

natolambert left a comment

Add ArmoRM to RewardBench #135

Add ArmoRM to RewardBench #135

Conversation

Haoxiang-Wang commented May 24, 2024 • edited Loading

RewardBench LeaderBoard

Demo Code

Modified Files

Evaluation Commands

natolambert left a comment

Choose a reason for hiding this comment

Haoxiang-Wang commented May 24, 2024 •

edited

Loading