This is source code for the paper, "Beyond Reward: Offline Preference-guided Policy Optimization" (OPPO
)
Main codes are in oppo
folder
It contains 2 parts:
scripted
contains code to reproduce results using preferences generated by a "scripted teacher".
human
contains code to train/eval OPPO
using human-labeled perference, which is from Preference Transformer
, please refer to their codebase for further details and consider cite their paper if needed
@misc{kang2023reward,
title={Beyond Reward: Offline Preference-guided Policy Optimization},
author={Yachen Kang and Diyuan Shi and Jinxin Liu and Li He and Donglin Wang},
year={2023},
eprint={2305.16217},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Our code is largely based on Decision Transformer
Human labels are obtained thanks to Preference Transformer
Our experiments, largely used D4RL dataset
Lift
and Can
environments are owing to Robomimic and Robosuite project