Skip to content

geraudnt/rosarl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

This repository contains the code to replicate the experiments of the paper "ROSARL: Reward-Only Safe Reinforcement Learning". The paper introduces a new framework for safe RL where the agent learns safe policies solely from scalar rewards using any suitable RL algorithm. This is achieved by replacing the rewards at unsafe terminal states by the minmax penalty, which is the strict upperbound reward whose optimal policy minimises the probability of reaching unsafe states.

Trajectories from learned policies of baselines vs ours

Supported RL Algorithms and General Usage

ROSARL is compatible with any RL Algorithm. One can simply estimate the minmax penalty during learning, and replace the environments rewards at unsafe states with it. See learning_minmax_penalty.py for a simple method of estimating the minmax penalty during learning by using the value function being learned by an RL algorithm.

Running the Safety Gym experiments

Installation

These experiments use the safety gym code (but modified to include environments that terminate when reaching unsafe states), and the safety starter agents code (but modified to include TRPO-Minmax, which is just TRPO but modified to use the learned Minmax penalty).

First install openai mujoco. Then install the required packages:

pip install -r requirements.txt
cd safety_ai_gym/safety-gym

pip install -e .
cd safety_ai_gym/safety-starter-agents

pip install -e .

Getting Started

Example Script: To run TRPO-Minmax on the Safexp-PointGoal1-TerminalUnsafe-v0 environment from Safety Gym, using neural networks of size (64,64):

from safe_rl import trpo_minmax
import gym, safety_gym

trpo_minmax(
	env_fn = lambda : gym.make('Safexp-PointGoal1-TerminalUnsafe-v0'),
	ac_kwargs = dict(hidden_sizes=(64,64))
	)

Reproduce Experiments from Paper: To reproduce an experiment from the paper, run:

cd /path/to/safety-starter-agents/scripts
python experiment.py --algo ALGO --task TASK --robot ROBOT --unsafe_terminate UNSAFE_TERMINATE --seed SEED 
	--exp_name EXP_NAME --cpu CPU

where

  • ALGO is in ['ppo', 'ppo_lagrangian', 'trpo', 'trpo_lagrangian', 'cpo'].
  • TASK is in ['goal1', 'goal2', 'button1', 'button2', 'push1', 'push2'] .
  • ROBOT is in ['point', 'car', 'doggo'].
  • UNSAFE_TERMINATE is in [0,1,2]. 0 is the original safety-gym environments that do not terminate when unsafe states are reached. 2 is the modified safety-gym environments that terminate when unsafe states are reached.
  • SEED is an integer. In the paper experiments, we used seeds of 0, 10, and 20, but results may not reproduce perfectly deterministically across machines.
  • CPU is an integer for how many CPUs to parallelize across.

EXP_NAME is an optional argument for the name of the folder where results will be saved. The save folder will be placed in /path/to/safety-starter-agents/data.

Plot Results: Plot results with:

cd /path/to/safety-starter-agents/scripts
python plot.py data/path/to/experiment
2
2 2
2 2

Watch Trained Policies: Test policies with:

cd /path/to/safety-starter-agents/scripts
python enjoy.py data/path/to/experiment
TRPO Success 1 2 2 2
Failure 1 2 2 2
TRPO Lagrangian Success 1 2 2 2
Failure 1 2 2 2
CPO Success 1 2 2 2
Failure 1 2 2 2
TRPO Minmax (Ours) Success 1 2 2 2
Failure 1 2 2 2

Cite the Paper

@article{NangueTasse2023,
    author = {Nangue Tasse, Geraud and Love, Tamlin and Nemecek, Mark and James, Steven and Rosman, Benjamin},
    title = {{ROSARL: Reward-Only Safe Reinforcement Learning}},
    year = {2023}
}

About

ROSARL: Reward-Only Safe Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages