Safe Reinforcement Learning with Contextual Information: Theory and Application to Comorbidity Management

Authors' names blinded for peer review

This repository contains the implementation of the Contextual Optimistic-Pessimistic Safe exploration (COPS) reinforcement learning algorithm. We also implement and compare with baseline safe RL methods, including DOPE, OptCMDP, and OptPessLP. COPS is tested on the ACCORD dataset for blood pressure (BP) and blood glucose (BG) management, as well as the synthetic inventory control problem.

Code Usage

Environment Setup

# for Linux machine
conda env create -f environment_linux.yml

# activate the environment
conda activate cops

You might want to install Gurobi optimization solver for faster computation if needed.

conda config --add channels https://conda.anaconda.org/gurobi
conda install gurobi

# run the license get commandline after applying for academic license in your account.

# to remove gurobi solver, run
conda remove gurobi

Running Experiments

In this work, we evaluate our COPS on both the ACCORD dataset and synthetic inventory control problem. The synthetic inventory control problem is in the InventoryControl folder.

Data Preparation

Create the corresponding datasets for each case by running the create_datasets_contextual.ipynb script in each folder. The datasets would be saved in the data/ folder.

Run COPS

Navigate to COPS/ folder, run create_datasets_contextual.ipynb to create the input data files:
Train true offline models for P, R, C using all data
- Run model_contextual.ipynb:
  - take above input data files
  - discretize the context features
  - basic model settings, action space, state space etc.
  - P is estimated the same way (empirical estimates) as DOPE, output model settings to output/model_contextual.pkl
- Run train_feedback_estimators.ipynb：get the R and C offline estimators
  - takes ../data/ACCORD_BPBGClass_v2_contextual.csv as input
  - R (CVDRisk): logistic regression with (context_vector, state, action)
  - C (SBP or Hba1c): linear regression with (context_vector, action)
  - Offline R and C models are saved to output/CVDRisk_estimator.pkl, output/A1C_feedback_estimator.pkl, output/SBP_feedback_estimator.pkl
Run python Contextual.py or python Contextual.py 1 to run the main contextual algorithm. Use 1 to specify using GUROBI solver.

Run Benchmarks

Navigate to Benchmarks/ folder
Run model.ipynb to:
- set up the state features and action features for state space and action space
- get empriical estimates of the P R C based on the dataset, save it to output/model.pkl
- set up CONSTRAINTS and baseline Constraint
- solve the optimal policy, save it to output/solution.pkl
- solve the baseline policy, save it to output/base.pkl
The UtilityMethods.py defines the utils class, which does:
- linear programming solver
- update the empirical estimate of P during exploration
Then run python DOPE.py to run the main DOPE algorithm, use python DOPE.py 1 to specify using GUROBI solver.
- learn the objective regrets, and constraint regrets of the learned policy
- save DOPE_opsrl_RUNNUMBER.pkl and regrets_RUNNUMBER.pkl
Plot the Objective Regret and Constraint Regret
- run python plot1.py output/DOPE_opsrl150.pkl 30000
- plots are in output/ folder

Run OptCMDP

Use the same model preparation scheme as in DOPE by running model.ipynb. If you have run this for DOPE, no need to run this again.
Run python OptCMDP.py 1 to run the OptCMDP algorithm
- Similar to DOPE, but not running K0 episodes for baseline policy. Instead, it solves Extended LP directly.
- Choose random policy for the first episode to get started
python plot1.py output/OptCMDP_opsrl10.pkl 10000
- should expect to see non-zero Constraint Regret

Run OptPessLP

Use the same model preparation scheme as in DOPE by running model.ipynb. If you have run this for DOPE, no need to run again.
Run python OptPessLP.py to run the algorithm
- Similar to DOPE, but select to run baseline policy based on estimated cost of baseline policy
- If estimated cost is too high, then run baseline, else solve the Extended LP
- Radius is larger, and has no tunning parameter
python plot1.py output/OptPessLP_opsrl10.pkl 10000
- should expect to see increasing linear Objective Regret with episodes, and 0 Connstraint Regret

Compare COPS with Clinician's Actions and Feedback

Navigate to COPS folder, run the model_contextual.ipynb again to generate
- CONTEXT_VECTOR_dict: dict with each patient's MaskID as key, and values of context_fea, init_state, state_index_recorded, action_index_recorded
- save the trained knn_model and knn_model_label and scaler to selecting the actions of Clinician with unmatching states
Run python Contextual_test.py 1 to run the simulation of COPS and Clinician
- The difference between solving constrained LP problem and solving the extended LP problem is that, in the extended LP problem, we need to add equations that relates to the confidence bound of the estimated P_hat. Here when testing, we still call the compute_extended_LP() function, but we do not add the confidence bound equations by settings the Inference to True.
- the results are saved to output_final/Contextual_test_BPClass.csv

Acknowledgment

This code base is developed based on the following project. We sincerely thank the authors for open-sourcing their project.

DOPE

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BatchImplementation/COPS_batch		BatchImplementation/COPS_batch
Benchmarks		Benchmarks
COPS		COPS
InventoryControl		InventoryControl
SafetyCost/COPS_WS		SafetyCost/COPS_WS
SensitivityAnalysis		SensitivityAnalysis
.gitignore		.gitignore
README.md		README.md
environment_linux.yml		environment_linux.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Safe Reinforcement Learning with Contextual Information: Theory and Application to Comorbidity Management

Code Usage

Environment Setup

Running Experiments

Data Preparation

Run COPS

Run Benchmarks

Run OptCMDP

Run OptPessLP

Compare COPS with Clinician's Actions and Feedback

Acknowledgment

About

Uh oh!

Releases

Packages

Languages

cops2025/COPS

Folders and files

Latest commit

History

Repository files navigation

Safe Reinforcement Learning with Contextual Information: Theory and Application to Comorbidity Management

Code Usage

Environment Setup

Running Experiments

Data Preparation

Run COPS

Run Benchmarks

Run OptCMDP

Run OptPessLP

Compare COPS with Clinician's Actions and Feedback

Acknowledgment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages