# Oil Environment Code Demo

This problem, adaptved from [here](https://www.pnas.org/content/109/3/764) is a continuous variant of the “Grid World” environment. It comprises of an agent surveying a d-dimensional map in search of hidden “oil deposits”. The world is endowed with an unknown survey function which encodes the probability of observing oil at that specific location. For agents to move to a new location they pay a cost proportional to the distance moved, and surveying the land produces noisy estimates of the true value of that location. In addition, due to varying terrain the true location the agent moves to is perturbed as a function of the state and action.


There is a $d$-dimensional reinforcement learning environment in the space $X = [0, 1]^d$.  The action space $A = [0,1]^d$ corresponding to the ability to attempt to move to any desired location within the state space.  On top of that, there is a corresponding reward function $f_h(x,a)$ for the reward for moving the agent to that location.  Moving also causes an additional cost $\alpha d(x,a)$ scaling with respect to the distance moved.

In this notebook we run a sample experiment for the setting when $d = 1$ and the reward function is taken to be a quadratic.  We compare several heuristics to existing reinforcement learning algorithms.

Here is an example illustrating the problem on a 1 dimensional line:

![Oil_Line_Diagram](diagrams/oil_line_diagram.png)
    
* Assuming a reasonable cost to move, the agent will likely want to move towards the right. If the cost to move is heavily penalized, the agent could chose to stay in place or possibly move to the left.
* Exactly how far the agent moves will be determined by the cost to move
* Finally, the agent may not end up exactly at its target location, as affected by the “terrain”

### Package Installation

In [1]:
import or_suite
import numpy as np

import copy

import os
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.evaluation import evaluate_policy
import pandas as pd


import gym

### Experiment Parameters

Here we use the oil environment as outlined in `or_suite/envs/oil_discovery/oil_environment.py`.  The package has default specifications for all of the environments in the file `or_suite/envs/env_configs.py`, and so we use one the defaults.

In addition, we need to specify the number of episodes for learning, and the number of iterations (in order to plot average results with confidence intervals).

In [2]:
CONFIG =  or_suite.envs.env_configs.oil_environment_default_config

epLen = CONFIG['epLen']
nEps = 300
numIters = 30

epsilon = (nEps * epLen)**(-1 / 4)
action_net = np.arange(start=0, stop=1, step=epsilon)
state_net = np.arange(start=0, stop=1, step=epsilon)

scaling_list = [0.1, 0.3, 1, 5]

DEFAULT_SETTINGS = {'seed': 1, 
                    'recFreq': 1, 
                    'dirPath': '../data/oil/', 
                    'deBug': False, 
                    'nEps': nEps, 
                    'numIters': numIters, 
                    'saveTrajectory': True, 
                    'epLen' : 5,
                    'render': False,
                    'pickle': False
                    }



### Specifying Agent

We specify 6 different agents to compare the effectiveness of each.

* `SB PPO` is Proximal Policy Optimization. When policy is updated, there is a parameter that “clips” each policy update so that action update does not go too far
* `Random` implements the randomized RL algorithm, which selects an action uniformly at random from the action space. In particular, the algorithm stores an internal copy of the environment’s action space and samples uniformly at random from it.
* `AdaQL` is an Adaptive Discretization Model-Free Agent, implemented for enviroments with continuous states and actions using the metric induced by the l_inf norm.
* `AdaMB` is an Adaptive Discretizaiton Model-Based Agent, implemented for enviroments with continuous states and actions using the metric induced by the l_inf norm.
* `Unif QL` is an eNet Model-Based Agent, implemented for enviroments with continuous states and actions using the metric induces by the l_inf norm.
* `Unif MB` is a eNet Model-Free Agent, implemented for enviroments with continuous states and actions using the metric induces by the l_inf norm.

In [3]:
oil_env = gym.make('Oil-v0', config=CONFIG)
mon_env = Monitor(oil_env)
dim = CONFIG['dim']
cost_param = CONFIG['cost_param']
prob = CONFIG['oil_prob']

agents = { 'SB PPO': PPO(MlpPolicy, mon_env, gamma=1, verbose=0, n_steps=epLen),
'Random': or_suite.agents.rl.random.randomAgent(),
'AdaQL': or_suite.agents.rl.ada_ql.AdaptiveDiscretizationQL(epLen, scaling_list[0], True, dim*2),
'AdaMB': or_suite.agents.rl.ada_mb.AdaptiveDiscretizationMB(epLen, scaling_list[0], 0, 2, True, True, dim, dim),
'Unif QL': or_suite.agents.rl.enet_ql.eNetQL(action_net, state_net, epLen, scaling_list[0], (dim,dim)),
'Unif MB': or_suite.agents.rl.enet_mb.eNetMB(action_net, state_net, epLen, scaling_list[0], (dim,dim), 0, False),
}

We recommend using a `batch_size` that is a multiple of `n_steps * n_envs`.
Info: (n_steps=5 and n_envs=1)


### Running Algorithm

In [4]:
path_list_line = []
algo_list_line = []
path_list_radar = []
algo_list_radar= []

for agent in agents:
    print(agent)
    DEFAULT_SETTINGS['dirPath'] = '../data/oil_metric_'+str(agent)+'_'+str(dim)+'_'+str(cost_param)+'_'+str(prob.__name__)+'/'
    if agent == 'SB PPO':
        or_suite.utils.run_single_sb_algo(mon_env, agents[agent], DEFAULT_SETTINGS)
    elif agent == 'AdaQL' or agent == 'Unif QL' or agent == 'AdaMB' or agent == 'Unif MB':
        or_suite.utils.run_single_algo_tune(oil_env, agents[agent], scaling_list, DEFAULT_SETTINGS)
    else:
        or_suite.utils.run_single_algo(oil_env, agents[agent], DEFAULT_SETTINGS)

    path_list_line.append('../data/oil_metric_'+str(agent)+'_'+str(dim)+'_'+str(cost_param)+'_'+str(prob.__name__))
    algo_list_line.append(str(agent))
    if agent != 'SB PPO':
        path_list_radar.append('../data/oil_metric_'+str(agent)+'_'+str(dim)+'_'+str(cost_param)+'_'+str(prob.__name__))
        algo_list_radar.append(str(agent))

SB PPO
New Experiment Run
Iteration: 0
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 6
Iteration: 7
Iteration: 8
Iteration: 9
Iteration: 10
Iteration: 11
Iteration: 12
Iteration: 13
Iteration: 14
Iteration: 15
Iteration: 16
Iteration: 17
Iteration: 18
Iteration: 19
Iteration: 20
Iteration: 21
Iteration: 22
Iteration: 23
Iteration: 24
Iteration: 25
Iteration: 26
Iteration: 27
Iteration: 28
Iteration: 29
[3.730179041675991, 4.232383375720717, 3.894319968816894, 4.0295688825610165, 3.6174418137157875, 4.0375409771444595, 4.092747932608935, 3.8854259932758883, 3.456887291875731, 3.9806669113812263, 4.184683082245084, 3.4864790401640966, 4.087131759092259, 4.1316540460690545, 3.407259457235617, 3.1791303327801352, 3.9403424597083005, 3.7112092283939044, 3.6075100140631826, 3.697876046157389, 3.914475000965339, 3.9536234062796622, 3.793679610114044, 3.7838666385704984, 4.044116723476524, 4.457340173227423, 2.9674245656476255, 4.053288418734922, 3.850174227193724

Writing to file data.csv
AdaQL
Chosen parameters: 0.3
Writing to file data.csv
0.3
AdaMB
Chosen parameters: 0.3
Writing to file data.csv
0.3
Unif QL
Chosen parameters: 0.1
Writing to file data.csv
0.1
Unif MB
Chosen parameters: 0.1
Writing to file data.csv
0.1


In [5]:
fig_path = '../figures/'
fig_name = 'oil_metric'+'_'+str(dim)+'_'+str(cost_param)+'_'+str(prob.__name__)+'_line_plot'+'.pdf'
or_suite.plots.plot_line_plots(path_list_line, algo_list_line, fig_path, fig_name, int(nEps / 40)+1)

additional_metric = {}
fig_name = 'oil_metric'+'_'+str(dim)+'_'+str(cost_param)+'_'+str(prob.__name__)+'_radar_plot'+'.pdf'
or_suite.plots.plot_radar_plots(path_list_radar, algo_list_radar,
fig_path, fig_name,
additional_metric
)

  Algorithm    Reward      Time        Space
0    Random  3.755120  6.519665 -3614.000000
1     AdaQL  4.905586  5.512802 -3600.800000
2     AdaMB  4.670059  5.737919 -3684.266667
3   Unif QL  4.613791  6.134131 -3567.466667
4   Unif MB  4.843150  5.119956 -3660.000000


In [6]:
from IPython.display import IFrame
IFrame("../figures/oil_metric_1_0_<lambda>_line_plot.pdf", width=600, height=280)

In [7]:
IFrame("../figures/oil_metric_1_0_<lambda>_radar_plot.pdf", width=600, height=450)

Here we see the uniform discretization model based algorithm performs the best with a minimal time complexity for evaluating the algorithm.