# Preview

In this example, we are going to use our toolbox to train and evaluate the [Trajectory-based Dynamics Model](https://arxiv.org/abs/2012.09156) in the reacher environment.

In [1]:
from IPython import display
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import torch
import omegaconf

import mbrl.env.pets_reacher as reacher_env
import mbrl.env.cartpole_continuous as cartpole_env
import mbrl.env.reward_fns as reward_fns
import mbrl.env.termination_fns as termination_fns
import mbrl.models as models
import mbrl.planning as planning
import mbrl.util.common as common_util


%load_ext autoreload
%autoreload 2

mpl.rcParams.update({"font.size": 16})

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [2]:
print(1+1)

2


# Creating the environment

First we instantiate the environment.

In [3]:
seed = 0
env = reacher_env.Reacher3DEnv()
env.seed(seed)
rng = np.random.default_rng(seed=0)
generator = torch.Generator(device=device)
generator.manual_seed(seed)
obs_shape = env.observation_space.shape
act_shape = env.action_space.shape


# Hydra configuration

MBRL-Lib uses [Hydra](https://github.com/facebookresearch/hydra) to manage configurations. For the purpose of this example, you can think of the configuration object as a dictionary with key/value pairs--and equivalent attributes--that specify the model and algorithmic options. Our toolbox expects the configuration object to be organized as follows:

In [4]:
trial_length = 200
num_trials = 10
ensemble_size = 5

# Everything with "???" indicates an option with a missing value.
# Our utility functions will fill in these details using the 
# environment information
cfg_dict = {
    # dynamics model configuration
    "dynamics_model": {
        "_target_": "mbrl.models.TrajBasedMLP",
        "device": device,
        "num_layers": 3,
        "ensemble_size": ensemble_size,
        "hid_size": 200,
        "in_size": "???",
        "out_size": "???",
        "deterministic": False,
        "propagation_method": "fixed_model",
        # can also configure activation function for GaussianMLP
        "activation_fn_cfg": {
            "_target_": "torch.nn.LeakyReLU",
            "negative_slope": 0.01
        }
    },
    # options for training the dynamics model
    "algorithm": {
        "learned_rewards": False,
        "target_is_delta": False, # trajectory based model predicts states directly
        "normalize": True,
    },
    # these are experiment specific options
    "overrides": {
        "trial_length": trial_length,
        "num_steps": num_trials * trial_length,
        "model_batch_size": 32,
        "validation_ratio": 0.05
    }
}
cfg = omegaconf.OmegaConf.create(cfg_dict)

# Creating a dynamics model

Given the configuration above, the following two lines of code create a wrapper for 1-D transition reward models, and a gym-like environment that wraps it, which we can use for simulating the real environment. The 1-D model wrapper takes care of creating input/output data tensors to the underlying NN model (by concatenating observations, actions and rewards appropriately), normalizing the input data to the model, and other data processing tasks (e.g., converting observation targets to deltas with respect to the input observation).

In [5]:
# Create a 1-D dynamics model for this environment
# dynamics_model = common_util.create_one_dim_tr_model(cfg, obs_shape, act_shape)

# Create a gym-like environment to encapsulate the model
# model_env = models.ModelEnv(env, dynamics_model, term_fn, reward_fn, generator=generator)

# PID Agent

The following config object and the subsequent function call create an agent that can plan using the Cross-Entropy Method over the model environment created above. When calling `planning.create_trajectory_optim_agent_for_model`, we also specify how many particles to use when propagating model uncertainty, as well as the uncertainty propagation method, "fixed_model", which corresponds to the method TS$\infty$ in the PETS paper.

In [26]:
def create_pid_agent(action_dim):
    P = np.random.rand(action_dim) * 5
    I = np.zeros(action_dim)
    D = np.random.rand(action_dim)
    target = np.random.rand(action_dim) * 2 - 1

    agent = planning.PIDAgent(dim=action_dim, Kp=P, Ki=I, Kd=D, target=target)
    return agent

In [36]:
agent = create_pid_agent(env.action_space.shape[0])

# Create a replay buffer

We can create a replay buffer for this environment an configuration using the following method, where `collect_trajectories` is on for easier plotting of results.

In [61]:
param_shape = (len(agent.get_parameters())+1,)
collect_full_trajectories = True
replay_buffer = common_util.create_replay_buffer(cfg, 
                                                 obs_shape, 
                                                 param_shape, 
                                                 rng=rng, 
                                                 collect_trajectories=collect_full_trajectories)



We can now populate the replay buffer with random trajectories of a desired length, using a modified function based on `util.rollout_agent_trajectories`. The changes are that the tuples are only added to the buffer at the end of the trajectory to add every sub-trajectory for supervised learning.

In [69]:
num_trials = 10

step = 0
trial = 0
total_rewards = []
callback = None
while trial < num_trials:
    traj = []
    obs = env.reset()
    agent.reset()
    done = False
    total_reward = 0.0
    while not done:
        action = agent.act(obs)
        next_obs, reward, done, info = env.step(action)
        if callback:
            callback((obs, action, next_obs, reward, done))

        obs = next_obs
        traj.append((obs, step))
        
        # if not the first step, iterate through memory and append each sub-trajectory
        if len(traj[:-1]) > 0:
            for obs_t, t in traj[:-1]:
                print(f"adding: mem time {t}, current time {step}, horizon {step-t}")
                replay_buffer.add(obs_t, np.concatenate((agent.get_parameters(), np.array([step-t,]))), next_obs, reward, done)

        total_reward += reward
        step += 1
                
        if not collect_full_trajectories and step == steps_or_trials_to_collect:
            total_rewards.append(total_reward)
            break
        if trial_length and step % trial_length == 0:
            if collect_full_trajectories and not done and replay_buffer is not None:
                replay_buffer.close_trajectory()
            break
    trial += 1
    total_rewards.append(total_reward)
    


adding: mem time0, current time 1, horizon 1
adding: mem time0, current time 2, horizon 2
adding: mem time1, current time 2, horizon 1
adding: mem time0, current time 3, horizon 3
adding: mem time1, current time 3, horizon 2
adding: mem time2, current time 3, horizon 1
adding: mem time0, current time 4, horizon 4
adding: mem time1, current time 4, horizon 3
adding: mem time2, current time 4, horizon 2
adding: mem time3, current time 4, horizon 1
adding: mem time0, current time 5, horizon 5
adding: mem time1, current time 5, horizon 4
adding: mem time2, current time 5, horizon 3
adding: mem time3, current time 5, horizon 2
adding: mem time4, current time 5, horizon 1
adding: mem time0, current time 6, horizon 6
adding: mem time1, current time 6, horizon 5
adding: mem time2, current time 6, horizon 4
adding: mem time3, current time 6, horizon 3
adding: mem time4, current time 6, horizon 2
adding: mem time5, current time 6, horizon 1
adding: mem time0, current time 7, horizon 7
adding: me

adding: mem time230, current time 262, horizon 32
adding: mem time231, current time 262, horizon 31
adding: mem time232, current time 262, horizon 30
adding: mem time233, current time 262, horizon 29
adding: mem time234, current time 262, horizon 28
adding: mem time235, current time 262, horizon 27
adding: mem time236, current time 262, horizon 26
adding: mem time237, current time 262, horizon 25
adding: mem time238, current time 262, horizon 24
adding: mem time239, current time 262, horizon 23
adding: mem time240, current time 262, horizon 22
adding: mem time241, current time 262, horizon 21
adding: mem time242, current time 262, horizon 20
adding: mem time243, current time 262, horizon 19
adding: mem time244, current time 262, horizon 18
adding: mem time245, current time 262, horizon 17
adding: mem time246, current time 262, horizon 16
adding: mem time247, current time 262, horizon 15
adding: mem time248, current time 262, horizon 14
adding: mem time249, current time 262, horizon 13


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [65]:
len(replay_buffer)

2200

# Training the Trajectory Based Model

In [59]:
agent.get_parameters().flatten()

array([ 4.32284387,  4.42208292,  3.3031029 ,  4.17837783,  0.24627421,
        1.60552719,  1.59221673,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.59455814,
        0.84819036,  0.91022234,  0.83308917,  0.30861847,  0.7820293 ,
        0.89154524,  0.40239062, -0.2175147 ,  0.8519179 , -0.5969506 ,
       -0.89464083,  0.32461422, -0.24695989])

# Plotting Results