# Plots of Trade-offs

In this notebook we compare different time discretization methods. First, we collect
a trajectory data from the environment at a fine discretization level (this is also
the discretization level we run the policy at -- right now, anyway). Then we compare:

1. Using uniform discretization at different granularities, e.g. updating with every
    1st, 10th, 100th, ...? interactions.
2. Using the adaptive method with different tolarances.

In order to average out randomness, we'll repeat each setting 3 times for now.

In [1]:
import gymnasium as gym
from adaptive_time.features import Fourier_Features
import numpy as np
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import random

import adaptive_time.utils
from adaptive_time import environments
from adaptive_time import mc2
from adaptive_time import samplers

seed = 13

In [2]:
gym.register(
    id="CartPole-OURS-v0",
    entry_point="adaptive_time.environments.cartpole:CartPoleEnv",
    vector_entry_point="adaptive_time.environments.cartpole:CartPoleVectorEnv",
    max_episode_steps=500,
    reward_threshold=475.0,
)

def reset_randomness(seed, env):
    random.seed(seed)
    np.random.seed(seed)
    # env.seed(seed)
    env.action_space.seed(seed)

In [3]:
# Sample usage of the environment.
print(
    "We run the same environment and simple policy twice,\n"
    "with different time discretizations. The policy we use\n"
    "will always go left, so the time discretization does not\n"
    "make a difference to the behaviour, and the total return\n"
    "will be the same.")
print()

policy = lambda obs: 0

env = gym.make('CartPole-OURS-v0')
tau = 0.02
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_1 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_1)

env = gym.make('CartPole-OURS-v0')
tau = 0.002
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_2 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_2)

np.testing.assert_almost_equal(total_return_1, total_return_2, decimal=0)

print()
print(
    "We can expect some difference because we may get an extra\n"
    "timesteps in the more fine-grained discretization, but the\n"
    "difference should be smallish.")

We run the same environment and simple policy twice,
with different time discretizations. The policy we use
will always go left, so the time discretization does not
make a difference to the behaviour, and the total return
will be the same.

Total undiscounted return:  10.589912009424973
Total undiscounted return:  10.017508472458736

We can expect some difference because we may get an extra
timesteps in the more fine-grained discretization, but the
difference should be smallish.


  logger.warn(


**NOTE** you must adjust the discount factor if changing time-scales!

In [4]:
phi = Fourier_Features()
phi.init_fourier_features(4,4)
x_thres = 4.8
theta_thres = 0.418
phi.init_state_normalizers(np.array([x_thres,2.0,theta_thres,1]), np.array([-x_thres,-2.0,-theta_thres,-1]))
phi.num_parameters

625

In [5]:
print_trajectory = False
gamma = 0.999

num_episodes = 500
epsilon = 0.1

tau = 0.002
env.stepTime(tau)

sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.1)
sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.0)


# We record:
returns_per_episode_q = np.zeros((2, num_episodes))
average_returns_q = np.zeros((2, num_episodes))  # the cumulative average of the above
predicted_returns_q = np.zeros((2, num_episodes))

reset_randomness(seed, env)

observation, _ = env.reset(seed=seed)
d = len(phi.get_fourier_feature(observation))
assert d == phi.num_parameters
features = np.identity(2 * d)   # An estimate of A = xx^T
targets = np.zeros(2 * d)  # An estimate of b = xG
weights = np.zeros(2 * d)   # The weights that approximate A^{-1} b

x_0 = phi.get_fourier_feature([0,0,0,0])  # the initial state
x_sa0 = mc2.phi_sa(x_0, 0)
x_sa1 = mc2.phi_sa(x_0, 1)


for episode in range(num_episodes):
    def policy(state):
        if random.random() < epsilon:
            return env.action_space.sample()
        # Otherwise calculate the best action.
        x = phi.get_fourier_feature(state)
        qs = np.zeros(2)
        for action in [0, 1]:
            x_sa = mc2.phi_sa(x, action)
            qs[action] = np.inner(x_sa.flatten(), weights)
        # adaptive_time.utils.softmax(qs, 1)
        return adaptive_time.utils.argmax(qs)

    trajectory = environments.generate_trajectory(env, policy=policy)

    if print_trajectory:
        print("trajectory-len: ", len(trajectory), "; trajectory:")
        for idx, (o, a, r, o_) in enumerate(trajectory):
            # * ignore reward, as it is always the same here.
            # * o_ is the same as the next o.
            print(f"* {idx:4d}: o: {o}\n\t --> action: {a}")

    weights, targets, features, cur_avr_returns = mc2.ols_monte_carlo(
        trajectory, sampler, tqdm, phi, weights, targets, features, x_0, gamma)
    
    # Store the empirical and predicted returns. For any episode, we may
    # or may not have empirical returns for both actions. When we don't have an
    # estimate, `nan` is returned.
    returns_per_episode_q[:, episode] = cur_avr_returns
    average_returns_q[:, episode] = np.nanmean(returns_per_episode_q[:, :episode+1], axis=1)

    predicted_returns_q[0, episode] = np.inner(x_sa0.flatten(), weights)
    predicted_returns_q[1, episode] = np.inner(x_sa1.flatten(), weights)
    print(
        'episode:', episode,
        ' empirical returns:' , returns_per_episode_q[:, episode],
        ' predicted returns:' , predicted_returns_q[:, episode])

Using 41/548 samples.


  logger.warn(


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [38.54726255 12.07184041]
Using 18/123 samples.


  0%|          | 0/123 [00:00<?, ?it/s]

episode: 1  empirical returns: [9.52965598 0.        ]  predicted returns: [24.23798007 12.14794427]
Using 19/159 samples.


  0%|          | 0/159 [00:00<?, ?it/s]

episode: 2  empirical returns: [10.95753037  0.        ]  predicted returns: [19.85380852 12.75762596]
Using 23/179 samples.


  0%|          | 0/179 [00:00<?, ?it/s]

episode: 3  empirical returns: [12.16771336  0.        ]  predicted returns: [17.97095232 12.70873222]
Using 23/177 samples.


  0%|          | 0/177 [00:00<?, ?it/s]

episode: 4  empirical returns: [12.08726013  0.        ]  predicted returns: [16.80486932 12.95675372]
Using 21/173 samples.


  0%|          | 0/173 [00:00<?, ?it/s]

episode: 5  empirical returns: [11.8715752  0.       ]  predicted returns: [15.98069162 13.08495193]
Using 23/172 samples.


  0%|          | 0/172 [00:00<?, ?it/s]

episode: 6  empirical returns: [11.71337222  0.        ]  predicted returns: [15.37294572 13.32588551]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Did 5000 steps! 105000
Did 5000 steps! 110000
Did 5000 steps! 115000
Did 5000 steps! 120000
Did 5000 steps! 125000
Did 5000 steps! 130000
Did 5000 steps! 135000
Did 5000 steps! 140000
Did 5000 steps! 145000
Did 5000 steps! 150000
Did 5000 steps! 155000
Did 5000 steps! 160000
Did 5000 steps! 165000
Did 5000 steps! 170000
Did 5000 steps! 175000
Did 5000 steps! 180000
Did 5000 steps! 185000
Did 5000 steps! 190000
Did 5000 steps! 195000
Did 5000 steps! 2000

KeyboardInterrupt: 