# Plots of Trade-offs

In this notebook we compare different time discretization methods. First, we collect
a trajectory data from the environment at a fine discretization level (this is also
the discretization level we run the policy at -- right now, anyway). Then we compare:

1. Using uniform discretization at different granularities, e.g. updating with every
    1st, 10th, 100th, ...? interactions.
2. Using the adaptive method with different tolarances.

In order to average out randomness, we'll repeat each setting 3 times for now.

In [1]:
import gymnasium as gym
from adaptive_time.features import Fourier_Features
import numpy as np
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import random

import adaptive_time.utils
from adaptive_time import environments
from adaptive_time import mc2
from adaptive_time import samplers

seed = 13

In [2]:
gym.register(
    id="CartPole-OURS-v0",
    entry_point="adaptive_time.environments.cartpole:CartPoleEnv",
    vector_entry_point="adaptive_time.environments.cartpole:CartPoleVectorEnv",
    max_episode_steps=500,
    reward_threshold=475.0,
)

def reset_randomness(seed, env):
    random.seed(seed)
    np.random.seed(seed)
    # env.seed(seed)
    env.action_space.seed(seed)

In [3]:
# Sample usage of the environment.
print(
    "We run the same environment and simple policy twice,\n"
    "with different time discretizations. The policy we use\n"
    "will always go left, so the time discretization does not\n"
    "make a difference to the behaviour, and the total return\n"
    "will be the same.")
print()

policy = lambda obs: 0

env = gym.make('CartPole-OURS-v0')
tau = 0.02
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_1 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_1)

env = gym.make('CartPole-OURS-v0')
tau = 0.002
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_2 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_2)

np.testing.assert_almost_equal(total_return_1, total_return_2, decimal=0)

print()
print(
    "We can expect some difference because we may get an extra\n"
    "timesteps in the more fine-grained discretization, but the\n"
    "difference should be smallish.")

We run the same environment and simple policy twice,
with different time discretizations. The policy we use
will always go left, so the time discretization does not
make a difference to the behaviour, and the total return
will be the same.

Total undiscounted return:  10.589912009424973
Total undiscounted return:  10.017508472458736

We can expect some difference because we may get an extra
timesteps in the more fine-grained discretization, but the
difference should be smallish.


  logger.warn(


**NOTE** you must adjust the discount factor if changing time-scales!

In [4]:
phi = Fourier_Features()
phi.init_fourier_features(4,4)
x_thres = 4.8
theta_thres = 0.418
phi.init_state_normalizers(np.array([x_thres,2.0,theta_thres,1]), np.array([-x_thres,-2.0,-theta_thres,-1]))
phi.num_parameters

625

In [5]:
print_trajectory = False
gamma = 0.999

num_episodes = 500
epsilon = 0.1

tau = 0.002
env.stepTime(tau)

sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.1)
# sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.0)


# We record:
returns_per_episode_q = np.zeros((2, num_episodes))
average_returns_q = np.zeros((2, num_episodes))  # the cumulative average of the above
predicted_returns_q = np.zeros((2, num_episodes))

reset_randomness(seed, env)

observation, _ = env.reset(seed=seed)
d = len(phi.get_fourier_feature(observation))
assert d == phi.num_parameters
features = np.identity(2 * d)   # An estimate of A = xx^T
targets = np.zeros(2 * d)  # An estimate of b = xG
weights = np.zeros(2 * d)   # The weights that approximate A^{-1} b

x_0 = phi.get_fourier_feature([0,0,0,0])  # the initial state
x_sa0 = mc2.phi_sa(x_0, 0)
x_sa1 = mc2.phi_sa(x_0, 1)


for episode in range(num_episodes):
    def policy(state):
        if random.random() < epsilon:
            return env.action_space.sample()
        # Otherwise calculate the best action.
        x = phi.get_fourier_feature(state)
        qs = np.zeros(2)
        for action in [0, 1]:
            x_sa = mc2.phi_sa(x, action)
            qs[action] = np.inner(x_sa.flatten(), weights)
        # adaptive_time.utils.softmax(qs, 1)
        return adaptive_time.utils.argmax(qs)

    trajectory = environments.generate_trajectory(env, policy=policy)

    if print_trajectory:
        print("trajectory-len: ", len(trajectory), "; trajectory:")
        for idx, (o, a, r, o_) in enumerate(trajectory):
            # * ignore reward, as it is always the same here.
            # * o_ is the same as the next o.
            print(f"* {idx:4d}: o: {o}\n\t --> action: {a}")

    weights, targets, features, cur_avr_returns = mc2.ols_monte_carlo(
        trajectory, sampler, tqdm, phi, weights, targets, features, x_0, gamma)
    
    # Store the empirical and predicted returns. For any episode, we may
    # or may not have empirical returns for both actions. When we don't have an
    # estimate, `nan` is returned.
    returns_per_episode_q[:, episode] = cur_avr_returns
    average_returns_q[:, episode] = np.nanmean(returns_per_episode_q[:, :episode+1], axis=1)

    predicted_returns_q[0, episode] = np.inner(x_sa0.flatten(), weights)
    predicted_returns_q[1, episode] = np.inner(x_sa1.flatten(), weights)
    print(
        'episode:', episode,
        ' empirical returns:' , returns_per_episode_q[:, episode],
        ' predicted returns:' , predicted_returns_q[:, episode])

Using 548/548 samples.


  logger.warn(


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [35.91830993 30.24942918]
Using 361/361 samples.


  0%|          | 0/361 [00:00<?, ?it/s]

episode: 1  empirical returns: [28.45849011  0.        ]  predicted returns: [35.36472805 35.71637349]
Using 1863/1863 samples.


  0%|          | 0/1863 [00:00<?, ?it/s]

episode: 2  empirical returns: [ 0.         82.02244811]  predicted returns: [54.30500398 66.10999344]
Using 649/649 samples.


  0%|          | 0/649 [00:00<?, ?it/s]

episode: 3  empirical returns: [ 0.         44.88951313]  predicted returns: [41.19728746 47.10703236]
Using 255/255 samples.


  0%|          | 0/255 [00:00<?, ?it/s]

episode: 4  empirical returns: [ 0.         19.43532548]  predicted returns: [46.67649866 50.50184754]
Using 233/233 samples.


  0%|          | 0/233 [00:00<?, ?it/s]

episode: 5  empirical returns: [ 0.         18.79206236]  predicted returns: [32.47099161 37.37620204]
Using 455/455 samples.


  0%|          | 0/455 [00:00<?, ?it/s]

episode: 6  empirical returns: [ 0.         33.11931168]  predicted returns: [32.07258811 34.39966632]
Using 340/340 samples.


  0%|          | 0/340 [00:00<?, ?it/s]

episode: 7  empirical returns: [ 0.         26.68285605]  predicted returns: [30.02010511 32.90284232]
Using 403/403 samples.


  0%|          | 0/403 [00:00<?, ?it/s]

episode: 8  empirical returns: [ 0.         31.01961097]  predicted returns: [31.835958   34.61759991]
Using 675/675 samples.


  0%|          | 0/675 [00:00<?, ?it/s]

episode: 9  empirical returns: [ 0.         46.77211508]  predicted returns: [31.60788984 35.15722459]
Using 534/534 samples.


  0%|          | 0/534 [00:00<?, ?it/s]

episode: 10  empirical returns: [ 0.        37.8610709]  predicted returns: [32.52375828 35.68179701]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 11  empirical returns: [ 0.         41.61848623]  predicted returns: [32.58748281 36.25691349]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 12  empirical returns: [ 0.         41.81897408]  predicted returns: [33.02158675 36.83651904]
Using 609/609 samples.


  0%|          | 0/609 [00:00<?, ?it/s]

episode: 13  empirical returns: [ 0.         41.43381738]  predicted returns: [33.18877555 37.00971386]
Using 615/615 samples.


  0%|          | 0/615 [00:00<?, ?it/s]

episode: 14  empirical returns: [ 0.         41.70953441]  predicted returns: [33.35995812 37.10163755]
Using 620/620 samples.


  0%|          | 0/620 [00:00<?, ?it/s]

episode: 15  empirical returns: [ 0.         41.84819395]  predicted returns: [33.60520272 37.21336416]
Using 627/629 samples.


  0%|          | 0/629 [00:00<?, ?it/s]

episode: 16  empirical returns: [ 0.         42.15698783]  predicted returns: [33.75531849 37.32372441]
Using 782/782 samples.


  0%|          | 0/782 [00:00<?, ?it/s]

episode: 17  empirical returns: [ 0.         51.24447541]  predicted returns: [34.32819605 38.53884562]
Using 738/738 samples.


  0%|          | 0/738 [00:00<?, ?it/s]

episode: 18  empirical returns: [ 0.         48.89151119]  predicted returns: [35.03648163 39.41646117]
Using 687/687 samples.


  0%|          | 0/687 [00:00<?, ?it/s]

episode: 19  empirical returns: [ 0.         46.17854184]  predicted returns: [35.35472344 39.830005  ]
Using 702/702 samples.


  0%|          | 0/702 [00:00<?, ?it/s]

episode: 20  empirical returns: [ 0.         46.83524623]  predicted returns: [35.6786645  40.20209902]
Using 729/729 samples.


  0%|          | 0/729 [00:00<?, ?it/s]

episode: 21  empirical returns: [48.16442451  0.        ]  predicted returns: [36.19727177 40.67224795]
Using 750/750 samples.


  0%|          | 0/750 [00:00<?, ?it/s]

episode: 22  empirical returns: [ 0.         49.14948636]  predicted returns: [36.60079675 41.09670042]
Using 763/763 samples.


  0%|          | 0/763 [00:00<?, ?it/s]

episode: 23  empirical returns: [ 0.         50.06394522]  predicted returns: [37.0519421  41.56123152]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Did 5000 steps! 105000
Did 5000 steps! 110000
Did 5000 steps! 115000
Did 5000 steps! 120000


KeyboardInterrupt: 