# Plots of Trade-offs

In this notebook we compare different time discretization methods. First, we collect
a trajectory data from the environment at a fine discretization level (this is also
the discretization level we run the policy at -- right now, anyway). Then we compare:

1. Using uniform discretization at different granularities, e.g. updating with every
    1st, 10th, 100th, ...? interactions.
2. Using the adaptive method with different tolarances.

In order to average out randomness, we'll repeat each setting 3 times for now.

In [12]:
import gymnasium as gym
from adaptive_time.features import Fourier_Features
import numpy as np
from tqdm.notebook import tqdm

import matplotlib.pyplot as plt
import random

import adaptive_time.utils
from adaptive_time import environments
from adaptive_time import mc2
from adaptive_time import samplers

seed = 13

In [2]:
gym.register(
    id="CartPole-OURS-v0",
    entry_point="adaptive_time.environments.cartpole:CartPoleEnv",
    vector_entry_point="adaptive_time.environments.cartpole:CartPoleVectorEnv",
    max_episode_steps=500,
    reward_threshold=475.0,
)

def reset_randomness(seed, env):
    random.seed(seed)
    np.random.seed(seed)
    # env.seed(seed)
    env.action_space.seed(seed)

In [3]:
# Sample usage of the environment.
print(
    "We run the same environment and simple policy twice,\n"
    "with different time discretizations. The policy we use\n"
    "will always go left, so the time discretization does not\n"
    "make a difference to the behaviour, and the total return\n"
    "will be the same.")
print()

policy = lambda obs: 0

env = gym.make('CartPole-OURS-v0')
tau = 0.02
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_1 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_1)

env = gym.make('CartPole-OURS-v0')
tau = 0.002
env.stepTime(tau)

reset_randomness(seed, env)
traj = environments.generate_trajectory(env, seed, policy)
total_return_2 = sum(ts[2] for ts in traj)
print("Total undiscounted return: ", total_return_2)

np.testing.assert_almost_equal(total_return_1, total_return_2, decimal=0)

print()
print(
    "We can expect some difference because we may get an extra\n"
    "timesteps in the more fine-grained discretization, but the\n"
    "difference should be smallish.")

We run the same environment and simple policy twice,
with different time discretizations. The policy we use
will always go left, so the time discretization does not
make a difference to the behaviour, and the total return
will be the same.

Total undiscounted return:  10.589912009424973
Total undiscounted return:  10.017508472458736

We can expect some difference because we may get an extra
timesteps in the more fine-grained discretization, but the
difference should be smallish.


  logger.warn(


**NOTE** you must adjust the discount factor if changing time-scales!

In [4]:
phi = Fourier_Features()
phi.init_fourier_features(4,4)
x_thres = 4.8
theta_thres = 0.418
phi.init_state_normalizers(np.array([x_thres,2.0,theta_thres,1]), np.array([-x_thres,-2.0,-theta_thres,-1]))
phi.num_parameters

625

In [9]:

def run_experiment(
        seed, env, sampler, epsilon, num_episodes, gamma, print_trajectory=False):
    """Returns the number of episodes it took to solve the environment."""

    # We record:
    returns_per_episode_q = np.zeros((2, num_episodes))
    average_returns_q = np.zeros((2, num_episodes))  # the cumulative average of the above
    predicted_returns_q = np.zeros((2, num_episodes))

    reset_randomness(seed, env)

    observation, _ = env.reset(seed=seed)
    d = len(phi.get_fourier_feature(observation))
    assert d == phi.num_parameters
    features = np.identity(2 * d)   # An estimate of A = xx^T
    targets = np.zeros(2 * d)  # An estimate of b = xG
    weights = np.zeros(2 * d)   # The weights that approximate A^{-1} b

    x_0 = phi.get_fourier_feature([0,0,0,0])  # the initial state
    x_sa0 = mc2.phi_sa(x_0, 0)
    x_sa1 = mc2.phi_sa(x_0, 1)

    for episode in range(num_episodes):
        def policy(state):
            if random.random() < epsilon:
                return env.action_space.sample()
            # Otherwise calculate the best action.
            x = phi.get_fourier_feature(state)
            qs = np.zeros(2)
            for action in [0, 1]:
                x_sa = mc2.phi_sa(x, action)
                qs[action] = np.inner(x_sa.flatten(), weights)
            # adaptive_time.utils.softmax(qs, 1)
            return adaptive_time.utils.argmax(qs)

        trajectory = environments.generate_trajectory(env, policy=policy, max_steps=100_000)
        if trajectory is None:
            print("episode:", episode)
            print("Did not drop it for a long time, returning!")
            return episode

        if print_trajectory:
            print("trajectory-len: ", len(trajectory), "; trajectory:")
            for idx, (o, a, r, o_) in enumerate(trajectory):
                # * ignore reward, as it is always the same here.
                # * o_ is the same as the next o.
                print(f"* {idx:4d}: o: {o}\n\t --> action: {a}")

        weights, targets, features, cur_avr_returns = mc2.ols_monte_carlo(
            trajectory, sampler, tqdm, phi, weights, targets, features, x_0, gamma)
        
        # Store the empirical and predicted returns. For any episode, we may
        # or may not have empirical returns for both actions. When we don't have an
        # estimate, `nan` is returned.
        returns_per_episode_q[:, episode] = cur_avr_returns
        average_returns_q[:, episode] = np.nanmean(returns_per_episode_q[:, :episode+1], axis=1)

        predicted_returns_q[0, episode] = np.inner(x_sa0.flatten(), weights)
        predicted_returns_q[1, episode] = np.inner(x_sa1.flatten(), weights)
        print(
            'episode:', episode,
            ' empirical returns:' , returns_per_episode_q[:, episode],
            ' predicted returns:' , predicted_returns_q[:, episode])
    
    return -1

In [13]:
num_episodes = 100
epsilon = 0.1

tau = 0.002
env.stepTime(tau)

# sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.1)
# sampler = samplers.AdaptiveQuadratureSampler2(tolerance=0.0)

samplers_tried = dict(
    q0_3=samplers.AdaptiveQuadratureSampler2(tolerance=0.3),
    q0_1=samplers.AdaptiveQuadratureSampler2(tolerance=0.1),
    q0_03=samplers.AdaptiveQuadratureSampler2(tolerance=0.03),
    q0_0=samplers.AdaptiveQuadratureSampler2(tolerance=0.0),
    u1=samplers.UniformSampler2(1),
    u5=samplers.UniformSampler2(5),
    u10=samplers.UniformSampler2(10),
    u20=samplers.UniformSampler2(20),
)

results = {}
for name, sampler in samplers_tried.items():
    results[name] = run_experiment(seed, env, sampler, epsilon, num_episodes, gamma=0.999)

print()
print("Results:")
for name, num_episodes in results.items():
    print(f"{name}: {num_episodes}")



Using 29/548 samples.


  logger.warn(


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [38.64580775 12.42227583]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 1
Did not drop it for a long time, returning!
Using 41/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [38.54726255 12.07184041]
Using 18/123 samples.


  0%|          | 0/123 [00:00<?, ?it/s]

episode: 1  empirical returns: [9.52965598 0.        ]  predicted returns: [24.23798007 12.14794427]
Using 19/159 samples.


  0%|          | 0/159 [00:00<?, ?it/s]

episode: 2  empirical returns: [10.95753037  0.        ]  predicted returns: [19.85380852 12.75762596]
Using 23/179 samples.


  0%|          | 0/179 [00:00<?, ?it/s]

episode: 3  empirical returns: [12.16771336  0.        ]  predicted returns: [17.97095232 12.70873222]
Using 23/177 samples.


  0%|          | 0/177 [00:00<?, ?it/s]

episode: 4  empirical returns: [12.08726013  0.        ]  predicted returns: [16.80486932 12.95675372]
Using 21/173 samples.


  0%|          | 0/173 [00:00<?, ?it/s]

episode: 5  empirical returns: [11.8715752  0.       ]  predicted returns: [15.98069162 13.08495193]
Using 23/172 samples.


  0%|          | 0/172 [00:00<?, ?it/s]

episode: 6  empirical returns: [11.71337222  0.        ]  predicted returns: [15.37294572 13.32588551]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 7
Did not drop it for a long time, returning!
Using 79/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [36.52305997 29.27377728]
Using 68/459 samples.


  0%|          | 0/459 [00:00<?, ?it/s]

episode: 1  empirical returns: [32.82077423  0.        ]  predicted returns: [34.76559224 35.31191037]
Using 26/123 samples.


  0%|          | 0/123 [00:00<?, ?it/s]

episode: 2  empirical returns: [0.         9.51430124]  predicted returns: [34.76559224 24.23858981]
Using 79/574 samples.


  0%|          | 0/574 [00:00<?, ?it/s]

episode: 3  empirical returns: [40.19305626  0.        ]  predicted returns: [37.54254078 24.64539809]
Using 71/568 samples.


  0%|          | 0/568 [00:00<?, ?it/s]

episode: 4  empirical returns: [39.74854335  0.        ]  predicted returns: [37.94683352 24.49000516]
Using 69/565 samples.


  0%|          | 0/565 [00:00<?, ?it/s]

episode: 5  empirical returns: [39.59192017  0.        ]  predicted returns: [38.22479588 24.49955773]
Using 71/569 samples.


  0%|          | 0/569 [00:00<?, ?it/s]

episode: 6  empirical returns: [39.83398062  0.        ]  predicted returns: [38.44007998 24.48550829]
Using 63/557 samples.


  0%|          | 0/557 [00:00<?, ?it/s]

episode: 7  empirical returns: [39.99506272  0.        ]  predicted returns: [38.60854994 24.68747003]
Using 74/583 samples.


  0%|          | 0/583 [00:00<?, ?it/s]

episode: 8  empirical returns: [40.62346967  0.        ]  predicted returns: [38.80446948 24.63569957]
Using 75/590 samples.


  0%|          | 0/590 [00:00<?, ?it/s]

episode: 9  empirical returns: [41.00390111  0.        ]  predicted returns: [38.99855562 24.55550364]
Using 88/597 samples.


  0%|          | 0/597 [00:00<?, ?it/s]

episode: 10  empirical returns: [41.4284572  0.       ]  predicted returns: [39.27344508 25.10097585]
Using 77/605 samples.


  0%|          | 0/605 [00:00<?, ?it/s]

episode: 11  empirical returns: [41.85329026  0.        ]  predicted returns: [39.47843645 25.03276821]
Using 77/598 samples.


  0%|          | 0/598 [00:00<?, ?it/s]

episode: 12  empirical returns: [41.45823445  0.        ]  predicted returns: [39.62252471 25.03739022]
Using 84/601 samples.


  0%|          | 0/601 [00:00<?, ?it/s]

episode: 13  empirical returns: [41.58233249  0.        ]  predicted returns: [39.80084182 25.23088053]
Using 81/612 samples.


  0%|          | 0/612 [00:00<?, ?it/s]

episode: 14  empirical returns: [42.22750387  0.        ]  predicted returns: [39.91632387 25.11781384]
Using 84/623 samples.


  0%|          | 0/623 [00:00<?, ?it/s]

episode: 15  empirical returns: [42.81926876  0.        ]  predicted returns: [40.00573515 25.24334909]
Using 81/634 samples.


  0%|          | 0/634 [00:00<?, ?it/s]

episode: 16  empirical returns: [43.42942663  0.        ]  predicted returns: [40.20322716 25.66946793]
Using 81/641 samples.


  0%|          | 0/641 [00:00<?, ?it/s]

episode: 17  empirical returns: [43.77805142  0.        ]  predicted returns: [40.39261955 25.61366482]
Using 86/670 samples.


  0%|          | 0/670 [00:00<?, ?it/s]

episode: 18  empirical returns: [45.35111703  0.        ]  predicted returns: [40.6683109  25.89699235]
Using 80/710 samples.


  0%|          | 0/710 [00:00<?, ?it/s]

episode: 19  empirical returns: [47.37196642  0.        ]  predicted returns: [40.93867847 25.77489415]
Using 72/692 samples.


  0%|          | 0/692 [00:00<?, ?it/s]

episode: 20  empirical returns: [47.57866836  0.        ]  predicted returns: [41.20642757 25.84432023]
Using 72/705 samples.


  0%|          | 0/705 [00:00<?, ?it/s]

episode: 21  empirical returns: [48.03238036  0.        ]  predicted returns: [41.50174593 25.94011382]
Using 75/701 samples.


  0%|          | 0/701 [00:00<?, ?it/s]

episode: 22  empirical returns: [47.83605884  0.        ]  predicted returns: [41.840375 26.495752]
Using 209/1283 samples.


  0%|          | 0/1283 [00:00<?, ?it/s]

episode: 23  empirical returns: [66.51893164  0.        ]  predicted returns: [43.10398498 27.44430113]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 24
Did not drop it for a long time, returning!
Using 548/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [35.91830993 30.24942918]
Using 361/361 samples.


  0%|          | 0/361 [00:00<?, ?it/s]

episode: 1  empirical returns: [28.45849011  0.        ]  predicted returns: [35.36472805 35.71637349]
Using 1863/1863 samples.


  0%|          | 0/1863 [00:00<?, ?it/s]

episode: 2  empirical returns: [ 0.         82.02244811]  predicted returns: [54.30500398 66.10999344]
Using 649/649 samples.


  0%|          | 0/649 [00:00<?, ?it/s]

episode: 3  empirical returns: [ 0.         44.88951313]  predicted returns: [41.19728746 47.10703236]
Using 255/255 samples.


  0%|          | 0/255 [00:00<?, ?it/s]

episode: 4  empirical returns: [ 0.         19.43532548]  predicted returns: [46.67649866 50.50184754]
Using 233/233 samples.


  0%|          | 0/233 [00:00<?, ?it/s]

episode: 5  empirical returns: [ 0.         18.79206236]  predicted returns: [32.47099161 37.37620204]
Using 455/455 samples.


  0%|          | 0/455 [00:00<?, ?it/s]

episode: 6  empirical returns: [ 0.         33.11931168]  predicted returns: [32.07258811 34.39966632]
Using 340/340 samples.


  0%|          | 0/340 [00:00<?, ?it/s]

episode: 7  empirical returns: [ 0.         26.68285605]  predicted returns: [30.02010511 32.90284232]
Using 403/403 samples.


  0%|          | 0/403 [00:00<?, ?it/s]

episode: 8  empirical returns: [ 0.         31.01961097]  predicted returns: [31.835958   34.61759991]
Using 675/675 samples.


  0%|          | 0/675 [00:00<?, ?it/s]

episode: 9  empirical returns: [ 0.         46.77211508]  predicted returns: [31.60788984 35.15722459]
Using 534/534 samples.


  0%|          | 0/534 [00:00<?, ?it/s]

episode: 10  empirical returns: [ 0.        37.8610709]  predicted returns: [32.52375828 35.68179701]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 11  empirical returns: [ 0.         41.61848623]  predicted returns: [32.58748281 36.25691349]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 12  empirical returns: [ 0.         41.81897408]  predicted returns: [33.02158675 36.83651904]
Using 609/609 samples.


  0%|          | 0/609 [00:00<?, ?it/s]

episode: 13  empirical returns: [ 0.         41.43381738]  predicted returns: [33.18877555 37.00971386]
Using 615/615 samples.


  0%|          | 0/615 [00:00<?, ?it/s]

episode: 14  empirical returns: [ 0.         41.70953441]  predicted returns: [33.35995812 37.10163755]
Using 620/620 samples.


  0%|          | 0/620 [00:00<?, ?it/s]

episode: 15  empirical returns: [ 0.         41.84819395]  predicted returns: [33.60520272 37.21336416]
Using 627/629 samples.


  0%|          | 0/629 [00:00<?, ?it/s]

episode: 16  empirical returns: [ 0.         42.15698783]  predicted returns: [33.75531849 37.32372441]
Using 782/782 samples.


  0%|          | 0/782 [00:00<?, ?it/s]

episode: 17  empirical returns: [ 0.         51.24447541]  predicted returns: [34.32819605 38.53884562]
Using 738/738 samples.


  0%|          | 0/738 [00:00<?, ?it/s]

episode: 18  empirical returns: [ 0.         48.89151119]  predicted returns: [35.03648163 39.41646117]
Using 687/687 samples.


  0%|          | 0/687 [00:00<?, ?it/s]

episode: 19  empirical returns: [ 0.         46.17854184]  predicted returns: [35.35472344 39.830005  ]
Using 702/702 samples.


  0%|          | 0/702 [00:00<?, ?it/s]

episode: 20  empirical returns: [ 0.         46.83524623]  predicted returns: [35.6786645  40.20209902]
Using 729/729 samples.


  0%|          | 0/729 [00:00<?, ?it/s]

episode: 21  empirical returns: [48.16442451  0.        ]  predicted returns: [36.19727177 40.67224795]
Using 750/750 samples.


  0%|          | 0/750 [00:00<?, ?it/s]

episode: 22  empirical returns: [ 0.         49.14948636]  predicted returns: [36.60079675 41.09670042]
Using 763/763 samples.


  0%|          | 0/763 [00:00<?, ?it/s]

episode: 23  empirical returns: [ 0.         50.06394522]  predicted returns: [37.0519421  41.56123152]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 24
Did not drop it for a long time, returning!
Using 548/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [35.91830993 30.24942918]
Using 361/361 samples.


  0%|          | 0/361 [00:00<?, ?it/s]

episode: 1  empirical returns: [28.45849011  0.        ]  predicted returns: [35.36472805 35.71637349]
Using 1863/1863 samples.


  0%|          | 0/1863 [00:00<?, ?it/s]

episode: 2  empirical returns: [ 0.         82.02244811]  predicted returns: [54.30500398 66.10999344]
Using 649/649 samples.


  0%|          | 0/649 [00:00<?, ?it/s]

episode: 3  empirical returns: [ 0.         44.88951313]  predicted returns: [41.19728746 47.10703236]
Using 255/255 samples.


  0%|          | 0/255 [00:00<?, ?it/s]

episode: 4  empirical returns: [ 0.         19.43532548]  predicted returns: [46.67649866 50.50184754]
Using 233/233 samples.


  0%|          | 0/233 [00:00<?, ?it/s]

episode: 5  empirical returns: [ 0.         18.79206236]  predicted returns: [32.47099161 37.37620204]
Using 455/455 samples.


  0%|          | 0/455 [00:00<?, ?it/s]

episode: 6  empirical returns: [ 0.         33.11931168]  predicted returns: [32.07258811 34.39966632]
Using 340/340 samples.


  0%|          | 0/340 [00:00<?, ?it/s]

episode: 7  empirical returns: [ 0.         26.68285605]  predicted returns: [30.02010511 32.90284232]
Using 403/403 samples.


  0%|          | 0/403 [00:00<?, ?it/s]

episode: 8  empirical returns: [ 0.         31.01961097]  predicted returns: [31.835958   34.61759991]
Using 675/675 samples.


  0%|          | 0/675 [00:00<?, ?it/s]

episode: 9  empirical returns: [ 0.         46.77211508]  predicted returns: [31.60788984 35.15722459]
Using 534/534 samples.


  0%|          | 0/534 [00:00<?, ?it/s]

episode: 10  empirical returns: [ 0.        37.8610709]  predicted returns: [32.52375828 35.68179701]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 11  empirical returns: [ 0.         41.61848623]  predicted returns: [32.58748281 36.25691349]
Using 611/611 samples.


  0%|          | 0/611 [00:00<?, ?it/s]

episode: 12  empirical returns: [ 0.         41.81897408]  predicted returns: [33.02158675 36.83651904]
Using 609/609 samples.


  0%|          | 0/609 [00:00<?, ?it/s]

episode: 13  empirical returns: [ 0.         41.43381738]  predicted returns: [33.18877555 37.00971386]
Using 615/615 samples.


  0%|          | 0/615 [00:00<?, ?it/s]

episode: 14  empirical returns: [ 0.         41.70953441]  predicted returns: [33.35995812 37.10163755]
Using 620/620 samples.


  0%|          | 0/620 [00:00<?, ?it/s]

episode: 15  empirical returns: [ 0.         41.84819395]  predicted returns: [33.60520272 37.21336416]
Using 629/629 samples.


  0%|          | 0/629 [00:00<?, ?it/s]

episode: 16  empirical returns: [ 0.         42.15698783]  predicted returns: [33.78158974 37.34025895]
Using 781/781 samples.


  0%|          | 0/781 [00:00<?, ?it/s]

episode: 17  empirical returns: [ 0.         51.17061056]  predicted returns: [34.30951272 38.51425187]
Using 726/726 samples.


  0%|          | 0/726 [00:00<?, ?it/s]

episode: 18  empirical returns: [ 0.         48.26023685]  predicted returns: [34.97897008 39.34889788]
Using 680/680 samples.


  0%|          | 0/680 [00:00<?, ?it/s]

episode: 19  empirical returns: [ 0.         45.75459854]  predicted returns: [35.26589471 39.70680677]
Using 705/705 samples.


  0%|          | 0/705 [00:00<?, ?it/s]

episode: 20  empirical returns: [ 0.         47.03055054]  predicted returns: [35.62466584 40.10769406]
Using 728/728 samples.


  0%|          | 0/728 [00:00<?, ?it/s]

episode: 21  empirical returns: [48.07806195  0.        ]  predicted returns: [36.07668387 40.56516466]
Using 754/754 samples.


  0%|          | 0/754 [00:00<?, ?it/s]

episode: 22  empirical returns: [ 0.         49.30320604]  predicted returns: [36.53637443 41.01036993]
Using 789/789 samples.


  0%|          | 0/789 [00:00<?, ?it/s]

episode: 23  empirical returns: [ 0.         51.29941265]  predicted returns: [37.12224109 41.55397685]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 24
Did not drop it for a long time, returning!
Using 110/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [35.38091958 30.08489224]
Using 149/745 samples.


  0%|          | 0/745 [00:00<?, ?it/s]

episode: 1  empirical returns: [44.34234383  0.        ]  predicted returns: [37.69094434 24.11495142]
Using 91/451 samples.


  0%|          | 0/451 [00:00<?, ?it/s]

episode: 2  empirical returns: [30.53485378  0.        ]  predicted returns: [36.63446409 10.49957174]
Using 95/471 samples.


  0%|          | 0/471 [00:00<?, ?it/s]

episode: 3  empirical returns: [ 0.         32.06530721]  predicted returns: [35.24980375 18.59643865]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 4
Did not drop it for a long time, returning!
Using 55/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [35.86225645 29.90625804]
Using 30/291 samples.


  0%|          | 0/291 [00:00<?, ?it/s]

episode: 1  empirical returns: [23.31033246  0.        ]  predicted returns: [31.52833501 48.41076766]
Using 13/125 samples.


  0%|          | 0/125 [00:00<?, ?it/s]

episode: 2  empirical returns: [0.         9.56224242]  predicted returns: [31.53074612 20.51453617]
Using 26/260 samples.


  0%|          | 0/260 [00:00<?, ?it/s]

episode: 3  empirical returns: [20.75439178  0.        ]  predicted returns: [28.65299936 22.61754423]
Using 13/125 samples.


  0%|          | 0/125 [00:00<?, ?it/s]

episode: 4  empirical returns: [9.68746365 0.        ]  predicted returns: [25.17131345 22.61754423]
Using 77/769 samples.


  0%|          | 0/769 [00:00<?, ?it/s]

episode: 5  empirical returns: [52.30468292  0.        ]  predicted returns: [39.41996295 33.63411998]
Using 21/207 samples.


  0%|          | 0/207 [00:00<?, ?it/s]

episode: 6  empirical returns: [16.75329653  0.        ]  predicted returns: [36.46069495 31.98023989]
Using 34/337 samples.


  0%|          | 0/337 [00:00<?, ?it/s]

episode: 7  empirical returns: [26.10825333  0.        ]  predicted returns: [36.23161128 31.72077011]
Using 37/366 samples.


  0%|          | 0/366 [00:00<?, ?it/s]

episode: 8  empirical returns: [27.74415369  0.        ]  predicted returns: [35.74538511 32.12556655]
Using 37/364 samples.


  0%|          | 0/364 [00:00<?, ?it/s]

episode: 9  empirical returns: [27.88810434  0.        ]  predicted returns: [35.35227837 32.02872365]
Using 37/365 samples.


  0%|          | 0/365 [00:00<?, ?it/s]

episode: 10  empirical returns: [27.93039743  0.        ]  predicted returns: [35.01881516 31.93082889]
Using 38/374 samples.


  0%|          | 0/374 [00:00<?, ?it/s]

episode: 11  empirical returns: [28.58958082  0.        ]  predicted returns: [34.58199086 31.94562354]
Using 39/390 samples.


  0%|          | 0/390 [00:00<?, ?it/s]

episode: 12  empirical returns: [ 0.         29.68860535]  predicted returns: [34.48568547 31.51131514]
Using 40/399 samples.


  0%|          | 0/399 [00:00<?, ?it/s]

episode: 13  empirical returns: [30.30156592  0.        ]  predicted returns: [34.1173461 31.5366202]
Using 39/389 samples.


  0%|          | 0/389 [00:00<?, ?it/s]

episode: 14  empirical returns: [29.59994143  0.        ]  predicted returns: [33.82553159 31.34640131]
Using 41/402 samples.


  0%|          | 0/402 [00:00<?, ?it/s]

episode: 15  empirical returns: [30.60002934  0.        ]  predicted returns: [33.6631352  31.11275155]
Using 39/390 samples.


  0%|          | 0/390 [00:00<?, ?it/s]

episode: 16  empirical returns: [29.7687334  0.       ]  predicted returns: [33.41462633 31.19025215]
Using 40/398 samples.


  0%|          | 0/398 [00:00<?, ?it/s]

episode: 17  empirical returns: [30.34187015  0.        ]  predicted returns: [33.27181019 31.04014074]
Using 40/395 samples.


  0%|          | 0/395 [00:00<?, ?it/s]

episode: 18  empirical returns: [30.1066019  0.       ]  predicted returns: [33.08577397 31.07604688]
Using 40/394 samples.


  0%|          | 0/394 [00:00<?, ?it/s]

episode: 19  empirical returns: [30.06938068  0.        ]  predicted returns: [32.9126194  31.09947558]
Using 40/399 samples.


  0%|          | 0/399 [00:00<?, ?it/s]

episode: 20  empirical returns: [30.34762884  0.        ]  predicted returns: [32.80108986 31.06231697]
Using 41/403 samples.


  0%|          | 0/403 [00:00<?, ?it/s]

episode: 21  empirical returns: [30.65653976  0.        ]  predicted returns: [32.69104144 31.05181656]
Using 40/397 samples.


  0%|          | 0/397 [00:00<?, ?it/s]

episode: 22  empirical returns: [30.27635189  0.        ]  predicted returns: [32.54408918 31.0610935 ]
Using 41/410 samples.


  0%|          | 0/410 [00:00<?, ?it/s]

episode: 23  empirical returns: [31.14537398  0.        ]  predicted returns: [32.47945599 31.04443023]
Using 41/405 samples.


  0%|          | 0/405 [00:00<?, ?it/s]

episode: 24  empirical returns: [30.81582914  0.        ]  predicted returns: [32.40077143 31.02175778]
Using 41/408 samples.


  0%|          | 0/408 [00:00<?, ?it/s]

episode: 25  empirical returns: [31.03778159  0.        ]  predicted returns: [32.32821059 31.03741174]
Using 41/410 samples.


  0%|          | 0/410 [00:00<?, ?it/s]

episode: 26  empirical returns: [31.18211311  0.        ]  predicted returns: [32.2557551  31.12348686]
Using 42/411 samples.


  0%|          | 0/411 [00:00<?, ?it/s]

episode: 27  empirical returns: [31.29127215  0.        ]  predicted returns: [32.18700905 31.11744094]
Using 41/409 samples.


  0%|          | 0/409 [00:00<?, ?it/s]

episode: 28  empirical returns: [31.1661362  0.       ]  predicted returns: [32.11134828 31.19277637]
Using 117/1162 samples.


  0%|          | 0/1162 [00:00<?, ?it/s]

episode: 29  empirical returns: [58.6453274  0.       ]  predicted returns: [33.98404803 34.50972354]
Using 116/1160 samples.


  0%|          | 0/1160 [00:00<?, ?it/s]

episode: 30  empirical returns: [ 0.         62.19302867]  predicted returns: [36.10573214 43.65070102]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 31
Did not drop it for a long time, returning!
Using 28/548 samples.


  0%|          | 0/548 [00:00<?, ?it/s]

episode: 0  empirical returns: [39.07399847  0.        ]  predicted returns: [36.74912534 27.6147081 ]
Using 7/123 samples.


  0%|          | 0/123 [00:00<?, ?it/s]

episode: 1  empirical returns: [9.53149333 0.        ]  predicted returns: [25.35204375 27.65554112]
Using 8/147 samples.


  0%|          | 0/147 [00:00<?, ?it/s]

episode: 2  empirical returns: [ 0.         11.63997281]  predicted returns: [25.20749215 14.29399636]
Using 12/234 samples.


  0%|          | 0/234 [00:00<?, ?it/s]

episode: 3  empirical returns: [18.86039022  0.        ]  predicted returns: [23.56246016 19.20513889]
Using 8/154 samples.


  0%|          | 0/154 [00:00<?, ?it/s]

episode: 4  empirical returns: [12.18376002  0.        ]  predicted returns: [21.28701667 19.05210994]
Using 11/205 samples.


  0%|          | 0/205 [00:00<?, ?it/s]

episode: 5  empirical returns: [15.71362585  0.        ]  predicted returns: [20.45318231 18.92653417]
Using 12/221 samples.


  0%|          | 0/221 [00:00<?, ?it/s]

episode: 6  empirical returns: [16.25117544  0.        ]  predicted returns: [19.87972283 19.10707381]
Using 11/209 samples.


  0%|          | 0/209 [00:00<?, ?it/s]

episode: 7  empirical returns: [15.98285649  0.        ]  predicted returns: [19.52990209 19.12444562]
Using 11/217 samples.


  0%|          | 0/217 [00:00<?, ?it/s]

episode: 8  empirical returns: [16.19190148  0.        ]  predicted returns: [19.23297865 19.2462129 ]
Using 8/155 samples.


  0%|          | 0/155 [00:00<?, ?it/s]

episode: 9  empirical returns: [ 0.         12.34324764]  predicted returns: [19.23286653 16.66143863]
Using 12/224 samples.


  0%|          | 0/224 [00:00<?, ?it/s]

episode: 10  empirical returns: [16.50201695  0.        ]  predicted returns: [18.99053243 16.82721556]
Using 12/222 samples.


  0%|          | 0/222 [00:00<?, ?it/s]

episode: 11  empirical returns: [16.3804122  0.       ]  predicted returns: [18.81726848 16.91198054]
Using 11/213 samples.


  0%|          | 0/213 [00:00<?, ?it/s]

episode: 12  empirical returns: [16.21114301  0.        ]  predicted returns: [18.64938006 16.95322146]
Using 12/227 samples.


  0%|          | 0/227 [00:00<?, ?it/s]

episode: 13  empirical returns: [16.83635697  0.        ]  predicted returns: [18.5517028  17.02535501]
Using 12/225 samples.


  0%|          | 0/225 [00:00<?, ?it/s]

episode: 14  empirical returns: [16.66964845  0.        ]  predicted returns: [18.43743728 17.07342272]
Using 11/216 samples.


  0%|          | 0/216 [00:00<?, ?it/s]

episode: 15  empirical returns: [16.46008725  0.        ]  predicted returns: [18.32373543 17.11695429]
Using 12/226 samples.


  0%|          | 0/226 [00:00<?, ?it/s]

episode: 16  empirical returns: [16.73840781  0.        ]  predicted returns: [18.23520439 17.1390869 ]
Using 11/217 samples.


  0%|          | 0/217 [00:00<?, ?it/s]

episode: 17  empirical returns: [16.54219994  0.        ]  predicted returns: [18.14496122 17.16579177]
Using 12/227 samples.


  0%|          | 0/227 [00:00<?, ?it/s]

episode: 18  empirical returns: [16.80967397  0.        ]  predicted returns: [18.07730909 17.17870164]
Using 11/218 samples.


  0%|          | 0/218 [00:00<?, ?it/s]

episode: 19  empirical returns: [16.57619974  0.        ]  predicted returns: [18.02113292 17.19353059]
Using 11/219 samples.


  0%|          | 0/219 [00:00<?, ?it/s]

episode: 20  empirical returns: [16.65623554  0.        ]  predicted returns: [17.95675321 17.21563552]
Using 12/231 samples.


  0%|          | 0/231 [00:00<?, ?it/s]

episode: 21  empirical returns: [17.09200908  0.        ]  predicted returns: [17.93162566 17.2223628 ]
Using 11/218 samples.


  0%|          | 0/218 [00:00<?, ?it/s]

episode: 22  empirical returns: [16.67243373  0.        ]  predicted returns: [17.88270746 17.28121973]
Using 12/221 samples.


  0%|          | 0/221 [00:00<?, ?it/s]

episode: 23  empirical returns: [16.80823749  0.        ]  predicted returns: [17.84272088 17.32734256]
Using 12/222 samples.


  0%|          | 0/222 [00:00<?, ?it/s]

episode: 24  empirical returns: [16.89879446  0.        ]  predicted returns: [17.81149314 17.34122168]
Using 11/220 samples.


  0%|          | 0/220 [00:00<?, ?it/s]

episode: 25  empirical returns: [16.80464428  0.        ]  predicted returns: [17.77512782 17.37906137]
Using 12/234 samples.


  0%|          | 0/234 [00:00<?, ?it/s]

episode: 26  empirical returns: [17.22912613  0.        ]  predicted returns: [17.76072795 17.4033113 ]
Using 12/223 samples.


  0%|          | 0/223 [00:00<?, ?it/s]

episode: 27  empirical returns: [16.94482749  0.        ]  predicted returns: [17.73611908 17.42982527]
Using 12/224 samples.


  0%|          | 0/224 [00:00<?, ?it/s]

episode: 28  empirical returns: [17.06628427  0.        ]  predicted returns: [17.71795776 17.45751285]
Using 12/237 samples.


  0%|          | 0/237 [00:00<?, ?it/s]

episode: 29  empirical returns: [17.42428212  0.        ]  predicted returns: [17.71149618 17.47639305]
Using 12/238 samples.


  0%|          | 0/238 [00:00<?, ?it/s]

episode: 30  empirical returns: [17.50551716  0.        ]  predicted returns: [17.70715078 17.49632501]
Using 12/238 samples.


  0%|          | 0/238 [00:00<?, ?it/s]

episode: 31  empirical returns: [17.46175554  0.        ]  predicted returns: [17.70006161 17.52668635]
Using 13/241 samples.


  0%|          | 0/241 [00:00<?, ?it/s]

episode: 32  empirical returns: [17.67020996  0.        ]  predicted returns: [17.7002488  17.58547767]
Using 12/240 samples.


  0%|          | 0/240 [00:00<?, ?it/s]

episode: 33  empirical returns: [17.54352008  0.        ]  predicted returns: [17.69612169 17.61098129]
Using 13/243 samples.


  0%|          | 0/243 [00:00<?, ?it/s]

episode: 34  empirical returns: [17.80837198  0.        ]  predicted returns: [17.69891219 17.6328465 ]
Using 13/242 samples.


  0%|          | 0/242 [00:00<?, ?it/s]

episode: 35  empirical returns: [17.66803576  0.        ]  predicted returns: [17.70007679 17.66078595]
Using 13/243 samples.


  0%|          | 0/243 [00:00<?, ?it/s]

episode: 36  empirical returns: [17.72441869  0.        ]  predicted returns: [17.70094207 17.67919292]
Using 13/244 samples.


  0%|          | 0/244 [00:00<?, ?it/s]

episode: 37  empirical returns: [17.77186885  0.        ]  predicted returns: [17.70295111 17.68813393]
Using 13/244 samples.


  0%|          | 0/244 [00:00<?, ?it/s]

episode: 38  empirical returns: [17.75041939  0.        ]  predicted returns: [17.70410777 17.6958187 ]
Using 13/245 samples.


  0%|          | 0/245 [00:00<?, ?it/s]

episode: 39  empirical returns: [17.90087584  0.        ]  predicted returns: [17.7097592  17.72105593]
Using 7/140 samples.


  0%|          | 0/140 [00:00<?, ?it/s]

episode: 40  empirical returns: [ 0.         11.07803233]  predicted returns: [17.7097592  16.05632938]
Using 13/244 samples.


  0%|          | 0/244 [00:00<?, ?it/s]

episode: 41  empirical returns: [17.76399  0.     ]  predicted returns: [17.70977941 16.06373428]
Using 13/242 samples.


  0%|          | 0/242 [00:00<?, ?it/s]

episode: 42  empirical returns: [17.66079013  0.        ]  predicted returns: [17.708438  16.0642474]
Using 13/246 samples.


  0%|          | 0/246 [00:00<?, ?it/s]

episode: 43  empirical returns: [17.90310945  0.        ]  predicted returns: [17.71307699 16.07208806]
Using 13/244 samples.


  0%|          | 0/244 [00:00<?, ?it/s]

episode: 44  empirical returns: [17.7784614  0.       ]  predicted returns: [17.71447528 16.07949094]
Using 13/245 samples.


  0%|          | 0/245 [00:00<?, ?it/s]

episode: 45  empirical returns: [17.84338259  0.        ]  predicted returns: [17.71751706 16.08509098]
Using 13/248 samples.


  0%|          | 0/248 [00:00<?, ?it/s]

episode: 46  empirical returns: [18.08293091  0.        ]  predicted returns: [17.72631032 16.09951984]
Using 13/244 samples.


  0%|          | 0/244 [00:00<?, ?it/s]

episode: 47  empirical returns: [17.80236603  0.        ]  predicted returns: [17.72787976 16.10579653]
Using 13/245 samples.


  0%|          | 0/245 [00:00<?, ?it/s]

episode: 48  empirical returns: [17.86113021  0.        ]  predicted returns: [17.73060258 16.11374351]
Using 13/249 samples.


  0%|          | 0/249 [00:00<?, ?it/s]

episode: 49  empirical returns: [18.18586632  0.        ]  predicted returns: [17.73902502 16.12101765]
Using 13/245 samples.


  0%|          | 0/245 [00:00<?, ?it/s]

episode: 50  empirical returns: [17.86950339  0.        ]  predicted returns: [17.74321031 16.12323489]
Using 13/246 samples.


  0%|          | 0/246 [00:00<?, ?it/s]

episode: 51  empirical returns: [17.93085312  0.        ]  predicted returns: [17.7467799  16.13324444]
Using 13/246 samples.


  0%|          | 0/246 [00:00<?, ?it/s]

episode: 52  empirical returns: [18.00482095  0.        ]  predicted returns: [17.75198422 16.14386973]
Using 13/246 samples.


  0%|          | 0/246 [00:00<?, ?it/s]

episode: 53  empirical returns: [17.98756888  0.        ]  predicted returns: [17.7567085  16.15085139]
Using 13/246 samples.


  0%|          | 0/246 [00:00<?, ?it/s]

episode: 54  empirical returns: [17.93569963  0.        ]  predicted returns: [17.75996965 16.15628231]
Using 13/245 samples.


  0%|          | 0/245 [00:00<?, ?it/s]

episode: 55  empirical returns: [17.94725662  0.        ]  predicted returns: [17.76357679 16.16661312]
Using 13/247 samples.


  0%|          | 0/247 [00:00<?, ?it/s]

episode: 56  empirical returns: [18.03862829  0.        ]  predicted returns: [17.76875565 16.18024193]
Using 13/253 samples.


  0%|          | 0/253 [00:00<?, ?it/s]

episode: 57  empirical returns: [ 0.         18.51129361]  predicted returns: [17.7670731  16.65830688]
Using 13/249 samples.


  0%|          | 0/249 [00:00<?, ?it/s]

episode: 58  empirical returns: [18.12564858  0.        ]  predicted returns: [17.77341742 16.66853601]
Using 13/249 samples.


  0%|          | 0/249 [00:00<?, ?it/s]

episode: 59  empirical returns: [18.15332654  0.        ]  predicted returns: [17.77995035 16.68345959]
Using 13/249 samples.


  0%|          | 0/249 [00:00<?, ?it/s]

episode: 60  empirical returns: [18.17658212  0.        ]  predicted returns: [17.78681638 16.69193809]
Using 13/251 samples.


  0%|          | 0/251 [00:00<?, ?it/s]

episode: 61  empirical returns: [18.28067561  0.        ]  predicted returns: [17.79508125 16.70596303]
Using 13/250 samples.


  0%|          | 0/250 [00:00<?, ?it/s]

episode: 62  empirical returns: [18.23843377  0.        ]  predicted returns: [17.80314975 16.71483076]
Using 13/253 samples.


  0%|          | 0/253 [00:00<?, ?it/s]

episode: 63  empirical returns: [18.47281975  0.        ]  predicted returns: [17.81406088 16.72096901]
Using 13/254 samples.


  0%|          | 0/254 [00:00<?, ?it/s]

episode: 64  empirical returns: [18.53809636  0.        ]  predicted returns: [17.82634199 16.73325497]
Using 14/264 samples.


  0%|          | 0/264 [00:00<?, ?it/s]

episode: 65  empirical returns: [19.16944421  0.        ]  predicted returns: [17.84823634 16.75543465]
Did 5000 steps! 5000
Did 5000 steps! 10000
Did 5000 steps! 15000
Did 5000 steps! 20000
Did 5000 steps! 25000
Did 5000 steps! 30000
Did 5000 steps! 35000
Did 5000 steps! 40000
Did 5000 steps! 45000
Did 5000 steps! 50000
Did 5000 steps! 55000
Did 5000 steps! 60000
Did 5000 steps! 65000
Did 5000 steps! 70000
Did 5000 steps! 75000
Did 5000 steps! 80000
Did 5000 steps! 85000
Did 5000 steps! 90000
Did 5000 steps! 95000
Did 5000 steps! 100000
Max steps reached! 100001
episode: 66
Did not drop it for a long time, returning!

Results:
q0_3: 1
q0_1: 7
q0_03: 24
q0_0: 24
u1: 24
u5: 4
u10: 31
u20: 66


In [14]:
print()
print("Results:")
for name, num_episodes in results.items():
    print(f"{name}: {num_episodes}")



Results:
q0_3: 1
q0_1: 7
q0_03: 24
q0_0: 24
u1: 24
u5: 4
u10: 31
u20: 66
