In [4]:
import gymnasium as gym
import numpy as np

## Gymnasium API Notes

In [20]:
env = gym.make("BipedalWalker-v3", render_mode=None)

env.action_space.shape[0]

4

In [21]:
env2 = gym.make("LunarLander-v3", render_mode=None)

env2.action_space.n

np.int64(4)

#### Action space dimension for continuous parallel environments

In [22]:
envs = gym.make_vec("Pendulum-v1", num_envs=2, vectorization_mode="sync")

envs.single_action_space.shape[0]
envs.action_space.shape[-1]

1

#### Action space dimension for discrete parallel environments

In [23]:
envs = gym.make_vec("LunarLander-v3", num_envs=2, vectorization_mode="sync")

envs.single_action_space.n
envs.action_space.shape # gives number of parallel environments
envs.action_space[0].n

np.int64(4)

## Tensor Manipulation

for A2C, we have several tensors with the following shapes:
- states: (N + 1,)
- actions: (N, )
- rewards: (N, )
- log_probs: (N, )
- dones: (N, )
where N is the number of rewards received.

Initially, these were saved to their own separate lists. But, this forces the API to be sequential and inefficient. To introduce parallelism, we need to generalize this by using Tensors and NumPy Arrays

In [8]:
# Lunar Lander uses 4 dimensional discrete action space
envs = gym.make_vec("LunarLander-v3", num_envs=2, vectorization_mode="sync")
envs.reset(seed=42)

actions = envs.action_space.sample()

obs, rewards, terminates, truncates, infos = envs.step(actions)

print(actions)

print("Actions:")
for a in actions:
    print(a)

print("Observations:")
for ob in obs:
    print(ob)

[3 1]
Actions:
3
1
Observations:
[ 0.00465546  1.4247642   0.24004106  0.29480776 -0.00680472 -0.08300382
  0.          0.        ]
[-0.00712929  1.3984202  -0.3651231  -0.29073122  0.0096392   0.11075787
  0.          0.        ]


To parallelize our environment, we can actually create the following numpy arrays instead of lists:
- states: (T + 1, state_space_size, E)
- rewards: (T, E)
- log_probs: (T, E)
where T is the number of timesteps and E is the number of environments. This actually means we can keep our original code, and parallelize our operations over each environment. This seems pretty efficient and is the best we can do for these algorithms, which have data dependencies.