In [47]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn 
import torch.nn.functional as F
for i in gym.envs.registry.keys():
    print(i)
    

CartPole-v0
CartPole-v1
MountainCar-v0
MountainCarContinuous-v0
Pendulum-v1
Acrobot-v1
phys2d/CartPole-v0
phys2d/CartPole-v1
phys2d/Pendulum-v0
LunarLander-v2
LunarLanderContinuous-v2
BipedalWalker-v3
BipedalWalkerHardcore-v3
CarRacing-v2
Blackjack-v1
FrozenLake-v1
FrozenLake8x8-v1
CliffWalking-v0
Taxi-v3
tabular/Blackjack-v0
tabular/CliffWalking-v0
Reacher-v2
Reacher-v4
Pusher-v2
Pusher-v4
InvertedPendulum-v2
InvertedPendulum-v4
InvertedDoublePendulum-v2
InvertedDoublePendulum-v4
HalfCheetah-v2
HalfCheetah-v3
HalfCheetah-v4
Hopper-v2
Hopper-v3
Hopper-v4
Swimmer-v2
Swimmer-v3
Swimmer-v4
Walker2d-v2
Walker2d-v3
Walker2d-v4
Ant-v2
Ant-v3
Ant-v4
Humanoid-v2
Humanoid-v3
Humanoid-v4
HumanoidStandup-v2
HumanoidStandup-v4
GymV26Environment-v0
GymV21Environment-v0
Adventure-v0
AdventureDeterministic-v0
AdventureNoFrameskip-v0
Adventure-v4
AdventureDeterministic-v4
AdventureNoFrameskip-v4
Adventure-ram-v0
Adventure-ramDeterministic-v0
Adventure-ramNoFrameskip-v0
Adventure-ram-v4
Adventure-ramDe

Classic Control: These are canonical environments used in RL development; they form the basis of many textbook examples. They give the right mix of complexity and simplicity to test and benchmark new RL algorithms. Classic control environments in Gymnasium include: 
Acrobot
Cart Pole
Mountain Car Discrete
Mountain Car Continuous
Pendulum
Box2D: Box2D is a 2D Physics Engine for Games. Environments based on this engine include simple games like:
Lunar Lander
Car Racing
ToyText: These are small and simple environments often used to debug RL algorithms. Many of these environments are based on the small grid world model and simple card games. Examples include: 
Blackjack
Taxi
Frozen Lake
MuJoCo: Multi-Joint dynamics with Contact (MuJoCo) is an open-source physics engine that simulates environments for applications like robotics, biomechanics, ML, etc. MuJoCo environments in Gymnasium include:
Ant
Hopper
Humanoid
Swimmer
And more
In addition to the built-in environments, Gymnasium can be used with many external environments using the same API. 

We’ll use one of the canonical Classic Control environments in this tutorial. To import a specific environment, use the .make() command and pass the name of the environment as an argument. For example, to create a new environment based on CartPole (version 1), use the command below: 

In [3]:
import gymnasium as gym
env = gym.make("CartPole-v1")

Observation space 
The observation space is the space that includes all possible observations. It also defines the format in which observations are stored. The observation space is typically represented as an object of datatype Box. This is an ndarray which describes the parameters of the observations. The box specifies the bounds of each dimension. You can view the observation space for an environment using the observation_space method:

In [4]:
print("observation space: ", env.observation_space)

observation space:  Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)


In this example, the CartPole-v1 observation space has 4 dimensions. The 4 elements of the observation array are:

Cart position - varies between -4.8 and +4.8
Cart velocity - ranges between - to +
Pole angle - varies between -0.4189 and +0.4189
Pole angular velocity - ranges between  - to +

In [5]:
#reset method to see an individual obs

observation, info = env.reset()
print("observation: ", observation)

observation:  [ 0.00616072  0.02858067 -0.03229749 -0.0110935 ]


Action space
The action space includes all possible actions that the agent can take. The action space also defines the format in which actions are represented. You can view the action space for an environment using the action_space method:

In [6]:
print("action space: ", env.action_space)

action space:  Discrete(2)


## Making custom agent

In [10]:
#creating and resetting the environment

env = gym.make('CartPole-v1')

SEED = 42
env.reset(seed=SEED)

np.random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x12e8169b0>

Random versus intelligent actions
In each step in a Markov process, the agent can randomly choose an action and explore the environment until it arrives at a terminal state. By choosing actions at random: 

It can take a long time to reach the terminal state.
The cumulative rewards are much lower than what they could have been.
Training the agent to optimize the selection of actions based on previous experiences (of interacting with the environment) is more efficient to maximize long-term rewards. 

The untrained agent starts with random actions based on a randomly initialized policy. This policy is typically represented as a neural network. During training, the agent learns the optimal policy that maximizes the rewards. In RL, the training process is also called policy optimization. 

There are various methods of policy optimization. The Bellman equations describe how to calculate the value of RL policies and determine the optimal policy. In this tutorial, we’ll use a simple technique called policy gradients. Other methods exist, such as Proximal Policy Optimization (PPO). 

In [54]:
nn = torch.nn
optim = torch.optim
relu = nn.ReLU()

In [72]:
class PolicyNetwork(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, dropout):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
    def forward(self, x):
        x = self.layer1(x)
        x = self.dropout(x)
        x = relu(x)
        x = self.layer2(x)
        return x

In [73]:
def calculate_stepwise_returns(rewards, discount_factor):
    returns = []
    R = 0
    for r in reversed(rewards):
        R = r + R * discount_factor
        returns.insert(0, R)
    returns = torch.tensor(returns)
    normalized_returns = (returns - returns.mean()) / returns.std()
    return normalized_returns

In [74]:
def forward_pass(env, policy, discount_factor):
    log_prob_actions = []
    rewards = []
    done = False
    episode_return = 0
    policy.train()
    observation, info = env.reset()
    while not done:
        observation = torch.FloatTensor(observation).unsqueeze(0)
        action_pred = policy(observation)
        action_prob = F.softmax(action_pred, dim = -1)
        dist = torch.distributions.categorical.Categorical(logits=action_prob)
        action = dist.sample()
        log_prob_action = dist.log_prob(action)
        observation, reward, terminated, truncated, info = env.step(action.item())
        done = terminated or truncated
        log_prob_actions.append(log_prob_action)
        rewards.append(reward)
        episode_return += reward
    log_prob_actions = torch.cat(log_prob_actions)
    stepwise_returns = calculate_stepwise_returns(rewards, discount_factor)
    return episode_return, stepwise_returns, log_prob_actions
    

In [75]:
def calculate_loss(stepwise_returns, log_prob_actions):
    loss = -(stepwise_returns * log_prob_actions).sum()
    return loss

In [78]:
def update_policy(stepwise_returns, log_prob_actions, optimizer):
    stepwise_returns = stepwise_returns.detach()
    loss = calculate_loss(stepwise_returns, log_prob_actions)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

In [79]:
def main(): 
    MAX_EPOCHS = 500
    DISCOUNT_FACTOR = 0.99
    N_TRIALS = 25
    REWARD_THRESHOLD = 475
    PRINT_INTERVAL = 10
    INPUT_DIM = env.observation_space.shape[0]
    HIDDEN_DIM = 128
    OUTPUT_DIM = env.action_space.n
    DROPOUT = 0.5
    episode_returns = []
    policy = PolicyNetwork(INPUT_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT)
    LEARNING_RATE = 0.01
    optimizer = optim.Adam(policy.parameters(), lr = LEARNING_RATE)
    for episode in range(1, MAX_EPOCHS+1):
        episode_return, stepwise_returns, log_prob_actions = forward_pass(env, policy, DISCOUNT_FACTOR)
        _ = update_policy(stepwise_returns, log_prob_actions, optimizer)
        episode_returns.append(episode_return)
        mean_episode_return = np.mean(episode_returns[-N_TRIALS:])
        if episode % PRINT_INTERVAL == 0:
            print(f'| Episode: {episode:3} | Mean Rewards: {mean_episode_return:5.1f} |')
        if mean_episode_return >= REWARD_THRESHOLD:
            print(f'Reached reward threshold in {episode} episodes')
            break

In [80]:
main()

| Episode:  10 | Mean Rewards:  25.3 |
| Episode:  20 | Mean Rewards:  27.4 |
| Episode:  30 | Mean Rewards:  38.2 |
| Episode:  40 | Mean Rewards:  43.3 |
| Episode:  50 | Mean Rewards:  49.5 |
| Episode:  60 | Mean Rewards:  48.3 |
| Episode:  70 | Mean Rewards:  69.2 |
| Episode:  80 | Mean Rewards:  98.8 |
| Episode:  90 | Mean Rewards:  81.3 |
| Episode: 100 | Mean Rewards:  81.5 |
| Episode: 110 | Mean Rewards:  68.2 |
| Episode: 120 | Mean Rewards:  70.9 |
| Episode: 130 | Mean Rewards:  77.5 |
| Episode: 140 | Mean Rewards: 102.1 |
| Episode: 150 | Mean Rewards: 119.6 |
| Episode: 160 | Mean Rewards: 104.0 |
| Episode: 170 | Mean Rewards: 102.1 |
| Episode: 180 | Mean Rewards:  94.9 |
| Episode: 190 | Mean Rewards: 108.7 |
| Episode: 200 | Mean Rewards: 100.7 |
| Episode: 210 | Mean Rewards:  83.6 |
| Episode: 220 | Mean Rewards:  82.7 |
| Episode: 230 | Mean Rewards:  80.9 |
| Episode: 240 | Mean Rewards: 109.7 |
| Episode: 250 | Mean Rewards: 121.7 |
| Episode: 260 | Mean Rew

In [None]:
env = gym.make(‘CartPole-v1’, render_mode=’human’)

In [None]:
while not done:
    …
   step, reward, terminated, truncated, info = env.step(action.item())
   env.render()
    …

source1: https://app.datacamp.com/learn/tutorials/reinforcement-learning-with-gymnasium?registration_source=google_onetap