In [24]:
%load_ext autoreload
%autoreload 2

import numpy as np
import torch
import torch.optim as optim

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Teaching a quadruped to walk

Time to try out the learning algorithms that you just implemented on a more difficult problem. The WalkerEnv implements a quadruped robot kind-of thing, see for yourself. The goal is to move in the $x$ direction as fast and as far as possible.

Your goal is to implement a class `WalkerPolicy` with function `determine_actions()` just like the StochasticPolicy we used earlier to control the pendulum. Below is a template of this class, but feel free to alter it however you want. The only important thing is the `determine_actions()` function!

After you implement it, copy `WalkerPolicy` into a separate file `WalkerPolicy.py` that you will upload to BRUTE together with the (optional) learned weights in a zip file. How the policy is implemented is up to you! You are constrained to only the libraries we used so far though, such as torch, numpy etc..

You will get some free points just for uploading a working policy (irrelevant of the performance). Further 2 points will be awarded for successfully traversing a small distance in the x direction.


# Hints

There is no single easy way of doing this, but here are some suggestions on what you could try to improve your policy:

1. This problem is much more difficult, than balancing a pendulum. It is a good idea to use a bit larger network than for the pendulum policy.

2. You can also try to use a different optimizer, such as Adam and play with the hyperparameters.

3. Using a neural network to compute the normal distribution scale $\sigma$ can lead to too much randomness in the actions (i.e. exploration). You can use a fixed $\sigma$ instead, or replace it with a learnable `torch.Parameter` initialized to some small constant. Make sure, you run it through an exponential, or softplus function to ensure $\sigma$ is positive.

4. The exploration can also be reduced by penalizing the variance of the action distribution in an additional loss term.

5. If you see some undesirable behaviour, you can tweak the reward function to penalize it. Even though the $x$ distance is all we care about, adding extra terms to the reward can help guide the learning process (This is known as reward shaping). Simply define a reward function mapping the state $s_{t+1}$ and action $a_t$ to a scalar reward $r_t$ and put it in the config dictionary under the key `'reward_fcn'`. See the `WalkerEnv` class for the implementation of the default reward.

6. Using the normal distribution on a bounded action space can lead to certain problems caused by action clipping. This can be mitigated by using a different distribution, such as the Beta distribution. See the `torch.distributions.beta` module for more information. (Note that Beta distribution is defined on the interval [0,1] and works better with parameters $\alpha,\beta \geq 1$.)


In [25]:
# If you cannot run with the visualization, you can set this to False
VISUALIZE = True

In [26]:
from environment.WalkerEnv import WalkerEnv
from WalkerPolicy import WalkerPolicy

In [27]:
# def walker_reward(state, action):
#     """reward function for the walker environment, state is [29] vector, action is [8] vector"""
#     pos = state[:15]  # first 15 elements of state vector are generalized coordinates [xyz, quat, joint_angles]
#     vel = state[15:]  # last 14 elements of state vector are generalized velocities [xyz_vel, omega, joint_velocities]
#     return vel[0]  # return the x velocity as the reward by default
# def walker_reward(state, action):
#     # Reward = x velocity
#     vel = state[0, 15:]  # velocity is last part of state
#     return vel[0]  
def walker_reward(state, action):
    pos = state[:, :15]  # first 15 elements => generalized coordinates
    vel = state[:, 15:]  # last 14 elements => generalized velocities
    return vel[:, 0] 

In [28]:
# This is the configuration for the Walker environment
# N is the number of robots controlled in parallel
# vis is a boolean flag to enable visualization
# !! IMPORTANT track is a boolean flag to enable camera tracking of a particular robot, this is useful when evaluating the performance of the policy after training
# reward_fcn is the reward function that the environment will use to calculate the reward
config = {'N': 1, 'vis': VISUALIZE, "track": 0, "reward_fcn": walker_reward}
env = WalkerEnv(config)

Environment ready


In [35]:
import numpy as np
import torch
import torch.optim as optim
from environment.WalkerEnv import WalkerEnv
from WalkerPolicy import WalkerPolicy

###############################################################################
# Simple reward function for the Walker
###############################################################################
def walker_reward(state, action):
    # Here, 'state' is shape (29,) => 1D array
    pos = state[:15]  # first 15 elements => generalized coordinates
    vel = state[15:]  # last 14 elements => generalized velocities
    return vel[0]     # x velocity

###############################################################################
# Environment configuration
###############################################################################
config = {
    'N': 1,
    'vis': VISUALIZE,   # set True if you want to see the robot
    'track': 0,
    'reward_fcn': walker_reward
}

env = WalkerEnv(config)

###############################################################################
# Policy and optimizer
###############################################################################
policy = WalkerPolicy(state_dim=29, action_dim=8)
optimizer = optim.Adam(policy.parameters(), lr=3e-4)

###############################################################################
# Training hyperparameters
###############################################################################
num_episodes = 100
max_episode_steps = 200
gamma = 0.99

###############################################################################
# Training loop
###############################################################################
for episode in range(num_episodes):
    s = env.vector_reset()  # shape (1,29), but we'll treat it as array
    states, actions, rewards, logps = [], [], [], []

    for t in range(max_episode_steps):
    # Convert state to torch
        s_torch = torch.tensor(s, dtype=torch.float32, requires_grad=False)  # shape (1, 29)

        # Forward pass through policy
        a_torch, log_prob = policy.sample_actions_and_log_prob(s_torch)  # Modify your policy to return both action and log-prob

        # Convert to numpy to step the environment
        a_np = a_torch.detach().cpu().numpy()

        # Step environment
        s_next, r = env.vector_step(a_np)

        # Store experience
        states.append(s_torch)
        actions.append(a_torch)
        rewards.append(torch.tensor(r, dtype=torch.float32))
        logps.append(log_prob)  # Append log-probabilities

        s = s_next

    # Discounted returns
    rews = torch.cat(rewards, dim=0)  # shape (T,1)
    T = rews.shape[0]
    returns = torch.zeros_like(rews)
    running_sum = 0.0

    for i in reversed(range(T)):
        running_sum = rews[i] + gamma * running_sum
        returns[i] = running_sum

    # Convert everything into a single batch
    all_states = torch.cat(states, dim=0)    # (T, 29)
    all_actions = torch.cat(actions, dim=0)  # (T, 8)
    all_logps = torch.cat(logps, dim=0)      # (T, 1)
    all_returns = returns                    # (T, 1)

    # Policy gradient loss (placeholder: logp=0 => no gradient)
    pg_loss = - (all_logps * all_returns).mean()

    optimizer.zero_grad()
    pg_loss.backward()
    optimizer.step()

    # Simple printout of returns
    ep_return = all_returns[0].item()
    print(f"Episode {episode+1}/{num_episodes}, Return={ep_return:.3f}")

env.close()
print("Done training!")


Environment ready
Episode 1/100, Return=-2.038
Episode 2/100, Return=-1.899
Episode 3/100, Return=1.107
Episode 4/100, Return=-1.725
Episode 5/100, Return=1.546
Episode 6/100, Return=-2.912
Episode 7/100, Return=-1.873
Episode 8/100, Return=-2.408
Episode 9/100, Return=-2.331
Episode 10/100, Return=-0.821
Episode 11/100, Return=-3.984
Episode 12/100, Return=0.689
Episode 13/100, Return=1.126
Episode 14/100, Return=2.691
Episode 15/100, Return=-2.599
Episode 16/100, Return=-3.099
Episode 17/100, Return=-0.099
Episode 18/100, Return=0.587
Episode 19/100, Return=1.409
Episode 20/100, Return=-1.312
Episode 21/100, Return=-4.106
Episode 22/100, Return=4.475
Episode 23/100, Return=4.244
Episode 24/100, Return=0.888
Episode 25/100, Return=-1.132
Episode 26/100, Return=-0.909
Episode 27/100, Return=-2.858
Episode 28/100, Return=0.402
Episode 29/100, Return=-2.178
Episode 30/100, Return=0.184
Episode 31/100, Return=2.474
Episode 32/100, Return=-0.562
Episode 33/100, Return=0.241
Episode 34/100,

c:\Users\bogda\Deep Reinforcement Learning\hw5-rl-main\rl-homework-venv\lib\site-packages\glfw\__init__.py:917: GLFWError: (65537) b'The GLFW library is not initialized'


KeyboardInterrupt: 