##Compare the n-step advantage with n-step return (mentioned in the class), vanilla advantage, GAE, as well as MC advantage for A2C algorithm##

Hint: SB3 implements Generalized Advantage Estimation (GAE) for A2C. In particular, you can find the implementation of the advantage in the method *def compute_returns_and_advantage method* in *buffer.py* (stable-baselines3/stable_baselines3/common/buffers.py) (https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/buffers.py). You can also play with the hyper-parameter (*gae_lambda*) to get different advantages without making model/algo implementation code changes.

[Requirements]:
- Compare the n-step advantage with the (vanilla) advantage, MC advantage, as well as GAE. Note that MC advantage is just optional for this assignment.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd '/content/drive/MyDrive/MIDS/AIPI_531/HW_1/stable-baselines3'
!pip install -e .[docs,tests,extra]
import stable_baselines3
print(f"{stable_baselines3.__version__=}")

/content/drive/MyDrive/MIDS/AIPI_531/HW_1/stable-baselines3
Obtaining file:///content/drive/MyDrive/MIDS/AIPI_531/HW_1/stable-baselines3
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting gymnasium<0.30,>=0.28.1 (from stable-baselines3==2.2.0a3)
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Collecting pytest-cov (from stable-baselines3==2.2.0a3)
  Downloading pytest_cov-4.1.0-py3-none-any.whl (21 kB)
Collecting pytest-env (from stable-baselines3==2.2.0a3)
  Downloading pytest_env-1.0.1-py3-none-any.whl (5.3 kB)
Collecting pytest-xdist (from stable-baselines3==2.2.0a3)
  Downloading pytest_xdist-3.3.1-py3-none-any.whl (41 kB)
[2K     [90m━━━━━

  if not hasattr(tensorboard, "__version__") or LooseVersion(


stable_baselines3.__version__='2.2.0a3'


In [3]:
import gymnasium as gym
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import DummyVecEnv, VecVideoRecorder
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.monitor import Monitor
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import numpy as np
import os
os.chdir('/content/drive/MyDrive/MIDS/AIPI_531/HW_1')

In [4]:
# create training and evaluation environment
env = gym.make("CartPole-v1")
eval_env = gym.make("CartPole-v1")

## 1. Vanilla advantage:

In [6]:
# initializa agent
vanilla_model = A2C("MlpPolicy", env, verbose=1, \
                    gae_lambda=0, n_steps=1, \
                    tensorboard_log="./vanilla_CartPole_v1_tensorboard/")

  and should_run_async(code)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [7]:
# train the model
vanilla_model.learn(total_timesteps=10000)

Logging to ./vanilla_CartPole_v1_tensorboard/A2C_8
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 13.3     |
|    ep_rew_mean        | 13.3     |
| time/                 |          |
|    fps                | 113      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 100      |
| train/                |          |
|    entropy_loss       | -0.069   |
|    explained_variance | nan      |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 0.0139   |
|    value_loss         | 1.14     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 11       |
|    ep_rew_mean        | 11       |
| time/                 |          |
|    fps                | 116      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    

<stable_baselines3.a2c.a2c.A2C at 0x7fdaa360a4a0>

In [8]:
# evaluate the model
vanilla_mean_reward, vanilla_std_reward = evaluate_policy(vanilla_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{vanilla_mean_reward:.2f} +/- {vanilla_std_reward:.2f}")



mean_reward:9.35 +/- 0.74


## 2. n-step advantage

In [9]:
# initializa agent
n_step_model = A2C("MlpPolicy", env, verbose=1, \
                    n_steps=20, \
                    tensorboard_log="./n_step_CartPole_v1_tensorboard/")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [10]:
# train the model
n_step_model.learn(total_timesteps=10000)

Logging to ./n_step_CartPole_v1_tensorboard/A2C_15
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 28.3     |
|    ep_rew_mean        | 28.3     |
| time/                 |          |
|    fps                | 909      |
|    iterations         | 100      |
|    time_elapsed       | 2        |
|    total_timesteps    | 2000     |
| train/                |          |
|    entropy_loss       | -0.661   |
|    explained_variance | 0.0943   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 5.63     |
|    value_loss         | 90.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 37.3     |
|    ep_rew_mean        | 37.3     |
| time/                 |          |
|    fps                | 916      |
|    iterations         | 200      |
|    time_elapsed       | 4        |
|    total_timesteps    

<stable_baselines3.a2c.a2c.A2C at 0x7fdaa360bf10>

In [11]:
# evaluate the model
n_step_mean_reward, n_step_std_reward = evaluate_policy(n_step_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{n_step_mean_reward:.2f} +/- {n_step_std_reward:.2f}")

mean_reward:397.01 +/- 107.21


Increasing number of step:

In [12]:
# initializa agent
n_step_model_2 = A2C("MlpPolicy", env, verbose=1, \
                    n_steps=30, \
                    tensorboard_log="./n_step_CartPole_v1_tensorboard/")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [13]:
# train the model
n_step_model_2.learn(total_timesteps=10000)

Logging to ./n_step_CartPole_v1_tensorboard/A2C_16
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 37.5     |
|    ep_rew_mean        | 37.5     |
| time/                 |          |
|    fps                | 989      |
|    iterations         | 100      |
|    time_elapsed       | 3        |
|    total_timesteps    | 3000     |
| train/                |          |
|    entropy_loss       | -0.615   |
|    explained_variance | 0.0405   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 7.44     |
|    value_loss         | 196      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 62.6     |
|    ep_rew_mean        | 62.6     |
| time/                 |          |
|    fps                | 1001     |
|    iterations         | 200      |
|    time_elapsed       | 5        |
|    total_timesteps    

<stable_baselines3.a2c.a2c.A2C at 0x7fdaa29004c0>

In [14]:
# evaluate the model
n_step_mean_reward_2, n_step_std_reward_2 = evaluate_policy(n_step_model_2, eval_env, n_eval_episodes=100)
print(f"mean_reward:{n_step_mean_reward_2:.2f} +/- {n_step_std_reward_2:.2f}")

mean_reward:460.60 +/- 60.34


## 3. GAE

In [15]:
# initializa agent
GAE_model = A2C("MlpPolicy", env, verbose=1, \
                gae_lambda=0.9, \
                tensorboard_log="./GAE_CartPole_v1_tensorboard/")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [16]:
# train the model
GAE_model.learn(total_timesteps=10000)

Logging to ./GAE_CartPole_v1_tensorboard/A2C_5
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.2     |
|    ep_rew_mean        | 19.2     |
| time/                 |          |
|    fps                | 396      |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.693   |
|    explained_variance | -0.146   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | 1.74     |
|    value_loss         | 7.4      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 19.1     |
|    ep_rew_mean        | 19.1     |
| time/                 |          |
|    fps                | 412      |
|    iterations         | 200      |
|    time_elapsed       | 2        |
|    total_timesteps    | 10

<stable_baselines3.a2c.a2c.A2C at 0x7fdaa2900400>

In [17]:
# evaluate the model
GAE_mean_reward, GAE_std_reward = evaluate_policy(GAE_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{GAE_mean_reward:.2f} +/- {GAE_std_reward:.2f}")

mean_reward:249.10 +/- 52.49


## 4. MC advantage

According to the [gym documentation](https://www.gymlibrary.dev/environments/classic_control/cart_pole/), the length of episode of CartPole v1 is 500. In order to make A2C algorithm estimate returns over the full trajectory, the n_steps is set to be 500.

In [22]:
# initializa agent
MC_model = A2C("MlpPolicy", env, verbose=1, \
                n_steps = 500, \
                tensorboard_log="./MC_CartPole_v1_tensorboard/")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [23]:
# train the model
MC_model.learn(total_timesteps=10000)

Logging to ./MC_CartPole_v1_tensorboard/A2C_6


<stable_baselines3.a2c.a2c.A2C at 0x7fdaa2901f90>

In [24]:
# evaluate the model
MC_mean_reward, MC_std_reward = evaluate_policy(MC_model, eval_env, n_eval_episodes=100)
print(f"mean_reward:{MC_mean_reward:.2f} +/- {MC_std_reward:.2f}")

mean_reward:172.95 +/- 114.37


## Compare the 4 methods

In [21]:
print(f"Mean reward for vanilla advantage is {vanilla_mean_reward:.2f} +/- {vanilla_std_reward:.2f}")
print(f"Mean reward for n-step advantage with n=20 is {n_step_mean_reward:.2f} +/- {n_step_std_reward:.2f}")
print(f"Mean reward for n-step advantage with n=30 is {n_step_mean_reward_2:.2f} +/- {n_step_std_reward_2:.2f}")
print(f"Mean reward for GAE advantage is {GAE_mean_reward:.2f} +/- {GAE_std_reward:.2f}")
print(f"Mean reward for MC advantage is {MC_mean_reward:.2f} +/- {MC_std_reward:.2f}")

Mean reward for vanilla advantage is 9.35 +/- 0.74
Mean reward for n-step advantage with n=20 is 397.01 +/- 107.21
Mean reward for n-step advantage with n=30 is 460.60 +/- 60.34
Mean reward for GAE advantage is 249.10 +/- 52.49
Mean reward for MC advantage is 107.50 +/- 35.51


* The vanilla advantage is biased with smaller variance since it only considers 1 step, so cumulative noise is not very high. Therefore, we observe small mean reward and small standard deviation from the vanilla method.

* The n-step return is the cumulative reward summed up over n timesteps from a given state. It offers a balance between the immediate reward and the future reward. Comparing with vanilla advantage, using n-steps leads to higher mean reward and standard deviation. Increasing n may lead to smaller bias since more actual observed rewards are taken into account, but variance may increase because our update is influenced by a series of actual observed rewards, which can be noisy. In
 Increasing n from 20 to 30 leads to an increase in mean reward.

* GAE is a method to reduce variance in advantage estimation without introducing too much bias. After introducing GAE, the mean reward increases a lot while the standard error doesn't increase too much.

*  Monte Carlo method uses samples to estimate expectations. It is unbiased but can have high variance, since it involves sampling returns until the end of an episode and then computing the advantage using those full episode returns. Therefore, the standard error of MC advantage appears to be high.

