# **CIS 6200 Spring 2024 Homework 7**


**Coding: Off-Policy Reinforcement Learning**

1. Reference: Stable Baselines3

2. Coding
  * Initialize Environment: Lunar Lander
    * Initialize the environment
    * Implement or import an appropriate evaluation function for a trained policy
    * Implement a function to generate video from the trained policy
  * DDPG
    * Implement the DDPG algorithm using the library
    * Try at least 2 policies (see the documentation), train them, and report differences in performance
  * SAC
    * Implement the SAC algorithm using the library
    * Try at least 2 policies (see the documentation), train them, and report differences in performance

3. Discussion
  * Environment
    * What is the action and observation space? Are they discrete or continuous?
    * Briefly describe how the rewards are assigned.
  * DDPG
    * Which policies did you try for DDPG? What are their differences in performance (mean_rewards, time efficiency, etc.)
    * What is the hyperparameter action_noise? What value did you use?
    * What is the hyperparameter buffer_size and replay_buffer_class? What value did you use?
  * SAC
    * Which policies did you try for SAC? What are their differences in performance (mean_rewards, time efficiency, etc.)?
    * How do they compare with DDPG?
    * What is the hyperparameter ent_coef? What value did you use?





**Note: Answers to the questions need to be submitted in the corresponding PDF submission along with this coding submission on gradescope.**

## Imports and Setups

In [1]:
!apt-get update && apt-get install ffmpeg freeglut3-dev xvfb  # For visualization
!apt-get update && apt-get install swig cmake
!pip install box2d-py
!pip install "stable-baselines3[extra]>=2.0.0a4"
!pip install renderlab

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Get:4 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [109 kB]
Get:7 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,080 kB]
Hit:8 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [1,570 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:12 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  Packages [770 kB]
Hit:13 https://p

In [2]:
import gymnasium as gym
import sys
import stable_baselines3
from stable_baselines3.common.evaluation import evaluate_policy
import numpy as np

print(f"{stable_baselines3.__version__=}")
print(f"{gym.__version__=}")

stable_baselines3.__version__='2.3.0a4'
gym.__version__='0.29.1'


In [3]:
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
import renderlab as rl

  and should_run_async(code)
  np.bool8: (False, True),

  from scipy.ndimage.filters import sobel



In [29]:
env = gym.make(
    "LunarLander-v2",
    continuous= True, # originally False <- change the action space from box to Dsicrete
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5, render_mode = "rgb_array"
)

In [30]:
env = rl.RenderFrame(env, "./output")

In [31]:
n_actions = env.env.env.env.env.action_space.shape[-1]
print()
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = DDPG("MlpPolicy", env, verbose=1, buffer_size=200000, learning_starts=10000, gamma = 0.98, policy_kwargs={"net_arch": [400, 300]})
model.learn(total_timesteps=300000, log_interval=10)
model.save("ddpg_lunar_lander")
# vec_env = model.get_env()


Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 120      |
|    ep_rew_mean     | -147     |
| time/              |          |
|    episodes        | 10       |
|    fps             | 200      |
|    time_elapsed    | 6        |
|    total_timesteps | 1204     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 111      |
|    ep_rew_mean     | -168     |
| time/              |          |
|    episodes        | 20       |
|    fps             | 185      |
|    time_elapsed    | 11       |
|    total_timesteps | 2211     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 110      |
|    ep_rew_mean     | -210     |
| time/              |          |
|    episodes        | 30       |
|    fps            

In [34]:
observation, info = env.reset()

while True:
  observation = np.array(observation)
  action, _states = model.predict(observation, deterministic=True)
  observation, reward, terminated, truncated, info = env.step(action)

  if terminated or truncated:
    break

env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


In [35]:
# Separate env for evaluation
eval_env = gym.make("LunarLander-v2",
    continuous= True,
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode = "rgb_array"
)


# Random Agent, before training
mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")




mean_reward=258.65 +/- 24.02276432414263


SAC

In [36]:
from stable_baselines3 import SAC
env = gym.make("LunarLander-v2",
    continuous= True,
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode = "rgb_array"
)

In [37]:
env = rl.RenderFrame(env, "./output")

In [38]:
model = SAC("MlpPolicy", env, verbose=1, ent_coef = 'auto', policy_kwargs={"net_arch": [400, 300]})
model.learn(total_timesteps=100000, log_interval=4)
model.save("sac_lunar_lander")

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 91.5     |
|    ep_rew_mean     | -283     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 59       |
|    time_elapsed    | 6        |
|    total_timesteps | 366      |
| train/             |          |
|    actor_loss      | 1.9      |
|    critic_loss     | 85       |
|    ent_coef        | 0.928    |
|    ent_coef_loss   | -0.206   |
|    learning_rate   | 0.0003   |
|    n_updates       | 265      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 100      |
|    ep_rew_mean     | -261     |
| time/              |          |
|    episodes        | 8        |
|    fps             | 59       |
|    time_elapsed    | 13       |
|    total_timesteps | 801      |
| train/             

In [43]:
obs, info = env.reset()
while True:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
      break
      obs, info = env.reset()
env.play()

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


In [40]:
eval_env = gym.make("LunarLander-v2",
    continuous= True,
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode = "rgb_array"
)

mean_reward, std_reward = evaluate_policy(
    model,
    eval_env,
    n_eval_episodes=10,
    deterministic=True,
)

print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")

mean_reward=174.17 +/- 117.07984443747523
