<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_12_4_atari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Deep RL for 2D environments: Q-Learning, DQN, and PPO
* [Eugene Agichtein](https://www.cs.emory.edu/~eugene/) for CS325: Artificial Intelligence
* Adapted from [Jeff Heaton](https://sites.wustl.edu/jeffheaton/)

This is the starting code for training agents for the Box2D environment in Gymnasium:
https://gymnasium.farama.org/environments/box2d/


Lunar Lander example is used in the starter code. You will extend these to Car Racing and Bipedal Worker yourself.



# Google CoLab Setup

The following code setsup gymnasium in Google colab. do not modify these lines, but ok need to add additional dependencies if needed

In [1]:
from google.colab import drive
!pip install stable-baselines3[extra] gymnasium
!pip install gymnasium[accept-rom-license,atari]
!pip install pyvirtualdisplay
!sudo apt-get install -y python-opengl ffmpeg
!sudo apt-get install -y xvfb
!pip install swig
!pip install gymnasium[box2d]
!pip install moviepy

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=1.3.0 (from stable-baselines3[extra])
  Downloading Shimmy-1.3.0-py3-none-any.whl (37 kB)
Collecting autorom[accept-rom-license]~=0.6.1 (from stable-baselines3[extra])
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

# Table-based Q-Learning for Box2D
Gymnasium: https://gymnasium.farama.org/ is a more general and realistic virtual universe with many environments, such as robotic control, video games, and 3-d physics.

Out of the box, Q-Learning does not deal with continuous inputs. Additionally, Q-Learning primarily deals with discrete actions, such as pressing a joystick up or down. First step is to adapt the example code from Mountain Car notebook provided to the Lunar Lander in Box2D environment.

## Introducing Box2D/Lunar Lander

This section will demonstrate how Q-Learning can create a solution to the Lunar Lander gym environment. The goal is to land a simple spaceship with 3 engines between 2 flags (landing area).

There are two versions of the environment, one without wind (easy / predictable) and with wind enabled (turbulent/windy environment when control is difficult). Lets suspend disbelief that there is wind on the moon. Our lander should be able to land on Mars too, where winds can be very powerful.

First, it might be helpful to visualize the Lunar Lander environment. The following code shows this environment with the wind enabled.

In [2]:
import base64
from IPython import display as ipythondisplay
from pathlib import Path
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
import numpy as np


env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= True,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")

The LunarLander environment observations can be either discrete (simpler version) or continuous. Actions can also be continuous or discrete. See details here:
https://gymnasium.farama.org/environments/box2d/lunar_lander/

The goal is to learn which combination of engines to apply to safely land the spacecraft.


Lets see how the robot behaves without training.

In [4]:
env.metadata['render_fps'] = 30
# Reset the environment
env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
env = RecordVideo(env, video_folder='./videos_lander_qlearn', episode_trigger=video_callable)

# Run the environment until done

truncated = False
i=0
while not truncated:
  i+=1
  #action = np.array([np.random.uniform(0,1), np.random.uniform(-1,1)]) #all engines off. crash land/ fall down
  action =  np.random.randint(0, 3)
  state, reward, terminated, truncated , info = env.step(action)
  #uncomment below to see observations
  #print(f"Step {i}: State={state}, Reward={reward}, term={terminated}, trunc={truncated}, info={info}")

env.close()

# Display the video
video = io.open(glob.glob('videos_lander_qlearn/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))


  logger.warn(


Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-1.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-1.mp4
Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-0.mp4


#Table-based QLearning parameters
Several hyperparameters are very important for Q-Learning. These parameters will likely need adjustment as you apply Q-Learning to other problems. Because of this, it is crucial to understand the role of each parameter.

* **LEARNING_RATE** The rate at which previous Q-values are updated based on new episodes run during training.
* **DISCOUNT** The amount of significance to give estimates of future rewards when added to the reward for the current action taken. A value of 0.95 would indicate a discount of 5% on the future reward estimates.
* **EPISODES** The number of episodes to train over. Increase this for more complex problems; however, training time also increases.

In [5]:
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 3e4 #set to >=3e4 to ensure training works for this problem

OBSERVATION_DIM = 8
NUM_ACTIONS = 4
NUM_BINS = 4 #8 use 2 or 3 bits for each observation dimension

We lets create the discrete buckets for state and build Q-table.



In [6]:
env.reset()
import math

epsilon = 1
#epsilon_change = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)

print(env.observation_space.high)
print(env.observation_space.low)
# This function converts the floating point state values into
# discrete values. This is often called binning.  We divide
# the range that the state values might occupy and assign
# each region to a bucket.
#then we map the state to a single number between 0:numBins**obs_space
def discretizeLunarState(s, obs_space, numBins=4):
  highs = obs_space.high
  lows = obs_space.low

  discrete_state = []

  normalized = (min(5, max(-5, int((s[0]) / 0.05))), \
            min(5, max(-1, int((s[1]) / 0.1))), \
            min(3, max(-3, int((s[2]) / 0.1))), \
            min(3, max(-3, int((s[3]) / 0.1))), \
            min(3, max(-3, int((s[4]) / 0.1))), \
            min(3, max(-3, int((s[5]) / 0.1))), \
            int(s[6]), \
            int(s[7]))

  for i in [0,1,2,3,4,5]:
    bin = ( highs[i]-lows[i] ) / numBins
    val = int ( ( normalized[i] -  lows[i] ) / bin )
    discrete_state.append( val )

  discrete_state.append(int(s[6])) #boolean leg
  discrete_state.append(int(s[7])) #boolean leg

  shift = int( math.log2(NUM_BINS))

  state_key = 0
  for i in [0,1,2,3,4,5]:
    state_key = state_key << shift
    state_key += discrete_state[i]
  state_key<<1
  state_key+=discrete_state[6]
  state_key<<1
  state_key+=discrete_state[7]

  return state_key


obs = env.reset()
state = discretizeLunarState(obs[0], env.observation_space, NUM_BINS)
print(obs)
#so now the state is a tuple of discrete values, to be used as the key in Q(s,a) table.
print(state)


#set up qtable
#(num_states, num_actions)
q_table = np.zeros((NUM_BINS**8, NUM_ACTIONS)) #number of possible discrete states x number of actions
print(q_table.shape)



[1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ]
[-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ]
(array([-0.00682964,  1.398447  , -0.69243383, -0.55439603,  0.00850735,
        0.16831653,  0.        ,  0.        ], dtype=float32), {})
4106
(65536, 4)


Now lets setup Q-learning!

Q-Learning Implementation: Discretizing input and actions

In [8]:
# Fill in the missing code blocks for Q-Learning

# Inside the run_game function, implement Q-learning steps of epsilon-greedy action selection
# Use the q_table to select the action, either exploiting or exploring based on epsilon
# Update the Q-table according to the Q-learning update rule

def run_game(env, q_table, render, should_update, exploit=False):
    done = False
    discrete_state = discretizeLunarState(env.reset()[0], env.observation_space, NUM_BINS)
    success = False
    total_reward = 0
    while not done:
        if exploit or np.random.rand() > epsilon:
            action = np.argmax(q_table[discrete_state, :])
        else:
            action = np.random.randint(0, NUM_ACTIONS)

        new_state, reward, done, truncated, info = env.step(action)
        total_reward += reward
        new_state_disc = discretizeLunarState(new_state, env.observation_space, NUM_BINS)

        if should_update:
            max_future_q = np.max(q_table[new_state_disc])
            current_q = q_table[discrete_state, action]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state, action] = new_q

        discrete_state = new_state_disc

        if truncated:
            break

    return total_reward

# Run the training loop to train the Q-Learning agent
episode = 0
success_count = 0

train_env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #set to False for simpler /calm environment
    wind_power= 15.0,
    turbulence_power= 1.5)

eval_env = gym.make(
    "LunarLander-v2",
    continuous=False,  # set to False for simpler discrete version
    gravity=-10.0,
    enable_wind=False,  # set to False for simpler /calm environment
    wind_power=15.0,
    turbulence_power=1.5,
    render_mode="rgb_array")


while episode < EPISODES:
    episode += 1
    done = False
    reward = run_game(train_env, q_table, False, True)
    if reward > 0:
        success_count += 1
    epsilon = epsilon / math.log(EPISODES)

print("Success Count:", success_count)

# Evaluate the trained Q-Learning agent
'''
eval_env.reset()
mean_reward, std_reward = evaluate_policy(q_table, eval_env, n_eval_episodes=10)
print("Mean Reward:", mean_reward)
'''


Success Count: 25280


'\neval_env.reset()\nmean_reward, std_reward = evaluate_policy(q_table, eval_env, n_eval_episodes=10)\nprint("Mean Reward:", mean_reward)\n'

Run the training! Note: this can take a *long* time - Q-learning is slow since separately learns each Q(S,A) value for a pretty large state space for this problem.


Now lets test the trained agent. What you should see that after about 10000 episodes, with wind=False, the lander can successfull land about half of the time. However, no reasonable amount of training discrete Q-Table can prepare the lander for behaving well in a windy/turbulent environment.

In [9]:
# HIDE OUTPUT

# Setup the wrapper to record the video
#eval environment, with graphics
''' DEFINED ABOVE
eval_env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #must be same as train environment
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")
'''

eval_env.reset()
video_callable=lambda episode_id: True
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_qlearn', episode_trigger=video_callable)
mean_reward =0
reward = run_game(eval_env, q_table, True, False, exploit=True)
mean_reward+=reward
reward = run_game(eval_env, q_table, True, False, exploit=True)
mean_reward+=reward
reward = run_game(eval_env, q_table, True, False, exploit=True)
mean_reward+=reward

print ("mean reward: ", reward/3)

# Display the video
video0 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_qlearn/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))


Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-0.mp4
Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-1.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-1.mp4
Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-2.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-2.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-2.mp4
mean reward:  66.73366207437525


## Inspecting the Q-Table

We can also display the Q-table. The following code shows the agent's action for each environment state. As the weights of a neural network, this table is not straightforward to interpret. Some patterns do emerge in that direction, as seen by calculating the means of rows and columns. The actions seem consistent at both velocity and position's upper and lower halves.

In [10]:
import pandas as pd

df = pd.DataFrame(q_table)

#df.columns = [f'v-{x}' for x in range(DISCRETE_GRID_SIZE[0])]
#df.index = [f'p-{x}' for x in range(DISCRETE_GRID_SIZE[1])]
df

Unnamed: 0,0,1,2,3
0,-0.294459,0.000000,0.000000,0.0
1,-0.219471,-0.280044,0.000000,0.0
2,0.000000,0.000000,0.000000,0.0
3,0.000000,0.000000,0.000000,0.0
4,-0.306443,-0.309958,0.866547,0.0
...,...,...,...,...
65531,0.000000,0.000000,0.000000,0.0
65532,0.000000,0.000000,0.000000,0.0
65533,0.000000,0.000000,0.000000,0.0
65534,0.000000,0.000000,0.000000,0.0


## Training the DQN Agent for Lunar Lander

#Todo: implement the DQN code for vectorized lunar lander environment above
Follow the DQN example in the provided notebook.



https://colab.research.google.com/drive/1f3cwSAvpDe23Xfkn_tXNj7dGkWlusJYN#scrollTo=mJb8fU8wIenZ



To implement DQN and other algorithms, we will use the Stable Baselines library. It is designed for ease of use, offering a straightforward API to implement, experiment with, and extend upon cutting-edge RL methods.

https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html


In [11]:
import gymnasium as gym
from stable_baselines3 import DQN
import torch as th
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.evaluation import evaluate_policy

# Create and initialize fresh Lunar Lander environment
train_env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #Should also learn even with wind enabled
    wind_power= 15.0,
    turbulence_power= 1.5)

time_step = train_env.reset()

# Instantiate the agent
#specify network architecture for policy and value networks
#***Todo: define dqn model. Provide policy network architecture as shown in
#MountainCar example.
# Define the neural network architecture for DQN
policy_kwargs = dict(
    net_arch=[256, 256]  # Adjust the network architecture as needed
)

# Initialize the DQN agent
dqn = DQN('MlpPolicy', train_env, policy_kwargs=policy_kwargs, learning_rate=0.001, buffer_size=10000, batch_size=64, learning_starts=1000,
          target_update_interval=500, exploration_fraction=0.1, exploration_final_eps=0.02, tensorboard_log="./dqn_lunar_tensorboard/")
#provide appropriate parameters, net arch requires experimentation



In [12]:
# Train the agent

#TODO: experiment with appropriate time steps for this problem
Timesteps = 1e5 #set to >=100000 to converge
dqn.learn(total_timesteps=Timesteps)

# Save the agent
dqn.save("dqn_lander")

In [13]:
# Create a fresh environment for evaluation
eval_env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False, #must be same as train environment
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")


# Evaluate the agent
mean_reward, std_reward = evaluate_policy(dqn, eval_env, n_eval_episodes=10)

print(f"Mean reward: {mean_reward} +/- {std_reward}")



Mean reward: -12.96816218548338 +/- 25.11949729779816


## Visualize actions

Visualize the lander for 3 episodes and save in a video.

In [14]:
# Setup the wrapper to record the video
from gymnasium.wrappers import RecordVideo
video_callable=lambda episode_id: True
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_dqn', episode_trigger=video_callable)

mean_reward, std_reward = evaluate_policy(dqn, eval_env, n_eval_episodes=3)


# Display the video
video0 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_dqn/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))

Moviepy - Building video /content/videos_lander_dqn/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos_lander_dqn/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_dqn/rl-video-episode-0.mp4
Moviepy - Building video /content/videos_lander_dqn/rl-video-episode-1.mp4.
Moviepy - Writing video /content/videos_lander_dqn/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_dqn/rl-video-episode-1.mp4
Moviepy - Building video /content/videos_lander_dqn/rl-video-episode-2.mp4.
Moviepy - Writing video /content/videos_lander_dqn/rl-video-episode-2.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_dqn/rl-video-episode-2.mp4


Acknowledgements: adapted from Official Example:

https://stable-baselines3.readthedocs.io/en/master/guide/examples.html

## PPO Policy

#Now lets use PPO
This is the starting code you have to complete.

https://gymnasium.farama.org/environments/box2d/lunar_lander/

In [15]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import VecFrameStack
import torch as th


In [16]:
# Train the agent
TIMESTEPS = 3e5
#experiment with number of steps
#setup training environment without video for speed

env_train = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5)

env_train.reset()

# Initialize the agent, use Proximal Policy Optimization (PPO)

lander_ppo = PPO('MlpPolicy', env_train, verbose=1, tensorboard_log="./ppo_lunar_tensorboard/")



Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


Train your PPO agent

In [17]:
#todo: experiment with number of steps
lander_ppo.learn(total_timesteps=TIMESTEPS)

# Save the model
lander_ppo.save(f"lander_ppo_model")
env.close()


Logging to ./ppo_lunar_tensorboard/PPO_1
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 87.2     |
|    ep_rew_mean     | -177     |
| time/              |          |
|    fps             | 1014     |
|    iterations      | 1        |
|    time_elapsed    | 2        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 90.6        |
|    ep_rew_mean          | -181        |
| time/                   |             |
|    fps                  | 798         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007039183 |
|    clip_fraction        | 0.0151      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.0



Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-2.mp4
Moviepy - Building video /content/videos_lander_qlearn/rl-video-episode-1.mp4.
Moviepy - Writing video /content/videos_lander_qlearn/rl-video-episode-1.mp4



                                                  

Moviepy - Done !
Moviepy - video ready /content/videos_lander_qlearn/rl-video-episode-1.mp4




In [18]:
# Evaluate the trained agent
env_train.reset()
mean_reward, std_reward = evaluate_policy(lander_ppo, env_train, n_eval_episodes=10)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

# Don't forget to close the environment when you are done
env.close()

Mean reward: -26.58168616346891 +/- 82.43187995022399


Now lets see how it lands!


In [19]:
# Setup the wrapper to record the video
import base64
from IPython import display as ipythondisplay
from pathlib import Path
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
import numpy as np

from gymnasium.wrappers import RecordVideo
video_callable=lambda episode_id: True


eval_env = gym.make(
    "LunarLander-v2",
    continuous= False, #set to False for simpler discrete version
    gravity= -10.0,
    enable_wind= False,
    wind_power= 15.0,
    turbulence_power= 1.5,
    render_mode="rgb_array")

obs = eval_env.reset()
video_folder = '/content/videos_lander_ppo'
# Record the environment
eval_env = RecordVideo(eval_env, video_folder='./videos_lander_ppo', episode_trigger=video_callable)

# Load the trained agent
# NOTE: if you have loading issue, you can pass `print_system_info=True`
# to compare the system on which the model was trained vs the current one
# model = DQN.load("dqn_lunar", env=env, print_system_info=True)
lander_ppo= PPO.load(f"lander_ppo_model", env=eval_env)

# Evaluate agent
mean_reward, std_reward = evaluate_policy(lander_ppo, eval_env, n_eval_episodes=3)
print("average reward: ", mean_reward)

eval_env.close()




# Display the video
video0 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-0.mp4')[0], 'r+b').read()
encoded0 = base64.b64encode(video0)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video0/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded0.decode('ascii'))))

video1 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-1.mp4')[0], 'r+b').read()
encoded1 = base64.b64encode(video1)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video1/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded1.decode('ascii'))))

video2 = io.open(glob.glob('videos_lander_ppo/rl-video-episode-2.mp4')[0], 'r+b').read()
encoded2 = base64.b64encode(video2)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video2/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded2.decode('ascii'))))


# Close the environment which should also save the video
env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  logger.warn("Unable to save last video! Did you call close()?")


Moviepy - Building video /content/videos_lander_ppo/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos_lander_ppo/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_ppo/rl-video-episode-0.mp4
Moviepy - Building video /content/videos_lander_ppo/rl-video-episode-1.mp4.
Moviepy - Writing video /content/videos_lander_ppo/rl-video-episode-1.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_ppo/rl-video-episode-1.mp4
Moviepy - Building video /content/videos_lander_ppo/rl-video-episode-2.mp4.
Moviepy - Writing video /content/videos_lander_ppo/rl-video-episode-2.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_ppo/rl-video-episode-2.mp4
average reward:  -76.87118755098588
Moviepy - Building video /content/videos_lander_ppo/rl-video-episode-3.mp4.
Moviepy - Writing video /content/videos_lander_ppo/rl-video-episode-3.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_lander_ppo/rl-video-episode-3.mp4
