<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_12_4_atari.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



```
# This is formatted as code
```

# Deep RL for 2D environments: Q-Learning, DQN, and PPO
* [Eugene Agichtein](https://www.cs.emory.edu/~eugene/) for CS325: Artificial Intelligence
* Adapted from [Jeff Heaton](https://sites.wustl.edu/jeffheaton/)

This is the starting code for training agents for the Box2D environment in Gymnasium:
https://gymnasium.farama.org/environments/box2d/


Lunar Lander example is used in the starter code. You will extend these to Car Racing and Bipedal Worker yourself.



# Google CoLab Setup

The following code setsup gymnasium in Google colab. do not modify these lines, but ok need to add additional dependencies if needed

In [1]:
from google.colab import drive
!pip install stable-baselines3[extra] gymnasium
!pip install gymnasium[accept-rom-license,atari]
!pip install pyvirtualdisplay
!sudo apt-get install -y python-opengl ffmpeg
!sudo apt-get install -y xvfb
!pip install swig
!pip install gymnasium[box2d]
!pip install gym-notebook-wrapper

Collecting stable-baselines3[extra]
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Collecting shimmy[atari]~=1.3.0 (from stable-baselines3[extra])
  Downloading Shimmy-1.3.0-py3-none-any.whl (37 kB)
Collecting autorom[accept-rom-license]~=0.6.1 (from stable-baselines3[extra])
  Downloading AutoROM-0.6.1-py3-none-any.whl (9.4 kB)
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.6.1->stable-baselines3[extra])
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━

## Introducing Box2D/Car Racer

https://gymnasium.farama.org/environments/box2d/car_racing/

This is a self-driving control task to learn from pixels - a top-down racing environment. The generated track is random every episode.

For simplicity, we will use discrete actions, which makes the problem alot simpler and amenable to Q-learning. However, because the input is now visual (pixels), we cannot use discrete Q-Tables since the dimensionality would be too large. Instead, we will go directly to DQN, where the input will be processed by a CNN to extract features for learning.

p.s. we will use the discrete action space version of the racer, which is simpler, since continous action spaces are not support by Q-learning or DQN, since we can't compute argmax(q) for continuous action outputs.


In [2]:
import base64
from IPython import display as ipythondisplay
from pathlib import Path
from gymnasium.wrappers import RecordVideo
import gymnasium as gym
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay
from pyvirtualdisplay import Display
import numpy as np
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple
from itertools import count
from PIL import Image
import torch
import glob
import io
import os




env = gym.make("CarRacing-v2", domain_randomize=False, continuous=False, render_mode="rgb_array") #keep consistent colors; discrete actions





Let's setup DQN network to learn this problem


Lets see how the robot behaves without training.

In [3]:
env.metadata['render_fps'] = 30
# Reset the environment
env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
video_env = RecordVideo(env, video_folder='./videos_racer_qlearn', episode_trigger=video_callable)

# Run the environment until done

truncated = False
terminated=False
i=0
while not truncated and (not terminated):
  i+=1
  action = np.random.randint(0, 4)
  s, r, terminated, truncated, info = video_env.step(action)
  #uncomment below to see observations
  #print(s, r, terminated, truncated, info)
env.close()

# Display the video
video = io.open(glob.glob('videos_racer_qlearn/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))


Moviepy - Building video /content/videos_racer_qlearn/rl-video-episode-0.mp4.
Moviepy - Writing video /content/videos_racer_qlearn/rl-video-episode-0.mp4





Moviepy - Done !
Moviepy - video ready /content/videos_racer_qlearn/rl-video-episode-0.mp4


# Training the DQN Agent

#Todo:
implement the DQN code for this environment. Follow the examples provided and feel free to adapt the LunarLander code to this problem.


https://colab.research.google.com/drive/1f3cwSAvpDe23Xfkn_tXNj7dGkWlusJYN#scrollTo=mJb8fU8wIenZ



To implement DQN and other algorithms, we will use the Stable Baselines library. It is designed for ease of use, offering a straightforward API to implement, experiment with, and extend upon cutting-edge RL methods.

https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html


In [4]:
import gymnasium as gym
from stable_baselines3 import DQN
import torch as th
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy

# Create and initialize fresh Lunar Lander environment
train_env = gym.make("CarRacing-v2", domain_randomize=False, continuous=False, render_mode="rgb_array") #keep consistent colors; discrete actions

#TODO:
#Implement DQN network for this problem
#Experiment with different network architectures and configurations to make it work
#Normally we can also stack the frames - but that breaks visualization/videos
dqn_racer = DQN("MlpPolicy", env, verbose=1)


Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Wrapping the env in a VecTransposeImage.




In [5]:
# Train the agent
Timesteps = 1e5 #set to >=100000 to converge
#TODO: invoke training of the dqn agents for appropriate number of steps

# Train the agent
dqn_racer.learn(total_timesteps=int(Timesteps))


# Save the agent
dqn_racer.save("dqn_racer")

----------------------------------
| rollout/            |          |
|    ep_len_mean      | 1e+03    |
|    ep_rew_mean      | -55.8    |
|    exploration_rate | 0.62     |
| time/               |          |
|    episodes         | 4        |
|    fps              | 40       |
|    time_elapsed     | 98       |
|    total_timesteps  | 4000     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 5.53e-05 |
|    n_updates        | 974      |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 1e+03    |
|    ep_rew_mean      | -52.9    |
|    exploration_rate | 0.24     |
| time/               |          |
|    episodes         | 8        |
|    fps              | 37       |
|    time_elapsed     | 213      |
|    total_timesteps  | 8000     |
| train/              |          |
|    learning_rate    | 0.0001   |
|    loss             | 0.0983   |
|    n_updates      

KeyboardInterrupt: 

In [None]:
env_stacked.reset()
train_env.reset()


# Evaluate the agent
mean_reward, std_reward = evaluate_policy(dqn_racer, train_env, n_eval_episodes=5, deterministic=True)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

## Visualize actions

Race the trained agent

In [None]:
train_env.metadata['render_fps'] = 30
# Reset the environment
train_env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
video_env = RecordVideo(train_env, video_folder='./videos_racer_dqn', episode_trigger=video_callable)

# Run the environment until done

truncated = False
terminated=False
i=0
obs, info = video_env.reset()
while not truncated and (not terminated):
    action, _ = dqn_racer.predict(obs, deterministic=True)
    obs, reward, terminated, truncated , info = video_env.step(action)
video_env.close()

# Display the video
video = io.open(glob.glob('videos_racer_dqn/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))


Can also make and display a gif of the agent

## PPO Policy

#Now lets use PPO


In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import VecFrameStack
import torch as th

env = gym.make("CarRacing-v2", domain_randomize=False, continuous=False, render_mode="rgb_array") #keep consistent colors; discrete actions
obs, info= env.reset()




In [None]:
#Define PPO agent to use Proximal Policy Optimization (PPO)
#follow example for mountain car/lunar lander
#experiment with different Cnn architectures/number of layers
#Hint: this might take alot of computation!
#another hint: can convert input to grayscale and stack frames for better learning
#see examples here: https://stable-baselines.readthedocs.io/en/v2.3.0/guide/vec_envs.html
racer_ppo = PPO("CnnPolicy", env, verbose=1)


In [None]:
# Train the agent
#experiment with appropriate time steps and learning rates
TIMESTEPS = 2e5


racer_ppo.learn(total_timesteps=TIMESTEPS)

# Save the model
racer_ppo.save(f"racer_ppo_model")



In [None]:
# Evaluate the trained agent
obs, info = env.reset()
mean_reward, std_reward = evaluate_policy(racer_ppo, env, n_eval_episodes=5)

print(f"Mean reward: {mean_reward} +/- {std_reward}")

# Don't forget to close the environment when you are done
env.close()

Now lets watch it race!


In [None]:
env.metadata['render_fps'] = 30
# Reset the environment
obs, info = env.reset()

# Setup the wrapper to record the video
video_callable=lambda episode_id: True
video_env = RecordVideo(env, video_folder='./videos_racer_ppo', episode_trigger=video_callable)

# Run the environment until done

truncated = False
terminated=False
i=0
obs, info = video_env.reset()
while not truncated and (not terminated):
    action, _ = racer_ppo.predict(obs, deterministic=True)
    obs, reward, terminated, truncated , info = video_env.step(action)
video_env.close()

print("finished eval")

# Display the video
video = io.open(glob.glob('videos_racer_ppo/*.mp4')[0], 'r+b').read()
encoded = base64.b64encode(video)
ipythondisplay.display(HTML(data='''
    <video width="640" height="480" controls>
        <source src="data:video/mp4;base64,{0}" type="video/mp4" />
    </video>
'''.format(encoded.decode('ascii'))))

The end!