# Working with stable_baseline & gym library

in this notebook we're going to work with `stable_baseline` and `gym` library in order to train an agent which can land safly on earth in `Lunar-Lander` environment by using `PPO` algorithm.

- this mini project is for hugging face [deep-RL course](https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt).

## Installing and Importing libraries

In [1]:
! pip install -q stable-baselines3
! pip install -q box2d

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.8/171.8 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for gym (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import gym

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

## Create Environment

stacking 16 lunar lander environment to train a Neral Net using these environments' observations.

In [3]:
# Create the environment
env = make_vec_env('LunarLander-v2', n_envs=16)

## Create Model

we use PPO algorithm [1] for training the agent and because the input of neural net which is observation we set the parameter `policy` equal to `MlpPolicy`.

In [4]:
model = PPO(
    policy = 'MlpPolicy',
    env = env,
    n_steps = 1024,
    batch_size = 64,
    n_epochs = 4,
    gamma = 0.999,
    gae_lambda = 0.98,
    ent_coef = 0.01,
    verbose=1)

Using cuda device


## Train Agent with PPO Algorithm

we train agent for 1 milion timesteps.

In [5]:
model.learn(total_timesteps=1000000)
# Save the model
model_name = "ppo-LunarLander-v2"
model.save(model_name)

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 90.8     |
|    ep_rew_mean     | -193     |
| time/              |          |
|    fps             | 1773     |
|    iterations      | 1        |
|    time_elapsed    | 9        |
|    total_timesteps | 16384    |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 104          |
|    ep_rew_mean          | -141         |
| time/                   |              |
|    fps                  | 1682         |
|    iterations           | 2            |
|    time_elapsed         | 19           |
|    total_timesteps      | 32768        |
| train/                  |              |
|    approx_kl            | 0.0068200324 |
|    clip_fraction        | 0.0493       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.38        |
|    explained_variance   | -0.00182     |
|    learning_r

## Evaluate the Agent

for evaluation we test our trained agent on a new environment.

In [6]:
eval_env = gym.make("LunarLander-v2")
mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)
print(f"mean_reward={mean_reward:.2f} +/- {std_reward}")



mean_reward=259.40 +/- 22.087186172676166


## Resources

[1] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv [cs.LG], 2017.