## RLlib

In [1]:
# HIDDEN
import gym
import numpy as np

#### What about the learning?

Let's return to the "API" of RL:

![](img/RL-API.png)

- We've talked about the input (environment) and output (policy)
- Let's talk about the reinforcement learning!

#### What we'll cover

- Many, many supervised learning algorithms exist... random forests, logistic regression, neural networks, etc.
- Likewise, there are many RL algorithms.
- This is not a course on RL algorithms, though many good ones exist!
- This course is about _applying_ RL.

#### Introducing Ray RLlib

![](img/rllib-logo.png)

- In this course we'll use Ray RLlib as our "scikit-learn of reinforcement learning"
- We will look under the hood only as needed, and focus on the inputs and outputs.

#### Our first RLlib code

First, we import RLlib, which is part of the Ray project:

In [2]:
from ray import rllib

Next, we create a trainer object. 

In [41]:
# HIDDEN
trainer_config = {
    "framework"  : "torch",
    "seed"       : 0,
    "env_config" : {"is_slippery" : False}}

In [42]:
trainer = rllib.agents.ppo.PPOTrainer(env="FrozenLake-v1", config=trainer_config)

For clarity we've hidden the config for now, but we'll get back to it soon.

#### Using the policy

- We haven't trained the agent yet, but we can still see what it does.
- This is like calling `predict` before running `fit` with supervised learning.

In [43]:
env = gym.make("FrozenLake-v1", is_slippery=False)
obs = env.reset()
obs

0

In [44]:
# HIDDEN
env.seed(3);

In [45]:
action = trainer.compute_single_action(obs, explore=False)
action

0

- We gave the trainer our initial observation, 0, and it recommended action 0 (left).
- This action came from the initialized **policy**.
- Remember, the policy maps observations to actions.

#### Using the policy

We can see what happened after taking that action:

In [46]:
obs, reward, done, _ = env.step(action)
env.render()

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG


In [47]:
# HIDDEN
# Apparently, we attempted to move left but actually moved down, because we're in the slippery environment. 

#### Training

- So far our policy was just a random/arbitrary initialization.
- What we want is to train it _based on experience interacting with the environment_.
- In order to do this, RLlib will _play through many episodes_ and learn as it goes.

In [51]:
train_info = trainer.train()

- Note that, unlike sklearn's `fit()`, here we don't provide the dataset to `train()`.
- We gave it the environment during initialization, and it uses the environment to generate data.

#### Training iterations

- In fact, what we just did was one _iteration_ of training.
- RLlib will play through a bunch of episodes per iteration, depending on its hyperparameters.

In [52]:
len(train_info["hist_stats"]["episode_lengths"])

516

Looks like it ran ~500 episodes in that one iteration.

#### RL mindset: data generation

- This is a key departure from the supervised learning mindset
- In SL, we take a fixed amount of data and train for some number of iterations
- In RL, more iterations means more training _on more data_ because we learn from the environment as we interact with it
- If you only play one episode, you might never see observation 10, so how can you learn what to do given observation 10?

#### Training info: episode lengths

Let's look at the lengths of the last 100 episodes we played:

In [53]:
print(train_info["hist_stats"]["episode_lengths"][-100:])

[4, 6, 3, 4, 9, 13, 8, 8, 4, 7, 5, 5, 7, 16, 16, 7, 2, 5, 4, 14, 3, 7, 4, 14, 3, 5, 7, 10, 2, 5, 17, 9, 27, 6, 7, 7, 19, 5, 18, 8, 4, 5, 3, 2, 11, 11, 18, 4, 15, 9, 6, 2, 8, 13, 7, 6, 5, 25, 2, 2, 6, 2, 2, 10, 8, 18, 9, 4, 6, 21, 3, 4, 14, 12, 4, 12, 7, 16, 11, 18, 7, 3, 8, 5, 8, 5, 6, 6, 5, 7, 5, 9, 3, 4, 9, 9, 8, 17, 14, 2]


- Remember that an episode ends when `.step()` returns `True` for the `done` flag.
- We see some very short episodes, where the agent fell into a hole right away.

#### Training info: episode rewards

- For those longer episodes, did the agent reach the goal?
- To assess this, we can print out the first 100 _episode rewards_:

In [54]:
print(train_info["hist_stats"]["episode_reward"][-100:])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [55]:
print(sum(train_info["hist_stats"]["episode_reward"][-100:]))

0.0


- This is not very impressive. Let's keep training.

#### More training

In [56]:
for i in range(10):
    train_info = trainer.train()

In [57]:
print(sum(train_info["hist_stats"]["episode_reward"][-100:]))

99.0


- Nice! Now we're reaching the goal almost every time!
- This non-slippery Frozen Lake is a very easy environment. 

#### Using the policy

We can run the **observation-policy-action loop** for multiple time steps to watch the policy in action:

In [59]:
obs = env.reset()

for i in range(3):
    action = trainer.compute_single_action(obs, explore=False)
    obs, reward, done, _ = env.step(action)
    env.render()

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG


#### Using the policy

In [60]:
for i in range(3):
    action = trainer.compute_single_action(obs, explore=False)
    obs, reward, done, _ = env.step(action)
    env.render()

  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m


Using this policy we reliably reach the goal every time because the non-slippery Frozen Lake environment is deterministic.

#### Evaluation

- Talk about train/test/etc.

#### Configuring the trainer

- `PPOTrainer`: we're using the PPO algorithm
- `env="FrozenLake-v1"`: RLlib knows about OpenAI Gym environments
  - In the next module we'll learn how to make our own environments!
- `config={"framework" : "torch"}`: RLlib works with tensorflow and pytorch
  - Here we can include additional hyperparameters like we would in sklearn
- `"env_config" : {"is_slippery" : False}`: this selects the non-slippery Frozen Lake

## Ex 1

In [None]:
# HIDDEN
from ipywidgets import Output
from IPython import display
import time

## Slippery Frozen Lake

In the slides we trained an agent to reliably reach the goal in the non-slippery Frozen Lake environment. Here, try the same thing with the slippery Frozen Lake. Train your agent until it reaches the goal at least 25% of the time.

In [None]:
from ray import rl

# BEGIN SOLUTION
trainer_config = {
    "framework"  : "torch",
    "seed"       : 0,
    "env_config" : {"is_slippery" : True}}

trainer = rllib.agents.ppo.PPOTrainer(env="FrozenLake-v1", config=trainer_config)

for i in range(10):
    train_info = trainer.train()

# END SOLUTION

## Ex 3