## RLlib

In [18]:
# HIDDEN
import gym
import numpy as np

#### What about the learning?

Let's return to the "API" of RL:

![](img/RL-API.png)

- We've talked about the input (environment) and output (policy)
- Let's talk about the reinforcement learning!

#### What we'll cover

- Many, many supervised learning algorithms exist... random forests, logistic regression, neural networks, etc.
- Likewise, there are many RL algorithms.
- This is not a course on RL algorithms, though many good ones exist!
- This course is about _applying_ RL.

#### Introducing Ray RLlib

![](img/rllib-logo.png)

- In this course we'll use Ray RLlib as our "scikit-learn of reinforcement learning"
- We will look under the hood only as needed, and focus on the inputs and outputs.

#### Our first RLlib code

First, we import RLlib, which is part of the Ray project:

In [1]:
from ray import rllib

Next, we create a trainer object. 

In [2]:
trainer = rllib.agents.ppo.PPOTrainer(env="FrozenLake-v1", config={"framework" : "torch"})



#### The trainer object

In [3]:
trainer = rllib.agents.ppo.PPOTrainer(env="FrozenLake-v1", config={"framework" : "torch"})



- `PPOTrainer`: we're using the PPO algorithm
- `env="FrozenLake-v1"`: RLlib knows about OpenAI Gym environments
  - In the next module we'll learn how to make our own environments!
- `config={"framework" : "torch"}`: RLlib works with tensorflow and pytorch
  - Here we can include additional hyperparameters like we would in sklearn

#### Using the policy

- We haven't trained the agent yet, but we can still see what it does.
- This is like calling `predict` before running `fit` with supervised learning.

In [5]:
env = gym.make("FrozenLake-v1")
obs = env.reset()
obs

0

In [6]:
# HIDDEN
env.seed(3);

In [7]:
action = trainer.compute_single_action(obs, explore=False)
action

0

- We gave the trainer our initial observation, 0, and it recommended action 1.
- This action came from the initialized **policy**.
- Remember, the policy maps observations to actions.

#### Using the policy

We can see what happened after taking that action:

In [8]:
obs, reward, done, _ = env.step(action)
env.render()

  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG


Apparently, we attempted to move down but actually moved right, because we're in the slippery environment.

#### Using the policy

We can run the **observation-policy-action loop** for multiple time steps:

In [9]:
for i in range(3):
    action = trainer.compute_single_action(obs, explore=False)
    obs, reward, done, _ = env.step(action)
    env.render()

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG


This is similar to what you saw in the previous section, except the policy is coming from RLlib instead of a fixed Python dictionary.

#### Training

- So far our policy was just a random/arbitrary initialization.
- What we want is to train it _based on experience interacting with the environment_.
- In order to do this, RLlib will _play through many episodes_ and learn as it goes.

In [10]:
train_info = trainer.train()

- Note that, unlike sklearn, here we don't provide the dataset.
- We already gave it the environment during initialization, and it uses the environment to generate data.

#### Training iterations

- In fact, what we just did was one _iteration_ of training.
- RLlib will play through a bunch of episodes per iteration, depending on its hyperparameters.

In [11]:
len(train_info["hist_stats"]["episode_lengths"])

527

Looks like it ran 501 episodes in that one iteration.

#### RL mindset: data generation

- This is a key departure from the supervised learning mindset
- In SL, we take a fixed amount of data and train for some number of iterations
- In RL, more iterations means more training _on more data_ because we learn from the environment as we interact with it
- If you only play one episode, you might never see observation 10, so how can you learn what to do given observation 10?

#### Training info: episode lengths

Let's look at the lengths of the last 100 episodes we played:

In [12]:
print(train_info["hist_stats"]["episode_lengths"][-100:])

[2, 5, 12, 3, 6, 7, 4, 3, 21, 20, 10, 11, 2, 2, 20, 12, 4, 12, 3, 4, 4, 4, 10, 4, 2, 7, 8, 6, 23, 9, 8, 11, 32, 3, 2, 8, 8, 5, 4, 3, 2, 10, 5, 6, 6, 10, 5, 14, 5, 13, 13, 12, 9, 8, 6, 13, 7, 16, 5, 3, 11, 18, 2, 4, 6, 3, 17, 18, 4, 5, 7, 15, 10, 5, 5, 5, 5, 12, 18, 4, 10, 7, 9, 4, 4, 4, 7, 2, 7, 3, 8, 18, 12, 9, 2, 11, 5, 11, 15, 5]


- Remember that an episode ends when `.step()` returns `True` for the `done` flag.
- We see some very short episodes, where the agent fell into a hole right away.

#### Training info: episode rewards

- For those longer episodes, did the agent reach the goal?
- To assess this, we can print out the first 100 _episode rewards_:

In [13]:
print(train_info["hist_stats"]["episode_reward"][-100:])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [14]:
print(sum(train_info["hist_stats"]["episode_reward"][-100:]))

0.0


- This is not very impressive. Let's keep training.

#### More training

In [15]:
for i in range(10):
    train_info = trainer.train()

In [16]:
print(sum(train_info["hist_stats"]["episode_reward"][-100:]))

5.0


Nice! We're improving!

In [19]:
# TODO: start with non-slippery I think, the move to slippery? That'll get us better results.


#### Evaluation

- Talk about train/test/etc.

In [17]:
# HIDDEN
from ipywidgets import Output
from IPython import display
import time

## Ex 1

## Ex 2

## Ex 3