# RLlib

> __Reinforcement Learning__ is a family of techniques that train *agents* to act in an *environment* to maximize *reward*. Famous examples include agents that can play chess, go, or Atari games ... but the field is hot because those agents can also be robots learning to do work, autonomous vehicles driving, or even virtual salesmen learning to get the best price possible from a customer.

Ray provides RLlib as a high-level library that encapsulates both the distribution (clustered) training as well as many popular reinforcement learning algorithms (the code that turns interactions and rewards into models and policies).

Here, to create a simple example, we'll use __Deep Q-Learning__ (a foundational deep RL algorithm) to learn OpenAI's "cart-pole" (https://gym.openai.com/envs/CartPole-v1/) environment, which you can visualize like this:

<video src='images/cpv1.mp4' controls='true'>

This example, and the lab, are based on the demo in Dean Wampler's excellent intro paper, "What is Ray?" on O'Reilly Safari Online: https://learning.oreilly.com/library/view/what-is-ray/9781492085768/

In [None]:
from ray.rllib.algorithms.dqn import DQNConfig

config = (  # 1. Configure the algorithm,
    DQNConfig()
    .environment("CartPole-v1")
    .rollouts(num_rollout_workers=2)
    .framework("torch")
    .training(model={"fcnet_hiddens": [64, 64]})
    .evaluation(evaluation_num_workers=1)
)

algo = config.build()  # 2. build the algorithm,

#for _ in range(5):
print(algo.train())  # 3. train it,

algo.evaluate()  # 4. and evaluate it.

In [None]:
fmt = '{:3d},{:8.4f},{:8.4f},{:8.4f}'
last_checkpoint = ''
for n in range(N_ITER):
  #  result = trainer.train()
    min  = result['episode_reward_min']
    mean = result['episode_reward_mean']
    max  = result['episode_reward_max']
    last_checkpoint = trainer.save(checkpoint_dir)
    print(fmt.format(n, min, mean, max))
print(f'last checkpoint file: {last_checkpoint}')

__Note__: If you've worked with RL and OpenAI gym before, you may realize these are not particularly impressive numbers, and not a particularly impressive algorithm.

Don't worry: __Ray RLlib__ includes a variety of more powerful algorithms which achieve better results. We'll try one of them -- Proximal Policy Optimization (PPO) in the lab exercise.

## RLlib Usage

How you approach RLlib will depend on whether you are toward the research direction vs. the applied direction in your current work.

Today, we're focused on getting started with applying RLlib using its built-in architectural patterns and algorithms, but customizing
* the environment (environment representation, action choices, rewards)
* (optionally) the model

Customization options are documented at https://docs.ray.io/en/master/rllib-env.html -- we'll start with the simplest custom environment integration to get started. This is based on an example in the Ray source examples.

In this scenario, we have
* simple, linear corridor of locations (states) where the agent starts at one end and gets a reward if/when it reaches the other end
* two possible actions: forward (away from starting state) or backward
* random positive reward upon reaching the goal; small loss (negative reward) for actions that to not reach the goal

The simplest way to start with a custom environment is to subclass the OpenAI `gym.Env` class, as this protocol is natively supported by Ray. The base class documentation is at https://github.com/openai/gym/blob/master/gym/core.py and there is some intro documentation at https://gym.openai.com/docs/#getting-started-with-gym

In [None]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import os
import random

from ray.rllib.env.env_context import EnvContext
from ray.rllib.models import ModelCatalog
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray.rllib.utils.framework import try_import_torch
from ray.tune.logger import pretty_print

torch, nn = try_import_torch()

class SimpleCorridor(gym.Env):
    """Example of a custom env in which you have to walk down a corridor.
    You can configure the length of the corridor via the env config."""

    def __init__(self, config: EnvContext):
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(
            0.0, self.end_pos, shape=(1, ), dtype=np.float64)
        # Set the seed. This is only used for the final (reach goal) reward.
        self.seed(config.worker_index * config.num_workers)

    def reset(self):
        self.cur_pos = 0
        return [self.cur_pos]

    def step(self, action):
        assert action in [0, 1], action
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = self.cur_pos >= self.end_pos
        # Produce a random reward when we reach the goal.
        return [self.cur_pos], \
            random.random() * 2 if done else -0.1, done, {}

    def seed(self, seed=None):
        random.seed(seed)

In our previous example, we used a shorthand configuration for the neural net model, specifying the dimenstions of a fully-connected (multilayer perceptron style) network. 

Here, we'll show how to provide a custom model. To keep things simple, this custom model actually just delegates to the same fully-connected network helper code, but the important thing is it shows the "integration glue" for hooking in any custom model, based on the standard PyTorch `nn.Module` pattern.

In [None]:
class TorchCustomModel(TorchModelV2, nn.Module):
    """Example of a PyTorch custom model that just delegates to a fc-net."""

    def __init__(self, obs_space, action_space, num_outputs, model_config,
                 name):
        TorchModelV2.__init__(self, obs_space, action_space, num_outputs,
                              model_config, name)
        nn.Module.__init__(self)

        self.torch_sub_model = TorchFC(obs_space, action_space, num_outputs,
                                       model_config, name)

    def forward(self, input_dict, state, seq_lens):
        input_dict["obs"] = input_dict["obs"].float()
        fc_out, _ = self.torch_sub_model(input_dict, state, seq_lens)
        return fc_out, []

    def value_function(self):
        return torch.reshape(self.torch_sub_model.value_function(), [-1])

Now that we have an environment and a model, we're ready to have RLlib train an agent.

In [None]:
from ray.rllib.algorithms.dqn import DQN

ModelCatalog.register_custom_model("my_model", TorchCustomModel)

algo = DQN.get_default_config().environment(SimpleCorridor, env_config={'corridor_length':5}) \
                               .framework('torch') \
                               .training(model={"custom_model": "my_model", "vf_share_layers": True,}) \
                               .build()

'''
for n in range(20): #training iterations
    result = trainer.train()
    min  = result['episode_reward_min']
    mean = result['episode_reward_mean']
    max  = result['episode_reward_max']
    fmt = '{:3d},{:8.4f},{:8.4f},{:8.4f}'
    print(fmt.format(n, min, mean, max))
    
print(pretty_print(result))
'''

In [None]:
algo.train()

## Lab: Ray RLlib and PPO

PPO of Proximal Policy Optimization is a more powerful (and more complicated) algorithm than the DQN we've looked at.

But thanks to Ray's implementations, you can swap it in easily.

__We'll redo the earlier OpenAI Gym Cart-Pole problem, but swap in PPO for the algorithm__

Note that we import `ppo` from `ray.rllib.agents`

By replacing "DQN" with "PPO" you can quickly get better results.

>
> Interested in PPO details? Check out this writeup: https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
> 

In [None]:
import ray
import ray.rllib.agents.ppo as ppo

In [None]:
# Copy the code from the OpenAI gym example, but replace references to DQN with references to PPO

# HINT: try 10 iterations -- that will be plenty for PPO to solve the problem

## Takeaways

Of course, getting an agent to walk down a 5-step corridor is not the same as solving a complex business or research problem.

The goal today is to
* explain how RLlib integrates the various ingredients of a distributed deep reinforcement learning system
* enable you to take the first steps toward specifying your own environment(s) and seeing how well the agent can (or can't) learn

__Next steps__

There is a good walkthrough of RLlib environment integration starting with the basics at https://medium.com/distributed-computing-with-ray/anatomy-of-a-custom-environment-for-rllib-327157f269e5