# Introduction to Reinforcement Learning

_Reinforcement Learning_ is the category of machine learning that focuses on training one or more _agents_ to achieve maximal _rewards_ while operating in an environment. This lesson discusses the core concepts of RL, while subsequent lessons explore RLlib in depth. We'll use two examples with exercises to give you a taste of RL. If you already understand RL concepts, you can either skim this lesson or skip to the [next lesson](02-About-RLlib.ipynb).

RL is a deep topic and a focus of intense research. We can only scratch the surface here, so let's begin with some RL references for further information:

## Books, Videos, etc.

The RISE Lab and U.C. Berkeley has many useful tutorials, videos, etc.:

* [RISE Lab YouTube channel](https://www.youtube.com/channel/UCP2-wiA964pif0secCpPbfw/videos)
* [RISE Camp 2019](https://risecamp.berkeley.edu/)

Several blog posts and series provide concise introductions to RL:

* [Reinforcement Learning Explained](https://www.oreilly.com/radar/reinforcement-learning-explained/). A gentle introduction to the ideas of RL.
* [A Beginner's Guide to Deep Reinforcement Learning](https://pathmind.com/wiki/deep-reinforcement-learning). From Pathmind, which uses RLlib for its products and services. Lots of good references at the end of this post.
* [An Outsider's Tour of Reinforcement Learning](http://www.argmin.net/2018/06/25/outsider-rl/). A series of posts on technical aspects of RL.

Several books are available on RL:

* [*Practical Reinforcement Learning*](https://www.endtoend.ai/practical-rl/), by Seungjae Ryan Lee.
* [*Hands-On Reinforcement Learning with Python*](https://learning.oreilly.com/library/view/hands-on-reinforcement-learning/9781788836524/), by Sudharsan Ravichandiran, Packt (2018-06-01)
* [*Hands-On Reinforcement Learning for Games*](https://www.packtpub.com/game-development/hands-on-game-ai-with-python), by Micheal Lanham, Packt (2020-01-03)
* [*Grokking Deep Reinforcement Learning*](https://www.manning.com/books/grokking-deep-reinforcement-learning), by Miguel Morales, Manning (Summer 2020 - previews available). Deep RL means using deep learning as part of the training system.
* [*Reinforcement Learning: An Introduction*](http://incompleteideas.net/book/bookdraft2018jan1.pdf), by Richard S. Sutton and Andrew G. Barto, MIT Press (2018-01-01). This is the definitive textbook. Deep, but highly recommended. See this independent [repo of Python code](https://github.com/Pulkit-Khandelwal/Reinforcement-Learning-Notebooks).

Other video tutorials and academic courses with materials available online:

* [University College London COMPM050/COMPGI13](https://www.davidsilver.uk/teaching/)
* [UC Berkeley CS 285](http://rail.eecs.berkeley.edu/deeprlcourse/)
* [CS 294 Deep Reinforcement Learning, Spring 2017](http://rll.berkeley.edu/deeprlcourse/)
* [A Tutorial on Reinforcement Learning I - YouTube](https://www.youtube.com/watch?v=fIKkhoI1kF4)
* [A Tutorial on Reinforcement Learning II - YouTube](https://www.youtube.com/watch?v=8hK0NnG_DhY)
* [ICML 2017 Tutorial](https://sites.google.com/view/icml17deeprl)

## What is Reinforcement Learning?

> **GOAL:** The goal of the section is to introduce the basic concepts of RL, specifically the _Markov Decision Process_ abstraction, and to show its use in Python.

Consider the following image:

![RL Concepts](../images/RL-concepts.png)

In RL, an **agent** interacts with an **environment** to maximize a **reward**. The agent makes **observations** about the **state** of the environment and takes **actions** that it believes will maximize the long-term reward. However, at the moment, it can only observe the immediate reward and it remembers past rewards. So, the training process usually involves lots and lot of reply of the game, the robot simulator traversing a virtual space, etc., so the agent can learn from repeated trials what decisions/actions work best to maximal the long-term reward.

RL has many applications, most famously these:

![Alpha Go](../images/alpha-go.png)
![Game](../images/game.png)
![Robot Arm](../images/robot-arm.gif)
![Walking Man](../images/walking-man.gif)
![Autonomous Vehicle](../images/autonomous-vehicle.jpeg)
![Two-legged Robot](../images/two-legged-robot.jpeg)


More general industry applications that are emerging include the following:

* **Process optimization:** industrial processes (factories, pipelines) and other business processes, routing problems, cluster optimization.
* **Ad serving and recommendations:** Some of the traditional methods, including _collaborative filtering_ are hard to scale for very large data sets. Can RL train an agent to do an effective job more efficiently than traditional methods?
* **Finance:** where markets are time-oriented _environments_ with automated trading systems are the _agents_. 

### Markov Decision Processes

Let's understand RL in more technical terms.

> **The key abstraction in reinforcement learning is the Markov Decision Process (MDP).**

An MDP models sequential interactions with an external environment. It consists of the following:

- a **state space**
- a set of **actions**
- a **transition function** that describes the probability of being in a state $s'$ at time $t+1$ given that the MDP was in state $s$ at time $t$ and action $a$ was taken.
- a **reward function**, which determines the reward received at time $t$.
- a **discount factor** $\gamma$, which is used when calculating the cumulative reward from the rewards received after each action is taken. It is used to "discount" earlier rewards vs. more recent rewards. The value is between 0 and 1.

More details about MDP are available [here](https://en.wikipedia.org/wiki/Markov_decision_process). Note what we said in the third bullet, that the new state only depends on the previous state and the action taken. The assumption is that we can simplify our effort by ignoring all the previous states except the last one and still achieve good results. This is known as the [Markov property](https://en.wikipedia.org/wiki/Markov_property).

**NOTE:** Reinforcement learning algorithms are often applied to problems that don't strictly fit into the MDP framework. In particular, situations in which the state of the environment is not fully observed lead to violations of the MDP assumption. Nevertheless, RL algorithms can be applied anyway.

### Policies

A **policy** is a function that takes in a **state** and returns an **action**. A policy may be stochastic (i.e., it may sample from a probability distribution) or it can be deterministic.

The **goal of reinforcement learning** is to learn a **policy** for maximizing the cumulative reward in an MDP. That is, we wish to find a policy $\pi$ which solves the following optimization problem

\begin{equation}
\arg\max_{\pi} \sum_{t=1}^T \gamma^t R_t(\pi),
\end{equation}

where $T$ is the number of steps taken in the MDP (this is a random variable and may depend on $\pi$) and $R_t$ is the reward received at time $t$ (also a random variable which depends on $\pi$). Note the use of $\gamma$, the discount factor.

A number of algorithms are available for solving reinforcement learning problems. Several of the most widely known are [value iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Value_iteration), [policy iteration](https://en.wikipedia.org/wiki/Markov_decision_process#Policy_iteration), and [Q learning](https://en.wikipedia.org/wiki/Q-learning), which we'll explore.

### RL in Python

The `gym` Python module provides MDP interfaces to a variety of simulators. For example, the CartPole environment interfaces with a simple simulator that simulates the physics of balancing a pole on a cart. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0. Here is an image from that website:

![Cart Pole](../images/Cart-Pole.png)

This example fits into the MDP framework as follows.
- The **state** consists of the position and velocity of the cart (moving in one dimension from left to right) as well as the angle and angular velocity of the pole that is balancing on the cart.
- The **actions** are to decrease or increase the cart's velocity by one unit.
- The **transition function** is deterministic and is determined by simulating physical laws.
- The **reward function** is a constant 1 as long as the pole is upright, and 0 once the pole has fallen over. Therefore, maximizing the reward means balancing the pole for as long as possible.
- The **discount factor** in this case can be taken to be 1.

More information about the `gym` Python module is available at https://gym.openai.com/.

In [2]:
import gym
import numpy as np

The code below illustrates how to create and manipulate MDPs in Python. An MDP can be created by calling `gym.make`. Gym environments are identified by names like `CartPole-v0`. A **catalog of built-in environments** can be found at https://gym.openai.com/envs.

In [3]:
env = gym.make('CartPole-v0')
print('Created env:', env)

Created env: <TimeLimit<CartPoleEnv<CartPole-v0>>>


Reset the state of the MDP by calling `env.reset()`. This call returns the initial state of the MDP.

In [4]:
state = env.reset()
print('The starting state is:', state)

The starting state is: [ 0.01418265 -0.02770248 -0.00307166 -0.0419574 ]


Recall that the state is the position of the cart, its velocity, the angle of the pole, and the angular velocity of the pole.

The `env.step` method takes an action (in the case of the CartPole environment, the appropriate actions are 0 or 1, for moving left or right). It returns a tuple of four things:
1. the new state of the environment
2. a reward
3. a boolean indicating whether the simulation has finished
4. a dictionary of miscellaneous extra information

In [5]:
# Simulate taking an action in the environment. Appropriate actions for
# the CartPole environment are 0 and 1 (for moving left and right).
action = 0
state, reward, done, info = env.step(action)
print(state, reward, done, info)

[ 0.01510358 -0.38331264 -0.02109857  0.58458855] 1.0 False {}


A **rollout** is a simulation of a policy in an environment. It alternates between choosing actions using some policy and taking those actions in the environment.

The code below performs a rollout in a given environment. It takes **random actions** until the simulation has finished and returns the cumulative reward.

In [7]:
def random_rollout(env):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # Keep looping as long as the simulation has not finished.
    while not done:
        # Choose a random action (either 0 or 1).
        action = np.random.choice([0, 1])
        
        # Take the action in the environment.
        state, reward, done, _ = env.step(action)
        
        # Update the cumulative reward.
        cumulative_reward += reward
    
    # Return the cumulative reward.
    return cumulative_reward
    
reward = random_rollout(env)
print(reward)
reward = random_rollout(env)
print(reward)

14.0
24.0


### Exercise 1

Finish implementing the `rollout_policy` function below, which should take an environment *and* a policy. The *policy* is a function that takes in a *state* and returns an *action*. The main difference is that instead of choosing a **random action**, the action should be chosen **with the policy** (as a function of the state).

> **Note:** Exercise solutions for this tutorial can be found [here](solutions/solutions.ipynb).

In [8]:
def rollout_policy(env, policy):
    state = env.reset()
    
    done = False
    cumulative_reward = 0

    # EXERCISE: Fill out this function by copying the 'random_rollout' function
    # and then modifying it to choose the action using the policy.
    raise NotImplementedError

    # Return the cumulative reward.
    return cumulative_reward

def sample_policy1(state):
    return 0 if state[0] < 0 else 1

def sample_policy2(state):
    return 1 if state[0] < 0 else 0

reward1 = np.mean([rollout_policy(env, sample_policy1) for _ in range(100)])
reward2 = np.mean([rollout_policy(env, sample_policy2) for _ in range(100)])

print('The first sample policy got an average reward of {}.'.format(reward1))
print('The second sample policy got an average reward of {}.'.format(reward2))

assert 5 < reward1 < 15, ('Make sure that rollout_policy computes the action '
                          'by applying the policy to the state.')
assert 25 < reward2 < 35, ('Make sure that rollout_policy computes the action '
                           'by applying the policy to the state.')

NotImplementedError: 

## Proximal Policy Optimization

> **GOAL:** The goal of this section is to demonstrate how to use the proximal policy optimization (PPO) algorithm, a popular way to develop a policy. 

We'll use **RLlib** this time with relatively little explanation for now, but explore it in greater depth in subsequent lessons. For more on RLlib, see the documentation at http://rllib.io.

PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described in https://arxiv.org/abs/1502.05477. [This OpenAI post](https://openai.com/blog/openai-baselines-ppo/) provides a more accessible introduction to PPO.

PPO works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD (_stochastic gradient descent_) to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![PPO](../images/ppo.png)

([source](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/))

> **NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `num_gpus` field of the `config` dictionary. Hence, for this to work, you must be using a machine that has one or more GPUs.

In [1]:
# import gym  # imported above already, but listed here for completeness
import ray
from ray.rllib.agents.ppo import PPOTrainer, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

In [5]:
# Start up Ray. This must be done before we instantiate any RL agents.
ray.init(num_cpus=3, ignore_reinit_error=True, log_to_driver=False)

2020-05-04 16:03:50,980	INFO resource_spec.py:212 -- Starting Ray with 3.66 GiB memory available for workers and up to 1.85 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-04 16:03:51,329	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:39687',
 'object_store_address': '/tmp/ray/session_2020-05-04_16-03-50_968099_55589/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-04_16-03-50_968099_55589/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-05-04_16-03-50_968099_55589'}

The Ray Dashboard is useful for monitoring Ray:

In [6]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8266


Instantiate a PPOTrainer object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `num_sgd_iter` is the number of epochs of SGD (passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO.
- `sgd_minibatch_size` is the SGD batch size that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers.

In [7]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 1
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

agent = PPOTrainer(config, 'CartPole-v0')

2020-05-04 16:04:39,321	INFO trainer.py:428 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-05-04 16:04:39,349	INFO trainer.py:585 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-05-04 16:04:41,272	INFO trainable.py:217 -- Getting current IP.


Train the policy on the `CartPole-v0` environment for 2 steps. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0.

### Exercise 2

Inspect how well the policy is doing by looking for the lines that say something like the following:

```
episode_len_mean: 22.262569832402235
episode_reward_mean: 22.262569832402235
```

This output indicates how much reward the policy is receiving and how many time steps of the environment the policy ran. The maximum possible reward for this problem is 200. The reward and trajectory length are very close because the agent receives a reward of one for every time step that it survives. However, this is specific to this environment.

TODO: questions to answer.

In [11]:
for i in range(2):
    result = agent.train()
    print(pretty_print(result))

custom_metrics: {}
date: 2020-05-01_10-29-38
done: false
episode_len_mean: 22.844827586206897
episode_reward_max: 90.0
episode_reward_mean: 22.844827586206897
episode_reward_min: 9.0
episodes_this_iter: 174
episodes_total: 174
experiment_id: cfeec7c4aa004861b5d15ef47eb38fd7
hostname: DWAnyscaleMBP.local
info:
  grad_time_ms: 1353.552
  learner:
    default_policy:
      cur_kl_coeff: 0.20000000298023224
      cur_lr: 4.999999873689376e-05
      entropy: 0.6649100184440613
      entropy_coeff: 0.0
      kl: 0.02841697819530964
      model: {}
      policy_loss: -0.03499114140868187
      total_loss: 159.3486785888672
      vf_explained_var: 0.03966745361685753
      vf_loss: 159.37799072265625
  load_time_ms: 49.631
  num_steps_sampled: 4000
  num_steps_trained: 3968
  sample_time_ms: 2405.276
  update_time_ms: 444.975
iterations_since_restore: 1
node_ip: 192.168.1.149
num_healthy_workers: 1
off_policy_estimator: {}
perf:
  cpu_util_percent: 11.414583333333335
  ram_util_percent: 65.433

In [8]:
result = agent.train()
result

{'episode_reward_max': 79.0,
 'episode_reward_min': 9.0,
 'episode_reward_mean': 22.46067415730337,
 'episode_len_mean': 22.46067415730337,
 'episodes_this_iter': 178,
 'policy_reward_min': {},
 'policy_reward_max': {},
 'policy_reward_mean': {},
 'custom_metrics': {},
 'hist_stats': {'episode_reward': [27.0,
   10.0,
   28.0,
   23.0,
   12.0,
   41.0,
   21.0,
   17.0,
   17.0,
   13.0,
   14.0,
   11.0,
   20.0,
   27.0,
   40.0,
   15.0,
   21.0,
   43.0,
   11.0,
   18.0,
   14.0,
   29.0,
   14.0,
   38.0,
   13.0,
   13.0,
   15.0,
   39.0,
   12.0,
   30.0,
   34.0,
   12.0,
   21.0,
   29.0,
   18.0,
   18.0,
   32.0,
   12.0,
   14.0,
   11.0,
   20.0,
   30.0,
   28.0,
   9.0,
   18.0,
   35.0,
   10.0,
   38.0,
   13.0,
   22.0,
   20.0,
   13.0,
   30.0,
   13.0,
   18.0,
   14.0,
   13.0,
   14.0,
   14.0,
   16.0,
   16.0,
   30.0,
   13.0,
   19.0,
   26.0,
   14.0,
   18.0,
   15.0,
   19.0,
   11.0,
   58.0,
   11.0,
   44.0,
   12.0,
   19.0,
   14.0,
   12.0,
   36.

### Exercise 3

The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective. (Fewer SGD iterations and a larger batch size should help.)

In [13]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0

agent = PPOTrainer(config, 'CartPole-v0')

2020-05-01 14:15:04,808	INFO trainable.py:217 -- Getting current IP.


Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_minibatch_size`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around 20 or 30 training iterations.

In [14]:
for i in range(2):
    result = agent.train()
    print(pretty_print(result))

custom_metrics: {}
date: 2020-05-01_14-15-14
done: false
episode_len_mean: 23.204678362573098
episode_reward_max: 73.0
episode_reward_mean: 23.204678362573098
episode_reward_min: 9.0
episodes_this_iter: 171
episodes_total: 171
experiment_id: b87ee6a868904eb3b2339b413a6b33e6
hostname: DWAnyscaleMBP.local
info:
  grad_time_ms: 2558.961
  learner:
    default_policy:
      cur_kl_coeff: 0.20000000298023224
      cur_lr: 4.999999873689376e-05
      entropy: 0.663716733455658
      entropy_coeff: 0.0
      kl: 0.029719049111008644
      model: {}
      policy_loss: -0.04187756031751633
      total_loss: 198.08648681640625
      vf_explained_var: 0.008818145841360092
      vf_loss: 198.1223907470703
  load_time_ms: 93.352
  num_steps_sampled: 4000
  num_steps_trained: 3968
  sample_time_ms: 5255.472
  update_time_ms: 1343.73
iterations_since_restore: 1
node_ip: 192.168.1.149
num_healthy_workers: 3
off_policy_estimator: {}
perf:
  cpu_util_percent: 64.64999999999999
  ram_util_percent: 64.357

Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model and can be used later to restore the model.

In [15]:
checkpoint_path = agent.save()
print(checkpoint_path)

/Users/deanwampler/ray_results/PPO_CartPole-v0_2020-05-01_14-15-00c1v_vfaw/checkpoint_2/checkpoint-2


Now let's use the trained policy to make predictions.

> **Note:** Here we are loading the trained policy in the same process, but in practice, this would often be done in a different process (probably on a different machine).

In [16]:
trained_config = config.copy()

test_agent = PPOTrainer(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

2020-05-01 14:15:21,909	INFO trainable.py:217 -- Getting current IP.
2020-05-01 14:15:22,010	INFO trainable.py:217 -- Getting current IP.
2020-05-01 14:15:22,012	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/PPO_CartPole-v0_2020-05-01_14-15-00c1v_vfaw/checkpoint_2/checkpoint-2
2020-05-01 14:15:22,013	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 2, '_timesteps_total': 8000, '_time_total': 12.834509372711182, '_episodes_total': 274}


Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

### Exercise 4

Verify that the reward received roughly matches up with the reward printed in the training logs.

In [17]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

31.0


The next lesson, [02: About RLlib](02-About-RLlib.ipynb) steps back to introduce to RLlib, its goals and the capabilities it provides.