__[A Baby Robot's Guide To Reinforcement Learning](https://towardsdatascience.com/tagged/baby-robot-guide)__

# State Values and Policy Evaluation in 5 minutes

## An Introduction to Reinforcement Learning


<center><img src="images/state_values_and_policy_evaluation_5mins/State_Values_and_Policy_Evaluation_Cover_Opt.gif"/></center>

<br/>
<br/>
<br/>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/WhatIThinkAbout/BabyRobot/HEAD?labpath=Reinforcement_Learning%2FPart%201%20Summary%20-%20State%20Values%20and%20Policy%20Evaluation%20in%205%20Minutes.ipynb)

<br/>
<br/>

This is a summary of the article <b>[State Values and Policy Evaluation](https://medium.com/towards-data-science/state-values-and-policy-evaluation-ceefdd8c2369)</b>. It distils all of the key terms and theory from that article down into a single cheat-sheet that can be read in 5 minutes or less. With that in mind, we'd better get started…


<br/>
<br/>
<center><img src="images/green_babyrobot_small.gif"/></center>

<br/>
<br/>

## Reinforcement Learning
Reinforcement Learning can be considered to be a problem that takes place in an environment that consists of multiple, independent, states.
A simple example of this would be a grid world, where each square in the grid represents a state:

<br/>
<br/>
<center><img src="images/state_values_and_policy_evaluation_5mins/basic_grid.png"/></center>

### State
A unique, self-contained, stage in the environment that defines the current situation. Each state is independent of previous states, which means you don't need to know or remember what has happened before.

### Episode
One complete execution of the environment.

### Terminal State
The final state in an environment where the episode ends.



***

### Agent
The agent is the thing that interacts with the environment and that makes decisions on how to move through the environment. In our case Baby Robot is the agent.

<center><img src="images/state_values_and_policy_evaluation_5mins/agent.png"/></center>

To move from one state **_s_** to the next state **_s'_** (read as s-prime) the agent takes an action a. As a result of taking this action it receives a reward **_r_**.

### Reward
The numerical value used to measure performance. It can be expressed as a penalty by making it negative. The more negative the worse the performance.

### Expected reward for a state-action pair
The expected reward can be thought of as the average reward and is used if the reward given for a particular action can vary.

<center><img src="images/state_values_and_policy_evaluation_5mins/expected_reward.png" style="background-color:#FFFFFF"/></center>


At time '_t-1_', starting in a particular state **_s_** and taking action **_a_**, the expected reward that's received at the next time step is a function of the current state and action.

### Return 'Gₜ'
The total amount of reward accumulated over an episode, starting at time 't' is equal to the sum of future rewards:

<center><img src="images/state_values_and_policy_evaluation_5mins/return.png" style="background-color:#FFFFFF"/></center>

***

## Policy
The strategy that is used by the agent to choose an action in a state.
In equations the policy is typically represented by '_π_'.

### Greedy Policy
Selects the action with the highest immediate reward.

---

Reinforcement Learning can be split into two distinct parts, the Prediction Problem and the Control Problem.

### Prediction Problem
Evaluate the performance of an agent.

### State Value
A measure of how good it is to be in that state. Given by the expected reward that can be obtained from starting in that state and then following the current policy for all future states.

The value for state s under policy **_π_** is the expected return:

<center><img src="images/state_values_and_policy_evaluation_5mins/state_value.png" style="background-color:#FFFFFF"/></center>


For a **_deterministic policy_**, where a single action is always selected in each state and when you're guaranteed of getting the same reward and ending up in the same next state, the value of a state is simply the immediate reward plus the value of the next state:

<center><img src="images/state_values_and_policy_evaluation_5mins/deterministic_policy.png" style="background-color:#FFFFFF"/></center>


For a **_stochastic policy_**, where multiple actions are possible, π(a|s) represents the probability of taking action a from state s under policy π.
In this case the state value is given by the sum of each action's reward multiplied by the probability of taking that action:

<center><img src="images/state_values_and_policy_evaluation_5mins/stochastic_policy.png" style="background-color:#FFFFFF"/></center>

---

## Dynamic Programming
Splits the problem into simpler sub-problems. In this case the state value is calculated by splitting the problem into 2 parts: the immediate reward and the value of the next state.

State values are calculated using the values of other states. As a result you only need to look one step ahead and don't need to know all of the rewards accumulated during the episode.


---

## Iterative Policy Evaluation
By repeatedly applying the above equation for the state value the converged state value can be calculated.

The algorithm for this is:
* Start with all state values set to zero.
* Do a sweep of all states to calculate the state values (using the above equation).
* Repeat until convergence.

This is shown in the code below, using the BabyRobot custom Gym environment:

In [1]:
# get the latest version of the babyrobot custom gym environment
%pip install babyrobot --upgrade -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import babyrobot
from babyrobot.lib import Policy
from babyrobot.lib import PolicyEvaluation

# create a blank, default environment
setup = {'show_start_text':False,'show_end_text':False,'robot':{ 'show': False}}
env = babyrobot.make("BabyRobot-v0",**setup)

# create a stochastic policy and a policy evaluation object for this policy
policy = Policy(env)
policy_evaluation = PolicyEvaluation( env, policy )

# show the initial state values
info = {'text': policy_evaluation.end_values}
env.show_info(info)
env.render()

MultiCanvas(height=196, sync_image_data=True, width=196)

In [3]:
# iterate until convergence
iterations = 0
convergence = False
while convergence == False:
  info = {'text': policy_evaluation.end_values}
  env.show_info(info)
  policy_evaluation.do_iteration()

  # calculate the largest difference in the state values from the start to end of the iteration
  delta = np.max(np.abs(policy_evaluation.end_values - policy_evaluation.start_values))

  # test if the difference is less than the defined convergence threshold
  if delta < policy_evaluation.threshold:
    convergence = True  

  iterations += 1

print(f"Convergence after {iterations} iterations")  

Convergence after 105 iterations


And here's one I made earlier, showing the state values after every 5 iterations of Policy Evaluation and rounded to 1 decimal place:

<center><img src="images/state_values_and_policy_evaluation_5mins/Iterative_Policy_Evaluation.gif" style="background-color:#FFFFFF"/></center>

This was created using the BabyRobot Animate class, as shown below:

In [4]:
from babyrobot.lib import Animate

setup = {'show_start_text':False,'show_end_text':False,'robot':{ 'show': False}}
setup['side_panel'] = {'width':150}
env = babyrobot.make("BabyRobot-v0",**setup)

# create a stochastic policy and a policy evaluation object for this policy
policy = Policy(env)
policy_evaluation = PolicyEvaluation( env, policy )

animate = Animate(env)
args = { 'max_steps': 105, 'save_interval': 5, 'show_directions': False }
animate.show_policy_evaluation(policy_evaluation,**args)

VBox(children=(MultiCanvas(height=196, sync_image_data=True, width=346), HBox(children=(Play(value=0, interval…

---

## Control Problem
Modify the policy to improve performance.

## Policy Improvement
After the state values have been calculated act greedily with respect to these values to form a new policy (i.e. choose the action which will take you to the next state that has the greatest state value).

<center><img src="images/state_values_and_policy_evaluation_5mins/Policy_Improvement.png" style="background-color:#FFFFFF"/></center>

In [5]:
# create a blank, default environment
setup = {'show_start_text':False,'show_end_text':False,'robot':{ 'show': False}}
env = babyrobot.make("BabyRobot-v0",**setup)

# create a stochastic policy and a policy evaluation object for this policy
policy = Policy(env)
policy_evaluation = PolicyEvaluation( env, policy )

# run policy evaluation to convergence
steps_to_convergence = policy_evaluation.run_to_convergence(max_iterations = 300)
print(f"Convergence in {steps_to_convergence} iterations")

# show the final state values after convergence
env = babyrobot.make("BabyRobot-v0",**setup)
directions = policy.get_directions(values=policy_evaluation.end_values)
info = {'text': policy_evaluation.end_values, 'precision': 1,
        'directions': {'arrows':directions}}
env.show_info(info) 
env.render()

Convergence in 104 iterations


MultiCanvas(height=196, sync_image_data=True, width=196)

---

## Discounted Rewards
Progressively reduce the contribution of rewards from future time steps by using a discount factor **_γ_** (gamma), where 0 ≤ γ ≤ 1.

<center><img src="images/state_values_and_policy_evaluation_5mins/Discounted_Rewards.png" style="background-color:#FFFFFF"/></center>

This allows the calculation of state values for deterministic policies that may not terminate. A value of 0.9 is commonly used as the discount factor.

The value of state **_s_** under policy <span style="font-family: arial;">**_π_**</span> with discounted future rewards is given by:

<center><img src="images/state_values_and_policy_evaluation_5mins/State_Value_Discounted_Rewards.png" style="background-color:#FFFFFF"/></center>

# Summary
We've now covered all of the basics of Reinforcement Learning (RL) and all in record time!

We concluded our introduction with an equation that's a cut-down version of the **Bellman Equation**, the key equation in Reinforcement Learning. In _[Part 2](https://medium.com/towards-data-science/markov-decision-processes-and-bellman-equations-45234cce9d25)_ of this series we look at the full Bellman Equation and examine **Markov Decision Processes** which are another core concept of RL.

___

<h4 style='color:green'>
If you found this notebook interesting and/or useful please give the accompanying Medium article a like.</br> 
Thanks!
</h4>