# Tinkering notebook 2: Markov Decision Processes and Dynamic Programming

In this notebook we will see how some of the content of Lecture 2 - Lecture 3 works in practice. 

We will start by a repetition of Markov Decision Processes (MDPs) and value functions. After this we will study two GridWorld examples (Example 4.1 in the textbook and FrozenLake). Here we will use dynamic programming to learn both value functions and optimal policies.\
我们将从重复马尔可夫决策过程(MDPs)和价值函数开始。在这之后，我们将学习两个GridWorld的例子(例子4.1在教科书和结冰湖)。这里我们将使用动态规划来学习值函数和最优策略。

# Table of content
* ### [1. Imports](#sec1)
* ### [2. Markov Decision Processes and Value Functions](#sec2)
 * #### [2.1 The Bellman equations](#sec2_1)
 * #### [2.2 Example: Study or Facebook?](#sec2_2)
 * #### [2.3 *Analytical solution to the Bellman equation](#sec2_3)
* ### [3. Helper functions](#sec3)
* ### [4. The environments](#sec4)
 * #### [4.1 Example 4.1: GridWorld-v0](#sec4_1)
 * #### [4.2 The Frozen Lake](#sec4_2)
* ### [5. MDPs and the Bellman equations](#sec5)
 * #### [5.1 Test your code on Example 4.1 (GridWorld)](#sec5_1)
* ### [6. Policy Evaluation](#sec6)
 * #### [6.1 In place updates](#sec6_1)
* ### [7. Policy Iteration](#sec7)
* ### [8. Value iteration](#sec8)


# 1. Imports <a id="sec1">

The packages needed in this notebook are:

In [5]:
import sys
sys.path.append("gym-gridworld")

In [6]:
# Packages needed for this notebook
import gym
import gym_gridworld
import numpy as np
import time
import random
import matplotlib.pyplot as plt
from IPython.display import clear_output # Used to clear the ouput of a Jupyter cell.

# 2. Markov Decision Processes and Value Functions <a id="sec2">

An MDP consist of a state space $\mathcal{S}$, an action space $\mathcal{A}$, a reward set $\mathcal{R}$ and a transition function $p(s', r | s, a)$. We define the return as\
MDP由一个状态空间 $\mathcal{S}$、一个动作空间$\mathcal{A}$、一个reward集$\mathcal{R}$和一个转换函数$p(s', r | s, a)$组成。我们将return定义为
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots
$$
where $\gamma$ is the discount rate.\
其中，$\gamma$是贴现率。

**State-value function:**
$$
v_{\pi}(s) = \mathbb{E}_{\pi}[ G_t | S_t = s]
$$
The expected return starting from state $s \in \mathcal{S}$ and following policy $\pi$.\
从状态$s \in \mathcal{S}$开始，然后遵循策略$\pi$的expected return。

**Action-value function ($Q$-value)**:
$$
q_{\pi}(s) = \mathbb{E}_{\pi}[ G_t | S_t = s, A_t = a]
$$
The expected return starting from state $s \in \mathcal{S}$, then taking action $a \in \mathcal{A}$ and then follow policy $\pi$.\
从状态$s \in \mathcal{S}$开始，然后采取动作$a \in \mathcal{A}$，然后遵循策略$\pi$的expected return

## 2.1 The Bellman equations <a id="sec2_1">

We have seen that $v_{\pi}(s)$ is the solution to the Bellman equation\
我们已经知道$v_{\pi}(s)$是Bellman方程的解
$$
\begin{align}
v_{\pi}(s) 
&= \mathbb{E}_{\pi}[ R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s] \\
&= \sum_{a} \pi(a|s) \sum_{r} \sum_{s'} p(s',r|s,a) [r + \gamma v_\pi(s')] \\
&= \sum_{a} \pi(a|s) q_{\pi}(s,a)
\end{align}
$$
where\
其中
$$
\begin{align}
q_{\pi}(s,a) 
&= \mathbb{E}_{\pi}[G_t|S_t=s, A_t=a] \\
&= \sum_{r}\sum_{s'} p(s',r|s,a) [r + \gamma v_{\pi}(s')]
\end{align}
$$

## 2.2 Example: Study or Facebook? <a id="sec2_2">

In Lecture 2 we looked at the example MDP bellow.
<img src = "example.png">

It has four states, $\mathcal{S} = \{ K_0, K_1, K_2, Pass \}$. The state $Pass$ is a terminal state, so the episode will end if this state is reached (alternatively, if we reach $Pass$ we will stay there forever and receive 0 future return).\
它有四种状态，$\mathcal{S} = \{ K_0, K_1, K_2, Pass \}$。状态$Pass$是一种terminal状态，所以当到达该状态时，事件将结束（或者，如果到达$Pass$，我们将永远停留在那里，并在未来收到0返回）。

In each non-terminal state we can choose between `Study` or `Facebook`, $\mathcal{A} = \{ Study, Facebook\}$. The red nodes corresponds to actions, and the labels on the edges gives immediate rewards (green) and transition probabilities (black).\
在每个non-terminal状态中，我们可以选择`Study`或`Facebook`，$\mathcal{A} = \{ Study, Facebook\}$。红色节点对应于行动，边缘上的标签提供即时reward(绿色)和转移概率(黑色)。

**Task:** 
Assume that we use the policy $\pi(s|a) = 0.5$ for all states and actions, and the discount factor $\gamma = 0.9$.\
假设我们对所有状态和动作使用策略$\pi(s|a) = 0.5$，discount因子$\gamma = 0.9$。

In Lecture 2 we saw that the state-value function (rounded to two decimals) is\
在第二讲中，我们看到state-value函数（四舍五入到两位小数）是
$$
v_{\pi}(K_0) = 3.00, \quad v_{\pi}(K_1) = 4.78, \quad v_{\pi}(K_2) = 7.84.
$$
1. Verify that this satisfies the Bellman equation for all states! (At least approximately, since we have rounded everything to two decimals).\
对于所有状态，验证此满足Bellman方程(至少是近似的，因为我们把所有的都四舍五入到两个小数)。

You can use the code-block bellow to carry out your computations.

In [19]:
gamma = 0.9
K0 = 3.00
K1 = 4.78
K2 = 7.84
P = 0
pi_FB = 0.5
pi_S = 0.5

vK0 = pi_FB * (0 + gamma*K0) + pi_S * (-1 + gamma*K1)
vK1 = pi_FB * (0 + gamma*(0.5*K0 + 0.5*K1)) + pi_S * (-1 + gamma*K2)
vK2 = pi_FB * (0 + gamma*(0.5*K1 + 0.5*K2)) + pi_S * (10 + gamma*P)
print("v(K0) =", round(vK0,2), "v(K1) =", round(vK1,2), "v(K2) =", round(vK2,2))

v(K0) = 3.0 v(K1) = 4.78 v(K2) = 7.84


## 2.3 *Analytical solution to the Bellman equation <a id="sec2_3">

Note that the Bellman equation can be written as\
请注意，Bellman方程可以写成
$$
v_{\pi}(s) = \mathbb{E}_{\pi}[ R_{t+1} | S_{t} = s] + \gamma\sum_{a, s'} \pi(a|s)p(s' | s, a) v_{\pi}(s').
$$
So the value of state $s$ is the average immediate reward plus the discounted average value of the next state.\
所以状态$s的价值是平均即时reward加上下一个状态的平均贴现值。

**Example: From the state $K_1$**:
    
* In $s = K_1$ the expected immediate reward is $-0.5$, since we choose `Facebook` with probability 0.5 (reward 0) and `Study` with probability 0.5 (reward -1). 
* There is a 0.5 probability that the action is `Facebook`, and then there is a 50/50 chance of going either to $K_0$ or $K_1$. Hence the total probability of going to $K_1$ and $K_2$ are both 0.25 ($0.5 \times 0.5$). There is also a 0.5 probability for the action `Study` which will move us to $K_2$. Finally there is 0 probability of reaching $Pass$. 

Summarizing this we get
$$
v_{\pi}(K_1) = -0.5 + \gamma [ 0.25 v_{\pi}(K_0) + 0.25 v_{\pi}(K_1) + 0.5 v_{\pi}(K_2) + 0 v_{\pi}(Pass)]
$$
If we define a vector with all state-values
$$
V_{\pi} = \begin{bmatrix} v_{\pi}(K_0) \\ v_{\pi}(K_1) \\ v_{\pi}(K_2) \\ v_{\pi}(Pass) \end{bmatrix}
$$
we can write this as
$$
v_{\pi}(K_1) = \underbrace{-0.5}_{r_1} + \gamma \underbrace{\begin{bmatrix} 0.25 & 0.25 & 0.5 & 0 \end{bmatrix}}_{p_1} V_\pi
$$
Note that the elements of $p_1$ are the probability of going from $K_1$ to each of the other states when we follow the policy.

**Putting it all together:**

If we do the same for all states, we get an equation on the form 
$$
V_{\pi} = R + \gamma P V_{\pi}
$$
where $R$ is the vector of expected immediate rewards for each state, and $P \in \mathbb{R}^{4 \times 4}$ where element $P_{i,j}$ is the probability of moving from the $i$th state to the $j$th state when we follow the policy.

Assuming that $\gamma < 1$ there is always a unique solution given by 
$$
V_{\pi} = (I - \gamma P)^{-1} R.
$$

**Task**: Fill in the correct values of $R$ and $P$ in the code below, and see if you find the same solution as in the slides of Lecture 2. You can also play around with the discount rate, and/or try to compute $R$, $P$ and then $V_{\pi}$ for another policy.

**Note**: The state $Pass$ is a bit special, since it is a terminating state. Hence, when we reach "Pass the exam" we will not receive anymore rewards, and we will stay in this state forever (the probability of going to any other state is 0). Hence
$$
v_{\pi}(Pass) = 0 + \gamma \begin{bmatrix} 0 & 0 & 0 & 1 \end{bmatrix} V_{\pi}
$$

In [None]:
discount = 0.9 # gamma
R = np.zeros((4,1))
P = np.zeros((4,4))

# Enter the expected immediate reward for each state
# Those computed in the text above are already filled in.
R[0] = ? # For K_0
R[1] = -.5 # For K_1
R[2] = ? # For K_2
R[3] = 0 # For "Pass exam"

# Enter the probabilities going from state i to state j
P[0] = [?, ?, ?, ?] # for i=0 (K_0)
P[1] = [0.25, 0.25, 0.5, 0] # for i=1 (K_1)
P[2] = [?, ?, ?, ?] # for i=2 (K_2)
P[3] = [0, 0, 0, 1] # for i=3 (Pass the exam) 

# Solve the Bellman equation 
V = np.linalg.inv(np.eye(4) - discount*P)@R # V = (I - discount*P)^-1 * R
print(V)

# 3. Helper functions <a id="sec3">

We will now start to look at two GridWorld examples, but first we define some functions that will be useful in the rest of the notebook.\
现在我们将开始看两个GridWorld示例，但是首先我们定义一些函数，这些函数将在本笔记本的其余部分中有用。

The class `RandomAgent` implements an agent with a policy $\pi(a|s)$. The method `act` will take the state $s$ as input, and then sample an action according to the probabilities $\pi(s|a)$.\
`RandomAgent`类实现了一个带有策略$\pi(a|s)$的agent。方法`act`将状态$s$作为输入，然后根据概率$\pi(s|a)$对动作进行抽样。

To implement this we use a table `probs` of size $|\mathcal{S}| \times |\mathcal{A}|$. So `probs[s][a]` $= \pi(a|s)$. Here we implement an agent that (initially) chooses the action with a uniform probability, so all actions are equally likely in each state.\
为了实现这一点，我们使用了一个大小为$|\mathcal{S}| \times |\mathcal{A}|$的`probs`表。因此，`probs[s][a]` $= \pi(a|s)$。在这里，我们实现了一个(最初)以均匀概率选择动作的agent，因此在每个状态下所有动作的可能性都是相等的。

In [20]:
class RandomAgent():
    
    def __init__(self, nA = 4, nS = 16):
        self.nA = nA # Number of actions
        self.nS = nS # Number of states
        
        # Uniform probabilites in each state.
        # That is, in each of the `nS` states
        # each of the `nA` actions has probability 1/nA.
        self.probs = np.ones((nS,nA))/nA 

    def act(self, state, done):
        action = np.random.choice(self.nA, p = self.probs[state]) 
        return action # a random policy

We also implement a function that will let the agent run on the environment:\
我们还实现了一个函数，让agent在环境上运行:

In [21]:
def run_agent(env, agent, tsleep = 0.05):
    state = env.reset()
    time_step = 0
    total_reward = 0
    reward = 0
    done = False
    while not done:
        action = agent.act(state, done);
        state, reward, done, info = env.step(action)
        total_reward += reward
        time_step += 1
        clear_output(wait=True)
        env.render()
        print("Time step:", time_step)
        print("State:", state)
        print("Action:", action)
        print("Total reward:", total_reward)
        time.sleep(tsleep)

# 4. The environments <a id="sec4">

In this notebook we will try the methods on two environments. Both environments are $4 \times 4$ gridworlds, see figure below.\
在本笔记本中，我们将在两个环境中尝试这些方法。这两个环境都是$4 \times 4$的网格世界，见下图。

<img src="grid.png" width=200>

The agent can be in one of the 16 grids, so the state space is\
agent可以在16个网格中的一个中，因此状态空间为
$$
\mathcal{S} = \{0, 1, 2, \ldots, 15\}.
$$
In each state the agent can take one out of four actions:\
在每种状态下，agent可以从四个动作中选择一个:

In [22]:
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

So the action space is\
所以动作空间是
$$
\mathcal{A} = \{ 0, 1, 2, 3\}
$$
Lets now look at the two different environments.\
现在让我们看看这两种不同的环境。

## 4.1 Example 4.1: GridWorld-v0 <a id="sec4_1">

The first environment is called `GridWorld-v0`, and is described in Example 4.1 of the textbook. To create the environment and study the state and action spaces, execute:\
第一个环境称为`GridWorld-v0`，在教科书的示例4.1中有描述。创造环境，研究状态和行动空间，执行:

In [23]:
env = gym.make('GridWorld-v0')
state = env.reset()
print("State space:", env.observation_space)
print("Action space:", env.action_space)

State space: Discrete(16)
Action space: Discrete(4)


We can see that the state space has 16 states, and there 4 actions (as mentioned above). Let us next visualize the environment:\
我们可以看到状态空间有16个状态，还有4个动作(如上所述)。下面让我们来看看环境:

In [24]:
env.render()


GFFF
FFFF
F[41mS[0mFF
FFFG


`S` is the starting state.\
`S`是起始状态。

`F` is a state where the agent can walk.\
`F`是agent可以行走的状态。

`G` are the two goal states. (In Example 4.1 these two states are considered to be one state).\
`G`是两个目标状态。(在例4.1中，这两种状态被认为是一种状态)。

**Reward:** The agent receives the reward -1 for each action taken, and the episode ends when a goal state is reached. Hence, the agent should reach a goal state with as few actions as possible in order to maximize the total reward.\
agent会因为每次行动而获得-1的奖励，当达到目标状态时，episode就结束了。因此，为了使总reward最大化，agent应该以尽可能少的行动达到目标状态。

**Dynamics:** This is a deterministic environment. So if the agent chooses the action `LEFT = 0` then it will move one step to the left if possible. If it hits a wall it will just stay in the same place.\
这是一个确定性的环境。因此，如果agent选择了动作`LEFT = 0`，那么它将尽可能向左移动一步。如果它撞到墙上，它会停留在相同的地方。

We can try to move the agent one step to the left with the following code:\
我们可以尝试用以下代码将代理向左移动一步:

In [25]:
new_state, reward, done, _ = env.step(LEFT) # Take action LEFT = 0
env.render()
print("New state:", new_state)
print("Reward:", reward)

  (Left)
GFFF
FFFF
[41mF[0mSFF
FFFG
New state: 8
Reward: -1.0


**More about dynamics:** 
The GridWorld environment has the variable `env.P` that encodes the dynamics $p(s',r|s,a)$ of the environment.\
GridWorld环境有一个变量`env.P`，它编码环境的动态$p(s',r|s,a)$。

`env.P[s][a]` will give back a list where each element is on the form `(probability, next_state, reward, terminating)`. So, if you take action `a` in state `s`, then with probability `probability` you will move to `next_state` and get the reward `reward`. `terminating` tells us if `next_state` will terminate the episode or not.\
`env.P[s][a]`将返回一个列表，其中每个元素都以`(probability, next_state, reward, terminating)`的形式存在。所以，如果你在状态`s`中采取行动`a`，那么你就会以概率`probability`移动到`next_state`并获得reward `reward`。`terminating`告诉我们`next_state`是否会终止episode。

In [26]:
s = 9
a = LEFT
print(env.P[s][a])
for p, next_s, reward, _ in env.P[s][a]: # Go through all possible transitions
    print("With probability", p, 
          "you will move to state", next_s, 
          "and get the reward", reward)

[(1.0, 8, -1.0, False)]
With probability 1.0 you will move to state 8 and get the reward -1.0


Since this is a deterministic environment, the list only contains one next state that is reached with probability 1.0.\
由于这是一个确定性的环境，列表中只包含一个到达概率为1.0的下一个状态。

Finally we can try to run the agent that chooses between all actions with equal probability in all states. Try to run it a few time, and note that you will get different total rewards every time since the agent uses a random policy.\
最后，我们可以尝试运行在所有状态下在所有动作之间以相同概率进行选择的agent。尝试运行它几次，并注意每次您将获得不同的总reward，因为代理使用随机策略。

In [28]:
agent = RandomAgent(env.action_space.n, env.observation_space.n)
run_agent(env, agent)

  (Right)
GFFF
FFFF
FSFF
FFF[41mG[0m
Time step: 9
State: 15
Action: 2
Total reward: -9.0


### Task: Try to implement an optimal policy (see Figure 4.1 in the textbook).
By changing `agent.probs` we can change the policy of the agent. Try to implement an optimal policy (see Figure 4.1 in the textbook).\
通过改变`agent.probs`，我们可以改变agent的策略。尝试实施最优政策（见教科书图4.1）。

Note that when two possible directions are shown in the optimal policy, then you can choose between them anyway you want (either pick one with probability 1.0, or maybe pick between them with equal probability).\
请注意，当最优策略中显示了两个可能的方向时，您可以根据自己的意愿在其中进行选择（要么选择概率为1.0的方向，要么以相等概率从它们之中选择一个）。

In [40]:
agent.probs = np.zeros((16, 4))

# Note that in each state the total
# probability must add upp to 1.0.

# Row 1
agent.probs[1][LEFT] = 1.0
agent.probs[2][LEFT] = 1.0
agent.probs[3][[LEFT, DOWN]] = 0.5 # Pick between them with equal probability

# Row 2
agent.probs[4][UP] = 1.0
agent.probs[5][[LEFT, UP]] = 0.5
agent.probs[6][[LEFT, DOWN]] = 0.5
agent.probs[7][DOWN] = 1.0

# Row 3
agent.probs[8][UP] = 1.0
agent.probs[9][[RIGHT, UP]] = 0.5
agent.probs[10][[RIGHT, DOWN]] = 0.5
agent.probs[11][DOWN] = 1.0


# Row 4 
agent.probs[12][[RIGHT, UP]] = 0.5
agent.probs[13][RIGHT] = 1.0
agent.probs[14][RIGHT] = 1.0

print(agent.probs)

[[0.  0.  0.  0. ]
 [1.  0.  0.  0. ]
 [1.  0.  0.  0. ]
 [0.5 0.5 0.  0. ]
 [0.  0.  0.  1. ]
 [0.5 0.  0.  0.5]
 [0.5 0.5 0.  0. ]
 [0.  1.  0.  0. ]
 [0.  0.  0.  1. ]
 [0.  0.  0.5 0.5]
 [0.  0.5 0.5 0. ]
 [0.  1.  0.  0. ]
 [0.  0.  0.5 0.5]
 [0.  0.  1.  0. ]
 [0.  0.  1.  0. ]
 [0.  0.  0.  0. ]]


In [41]:
run_agent(env, agent)

  (Up)
[41mG[0mFFF
FFFF
FSFF
FFFG
Time step: 3
State: 0
Action: 3
Total reward: -3.0


## 4.2 The Frozen Lake <a id="sec4_2">

The environment `FrozenLake-v0` [FrozenLake-v0](https://gym.openai.com/envs/FrozenLake-v0/) is similar to `GridWorld-v0`, but it is stochastic.\
环境`FrozenLake-v0`[FrozenLake-v0](https://gym.openai.com/envs/FrozenLake-v0/)类似于`GridWorld-v0`，但它是随机的。

In [42]:
env = gym.make('FrozenLake-v0')
state = env.reset()
print("State space:", env.observation_space)
print("Action space:", env.action_space)

State space: Discrete(16)
Action space: Discrete(4)


In [43]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


`S` is the starting state.

`F` is a frozen surface that the agent can walk on, but it is slippery, so the movement direction only partially depends on the action.\
`F`是一个冻结的表面，agent可以在上面行走，但它很滑，所以移动方向只部分取决于动作。

`H` is a hole. If the agent steps here, it will fall in. (The episode terminates with $0$ reward.)\
`H`是一个洞。如果agent进入这里，它就会掉进去。(这一episode以$0$的reward结束。)

`G` is the goal. If the agents steps here, the episode terminates with reward $+1$.\
`G`是目标。如果agent到达这里，这一episode将以$+1$的reward结束。

**Reward**: All actions not leading to the goal state gives a reward of 0. Hence, to maximize the reward the agent much reach the goal state without falling into a hole.\
所有不指向目标状态的行为都给予0的reward。因此，为了reward励最大化，agent要达到目标状态而不会掉进洞里。

**Dynamics:** Here we also have `env.P[s][a]` to see the dynamics of the environment.\
这里我们也有`env.P[s][a]`来观察环境的动态。

In [45]:
s = 9
a = LEFT 
print(env.P[s][a])
for p, next_s, reward, _ in env.P[s][a]:
    print("With probability", p, 
          "you will move to state", next_s, 
          "and get the reward", reward)

[(0.3333333333333333, 5, 0.0, True), (0.3333333333333333, 8, 0.0, False), (0.3333333333333333, 13, 0.0, False)]
With probability 0.3333333333333333 you will move to state 5 and get the reward 0.0
With probability 0.3333333333333333 you will move to state 8 and get the reward 0.0
With probability 0.3333333333333333 you will move to state 13 and get the reward 0.0


We see that the environment is stochastic in this case, since the action `LEFT` in state 9 may take the agent either to state 5 (up into a hole!), state 8 (left) or state 13 (down), due to the slippery surface.\
我们可以看到，在这种情况下，环境是随机的，因为状态9中的`LEFT`行动可能会将agent带入状态5(向上进入一个洞!)、状态8(向左)或状态13(向下)，因为表面很滑。

We can also try to run this environment using the random policy:\
我们也可以尝试使用随机策略来运行这个环境:

In [46]:
agent = RandomAgent(env.action_space.n, env.observation_space.n)
run_agent(env, agent)

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG
Time step: 2
State: 5
Action: 1
Total reward: 0.0


We can see that typically the agent ends up in one of the holes, and thus the total reward is typically 0 (but the expected total reward is positive, since there is a non-zero probability that we reach the goal).\
我们可以看到，agent通常会在其中一个洞中结束，因此总reward通常是0（但期望的总reward是正的，因为我们达到目标的概率是非0）。

# 5. MDPs and the Bellman equations <a id="sec5">

In this section we will study the Bellman equations for the value function a bit closer. Remember that we defined the return as\
在本节中，我们将进一步研究值函数的Bellman方程。记住，我们把return定义为
$$
G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots
$$
and the state-value function as\
状态值函数为
$$
v_\pi(s) = \mathbb{E}[G_t | S_{t} = s].
$$
The Bellman equation for the state-value function is then\
状态值函数的Bellman方程为
$$
v_\pi(s) 
= \sum_{a} \pi(a|s) \sum_{r} \sum_{s'} p(s',r|s,a) [r + \gamma v_\pi(s')] 
= \sum_{a} \pi(a|s) q_{\pi}(s,a)
$$
where
$$
q_{\pi}(s,a) 
= \mathbb{E}_{\pi} [G_t | S_t=s, A_t=a] 
= \sum_{r}\sum_{s'} p(s',r|s,a) [r + \gamma v_{\pi}(s')]
$$
is the action-value function.\
是动作值函数。

**Implementation:**
In the code we will represent the state-value function $v_\pi(s)$ as a vector $v$ with one element for each state in $\mathcal{S}$.

We will now implement functions for computing the right-hand side of the Bellman equation. 

**Task:**
Complete `compute_action_value` and `Bellman_RHS`. Make sure that you understand the code.

We start by a function that computes the action values $q_{\pi}(s,a)$ given the state-value function $v_\pi(s)$.

In [None]:
def compute_action_value(env, discount, s, a, v):
    
    action_value = 0
    
    for p, next_s, reward, _ in env.P[s][a]:
        # Loop through all possible (s', r) pairs
        action_value += ?
    
    return action_value

With the action values, we can now compute $\sum_{a} \pi(a|s) q_{\pi}(s,a)$ (the expected action value) to get the right-hand side (RHS) of the Bellman equation.

For this we use `agent.probs[s][a]` $=\pi(a|s)$, see discussion in "Helper Functions".

In [None]:
def Bellman_RHS(env, discount, agent, s, v):
    
    state_value = 0
    
    for a in range(env.action_space.n):
        # Loop through all possible actions
        state_value += ?
    
    return state_value

Finally we implement a function that, given a value function, computes the right-hand side of the Bellman equation for all states.

In [None]:
def Bellman_RHS_all(env, discount, agent, v0):
    # v0 is the given value function
    # v will be the right-hand side of the Bellman equation
    # If v0 is indeed the value function, then we should get v = v0.
    
    v = np.zeros(env.observation_space.n)
    
    for s in range(env.observation_space.n):
        v[s] = Bellman_RHS(env, discount, agent, s, v0)
    
    return v

## 5.1 Test your code on Example 4.1 (GridWorld) <a id="sec5_1">

For Example 4.1 we will consider the state-value $v_{\pi}(s)$ for the policy when each action is taken with equal probability. The discount rate is
$$
\gamma = 1
$$
The value function for this policy is given in Figure 4.1 in the textbook (the lower left), and it is

In [None]:
v = np.array([[0, -14, -20, -22], 
             [-14, -18, -20, -20],
             [-20, -20, -18, -14], 
             [-22, -20, -14, 0]]).ravel()

print("v as vector:", v)
print("v as matrix:\n", v.reshape(4,4))

# ravel turns the matrix into an array,
# and with reshape we print it as a matrix again so that it is easier to read.

We can now use our code to see if this value function really satisfy the Bellman equation:

In [None]:
env = gym.make('GridWorld-v0')
agent = RandomAgent()
discount = 1
v_new = Bellman_RHS_all(env, discount, agent, v)
print('Right-hand side of Bellman equation:\n', v_new.reshape(4,4))

**Task:** `v` is the true value-function for the policy $\pi$ implemented in `agent`. Hence, `v_new` (the right-hand side of the Bellman equation) should be equal to `v`. 

If `v_new` is not equal to `v`, go back and fix your code for `compute_action_value` and `Bellman_RHS`. Remember to re-run the code cells after you have changed the code!

# 6. Policy Evaluation <a id="sec6">

In Lecture 3 we learned that one way of solving the Bellman equation, is to start from an initial guess and then repeatedly update the value function by applying the right-hand side of the Bellman equation. 

Below is one way to implement this. The iteration will stop when the maximum change in $v$ is less than `tol` (tolerance) or the number of iterations are `max_iter`.

***Note:*** For this code to work properly, your implementation of `compute_action_value` and `Bellman_RHS` must be correct. So make sure that you have tested your code first!

**Task:** Make sure that you understand the code!

In [None]:
def policy_evaluation(env, discount, agent, v0, max_iter=1000, tol=1e-6):
    
    v_old = v0
    
    for i in range(max_iter):
        v_new = Bellman_RHS_all(env, discount, agent, v_old)
        
        if np.max(np.abs(v_new-v_old)) < tol:
            break
            
        v_old = v_new
        
    return v_new

Lets try this on the `GridWorld-v0` example, with the uniformly random policy. We start with an initial guess $v_{\pi}(s) = 0$ for all $s$.

In [None]:
env = gym.make('GridWorld-v0')
agent = RandomAgent()
discount = 1
v0 = np.zeros((env.observation_space.n))

v = policy_evaluation(env, discount, agent, v0)
print(v.reshape(4,4))

Is this (approximately) the same as the true value function in Figure 4.1? $(k=\infty)$
If you do not find the correct value function, make sure that your code in `compute_action_value` and `Bellman_RHS` is correct!

To replicate the other parts of Figure 4.1, you can set `max_iter` in order to see how the value function looks after a few iterations.

In [None]:
v = policy_evaluation(env, discount, agent, v0, max_iter=1)
print(v.reshape(4,4))

**Task:** Use `policy_evaluation` to compute the value function for `FrozenLake-v0` when the uniformly random policy is used. Use $\gamma = 1$.

In [None]:
env = gym.make('FrozenLake-v0')
env.reset()
env.render()
agent = RandomAgent()
discount = 1

# Write code for computing the state-value function

print(v.reshape(4,4))

## 6.1 In place updates <a id="sec6_1">

As mentioned in Lecture 3 and in the textbook, the policy evaluation is often implemented using in place updates.

This can both simplify implementation, since we do not keep two separate arrays, and it can also speed up convergence.

**Task:** Complete the code in `policy_evaluation_ip`. Pseudo-code can be found in the textbook in the box on page 75. Then test your code on `GridWorld-v0` with the uniform policy to see that you still get the correct value function.

***Note:*** You have already written a function that computes the right-hand side of the Bellman equation!

In [None]:
def policy_evaluation_ip(env, discount, agent, v0, max_iter=1000, tol=1e-6):
    
    v = v0
    
    for i in range(max_iter): # Loop
        delta = 0
        for s in range(env.observation_space.n):
            vs = v[s]
            
            # Code for updating v[s]
            
            delta = np.max([delta, np.abs(vs-v[s])])
            
        if (delta < tol): # Until delta < tol
            break
            
    return v    

In [None]:
env = gym.make('GridWorld-v0')
agent = RandomAgent()
discount = 1

v0 = np.zeros(16)
v = policy_evaluation_ip(env, discount, agent, v0)
print(v)

# 7. Policy Iteration <a id="sec7">

Now when we have code for evaluating a policy, it is time to see how it can be improved. Remember that the idea is to act greedily with respect to $v_{\pi}(s)$. That is, given $v_{\pi}(s)$ we can compute $q_{\pi}(s,a)$, and then the greedy (improved) policy is
$$
\pi'(s) = \text{argmax}_{a} q_{\pi}(s,a)
$$
We have already written code for computing $q_{\pi}(s,a)$ for a given $v_{\pi}(s)$, so the only thing we have to do now is to implement the maximization.

`greedy_policy` will return `a_probs` which encode a policy that is greedy with respect to `v`. That is `a_probs[s][a]` $= \pi'(a|s)$. Make sure that you understand the code.

In [None]:
def greedy_policy(env, discount, agent, v):
    
    # The new policy will be a_probs
    # We start by setting all probabilities to 0
    # Then when we have found the greedy action in a state, 
    # we change the probability for that action to 1.0.
    
    a_probs = np.zeros((env.observation_space.n, env.action_space.n)) 
    
    for s in range(env.observation_space.n):
        
        action_values = np.zeros(env.action_space.n)
        
        for a in range(env.action_space.n):
            # Compute action value for all actions
            action_values[a] = compute_action_value(env, discount, s, a, v)
            
        a_max = np.argmax(action_values) # A greedy action
        a_probs[s][a_max] = 1.0 # Always choose the greedy action!
        
    return a_probs

Lets try to improve the policy on `GridWorld-v0`.

In [None]:
env = gym.make('GridWorld-v0')
agent = RandomAgent()
discount = 1

# We first evaluate the policy
v = np.zeros(env.observation_space.n)
v = policy_evaluation(env, discount, agent, v)

In [None]:
v_old = v

# And then we improve the policy (act greedy w.r.t v)
agent.probs = greedy_policy(env, discount, agent, v)

# We can also evaluate the new policy 
v = policy_evaluation(env, discount, agent, v)

print("Value of initial policy:")
print(v_old.reshape(4,4))
print("\nValue of improved policy:")
print(v.reshape(4,4))

Assuming that your implementation of `compute_action_value` is correct, 
we can clearly see that the improved policy has higher value in every state. In fact, the policy is now an optimal policy. To see this, you can try to rerun the second cell above and note that the policy does not improve anymore.

**Policy iteration:** However, it is not the case for all environments that the policy will converge in just one improvement. In this case we may have to improve the policy several times until it finally converge to the optimal policy. 

Finally, we can try to run the agent with the improved policy.

In [None]:
run_agent(env, agent)

**Task:** Find an optimal policy for `FrozenLake-v0`. (Note again that you may have to improve several times to reach an optimal policy!)

In [None]:
env = gym.make('FrozenLake-v0')
agent = RandomAgent()
discount = 1

# Enter code here

# 8. Value iteration <a id="sec8">

In the value iteration we instead start from the Bellman optimality equation

$$
v_{*}(s) = \max_{a} q_{\pi_*}(s,a) = \max_a \sum_{s', r} p(s', r | s, a) [r + \gamma v_{*}(s')]
$$

We start with an initial guess $v_0$ and then we repeatedly compute the right-hand side of this equation, until we converge to the optimal state-value function. When we have the optimal state-value function $v_*$, we can take any policy that is greedy w.r.t $v_*$ and this will give us an optimal policy. 

**Task 1:** Complete the code below. Pseudo-code for the algorithm can be found on page 83 in the textbook. Note that the code for computing the action-value given $v_{\pi}$ has already been implemented above.

The `value_iteration` function will (if implemented correctly) give back the optimal value function. 

**Task 2:** Also add some code for computing the optimal policy given this, and try it on `FrozenLake-v0` and/or `GridWorld-v0`.

In [None]:
def value_iteration(env, discount, agent, v0, max_iter=1000, tol=1e-6):
    
    v = v0
    
    for i in range(max_iter): # Loop
        delta = 0
        for s in range(env.observation_space.n):
            vs = v[s]
            
            
            
            # Code for updating v[s]
                
            
            ##
            
            delta = np.max([delta, np.abs(vs-v[s])])
            
        if (delta < tol): # Until delta < tol
            break
            
    return v    

In [None]:
env = gym.make('FrozenLake-v0')
agent = RandomAgent()
discount = 1

v0 = np.zeros(env.observation_space.n)
v = value_iteration(env, discount, agent, v0)
print(v.reshape(4,4))

In [None]:
# Code for finding the greedy policy w.r.t v, and to run it on FrozenLake-v0.