# RL BA2 - Model-free Prediction and Control

## Monte Carlo Prediction

We study an environment with three states, $\mathcal{S} = \{A,B,C\}$, where $C$ is a terminating state. The discount rate is $\gamma = 1$. A policy $\pi$ is used to observe the following two episodes (states and rewards):\
我们研究一个有三种状态的环境，$\mathcal{S} = \{A,B,C\}$，其中$C$是终止状态。折现率为$\gamma = 1$。策略$\pi$用于观察以下两个事件（状态和奖励）:

Episode 1: $A, 2, A, 4, B, -4, A, 4, B, -2, C$ (terminate)\
Episode 2: $B, -2, A, 4, B, -4,C$ (terminate)

From this we want to estimate $v_\pi(A)$ and $v_\pi(B)$ ($v_\pi(C) = 0$ since it is a terminating state). What will $V(A)$ and $V(B)$ be after the two episodes\
由此，我们想要估算$v_\pi(A)$和$v_\pi(B)$（$v_\pi(C) = 0$，因为它是一个终止状态）。在这两个episode之后，$V(A)$和$V(B)$会是什么？

(a) if we use first-visit Monte-Carlo?\
(b) if we use every-visit Monte-Carlo?

__Answer__

For episode 1, we have:
$$
\begin{align}
A: G_0 &= 4 \\
A: G_1 &= 2 \\
B: G_2 &= -2 \\
A: G_3 &= 2 \\
B: G_4 &= -2
\end{align}
$$

For episode 2, we have:
$$
\begin{align}
B: G_0 &= -2 \\
A: G_1 &= 0 \\
B: G_2 &= -4
\end{align}
$$

__(a) if we use first-visit Monte-Carlo?__

From
$$
V(S_t) \leftarrow average(Returns(S_t))
$$
we obtain
$$
\begin{align}
V(A) &= \frac{1}{2} (4 + 0) = 2 \\
V(B) &= \frac{1}{2} (-2 + (-2)) = -2
\end{align}
$$

__(b) if we use every-visit Monte-Carlo?__

$$
\begin{align}
V(A) &= \frac{1}{4} (4 + 2 + 2 + 0) = 2 \\
V(B) &= \frac{1}{4} (-2 + (-2) + (-2) + (-4)) = -2.5
\end{align}
$$

## TD Prediction

We study an environment with three states, $\mathcal{S} = \{A,B,C\}$, where $C$ is a terminating state. A policy $\pi$ is used to observe the following:

$$
S_0 = B, R_1 = -2, S_1 = A, R_2 = 4, S_2 = B
$$

The discount rate is $\gamma = 1$.
Initialization: $V(A) = V(B) = V(C) = 0$.
We use TD(0) with constant step size $\alpha = 1$. What will $V(A)$ and $V(B)$ be after the updates?

__Answer__

From 
$$
V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]
$$
we obtain
$$
\begin{align}
V(S_0) &\leftarrow V(S_0) + \alpha[R_1 + \gamma V(S_1) - V(S_0)] \\
V(B)   &\leftarrow V(B) + \alpha[-2 + \gamma V(A) - V(B)] \\
V(B)   &\leftarrow 0 + \alpha[-2 + \gamma 0 - 0] = -2
\end{align}
$$
then we get
$$
\begin{align}
V(S_1) &\leftarrow V(S_1) + \alpha[R_2 + \gamma V(S_2) - V(S_1)] \\
V(A) &\leftarrow V(A) + \alpha[4 + \gamma V(B) - V(A)] \\
V(A) &\leftarrow 0 + \alpha[4 + \gamma -2 - 0] = 2
\end{align}
$$

## Q-learning

The environment consists of three states $\mathcal{S} = \{Room 0, Room 1, Room 2\}$. Room 2 is a terminal state. The three rooms are in a corridor and the agent can take the action $\mathcal{A} = \{Left, Right\}$. Consider trying to learn a policy for this environment using Q-learning. We use the step size $\alpha = 1$ and the discount rate $\gamma = 1$.\
环境由三个状态$\mathcal{S} = \{Room 0, Room 1, Room 2\}$组成。Room 2是终止状态。这三个房间位于走廊上，agent可以执行操作$\mathcal{A} = \{Left, Right\}$。考虑使用Q-learning学习这个环境的策略。我们使用步长$\alpha = 1$和折现率$\gamma = 1$。

Initialization: $Q(s, a) = 0$ for all $s$ and $a$ except $Q(Room 1, Right) = 10$.\
初始化：$Q(s, a) = 0$对所有$s$和$a$，除了$Q(Room 1, Right) = 10$。

(a) We start in $S = Room 0$ and choose action $A = Right$. The agent moves to $S' = Room 1$ and gets reward $R = -1$. The Q-values are updated. What is $Q(s, a)$ for all pairs now?\
我们从$S = Room 0$开始，选择动作$A = Right$。agent移动到$S' = Room 1$，得到reward $R = -1$。更新Q值。现在所有状态-动作对的$Q(s, a)$是什么？

(b) We continue from part (a). We are now in $S = Room 1$ and take action $A = Left$. The agent moves to $S' = Room 0$ and gets reward $R = -1$. The Q-values are updated. What is $Q(s, a)$ for all pairs now?\
我们从第(a)部分继续。我们现在在$S = Room 1$，并采取行动$A = Left$。agent移动到$S' = Room 0$，得到reward $R = -1$。更新Q值。现在对所有的$Q(s, a)$是什么?

(c) After the two steps above, what is the greedy policy respect to $Q$?\
经过上述两个步骤后，$Q$的贪婪策略是什么?

__Answer__

__(a) We start in $S = Room 0$ and choose action $A = Right$. The agent moves to $S' = Room 1$ and gets reward $R = -1$. The Q-values are updated. What is $Q(s, a)$ for all pairs now?__

From
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_{t+1} + \gamma \underset{a}{\max}Q(S_{t+1}, a) - Q(S_t, A_t)]
$$

we obtain
$$
\begin{align}
Q(Room 0, Right) 
&\leftarrow Q(Room 0, Right) + \alpha [-1 + \gamma \underset{a}{\max}Q(Room 1, a) - Q(Room 0, Right)] \\
&\leftarrow 0 + \alpha [-1 + 10 - 0] = 9
\end{align}
$$

noting that
$$
\underset{a}{\max}Q(Room 1, a) = Q(Room 1, Right) = 10
$$

__(b) We continue from part (a). We are now in $S = Room 1$ and take action $A = Left$. The agent moves to $S' = Room 0$ and gets reward $R = -1$. The Q-values are updated. What is $Q(s, a)$ for all pairs now?__

$$
\begin{align}
Q(Room1, Left) 
&\leftarrow Q(Room1, Left) + \alpha [-1 + \gamma \underset{a}{\max}Q(Room 0, a) - Q(Room1, Left)] \\
&\leftarrow 0 + \alpha [-1 + 9 - 0] = 8
\end{align}
$$

noting that
$$
\underset{a}{\max}Q(Room 0, a) = Q(Room 0, Right) = 9
$$

__(c) After the two steps above, what is the greedy policy respect to $Q$?__

We now have
$$
\begin{align}
Q(Room 0, Left) &= 0 \\
Q(Room 0, Right) &= 9 \\
Q(Room 1, Left) &= 8 \\
Q(Room 1, Right) &= 10
\end{align}
$$

so
$$
\begin{align}
\pi(Room 0) &= Right \\
\pi(Room 1) &= Right
\end{align}
$$

## Q-learning, optimal policy, `Taxi-v3`

Use Q-learning to find the optimal policy for the `Taxi-v3` environment. This is an undiscounted problem, i.e. $\gamma = 1$. You can use $\alpha = 0.1$, $\varepsilon = 0.1$ and train on at least 10 000 episodes.\
使用Q-learning寻找`Taxi-v3`环境的最佳策略。这是一个未贴现的问题，即$\gamma = 1$。你可以使用$\alpha = 0.1$， $\varepsilon = 0.1$，至少训练10000个episode。

Doing like this you should get an estimated Q-function such that (at least in most states) the greedy policy w.r.t Q is optimal. In the quizz the question will be e.g.:\
这样做，你应该得到一个估计的Q函数，这样（至少在大多数状态下）关于Q的贪婪策略是最优的。在小测验中，问题会是这样的:

"Give the optimal action in state $s = 410$" for a few different states. So be sure that you have code ready to answer these types of questions. Since there is a risk that the Q-learning will find a policy that is not optimal in every possible state, you pass this part even if you only give the correct answer in 80% of the states asked for.\
对于几个不同的状态，“在状态$s = 410$下给出最优行动”。因此，请确保您已经准备好了回答这类问题的代码。由于q学习有可能在所有可能的状态下找到一个不是最优的政策，所以即使你只在要求的80%的状态下给出了正确答案，你也会通过这一部分。

__Remember:__ When you have finished training your policy, you should use the greedy policy to answer the questions. If you use a $\varepsilon$-greedy with $\varepsilon > 0$ there is a chance that you return an action that is not greedy w.r.t Q.\
当你训练完你的策略，你应该用贪婪策略来回答问题。如果你使用$\varepsilon$-greedy和$\varepsilon > 0$，那么你有可能返回一个关于Q不贪婪的行为。

__Tips:__ When you are done training your agent, it is also fun to use test policy from Tinkering Notebook 3 to see your agent in action. This can also give you a feeling for if the agent seems to behave in an optimal way.\
当你完成了对你的agent的训练后，使用Tinkering Notebook 3中的测试策略来看看你的agent是很有趣的。这还可以让您了解代理是否以最佳方式运行。

__Note:__ The training of the agent is random due to the $\varepsilon$-greedy exploration. So even if you are training everything correctly, you may be unlucky so that the agent has not learned the optimal action in the specific state we ask for. You only need to get 80% (4) of the actions correct. If you only get e.g. 3 actions correct you can just retake the quiz one more time (you will then be asked for actions in different states), to see if it was just bad luck or if you have to look over your code one more time.\
由于$\varepsilon$贪婪探索，agent的训练是随机的。所以，即使你正确地训练了一切，你也可能不走运，以致于主体没有在我们要求的特定状态下学会最佳行动。你只需要完成80%(4)的动作。如果你只做了3个正确的动作，你可以再做一次测试（然后你会被要求在不同的状态下做动作），看看这只是运气不好，还是你需要再检查一次代码。