# Q-learning, SARSA and Temporal Difference methods
Q-learning defines an algorithm to approximate $q_*$, the *optimal action-value function*. We can write it independently of the policy being followed, and it is guaranteed to converge to $q_*$ as long as all **state-action pairs** are visited infinitely often and the policy converges in the limit to the greedy policy.

\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t) \right].
\end{equation}

In procedural form, the algorithm is:

\begin{equation}
\begin{split}
& \text{Initialize } Q(s, a), \text{ for all } s \in \mathcal{S}^+, a \in \mathcal{A}(s), \text{ arbitrarily except that } Q(\texttt{terminal}, \cdot) = 0 \\
& \text{Repeat (for each episode):} \\
& \quad \text{Initialize } S \\
& \quad \text{Repeat (for each step of episode):} \\
& \quad \quad \text{Choose } A \text{ from } S \text{ using policy derived from } Q \text{ (e.g., } \epsilon\text{-greedy)} \\
& \quad \quad \text{Take action } A, \text{ observe } R, S' \\
& \quad \quad Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma \max_a Q(S', a) - Q(S, A) \right] \\
& \quad \quad S \leftarrow S' \\
& \quad \text{until } S \text{ is terminal}
\end{split}
\end{equation}

Q-learning is called an **off-policy** because it directly approximates $q_*$, the optimal action-value function, independent of the policy being followed. 

## SARSA
SARSA is an **on-policy** TD control algorithm. It is very similar to Q-learning, but instead of using the greedy policy to select the next action, it uses the same policy that is being learned about. The update rule is:

\begin{equation}
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right].
\end{equation}

In procedural form, the algorithm is:
\begin{equation}
\begin{split}
& \text{Initialize } Q(s, a), \text{ for all } s \in \mathcal{S}^+, a \in \mathcal{A}(s), \text{ arbitrarily except that } Q(\texttt{terminal}, \cdot) = 0 \\
& \text{Repeat (for each episode):} \\
& \quad \text{Initialize } S \\
& \quad \text{Choose } A \text{ from } S \text{ using policy derived from } Q \text{ (e.g., } \epsilon\text{-greedy)} \\
& \quad \text{Repeat (for each step of episode):} \\
& \quad \quad \text{Take action } A, \text{ observe } R, S', A' \\
& \quad \quad Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma Q(S', A') - Q(S, A) \right] \\
& \quad \quad S \leftarrow S'; A \leftarrow A' \\
& \quad \text{until } S \text{ is terminal}
\end{split}
\end{equation}
 The main difference of SARSA from Q-learning is that the choice action $A'$ is based on policy $\pi$, instead of the greedy action $\max_a Q(S', a)$.
**Example 6.6: Cliff Walking**
Run this example described on page 132 of the book.


## Expected SARSA
Expected SARSA is an alternative to Q-learning that is more robust to noise and variance in the estimates of the action values. It is defined as:

\begin{align}
Q(S_t, A_t) &\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \mathbb{E}_\pi \left[ Q(S_{t+1}, A_{t+1}) \mid S_{t+1} \right] - Q(S_t, A_t) \right]\\
&\leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \sum_a \pi(a \mid S_{t+1}) Q(S_{t+1}, a) - Q(S_t, A_t) \right].
\end{align}

The above update rule, uses the expected value instead of the maximum value over the next state,action pairs. This is useful when the action values are noisy, and the maximum value may not be representative of the true value of the action. The expected value is more robust to noise, and is less likely to overestimate the action values. Otherwise this rule is very similar to Q-learning, updating at every time step. 