## Temporal-Difference Learning
Dynamic programming (DP) solutions to RL problems require a perfect model of the environment; Monte-Carlo (MC) do not require a model of the environment and instead learn from experience-sample sequences of states, actions and rewards which can be obtained from an actual or simulated interaction with an environment. Temporal-Difference learning (TD) is a combination of both - like MC, it learns directly from raw experience without a model and like DP, it updates its estimates based in part on other estimates without waiting for a final outcome.

### 1. SARSA
Here, we are interested in learning the value of state-action pairs. Starting with an arbitrary policy, we update our estimates of the values using the following equation:
$$Q(S_t, A_t) \gets Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right] $$

Where $Q(S_t, A_t)$ is the value of taking action A in state S, $\alpha$ is the step-size and $\gamma$ is the discount factor.

An update is performed after every step from a nonterminal state. If $S_{t+1}$ is terminal, then $Q(S_{t+1}, A_{t+1})$ is defined as zero. The update rule uses elements of the tuple $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ which make up a transition from one state-action pair to the next giving it the name SARSA. Here's the SARSA algorithm:

$\text {Initialize } Q(s,a), \forall s \in S, a \in A(s), \text {arbitrarily and Q(terminal state, $\cdot$) = 0}\\
\text {Repeat (for each episode):}\\
\quad \quad \text {Choose A from S using policy derived from Q (e.g. $\epsilon$-greedy)}\\
\quad \quad \text {Take action A, observe R, S'}\\
\quad \quad \text {Choose A' from S' using policy derived from Q (e.g. $\epsilon$-greedy)}\\
\quad \quad Q(S, A) \gets Q(S, A) + \alpha \left[ R + \gamma Q(S', A') - Q(S, A) \right]\\
\quad \quad S \gets S'; A \gets A';\\
\quad \text {until S is terminal}\\
$
