## Temporal Difference learning: TD(0)

Temporal difference learning combines the bootstrapping aspect used in dynamic programming with the sampling aspect used in Monte Carlo to give us another model free policy evaluation algorithm.
Temporal difference first analyses and changes the original update formula from incremental MC to derive an updated formula that does both bootstrapping and sampling at the same time: $ V_\pi(s_t)\ \gets\ V_\pi(s_t)\ +\ \alpha(R\ +\ \gamma V_\pi(s_{t+1})\ \ -\ V_\pi(s_t))$.
Notice the difference is instead of waiting to calculate the $G_t$ until the end of the episode, we calculate the TD target: $ R\ +\ \gamma V_\pi(s_{t+1})$ every step of the episode.
This way, we bootstrap the information while sampling.
Temporal Difference learning is only used in MDP settings, like most reinforcement learning algorithms.
Nevertheless, it is a better choice than Monte Carlo methods in MDP settings with very long episodes or non-episodic domains.
There are many versions of Temporal Difference learning, but here we are going to show the TD(0) version.

##### TD methods

There is actually an entire spectrum of ways we can blend Monte Carlo and dynamic programming using a method called TD(λ).
 - when $λ = 0$, we get the TD-learning formulation above, hence giving us the alias TD(0).
 - when $λ = 1$, we recover Monte Carlo policy evaluation, depending on the formulation used.
 - when $0 < λ < 1$, we get a blend of these two methods.

For a more thorough treatment of TD(λ), please see Sections 7.1 and 12.1-12.5 of the book by Sutton and Barto.

### Characteristics of Temporal Difference learning:

##### Model free
Temporal Difference learning methods are model free, i.e. they do not require full knowledge of all states or transition dynamics.

##### Convergence
Temporal Difference methods converge to a *local optimum*.

##### Bootstrapping & Sample based
Temporal Difference learning methods combine bootstrapping with sampling.
 - The bootstrapping aspect is common with previous dynamic programming methods.
 The update formula consists in a Bellman backup over just one transition. It is executed every transition and the state value for the current state is bootstrapped from state values of previous states.
 The reason we are able to backup over just one transition in dynamic programming is because we leverage the **Markovian assumption** of the domain.
 - Sampling is common with Monte Carlo methods in order to allow for a model free algorithm.

##### Biased estimation of state values
In TD learning, we bootstrap the next state's value estimate to get the current state's value estimate, so the estimate of current state is biased by the estimated value of the next state.

##### Finite & Infinite Horizon
Temporal Difference learning methods can be used in both finite or infinite horizon settings, i.e. it works with both episodic or non-episodic domains.
Infinite horizon settings are possible because Temporal Difference learning update rules for the state value function happen each step, not after the end of an episode like in Monte Carlo.

##### Low variance
The variance of Monte Carlo evaluation is relatively higher than TD learning because in Monte Carlo evaluation, we consider many transitions in each episode with each transition contributing variance to our estimate.
On the other hand, TD learning only considers one transition per update, so we do not accumulate variance as quickly.

##### Low data efficiency
Monte Carlo is generally more data efficient than TD(0).
In Monte Carlo, we update the value of a state based on the returns of the entire episode, so if there are highly positive or negative rewards
in many trajectories in the future, these rewards will be immediately incorporated into our update
of state values in *every state*.

On the other hand in TD(0), we update the value of a state using only the
reward in the current step and some previous estimate of the value at the next state. This means that
if there are highly positive or negative rewards many trajectories in the future, we will only incorporate
these into the current state's value update. This means that if a highly rewarding episode has length L, then
we may need to experience that episode *L times* for the information of the highly rewarding episode
to travel all the way back to the starting state.

##### Epsilon greedy policy
Epsilon greedy policies are always nice to use and combining them with Temporal Difference learning methods is a good idea.
We set an epsilon value for the policy that decays with each step, denoting the probability that the next action will be random.
This way, we allow our agent to explore more in the beginning, where epsilon is near 1 and exploit what he has learned during the late steps, where the epsilon is near 0.

In our case, the update rule after each step for our epsilon is the following:
$ \epsilon = 1 / ( C_{\epsilon} * F_{\epsilon})$, where $ C_{\epsilon} $ is a counter that increments after each episode has ended, whereas $ F_{\epsilon} $ is a constant factor.

##### Discount factor
The discount factor must take a value in the range $[0...1]$ and in our case: `self.discount_factor = 1`.
By setting it to $1$ we basically mean that we put as much value to future states as the current state.

##### Learning rate
The learning rate *usually* takes any value in the range $[0...1]$.
 - setting a value bigger than $1$ gives a higher weight to newer data, which can help learning in non-stationary domains.
 - values closer to $0$ gives a higher weight to older data.
 - values closer to $1$ gives almost the same weight to old and new data.

The learning rate can also be a **decaying learning rate**, like we did in Incremental Monte Carlo.
Because of the similarities between Incremental Monte Carlo and TD(0), we use a decaying learning rate for TD(0) too.
It is initialized to $1$ and decays with increasing number of episodes.

In [1]:
import numpy as np
import random
from collections import defaultdict
from environment import Env


class Tuple:
    def __init__(self, state, action, reward, next_state, next_action, done):
        self.state = state
        self.action = action
        self.reward = reward
        self.next_state = next_state
        self.next_action = next_action
        self.done = done

In [2]:
# Temporal Difference Agent which learns from each tuple during an episode
# render sleep time updated to 0.01
class TDAgent:
    def __init__(self, actions):
        self.width = 5
        self.height = 5
        self.actions = actions
        self.discount_factor = 1
        self.decaying_epsilon_counter = 1
        self.decaying_epsilon_mul_factor = 0.2
        self.epsilon = None
        self.tuple = None
        self.learning_rate = 1
        self.value_table = defaultdict(float)

##### Tuple class

We define a class Tuple that will help us save tuples of trajectories in the following fashion:

$(S, A, R, S_{next}, A_{next})$, where:
 - $S$ - current state
 - $A$ - current action
 - $R$ - reward
 - $S_{next}$ - next state
 - $A_{next}$ - next action
 - $D$ - boolean denoting wether the current Tuple is the last one in the episode.

Notice that:
 - we save $S_{next}$ and $A_{next}$ in the `Tuple` class, since we need those values for bootstrapping when updating the state values of the current state.
 - `self.tuple` variable of the class `TDAgent` is not a list of tuples, it only contains the last tuple sampled for that episode.
 This is why we save $S_{next}$, $A_{next}$ and $D$ in the current tuple, otherwise there would be no possibility to bootstrap.

##### Initialization of TDAgent

For Temporal Difference learning we keep track of the following:
 - state value functions, initially set to $0$
 - `self.tuples` variable is not a list of tuples, it rather contains the latest sampled tuple.
 This is why we save $S_{next}$, $A_{next}$ and $D$ in the current tuple, otherwise there would be no possibility to bootstrap.
 It is initially set to null and it is updated each step.
 - `self.learning_rate` is initialized to $1$ and decays with the increasing number of episodes.
 - `self.discount_factor` is set to `agent.learning_rate = agent.epsilon / 5`

### Temporal Difference learning

Temporal Difference learning combines sampling with bootstrapping.
Recall that the update rule to Incremental Monte Carlo was the following:

$ V^{\pi}(S) = V^{\pi}(S) + \gamma * [ G(S) - V^{\pi}(S) ] $

Recall that $G(S)$ is the return after rolling out the policy from time step t to termination starting at state st.
Let's now replace $G(S)$ with a Bellman backup like in dynamic programming.
That is, let's replace $G(S)$ with: $R(S) + \alpha * V^{\pi}(S_{next})$, where $R(S)$ is a sample of the reward at the current time step and $V^{\pi}(S_{next})$ is our current estimate of the value at the next state.
Making this substitution gives us the TD-learning update:

$ V^{\pi}(S) = V^{\pi}(S) + \alpha [R(S) + \gamma V^{\pi}(S_{next}) − V^{\pi}(S)] $ where:
 - $V^{\pi}(S)$ - state value of current state following the policy $\pi$
 - $V^{\pi}(S_{next}$ - the current estimate following the policy $\pi$ of the state value of the next state.
 - $\alpha$ - the **learning rate**.
 Learning rate can take any value int the range $[0...1]$.
 Values closer to 0 mean that we put more value to older experiences, whereas values closer to 1 means that we put more value to latest experiences.
 In our case, the learning rate takes the value $0.4$.
 - $R(S)$ - the reward at state $S$.
 - $\gamma$ - the **discount factor**.
 Traditionally used when calculating returns, now it is used when calculating **expectancies of returns**, i.e. state values.

The difference $R(S) + \gamma V^{\pi}(S_{next}) − V^{\pi}(S)$ is commonly referred to as the **TD error**.

The sum $R(S) + \gamma V^{\pi}(S_{next})$ is referred to as the **TD target**.


In [3]:
class TDAgent(TDAgent):
    # for every tuple, agent updates v function of visited states
    def update(self):
        state_name = str(self.tuple.state)
        next_state_name = str(self.tuple.next_state)

        V = self.value_table[state_name]
        next_V = self.value_table[next_state_name]
        reward = self.tuple.reward

        TD_Target = reward + self.discount_factor * next_V
        TD_Error = TD_Target - V
        V = V + self.learning_rate * TD_Error

        self.value_table[state_name] = V

        if self.tuple.done:
            self.value_table[next_state_name] = reward

### Other methods

##### Helper methods

In [4]:
class TDAgent(TDAgent):
    # get action for the state according to the v function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        self.epsilon = 1 / (self.decaying_epsilon_counter * self.decaying_epsilon_mul_factor)
        if np.random.rand() < self.epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the v function table
            next_state = self.possible_next_state(state)
            action = self.arg_max(next_state)
        return int(action)

In [5]:
class TDAgent(TDAgent):
    # append sample to memory(state, reward, done)
    def save_tuple(self, tuple):
        self.tuple = tuple

    # compute arg_max if multiple candidates exit, pick one randomly
    @staticmethod
    def arg_max(next_state):
        max_index_list = []
        max_value = next_state[0]
        for index, value in enumerate(next_state):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

In [6]:
class TDAgent(TDAgent):
    # get the possible next states
    def possible_next_state(self, state):
        col, row = state
        next_state = [0.0] * 4

        if row != 0:
            next_state[0] = self.value_table[str([col, row - 1])]
        else:
            next_state[0] = self.value_table[str(state)]
        if row != self.height - 1:
            next_state[1] = self.value_table[str([col, row + 1])]
        else:
            next_state[1] = self.value_table[str(state)]
        if col != 0:
            next_state[2] = self.value_table[str([col - 1, row])]
        else:
            next_state[2] = self.value_table[str(state)]
        if col != self.width - 1:
            next_state[3] = self.value_table[str([col + 1, row])]
        else:
            next_state[3] = self.value_table[str(state)]

        return next_state

##### Main loop

In [7]:
class TDAgent(TDAgent):
    # main loop
    def mainloop(self, env, verbose = False):
        for episode in range(1000):
            state = env.reset()
            action = agent.get_action(state)
            reward = 0

            while True:
                env.render()

                # forward to next state. reward is number and done is boolean
                next_state, next_reward, done = env.step(action)

                # get next action
                next_action = agent.get_action(next_state)

                # save only tuple
                agent.save_tuple(Tuple(state, action, reward, next_state, next_action, False))
                # update v values immediately
                agent.update()
                # clear tuple
                agent.tuple = None

                state = next_state
                action = next_action
                reward = next_reward

                # at the end of each episode, print episode info
                if done:
                    # ---- Terminal State
                    # save only tuple
                    agent.save_tuple(Tuple(state, action, reward, state, action, True))
                    # update v values immediately
                    agent.update()
                    # clear tuple
                    agent.tuple = None
                    # ----

                    agent.decaying_epsilon_counter = agent.decaying_epsilon_counter + 1
                    # decaying learning rate
                    agent.learning_rate = 1 / (episode + 2)

                    if(verbose):
                        print("episode : ", episode, "\t[3, 2]: ", round(agent.value_table["[3, 2]"], 2),
                              " [2, 3]:", round(agent.value_table["[2, 3]"], 2), " [2, 2]:", round(agent.value_table["[2, 2]"], 2),
                              "\tepsilon: ", round(agent.epsilon, 2))
                    break

In [None]:
# main
if __name__ == "__main__":
    env = Env()
    agent = TDAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env)
    except:
        pass

### Results

TD(0) of Temporal Difference learning does converge to a solution in less than 70 episodes.

Crucial to making TD(0) find a solution in Grid World is the **decaying learning rate**, that decays with increasing number of episodes.
This is done because in TD(0), given that the agent has found a good policy, a small penalization when exploring in the late episodes does backpropagate the information to other neighbouring value states without differentiating on the action.
That being said, because of a small penalization the agent updates the state values around the state where the penalization happened and does everything to avoid those states, even though some of those states actually lead to a solution whenever the correct action is chosen, i.e. most of the times.