## SARSA
SARSA is another Temporal Difference learning algorithm that bootstraps while sampling.
It shares the same principles with TD learning, although SARSA uses and updates Q values, not V values like in TD learning.
The update formula is similar to TD learning: $ Q^{\pi}(s_t, a_t) \gets Q^{\pi}(s_t, a_t) + \alpha[R_t+ \gamma Q^{\pi}(s_{t+1}, a_{t+1}) - Q^{\pi}(s_t, a_t)] $.
SARSA is of course more resource hungry, since if there are $n$ V values for TD learning, there are $n\times m$ Q values for SARSA, where $n$ is the number of states and $m$ is the number of actions.
Nevertheless, SARSA is the better choice for more complex MDP problems, inherent in the fact that it allows for more detailed representations of transition dynamics while using Q values.

 SARSA gets its name from the parts of the trajectory used in the update equation.

### Characteristics of SARSA:

##### Model free
SARSA is model free, i.e. it does not require full knowledge of all states or transition dynamics.

##### On policy
 - On policy methods attempt to evaluate or improve the policy that is used to make decisions.
 - Off policy methods evaluate or improve a policy different from that used to generate the data.
 Something very positive for off policy methods is that they can figure out the optimal policy **regardless** of the agent’s actions and motivation.

SARSA in particular is an on policy algorithm.

##### Q values: state-action pair values
SARSA evaluates and updates Q values, i.e. state-action pair values instead of state values, i.e. V values.
This basically means that it does not bind the expected future sum of rewards to a state, but rather binds the expectancies to state-action pairs.

##### Convergence
SARSA converges to a *local optimum* in case of infinite visits to state-action pairs.

##### Bootstrapping & Sample based
SARSA combines bootstrapping with sampling, just like TD(0) also did.
 - The bootstrapping aspect is common with previous dynamic programming methods.
 The update formula consists in a Bellman backup over just one transition. It is executed every transition and the state-action value for the current state-action pair is bootstrapped from the old estimate of the sampled next state - next action pair.
 The reason we are able to backup over just one transition is because we leverage the **Markovian assumption** of the domain.
 - Sampling is common with Monte Carlo methods in order to allow for a model free algorithm.

##### Finite & Infinite Horizon
SARSA can be used in both finite or infinite horizon settings, i.e. it works with both episodic or non-episodic domains.
Infinite horizon settings are possible because SARSA update rules for the Q value function happen each step, not after the end of an episode like in Monte Carlo.

##### Lower variance and lower data efficiency than Monte Carlo
SARSA has lower variance and lower data efficiency when compared to Monte Carlo methods.
This is a property inherited from Temporal Difference learning.
For more information, have a look at [Temporal Difference learning](../4-temporal-difference/td_agent.ipynb).

##### Lower data efficiency when using Q values than using V values
Agents that work with state-action pair values usually have lower data efficiency than their counterparts working with state values while exploring.
 - more memory is used to represent state-action pair values than state values
 - more data is needed to train the agent, i.e. the agent needs to spend more time interacting with the environment to learn a good policy while exploring

##### Epsilon greedy policy
Epsilon greedy policies determine how often will the agent explore and how often will the agent exploit.

Furthermore, we want the epsilon greedy policy to be **greedy in the limit of exploration (GLIE)**.
 - all state-action pairs are visited an infinite number of times
 - $\epsilon_{t} → 0$ as $ t → 0 $, i.e. the policy is greedy in the limit and converges to 0

In our case, the update rule after each step for our epsilon is the following:
$ \epsilon \gets 1 / ( c_{\epsilon} \times f_{\epsilon})$, where $ c_{\epsilon} $ is a counter that increments after each episode has ended, whereas $ f_{\epsilon} $ is a constant factor.

##### Discount factor
The discount factor must take a value in the range $[0...1]$.
 - setting it to $1$ means that we put as much value to future states as the current state.
 - setting it to $0$ means that we do not value future states at all, only the current state.

##### Learning rate
The learning rate *usually* takes any value in the range $[0...1]$.
 - setting a value bigger than $1$ gives a higher weight to newer data, which can help learning in non-stationary domains.
 - values closer to $0$ gives a higher weight to older data.
 - values closer to $1$ gives almost the same weight to old and new data.

##### Theorem: Robbins-Munro sequence for Learning rate
Finite-state and finite-action MDP's converges to the optimal action-value, i.e. Q(s, a) → q(s, a), if the following two conditions hold:
 1. The sequence of policies $\pi$ from is GLIE
 2. The step-sizes $\alpha_t$ satisfy the Robbins-Munro sequence such that:
  - $ \sum^{\infty}_{t=1} \alpha_t = \infty $
  - $ \sum^{\infty}_{t=1} \alpha^2_t < \infty $

That is why we are going to use a **decaying learning rate**, like we did in Incremental Monte Carlo that satisfies the above conditions.
If we use a learning rate similar to the one we used in Incremental Monte Carlo, of the form $ k \times 1/c_{\epsilon}$ we can be sure that it satisfies the above conditions.


##### Initialization
For SARSA we keep track of the following:
 - state-action value functions, initially set to $0$
 - `self.learning_rate` is initialized to $1$ and decays together with `self.epsilon` with the increasing number of episodes at the same rate.
 It is repeatedly set equal to epsilon at the end of each episode.
 - `self.discount_factor` is set to $0.9$.
 - we set `self.decaying_epsilon_mul_factor` to a value of $0.1$, whereas for TD(0) learning the value was set to $0.2$.
 This is done to allow the agent explore longer, because as we said algorithms that work with Q values are less data efficient than their V value counterparts.
 `self.epsilon` starts from $10$ and decreases with each episode.

We also choose to directly call on the `self.learn()` method with the current sampled tuple instead of first saving the tuple on `self.tuples` variable and then learning.
It is just a matter of preference.
We pass the tuple in the form $(s, a, r, s_{t+1}, a_{t+1})$ in the `learn()` method.


##### Initialization
For SARSA we keep track of the following:
 - state-action value functions, initially set to $0$
 - We choose to directly call on the `self.learn()` method with the current sampled tuple instead of first saving the tuple on `self.tuples` variable and then learning.
 It is just a matter of preference.
 We pass the tuple in the form $(s, a, r, s_{t+1}, a_{t+1})$ in the `learn()` method.
 - `self.learning_rate` is initialized to $1$ and decays with the increasing number of episodes.
 - `self.discount_factor` is set to $0.9$.
 - we set `self.decaying_epsilon_mul_factor` to a value of $0.1$, whereas for TD(0) learning the value was set to $0.2$.
 This is done to allow the agent explore longer, because as we said algorithms that work with Q values are less data efficient than their V value counterparts.
 `self.epsilon` starts from $10$ and decreases with each episode.


In [1]:
import numpy as np
import random
from collections import defaultdict
from environment import Env


# SARSA agent learns every time step from the sample <s, a, r, s', a'>
# render sleep time updated to 0.005
class SARSAgent:
    def __init__(self, actions):
        self.actions = actions
        self.learning_rate = 1
        self.discount_factor = 0.9
        self.decaying_epsilon_counter = 1
        self.decaying_epsilon_mul_factor = 0.1
        self.epsilon = None
        self.q_table = defaultdict(lambda: [0.0, 0.0, 0.0, 0.0])


### SARSA

SARSA combines sampling with bootstrapping in an on policy algorithm, just like TD(0).
Nevertheless, unlike TD(0) in SARSA we evaluate and update Q values.

The update rule for Q values in SARSA is the following:

$ Q^{\pi}(s_t, a_t) \gets Q^{\pi}(s_t, a_t) + \alpha [r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1}) − Q^{\pi}(s_t, a_t)] $ where:
 - $Q^{\pi}(s_t, a_t)$ - Q value of current state-action pair following the policy $\pi$
 - $Q^{\pi}(s_{t+1}, a_{t+1})$ - the current estimate following the policy $\pi$ of the state value of the next state.
 - $\alpha$ - the **learning rate**.
 Learning rate can take any value int the range $[0...1]$.
 Values closer to 0 mean that we put more value to older experiences, whereas values closer to 1 means that we put more value to latest experiences.
 In our case, the learning rate takes the value $0.4$.
 - $r_t$ - the reward at time-step $t$.
 - $\gamma$ - the **discount factor**.
 Traditionally used when calculating returns, now it is used when calculating **expectancies of returns**, i.e. state values.

The difference $r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1}) − Q^{\pi}(s_t, a_t)$ is commonly referred to as the **TD error**.

The sum $r_t + \gamma Q^{\pi}(s_{t+1}, a_{t+1})$ is referred to as the **TD target**.


In [2]:
class SARSAgent(SARSAgent):
    # with sample <s, a, r, s', a'>, learns new q function
    def learn(self, state, action, reward, next_state, next_action):
        current_q = self.q_table[state][action]
        next_state_q = self.q_table[next_state][next_action]
        SRS_Target = reward + self.discount_factor * next_state_q
        SRS_Error = SRS_Target - current_q
        new_q = current_q + self.learning_rate * SRS_Error
        self.q_table[state][action] = new_q

### Other methods

##### Update Epsilon and Learning rate

In [None]:
class SARSAgent(SARSAgent):
    # epsilon-greedy policy
    def update_epsilon(self):
        self.epsilon = 1 / (self.decaying_epsilon_counter * self.decaying_epsilon_mul_factor)

    # decaying learning rate satisfying Robbins-Munro sequence
    def update_learning_rate(self):
        self.learning_rate = 1 / (self.decaying_epsilon_counter * self.decaying_epsilon_mul_factor)
        if self.learning_rate > 1:
            self.learning_rate = 1

##### Helper methods

In [3]:
class SARSAgent(SARSAgent):
    # get action for the state according to the q function table
    # agent pick action of epsilon-greedy policy
    def get_action(self, state):
        self.update_epsilon()
        if np.random.rand() < self.epsilon:
            # take random action
            action = np.random.choice(self.actions)
        else:
            # take action according to the q function table
            state_action = self.q_table[state]
            action = self.arg_max(state_action)
        return action


In [5]:
class SARSAgent(SARSAgent):
    @staticmethod
    def arg_max(state_action):
        max_index_list = []
        max_value = state_action[0]
        for index, value in enumerate(state_action):
            if value > max_value:
                max_index_list.clear()
                max_value = value
                max_index_list.append(index)
            elif value == max_value:
                max_index_list.append(index)
        return random.choice(max_index_list)

##### Main loop

In [6]:
class SARSAgent(SARSAgent):
    # main loop
    def mainloop(self, env, verbose = False):
        for episode in range(1000):
            # reset environment and initialize state
            state = env.reset()

            # update epsilon and get action of state from agent
            action = self.get_action(str(state))

            while True:
                env.render()

                # take action and proceed one step in the environment
                next_state, reward, done = env.step(action)

                # update epsilon and get next action
                next_action = self.get_action(str(next_state))

                # with sample <s,a,r,s',a'>, agent learns new q function
                self.learn(str(state), action, reward, str(next_state), next_action)

                state = next_state
                action = next_action

                # print q function of all states at screen
                env.print_value_all(self.q_table)

                # if episode ends, then break
                if done:
                    self.decaying_epsilon_counter = self.decaying_epsilon_counter + 1
                    # decaying learning rate satisfying Robbins-Munro sequence
                    self.update_learning_rate()

                    if verbose:
                        print("episode: ", episode,
                              "\tepsilon: ", round(self.epsilon, 2),
                              "\tlearning rate: ", round(self.learning_rate, 2)
                              )
                    break

In [7]:
if __name__ == "__main__":
    env = Env()
    agent = SARSAgent(actions=list(range(env.n_actions)))
    try:
        agent.mainloop(env, verbose=False)
    except:
        pass

### Results

SARSA does converge to an optimal policy within 60 episodes.

Very important to making SARSA converge to an optimal policy in Grid World is the **decaying learning rate** that satisfies the **Robbins-Munro sequence**.