## Deep SARSA
We have already seen [SARSA](../6-sarsa/sarsa_agent.ipynb) in the previous examples.
SARSA is an on policy method for Temporal Difference style control that uses Q values in internal calculations.
Deep SARSA is just an extension of SARSA that uses neural networks, i.e. a *non-linear* function as state-action value function approximator.
This is done to scale up to decision making in really large domains of huge state spaces.

In our application deep SARSA shows great empirical results.

### Characteristics of deep Monte Carlo Q evaluation:
Deep SARSA inherits all the characteristics of normal [SARSA](../6-sarsa/sarsa_agent.ipynb).

##### Neural Network as value function approximator
This can be any neural network architecture that we see fit.
In this example, we are using a very simple Dense Neural Network architecture.
Dense Neural Networks are simply a *stack of matrices with activation functions in between*.

##### Continuous or very large state and action spaces
Using a neural network as a value function approximator has its benefits.
It allows us to solve problems with continuous state and action spaces.
Such problems would be almost impossible to solve with normal Q learning, as they are very resource hungry.

##### Initialization
For SARSA, we keep track of the following:
 - `self.learning_rate` is set constant to $0.001$
 - `self.epsilon` is initially set to $1.0$ and decays each step taken via the variable `self.epsilon_decay = 0.999` until a minimum of `self.epsilon_min = 0.01` is reached
 - `self.discount_factor` is set to $0.99$

For the neural network we keep track of the following:
 - we keep track of a model (neural network) in `self.model = self.build_model()`

In [None]:
import copy
import pylab
import random
import numpy as np
from environment import Env
from keras.layers import Dense
from keras.optimizers import Adam
from keras.models import Sequential

EPISODES = 1000


# this is DeepSARSA Agent for the GridWorld
# Utilize Neural Network as q function approximator
class DeepSARSAgent:
    def __init__(self):
        self.load_model = False
        # actions which agent can do
        self.action_space = [0, 1, 2, 3, 4]
        # get size of state and action
        self.action_size = len(self.action_space)
        self.state_size = 15
        self.discount_factor = 0.99
        self.learning_rate = 0.001

        self.epsilon = 1.  # exploration
        self.epsilon_decay = .999
        self.epsilon_min = 0.01
        self.model = self.build_model()

        if self.load_model:
            self.epsilon = 0.05
            self.model.load_weights('./save_model/deep_sarsa.h5')

The update rule for Q values in deep SARSA is the following:

$\hat{Q}^\pi(s_t, a_t) \gets^{train} r_t+ \gamma \hat{Q}^\pi(s_t ,a_t)$
 - $\hat{Q}^\pi(s_t, a_t)$ - *predicted* Q value of current state-action pair following the policy $\pi$
 - $\gets^{train}$ - this means train the neural network accordingly, instead of *assign* the value
 - $\gamma$ - the **discount factor**.

First things first, the update formula is very similar to the update rule we saw in Q learning, although with the following differences:
 - instead of $Q^\pi$ we now deal with $\hat{Q}^\pi$, which is an approximation, the output of a neural network.
 - after calculating the result, we do not *assign* the value to $\hat{Q}^\pi$, but rather *train* the neural network with *gradient descent* in order to update the weights with the latest state-action value.

Moreover, this seems like a simplified version of the update rule we saw for Q learning.
Notice that the right hand side of the formula is simply a **return**.
We use this value to update the network and we do not take into account the temporal difference between Q values.

The reason behind this is that in deep SARSA, the presence of a neural network will mimic the temporal difference formula we saw on SARSA with a *learning rate*.
In a neural network we update the weights via *backpropagation* in *gradient descent*.
We also specify a **learning rate** in the process.
That is why we do not need the explicit temporal difference aspect of SARSA anymore, since a similar implicit process is provided by backpropagation when we train the neural network with new data.

In [None]:
class DeepSARSAgent(DeepSARSAgent):
    def train_model(self, state, action, reward, next_state, next_action, done):
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

        state = np.float32(state)
        next_state = np.float32(next_state)
        target = self.model.predict(state)[0]
        # like Q Learning, get maximum Q value at s'
        # But from target model
        if done:
            target[action] = reward
        else:
            target[action] = (reward + self.discount_factor *
                              self.model.predict(next_state)[0][next_action])

        target = np.reshape(target, [1, 5])
        # make minibatch which includes target q value and predicted q value
        # and do the model fit!
        self.model.fit(state, target, epochs=1, verbose=0)

##### The model: neural network
Neural networks are built one layer after the other.
Our model is a dense neural network, i.e. a neural network comprised of only dense layers.
Dense layers are simply a *set of matrices* with an *activation function* in the end.

In [None]:
class DeepSARSAgent(DeepSARSAgent):
    # approximate Q function using Neural Network
    # state is input and Q Value of each action is output of network
    def build_model(self):
        model = Sequential()
        model.add(Dense(30, input_dim=self.state_size, activation='relu'))
        model.add(Dense(30, activation='relu'))
        model.add(Dense(self.action_size, activation='linear'))
        model.summary()
        model.compile(loss='mse', optimizer=Adam(lr=self.learning_rate))
        return model

##### Helper methods

In [None]:
class DeepSARSAgent(DeepSARSAgent):
    # get action from model using epsilon-greedy policy
    def get_action(self, state):
        if np.random.rand() <= self.epsilon:
            # The agent acts randomly
            return random.randrange(self.action_size)
        else:
            # Predict the reward value based on the given state
            state = np.float32(state)
            q_values = self.model.predict(state)
            return np.argmax(q_values[0])

##### Main loop

In [None]:
if __name__ == "__main__":
    env = Env()
    agent = DeepSARSAgent()

    global_step = 0
    scores, episodes = [], []

    for e in range(EPISODES):
        done = False
        score = 0
        state = env.reset()
        state = np.reshape(state, [1, 15])

        while not done:
            # fresh env
            global_step += 1

            # get action for the current state and go one step in environment
            action = agent.get_action(state)
            next_state, reward, done = env.step(action)
            next_state = np.reshape(next_state, [1, 15])
            next_action = agent.get_action(next_state)
            agent.train_model(state, action, reward, next_state, next_action,
                              done)
            state = next_state
            # every time step we do training
            score += reward

            state = copy.deepcopy(next_state)

            if done:
                scores.append(score)
                episodes.append(e)
                pylab.plot(episodes, scores, 'b')
                pylab.savefig("./save_graph/deep_sarsa2.png")
                print("episode:", e, "  score:", score, "global_step",
                      global_step, "  epsilon:", agent.epsilon)

        if e % 100 == 0:
            agent.model.save_weights("./save_model/deep_sarsa2.h5")

<br/>
<h3 style="text-align:center">Results</h3>
<img src="./save_graph/deep_sarsa.png" alt="deep_sarsa.png" width="70%" />

From the graph above we can see the following:
 - the agent shows amazing results after just 20 episodes.
 - there might be a chance that such results are not always reproducible, being that we are not using any decaying learning rate that satisfies a Robbins-Munro sequence.
 Nevertheless, we should be confident that in most cases, our deep SARSA implementation will find an optimal policy given that SARSA is a reasonably complex Temporal Difference learning algorithm.

We can conclude that deep SARSA has shown amazing results in our extended Grid World environment.
