# 2. SARSA

![sarsa](https://miro.medium.com/max/226/1*B4HaWwCxv4af2D6CEIGxyA.png)

SARSA is an on-policy gradient where an action A taken in the current state S results a reward R. As result, the agent ends up in the next state S1, where it takes action A1 (this is where the SARSA name comes from). As an on-policy algorithm, SARSA updates its policy based on actions taken.

## SARSA vs Q-Learning

From the earlier explanation it may seem that SARSA is very similar to Q-Learning and it is partly right. The idea of storing Q-values in a tabular format remains the same; however, there is one signifficant difference - SARSA is on-policy algorithm, while Q-Learning is off-policy one.

This will be explained more in-depth in the following mathematics section, but in short, SARSA algorithm uses the same current policy to choose A1 and update its Q-values, while Q-Learning uses greedy action to determine the next action.

## Mathematics

To better understand the difference between SARSA and Q-learning, let's look at SARSA's update function.

$$
Q(s_t, a_t) = Q(s_t, a_t) + \alpha [r(s_t, a_t) + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]
$$

Now, let's look at the following equation we used to update Q-learning algorithm.
$$
Q(s_t, a_t) = Q(s_t, a_t) + \alpha [r(s_t, a_t) + \gamma max_{a_{t+1}} Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)]
$$

As we can see, the only difference is that SARSA uses the **same behaviour policy** ($Q(s_{t+1}, a_{t+1})$) as a target Q value.

Q-Learning, on the other hand, uses greedy policy to determine the target Q-value.

## Implementation

As SARSA is quite similar to Q-learning, it might be useful to implement SARSA (as it is not going to be used for the final challenge).

#### 1. Environment

In [2]:
import numpy as np
import gym

#Set up 'FrozenLake-v1' environment
env = 

#### 2. Initializing parameters

In [11]:
#Defining parameters
epsilon = 0.9
total_episodes = 10000
max_steps = 100
alpha = 0.85
gamma = ___
 
#Initializing Q-matrix
Q = np.zeros((___, ___))

#### 3. Defining functions

When it comes to functions, we will need function for **choosing an action** and **updating Q-table**.

In [4]:
#Function to choose the next action
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action
 
#Function to learn the Q-value
def update(state, state2, reward, action, action2):
    predict = Q[state, action]
    target = reward + gamma * Q[state2, action2]
    Q[state, action] = Q[state, action] + alpha * (target - predict)

#### 4. Training

The training process itself involves taking an action from the action space, choosing the next action using our previously defined function and updating Q-value table.

In [15]:
#Initializing the reward
reward=0
 
# Starting the SARSA learning
for episode in range(total_episodes):
    t = 0
    state1 = env.reset()
    action1 = choose_action(state1)
 
    while t < max_steps:
         
        #Getting the next state after taking action1
        state2, reward, done, info = ___
 
        #Choosing the next action
        action2 = choose_action(state2)
         
        #Learning the Q-value using our defined function
        ___
 

        state1 = state2
        action1 = action2
         
        #Updating the respective vaLues
        t += 1
        reward += 1
         
        #If at the end of learning process
        if done:
            break