> ***This notebook illustrates the implmentation of the popular model-free, on-policy [SARSA](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action) algorithm, common for RL tasks.***

## Introduction
- ***Recalling SARSA***

Reinforcement learning is one of the sub of machine learning. A machine learning agent learns from the feedback of the try-and-error in order to predict their next step. Reinforcement learning can be implemented in various method. Q-learning and State-Action-Reward-State-Action (SARSA) methods are two of the commom ones. Both methods are almost similar except Q-learning is off-policy algorithm and SARSA is on-policy algorithm. 

- ***Main features of SARSA***
  - SARSA (state-action-reward-state-action) is an on-policy reinforcement learning algorithm that estimates the value of the policy being followed. In this algorithm, the agent grasps the optimal policy and uses the same to act. The policy that is used for updating and the policy used for acting is the same, unlike in Q-learning. This is an example of on-policy learning.
  - It is a "model free" algorithm. Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning, Actor-Critic are "model free" RL algorithms.
  - An experience in SARSA is of the form ⟨S,A,R,S’, A’ ⟩, which means that

    * current state S, 
    * current action A, 
    * reward R, and 
    * new state S’,
    * future action A’. 

    This provides a new experience to update from

    $Q(S,A) to R+γQ(S’,A’)$
## Drawing out the difference between [Q-learning and SARSA](https://stackoverflow.com/questions/6848828/what-is-the-difference-between-q-learning-and-sarsa?width=100%)

  ![](https://www.researchgate.net/profile/Thanh-Nguyen-267/publication/321328825/figure/fig6/AS:631634760069178@1527604870465/An-example-of-using-Sarsa-versus-Q-learning-in-TD-learning.png)

  *Image Credits: [Research Gate](https://www.researchgate.net/figure/An-example-of-using-Sarsa-versus-Q-learning-in-TD-learning_fig6_321328825)*

## Implementing SARSA
To implement `SARSA`, the steps are as follows:
- Initialize alpha (learning rate), that controls the rate of learning, how we make adjustments to he Q function
- Initialize the Q function, which is just the agent's estimate of it's discounted future rewards starting from a given state and taking an actiona and it may have some assumptions built-in onto whether or not it follows a particular policy
- Initialize the staate S
- Choose some initial action based on that state using an epsilon greedy stratergy from that function Q
- Loop over the episodes
  - Taking the action A, getting the rewards and the new state S'
  - Choose an action A'(S') using epsilon greedy from the Q function
  - Update the Q function using the update rule 
  
    `Q(s,a) -> Q(s,a) + alpha * ( R + gamma*Q(s',a') - Q(s,a)`
  - Store the S -> S' and A -> A'

## Importing the required libraries and setting up the [‘FrozenLake-v0’](https://gym.openai.com/envs/FrozenLake-v0/) environment


In [None]:
import gym # toolkit for reinforcement learning, OpenAI’s gym module to load the environment
import numpy as np

In [None]:
env = gym.make('FrozenLake-v0')  # importing the environment

## Initializing different parameters and the Q-Table


In [None]:
# defining different parameters
epsilon = 0.9         # probability of the random action
total_episodes = 1000  # number of episodes or training cycle
max_steps = 100        # number of steps per iteration
alpha = 0.85   # learning rate
gamma = 0.95   #  discount factor influencing the importance of future rewards

# initializing the Q-table
Q = np.zeros((env.observation_space.n, env.action_space.n))         

In [None]:
Q  # 16 by 4 Q-table

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [None]:
# function to choose the next action
def choose_action(state):
  action=0
  if np.random.uniform(0,1) < epsilon:
    action = env.action_space.sample()  # create random action, action selection, some actions may have the same value, randomly choose on in these actions
  else:
    action = np.argmax(Q[state, :])
  return action

# function to learn Q-value
def update(state, state2, reward, action, action2):
  predict = Q[state, action]
  target = reward + gamma*Q[state2, action2]
  Q[state, action] = Q[state, action] + alpha*(target - predict)  # update
  # Here, the update equation for SARSA depends on the current state, current action, reward obtained, next state and next action. 

## Training the agent

In [None]:
# initializing the reward
reward = 0

# starting the SARSA learning
for episode in range(total_episodes):
  t = 0
  # initial observation
  state1 = env.reset()

  # RL choose action based on observation
  action1 = choose_action(state1)

  while t < max_steps:
    # fresh env
    env.render()

    # getting the next state
    state2, reward, done, info = env.step(action1)

    #RL take action and get next observation and reward
    action2 = choose_action(state2)

    # learning the Q-value, RL choose action based on next observation
    update(state1, state2, reward, action1, action2)

    # swap observation and action
    state1 = state2
    action1 = action2

    # updating the Q-value
    t += 1
    reward += 1

    # break while loop when end of this episode
    if done:
      break

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FH

These are the different action to reach the goal state. In the above output, the red mark determines the current position of the agent in the environment while the direction given in brackets gives the direction of movement that the agent will make next. Note that the agent stays at it’s position if goes out of bounds.
## Evaluating the agent's performance


In [None]:
# evaluating the agent's performance
print(f'Performance : ', reward/total_episodes)

# Visualizing the Q-table
print(Q)

Performance :  0.001
[[3.93733102e-04 1.41984851e-03 1.55198166e-03 6.12306778e-04]
 [1.47719338e-04 5.83464052e-05 2.11345649e-04 1.83062965e-03]
 [3.16573153e-04 1.23743397e-04 7.01921006e-05 4.42754356e-04]
 [2.09527094e-05 6.49180700e-05 1.92575546e-05 2.82355914e-04]
 [1.74111822e-03 7.23986566e-02 2.21232951e-04 7.13353779e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.35269898e-03 3.85028417e-10 1.00522763e-01 2.69718892e-06]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [4.82475662e-03 3.93284559e-02 3.54642101e-03 3.45553247e-02]
 [3.18405862e-02 1.15039973e-01 3.34003084e-02 6.61898697e-02]
 [6.64188563e-01 1.44181259e-02 4.03976346e-05 1.16580609e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.77453135e-07 8.74085054e-03 1.68668293e-01 2.74975914e-02]
 [1.76216012e-02 1.46157216e-01 8.67757068e-01 6.81846164e-02]
 [0.00000000e+00 0.00000000e+00 0.

From the result of the above cell, we can see that how the cells of the Q-table is updated according to the on-policy SARSA learning. The average reward according to the model's performance is `0.001` which can be upgraded by tweaking the parameters of the model.

Refer this [notebook](https://www.kaggle.com/just4jcgeorge/3-maze-problem-with-sarsa-solution) by Geroge Ng, for an SARSA implementation illustrating the solution to 3-maze problem.