## SARSA

### Definition

The State–action–reward–state–action (SARSA) algorithm is a machine learning reinforcement learning technique for learning a Markov decision process policy.

This name simply reflects the fact that the main function for updating the Q-value is dependent on the agent's current state "S1," the action "A1," the reward "R" the agent receives for taking that action, the state "S2" that the agent enters after that action, and finally the next action "A2" the agent chooses in its new state.

The main advantage of SARSA is that, the reward/penalty of an action can be updated after each state, and need not wait till the end of the episode.

### SARSA v/s Q-learning

The SARSA algorithm is a slightly modified version of the well-known Q-Learning algorithm. In any Reinforcement Learning algorithm, a learning agent's policy can be one of two types:-

1. On Policy: In this case, the learning agent learns the value function based on the current action as determined by the policy in use.
2. Off Policy: In this case, the learning agent learns the value function based on another policy's behaviour.

The difference between these two algorithms i.e. Q-learning and SARSA is that SARSA chooses an action following the same current policy and updates its Q-values whereas Q-learning chooses the greedy action, that is, the action that gives the maximum Q-value for the state, that is, it follows an optimal policy.

The difference between can also be seen in the update statements of each technique:
1. Q-learning: image.png

2. SARSA: image.png

Here, the update equation for SARSA depends on the current state, current action, reward obtained, next state and next action. This observation lead to the naming of the learning technique as SARSA stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’).

SARSA takes into account the current exploration policy which, for example, may be greedy with random steps. It can find a different policy than Q-learning in situations when exploring may incur large penalties. For example, when a robot goes near the top of stairs, even if this is an optimal policy, it may be dangerous for exploration steps. SARSA will discover this and adopt a policy that keeps the robot away from the stairs. It will find a policy that is optimal, taking into account the exploration inherent in the policy.



### Algorithm for SARSA

**controller** SARSA(S,A,γ,α)

**inputs:**

    S is a set of states
    A is a set of actions
    γ the discount
    α is the step size

**internal state:**

    real array Q[S,A]
    previous state s
    previous action a

**begin**

      initialize Q[S,A] arbitrarily
      observe current state s
      select action a using a policy based on Q
      repeat forever:
          carry out an action a
          observe reward r and state s'
          select action a' using a policy based on Q
          Q[s,a] ←Q[s,a] + α(r+ γQ[s',a'] - Q[s,a])
          s ←s'
          a ←a'
     end-repeat
  
**end**

### Implementation

The following Python code demonstrates how to implement the SARSA algorithm using the OpenAI’s gym module to load the environment.

**Code**

In [2]:
#Importing Libraries
import numpy as np 
import gym 

**Chosen Environment:**
Here, we will be using the ‘FrozenLake-v0’ environment which is preloaded into gym.

**Description of Environment:**

**FrozenLake-v0:**

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The surface is described using a grid like the following:

SFFF       (S: starting point, safe)

FHFH       (F: frozen surface, safe)

FFFH       (H: hole, fall to your doom)

HFFG       (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

In [3]:
#Building the environment
env = gym.make('FrozenLake-v0')

In [4]:
#The FrozenLake-v0 environment is set and loaded into a variable. 
#The necessary variables  are initialized. The number of episodes is 100 and the maximum number of steps per episode is 10
epsilon = 0.9
total_episodes = 100
max_steps = 10
alpha = 0.85
gamma = 0.95
Q = np.zeros((env.observation_space.n, env.action_space.n))

In [5]:
#Functions to choose the net action and to learn the Q value
def choose_action(state): 
    action=0
    if np.random.uniform(0, 1) < epsilon: 
        action = env.action_space.sample() 
    else: 
        action = np.argmax(Q[state, :]) 
    return action 
def update(state, state2, reward, action, action2): 
    predict = Q[state, action] 
    target = reward + gamma * Q[state2, action2] 
    Q[state, action] = Q[state, action] + alpha * (target - predict) 

In [6]:
#The machine is trained using SARSA algorithm by iterating through all the episodes.
reward=0  

# Starting the SARSA learning
for episode in range(total_episodes): 
    t = 0
    state1 = env.reset() 
    action1 = choose_action(state1) 
  
    while t < max_steps: 
        #Visualizing the training
        env.render() 

         #Getting the next state 
        state2, reward, done, info = env.step(action1) 

        #Choosing the next action
        action2 = choose_action(state2)  

        #Learning the Q-valu
        update(state1, state2, reward, action1, action2) 
        state1 = state2 
        action1 = action2

         #Updating the respective vaLues 
        t += 1
        reward += 1

        #If at the end of learning process
        if done: 
            break


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Right)
[41mS

In the above output, the red mark determines the current position of the agent in the environment while the direction given in brackets gives the direction of movement that the agent will make next. Note that the agent stays at it’s position if goes out of bounds.

The performance of the algorithm is calculated by dividing the final reward by the total number of episodes.

In [7]:
#Evaluating the performance
print ("Performance : ", reward/total_episodes)
#Visualizing the Q-matrix
print(Q)

Performance :  0.01
[[0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.10295625 0.         0.686375  ]
 [0.         0.         0.         0.019125  ]
 [0.         0.         0.         0.        ]]


### Comparison 

QL and SARSA are both excellent initial approaches for reinforcement learning problems. A few key notes to select when to use QL or SARSA:
1. Both approach work in a finite environment (or a discretized continuous environment)
2. QL directly learns the optimal policy while SARSA learns a “near” optimal policy. QL is a more aggressive agent, while SARSA is more conservative. An example is walking near the cliff. QL will take the shortest path because it is optimal (with the risk of falling), while SARSA will take the longer, safer route (to avoid unexpected falling).
3. In practice, if you want to fast in a fast-iterating environment, QL should be your choice. However, if mistakes are costly (unexpected minimal failure — robots), then SARSA is the better option.
4. If your state space is too large, try exploring the deep q network. I will hopefully write up a post about this topic soon. Stay tuned!

### Applications

1. **SARSA learning algorithm for reactive power control in power system:** The SARSA learning algorithm which is an on-policy algorithm in RL concept is applied to the IEEE 39-buses New England power system. Results show that SARSA learning algorithm is able to provide optimal or near optimal control settings for power system under varying system conditions.

2. **A reinforcement learning approach to the shepherding task using SARSA:** In this a reinforcement learning model of the shepherding of a flock of sheep by a dog is used. The shepherding task, a heuristic model originally proposed by Strombom, et al., describes the dynamics of the sheep while being herded by a dog to a predefined target. This study recreates the proposed model using SARSA, an algorithm for learning the optimal policy in reinforcement learning. Results show that with a discretized state and action space, the dog is able to successfully herd a flock of a sheep to the target position by first learning to reach a subgoal. A reward is awarded when the dog reaches the neighbourhood of a subgoal, while a penalty is incurred for each time the shepherding task is not completed. The stochasticity of the interaction among sheep and dog, including the existence of multiple subgoals affect the learning time of the agent. 

Other Applications are as follows:
1. Robotics for industrial automation.
2. Business strategy planning
3. Machine learning and data processing
4. It helps you to create training systems that provide custom instruction and materials according to the requirement of students.
5. Aircraft control and robot motion control

### Conclusion

Reinforcement learning is a very important concept that has great potential and plays a critical role in the technological industry. This can be used  to arrive at solutions for complex problems with proper training and testing. Thus, SARSA Algorithm can be used to serve this purpose effeciently and successfully

**References:**

1. https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
2. https://ieeexplore.ieee.org/document/4762658
3. https://medium.com/swlh/introduction-to-reinforcement-learning-coding-sarsa-part-4-2d64d6e37617
4. https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action
