# SARSA (Lambda) [In-Progress]
We take the Temporal Difference Algorithm SARSA (state, action, reward, state', action') and introduce eligibility traces to it. One of the pitfalls of Temporal Difference learning methods for learning is that since they back up step by step (online), a reward is only associated to the last step that led to it. In Monte Carlo methods, we backup the whole sequence of actions, which has its own pitfalls (explored in /MonteCarloMethods/vanilla_monte_carlo.ipynb). Eligibility traces provide some sense of memory to our algorithm. We use this method to update state-action pairs based on their "relevance" to the reward. For example, if a state action pair occurs right before the terminal state, it might be considered more relevant than the first pair in the sequence. However, since the first state-action pair was part of our sequence, we assign some non-zero relevance to this pair that decays over time. This allows us to learn in a more general manner. 

Another method to bridge the gap between the pros and cons of Monte Carlo Methods and Temporal Difference methods is to backup based on some fixed n-step return. In this case, we have a fixed time horizon, which may not be the most suitable if we have an environment in which our episode lengths may differ drastically. Additionally, as n approaches the length of the episode, we effectively have Monte Carlo Control 

In [43]:
import gym 
import numpy as np 
from datetime import datetime

In [44]:
env = gym.make('Taxi-v3', render_mode = 'human')
env.reset()
env.render()

In [45]:
state_space = env.observation_space
action_space = env.action_space

#print("We have {} action space and {} state space".format(action_space, state_space))

In [46]:
#This function returns an array of action probabilities for a given state (a polic) 
#this policy is designed to be epsilon-greedy in relation to the state action value function Q 
def policy_fn(Q, num_actions, e, state):
    action_probabilities = np.ones(num_actions) * (e/num_actions)
    highest_action_value = np.argmax(Q[state])
    action_probabilities[highest_action_value] += (1 - e)
    
    return action_probabilities

In [47]:
#Initializations 
Q = np.zeros((state_space.n, action_space.n))
returns = [[[]]*action_space.n for i in range(state_space.n)]
pi = np.zeros((state_space.n, action_space.n))
epochs_per_episode = []
action_prev = 0 
state_prev = 0 

#hyperparams 
num_episodes = 500
epsilon = 0.2
gamma = 1
alpha = 0.1
lamda = 0.4

## Prediction and Control with SARSA (lambda) using Dutch Eligibility Traces 

In [48]:
E = np.zeros((state_space.n, action_space.n))
time_start = datetime.now().time()
for i in range(num_episodes):
    
    cumulative_reward = 0 
    epoch = 0
    
    state = env.reset()
    if type(state)==tuple: 
        state = state[0]
    pi[state] = policy_fn(Q, action_space.n, epsilon, state)
    action = np.random.choice(np.arange(action_space.n), p = pi[state])
    
    terminated = False 
    print("Episode: {} Epoch: {}".format(i, epoch))
    
    while not terminated: 

        epoch+=1
        if type(state)==tuple: 
            state = state[0]
        
        next_state, reward, terminated, truncated, step_dict = env.step(action)
        
        pi[state] = policy_fn(Q, action_space.n, epsilon, state)
        next_action = np.random.choice(np.arange(action_space.n), p = pi[state])
        
        if type(next_state)==tuple: 
            next_state = next_state[0]
        
        #calculate the temporal difference error for the predicted value of the state-action 
        #pair 
        td_error = reward + (gamma * Q[next_state][next_action]) - Q[state][action]
        
        # we use accumulating eligibility traces 
        E[state][action] = (E[state][action]) + 1
        
        #Backup state-action pair based on its TD error and eligibility trace
        Q[state][action] = Q[state][action] + (alpha * td_error * E[state][action])
        
        #Decay all Eligibility traces
        E = lamda * gamma * E
        
        state = next_state
        action = next_action
        
        if terminated: 
            print("Episode: {} Epoch: {}".format(i, epoch))
            epochs_per_episode.append(epoch)

time_end = datetime.now().time()

Episode: 0 Epoch: 0
Episode: 0 Epoch: 661
Episode: 1 Epoch: 0
Episode: 1 Epoch: 1283
Episode: 2 Epoch: 0
Episode: 2 Epoch: 1940
Episode: 3 Epoch: 0
Episode: 3 Epoch: 931
Episode: 4 Epoch: 0
Episode: 4 Epoch: 343
Episode: 5 Epoch: 0
Episode: 5 Epoch: 595
Episode: 6 Epoch: 0
Episode: 6 Epoch: 2252
Episode: 7 Epoch: 0
Episode: 7 Epoch: 884
Episode: 8 Epoch: 0
Episode: 8 Epoch: 2965
Episode: 9 Epoch: 0
Episode: 9 Epoch: 1019
Episode: 10 Epoch: 0
Episode: 10 Epoch: 4331
Episode: 11 Epoch: 0
Episode: 11 Epoch: 1647
Episode: 12 Epoch: 0
Episode: 12 Epoch: 676
Episode: 13 Epoch: 0
Episode: 13 Epoch: 3649
Episode: 14 Epoch: 0
Episode: 14 Epoch: 1628
Episode: 15 Epoch: 0
Episode: 15 Epoch: 890
Episode: 16 Epoch: 0
Episode: 16 Epoch: 143
Episode: 17 Epoch: 0
Episode: 17 Epoch: 852
Episode: 18 Epoch: 0
Episode: 18 Epoch: 1387
Episode: 19 Epoch: 0
Episode: 19 Epoch: 3487
Episode: 20 Epoch: 0
Episode: 20 Epoch: 112
Episode: 21 Epoch: 0
Episode: 21 Epoch: 220
Episode: 22 Epoch: 0
Episode: 22 Epoch: 2

In [None]:
datetime1 = datetime.combine(datetime.today(), time_start)
datetime2 = datetime.combine(datetime.today(), time_end)

# calculate the difference between the datetime objects
time_diff = (datetime2 - datetime1).total_seconds()

## Visualizing the learning process

In [None]:
import matplotlib.pyplot as plt 

num_episode = list(range(1, len(epochs_per_episode) + 1))
plt.bar(num_episode, epochs_per_episode)
plt.xlabel('Episode')
plt.ylabel('Epoch')
plt.show()

From observing the learning process, it may seem that sarsa lambda is not the ideal choice in this scenario. The rate of convergence is far inferior to that of the off policy Q-learning method evaluated previously. Q-learning learns Q-values based on the optimal action, not the action sampled by the epsilon greedy policy. 

The agent is definitely learning, but random spikes in number of epochs for an episode are indicative of some error. It could be that the agent is learning to favor certain action sequences due to the existence of traces, but the task itself is not the most sequential in nature. 

On the other hand, this method seems to be outperforming the regular SARSA algorithm on this task. 

Half way through training (250 episodes and 10 hours in), my laptop crashed which caused me to lose the stored progress. The gist of the plot that would have been was a gradually decaying curve that every few (10-15 episodes) would spike up to 1000+ steps for the episode