# SARSA (Lambda) [In-Progress]
We take the Temporal Difference Algorithm SARSA (state, action, reward, state', action') and introduce eligibility traces to it. One of the pitfalls of Temporal Difference learning methods for learning is that since they back up step by step (online), a reward is only associated to the last step that led to it. In Monte Carlo methods, we backup the whole sequence of actions, which has its own pitfalls (explored in /MonteCarloMethods/vanilla_monte_carlo.ipynb). Eligibility traces provide some sense of memory to our algorithm. We use this method to update state-action pairs based on their "relevance" to the reward. For example, if a state action pair occurs right before the terminal state, it might be considered more relevant than the first pair in the sequence. However, since the first state-action pair was part of our sequence, we assign some non-zero relevance to this pair that decays over time. This allows us to learn in a more general manner. 

Another method to bridge the gap between the pros and cons of Monte Carlo Methods and Temporal Difference methods is to backup based on some fixed n-step return. In this case, we have a fixed time horizon, which may not be the most suitable if we have an environment in which our episode lengths may differ drastically. Additionally, as n approaches the length of the episode, we effectively have Monte Carlo Control 

In [46]:
import gym 
import numpy as np 
from datetime import datetime

In [47]:
env = gym.make("Taxi-v3", render_mode = "human")
env.reset()
env.render()

In [48]:
state_space = env.observation_space
action_space = env.action_space

#print("We have {} action space and {} state space".format(action_space, state_space))

In [49]:
#This function returns an array of action probabilities for a given state (a polic) 
#this policy is designed to be epsilon-greedy in relation to the state action value function Q 
def policy_fn(Q, num_actions, e, state):
    action_probabilities = np.ones(num_actions) * (e/num_actions)
    highest_action_value = np.argmax(Q[state])
    action_probabilities[highest_action_value] += (1 - e)
    
    return action_probabilities

In [50]:
#Initializations 
Q = np.zeros((state_space.n, action_space.n))
returns = [[[]]*action_space.n for i in range(state_space.n)]
pi = np.zeros((state_space.n, action_space.n))
epochs_per_episode = []
action_prev = 0 
state_prev = 0 

#hyperparams 
num_episodes = 500
epsilon = 0.1
gamma = 1
alpha = 0.1
lamda = 0.6 

## Prediction and Control with SARSA (lambda) using Dutch Eligibility Traces 

In [51]:
time_start = datetime.now().time()
for i in range(num_episodes):
    E = np.zeros((state_space.n, action_space.n))
    episode = []
    state = env.reset()
    action = 0 
    cumulative_reward = 0 
    epoch = 0
    terminated = False 
    print("Episode: {} Epoch: {}".format(i, epoch))
    
    while not terminated: 

        epoch+=1
        if type(state)==tuple: 
            state = state[0]
        
        next_state, reward, terminated, truncated, step_dict = env.step(action)
        
        pi[state] = policy_fn(Q, action_space.n, epsilon, state)
        next_action = np.random.choice(np.arange(action_space.n), p = pi[state])
        
        #calculate the temporal difference error for the predicted value of the state-action 
        #pair 
        td_error = reward + (gamma * Q[next_state][next_action]) - Q[state][action]
        E[state][action] = (1 - alpha)*(E[state][action]) + 1 # we use dutch eligibility traces 
        
        #Backup each state-action pair based on its TD error and eligibility trace
        for state in range(state_space.n): 
            for action in range(action_space.n): 
                Q[state][action] = Q[state][action] + (alpha * td_error * E[state][action])
                E[state][action] = lamda * gamma * E[state][action]
        
        state = next_state
        action = next_action
        
        if terminated: 
            print("Episode: {} Epoch: {}".format(i, epoch))
            epochs_per_episode.append(epoch)

time_end = datetime.now().time()

Episode: 0 Epoch: 0


  if not isinstance(terminated, (bool, np.bool8)):


In [None]:
datetime1 = datetime.combine(datetime.today(), time_start)
datetime2 = datetime.combine(datetime.today(), time_end)

# calculate the difference between the datetime objects
time_diff = (datetime2 - datetime1).total_seconds()
print("Total time taken for 500 episodes: {}".format(time_diff))

NameError: name 'datetime' is not defined

## Visualizing the learning process

In [None]:
import matplotlib.pyplot as plt 

num_episode = list(range(1, len(epochs_per_episode) + 1))
plt.bar(num_episode, epochs_per_episode)
plt.xlabel('Episode')
plt.ylabel('Epoch')
plt.show()