# Temporal Difference Learning 

In [65]:
import gym 
import numpy as np 
import time

### Environment: OpenAI Taxi-v3

For the purposes of this assignment (as in the previous notebook), we use Taxi-v3 from OpenAI's Gym repository. This environment, its states, actions and goals are detailed on the following web page: https://www.gymlibrary.dev/environments/toy_text/taxi/. The environment is kept the same as the vanilla monte-carlo algorithm to be able to compare efficiency and speed.  

In [66]:
env = gym.make("Taxi-v3", render_mode = "human")
env.reset()
env.render()

In [67]:
state_space = env.observation_space
action_space = env.action_space

print("We have {} action space and {} state space".format(action_space, state_space))

We have Discrete(6) action space and Discrete(500) state space


### On-policy Monte Carlo Control 

Instead of initializing state and action previous as zero for the SARSA update step, we could sample the next action during training and calculate the difference, but instead we choose to do it this way so that the actions used in the update step are those that have been selected in the episode. If there's something wrong with this method don't be afraid to let me know. 

In [68]:
#This function returns an array of action probabilities for a given state (a polic) 
#this policy is designed to be epsilon-greedy in relation to the state action value function Q 
def policy_fn(Q, num_actions, e, state):
    action_probabilities = np.ones(num_actions) * (e/num_actions)
    highest_action_value = np.argmax(Q[state])
    action_probabilities[highest_action_value] += (1 - e)
    
    return action_probabilities

def update_params(Q, state, action, reward, s_next, alpha, gamma) -> None: 
    old_val = Q[state, action]
    a_next = np.max(Q[s_next])
    new_val = old_val + alpha * (reward + gamma * a_next - old_val)
    Q[state, action] = new_val

In [69]:
#Initializations 
Q = np.zeros((state_space.n, action_space.n))
returns = [[[]]*action_space.n for i in range(state_space.n)]
pi = np.zeros((state_space.n, action_space.n))
state_prev = 0 # dummy initialization of previous state and action for step 0 
action_prev = 0 
epochs_per_episode = []

#hyperparams 
num_episodes = 1000
e = 0.1
gamma = 0.6
alpha = 0.1

In [70]:
for i in range(num_episodes):
    
    episode = []
    state = env.reset()
    cumulative_reward = 0 
    epoch = 0
    terminated = False 
    print("Episode: {} Epoch: {}".format(i, epoch))
    
    while not terminated: 
        
        epoch+=1
        if type(state)==tuple: 
            state = state[0]
        
        pi[state] = policy_fn(Q, action_space.n, e, state)
        action = np.random.choice(np.arange(action_space.n), p = pi[state])
        
        next_state, reward, terminated, truncated, step_dict = env.step(action)
        episode.append((state, action, reward))
        
        update_params(Q, state, action, reward, next_state, alpha, gamma)
        
        if terminated: 
            print("Episode: {} Epoch: {}".format(i, epoch))
            epochs_per_episode.append(epoch)
            break 
        
        state = next_state

Episode: 0 Epoch: 0


## Visualizing the learned policy

### Results 
