# Challenge Assignment
## Cliff Walking with Reinforcement Learning

## CSCI E-82A
 
# STUDENT: CATHAL (CHARLIE) FLANAGAN

## TEAM HAWK: CATHAL (CHARLIE) FLANAGAN, SACHIN MATHUR, JEFF WINCHELL

## Introduction

In this challenge you will apply several reinforcement learning algorithms to a classic problem in reinforcement learning, known as the cliff walking problem. The cliff walking problem is basically a game. The goal is for the agent to find the highest reward (lowest cost) path from a starting state to the goal. 

There are a number of versions of the cliff walking problems which have been used as research benchmarks over the years. A typical cliff walking problem might use a grid of 4x12. For this challenge you will work with a reduced size grid world of 4x4 illustrated below to reduce training time for your models.   

<img src="CliffWalking.JPG" alt="Drawing" style="width:200px; height:200px"/>
<center> **Grid World for similified cliff walking problem** </center>

The goal is to find the highest reward path from the **starting state**, 12, to the **terminal state**, 15, making this an **episodic task**. The rewards for this task are:
1. A reward of -1 for most state transitions. The -1 reward apples to state to state transitions and to transitions toward the boundary of the grid transitioning to the same state.    
2. A reward of -100 for 'falling off the cliff'. Falling off the cliff occurs when entering states 13 or 14. The only possible transition out of the cliff states is back to the origin state, 12. There are no possible transitions toward the boundary from the cliff state. 

Intuitively, we can see that the optimal solution follows the dotted line path shown in the diagram above. The challenge is to find a path that is as close to this optimal as possible.   

You can find a short discussion of the cliff walking problem on page 132 of Sutton and Barto, second edition. 

## Challenge

For this challenge you will do the following:

1. Create a simulator for the grid world environment. All interactions between your agents and the environment must be though calls to the function you create.  
2. Create and apply state value estimation agents using the RL algorithms.
3. Use the general policy improvement (GPI) algorithm with the appropriate control algorithm to improve policy. 
4. Use the state value estimation agent to evaluate the improved policy. 
5. Compare the results for the various control algorithms you try. 

Methods to use to solve this problem:

1. The Monte Carlo method for value estimation and (action value) control. The action value method for Monte Carlo has not been explicitly addressed in this course. You can find the pseudo code for Monte Carlo control on page 101 of Sutton and Barto, second edition.   
2. Create and execute agents using the n-step TD method for value estimation and n-step SARSA (action value) for control.
3. Create and execute agents using TD(0) for value estimation and SARSA(0) or Double Q-Learning (action value) control. You are welcome to try both algorithms if you have the time. 
4. For additional, but optional, challenge you may wish to try a dynamic programming algorithm. Does DP work for this problem or not, and why? 

> **Hints**
> - For TD(0), n-step TD, n-step SARSA, SARSA(0) and Double Q-learning, you may need to change the reward to -10 for state transitions toward the boundary of the grid world.  
> - For the n-step algorithms keep in mind that the grid world is rather small. 
> - Make sure you are not accidentally using two epsilon greedy steps in your GPI process.  

## 1.The Monte Carlo method for value estimation and (action value) control. The action value method for Monte Carlo has not been explicitly addressed in this course. You can find the pseudo code for Monte Carlo control on page 101 of Sutton and Barto, second edition.

In [65]:
import numpy as np
import numpy.random as nr

## Define the transition dictonary of dictionaries:
neighbors={0:{'u':0, 'd':4, 'l':0, 'r':1},
          1:{'u':1, 'd':5, 'l':0, 'r':2},
          2:{'u':2, 'd':6, 'l':1, 'r':3},
          3:{'u':3, 'd':7, 'l':2, 'r':3},
          4:{'u':0, 'd':8, 'l':4, 'r':5},
          5:{'u':1, 'd':9, 'l':4, 'r':6},
          6:{'u':2, 'd':10, 'l':5, 'r':7},
          7:{'u':3, 'd':11, 'l':6, 'r':7},
          8:{'u':4, 'd':12, 'l':8, 'r':9},
          9:{'u':5, 'd':13, 'l':8, 'r':10},
          10:{'u':6, 'd':14, 'l':9, 'r':11},
          11:{'u':7, 'd':15, 'l':10, 'r':11},
          12:{'u':8, 'd':12, 'l':12, 'r':13},
          13:{'u':12, 'd':12, 'l':12, 'r':12},
          14:{'u':12, 'd':12, 'l':12, 'r':12},
          15:{'u':15, 'd':15, 'l':15, 'r':15}}

In [66]:
policy =               {0:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        1:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25}, 
                        2:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        3:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        4:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        5:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        6:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        7:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        8:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        9:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        10:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        11:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        12:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        13:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        14:{'u':0.25, 'd':0.25, 'l':0.25, 'r':0.25},
                        15:{'u':0.00, 'd':0.00, 'l':0.00, 'r':0.00}}


In [67]:
rewards ={0:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          1:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          2:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          3:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          4:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          5:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          6:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          7:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          8:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          9:{'u':-1.0, 'd':-100.0, 'l':-1.0, 'r':-1.0},
          10:{'u':-1.0, 'd':-100.0, 'l':-1.0, 'r':-1.0},
          11:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          12:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-100.0},
          13:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          14:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          15:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}}

In [68]:
#Rewards with a -10.0 for hitting the wall
rew_wall={0:{'u':-10.0, 'd':-1.0, 'l':-10.0, 'r':-1.0},
          1:{'u':-10.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          2:{'u':-10.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          3:{'u':-10.0, 'd':-1.0, 'l':-1.0, 'r':-10.0},
          4:{'u':-1.0, 'd':-1.0, 'l':-10.0, 'r':-1.0},
          5:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          6:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          7:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-10.0},
          8:{'u':-1.0, 'd':-1.0, 'l':-10.0, 'r':-1.0},
          9:{'u':-1.0, 'd':-100.0, 'l':-1.0, 'r':-1.0},
          10:{'u':-1.0, 'd':-100.0, 'l':-1.0, 'r':-1.0},
          11:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-10.0},
          12:{'u':-1.0, 'd':-10.0, 'l':-10.0, 'r':-100.0},
          13:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          14:{'u':-1.0, 'd':-1.0, 'l':-1.0, 'r':-1.0},
          15:{'u':0.0, 'd':0.0, 'l':0.0, 'r':0.0}}

In [69]:
start_state = 12

In [70]:
def MC_generate_episode(start_state, policy, neighbors, terminal):
    ## List of states which might be visited in episode
    n_states = len(policy)
#    visited_state = [0] * n_states
    states = list(neighbors.keys())
    current_state = start_state
    #while(current_state == terminal): # Keep trying to not use terminal state to start
    #    current_state = nr.choice(states, size = 1)[0]
            
    ## Take a random walk trough the states until we get to the terminal state
    ## We do some bookkeeping to ensure we only visit states once.
    visited = [] # List of states visited on random walk
    while(current_state != terminal): # Stop when at terminal state
        ## Probability of state transition given policy
        probs = list(policy[current_state].values())
        ## Find next state to transition to
        next_state = nr.choice(list(neighbors[current_state].values()), size = 1, p = probs)[0]
        visited.append(next_state)
        current_state = next_state  
    return(visited)    
    
    
MC_generate_episode(start_state, policy, neighbors, 15)

[12,
 8,
 8,
 9,
 13,
 12,
 12,
 12,
 12,
 12,
 13,
 12,
 8,
 8,
 12,
 12,
 12,
 13,
 12,
 12,
 12,
 12,
 8,
 12,
 12,
 13,
 12,
 12,
 12,
 12,
 12,
 12,
 13,
 12,
 13,
 12,
 12,
 13,
 12,
 12,
 8,
 9,
 5,
 4,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 2,
 2,
 1,
 1,
 0,
 0,
 1,
 2,
 6,
 10,
 6,
 10,
 9,
 10,
 14,
 12,
 8,
 4,
 4,
 5,
 6,
 2,
 3,
 3,
 7,
 6,
 10,
 14,
 12,
 8,
 8,
 9,
 8,
 12,
 12,
 13,
 12,
 12,
 12,
 12,
 13,
 12,
 12,
 13,
 12,
 12,
 8,
 9,
 5,
 9,
 10,
 6,
 10,
 6,
 2,
 1,
 2,
 2,
 2,
 2,
 6,
 2,
 1,
 0,
 1,
 1,
 2,
 6,
 7,
 3,
 7,
 11,
 10,
 9,
 13,
 12,
 12,
 13,
 12,
 12,
 12,
 12,
 13,
 12,
 13,
 12,
 12,
 12,
 8,
 8,
 8,
 9,
 5,
 4,
 8,
 4,
 4,
 0,
 4,
 8,
 8,
 12,
 13,
 12,
 12,
 12,
 8,
 8,
 9,
 5,
 1,
 2,
 6,
 2,
 1,
 0,
 1,
 5,
 6,
 7,
 6,
 7,
 11,
 10,
 6,
 7,
 11,
 10,
 9,
 8,
 12,
 12,
 12,
 12,
 8,
 12,
 8,
 8,
 4,
 5,
 4,
 8,
 8,
 8,
 8,
 4,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 4,
 0,
 0,
 1,
 2,
 3,
 3,
 2,
 2,
 6,
 10,
 14,
 12,
 12,
 8,
 9,
 5,
 4,
 4,
 0,
 

In [72]:
def MC_state_values(start_state, policy, neighbors, rewards, terminal, episodes = 1):
    '''Function for first visit Monte Carlo on GridWorld.'''
    ## Create list of states 
    states = list(policy.keys())
    n_states = len(states)
    
    ## An array to hold the accumulated returns as we visit states
    G = np.zeros((episodes,n_states))
    
    ## An array to keep track of how many times we visit each state so we can 
    ## compute the mean
    n_visits = np.zeros((n_states))
    
    ## Iterate over the episodes
    for i in range(episodes):
        ## For each episode we use a list to keep track of states we have visited.
        ## Once we visit a state we need to accumulate values to get the returns
        states_visited = []
   
        ## Get a path for this episode
        visit_list = MC_generate_episode(start_state, policy, neighbors, terminal)
        current_state = visit_list[0]
        for state in visit_list[0:]: 
            ## list of states we can transition to from current state
            transition_list = list(neighbors[current_state].values())
            
            if(state in transition_list): # Make sure the transistion is allowed
                transition_index = transition_list.index(state)   
  
                ## find the action value for the state transition
                v_s = list(rewards[current_state].values())[transition_index]
   
                ## Mark that the current state has been visited 
                if(state not in states_visited): states_visited.append(current_state)  
                ## Loop over the states already visited to add the value to the return
                for visited in states_visited:
                    G[i,visited] = G[i,visited] + v_s
                    n_visits[visited] = n_visits[visited] + 1.0
            ## Update the current state for next transition
            current_state = state   
    
    ## Compute the average of G over the episodes are return
    n_visits = [nv if nv != 0.0 else 1.0 for nv in n_visits]
    returns = np.divide(np.sum(G, axis = 0), n_visits)   
    return(returns)              
    
returns = MC_state_values(start_state, policy, neighbors, rewards, terminal = 15, episodes = 100)
np.array(returns).reshape((4,4))

array([[-8.39044944, -8.22283902, -7.67400657, -8.05426246],
       [-8.0979013 , -8.01480963, -8.66866182, -8.41305178],
       [-8.79894064, -9.31471463, -9.17133492, -8.65463918],
       [-9.74173525, -8.15057712, -9.35509554,  0.        ]])

In [73]:
import copy
def MC_optimal_policy(start_state, policy, neighbors, rewards, terminal, episodes = 10, cycles = 10, epsilon = 0.05):
    ## Create a working cooy of the initial policy
    current_policy = copy.deepcopy(policy)
    
    ## Loop over a number of cycles of GPI
    for _ in range(cycles):
        ## First compute the average returns for each of the states. 
        ## This is the policy evaluation phase
        returns = MC_state_values(start_state, current_policy, neighbors, rewards, terminal = terminal,\
                                  episodes = episodes)
        
        ## We want max Q for each state, where Q is just the difference 
        ## in the values of the possible state transition
        ## This is the policy evaluation phase
        for s in current_policy.keys(): # iterate over all states
            ## Compute Q for each possible state transistion
            ## Start by creating a list of the adjacent states.
            possible_s_prime = neighbors[s]
            neighbor_states = list(possible_s_prime.values())
            ## Check if terminal state is neighbor, but state is not terminal.
            if(terminal in neighbor_states and s != terminal):
                ## account for the special case adjacent to goal
                neighbor_Q = []
                for s_prime in possible_s_prime.keys(): # Iterate over adjacent states
                    if(neighbors[s][s_prime] == terminal):  
                         neighbor_Q.append(returns[s])
                    else: neighbor_Q.append(0.0) ## Other transisions have 0 value.   
            else: 
                 ## The other case is rather easy. Compute Q for the transistion to each neighbor           
                 neighbor_values = returns[neighbor_states]
                 neighbor_Q = [n_val - returns[s] for n_val in neighbor_values]
                
            ## Find the index for the state transistions with the largest values 
            ## May be more than one. 
            max_index = np.where(np.array(neighbor_Q) == max(neighbor_Q))[0]  
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(len(np.array(neighbor_Q)))
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder          
                   
    return current_policy
 
 
nr.seed(9876)
MC_policy = MC_optimal_policy(start_state, policy, neighbors, rew_wall, terminal = 15, episodes = 50, cycles = 10, 
                              epsilon = 0.1)  
MC_policy

{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 2: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 3: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 4: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 5: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 6: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 7: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 8: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 9: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},


## 1.Create a simulator for the grid world environment. All interactions between your agents and the environment must be though calls to the function you create

In [74]:
def next_state(start_state, policy, neighbors, rewards, terminal):
    s = list(policy[start_state].keys())
    k = nr.choice(s, size = 1)[0]
    r = rew_wall[start_state][k]
    s_next = neighbors[start_state][k]
    done = False
    if s_next == terminal:
        done = True
    return (start_state,s_next,k,r,done)

In [75]:
done = False
start_state = 12
while done == False:
    (start_state,s_next,action,r,done) = next_state(start_state, policy, neighbors, rew_wall, 15)
    print("Current State = ",start_state,"Next State = ",s_next,"Action = ",action,"Reward = ",r,"Done = ",done)
    start_state = s_next

Current State =  12 Next State =  13 Action =  r Reward =  -100.0 Done =  False
Current State =  13 Next State =  12 Action =  d Reward =  -1.0 Done =  False
Current State =  12 Next State =  12 Action =  l Reward =  -10.0 Done =  False
Current State =  12 Next State =  12 Action =  d Reward =  -10.0 Done =  False
Current State =  12 Next State =  12 Action =  l Reward =  -10.0 Done =  False
Current State =  12 Next State =  8 Action =  u Reward =  -1.0 Done =  False
Current State =  8 Next State =  12 Action =  d Reward =  -1.0 Done =  False
Current State =  12 Next State =  13 Action =  r Reward =  -100.0 Done =  False
Current State =  13 Next State =  12 Action =  u Reward =  -1.0 Done =  False
Current State =  12 Next State =  8 Action =  u Reward =  -1.0 Done =  False
Current State =  8 Next State =  4 Action =  u Reward =  -1.0 Done =  False
Current State =  4 Next State =  8 Action =  d Reward =  -1.0 Done =  False
Current State =  8 Next State =  8 Action =  l Reward =  -10.0 D

### Markov - Control

In [76]:
import pandas as pd
from collections import OrderedDict
def mc_control(start_state, policy, neighbors, rewards, terminal, cycles, episodes,epsilon):
    cp = copy.deepcopy(policy)
    
    for j in range(0,cycles):
        nx = []
        act = []
        rew = []
        done = False
    
        for z in range(0,episodes): 
            done = False
            start_state = 12
            while done == False:
                (current_state,s_next,action,r,done) = next_state(start_state, policy, neighbors, rew_wall, 15)
                start_state = s_next
                nx.append(current_state)
                act.append(action)
                rew.append(r)
        df = pd.DataFrame(OrderedDict({'state':nx, 'action':act, 'reward':rew}))
        f1 = df.groupby(['state','action']).mean()
        f1.reset_index(inplace=True)
        max_df = f1.groupby(['state','action']).max()
        #f2 = max_df.reset_index()
        to_be_merged = max_df.groupby(['state']).max()
        to_be_merged.reset_index(inplace=True)
        f2 = f1.merge(to_be_merged, on=['state'])
        f3 = f2[f2.reward_x == f2.reward_y]
        #df = f3
        p = random.random()
        if p >0.5:
            df = f3.groupby('state').last().reset_index()
        else:
            df = f3.groupby('state').first().reset_index()
        #df = f3.groupby('state').sample(n=1).reset_index()
        #return(df)
        
        #Update Policy
        st = df.index.tolist()
        action = df.action.tolist()

        prob_for_policy = 1- epsilon
        remainder = epsilon
        for k in range(0,len(st)):
            for i, key in enumerate(cp[st[k]]): ## Update policy
                #print(i,key)
                if key == action[k]:
                    #cp[st[k]][key] = 1 - epsilon + (epsilon/np.absolute(r[k]))
                    cp[st[k]][key] = 0.7
                else:
                    #cp[st[k]][key] = epsilon
                    cp[st[k]][key] = 0.1
    
    
    return(cp)
        

In [77]:
df = mc_control(12, policy, neighbors, rewards, 15, 20,1000,0.1)
df

{0: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 1: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 2: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 3: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 4: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 5: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 6: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 7: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 8: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 9: {'d': 0.1, 'l': 0.7, 'r': 0.1, 'u': 0.1},
 10: {'d': 0.1, 'l': 0.7, 'r': 0.1, 'u': 0.1},
 11: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 12: {'d': 0.1, 'l': 0.1, 'r': 0.1, 'u': 0.7},
 13: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 14: {'d': 0.7, 'l': 0.1, 'r': 0.1, 'u': 0.1},
 15: {'d': 0.0, 'l': 0.0, 'r': 0.0, 'u': 0.0}}

# 2. Create and execute agents using the n-step TD method for value estimation and n-step SARSA (action value) for control.


# TD_N


In [24]:
def TD_n(policy, episodes, n, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform TD(N) policy evaluation.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    v = [0]*len(list(policy.keys()))
    
    current_policy = copy.deepcopy(policy)
    
    
    ## sample an initial state at random and make sure is not terminal state
    s = nr.choice(states, size = 1)[0]
    while(s == goal):
        s = nr.choice(states, size = 1)[0]  
        
    for _ in range(episodes): # Loop over the episodes
        T = float("inf")
        tau = 0
        reward_list = []
        t = 0
        
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## Choose action given policy
                probs = list(policy[s].values())
                a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                
            tau = t - n + 1 ## update the time step being updated

            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n - 1, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * v[s_prime]
                ## Update v
                v[s] = v[s] + alpha * (G - v[s])
            
            ## Set state for next iteration
            if(s_prime != goal):
                s = s_prime
            t = t +1
    return(v)

np.round(np.array(TD_n(policy, episodes = 1000, n = 2, goal = 15, alpha = 0.2, gamma = 0.9)).reshape((4,4)), 4)

array([[ -26.8557,  -23.7034,  -23.3309,  -27.3336],
       [ -26.9808,  -26.8121,  -25.5878,  -20.0225],
       [ -50.3441,  -57.7306,  -72.3257,  -10.9738],
       [ -66.326 , -173.3437, -167.282 ,    0.    ]])

In [25]:
import copy

def select_a_prime(s_prime, policy, action_index, greedy, goal):
    ## Randomly select an action prime 
    ## Make sure to handle the terminal state
    if(s_prime != goal and greedy): 
        probs = list(policy[s_prime].values())
        a_prime_index = nr.choice(action_index, size = 1, p = probs)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]
    else: ## Don't probability weight for terminal state or non-greedy selecttion
        a_prime_index = nr.choice(action_index, size = 1)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]   
    return(a_prime_index, a_prime)


In [26]:
def SARSA_n(policy, episodes, n, start, goal, alpha = 0.1, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(N) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random and make sure is not terminal state
        s = start
        
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        t = 0 # Initialize the time step count
        T = float("inf")
        tau = 0
        reward_list = []
        while(tau != T - 1): # Episode ends where get to terminal state 
            if(t < T):
                ## The next state given the action
                s_prime, reward = state_values(s, a)
                reward_list.append(reward)  # append the reward to the list
                if(s_prime == goal): T = t + 1  # We reached the terminal state
                else:
                    # Select and store the next action using the policy
                    a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
                
                
            tau = t - n + 1 ## update the time step being updated
  
            if(tau >= 0): # Check if enough time steps to compute return
                ## Compute the return
                ## The formula for the first index in the loop is different from Sutton and Barto
                ## but seems to be correct at least for Python.
                G = 0.0 
                for i in range(tau, min(tau + n, T)):
                    G = G + gamma**(i-tau) * reward_list[i]   
                ## Deal with case of where we are not at the terminal state
                if(tau + n < T): G = G + gamma**n * Q[a_prime_index,s_prime]
                ## Finally, update Q
                Q[a_index,s] = Q[a_index,s] + alpha * (G - Q[a_index,s])
            
            ## Set action and state for next iteration
            if(s_prime != goal):
                s = s_prime   
                a = a_prime 
                a_index = a_prime_index
                
            
            ## increment t
            t = t + 1
    return(Q)

Q = SARSA_n(policy, episodes = 100, n = 4, start=12, goal = 15, alpha = 0.2, gamma = 0.9)

for i in range(4):
    print(np.round(Q[i,:].reshape((4,4)), 4))

[[ -52.3939  -45.807   -43.0941  -55.7021]
 [ -39.0505  -37.1128  -29.6329  -40.1967]
 [ -76.4233  -34.4339  -31.6016  -33.388 ]
 [-127.3763 -248.4214 -220.4813    0.    ]]
[[ -50.2946  -39.0144  -35.146   -37.3671]
 [ -72.2814  -60.5326  -49.4221  -22.5104]
 [-175.0387 -253.3013 -227.93     -3.4664]
 [-241.8034 -287.4953 -239.1136    0.    ]]
[[ -54.4098  -45.1193  -33.5755  -45.1677]
 [ -47.1009  -41.7     -37.7807  -45.8532]
 [-111.5445  -96.2269  -61.5217  -39.7004]
 [-199.2318 -280.7713 -212.0972    0.    ]]
[[ -39.4485  -32.244   -45.7881  -50.2894]
 [ -41.5811  -43.1499  -30.0908  -40.8857]
 [ -76.975   -52.5201  -32.968   -26.7341]
 [-310.719  -272.5116 -209.7483    0.    ]]


In [9]:
def SARSA_n_GPI(policy, n, cycles, episodes, start, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_n(policy, episodes, n, start, goal = goal, alpha = alpha, gamma = gamma, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

SARSA_N_Policy = SARSA_n_GPI(policy, n = 4, cycles = 5, episodes = 100, start=12, goal = 15, alpha = 0.2, epsilon = 0.1)
SARSA_N_Policy

{0: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 1: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 2: {'d': 0.7,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 3: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 4: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 5: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 6: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.7,
  'u': 0.10000000000000002},
 7: {'d': 0.10000000000000002,
  'l': 0.7,
  'r': 0.10000000000000002,
  'u': 0.10000000000000002},
 8: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},
 9: {'d': 0.10000000000000002,
  'l': 0.10000000000000002,
  'r': 0.10000000000000002,
  'u': 0.7},


In [28]:
np.array(TD_n(SARSA_N_Policy, episodes = 100, n = 2, goal = 15, alpha = 0.2, gamma = 0.9)).reshape((4,4))

array([[ -2.90262418,  -7.92573065,  -9.47528788, -11.44606244],
       [ -6.21809122,  -6.86119618,  -6.73531456,  -6.17882546],
       [ -4.94369556,  -5.85117879,  -6.29045708,  -1.0683498 ],
       [-25.34439358, -84.62539258, -73.020637  ,   0.        ]])

# 3. Create and execute agents using TD(0) for value estimation and SARSA(0) or Double Q-Learning (action value) control. You are welcome to try both algorithms if you have the time. 

# TD_0

In [46]:
import copy

def select_a_prime(s_prime, policy, action_index, greedy, goal):
    ## Randomly select an action prime 
    ## Make sure to handle the terminal state
    if(s_prime != goal and greedy): 
        probs = list(policy[s_prime].values())
        a_prime_index = nr.choice(action_index, size = 1, p = probs)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]
    else: ## Don't probability weight for terminal state or non-greedy selecttion
        a_prime_index = nr.choice(action_index, size = 1)[0]
        a_prime = list(policy[s_prime].keys())[a_prime_index]   
    return(a_prime_index, a_prime)


def SARSA_0(policy, episodes, start = 12, goal=15, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(action_index),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample a state at random ensuring it is not terminal state
        s = start
        while(s == goal): s = nr.choice(states, size = 1)[0]
        ## Now choose action given policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = float('inf') # Value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state 
            ## The next state given the action
            s_prime, reward = state_values(s, a)
            a_prime_index, a_prime = select_a_prime(s_prime, current_policy, action_index, True, goal)
     
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)

Q = SARSA_0(policy, 1000, start=12, goal = 15, alpha = 0.2, epsilon = 0.1)

for i in range(4):
    print(np.round(Q[i,:].reshape((4,4)), 4))

[[ -68.0004  -72.7576  -72.6212  -61.3741]
 [ -61.2504  -61.9922  -61.6959  -54.9816]
 [ -66.4548  -74.2551  -90.6158  -53.6157]
 [ -85.9109 -168.0588 -147.6472    0.    ]]
[[ -69.7344  -66.7352  -80.8066  -54.9112]
 [-107.4486 -116.0852  -99.9945  -48.1122]
 [-146.1167 -231.7657 -234.7439   -1.    ]
 [-165.3534 -152.8106 -147.7773    0.    ]]
[[ -69.3782  -60.7914  -68.0871  -61.472 ]
 [ -72.98    -75.5752  -59.3754  -63.4789]
 [-109.6557 -113.8241 -144.3841 -108.1047]
 [-144.7914 -145.5922 -146.1138    0.    ]]
[[ -60.7668  -65.6096  -55.4125  -62.5771]
 [ -98.14    -65.6574  -54.1154  -59.6384]
 [-127.2995 -147.901   -51.9904  -42.7198]
 [-236.6698 -132.5086 -141.2563    0.    ]]


In [47]:
def SARSA_0_GPI(policy, cycles, episodes, start, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = SARSA_0(policy, episodes = episodes, goal = goal, alpha = alpha, epsilon = epsilon)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

SARSA_0_Policy = SARSA_0_GPI(policy, cycles = 10, episodes = 100, start = 12, goal = 15, alpha = 0.2, epsilon = 0.01)
SARSA_0_Policy

{0: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 1: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 2: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 3: {'d': 0.97,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.010000000000000009},
 4: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.97},
 5: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.97},
 6: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 7: {'d': 0.97,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.010000000000000009},
 8: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.97},
 9: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  '

In [48]:
def td_0_state_values(policy, n_samps, goal, alpha = 0.2, gamma = 0.9):
    """
    Function for TD(0) policy 
    """
    ## Initialize the state list and state values
    states = list(policy.keys())
    v = [0]*len(list(policy.keys()))
    action_index = list(range(len(list(policy[0].keys()))))
    for _ in range(n_samps):
        s = nr.choice(states, size =1)[0]
        probs = list(policy[s].values())
        if(s != goal):
            a = list(policy[s].keys())[nr.choice(action_index, size = 1, p = probs)[0]]
        else:
            a = list(policy[s].keys())[nr.choice(action_index, size = 1)[0]]
        transistion = state_values(s, a)
        v[s] = v[s] + alpha * (transistion[1] +  gamma * v[transistion[0]] - v[s])
    return(v)
    
nr.seed(345)    
np.round(np.array(td_0_state_values(policy, n_samps = 1000, goal = 15)).reshape((4,4)), 4)

array([[ -49.3663,  -37.2095,  -35.1425,  -43.6832],
       [ -58.4049,  -46.7909,  -37.9115,  -37.4714],
       [ -76.4688,  -89.4535, -101.7556,  -48.2905],
       [-182.4409, -139.6356, -135.9916,    0.    ]])

In [49]:
np.round(np.array(td_0_state_values(SARSA_0_Policy, n_samps = 10000, goal = 15)).reshape((4,4)),4)

array([[ -5.1416,  -4.7181,  -4.3166,  -3.5943],
       [ -5.6214,  -5.268 ,  -4.1534,  -2.2998],
       [ -6.1124,  -5.605 , -10.9227,  -1.008 ],
       [ -6.5549,  -7.0519,  -7.1386,   0.    ]])


# Q-Learning

In [29]:
def Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s == goal): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Find the index or indices of maximum action values for s_prime
            ## Break any tie with multiple max values by random selection
            action_values = Q[:,s_prime]
            a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
            a_prime = possible_actions[a_prime_index]
            
            ## Lookup the reward 
            reward = rewards[s][a]
            
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)

Q = Q_learning_0(policy, neighbors, rewards, 1000, goal = 15)

for i in range(4):
    print(np.round(Q[i,:].reshape((4,4)), 4))

[[-14.1319 -13.4494 -13.0569 -12.3394]
 [ -5.1128  -4.6347  -4.0692  -3.4051]
 [ -4.6856  -4.0951  -3.439   -2.5382]
 [ -5.217   -5.6779  -5.6766   0.    ]]
[[  -4.6856   -4.0951   -3.439    -2.71  ]
 [  -5.1919   -4.6696   -4.0365   -1.9   ]
 [  -5.6662 -102.9805 -102.2683   -1.    ]
 [ -14.4412  -14.464   -14.6368    0.    ]]
[[-13.6146  -5.1042  -4.6748  -4.0485]
 [-13.536   -4.6066  -4.0885  -3.4024]
 [-13.9567  -5.1104  -4.6099  -3.8084]
 [-14.2643  -5.669   -5.6545   0.    ]]
[[ -4.6856  -4.0951  -3.439  -12.3394]
 [ -4.0951  -3.439   -2.71   -11.5275]
 [ -4.6856  -4.0951 -10.797   -1.8129]
 [-97.1075  -5.6757  -5.6881   0.    ]]


In [31]:
def Q_learning_0(policy, neighbors, rewards, episodes, goal, alpha = 0.2, gamma = 0.9):
    """
    Function to perform Q-learning(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = nr.choice(states, size = 1)[0]
        while(s == goal): s = nr.choice(states, size = 1)[0]
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Find the index or indices of maximum action values for s_prime
            ## Break any tie with multiple max values by random selection
            action_values = Q[:,s_prime]
            a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
            a_prime = possible_actions[a_prime_index]
            
            ## Lookup the reward 
            reward = rewards[s][a]
            
            ## Update the action values
            Q[a_index,s] = Q[a_index,s] + alpha * (reward + gamma * Q[a_prime_index,s_prime] - Q[a_index,s])
            
            ## Set action and state for next iteration
            a = a_prime
            a_index = a_prime_index
            s = s_prime

    return(Q)

Q = Q_learning_0(policy, neighbors, rewards, 1000, goal = 15)

for i in range(4):
    print(np.round(Q[i,:].reshape((4,4)), 4))

[[-13.7935 -12.8261 -13.0482 -12.2368]
 [ -5.2039  -4.6314  -4.0554  -3.4308]
 [ -4.6856  -4.0951  -3.439   -2.6909]
 [ -5.217   -5.6875  -5.6426   0.    ]]
[[  -4.6855   -4.0951   -3.439    -2.71  ]
 [  -5.1433   -4.6371   -4.0306   -1.9   ]
 [  -5.6735 -104.8564 -102.8856   -1.    ]
 [ -14.5281  -14.5337  -14.4174    0.    ]]
[[-13.9325  -5.1196  -4.6142  -4.0235]
 [-13.1094  -4.6432  -4.0662  -3.2863]
 [-14.1123  -5.0815  -4.6671  -3.9344]
 [-14.4771  -5.6826  -5.6502   0.    ]]
[[  -4.6854   -4.0951   -3.439   -12.2797]
 [  -4.0951   -3.439    -2.71    -11.4254]
 [  -4.6856   -4.0951  -10.6985   -1.8438]
 [-100.6702   -5.6842   -5.6562    0.    ]]


In [53]:
def double_Q_learning_0(policy, neighbors, rewards, episodes, start=12, goal=15, alpha = 0.2, gamma = 0.9):
    """
    Function to perform SARSA(0) control policy improvement.
    """
    ## Initialize the state list and action values
    states = list(policy.keys())
    n_states = len(states)
    
    ## Initialize possible actions and the action values
    possible_actions = list(rewards[0].keys())
    action_index = list(range(len(list(policy[0].keys()))))
    Q1 = np.zeros((len(possible_actions),len(states)))
    Q2 = np.zeros((len(possible_actions),len(states)))
    
    current_policy = copy.deepcopy(policy)
    
    for _ in range(episodes): # Loop over the episodes
        ## sample an intial state at random but make sure it is not goal
        s = start
        
        ## Now choose action following policy
        a_index, a = select_a_prime(s, current_policy, action_index, True, goal)
        
        s_prime = n_states + 1 # Dummy value of s_prime to start loop
        while(s_prime != goal): # Episode ends where get to terminal state   
            ## Get s_prime given s and a
            s_prime = neighbors[s][a]
            
            ## Update one or the other action values at random
            if(nr.uniform() <= 0.5):
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q1[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q1 
                Q1[a_index,s] = Q1[a_index,s] + alpha * (reward + gamma * Q2[a_prime_index,s_prime] - Q1[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime
            
            else:
                ## Find the index or indices of maximum action values for s_prime
                ## Break any tie with multiple max values by random selection
                action_values = Q2[:,s_prime]
                a_prime_index = nr.choice(np.where(action_values == max(action_values))[0], size = 1)[0]
                a_prime = possible_actions[a_prime_index]
                ## Lookup the reward 
                reward = rewards[s][a]
                ## Update Q2
                Q2[a_index,s] = Q2[a_index,s] + alpha * (reward + gamma * Q1[a_prime_index,s_prime] - Q2[a_index,s])
            
                ## Set action and state for next iteration
                a = a_prime
                a_index = a_prime_index
                s = s_prime

    return(Q1)

Q = double_Q_learning_0(policy, neighbors, rewards, 2000, start = 12, goal = 15)

for i in range(4):
    print(np.round(Q[i,:].reshape((4,4)), 4))

[[-7.9268 -5.4384 -6.3841 -4.27  ]
 [-5.5029 -4.3729 -3.6576 -2.9743]
 [-4.6856 -4.0951 -3.439  -1.0422]
 [-5.217  -5.6953  0.      0.    ]]
[[ -5.111   -3.7793  -4.2037  -3.3857]
 [ -4.2402  -6.2824  -4.376   -1.9   ]
 [ -6.4442 -20.3676 -20.      -1.    ]
 [-14.6953  -7.6117  -2.5284   0.    ]]
[[ -8.0466  -3.9433  -4.0889  -3.8386]
 [ -4.2885  -4.8989  -3.2606  -3.0143]
 [ -6.0157  -4.5969  -8.2419  -1.0679]
 [-14.6953  -5.6953  -0.4575   0.    ]]
[[  -5.0223   -3.7182   -3.5561   -4.4784]
 [  -4.0951   -3.439    -2.71     -4.95  ]
 [  -4.6856   -4.0951   -3.8016   -2.0732]
 [-105.1258   -5.6954    0.        0.    ]]


In [54]:
def double_Q_learning_0_GPI(policy, neighbors, reward, cycles, episodes, start, goal, alpha = 0.2, gamma = 0.9, epsilon = 0.1):
    ## iterate over GPI cycles
    current_policy = copy.deepcopy(policy)
    for _ in range(cycles):
        ## Evaluate policy with SARSA
        Q = double_Q_learning_0(policy, neighbors, rewards, episodes = episodes, start = start, goal = goal)
        
        for s in list(current_policy.keys()): # iterate over all states
            ## Find the index action with the largest Q values 
            ## May be more than one. 
            max_index = np.where(Q[:,s] == max(Q[:,s]))[0]
            
            ## Probabilities of transition
            ## Need to allow for further exploration so don't let any 
            ## transition probability be 0.
            ## Some gymnastics are required to ensure that the probabilities 
            ## over the transistions actual add to exactly 1.0
            neighbors_len = float(Q.shape[0])
            max_len = float(len(max_index))
            diff = round(neighbors_len - max_len,3)
            prob_for_policy = round(1.0/max_len,3)
            adjust = round((epsilon * (diff)), 3)
            prob_for_policy = prob_for_policy - adjust
            if(diff != 0.0):
                remainder = (1.0 - max_len * prob_for_policy)/diff
            else:
                remainder = epsilon
                                                 
            for i, key in enumerate(current_policy[s]): ## Update policy
                if(i in max_index): current_policy[s][key] = prob_for_policy
                else: current_policy[s][key] = remainder   
                    
    return(current_policy)                    
 

Double_Q_0_Policy = double_Q_learning_0_GPI(policy, neighbors, rewards, cycles = 10, episodes = 500, start = 12, goal = 15, alpha = 0.2, epsilon = 0.01)
Double_Q_0_Policy

{0: {'d': 0.97,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.010000000000000009},
 1: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 2: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 3: {'d': 0.97,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.010000000000000009},
 4: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 5: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 6: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.97,
  'u': 0.010000000000000009},
 7: {'d': 0.97,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.010000000000000009},
 8: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  'r': 0.010000000000000009,
  'u': 0.97},
 9: {'d': 0.010000000000000009,
  'l': 0.010000000000000009,
  '

In [55]:
np.round(np.array(td_0_state_values(Double_Q_0_Policy, n_samps = 10000, goal = 15)).reshape((4,4)), 4)

array([[-4.7606, -4.1778, -3.4851, -2.7388],
       [-4.1636, -3.4925, -2.7515, -1.9174],
       [-5.5394, -4.166 , -3.9765, -1.7194],
       [-9.5199, -8.2584, -7.9638,  0.    ]])