# Temporal difference prediction and control

In this notebook, you will implement temporal difference approaches to prediction and control described in [Sutton and Barto's book, Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html). We will use the grid ```World``` class from the previous lectures. 

### Install dependencies

In [8]:
! pip install numpy pandas



### Imports

In [9]:
import numpy as np
import random
import sys          # We use sys to get the max value of a float
import pandas as pd # We only use pandas for displaying tables nicely
pd.options.display.float_format = '{:,.3f}'.format

### ```World``` class and globals

The ```World``` is a grid represented as a two-dimensional array of characters where each character can represent free space, an obstacle, or a terminal. Each non-obstacle cell is associated with a reward that an agent gets for moving to that cell (can be 0). The size of the world is _width_ $\times$ _height_ characters.

A _state_ is a tuple $(x,y)$.

An empty world is created in the ```__init__``` method. Obstacles, rewards and terminals can then be added with ```add_obstacle``` and ```add_reward```.

To calculate the next state of an agent (that is, an agent is in some state $s = (x,y)$ and performs and action, $a$), ```get_next_state()```should be called.

In [10]:
# Globals:
ACTIONS = ("up", "down", "left", "right") 

# Rewards, terminals and obstacles are characters:
REWARDS = {" ": 0, ".": 0.1, "+": 10, "-": -10}
TERMINALS = ("+", "-") # Note a terminal should also have a reward assigned
OBSTACLES = ("#")

# Discount factor
gamma = 1

# The probability of a random move:
rand_move_probability = 0

class World:  
  def __init__(self, width, height):
    self.width = width
    self.height = height
    # Create an empty world where the agent can move to all cells
    self.grid = np.full((width, height), ' ', dtype='U1')
  
  def add_obstacle(self, start_x, start_y, end_x=None, end_y=None):
    """
    Create an obstacle in either a single cell or rectangle.
    """
    if end_x == None: end_x = start_x
    if end_y == None: end_y = start_y
    
    self.grid[start_x:end_x + 1, start_y:end_y + 1] = OBSTACLES[0]

  def add_reward(self, x, y, reward):
    assert reward in REWARDS, f"{reward} not in {REWARDS}"
    self.grid[x, y] = reward

  def add_terminal(self, x, y, terminal):
    assert terminal in TERMINALS, f"{terminal} not in {TERMINALS}"
    self.grid[x, y] = terminal

  def is_obstacle(self, x, y):
    if x < 0 or x >= self.width or y < 0 or y >= self.height:
      return True
    else:
      return self.grid[x ,y] in OBSTACLES 

  def is_terminal(self, x, y):
    return self.grid[x ,y] in TERMINALS

  def get_reward(self, x, y):
    """ 
    Return the reward associated with a given location
    """ 
    return REWARDS[self.grid[x, y]]

  def get_next_state(self, current_state, action):
    """
    Get the next state given a current state and an action. The outcome can be
    stochastic  where rand_move_probability determines the probability of 
    ignoring the action and performing a random move.
    """    
    assert action in ACTIONS, f"Unknown acion {action} must be one of {ACTIONS}"
    
    x, y = current_state 
    
    # If our current state is a terminal, there is no next state
    if self.grid[x, y] in TERMINALS:
      return None

    # Check of a random action should be performed:
    if np.random.rand() < rand_move_probability:
      action = np.random.choice(ACTIONS)

    if action == "up":      y -= 1
    elif action == "down":  y += 1
    elif action == "left":  x -= 1
    elif action == "right": x += 1

    elif action == "rightd": x += 1; y += 1
    elif action == "rightu": x += 1; y -= 1
    elif action == "leftd": x -= 1; y += 1
    elif action == "leftu": x -= 1; y -= 1

    # If the next state is an obstacle, stay in the current state
    return (x, y) if not self.is_obstacle(x, y) else current_state


## A simple world and a simple policy

In [11]:
world = World(2, 3)

# Since we only focus on episodic tasks, we must have a terminal state that the 
# agent eventually reaches
world.add_terminal(1, 2, "+")

def equiprobable_random_policy(x, y):
  return { k:1/len(ACTIONS) for k in ACTIONS }

print(world.grid.T)

[[' ' ' ']
 [' ' ' ']
 [' ' '+']]


## Exercise: TD prediction

You should implement TD prediction for estimating $V≈v_\pi$. See page 120 of [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html).


To implement TD prediction, the agent has to interact with the world for a certain number of episodes. However, unlike in the Monte Carlo case, we do not rely on complete sample runs, but instead update estimates (for prediction and control) and the policy (for control only) each time step in an episode.


Below, you can see the code for running an episode, with a TODO where you have to add your code for prediction. Also, play with the parameters ```alpha``` and ```EPISODES```, you will typically need a lot more than 10 episodes for an agent to learn anything.

In [12]:
# Global variable to keep track of current estimates
V = np.full((world.width, world.height), 0.0) # TODO

# Our step size / learing rate 
alpha = 0.05 ##### SÆT DEN HER NED FOR AT KOMME NÆRMERE TRUE VALUE!!!!!!!!!!

# Discount factor
gamma = 0.9

# Episodes to run 
EPISODES = 1000

def TD_prediction_run_episode(world, policy, start_state):
    current_state = start_state
    while not world.is_terminal(*current_state):
        # Get the possible actions and their probabilities that our policy says 
        # that the agent should perform in the current state: 
        possible_actions = policy(*current_state)

        # Pick a weighted random action: 
        action = random.choices(population=list(possible_actions.keys()), 
                                weights=possible_actions.values(), k=1)  
        
        # Get the next state from the world
        next_state = world.get_next_state(current_state, action[0])
        
        # Get the reward for performing the action
        reward = world.get_reward(*next_state)

        ###TODO: =============================================================
        ###TODO: Substitute the next line of code with your own
        ###TODO: =============================================================
        V[current_state] = V[current_state] + alpha * (reward + gamma * V[next_state] - V[current_state])

        # print(f"Current state (S) = {current_state}, next_state S' = {next_state}, reward = {reward}")

        # Move the agent to the new state
        current_state = next_state


for episode in range(EPISODES):
    # print(f"Episode {episode + 1 }/{EPISODES}:")
    TD_prediction_run_episode(world, equiprobable_random_policy, (0, 0))


display(pd.DataFrame(V.T))

Unnamed: 0,0,1
0,3.4,3.82
1,4.481,5.384
2,6.635,0.0


## Exercise: SARSA

Implement and test SARSA with an $\epsilon$-greedy policy. See page 130 of [Introduction to Reinforcement Learning](http://incompleteideas.net/book/the-book-2nd.html) on different worlds. Make sure that it is easy to show a learnt policy (most probable action in each state). 


In [13]:
# TODO: Implement your code here -- you need a Q-table to keep track of action 
#       value estimates and a policy-function that returns an epsilon greedy 
#       policy based on your estimates. 

epsilon = 0.1
Q = np.full((world.width, world.height, 4), 0.0)
alpha = 0.05


def e_greedy(x,y):
    if np.random.rand() < epsilon:
        return np.random.choice(ACTIONS)
        #return { k:1/len(ACTIONS) for k in ACTIONS }
    else:
        return ACTIONS[np.argmax(Q[(x,y)])]

def sarsa(world, start_state):
    current_state = start_state
    current_action = e_greedy(*current_state)


    while not world.is_terminal(*current_state):

        # Get the next state from the world
        next_state = world.get_next_state(current_state, current_action)
        
        # Get the reward for performing the action
        reward = world.get_reward(*next_state)
        
        next_action = e_greedy(*next_state)

        x, y = current_state
        nx, ny = next_state

        action_index = ACTIONS.index(current_action)
        next_action_index = ACTIONS.index(next_action)
        Q[x,y, action_index] = Q[x,y, action_index] + alpha * (reward + gamma * Q[nx,ny, next_action_index] - Q[x,y, action_index])

        current_state = next_state
        current_action = next_action


for episode in range(EPISODES):
    # print(f"Episode {episode + 1 }/{EPISODES}:")
    sarsa(world, (0, 0))

#Print Q
print("Valdemar er dum:")
for x in range(world.width):
    for y in range(world.height):
        print("State:", f"({x},{y})")
        for i in range(len(ACTIONS)):
            print("        ",ACTIONS[i],":",round(Q[(x,y)][i],2))
        print("\n")



Valdemar er dum:
State: (0,0)
         up : 4.25
         down : 2.2
         left : 4.63
         right : 7.47


State: (0,1)
         up : 5.55
         down : 0.01
         left : 0.2
         right : 0.15


State: (0,2)
         up : 0.4
         down : 0.0
         left : 0.0
         right : 0.0


State: (1,0)
         up : 6.2
         down : 8.87
         left : 5.17
         right : 4.98


State: (1,1)
         up : 5.67
         down : 10.0
         left : 2.19
         right : 6.05


State: (1,2)
         up : 0.0
         down : 0.0
         left : 0.0
         right : 0.0




## Exercise: Windy Gridworld

Implement the Windy Gridworld (Example 6.5 on page 130 in the book) and test your SARSA implementation on the Windy Gridworld, first with the four actions (```up, down, left, right```) that move the agent in the cardinal directions, and then with King's moves as described in Exercise 6.9. How long does it take to learn a good policy for different values of $\alpha$ and $\epsilon$?

In [14]:
### TODO: Implement and test SARSA, first on Windy Gridworld with four actions 
###       and then with King's moves


wind_strength = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
wind_world = World(10, 7)
wind_world.add_terminal(7, 3, "+")
REWARDS = {" ": -1, ".": 0.1, "+": 10, "-": -10}

# Since we only focus on episodic tasks, we must have a terminal state that the 
# agent eventually reaches

# print(wind_world.grid.T)

epsilon = 0.1
alpha = 0.5
Q = np.full((wind_world.width, wind_world.height, 4), 0.0)
ACTIONS = ("up", "down", "left", "right")


def windsarsa(world, start_state):
    current_state = start_state
    current_action = e_greedy(*current_state)


    while not world.is_terminal(*current_state):

        # Get the next state from the world
        next_state = world.get_next_state(current_state, current_action)

        next_state = (next_state[0], next_state[1] - wind_strength[current_state[0]])

        if (next_state[1] < 0):
            next_state = (next_state[0], 0)
        if (next_state[1] > world.height-1):
            next_state = (next_state[0], world.height-1)
        
        # Get the reward for performing the action
        reward = world.get_reward(*next_state)
        
        next_action = e_greedy(*next_state)



        x, y = current_state
        nx, ny = next_state

        action_index = ACTIONS.index(current_action)
        next_action_index = ACTIONS.index(next_action)
        Q[x,y, action_index] = Q[x,y, action_index] + alpha * (reward + gamma * Q[nx,ny, next_action_index] - Q[x,y, action_index])

        current_state = next_state
        current_action = next_action

EPISODES = 10000

for episode in range(EPISODES):
    # print(f"Episode {episode + 1 }/{EPISODES}:")
    windsarsa(wind_world, (0, 3))

# print(Q)

highest_action = np.full((wind_world.width, wind_world.height), '     ')
for x in range(wind_world.width):
    for y in range(wind_world.height):
        highest_action[x, y] = ACTIONS[np.argmax(Q[x, y])]
highest_action[7, 3] = "goal"
print(pd.DataFrame(highest_action.T))
print("|||||||||||||||||||||||KINGS MOVES|||||||||||||||||||")
###### KINGS MOVES
epsilon = 0.5
alpha = 0.05
Q = np.full((wind_world.width, wind_world.height, 8), 0.0)
ACTIONS = ("up", "down", "left", "right", "rightu", "rightd", "leftu", "leftd") 
EPISODES = 10000

for episode in range(EPISODES):
    # print(f"Episode {episode + 1 }/{EPISODES}:")
    windsarsa(wind_world, (0, 3))

highest_action_king = np.full((wind_world.width, wind_world.height), '      ')
for x in range(wind_world.width):
    for y in range(wind_world.height):
        highest_action_king[x, y] = ACTIONS[np.argmax(Q[x, y])]
highest_action_king[7, 3] = "goal"
print(pd.DataFrame(highest_action_king.T))


       0      1      2      3      4      5      6      7      8     9
0  right  right  right  right  right  right  right  right  right  down
1  right     up  right  right  right     up  right  right  right  down
2     up  right  right  right     up     up  right  right   down  down
3  right  right  right  right  right  right  right   goal  right  down
4  right   down  right  right  right  right     up   down   left  left
5  right  right  right  right  right     up     up   down  right  down
6  right     up  right  right     up     up     up     up     up  left
|||||||||||||||||||||||KINGS MOVES|||||||||||||||||||
        0       1       2       3       4       5       6       7       8  \
0  rightd      up   leftd   right   right  rightd   right  rightd  rightd   
1   leftu  rightd  rightd  rightd   right   right   right   right  rightd   
2  rightd  rightd   leftd  rightd   right  rightu  rightd  rightd  rightd   
3  rightd  rightd  rightd   leftd  rightd  rightd  rightd    goal   le