<a href="https://colab.research.google.com/github/Mentel1/Escape-Room-RL/blob/clement/Escape_Room.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
! pip install jdc

Collecting jdc
  Downloading jdc-0.0.9-py2.py3-none-any.whl (2.1 kB)
Installing collected packages: jdc
Successfully installed jdc-0.0.9


In [2]:
import jdc
# --
import numpy as np
# --
from rl_glue import RLGlue
# --
from Agent import BaseAgent 
from Environment import BaseEnvironment  
# --
from manager import Manager
# --
from itertools import product
# --
from tqdm import tqdm

#Environment

In [7]:
# Create empty EscapeRoomEnvironment class.

class EscapeRoomEnvironment(BaseEnvironment):
    def env_init(self, agent_info={}):
        raise NotImplementedError

    def env_start(self, state):
        raise NotImplementedError

    def env_step(self, reward, state):
        raise NotImplementedError

    def env_end(self, reward):
        raise NotImplementedError
        
    def env_cleanup(self, reward):
        raise NotImplementedError


## env_init()

The first function we add to the environment is the initialization function which is called once when an environment object is created. In this function, the grid dimensions and special locations (start and goal locations and the cliff locations) are stored for easy use later.

In [8]:
%%add_to EscapeRoomEnvironment

def env_init(self, env_info={}):
        
        reward = None
        state = None 
        termination = None
        self.reward_state_term = (reward, state, termination)
        
        self.grid_h = env_info.get("grid_height", 5) 
        self.grid_w = env_info.get("grid_width", 5)
        
 
        self.start_loc = (self.grid_h-1, self.grid_w//2)
        # Goal location is the bottom-right corner. (max x, max y).
        self.goal_loc = (0,self.grid_w//2)
        # The door is in the middle of the top line of the room
        self.obstacle_loc = (self.goal_loc[0]+1,self.goal_loc[1])
        #There is an obstacle before the door
        self.key_loc = (self.grid_h-1,self.grid_w-1)
        #The key is in the bottom right corner
        self.got_key = False
        #The player does not have the key in the beginning

## env_start()

In env_start(), we initialize the agent location to be the start location and return the state corresponding to it as the first state for the agent to act upon. Additionally, we also set the reward and termination terms to be 0 and False respectively as they are consistent with the notion that there is no reward nor termination before the first action is even taken.

In [9]:
%%add_to EscapeRoomEnvironment

def env_start(self):
    """The first method called when the episode starts, called before the
    agent starts.

    Returns:
        The first state from the environment.
    """
    reward = 0
    # agent_loc will hold the current location of the agent
    self.agent_loc = self.start_loc
    # state is the one dimensional state representation of the agent location.
    state = (*self.agent_loc,self.got_key)
    termination = False
    self.reward_state_term = (reward, state, termination)

    return self.reward_state_term[1]

## *Implement* env_step()


In [11]:
%%add_to EscapeRoomEnvironment


def env_step(self, action):

    if action == 0: # UP 
        possible_next_loc = (self.agent_loc[0] - 1, self.agent_loc[1])
    elif action == 1: # LEFT
        possible_next_loc = (self.agent_loc[0], self.agent_loc[1] - 1)
    elif action == 2: # DOWN
        possible_next_loc = (self.agent_loc[0] + 1, self.agent_loc[1])
    elif action == 3: # RIGHT
        possible_next_loc = (self.agent_loc[0], self.agent_loc[1] + 1)
    else: 
        raise Exception(str(action) + " not in recognized actions [0: Up, 1: Left, 2: Down, 3: Right]!")

    reward = -1
    terminal = False

    if possible_next_loc not in self.forbidden_locs:
      self.agent_loc = possible_next_loc
      if self.agent_loc == self.goal_loc and self.got_key:
        reward = 10
        terminal = True
      elif self.agent_loc == self.key_loc and not self.got_key:
        self.got_key = True
        reward = 1

    
    state = (*self.agent_loc,self.got_key)
    self.reward_state_term = (reward, state, terminal)
    return self.reward_state_term

## env_cleanup()


In [12]:
%%add_to EscapeRoomEnvironment

def env_cleanup(self):
    """Cleanup done after the environment ends"""
    self.agent_loc = self.start_loc
    self.got_key = False

#Agent : TD learning


In [14]:
# Create empty TDAgent class.

class TDAgent(BaseAgent):
    def agent_init(self, agent_info={}):
        raise NotImplementedError
        
    def agent_start(self, state):
        raise NotImplementedError

    def agent_step(self, reward, state):
        raise NotImplementedError

    def agent_end(self, reward):
        raise NotImplementedError

    def agent_cleanup(self):        
        raise NotImplementedError
        
    def agent_message(self, message):
        raise NotImplementedError

## agent_init()

As we did with the environment, we first initialize the agent once when a TDAgent object is created. In this function, we create a random number generator, seeded with the seed provided in the agent_info dictionary to get reproducible results. We also set the policy, discount and step size based on the agent_info dictionary. Finally, with a convention that the policy is always specified as a mapping from states to actions and so is an array of size (# States, # Actions), we initialize a values array of shape (# States,) to zeros.

In [15]:
%%add_to TDAgent

def agent_init(self, agent_info={}):

    self.rand_generator = np.random.RandomState(agent_info.get("seed"))

    # Policy will be given, recall that the goal is to accurately estimate its corresponding value function. 
    self.policy = agent_info.get("policy")
    # Discount factor (gamma) to use in the updates.
    self.discount = agent_info.get("discount")
    # The learning rate or step size parameter (alpha) to use in updates.
    self.step_size = agent_info.get("step_size")

    self.values = np.zeros((self.policy.shape[0],))

## agent_start()

In agent_start(), we choose an action based on the initial state and policy we are evaluating. We also cache the state so that we can later update its value when we perform a Temporal Difference update. Finally, we return the action chosen so that the RL loop can continue and the environment can execute this action.

In [16]:
%%add_to TDAgent

def agent_start(self, state):

    action = self.rand_generator.choice(range(self.policy.shape[1]), p=self.policy[state])
    self.last_state = state
    return action

## agent_step()

In agent_step(), the agent must:

- Perform an update to improve the value estimate of the previously visited state, and
- Act based on the state provided by the environment.


In [None]:
%%add_to TDAgent

def agent_step(self, reward, state):

    target = reward + self.discount * self.values[state]
    self.values[self.last_state] = self.values[self.last_state] + self.step_size * (target - self.values[self.last_state])

    action = self.rand_generator.choice(range(self.policy.shape[1]), p=self.policy[state])
    self.last_state = state

    return action

## agent_end() 

TD update for the case where an action leads to a terminal state.

In [None]:
%%add_to TDAgent

def agent_end(self, reward):

    target = reward
    self.values[self.last_state] = self.values[self.last_state] + self.step_size * (target - self.values[self.last_state])


## agent_cleanup()



In [None]:
%%add_to TDAgent

def agent_cleanup(self):
    self.last_state = None

## agent_message()


In [None]:
%%add_to TDAgent

def agent_message(self, message):
    """A function used to pass information from the agent to the experiment.
    Args:
        message: The message passed to the agent.
    Returns:
        The response (or answer) to the message.
    """
    if message == "get_values":
        return self.values
    else:
        raise Exception("TDAgent.agent_message(): Message not understood!")