<a href="https://colab.research.google.com/github/asrjy/ldrl/blob/main/Chapter%203%20-%20OpenAI%20Gym.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating a basic Environment and Agent

In [6]:
import random

In [1]:
class Environment:
  """
  Defining an environment that will give agent random rewards for limited number of timesteps
  """
  def __init__(self):
    """
    Initializing it's internal state. In this case, state is just a counter that limits the # of timesteps
    that the agent is allowed to take to interact with the environment
    """
    self.steps_left = 10
  def get_observation(self):
    """
    Returns the current environment's observation to the agent. This is usually implemented as some function
    of the internal state of the environment
    """
    return [0.0, 0.0, 0.0]
  def get_actions(self):
    """
    Allows the agent to query the set of actions it can execute. Normally, the set of actions the agent
    can take do not change over time, but some actions are impossible to be executed in some states 
    example: tictactoe
    """
    return [0, 1]
  def is_done(self):
    """
    Signals the end of the episode to the agent. 
    """
    return self.steps_left == 0
  def action(self, action):
    """
    Central piece of the environment's functionality. Does two things. Handles agent's action and returns
    the reward for this action. In this example, reward is random and it's action is discarded. We update
    the count of steps and refuse the continue the episodes if they are over. 
    """
    if self.is_done():
      raise Exception("Game is Over")
    self.steps_left -= 1
    return random.random()


In [3]:
class Agent:
  def __init__(self):
    """
    Will keep the total reward accumulated by the agent during the episode
    """
    self.total_reward = 0.0
  def step(self, env):
    """
    Accepts the environment's instance as argument and allows agent to 
     - observe the environment
     - make decision about action to take based on observation made
     - submit the action to environment
     - get reward for the action taken
    """
    current_obs = env.get_observation()
    actions = env.get_actions()
    reward = env.action(random.choice(actions))
    self.total_reward += reward
  

In [11]:
if __name__ == "__main__":
  env = Environment()
  agent = Agent()
  while not env.is_done():
    agent.step(env)
  print(f"Total reward: {agent.total_reward:.4f}")

Total reward: 5.7481


## OpenAI Gym API

Provides a rich collection of environments for RL Experiments. 

Central class in Library is called Env (environment). Instances of this class provide several methods and fields for various environments. 

But at a high level, every environment provides the following information. 

- Set of actions allowed to be executed in the environment. Gym supports both discrete and continuous actions. 
- Shape and boundaries of the observations, of the environment. 
- A method called "step" to execute an action which returns the current observation, reward and indication that the episode is over. 
- A method called "reset" which returns the environment to it's initial state and obtains the first observation. 

### Action space

An environment is not limited to a single action. It can take multiple actions such as pushing multiple buttons simultaenously, modifying both heating and cooling setpoints at once etc., In such cases, Gym defines a special container class that allows the nesting of several action spaces into one unified action. 

### Observation space

Observations are provided to the agent at every timestep besides the reward. They can be as simple as a bunch of numbers or as complex as several multidimensional tensors using images from multiple cameras. 


The basic abstract class Space includes two useful methods. 

- sample() -> returns random sample from the space
- contains(x) -> checks whether the argument, 'x' belongs to the space's domain. 

Both these methods are abstract and reimplemented in each of the Space subclasses. 

- 'Discrete' subclass represents a mutually exclusive set of items numbered from 0 to n-1. It's only field is a number n, the count of items it describes
- 'Box' subclass represents the n dimensional tensor of rational numbers with intervals [low, high]. For example, it could be used for an accelerator pedal with one value between low = 0.0 and high = 1.0

    Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)

    shape = (1,) denotes it's a scalar value. 

- 'Tuple' subclass allows us to combine several Space class instances together. This enables us to create action and observation spaces of any complexity. For example, a car could have lots of controls that can be changed at every timestamp. In this case, we can combine action spaces as 

    Tuple(spaces=(Box(low=-1.0, high=1.0, shape=(3,), dtype=np.float32), Discrete(n=3),Discrete(n=2)))
    
    This is rarely used. 

There are other subclasses but the preceeding three are the most important. All the subclasses implement the sample() and contains() methods. The sample() function performs a random sample corresponding to the Space class and parameters. The contains() method verifies that the given arguments comply with Space parameters and used to check an action's sanity. 

Every environment has two members of type Space: action_space and observation_space. This allows us to create generic code that could work with any environment. 

### The environment

The environment is represented in Gym by the Env class. It has the following members. 

- action_space: This is the field of Space class and provides specification for allowed actions in the environment. 
- observation_space: This has the same Space class as action_space and specifies observations of the environment. 
- reset(): Resets environment to it's initial state returning the initial observation vector. 
- step(): Allows agent to take action and return information about the outcome of the action that are the next observation, reward and end of episode flag. 

### Creating an environment

In [12]:
import gym 
e = gym.make('CartPole-v0')

THe observations in this env are 4 floating point numbers containing stick's center of mass, speed, angle to the platform and angular speed. 

The episode continues until stick falls, so we need to balance platform in a way to avoid the stick falling. 

In [13]:
obs = e.reset()
obs

array([ 0.03792635, -0.04452565,  0.03030676, -0.02337004])

In [15]:
e.action_space, e.observation_space

(Discrete(2),
 Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32))

In [17]:
e.step(0)
# array(new_observation, reward, done_flag, additional_information)

(array([ 0.03223446, -0.43560346,  0.03541374,  0.58066172]), 1.0, False, {})

In [21]:
e.action_space.sample()
# This could be useful when we are not sure which action to take

1

In [22]:
e.observation_space.sample()
# Not very useful

array([-2.5456860e+00,  1.8878092e+38, -3.9252144e-01, -2.3982668e+38],
      dtype=float32)

In [27]:
if __name__ == "__main__":
  env = gym.make("CartPole-v0")
  total_reward = 0.0
  total_steps = 1
  obs = env.reset()
  while True:
    """
    Sampling random action, asking environment to execute it and return the next observation, reward and done flag. 
    If episode is over, we stop loop and show many steps we have taken and how much reward has been accumulated. 
    """
    action = env.action_space.sample()
    obs, reward, done, _ = env.step(action)
    total_reward += reward
    total_steps += 1
    if done: 
      break
  print(f"Episode done in {total_steps} steps. Total reward is {total_reward:.2f}")

Episode done in 17 steps. Total reward is 16.00
