# How to Create a Custom Environment
In this Jupyter Notebook, we will implement a simple game called `GridWorldEnv` which consists of a 2 dimensional square grid of fixed size. The agent can move vertically or horizontally betweeng grid cells in each timestep.

### Goal & Basic Information
THe goal of the agent is to navigate to a target on the grid that has been placed randomly at the beginning of the episode. Here is basic information about the game:
-  Observations provide the location of the target and agent
-  There are 4 discrete actions in our environment, corresponding to the movements "right", "up", "left" and "down"
-  The environment ends (terminates) when the agent has navigated to the grid cell where the target is located
-  The agent is only rewarded when it reaches the target, i.e, the reward is one when the agent reaches the target and zero otherwise.

### Environment `__init__`
The custom environment will inheirt from `gymnasium.Env` that defines the structure of the environment.

- One of the requirements for an environment is defining the observation and action space, which delcares the general set of possible inputs (actions) and outputs (observations) of the environment

As outlined previously, our agent has 4 discrete actions, therefore we use `Discrete(4)` space with 4 options. For our observation, we can imagine our observation looks like `{"agent": array([1, 0]), "target": array([0, 3])}`

This is where the array elements represents the x and y positions of the agent or target. Alternative options for representing the observation is a 2D grid with values representing the agent and target on the grid or a 3D grid with each "layer" containing only the agent or target information. 

Hence, we declare the observation space as `Dict` with the agent and target spaces being a `Box` allowing an array output of an int type.

In [20]:
# all the necessary imports
from typing import Optional
import numpy as np
import gymnasium as gym

In [7]:
class GridWorldEnv(gym.Env):
    def __init__(self, size: int = 5):

        # size of the square grid
        self.size = size

        # define the agent and target location; randomly chosen in 'reset' and updated in 'step'
        self._agent_location = np.array([-1, -1], dtype=np.int32)
        self._target_location = np.array([-1, -1], dtype=np.int32)

        # observations are dictionaries with the agent's and the target's location
        # each location is encoded as an element of {0, ..., `size` -1}^2
        self.observation_space = gym.spaces.Dict(
            {
                "agent": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
                "target": gym.spaces.Box(0, size-1, shape=(2,), dtype=int)
            }
        )

        # we have 4 actions, corresponding to "right", "up", "left" and "down"
        self.action_space = gym.spaces.Discrete(4)

        # dictionary maps the abstract actions to the directions on the grid
        self._action_to_direction = {
            0: np.array([1, 0]), # right
            1: np.array([0, 1]), # up
            2: np.array([-1, 0]), # left
            3: np.array([0, -1]), # down
        }

```
### Initalizing Grid & Environment
self.size = size
.  .  .  .  .
.  .  .  .  .
.  .  .  .  .
.  .  .  .  .
.  .  .  .  .


### Agent & Target Initializtion
self._agent_location = np.array([-1, -1], dtype=np.int32)
self._target_location = np.array([-1, -1], dtype=np.int32)
Grid is initialized, but the agent and target are not yet placed.


### Observation Space
self.observation_space = gym.spaces.Dict(
    {
        "agent": gym.spaces.Box(0, size-1, shape=(2,), dtype=int),
        "target": gym.spaces.Box(0, size-1, shape=(2,), dtype=int)
    }
)
The agent observes both its own position and the target's position as [x, y] coordinates in the grid.
{
    "agent": [1, 2],
    "target": [3, 3]
}

### Action Space
self.action_space = gym.spaces.Discrete(4)
A  .  .      # Action 0 (Right): Agent moves to [2, 3].
.  A  .      # Action 1 (Up):    Agent moves to [1, 2].
.  A  .      # Action 2 (Left):  Agent moves to [2, 1].
.  A  .      # Action 3 (Down):  Agent moves to [3, 2].

### Action to Direction Mapping
self._action_to_direction = {
    0: np.array([1, 0]),  # right
    1: np.array([0, 1]),  # up
    2: np.array([-1, 0]), # left
    3: np.array([0, -1]), # down
}

Action 0 (Right): Add [1, 0] to the current position.
[2, 2] + [1, 0] = [3, 2]
Action 1 (Up): Add [0, 1] to the current position.
[2, 2] + [0, 1] = [2, 3]
Action 2 (Left): Add [-1, 0] to the current position.
[2, 2] + [-1, 0] = [1, 2]
Action 3 (Down): Add [0, -1] to the current position.
[2, 2] + [0, -1] = [2, 1]

    Up (1)
      ↑
Left ← A → Right
      ↓
   Down (3)
```

### Consturcting Observations
Since we need to compute observations in both `Env.reset()` and `Env.step()`, it is often convenient to have a method `_get_obs` that translates the environment's state into an observation. This isn't mandatory and you can compute the observations in `Env.reset()` and `Env.step()` separately.

In [12]:
# finding the location of both the target and the agent
def _get_obs(self):
    return {"agent": self.__agent_location, "target": self._target_location}

We also implement a similar method to the information returned by `Env.reset()` and `Env.step()`. In this case, we would like to find the manhattan distance between the agent and the target.

In [14]:
# finding the manhattan distance between the agent and the target
def _get_info(self):
    return {
        "distance": np.linalg.norm(
            self._agent_location - self._target_location, ord=1
        )
    }

Oftentimes, info will also contain some data that is only available in the `Env.step()` method. In that case, we would need to update the dictionary this is returned by `_get_info` in `Env.step()`

### Reset Function
The purpose of `reset()` is to initiate a new episode for an environment and has two parameters, which are `seed` and `options`. 
- The seed is used to initialize the random number generator to a deterministic state
- Options can be used to specify values used within reset
On the first line of reset, you need to call `super().reset(seed=seed)` which will initialize the random number generate `np_random` to use through the rest of the `reset()`

Within the custom environment, the `reset()` needs to randomly choose the agent and target's positions. The return tpe of `reset()` is a tuple of the initial observation and auxiliary information.

In [16]:
def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):

    # seeding self.np_random
    super().reset(seed=seed)

    # choosing the agent's location uniformly at random
    self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

    # we sample the target's location randomly until it doesn't coincide with agent's location
    self._target_location = self._agent_location
    while np.array_equal(self._target_location, self._agent_location):
        self._target_location = self.np_random_integers(0, self.size, size=2, dtype=int)

    # retrieving the observation and information
    observation = self._get_obs()
    info = self._get_info()

    # returning the observation and information
    return observation, info 

### Step Function
The `step()` method usually contains most of the logic for the environment, where it accepts a `action` and computes the state of the environment after applying the action. It then returns:
- a tuple of the next observation
- the resulting reward
- if the environment was terminated
- if the environment has truncated and auxiliary information

In [17]:
def step(self, action):

    # map the action (element of {0, 1, 2, 3} to the direction the agent walks
    direction = self._action_to_direction[action]

    # use `np.clip` to make sure agent doesn't leave the grid bounds
    self._agent_location = np.clip(self._agent_location + direction, 0, self.size - 1)

    # a environment is completed if and only if the agent has reached the target
    terminated = np.array_equal(self._agent_location, self._target_location)
    truncated = False
    reward = 1 if terminated else 0
    observation = self._get_obs()
    info = self._get_info()

    # returning all the details
    return observation, reward, terminated, truncated, info

### Registering & Making the Environment
It's more common for environments to be initialized using `gymnasium.make()`. The environment ID consists of three components, two of which are optional:
- `gymnasium_env` which is a mandatory name
- `v0` optional version naming
- an appropariate ID 

In [18]:
gym.register(
    id="gymnasium_env/GridWorld-v0",
    entry_point=GridWorldEnv,
)

In [22]:
env = gym.make("gymnasium_env/GridWorld-v0")
print(env)

<OrderEnforcing<PassiveEnvChecker<GridWorldEnv<gymnasium_env/GridWorld-v0>>>>


In [23]:
gym.make("gymnasium_env/GridWorld-v0", max_episode_steps=100)

<TimeLimit<OrderEnforcing<PassiveEnvChecker<GridWorldEnv<gymnasium_env/GridWorld-v0>>>>>

In [24]:
env.observation_space

Dict('agent': Box(0, 4, (2,), int64), 'target': Box(0, 4, (2,), int64))