# Task Overview

In this exercise, we will create a basic "Grid World" environment that mimics the functionality of OpenAI Gym environments. This simplified environment will help you understand the dynamics and challenges of reinforcement learning (RL) by implementing the environment from scratch.

## Environment Methods

**Environment Attributes**
- **size**: The size of the grid (default 4x4).
- **start_position**: The starting position of the agent (bottom-left corner - (0,0)).
- **goal_position**: The goal position the agent aims to reach (top-right corner - (size-1, size-1)).
- **state**: The current position of the agent on the grid.
- **max_episode_length**: The maximum number of steps the agent can take in an episode.
- **current_step**: The current step number in the episode.

**Existing Methods**

- __Innit__: initializes the environment with the given size and maximum episode length.
  - Arguments:
    - size: the size of the grid.
    - max_episode_length: the maximum number of steps the agent can take in an episode.

- **Render**: displays the current state of the environment with agent and goal position.

**Methods to Implement**

- **Reset**: Reset the environment to the initial state.  
  - Returns: The initial state of the environment.

- **Step**: Take an action in the environment.  
  - Arguments:  
    - action: The action to take in the environment.
  - Features:
    - if the state is equal to the goal position, reward is equal to 10
    - else the reward is equal to -1
    - if the number of steps is superior to the episode lenght, reward is equal to -10
  - Returns:  
    - state: The new state of the environment.
    - reward: The reward for the action taken.
    - done: A boolean indicating if the episode has ended.
    - info: Additional information.
- **Sample Action**: Samples a random action from the action space.  
  - Returns: A random action.  

In [None]:
import numpy as np

class SimpleGridWorld:
    def __init__(self, size=4, max_episode_length=10):
        self.size = size
        self.start_position = (0, 0)
        self.goal_position = (size - 1, size - 1)
        self.state = self.start_position
        self.max_episode_length = max_episode_length
        self.current_step = 0
    
    def reset(self):
        self.state = self.start_position
        self.current_step = 0
        return self.state
    
    def render(self):
        for i in range(self.size):
            for j in range(self.size):
                if (i, j) == self.state:
                    print("A ", end="")
                elif (i, j) == self.goal_position:
                    print("G ", end="")
                else:
                    print(". ", end="")
            print()
        print()
    
    def step(self, action):
        if self.current_step >= self.max_episode_length:
            return self.state, -10, True, {}
        
        self.current_step += 1
        x, y = self.state
        if action == "up" and x > 0:
            self.state = (x - 1, y)
        elif action == "down" and x < self.size - 1:
            self.state = (x + 1, y)
        elif action == "left" and y > 0:
            self.state = (x, y - 1)
        elif action == "right" and y < self.size - 1:
            self.state = (x, y + 1)
        
        if self.state == self.goal_position:
            reward = 10
        elif self.current_step >= self.max_episode_length:
            reward = -10
        else:
            reward = -1
        
        done = self.state == self.goal_position or self.current_step >= self.max_episode_length
        return self.state, reward, done, {}
    
    def sample_action(self):
        return ["up", "down", "left", "right"][np.random.randint(4)]


## Episode and Sampling

An episode in RL is a sequence of steps that starts from the initial state and ends when the agent reaches the goal state or the maximum episode length is reached. In each step, the agent takes an action in the environment, receives a reward, and transitions to a new state. The agent continues this process until the episode ends.

Here, we instantiate an environment as defined above and sample a trajectory by taking random actions at each step. We will visualize the agent's movement in the grid world and observe the rewards received during the episode.

In [None]:
env = SimpleGridWorld(size=4, max_episode_length=10)
state = env.reset()
done = False
trajectory = []  # Initialize an empty list to store the trajectory

while not done:
    action = env.sample_action()
    next_state, reward, done, info = env.step(action)
    
    # Save the step information in a dictionary format
    step_data = {
        "state": state,
        "action": action,
        "reward": reward,
        "next_state": next_state,
        "done": done
    }
    trajectory.append(step_data)  # Append the step data to the trajectory list
    
    print(f"Action: {action}, Reward: {reward}, Next State: {next_state}")
    env.render()
    
    state = next_state  # Update the current state to the next state
    
    if done:
        print("Final Objective Attained or Max episode length reached.")

The sampled trajectory was stored in a dictionnary

In [None]:
trajectory