## Random Maze Environment

#### Learning the Maze

- Previously we created `FrozenPond` and then a `Maze` variant.
- Let's train an agent to complete the Maze using RLlib.

In [8]:
from ray.rllib.agents.ppo import PPOTrainer
from envs import Maze # Maze defined in previous slides

In [16]:
trainer = PPOTrainer({"framework" : "torch", "create_env_on_driver" : True}, 
                     env=Maze)

In [17]:
train_info = trainer.train()

#### Learning the Maze

We see that the agent always receives the reward for finishing the maze:

In [37]:
print(trainer.evaluate()['evaluation']['hist_stats']['episode_reward'])

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


But often takes much longer than the optimal number of steps, which is 6:

In [35]:
print(trainer.evaluate()['evaluation']['hist_stats']["episode_lengths"])

[74, 17, 37, 54, 39, 15, 29, 65, 73, 29, 39, 25, 45, 54, 14, 20, 51, 48, 20, 21, 32, 55, 62, 38, 11, 12, 45, 58, 58, 24, 65, 18, 51, 95, 83, 32, 20, 28, 22, 26, 12, 14, 60, 18, 41, 44, 7, 33, 30, 19, 33, 21, 68, 18]


We can improve the agent by training for more iterations.

#### Learning the Maze

In [38]:
for i in range(3):
    train_info = trainer.train()

In [39]:
print(trainer.evaluate()['evaluation']['hist_stats']["episode_lengths"])

[52, 11, 13, 14, 11, 14, 8, 12, 9, 7, 13, 14, 7, 11, 14, 10, 9, 11, 12, 10, 9, 8, 17, 15, 10, 10, 8, 10, 9, 8, 9, 7, 8, 7, 9, 9, 16, 13, 6, 14, 9, 6, 10, 8, 10, 6, 9, 10, 18, 11, 14, 9, 11, 10, 11, 10, 8, 13, 14, 7, 11, 9, 15, 6, 6, 6, 9, 10, 6, 10, 6, 10, 13, 8, 8, 9, 14, 15, 7, 10, 7, 7, 14, 11, 12, 6, 13, 10, 8, 6, 12, 10, 6, 7, 8, 10, 6, 12, 14, 11, 6, 7, 9, 13, 6, 8, 9, 15, 9, 15, 6, 8, 6, 7, 8, 8, 12, 8, 13, 11, 7, 11, 10, 12, 7, 18, 10, 7, 14, 6, 6, 7, 8, 9, 6, 8, 9, 10, 10, 8, 7, 6, 9, 10, 10, 8, 14, 10, 9, 8, 9, 8, 7, 8, 13, 17, 8, 7, 10, 6, 7, 11, 9, 11, 8, 9, 13, 8, 7, 12, 16, 12, 8, 6, 11, 6, 9, 9, 18, 6, 8, 11, 17, 10, 7, 9, 12, 8, 6, 7, 15, 12, 6, 12, 8, 11, 8, 8, 8, 7, 8, 8, 8, 6, 8, 8, 6]


By the end, the number of steps is often less than 10. Progress.

#### Beyond the simple maze

We can train an agent to learn this fixed maze:

In [41]:
maze = Maze()
maze.reset()
maze.render()

P...
.X.X
...X
X..G


But this is quite an easy problem:

- Small state space
- Small action space
- No stochasticity

#### Random maze

- Let's make the problem harder by looking at a _random_ maze
- That is, the wall locations change every episode.
- We'll do this by reimplementing the `reset` method:

In [42]:
class RandomMaze(Maze):
    def reset(self):
        self.player = (0, 0) # the player starts at the top-left
        self.exit = (3, 3)   # exit is at the bottom-right
        
        self.walls = np.random.rand(4, 4) < 0.2
        self.walls[self.player] = 0
        self.walls[self.exit] = 0
        
        return 0 # the observation corresponding to (0,0)

Now, each square (except the start and end locations) is a wall with probability 20%

#### Impossible mazes

- In this new setup

# IDEA

- should we do random pond instead of random maze?
- then we don't need to deal with impossible cases because the episode will always end??
- ok wait a minute
- actually the agent could always go back and forth forever if explore=False
- i guess this isn't a concern for training.
- ok but we do need some possibility of the done flag. I see.

In [33]:
# from ipywidgets import Output
# from IPython import display
# import time

# env = Maze()

# out = Output();
# display.display(out);
# with out:
#     obs = env.reset()
#     env.seed(1)
#     done = False
#     episode_length = 0
#     while not done:
#         action = trainer.compute_single_action(obs)
#         obs, rewards, done, _ = env.step(action)
        
#         out.clear_output(wait=True)
#         print("action:", action)
#         env.render()
#         time.sleep(0.5)
#         episode_length += 1
# print("episode length:", episode_length)

action: 2
....
.X.X
...X
X..G
episode length: 10


Then, introduce random maze