2ND QUESTION: The Frozen Lake: Building Custom Environment

In this level we are extending frozen lake problem to 8*8 challenge, where in the number of pot holes are increased to 8. The pot holes are randomly generated but the starting and ending point remains same as the level 1 frozen lake Students should custom build this environment and implement an agent to navigate this challenge with optimized Episodes.

The agent navigates a grid-like frozen lake and tries to reach a goal tile without falling into holes. The observation space in this environment is discrete. The observation space consists of a single discrete variable representing the current tile the agent is on. The tiles can have the following discrete values:

1. Frozen (F): The tile is frozen and safe to step on.
2. Hole (H): The tile is a hole, and if the agent steps on it, it falls into the hole and fails.

3. Start (S): The tile is the starting point for the agent.

4. Goal (G): The tile is the goal, and if the agent reaches it, it succeeds.

The agent receives one of these discrete observations at each time step, indicating the current tile it is occupying. It can use this information to determine its position in the environment and make decisions on which action to take.

Expected Outcome: Path traversed to reach the Goal, completion of the game and rewards achieved, Episodes trained for.

In [2]:
import numpy as np
import tensorflow as tf
import gym

# Custom 8x8 Frozen Lake Environment
class CustomFrozenLakeEnv(gym.Env):
    def __init__(self, size=8, num_holes=8):
        self.size = size
        self.num_holes = num_holes
        self.state_space = np.prod([size, size])
        self.action_space = 4  # Up, Down, Left, Right

        self.grid = np.zeros((size, size), dtype=int)
        self.start_pos = (0, 0)
        self.goal_pos = (size - 1, size - 1)

        # Set random positions for holes
        hole_positions = [(np.random.randint(size), np.random.randint(size)) for _ in range(num_holes)]
        for pos in hole_positions:
            self.grid[pos] = 1  # Mark as hole

        self.reset()

    def reset(self):
        self.agent_pos = self.start_pos
        self.done = False
        return self.get_state()

    def get_state(self):
        return self.agent_pos[0] * self.size + self.agent_pos[1]

    def step(self, action):
        if self.done:
            raise ValueError("Episode is done. Call reset() to start a new episode.")

        row, col = self.agent_pos

        if action == 0:  # Up
            row = max(0, row - 1)
        elif action == 1:  # Down
            row = min(self.size - 1, row + 1)
        elif action == 2:  # Left
            col = max(0, col - 1)
        elif action == 3:  # Right
            col = min(self.size - 1, col + 1)

        self.agent_pos = (row, col)

        if self.grid[self.agent_pos] == 1:  # Agent fell into a hole
            reward = -1
            self.done = True
        elif self.agent_pos == self.goal_pos:  # Agent reached the goal
            reward = 1
            self.done = True
        else:
            reward = 0

        return self.get_state(), reward, self.done, {}

    def render(self):
        render_grid = np.zeros_like(self.grid, dtype=str)
        render_grid[self.grid == 1] = 'H'
        render_grid[self.start_pos] = 'S'
        render_grid[self.goal_pos] = 'G'
        render_grid[self.agent_pos] = 'A'

        print(render_grid)

# Define Q-network
class QNetwork(tf.keras.Model):
    def __init__(self, num_actions):
        super(QNetwork, self).__init__()
        self.layer1 = tf.keras.layers.Dense(64, activation='relu')
        self.layer2 = tf.keras.layers.Dense(64, activation='relu')
        self.output_layer = tf.keras.layers.Dense(num_actions, activation='linear')

    def call(self, state):
        x = self.layer1(state)
        x = self.layer2(x)
        return self.output_layer(x)

# Initialize Q-network
num_actions = 4
q_network = QNetwork(num_actions)

# Q-learning parameters
learning_rate = 0.001
gamma = 0.99
optimizer = tf.optimizers.Adam(learning_rate)
huber_loss = tf.keras.losses.Huber()

# Training loop
num_episodes = 1000
epsilon = 0.1

env = CustomFrozenLakeEnv(size=8, num_holes=8)

for episode in range(num_episodes):
    state = env.reset()
    state = np.reshape(state, [1, 1]).astype(np.float32)

    total_reward = 0
    episode_path = [env.agent_pos]  # Store the initial position

    with tf.GradientTape() as tape:
        for t in range(1000):  # Maximum number of steps per episode
            # Choose action using epsilon-greedy policy
            if np.random.rand() < epsilon:
              action = np.random.randint(env.action_space)
            else:
              q_values = q_network(state)
              action = tf.argmax(q_values, axis=1).numpy()[0]


            # Take the chosen action
            next_state, reward, done, _ = env.step(action)
            next_state = np.reshape(next_state, [1, 1]).astype(np.float32)

            episode_path.append(env.agent_pos)  # Store the current position

            # Update Q-values using Bellman equation
            q_values_next = q_network(next_state)
            target = reward + gamma * tf.reduce_max(q_values_next, axis=1)
            loss = huber_loss(target, q_network(state))

            total_reward += reward

            if done:
                break

            state = next_state

    # Optimize the model
    grads = tape.gradient(loss, q_network.trainable_variables)
    optimizer.apply_gradients(zip(grads, q_network.trainable_variables))

    print("Episode {}: Total Reward: {}".format(episode, total_reward))

    if done:
        if reward == 1:
            print("Game completed successfully! Agent reached the goal.")
            print("Path Traversed:", episode_path)
        else:
            print("Agent fell into a hole. Game over.")


Episode 0: Total Reward: -1
Agent fell into a hole. Game over.
Episode 1: Total Reward: -1
Agent fell into a hole. Game over.
Episode 2: Total Reward: -1
Agent fell into a hole. Game over.
Episode 3: Total Reward: -1
Agent fell into a hole. Game over.
Episode 4: Total Reward: -1
Agent fell into a hole. Game over.
Episode 5: Total Reward: -1
Agent fell into a hole. Game over.
Episode 6: Total Reward: -1
Agent fell into a hole. Game over.
Episode 7: Total Reward: -1
Agent fell into a hole. Game over.
Episode 8: Total Reward: -1
Agent fell into a hole. Game over.
Episode 9: Total Reward: -1
Agent fell into a hole. Game over.
Episode 10: Total Reward: -1
Agent fell into a hole. Game over.
Episode 11: Total Reward: -1
Agent fell into a hole. Game over.
Episode 12: Total Reward: -1
Agent fell into a hole. Game over.
Episode 13: Total Reward: -1
Agent fell into a hole. Game over.
Episode 14: Total Reward: -1
Agent fell into a hole. Game over.
Episode 15: Total Reward: -1
Agent fell into a hol