## Frozen Lake Domain Description

Frozen Lake is a simple grid-world environment where an agent navigates a frozen lake to reach a goal while avoiding falling into holes. The environment is represented as a grid, with each cell being one of the following:

* **S**: Starting position of the agent
* **F**: Frozen surface, safe to walk on
* **H**: Hole, falling into one ends the episode with a reward of 0
* **G**: Goal, reaching it ends the episode with a reward of 1

The agent can take four actions:

* **0: Left**
* **1: Down**
* **2: Right**
* **3: Up**

However, due to the slippery nature of the ice, the agent might not always move in the intended direction. There's a chance it moves perpendicular to the intended direction.




In [None]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG



  if not isinstance(terminated, (bool, np.bool8)):


The transition model for the Frozen Lake world describes how the agent's actions affect its movement and the resulting state transitions. Here's a breakdown of the key components:

**Actions:**

* The agent can choose from four actions:
    * 0: Left
    * 1: Down
    * 2: Right
    * 3: Up

**State Transitions:**

* **Intended Movement:** Ideally, the agent moves one cell in the chosen direction.
* **Slippery Ice:** Due to the slippery nature of the ice, there's a probability that the agent will move in a perpendicular direction instead of the intended one. The exact probabilities depend on the specific Frozen Lake environment configuration, but typically:
    * **Successful Move:** The agent moves in the intended direction with a high probability.
    * **Perpendicular Move:** The agent moves 90 degrees to the left or right of the intended direction with a lower probability.
* **Boundaries:** If the intended movement would take the agent outside the grid boundaries, it remains in its current position.
* **Holes:** If the agent lands on a hole ("H"), the episode ends, and it receives a reward of 0.
* **Goal:** If the agent reaches the goal ("G"), the episode ends, and it receives a reward of 1.




In [None]:
import gym

# Create the environment
env = gym.make('FrozenLake-v1', render_mode='ansi')  # 'ansi' mode for text-based rendering

# Reset the environment to the initial state
observation = env.reset()

# Take a few actions and observe the results
for _ in range(5):
    action = env.action_space.sample()  # Choose a random action
    observation, reward, done, info = env.step(action)
    # Render the environment to see the agent's movement (text-based)
    if done:
        observation= env.reset()
    else:
      rendered = env.render()
      if len(rendered) > 1:  # Check if there's a second element
         print(rendered[1])  # Print the second element
# Close the environment
env.close()
print ("State 14 Going Right: (s, a, r, Done)", env.P[14][2])

  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG

State 14 Going Right: (s, a, r, Done) [(0.3333333333333333, 14, 0.0, False), (0.3333333333333333, 15, 1.0, True), (0.3333333333333333, 10, 0.0, False)]


In [None]:
import gym
import numpy as np


# Create FrozenLake environment
env = gym.make("FrozenLake-v1")

# Starter code for students (modified for number of iterations)
def value_iteration(env, gamma=0.9, num_iterations=1000):
    """
    Implements the Value Iteration algorithm.

    Args:
        env: The OpenAI Gym environment.
        gamma: Discount factor.
        num_iterations: Number of iterations to run.

    Returns:
        The optimal value function and policy.
    """

    # Initialize value function and policy
    V = np.zeros(env.observation_space.n)
    policy = np.zeros(env.observation_space.n)

    # TODO: Implement the core Value Iteration logic here
    # - Iterate for 'num_iterations'
    for _ in range(num_iterations):
      # copying values of V into a temp variable
      V_temp = np.copy(V)
      # for each state in our environment
      for state in range(env.observation_space.n):
        # intializing Q values with zeros
        Q = np.zeros(env.action_space.n)
        # for each action we take in our environment
        for action in range(env.action_space.n):
          for prob, next_state, reward, done in env.P[state][action]:
            # calculating Q values for all actions
            Q[action] += prob * (reward + gamma * V_temp[next_state])
        # updating V[s] with the max Q value
        V[state] = max(Q)
        # updating policy with the action that maximizes Q value
        policy[state] = np.argmax(Q)

    # - For each state:
    #   - Calculate Q values for all actions
    #   - Update V[s] with the max Q value
    #   - Update policy[s] with the action that maximizes Q value

    # Assume that all initial states will be 0

    return V, policy

# Apply student's Value Iteration
optimal_V, optimal_policy = value_iteration(env)

# Evaluate student's solution (Optional)
def evaluate_policy(env, policy, num_episodes=100):
    total_reward = 0
    for _ in range(num_episodes):
        state = env.reset()
        done = False
        while not done:
            action = policy[state]
            state, reward, done, _ = env.step(action)
            total_reward += reward
    return total_reward / num_episodes

average_reward = evaluate_policy(env, optimal_policy)
print("Optimal Policy:")
print(optimal_policy.reshape(4, 4))
print("Optimal Values:")
print(optimal_V.reshape(4, 4))
print("Average Reward:", average_reward)

Optimal Policy:
[[0. 3. 0. 3.]
 [0. 0. 0. 0.]
 [3. 1. 0. 0.]
 [0. 2. 1. 0.]]
Optimal Values:
[[0.0688909  0.06141457 0.07440976 0.05580732]
 [0.09185454 0.         0.11220821 0.        ]
 [0.14543635 0.24749695 0.29961759 0.        ]
 [0.         0.3799359  0.63902015 0.        ]]
Average Reward: 0.72


# Results:

The above results display the optimal policy, the optimal values, and the average reward.

*   The optimal policy matrix represents the best action to take in each state, with 0 being Left, 1 being Down, 2 being Right, and 3 being Up.
*   The optimal values matrix shows the expected cumulative reward you can achieve from each state.









In [None]:
import gym
import numpy as np

# Q Learning Algorithm

# create a frozen lake environment
env = gym.make("FrozenLake-v1")
n_observations = env.observation_space.n
n_actions = env.action_space.n

# creating and initializing the q-table to 0
Q_table = np.zeros((n_observations, n_actions))

# number of episodes we will run
num_episodes = 10000
# max iterations per episode
max_itrs_per_episode = 100
# initialize the exploration probability to 1
exploration_prob = 1
# exploration decreasing decay for exponential decreasing
exploration_dec_decay = 0.001
# minimum of exploration probability
min_exploration_prob = 0.01
# gamma discount
gamma = 0.9
# learning rate
alpha = 0.1

# storing the total rewards the agent gets in the environment after each episode in a list
total_rewards_episode = []

# iterating over episodes
for e in range(num_episodes):
    # initialize the first state of the episode
    current_state = env.reset()
    done = False

    # sum the rewards that the agent gets from the environment
    total_episode_reward = 0

    for i in range(max_itrs_per_episode):
        # exploration
        if np.random.uniform(0, 1) < exploration_prob:
            action = env.action_space.sample()
        # exploitation
        else:
            action = np.argmax(Q_table[current_state, :])

        # the environment runs the chosen action and returns the next state, reward, and done
        next_state, reward, done, _ = env.step(action)

        # update our q-table using the q-learning iteration
        Q_table[current_state, action] = (1 - alpha) * Q_table[current_state, action] + alpha * (reward + gamma * max(Q_table[next_state, :]))

        # update total reward
        total_episode_reward += reward

        # if the episode is done, exit the loop
        if done:
            break

        # update current state
        current_state = next_state

    # append the total reward for the episode after it finishes
    total_rewards_episode.append(total_episode_reward)

    # update the exploration probability using exponential decay
    exploration_prob = max(min_exploration_prob, np.exp(-exploration_dec_decay * e))

# print average reward per thousand episodes
print("Average reward per thousand episodes:")
for i in range(10):
    print("(", (i + 1) * 1000, "): Average reward: ", np.mean(total_rewards_episode[1000 * i:1000 * (i + 1)]))


Average reward per thousand episodes:
( 1000 ): Average reward:  0.04
( 2000 ): Average reward:  0.173
( 3000 ): Average reward:  0.323
( 4000 ): Average reward:  0.336
( 5000 ): Average reward:  0.524
( 6000 ): Average reward:  0.515
( 7000 ): Average reward:  0.497
( 8000 ): Average reward:  0.425
( 9000 ): Average reward:  0.494
( 10000 ): Average reward:  0.406
