## Homework 3

In this assignment, we will practice implementing (1)value iteration and (2)policy gradient method in the Frozen Lake environment from OpenAI Gym. To begin with, for value iteration, there are two functions that you need to implement: (1) policy_improvement, which returns the optimal policy given a value function and environment dynamics, and (2) value_iteration, which implements value iteration.

* You should run this on Google Colab




In [1]:
# IMPORTANT: Always run this cell before anything else to ensure that you are able to access the Frozen Lake environment
!pip install gymnasium==0.29.1
import gymnasium as gym
import argparse
import numpy as np
import time
from gymnasium.envs.registration import register

# De-register environments if there is a collision
env_dict = gym.envs.registration.registry.copy()
for env in env_dict:
    if "Deterministic-4x4-FrozenLake-v0" in env:
        del gym.envs.registration.registry[env]
    elif "Stochastic-4x4-FrozenLake-v0" in env:
        del gym.envs.registration.registry[env]


register(
    id="Deterministic-4x4-FrozenLake-v0",
    entry_point="gymnasium.envs.toy_text.frozen_lake:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": False},
)

register(
    id="Stochastic-4x4-FrozenLake-v0",
    entry_point="gymnasium.envs.toy_text.frozen_lake:FrozenLakeEnv",
    kwargs={"map_name": "4x4", "is_slippery": True},
)





Collecting gymnasium==0.29.1
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium==0.29.1)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Installing collected packages: farama-notifications, gymnasium
Successfully installed farama-notifications-0.0.4 gymnasium-0.29.1


The parameters P, nS, nA, gamma are defined as follows:

	P: nested dictionary of a nested lists
		From gym.core.Environment
		For each pair of states in [1, nS] and actions in [1, nA], P[state][action] is a
		tuple of the form (probability, nextstate, reward, terminal) where
			- probability: float
				the probability of transitioning from "state" to "nextstate" with "action"
			- nextstate: int
				denotes the state we transition to (in range [0, nS - 1])
			- reward: int
				either 0 or 1, the reward for transitioning from "state" to
				"nextstate" with "action"
			- terminal: bool
			  True when "nextstate" is a terminal state (hole or goal), False otherwise
	nS: int
		number of states in the environment
	nA: int
		number of actions in the environment
	gamma: float
		Discount factor. Number in range [0, 1)

## Value Iteration

In [None]:
def policy_improvement(P, nS, nA, value_from_policy, policy, gamma=0.9):
    """Given the value function from policy improve the policy.

	Parameters
	----------
	P, nS, nA, gamma:
		defined at beginning of file
	value_from_policy: np.ndarray
		The value calculated from the policy
	policy: np.array
		The previous policy.

	Returns
	-------
	new_policy: np.ndarray[nS]
		An array of integers. Each integer is the optimal action to take
		in that state according to the environment dynamics and the
		given value function.
	"""

    new_policy = np.zeros(nS, dtype="int")

    ############################
    # YOUR IMPLEMENTATION HERE #

    ############################
    return new_policy


In [None]:
def value_iteration(P, nS, nA, gamma=0.9, tol=1e-3):
    """
	Learn value function and policy by using value iteration method for a given
	gamma and environment.

	Parameters:
	----------
	P, nS, nA, gamma:
		defined at beginning of file
	tol: float
		Terminate value iteration when
			max |value_function(s) - prev_value_function(s)| < tol
	Returns:
	----------
	value_function: np.ndarray[nS]
	policy: np.ndarray[nS]
	"""

    value_function = np.zeros(nS)
    policy = np.zeros(nS, dtype=int)
    ############################
    # YOUR IMPLEMENTATION HERE #

    ############################
    return value_function, policy

We provide you with the following function to evaluate how good your policy is, by interfering with the environment!

In [None]:
def evaluate(env, policy, max_steps=100):
    """
    This function does not need to be modified
    Watch your agent play!

    Parameters
    ----------
    env: gym.core.Environment
      Environment to play on. Must have nS, nA, and P as
      attributes.
    Policy: np.array of shape [env.nS]
      The action to take at a given state
  """

    episode_reward = 0
    ob, _ = env.reset()
    for t in range(max_steps):
        a = policy[ob]
        ob, rew, done, _, _ = env.step(a)
        episode_reward += rew
        if done:
            break
    if not done:
        print(
            "The agent didn't reach a terminal state in {} steps.".format(
                max_steps
            )
        )
    else:
        print("Episode reward: %f" % episode_reward)


In [None]:
# Run the code below to implement value iteration on Frozen Lake!
# You may change the parameters in the functions below
np.set_printoptions(precision=3)

# Make gym environment
env = gym.make('Deterministic-4x4-FrozenLake-v0')

env.nS = env.nrow * env.ncol
env.nA = 4

print("\n" + "-" * 25 + "\nBeginning Value Iteration\n" + "-" * 25)

V_vi, p_vi = value_iteration(env.P, env.nS, env.nA, gamma=0.9, tol=1e-3)
evaluate(env, p_vi, 100)

## Policy Gradient

In this problem, we try to implement the `REINFORCE` algorithm to learn the optimal policy on environment `CartPole-v1` of OpenAI-Gym. We will use a neural network to parametrize the policy, and then run policy gradient on it. We  provide you with the following evaluation function, to calculat the expected cumulative reward of your current policy through Monte Carlo. You need to (1) implement REINFORCE with a neural network, and (2) plot the expected reward againt the number of training epochs.

In [None]:
def evaluate_neural(env, policy, max_steps=1000, trials = 100):
    """
    This function does not need to be modified
    Renders policy once on environment. Watch your agent play!

    Parameters
    ----------
    env: gym.core.Environment
    Policy: torch.Distribution, trained with REINFORCE
    trials: int, number of trials for Monte Carlo, default = 100
    Returns
    -------
    cum_reward_mean: estimate of the expected cumulative reward
  """
    cum_reward = []
    for i in range(trials):
      episode_reward = 0
      ob,_ = env.reset()
      for t in range(max_steps):
          ob = torch.tensor(ob, dtype=torch.float32)
          a = Categorical(policy(ob)).sample().item()
          ob, rew, done, _, _ = env.step(a)
          episode_reward += rew
          if done:
              break
      cum_reward.append(episode_reward)
    cum_reward = np.array(cum_reward)
    return cum_reward.mean(), cum_reward.std()

Now you are ready to implement the algorithm! In the following we define a class `PolicyNetwork` for the agent's policy, which you should implement by a neural network. Recall that the (variance reduced) gradient estimate in `REINFORCE` can be defined by
$$ \sum_{h\geq 0} \nabla_\theta\log \pi_\theta(A_h|S_h) \cdot \sum_{t\geq h}\gamma^t r(S_t, A_t).  $$
 You have the freedom to decide the architecture and the training procedure of the neural network, but make sure that it takes the state as input, and outputs a distrubution for the actions! Once it is implemented and trained, please plot the cumulative reward (with error bar = 2* standard deviation) against the number of epochs with `matplotlib.pyplot`, and submit it in the .tex file!

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

# Define the policy network
class PolicyNetwork(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(PolicyNetwork, self).__init__()
        ############################
        # YOUR IMPLEMENTATION HERE #
        pass
        ############################

    def forward(self, x):
        ############################
        # YOUR IMPLEMENTATION HERE #
        pass
        ############################

# Initialize environment and optimizer. Feel free to change!
env = gym.make('CartPole-v1')
input_size = 4
output_size = env.action_space.n
hidden_size = 64
policy_net = PolicyNetwork(input_size, hidden_size, output_size)
optimizer = optim.Adam(policy_net.parameters(), lr=0.001)

avg_rewards = []
std_rewards = []

# Training hyperparameters. Feel free to change!
num_episodes = 3000
gamma = 0.99

# Training loop
for episode in range(num_episodes):
    episode_rewards = []
    log_probs = []
    rewards = []

    state,_ = env.reset()
    done = False

    while not done:
      # Sample action from the policy network. For simplicity we only use single trajectory
      # to update. But feel free to change this into batch learning!
      state_tensor = torch.from_numpy(state).float().unsqueeze(0)
      action_probs = policy_net(state_tensor)
      action_dist = Categorical(action_probs)
      action = action_dist.sample()
      log_prob = action_dist.log_prob(action)
      log_probs.append(log_prob)

      # Take action and observe next state and reward
      next_state, reward, done, _, _ = env.step(action.item())
      episode_rewards.append(reward)
      state = next_state


    # Now use the episode_rewards list, the log_prob list to construct a loss function, on which you can
    # backward propagate and optimize with optimizer. First, try to compute the discounted cumulative reward
    # for every step in the trajectory.
    discounted_reward = []
    ############################
    # YOUR IMPLEMENTATION HERE #
    pass
    ############################
    discounted_rewards = torch.tensor(discounted_reward, dtype = torch.float32)

    # Now compute the policy loss with discounted_rewards and log_prob. Use this loss to run policy gradient.
    ############################
    # YOUR IMPLEMENTATION HERE #
    pass
    ############################

    # Record the cumulative reward and its deviation once every 100 episodes
    if (episode + 1) % 100 == 0:
      avg_reward, std_reward = evaluate_neural(env, policy_net)
      print(f'Episode [{episode + 1}/{num_episodes}], Cumulative Reward: {avg_reward}')
      avg_rewards.append(avg_reward)
      std_rewards.append(std_reward)

In [None]:
from matplotlib import pyplot as plt
# Plot the curve of cumulative reward v.s. number of episodes, with error bar = 2* std_reward
############################
# YOUR IMPLEMENTATION HERE #
pass
############################


## Optional: Visualizing Your Learned Policy Within the Game
Want to see how good your agent perform in the Frozen lake / CartPole-v1? Try to visualize you policy learned in both value iteration and policy gradient! If you are not familiar with environment visualization, we recommend you to check the codebase in HW2!

In [None]:
############################
# YOUR IMPLEMENTATION HERE #

############################

