# Lab 1: Problem 2 (TD-learning with policy improvement)

*OpenAI gym FrozenLake environment*

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend. The surface is described using a grid like the following

        SFFF
        FHFH
        FFFH
        HFFG

    S : starting point, safe
    F : frozen surface, safe
    H : hole, fall to your doom
    G : goal, where the frisbee is located

    The episode ends when you reach the goal or fall in a hole.
    You receive a reward of 1 if you reach the goal, and zero otherwise.
    
    
FrozenLake-v1 defines "solving" as getting average reward of 0.78 over 100 consecutive trials.

More documentation: https://www.gymlibrary.dev/environments/toy_text/frozen_lake/


In [1]:
%reload_ext autoreload
%autoreload 2

In [2]:
from kret_studies import *
from kret_studies.notebook import *
from kret_studies.complex import *

logger = get_notebook_logger()

Loaded environment variables from /Users/Akseldkw/Desktop/Columbia/ORCS4529/.env.
/Users/Akseldkw/coding/kretsinger/data/nb_log.log


In [3]:
env_vars = os.environ.copy()
wand_db_dir = env_vars["OUTPUT_DIR"] + "/wandb"

In [4]:
# wandb set up for logging runs online and moving them to the leaderboard
# you will need to create a wandb account when prompted, or if you already have an account login you can simply sign in.
# !pip install wandb -qqq
import wandb

wandb.login()
run = wandb.init(dir=wand_db_dir)

[34m[1mwandb[0m: Currently logged in as: [33makseldkw[0m ([33makseldkw07[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [5]:
## DO NOT CHANGE THIS CELL
import numpy as np
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv

env = typing.cast(
    FrozenLakeEnv, gym.make("FrozenLake-v1", is_slippery=True, success_rate=0.85)
)
# env.seed(0)

For proper accounting rewards while you learn, we build a wrapper around env.step() and env.reset(). In an episode, every time you take an action the reward will be appended to the reward of the episode, and when ever the environment is reset (at the end of an epsiode), the episode reward is reset to 0.

In [6]:
from functools import wraps

## DO NOT CHANGE THIS CELL
# wrapper for accounting rewards
rEpisode = 0
rList = []
fixedWindow = 100
movingAverage = 0


def reset_decorate(func):
    @wraps(func)
    def func_wrapper(*args, **kwargs):
        global rList
        global movingAverage
        global rEpisode
        global fixedWindow
        rList.append(rEpisode)
        if len(rList) >= fixedWindow:
            movingAverage = np.mean(rList[len(rList) - fixedWindow : len(rList) - 1])
        rEpisode = 0
        return func(*args, **kwargs)

    return func_wrapper


env.reset = reset_decorate(env.reset)


def step_decorate(func):
    @wraps(func)
    def func_wrapper(*args, **kwargs):
        global rEpisode
        # Call the original step function and unpack the result
        result = func(*args, **kwargs)
        # Handle both 4-tuple and 5-tuple returns for compatibility
        if len(result) == 5:
            s1, r, d, other, info = result
            rEpisode += r
            return (s1, r, d, other, info)
        else:
            raise ValueError("Unexpected number of return values from env.step")

    return func_wrapper


env.step = step_decorate(env.step)


def init():
    rEpisode = 0
    rList = []
    movingAverage = 0
    return

Below we illustrate the execution of the Open AI gym enviornment using the policy of chosing random action in every state. Every time an action is taken the enviorment returns a tuple containing next state, reward, and the status (whether terminal state is reached or not).

In [7]:
### RANDOM SAMPLING EXAMPLE
num_episodes = 1000
# number of episodes you want to try
episode_max_length = 100
# you can explicitly end the epsiode before terminal state is reached

env.reset()
# env.render()
# execute in episodes
for i in range(num_episodes):

    # reset the environment at the beginning of an episode
    s = env.reset()
    d = False  # not done

    for t in range(episode_max_length):

        ################ Random action policy ###########################
        # play random action
        a = env.action_space.sample()
        # get new state, reward, done
        s, r, d, _, info = env.step(a)
        #################################################################

        # break if done, reached terminal state
        if d == True:
            break

    # log per-episode reward and moving average over 100 episodes
    wandb.log(
        {
            "random reward": rEpisode,
            "random reward moving average": movingAverage,
            "random episode": i,
        }
    )

Implement tabular TD-learning with policy improvement (*YOU SHOULD ONLY CHANGE THE CELL BELOW*)

In [8]:
grid_env = typing.cast(FrozenLakeEnv, env.unwrapped)
env.observation_space = typing.cast(gym.spaces.Discrete, env.observation_space)
env.action_space = typing.cast(gym.spaces.Discrete, env.action_space)


def check_valid(state: int, action: int) -> bool:
    """
    Check if the action is valid for the given state.
    """
    state_shift = {0: -1, 1: 4, 2: 1, 3: -4}
    next_state = state + state_shift[action]
    # Check for invalid moves: out of bounds or wrapping around edges
    ncols = grid_env.ncol
    nrows = grid_env.nrow

    # Compute current row and col
    row, col = divmod(state, ncols)

    # For left, can't move if col == 0
    if action == 0 and col == 0:
        return False
    # For right, can't move if col == ncols - 1
    elif action == 2 and col == ncols - 1:
        return False
    # For up, can't move if row == 0
    elif action == 3 and row == 0:
        return False
    # For down, can't move if row == nrows - 1
    elif action == 1 and row == nrows - 1:
        return False
    # Out of bounds
    elif next_state < 0 or next_state >= env.observation_space.n:
        return False
    return True

In [9]:
"""I'm moving this into its own cell so as to avoid resetting Q and n every time I re-run the training cell."""

# create Q table

Q = np.zeros([env.observation_space.n, env.action_space.n])  # matrix Q[s,a]
# create policy
pi = np.random.randint(
    low=env.action_space.n, size=env.observation_space.n
)  # array pi[s]

In [10]:
dtt(Q, pi, names=["Q", "pi"])

Unnamed: 0,0,1,2,3
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0

Unnamed: 0,0
0,3
1,0
2,0
3,0
4,1
5,0
6,2
7,1
8,1
9,2


In [None]:
# initialize episodic structure
init()

# initialize episodic structure
num_episodes = 1000
episode_max_length = 100

# initialize discount factor, learning rate
gamma = 0.95
learnRate = 0.80


def print_condition(s: int):
    return False


def learnRate_decay(
    episode: int, initial_lr: float = 0.8, decay_rate: float = 0.99, min_lr: float = 0.1
) -> float:
    """
    Decay the learning rate over episodes.
    """
    return max(min_lr, initial_lr * (decay_rate**episode))


# SET POLICY ITERATION k=1,2...

# RESET Q values

# execute in episodes
for i in tqdm.tqdm(range(num_episodes)):

    # reset the environment at the beginning of an episode
    reset_state = env.reset()
    s = reset_state[0]
    terminated = False  # not done

    for t in range(episode_max_length):

        ###########SELCT ACTION a for state  using current policy ##################
        # example
        # a = int(pi[s]) #slightly different for randomized policy
        a = (
            env.action_space.sample()
        )  # currently selecting a random action, change this

        if print_condition(s):
            print(f"State {s} {action_values=} {mask=} {a=}")

        # get new state, reward, done
        state, reward, terminated, truncated, info = env.step(a)

        ##### update Q(s,a) ############
        next_max = float(np.max(Q[state, :]))
        td = float(reward) + gamma * next_max - Q[s, a]
        if print_condition(s):
            print(f"{s=} {a=} {state=} {reward=} {next_max=} {td=}")
        Q[s, a] = Q[s, a] + learnRate_decay(i) * td
        n[s, a] += 1

        # break if done, reached terminal state
        if terminated:
            # print(f"Terminated {i=} {t=}")
            break

        if print_condition(s):
            print("\n")

        s = state

    # log per-episode reward and moving average over 100 episodes
    wandb.log(
        {
            "training reward": rEpisode,
            "training reward moving average": movingAverage,
            "training episode": i,
        }
    )

wandb.run = typing.cast(wandb.Run, wandb.run)
wandb.run.summary["number of training episodes"] = num_episodes

#### improve policy pi

In [None]:
%%wandb
## DO NOT CHANGE THIS CELL. CHANGING ANY PART OF THIS CELL CAN DISQUALIFY THE SUBMISSION
#Evaluation of trained policy
init()
num_episodes=1000; #number of episodes for evaluation
episode_max_length=100
movingAverageArray=[]
score=0
env.reset()
for i in range(num_episodes):
    s = env.reset()
    d = False #not done
    for t in range(episode_max_length):
        a = int(pi[s])
        s, r, d, _ = env.step(a)
        if d == True:
            break
    #log per-episode reward and moving average over 100 episodes
    wandb.log({ "evaluation reward" : rEpisode, "evaluation reward moving average" : movingAverage, "evaluation episode" : i})
    movingAverageArray.append(movingAverage)
    #score is x if there is a window of 100 consecutive episodes where moving average was at least x
    if i>100:
        score=max(score,min(movingAverageArray[i-100:i-1]))

wandb.run.summary["score"]=score

In [None]:
run.finish()

VBox(children=(Label(value='0.093 MB of 0.093 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
evaluation episode,▁▁▁▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇██
evaluation reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
evaluation reward moving average,▃▃▃▃▆▆▆▆▆▃▁▁▁▃▃▁▃▆▆▆▆▆▆▁▁▁▁▁▃▃▃▃▃▃▃▃▆██▆
random episode,▁▁▁▁▂▂▂▂▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇████
random reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁
random reward moving average,▁▁▁▁▄▄▄▄▄▄▄▄▄▄▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▇▇▇▇█▇▇▅▅▁
training episode,▁▁▁▁▁▂▃▃▄▄▄▄▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇███
training reward,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
training reward moving average,▁▁▃▃▃▃▆▃▃▃█▆▆▆▆▁▁▁▁▃▃▃▃▁▃▆▆▆▆█▆▆▆█▆▆▆▆▆▃

0,1
evaluation episode,999.0
evaluation reward,1.0
evaluation reward moving average,0.0101
number of training episodes,1000.0
random episode,999.0
random reward,0.0
random reward moving average,0.0
score,0.0202
training episode,999.0
training reward,0.0
