# Reinforcement Learning - Monte Carlo and TD learning

> In this notebook, we will be implementing Monte Carlo control and Temporal difference learning algorithms - Q learning and SARSA algorithms. We will be testing our implementation on FrozenLake environment and some custom versions we made. Finally we will compare and try to draw conclusions.

## Importing libs

In [108]:
import gymnasium as gym
from gymnasium.envs.registration import register
from typing import Tuple, List
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv
from gymnasium import spaces
from tqdm import tqdm
import numpy as np
import time
import random
import pprint

## Timer decorator

In [109]:
def timer(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        duration = end - start
        if args and hasattr(args[0], '__dict__'):
            setattr(args[0], f'{func.__name__}_time', duration)
        print(f"Function '{func.__name__}' took {duration:.4f} seconds")
        return result
    return wrapper

## Cliff Walking

In [110]:
class CliffWalking:
    def __init__(self, rows: int = 4, cols: int = 12):
        self.rows = rows
        self.cols = cols
        self.start = (rows - 1, 0)
        self.goal = (rows - 1, cols - 1)
        self.cliff = {(rows-1,c) for c in range(1,cols-1)}

        self.actions = {
            0: (-1, 0),  # up
            1: ( 0, 1),  # right
            2: ( 1, 0),  # down
            3: ( 0,-1),  # left
        }

        self.n_actions = len(self.actions)
        self.n_states = rows * cols
        self.reset()

    def state_to_index(self, pos: Tuple[int,int]) -> int:
        r, c = pos
        return r * self.cols + c

    def index_to_state(self, idx: int) -> Tuple[int,int]:
        return divmod(idx, self.cols)

    def reset(self):
        self.agent_pos = self.start
        return self.state_to_index(self.agent_pos), {}

    def step(self, action: int) -> Tuple[int, float, bool, bool, int]:
        if action not in self.actions:
            raise ValueError(f"Invalid action {action}")

        truncated = False

        dr, dc = self.actions[action]
        r, c = self.agent_pos
        new_r = min(max(r + dr, 0), self.rows - 1)
        new_c = min(max(c + dc, 0), self.cols - 1)
        new_pos = (new_r, new_c)

        if new_pos in self.cliff:
            reward = -100.0
            self.agent_pos = self.start
            done = False
        elif new_pos == self.goal:
            reward = -1.0
            self.agent_pos = new_pos
            done = True
        else:
            reward = -1.0
            self.agent_pos = new_pos
            done = False

        next_state = self.state_to_index(self.agent_pos)
        return next_state, reward, done, truncated, {}

## Monte Carlo Implementation

In [111]:
@timer
def monte_carlo(env, episodes=10000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    Monte Carlo control using first-visit method and epsilon-greedy policy.
    Returns Q table of state-action values.
    """
    n_actions = env.n_actions
    n_states = env.n_states


    Q = np.zeros((n_states, n_actions)) #q value function initialization

    for ep in range(episodes):
        state, _ = env.reset()
        done = False
        episode = []

        # generating an episode
        while not done:
            #exploration
            if random.random() < epsilon:
                action = np.random.choice(n_actions) # choose any action randomly with a probability of epsilon
            else :
                best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten() #listof all actions with the max q return
                action = int(np.random.choice(best_actions)) #choose any one from them randomly

            # perform the chosen action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state,action,reward))
            state = next_state

        G = 0
        visited = set()
        for t in range(len(episode)-1,-1,-1): # compute return backwards
            s, a, r = episode[t]
            G = discount*G + r # since we are traversing backwards we can discount without keeping track of the number of terms
            if (s, a) not in visited:
                visited.add((s, a))
                # Incremental update
                Q[s, a] += alpha * (G - Q[s, a])


    return Q



## Temporal Difference Implementation

A generic TD update for $Q(s_t, a_t)$ takes the form:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[\text{Target} - Q(s_t, a_t)\right],
$$

where $\alpha$ is a step-size (learning rate), and Target is an estimate of the return just one-step ahead plus estimated future values.

- In **SARSA**, the target is:

$$
r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}),
$$

using the next action $a_{t+1}$ actually chosen by the current policy (**on-policy**).

- In **Q-learning**, the target is:

$$
r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a'),
$$

using the best possible next action according to current $Q$ (**off-policy**, because it imagines following the greedy policy from the next state even if the       behavior policy actually explores).

Summing up:

- **SARSA update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right].
$$

- **Q-learning update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right].
$$

In both cases, during learning we select actions via an $\epsilon$-greedy policy over current $Q$: with probability $\epsilon$ choose a random action, else choose:

$$
\arg\max_a Q(s, a).
$$

This ensures exploration.


In [112]:
@timer
# -------------------------------------------------
# Q_LEARNING (On-Policy Temporal-Difference)
# -------------------------------------------------
def q_learning(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    Q-Learning algorithm with epsilon-greedy exploration.
    Returns Q value function
    """
    n_actions = env.n_actions
    n_states = env.n_states
    Q = np.zeros((n_states, n_actions)) #init q value functions for all pairs

    for ep in range(episodes):
        state, _ = env.reset()
        done = False

        
        while not done:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = np.random.choice(n_actions)
            else:
                best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten()
                action = int(np.random.choice(best_actions))
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-Learning update (off-policy)
            best_next = 0 if done else np.max(Q[next_state]) # choose the best q of the nest state for updating irrespective of our current policy (off policy)
            Q[state, action] += alpha * (reward + discount * best_next - Q[state, action])
            
            state = next_state
    return Q
    

In [113]:
@timer
# -------------------------------------------------
# SARSA (On-Policy Temporal-Difference)
# -------------------------------------------------
def sarsa(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    SARSA algorithm (on-policy TD control) with epsilon-greedy policy.
    Returns Q table of state-action values.
    """
    n_actions = env.n_actions
    n_states = env.n_states
    Q = np.zeros((n_states, n_actions)) #init q values for all state action pairs

    for ep in range(episodes):
        state, _ = env.reset()

        # Choose initial action (epsilon strategy)
        if random.random() < epsilon:
            action = np.random.choice(n_actions)
        else:
            best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten()
            action = int(np.random.choice(best_actions))

        
        done = False
        while not done:
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Choose next action (epsilon-greedy)
            if random.random() < epsilon:
                next_action = np.random.choice(n_actions)
            else:
                best_actions = np.argwhere(Q[next_state] == np.max(Q[next_state])).flatten()
                next_action = int(np.random.choice(best_actions))
            
            # SARSA update (on-policy)
            Q[state, action] += alpha * (reward + discount * Q[next_state, next_action] * (not done) - Q[state, action])
            state, action = next_state, next_action
            
    return Q

## Policy Evaluation

Testing our policy on the environments and  printing the time for al the algorithms and the average return obtained

In [114]:
@timer
def evaluate_policy(env, Q, episodes=100, discount=1.0):
    """
    Evaluate a given policy derived from Q (greedy) by running episodes.
    Returns the average total (discounted) return.
    """
    total_return = 0.0
    for ep in range(episodes):
        state, _ = env.reset()
        done = False
        G = 0.0
        t = 0
        while not done:
            # Greedy action
            best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten()
            action = int(np.random.choice(best_actions))
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            G += (discount**t) * reward
            t += 1 # keeping track of the power to raise discount with
            state = next_state
        total_return += G
    avg_return = total_return / episodes
    return avg_return


## Results

In [115]:
env = CliffWalking()
# Train and evaluate each algorithm
mc_Q = monte_carlo(env, episodes=50000, alpha=0.03, discount=0.99, epsilon=0.1)
ql_Q = q_learning(env, episodes=50000, alpha=0.1, discount=0.99, epsilon=0.1)
sa_Q = sarsa(env, episodes=50000, alpha=0.1, discount=0.99, epsilon=0.1)

print("Average Return (MC):", evaluate_policy(env, mc_Q, episodes=1000, discount=0.99))
print("Average Return (Q-Learning):", evaluate_policy(env, ql_Q, episodes=1000, discount=0.99))
print("Average Return (SARSA):", evaluate_policy(env, sa_Q, episodes=1000, discount=0.99))
print()


Function 'monte_carlo' took 31.3385 seconds
Function 'q_learning' took 19.7461 seconds
Function 'sarsa' took 19.9301 seconds
Function 'evaluate_policy' took 0.3606 seconds
Average Return (MC): -15.705680661607396
Function 'evaluate_policy' took 0.2721 seconds
Average Return (Q-Learning): -12.247897700102984
Function 'evaluate_policy' took 0.3668 seconds
Average Return (SARSA): -15.705680661607396

