# Reinforcement Learning - Monte Carlo and TD learning

> In this notebook, we will be implementing Monte Carlo control and Temporal difference learning algorithms - Q learning and SARSA algorithms. We will be testing our implementation on FrozenLake environment and some custom versions we made. Finally we will compare and try to draw conclusions.

## Importing libs

In [2]:
import gymnasium as gym
from gymnasium.envs.registration import register
from gymnasium.envs.toy_text.frozen_lake import FrozenLakeEnv
from gymnasium import spaces
from tqdm import tqdm
import numpy as np
import time
from tqdm import tqdm
import random
import matplotlib.pyplot as plt
import pprint

## Timer decorator

In [3]:
def timer(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        duration = end - start
        if args and hasattr(args[0], '__dict__'):
            setattr(args[0], f'{func.__name__}_time', duration)
        print(f"Function '{func.__name__}' took {duration:.4f} seconds")
        return result
    return wrapper

## Custom Environment



> We will be creating a new custom environment based on FrozenLake environment just with some minor tweaks and test our algorithms on these. These class would have similar methods to the standard environments in the gymnasium package.

For using a custom environment of large size in frozen lake environemnt, we need to generate a map from starting position to end goal to ensure that there is a possible path which can be learned by our agent.
We check at each cell, all the possible choices of moving our agent (basically up and down) and then choose randomly from there. This generates a monotonic path and then we later set other cells as holes according to the passed probability. 

We could have used all four direction when considering choices, essentially making the path non-monotonic. This is so because a random walk in 2D always ensure that you visit any other cell. However, this is not so desirable as this would increase the size of path and it might be that there are too many frozen cells making it easy for our agent to learn.

In [4]:
def generate_frozenlake_desc(size, hole_prob=0.2, seed=None):
    rng = random.Random(seed)
    desc = [['F'] * size for _ in range(size)]
    desc[0][0] = 'S'
    desc[size-1][size-1] = 'G'
    
    # Generate path with 3-cell wide corridor
    path = set()
    r, c = 0, 0
    while (r, c) != (size-1, size-1):
        path.add((r, c))
        # Add neighboring cells to create width
        for dr, dc in [(0,1), (1,0), (0,-1), (-1,0)]:
            nr, nc = r+dr, c+dc
            if 0 <= nr < size and 0 <= nc < size:
                path.add((nr, nc))
                
        choices = []
        if r < size-1:
            choices.append((r+1, c))
        if c < size-1:
            choices.append((r, c+1))
        r, c = rng.choice(choices)
    
    # Set holes only outside corridor
    hole_pos = []
    for r in range(size):
        for c in range(size):
            if (r, c) not in path and rng.random() < hole_prob:
                desc[r][c] = 'H'
                hole_pos.append((r,c))
    return desc, hole_pos

Having made a way to generate the desc of our map for the environment easily we will proceed with our custom environment.

A typical environment has some essential features -->
- An action apce and an observation space : usually done using spaces module from gymnasium
- Reset and Step methods : for taking observation, recording rewards and performing further actions
- It also should have an attribute P providing the transition probabilities of our model (this is not required here since we are doing model free learning but in general it is included)

In [5]:
class CustomFrozenLakeEnv(gym.Env):
    def __init__(self, slippery=False):

        # essentials
        self.size = 50
        self.n_states = self.size * self.size # number of cells
        self.n_actions = 4
        self.slippery = slippery
        self.hole_pos = []
        
        # Define a custom 50x50 map
        dsc, hp = generate_frozenlake_desc(self.size, hole_prob=0.1, seed=42)
        self.desc = np.array(dsc, dtype='<U1')
        self.hole_pos = hp
        self.hole_pos = set(self.hole_pos)
        # Calculate positions
        self.start_pos = (0,0)
        self.goal_pos = (self.size-1, self.size-1)
        # first is row and second is coulumn
        self.actions = {
            0: (0, -1),    # Left
            1: (1, 0),     # Down
            2: (0, 1),     # Right
            3: (-1, 0)     # Up
        }
        
        # Define spaces and state
        self.action_space = spaces.Discrete(self.n_actions)
        self.observation_space = spaces.Discrete(self.n_states)
        self.state = None # this will be storing the current state of the agent

    def reset(self, seed=None, **kwargs):
        super().reset(seed=seed, **kwargs)
        self.state = self.pos_to_state(self.start_pos) #what's the state of the start position
        return self.state, {} # we ain't sending any info

    def step(self, action):
        current_pos = self.state_to_pos(self.state) # we are only storing state not position so this roundabout
        row, col = current_pos
        
        # if it is slippery then some stochastic flavour
        if self.slippery:
            action = self.np_random.choice([action, (action + 1) % 4, (action - 1) % 4])
        

        # calculus :)
        dr, dc = self.actions[action]
        new_row, new_col = row + dr, col + dc
        
        # Ensure within bounds
        new_row = np.clip(new_row, 0, self.size - 1)
        new_col = np.clip(new_col, 0, self.size - 1)
        new_pos = (new_row, new_col)
        new_state = self.pos_to_state(new_pos)

        old_dist = abs(current_pos[0]-self.size+1) + abs(current_pos[1]-self.size+1)
        new_dist = abs(new_row-self.size+1) + abs(new_col-self.size+1)
        
        # Check for hole or goal
        terminated = False
        reward = 0.0
        if new_pos in self.hole_pos:
            terminated = True
            reward = -1.0
        elif new_pos == self.goal_pos:
            terminated = True
            reward = 1.0
        
        self.state = new_state #change state
        return new_state, reward, terminated, False, {}

    # easy peasy
    def pos_to_state(self, pos):
        row, col = pos
        return row * self.size + col

    def state_to_pos(self, state):
        row = state // self.size
        col = state % self.size
        return (row, col)

This is another custom environment similar to FrozenLake with a 4*4 size except with a twist. The agent will be rewarded only when it collects key along its path to reach the goal. This forces the agent to adopt a particular route. Kind of like travelling in a traffic where you are required to visit a stop (say a petrol pump)

In [6]:
class ExpandedFrozenLakeEnv(gym.Env):
    
    def __init__(self, slippery=False):
        self.size = 4
        self.n_states_base = self.size * self.size
        self.n_states = self.n_states_base * 2  # Double for key status
        self.n_actions = 4
        self.slippery = slippery
        
        # Define 4x4 map with key
        self.desc = np.array([
            ['S', 'F', 'F', 'F'],
            ['F', 'H', 'F', 'K'],
            ['F', 'F', 'F', 'F'],
            ['H', 'F', 'F', 'G']
        ], dtype='<U1')
        
        # Identify key, start, goal, and hole positions
        self.start_pos = None
        self.goal_pos = []
        self.hole_pos = []
        self.key_pos = None
        for row in range(self.size): # same idea as for the above class
            for col in range(self.size):
                if self.desc[row, col] == 'S':
                    self.start_pos = (row, col)
                elif self.desc[row, col] == 'G':
                    self.goal_pos.append((row, col))
                elif self.desc[row, col] == 'H':
                    self.hole_pos.append((row, col))
                elif self.desc[row, col] == 'K':
                    self.key_pos = (row, col)
        
        self.actions = {
            0: (0, -1),   # Left
            1: (1, 0),    # Down
            2: (0, 1),    # Right
            3: (-1, 0)    # Up
        }

        # self.P = self._make_transition_model()

        self.action_space = spaces.Discrete(self.n_actions)
        self.observation_space = spaces.Discrete(self.n_states)
        # Initialize state
        self.state = None
        self.has_key = None

    def reset(self, seed=None, **kwargs):
        super().reset(seed=seed, **kwargs)
        self.has_key = False
        self.state = self.pos_to_state(self.start_pos)
        return self.get_full_state(), {}

    def step(self, action):
        current_pos = self.state_to_pos(self.state)
        row, col = current_pos
        
        # act 
        if self.slippery:
            action = self.np_random.choice([action, (action + 1) % 4, (action - 1) % 4])
        
        dr, dc = self.actions[action]
        new_row, new_col = row + dr, col + dc
        
        # Ensure within bounds
        new_row = np.clip(new_row, 0, self.size - 1)
        new_col = np.clip(new_col, 0, self.size - 1)
        new_pos = (new_row, new_col)
        new_state = self.pos_to_state(new_pos)
        
        # Check if key is collected
        if new_pos == self.key_pos:
            self.has_key = True
        
        # Check for hole or goal
        terminated = False
        reward = 0.0
        if new_pos in self.hole_pos:
            terminated = True
        elif new_pos in self.goal_pos and self.has_key:
            terminated = True
            reward = 1.0
        
        self.state = new_state
        full_state = self.get_full_state()
        return full_state, reward, terminated, False, {}

    #trivial stuff
    def pos_to_state(self, pos):
        row, col = pos
        return row * self.size + col

    def state_to_pos(self, state):
        row = state // self.size
        col = state % self.size
        return (row, col)

    def get_full_state(self):
        return self.state + (self.n_states_base * int(self.has_key))

### Registering these custom environmnts

In [7]:
# Register environments
gym.register(
    id="CustomFrozenLake-v1",
    entry_point=CustomFrozenLakeEnv,
    kwargs={'slippery': False},
)

gym.register(
    id="ExpandedFrozenLake-v1",
    entry_point=ExpandedFrozenLakeEnv,
    kwargs={'slippery': False},
)

We will iterate over these environment ids to test.

In [8]:
# env ids to test
env_ids = ['CustomFrozenLake-v1','FrozenLake-v1','ExpandedFrozenLake-v1']  # frozenlake-v1 is for the default 

## Monte Carlo Implementation

We will be implementing the algorithm of monte carlo control. It's a on-policy algorithm without exploring start using epsilon soft policy.

BlackJack game requires exploration as there are certain high reward moves.We thus use a slowly exponential decaying epsilon.

In [9]:
@timer
def monte_carlo(env, episodes=10000, alpha=0.1, discount=0.99, epsilon=0.1):
    """
    Monte Carlo control using evvery-visit method and epsilon-greedy policy.
    Returns Q table of state-action values.
    """
    n_actions = env.action_space.n
    n_states = env.observation_space.n
    epsilon_start = 1.0
    epsilon_min = 0.05
    decay_rate = 0.01
    max_steps = 1000 

    Q = np.ones((n_states, n_actions)) #q value function initialization
    epsilons = np.maximum(epsilon_min, epsilon_start * np.exp(-decay_rate * np.arange(episodes))) #precompute epsilon

    for ep in tqdm(range(episodes), desc="MC epsiodes progress"):
        state, _ = env.reset()
        done = False
        episode = []
        epsilon = epsilons[ep]  # slow exponential decay of epsilon
        steps = 0
        # generating an episode
        while not done and steps < max_steps:
            #exploration
            if random.random() < epsilon:
                action = env.action_space.sample() # choose any action randomly with a probability of epsilon
            else :
                # policy improvement
                best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten() # list of all actions with the max q return
                action = int(np.random.choice(best_actions)) #choose any one from them randomly

            # perform the chosen action
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode.append((state,action,reward))
            state = next_state
            steps+=1
            
        G = 0
        for t in range(len(episode)-1,-1,-1): # compute return backwards
            s, a, r = episode[t]
            G = discount*G + r # since we are traversing backwards we can discount without keeping track of the number of terms
            Q[s, a] += alpha * (G - Q[s, a])
                

    return Q

## Temporal Difference Implementation

A general TD update for $Q(s_t, a_t)$ is of  this form:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[\text{Target} - Q(s_t, a_t)\right],
$$

where $\alpha$ is a step-size (learning rate), and Target is an estimate of the return just one-step ahead plus estimated future values.

- In **SARSA**, the target is:

$$
r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}),
$$

using the next action $a_{t+1}$ actually chosen by the current policy (**on-policy**).

- In **Q-learning**, the target is:

$$
r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a'),
$$

using the best possible next action according to current $Q$ (**off-policy**, because it imagines following the greedy policy from the next state even if the       behavior policy actually explores).

Summing up:

- **SARSA update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)\right].
$$

- **Q-learning update**:

$$
Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t)\right].
$$

In both cases, during learning we select actions via an $\epsilon$-greedy policy over current $Q$: with probability $\epsilon$ choose a random action, else choose:

$$
\arg\max_a Q(s, a).
$$

This ensures exploration.


In [14]:
@timer
# -------------------------------------------------
# Q_LEARNING
# -------------------------------------------------
def q_learning(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1, alpha_decay=0.001):
    """
    Q-Learning algorithm with epsilon-greedy exploration.
    Returns Q value function
    """
    n_actions = env.action_space.n
    n_states = env.observation_space.n
    epsilon_start = 1.0
    epsilon_min = 0.05
    decay_rate = 0.9
    max_steps = 1000

    # Precompute epsilons
    epsilons = np.maximum(epsilon_min, epsilon_start * np.exp(-decay_rate * np.arange(episodes)))
    
    Q = np.ones((n_states, n_actions)) * 5.0 #init q value functions for all pairs

    for ep in tqdm(range(episodes), desc="Q learning epsiodes progress"):
        state, _ = env.reset()
        done = False
        epsilon = epsilons[ep]  # slow exponential decay of epsilon
        steps = 0

        # alpha decay
        alpha_curr = alpha/(1+(alpha_decay*ep))

        while not done and steps < max_steps:
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = env.action_space.sample()
            else:
                max_val = np.max(Q[state])
                action = np.random.choice(np.flatnonzero(Q[state] == max_val))
            
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Q-Learning update 
            best_next = 0 if done else np.max(Q[next_state]) # choose the best q of the nest state for updating irrespective of our current policy (off policy)
            Q[state, action] += alpha_curr * (reward + discount * best_next - Q[state, action])
            
            state = next_state
            steps+=1
    return Q
    

In [None]:
@timer
# -------------------------------------------------
# SARSA
# -------------------------------------------------10
def sarsa(env, episodes=1000, alpha=0.1, discount=0.99, epsilon=0.1, alpha_decay=0.001):
    """
    SARSA algorithm (on-policy TD control) with epsilon-greedy policy.
    Returns Q table of state-action values.
    """
    n_actions = env.action_space.n
    n_states = env.observation_space.n
    epsilon_start = 1.0
    epsilon_min = 0.05
    decay_rate = 0.001 
    max_steps = 1000
    
    # Precompute epsilons
    epsilons = np.maximum(epsilon_min, epsilon_start * np.exp(-decay_rate * np.arange(episodes)))
    
    Q = np.ones((n_states, n_actions)) * 5.0 #init q values for all state action pairs

    for ep in tqdm(range(episodes), desc="SARSA epsiodes progress"):
        state, _ = env.reset()
        epsilon = epsilons[ep]  # slow exponential decay of epsilon

        # Choose initial action (epsilon strategy)
        if random.random() < epsilon:
            action = env.action_space.sample()
        else:
            best_actions = np.argwhere(Q[state] == np.max(Q[state])).flatten()
            action = int(np.random.choice(best_actions))

        # alpha decay
        alpha_curr = alpha/(1+(alpha_decay*ep))

        steps = 0
        done = False
        while not done and steps < max_steps:
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            # Choose next action (epsilon-greedy)
            if random.random() < epsilon:
                next_action = env.action_space.sample()
            else:
                best_actions = np.argwhere(Q[next_state] == np.max(Q[next_state])).flatten()
                next_action = int(np.random.choice(best_actions))
            
            # SARSA update (on-policy)
            Q[state, action] += alpha_curr * (reward + discount * Q[next_state, next_action] * (not done) - Q[state, action])
            state, action = next_state, next_action
            steps += 1
            
    return Q

## Policy Evaluation

Testing our policy on the environments and  printing the time for al the algorithms and the average return obtained

In [12]:
def evaluate_policy(env, Q, episodes=100, discount=1.0):
    """
    Evaluate a given policy derived from Q (greedy) by running episodes.
    Returns the average total (discounted) return.
    """
    returns = []
    best_actions_list = [np.argwhere(Q[s] == np.max(Q[s])).flatten() for s in range(Q.shape[0])]   
    max_steps = 1000
    for ep in tqdm(range(episodes), desc="Evaluating"):
        state, _ = env.reset()
        done = False
        G = 0.0
        t = 0
        steps = 0
        while not done and steps < max_steps:
            # Greedy action
            actions = best_actions_list[state]
            action = int(np.random.choice(actions))
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            G += (discount**t) * reward
            t += 1 # keeping track of the power to raise discount with
            state = next_state
            steps += 1
        returns.append(G)

    returns_arr = np.array(returns, dtype=float)

    # plt.ion()  # interactive mode
    # fig, ax = plt.subplots(figsize=(8,6))
    # ax.plot(cum_avg, marker='s', markersize=4, markevery=50)
    # ax.set(xlabel='Episode', ylabel='Average return till now', title='Convergence of Average Return')
    # ax.grid(True)
    # plt.draw()
    # plt.pause(0.001) 

    avg_return = returns_arr.mean()
    return avg_return


## Results

In [401]:
for eid in env_ids:
    print(eid)
    EPISODES = 5000
    # if eid == "ExpandedFrozenLake-v1" or eid == "ExpandedFrozenLake-v1-slip":
    #     EPISODES = 500000
    env = gym.make(eid)
    
    # Train and evaluate each algorithm
    mc_Q = monte_carlo(env, episodes=10000, alpha=0.01, discount=0.99)
    print("Average Return (MC):", evaluate_policy(env, mc_Q, episodes=1000, discount=0.99))

    print("moving to Q")

    ql_Q = q_learning(env, episodes=EPISODES, alpha=0.1, discount=0.99)
    print("Average Return (Q-Learning):", evaluate_policy(env, ql_Q, episodes=1000, discount=0.99))

    print("moving to SARSA")

    sa_Q = sarsa(env, episodes=EPISODES, alpha=0.1, discount=0.99)
    print("Average Return (SARSA):", evaluate_policy(env, sa_Q, episodes=1000, discount=0.99))

    print()


CustomFrozenLake-v1


MC epsiodes progress: 100%|██████████| 10000/10000 [00:04<00:00, 2025.62it/s]


Function 'monte_carlo' took 4.9393 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 3797.50it/s]


Average Return (MC): 0.8429431933839271
moving to Q


Q learning epsiodes progress: 100%|██████████| 5000/5000 [00:13<00:00, 363.81it/s] 


Function 'q_learning' took 13.7458 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 3800.81it/s]


Average Return (Q-Learning): 0.8429431933839271
moving to SARSA


SARSA epsiodes progress: 100%|██████████| 5000/5000 [00:08<00:00, 581.84it/s] 


Function 'sarsa' took 8.5974 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 3837.57it/s]


Average Return (SARSA): 0.8429431933839271

FrozenLake-v1


MC epsiodes progress: 100%|██████████| 10000/10000 [00:02<00:00, 4068.23it/s]


Function 'monte_carlo' took 2.4597 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 2991.21it/s]


Average Return (MC): 0.38816767607190616
moving to Q


Q learning epsiodes progress: 100%|██████████| 5000/5000 [00:03<00:00, 1601.14it/s]


Function 'q_learning' took 3.1243 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 1981.41it/s]


Average Return (Q-Learning): 0.5054036686203724
moving to SARSA


SARSA epsiodes progress: 100%|██████████| 5000/5000 [00:02<00:00, 2135.19it/s]


Function 'sarsa' took 2.3434 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 2412.95it/s]


Average Return (SARSA): 0.5188396662855793

ExpandedFrozenLake-v1


MC epsiodes progress: 100%|██████████| 10000/10000 [00:01<00:00, 7140.71it/s]


Function 'monte_carlo' took 1.4024 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 6879.01it/s]


Average Return (MC): 0.9509900498999998
moving to Q


Q learning epsiodes progress: 100%|██████████| 5000/5000 [00:03<00:00, 1386.62it/s]


Function 'q_learning' took 3.6078 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 11519.59it/s]


Average Return (Q-Learning): 0.9509900498999998
moving to SARSA


SARSA epsiodes progress: 100%|██████████| 5000/5000 [00:01<00:00, 3043.41it/s]


Function 'sarsa' took 1.6445 seconds


Evaluating: 100%|██████████| 1000/1000 [00:00<00:00, 11258.52it/s]

Average Return (SARSA): 0.9509900498999998






In [13]:
env = gym.make("CustomFrozenLake-v1")
EPISODES = 500000
# Train and evaluate each algorithm
# mc_Q = monte_carlo(env, episodes=50000, alpha=0.03, discount=0.99)
# print("Average Return (MC):", evaluate_policy(env, mc_Q, episodes=1000, discount=0.99))

print("moving to Q")

ql_Q = q_learning(env, episodes=EPISODES, alpha=0.1, discount=0.99)
print("Average Return (Q-Learning):", evaluate_policy(env, ql_Q, episodes=1000, discount=0.99))

print("moving to SARSA")

sa_Q = sarsa(env, episodes=EPISODES, alpha=0.1, discount=0.99)
print("Average Return (SARSA):", evaluate_policy(env, sa_Q, episodes=1000, discount=0.99))

print()

moving to Q


Q learning epsiodes progress: 100%|██████████| 500000/500000 [18:28<00:00, 451.15it/s] 


Function 'q_learning' took 1108.2927 seconds


Evaluating: 100%|██████████| 1000/1000 [00:01<00:00, 766.36it/s]


Average Return (Q-Learning): 0.37723664692350406
moving to SARSA


SARSA epsiodes progress: 100%|██████████| 500000/500000 [20:33<00:00, 405.47it/s] 


Function 'sarsa' took 1233.1371 seconds


Evaluating: 100%|██████████| 1000/1000 [00:01<00:00, 785.09it/s]

Average Return (SARSA): 0.37723664692350406






Now trying with alpha decay

In [None]:
env = gym.make("CustomFrozenLake-v1")
EPISODES = 500000
# Train and evaluate each algorithm
# mc_Q = monte_carlo(env, episodes=50000, alpha=0.03, discount=0.99)
# print("Average Return (MC):", evaluate_policy(env, mc_Q, episodes=1000, discount=0.99))

print("moving to Q")

ql_Q = q_learning(env, episodes=EPISODES, alpha=0.5, discount=0.99)
print("Average Return (Q-Learning):", evaluate_policy(env, ql_Q, episodes=1000, discount=0.99))

print("moving to SARSA")

sa_Q = sarsa(env, episodes=EPISODES, alpha=0.5, discount=0.99)
print("Average Return (SARSA):", evaluate_policy(env, sa_Q, episodes=1000, discount=0.99))

print()

It is still too costly to run

## Results

We tested the following algorithms on our custom environment:

- Monte Carlo - On Policy - Every visit - epsilon greedy policy
- Q learning control
- SARSA

Here are the results and stat summarized:

| Environment            | Algorithm     | Time (seconds) | Average Return |
|------------------------|---------------|-----------------|-----------------|
| CustomFrozenLake-v1    | Monte Carlo   | 4.9393          | 0.8429          |
| CustomFrozenLake-v1    | Q-Learning    | 13.7458         | 0.8429          |
| CustomFrozenLake-v1    | SARSA         | 8.5974          | 0.8429          |
| FrozenLake-v1          | Monte Carlo   | 2.4597          | 0.3882          |
| FrozenLake-v1          | Q-Learning    | 3.1243          | 0.5054          |
| FrozenLake-v1          | SARSA         | 2.3434          | 0.5188          |
| ExpandedFrozenLake-v1  | Monte Carlo   | 1.4024          | 0.9510          |
| ExpandedFrozenLake-v1  | Q-Learning    | 3.6078          | 0.9510          |
| ExpandedFrozenLake-v1  | SARSA         | 1.6445          | 0.9510          |


I did try building larger frozen lake environment but with lower number of episodes and max step length the average return was 0. Increasing the number of episodes and max steps could have fized the issue bu the computation was too slow (taking half an hour for a single algorithm) I did try utilizing as many numpy operations as I could but the time gained was not sufficient. I also tried implementing the distance reward saying that if the euclidean distace between the agent and the goal reduced then we would award it a small value and punish it for going away. UNfortunately this did not work as the agent was repeatedly moving between two cells close by to the goal gaining reward without actually reaching the goal. I have also ensured exploration using exponential decay. Every visit MC turned out to be faster than Once visit and converged in fewer episodes, However the compute time per episode also increased. 

For a 50*50 frozen lake environment

Q learning - 1108.3 seconds  - 0.38 <br>
SARSA      - 1233.14 seconds - 0.38

Monte Carlo is too slow here. Even here with 500,000 episodes the final reward is pretty low compared to other environments. This indicates that the number of episodes required is much higher and thus the compute time would also not be sufficient.

Our algorithms also perform much worse on the original frozen lake. This is due to the stochastic nature of the environment caused to slippery nature.