#Monte Carlo First Visit Method to solve a simple GridWorld problem.#

Analyzing State-Value Convergence with First-Visit Monte Carlo
<br>In this case, we are using a RANDOM UNIFORM POLICY to train.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import sys

# where gridworld.py is located
# modify path accordingly
sys.path.append('/content/drive/MyDrive')


In [None]:
import numpy as np
import gridworld

**Modify the code to enable the MC simulations to run 100, 500, 1000, 5000, and 10000 episodes. Run gridworld.print_values(state_values,env) to show the
results for each case.**

**Standard Grid:**<br>
x means you can't go there<br>
s means start position<br>
number means reward at that state<br>
**.  .  .  1**<br>
**.  x  . -1**<br>
**s  .  .  .**<br>





In [None]:
# define a grid that describes the reward for arriving at each state and possible actions at each state
# the grid looks like this:
#
# x means you can't go there
# s means start position
# number means reward at that state
#   0  1  2  3
# 0 .  .  .  1
# 1 .  x  . -1
# 2 s  .  .  .
#
# We go by (row, col)
#
EPISODE_COUNT = [100,500,1000,5000,10000]
GAMMA = 0.9

In [None]:
def run_mc_simulation(num_episodes):
    """
    Runs a Monte Carlo (MC) simulation to estimate state values in the GridWorld environment.
    """
    env = gridworld.standard_grid() # Initialize the 3x4 grid environment
    returns = {}  # Dictionary to store returns for each state, where returns[s] is a list
    state_values = {}  # Dictionary to store the estimated value function for each state

    for ep in range(num_episodes):
        state = env.reset()  # Reset environment to the start state
        trajectory = []  # List to store the state-reward trajectory of one episode

        while True:
            action = np.random.choice(['U', 'D', 'L', 'R'])  # Choose a random action
            next_state, reward, done, invalid = env.step(action)  # Take action and observe outcome
            if invalid:
                continue  # Skip invalid moves and try again

            trajectory.append((state, reward))  # Store the state and received reward: St, Rt+1
            state = next_state  # Move to the next state
            if done:
                break  # End episode if a terminal state is reached

        reverse_trajectory = trajectory[::-1]  # Reverse the trajectory to process it backward
        reverse_trajectory_states = list(map(lambda x: x[0], reverse_trajectory))  # Extract states
        G = 0  # Initialize the return (discounted cumulative reward)

        for idx, step in enumerate(reverse_trajectory):
            s, r = step  # Extract state and reward
            G = GAMMA * G + r  # Compute return using the discount factor
             # say, after 3 steps, G=GAMMA*(GAMMA*(GAMMA*G + r3) + r2) + r1
             #                     G=GAMMA^3*G + GAMMA^2*r3 + GAMMA*r2 + r1
             #                     G=GAMMA^2*r3 + GAMMA*r2 + r1

            # First-visit MC: Only update the first occurrence of the state in the episode
            if s not in reverse_trajectory_states[idx+1:]:
                if s in returns:
                    returns[s].append(G)  # Append return to the list of returns for this state
                else:
                    returns[s] = [G]  # Create a new list for this state

                state_values[s] = np.mean(returns[s])  # Compute the average return for this state

        # Manually assign values for terminal states as they receive no further rewards
        state_values[(0,3)] = 1  # Goal state with positive reward
        state_values[(1,3)] = -1  # Penalty state with negative reward

    return state_values  # Return the estimated state values

In [None]:
# Loop through the episodes 100,500,1000,5000,10000 and print out the results.
for episodes in EPISODE_COUNT:
    print(f"Running simulation for {episodes} episodes...")
    state_values = run_mc_simulation(episodes)
    gridworld.print_values(state_values, gridworld.standard_grid())
    print("\n")

print("Training complete.")

Running simulation for 100 episodes...
---------------------------
-0.01| 0.09| 0.27| 1.00|
---------------------------
-0.10| 0.00|-0.45|-1.00|
---------------------------
-0.15|-0.27|-0.44|-0.69|


Running simulation for 500 episodes...
---------------------------
 0.07| 0.15| 0.30| 1.00|
---------------------------
 0.00| 0.00|-0.32|-1.00|
---------------------------
-0.10|-0.21|-0.36|-0.65|


Running simulation for 1000 episodes...
---------------------------
 0.04| 0.11| 0.24| 1.00|
---------------------------
-0.04| 0.00|-0.40|-1.00|
---------------------------
-0.12|-0.24|-0.40|-0.66|


Running simulation for 5000 episodes...
---------------------------
 0.04| 0.14| 0.26| 1.00|
---------------------------
-0.03| 0.00|-0.36|-1.00|
---------------------------
-0.11|-0.21|-0.36|-0.68|


Running simulation for 10000 episodes...
---------------------------
 0.05| 0.14| 0.26| 1.00|
---------------------------
-0.03| 0.00|-0.37|-1.00|
---------------------------
-0.11|-0.22|-0.38|-0.68

# Conclusions from Running the Monte Carlo Simulation for 5 Cases

1. **Early episodes (100 & 500) show high variance in values**

  * The state values fluctuate significantly in the first two cases (100 and 500 episodes), because the agent has not explored the environment enough.
  * Notable inconsistencies: (1,2) = -0.45 (100 episodes) → -0.32 (500 episodes), indicate irregularities due to insufficient sampling.
  * Estimates are unstable, and values change drastically between runs.

2. **Convergence Begins (1000 Episodes)**

  * Variance decreases, and values start stabilizing.
  * Example: (1,2) settles around -0.40, (2,2) around -0.40.

3. **Stabilization (5000+ Episodes)**
  * Values become more consistent with minimal changes.
    * Example: (1,2) = -0.36 (5000 episodes) → -0.37 (10,000 episodes).
  * Comparing 5000 and 10,000 episodes, we see only minor changes in the estimated values. This suggests that after around 5000 episodes, the state values have mostly converged.

4. **Final Observation**
  * Positive values appear closer to the goal state (1 at (0,3)), indicating higher expected rewards.
  * Negative values appear near the penalty state (-1 at (1,3)), meaning these states lead to losses.
  * The values suggest the agent learns to move towards the goal while avoiding the penalty.
  * Final state values align with expected behavior
    * Higher values near the goal (0.14, 0.26).
    * Lower values near the penalty (-0.37, -0.38).
