# 43008: Reinforcement Learning

## Week 6: Monte Carlo Methods - Prediction:
* First-Visit MC Algorithm for Policy Evaluation
* Every-Visit MC Algorithm for Policy Evaluation

### What you will learn?
* Implement First Visit MC Algorithm for Gym based environments
* Implement Every Visit MC Algorithm for Gym based environmenmts

### Let's install some important helper packages and libraries

In [None]:
!pip install gym pyvirtualdisplay
!apt-get install -y xvfb ffmpeg

### Import Libraries

In [None]:
import gymnasium as gym
import numpy as np
from collections import defaultdict
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display

from matplotlib import pyplot
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from collections import defaultdict
from functools import partial
%matplotlib inline
plt.style.use('ggplot')

### Helper function for plotting

In [None]:
def plot_blackjack(V, ax1, ax2):
    #Define Ranges for each aspects like Player_sum, Dealer_Show etc.
    player_sum = np.arange(12, 21 + 1)
    dealer_show = np.arange(1, 10 + 1)
    usable_ace = np.array([False, True])
    state_values = np.zeros((len(player_sum), len(dealer_show), len(usable_ace)))

    #Extract State-value from V in a specific order/format for visualization
    for i, player in enumerate(player_sum):
        for j, dealer in enumerate(dealer_show):
            for k, ace in enumerate(usable_ace):
                state_values[i, j, k] = V[player, dealer, ace]

    X, Y = np.meshgrid(player_sum, dealer_show)

    ax1.plot_wireframe(X, Y, state_values[:, :, 0])
    ax2.plot_wireframe(X, Y, state_values[:, :, 1])

    # Axes settings
    for ax in ax1, ax2:
        ax.set_zlim(-1, 1)
        ax.set_ylabel('Player Sum')
        ax.set_xlabel('Dealer Showing')
        ax.set_zlabel('State-Value')


### Helper functions

#### Generate an episode using a policy

In [None]:
def sample_policy(observation):
    score, dealer_score, usable_ace = observation
    return 0 if score >= 20 else 1

#### Initialize the BlackJack Gym environment

**Reference**: https://www.gymlibrary.dev/environments/toy_text/blackjack/

In [None]:
## WRITE YOUR CODE HERE ##
env = # Hint: Check the reference link provided

#### Function to generate random episodes

In [None]:
def generate_episode(env, given_policy):
    """Generate an episode using a given policy and the envirnment"""

    # Initialize an empty list to store the episode's state, action, and reward tuples.
    states, actions, rewards = [], [], []

    # Reset the environment to get the initial state.
    current_state = ## WRITE YOUR CODE HERE ##

    # Initialize a flag to check if the episode is finished.
    done = ## WRITE YOUR CODE HERE ##

    # Continue until the episode is done/terminating state not reached.
    while not done:

        states.append(current_state)

        # Get the action for the current state from the policy.
        action = ## WRITE YOUR CODE HERE ##
        ## WRITE YOUR CODE HERE ## Append action to the actions list


        # Take the chosen action in the environment.
        # Get the resulting next state, reward, whether the episode is done, and other info.
        next_state_tuple = ## WRITE YOUR CODE HERE ##
        next_state, reward, terminated, truncated, info = ## WRITE YOUR CODE HERE ##
        done = terminated or truncated

        print(f"  Action: {action}, Next state: {next_state}, Reward: {reward}, Done: {done}, Info: {info}")

        ## WRITE YOUR CODE HERE ## ## Append the rewards to the Reward list
        rewards.

        # Update the current state to the next state.
        current_state = ## WRITE YOUR CODE HERE ##
    # Return the generated episode.

    return states, actions, rewards


## Monte-Carlo First-Visit Algorithm

```plaintext
Algorithm: Monte Carlo Prediction
Input: π (policy to be evaluated)
Initialize:
    Q(s, a) arbitrarily, for all s ∈ S, a ∈ A(s)
    Returns(s, a) ← empty list, for all s ∈ S, a ∈ A(s)
Repeat forever (for each episode):
    Generate an episode following π: S0, A0, R1, ..., ST-1, AT-1, RT
    G ← 0
    For t = T-1 down to 0:
        G ← γG + Rt+1
        If the pair St, At appears in S0, A0, ..., St-1, At-1 for "first-visit"
        OR
        Always for "every-visit":
            Append G to Returns(St, At)
            Q(St, At) ← average(Returns(St, At))
```

In the algorithm above:
- The distinction between first-visit and every-visit is made in the conditional statement.
- For first-visit, the state-action pair should not have been visited in the episode before time \( t \).
- For every-visit, the state-action pair is updated every time it's encountered.

<img src='https://drive.google.com/uc?id=18EN5SYHlr0yvTcFYKxIRWrIJsglpbsXn' height=300>



### 1. First Visit MC Prediction



In [None]:
# First Visit Monte Carlo Prediction/Policy Evaluation
def first_visit_mc_prediction(given_policy, env, n_episodes, gamma=1):

    # First, we initialize the empty value table as a dictionary for storing the values of each state
    V = ## WRITE YOUR CODE HERE ##
    N = ## WRITE YOUR CODE HERE ##


    for nE in range(n_episodes):

        # Next, we generate the epsiode and store the states and rewards

        states, _, rewards = ## WRITE YOUR CODE HERE ##
        G = ## WRITE YOUR CODE HERE ##

        # Next for each step, store the rewards to a variable R and states to S, and calculate
        # returns as a sum of rewards

        for stepT in range(len(states) - 1, -1, -1): # Syntax: Len(Start, stop, step)
            R = ## WRITE YOUR CODE HERE ##
            S = ## WRITE YOUR CODE HERE ##

            # Perform first visit MC, check if the state in the current episode is visited for the first time, if yes,
            # Increment the counter and find the rewards from that state onwards and assign it to the State-Value

            if S not in states[:stepT]:
                ## Increment the First Visit Counter..  WRITE YOUR CODE HERE ##

                for rIdx in range(len(rewards[stepT:])):
                  G += ## WRITE YOUR CODE HERE ## ##

                V[S] += ## WRITE YOUR CODE HERE ##

    # Return the State-Value
    return V


In [None]:
# Excecute the First Visit MC Algorithm on the BlackJack Environment and find the State-Value over multiple episodes
value_first = ## WRITE YOUR CODE HERE ##

In [None]:
# Just for checking
for i in range(10):
  print(value_first.popitem())

In [None]:
# Plot the Values and correlate with the Policy Visually
fig, axes = pyplot.subplots(nrows=2, figsize=(8, 16),
subplot_kw={'projection': '3d'})
axes[0].set_title('value function without usable ace')
axes[1].set_title('value function with usable ace')
plot_blackjack(value_first, axes[0], axes[1])

### 2. Every Visit MC Prediction


<img src='https://drive.google.com/uc?id=1hKppuG-mSY32zW8xJN96DdMhuYk5764n' height=330>

In [None]:
# Every-Visit Monte Carlo
def every_visit_mc_prediction(given_policy, env, n_episodes, gamma=1):

    # First, we initialize the empty value table as a dictionary for storing the values of each state
    V = ## WRITE YOUR CODE HERE ##
    N = ## WRITE YOUR CODE HERE ##

    # Run through all episodes
    for nE in range(n_episodes):

        # Next, we generate the epsiode and store the states and rewards
        states, _, rewards = ## WRITE YOUR CODE HERE ##
        G = ## WRITE YOUR CODE HERE ##

        # Next for each step, store the rewards to a variable R and states to S, and calculate
        # returns as a sum of rewards
        for stepT in range(len(states) - 1, -1, -1): # Syntax: Len(Start, stop, step)
            ## WRITE YOUR CODE HERE ## For R and G

            # Perform every visit MC, check if the state in the current episode is visited
            # Increment the counter for each visit and find the rewards from that state onwards and assign it to the State-Value

            N[S] += ## WRITE YOUR CODE HERE ##

            G += ## WRITE YOUR CODE HERE ##

            V[S] += (## WRITE YOUR CODE HERE ##

    return V


In [None]:
# Excecute the Every Visit MC Algorithm on the BlackJack Environment and find the State-Value over multiple episodes
value_every = ## WRITE YOUR CODE HERE ##

In [None]:
# Just For Checking
for i in range(10):
  print(value_every.popitem())

In [None]:
# Plot the Values and correlate with the Policy Visually
fig, axes = pyplot.subplots(nrows=2, figsize=(8, 16),
subplot_kw={'projection': '3d'})
axes[0].set_title('value function without usable ace')
axes[1].set_title('value function with usable ace')
plot_blackjack(value_every, axes[0], axes[1])