# Reinforcement Learning
Prof. Milica Gašić

### Monte Carlo prediction

The idea of Monte Carlo prediction is very simple: Estimate the value (or the action value) by averating the observed returns from collected episodes. In this notebook we apply Monte Carlo prediction to the game of tic-tac-toe.

#### Implementation

Make sure that the file `rl_env.py` is in the same folder as the notebook.

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import rl_env

We already implemented tic-tac-toe in `TicTacToeEnv`:
- The environment has $3^9 = 19683$ states (9 fields with 3 values: empty, player 1, player 2).
- There are $9$ possible actions, which determine the next move of the current player (i.e. the actions control both players).
- The final reward is $1$ if player 1 wins, and $0$ if player 2 wins or when there is a draw. The reward is $0$ in all other time steps.

In [2]:
# Create an instance of the tic-tac-toe environment
env = rl_env.TicTacToeEnv()

We already implemented the random policy for the tic-tac-toe environment:

In [3]:
def random_policy(state):
    # Obtain the list of empty fields
    valid_actions = rl_env.TicTacToeEnv.get_valid_actions(state)
    # Select one of the empty fields randomly
    # For non-empty fields, the action does not have an effect
    action = np.random.choice(valid_actions)
    return action

Your task is to implement Monte Carlo prediction of the action value for the **initial state**, i.e. you don't need to compute the action values for all states, but only the $9$ action values for the initial state.  
We don't need a discount factor, so the initial return is equal to the final reward.

You don't need an `Agent` object for this implementation, just generate episodes and estimate the action values.

In [4]:
#######################################################################
# TODO: Implement Monte Carlo prediction of the action value function #
# for the initial state as described above. Generate at least 10000   #
# episodes to estimate the action values.                             #
#######################################################################

def generate_episode():
    s, _ = env.reset()
    episode = [s]
    terminated = False

    while not terminated:
        a = random_policy(s)
        s, r, terminated, *_ = env.step(a)
        episode.append((a, r, s))

    return episode


v = np.zeros(9)
n = np.zeros(9)
for _ in range(10000):
    episode = generate_episode()
    _, (s, *_), *_, (_, g, _) = episode
    n[s] += 1
    v[s] += (g - v[s]) / n[s]

v.reshape((3, 3))

#######################################################################
# End of your code.                                                   #
#######################################################################

array([[0.6460018 , 0.54954955, 0.59326661],
       [0.52037037, 0.67857143, 0.51212938],
       [0.59192825, 0.53027682, 0.64351005]])

Since the reward is only $1$ if player 1 wins, the value of the initial state is equal to the winning probability of player 1.  
Use this to answer the following questions:
- What is the probability that the first player wins?
- Which initial action has the highest chance of winning?

In [5]:
#######################################################################
# TODO: Answer the questions by using the computed action values.     #
#######################################################################
v.max(), v.argmax()
#######################################################################
# End of your code.                                                   #
#######################################################################

(0.6785714285714286, 4)

67.8% and center