Copyright **`(c)`** 2023 Giovanni Squillero `<giovanni.squillero@polito.it>`  
[`https://github.com/squillero/computational-intelligence`](https://github.com/squillero/computational-intelligence)  
Free for personal or classroom use; see [`LICENSE.md`](https://github.com/squillero/computational-intelligence/blob/master/LICENSE.md) for details.  

# LAB10

Use reinforcement learning to devise a tic-tac-toe player.

### Deadlines:

* Submission: [Dies Natalis Solis Invicti](https://en.wikipedia.org/wiki/Sol_Invictus)
* Reviews: [Befana](https://en.wikipedia.org/wiki/Befana)

Notes:

* Reviews will be assigned  on Monday, December 4
* You need to commit in order to be selected as a reviewer (ie. better to commit an empty work than not to commit)

In [3]:
import stable_baselines3 as sb3
import numpy as np
import gymnasium
import random
from sb3_contrib import ARS
from stable_baselines3 import PPO, SAC, A2C, DQN
from stable_baselines3.common.monitor import Monitor

### Model-Free vs Model-Based RL

One of the most important branching points in an RL algorithm is the question of **whether the agent has access to (or learns) a model of the environment**. By a model of the environment, we mean a function which predicts state transitions and rewards.

The main upside to having a model is that **it allows the agent to plan** by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. A particularly famous example of this approach is [AlphaZero](https://arxiv.org/abs/1712.01815). When this works, it can result in a substantial improvement in sample efficiency over methods that don’t have a model.

Algorithms which use a model are called model-based methods, and those that don’t are called model-free. While model-free methods forego the potential gains in sample efficiency from using a model, they tend to be easier to implement and tune.

#### A. What to Learn in Model-Free RL
There are two main approaches to representing and training agents with model-free RL:

1. **Policy Optimization**: Methods in this family represent a policy explicitly as $\pi_{\theta}(a|s)$. They optimize the parameters $\theta$ either directly by gradient ascent on the performance objective $J(\pi_{\theta})$, or indirectly, by maximizing local approximations of $J(\pi_{\theta})$. This optimization is almost always performed on-policy, which means that each update only uses data collected while acting according to the most recent version of the policy. Policy optimization also usually involves learning an approximator $V_{\phi}(s)$ for the on-policy value function $V^{\pi}(s)$, which gets used in figuring out how to update the policy.

A couple of examples of policy optimization methods are:

- [A2C / A3C](https://arxiv.org/abs/1602.01783), which performs gradient ascent to directly maximize performance,
- [PPO](https://arxiv.org/abs/1707.06347), whose updates indirectly maximize performance, by instead maximizing a surrogate objective function which gives a conservative estimate for how much $J(\pi_{\theta})$ will change as a result of the update.

2. **Q-Learning**. Methods in this family learn an approximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$. Typically they use an objective function based on the [Bellman equation](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#bellman-equations). This optimization is almost always performed *off-policy*, which means that each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained. The corresponding policy is obtained via the connection between $Q^*$ and $\pi^*$: the actions taken by the Q-learning agent are given by:
$$a(s) = \arg \max_a Q_{\theta}(s,a)$$

Examples of Q-learning methods include
- [DQN](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), a classic which substantially launched the field of deep RL,
- [C51](https://arxiv.org/abs/1707.06887), a variant that learns a distribution over return whose expectation is $Q^*$.

**Trade-offs Between Policy Optimization and Q-Learning**. The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable. By contrast, Q-learning methods only indirectly optimize for agent performance, by training $Q_{\theta}$ to satisfy a self-consistency equation. There are many failure modes for this kind of learning, so it tends to be less stable. But, Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques.

**Interpolating Between Policy Optimization and Q-Learning**. Serendipitously, policy optimization and Q-learning are not incompatible (and under some circumstances, it turns out, equivalent), and there exist a range of algorithms that live in between the two extremes. Algorithms that live on this spectrum are able to carefully trade-off between the strengths and weaknesses of either side. Examples include:

- [DDPG](https://arxiv.org/abs/1509.02971), an algorithm which concurrently learns a deterministic policy and a Q-function by using each to improve the other,
- [SAC](https://arxiv.org/abs/1801.01290), a variant which uses stochastic policies, entropy regularization, and a few other tricks to stabilize learning and score higher than DDPG on standard benchmarks.

### B. What to Learn in Model-Based RL

Unlike **model-free RL**, there aren’t a small number of easy-to-define clusters of methods for model-based RL: there are many orthogonal ways of using models. We’ll give a few examples, but the list is far from exhaustive. In each case, the model may either be given or learned.

**Background: Pure Planning**. The most basic approach never explicitly represents the policy, and instead, uses pure planning techniques like [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control) (MPC) to select actions. In MPC, each time the agent observes the environment, it computes a plan which is optimal with respect to the model, where the plan describes all actions to take over some fixed window of time after the present. (Future rewards beyond the horizon may be considered by the planning algorithm through the use of a learned value function.) The agent then executes the first action of the plan, and immediately discards the rest of it. It computes a new plan each time it prepares to interact with the environment, to avoid using an action from a plan with a shorter-than-desired planning horizon.

The [MBMF](https://sites.google.com/view/mbmf) work explores MPC with learned environment models on some standard benchmark tasks for deep RL.

**Expert Iteration**. A straightforward follow-on to pure planning involves using and learning an explicit representation of the policy, $\pi_{\theta}(a|s)$. The agent uses a planning algorithm (like Monte Carlo Tree Search) in the model, generating candidate actions for the plan by sampling from its current policy. The planning algorithm produces an action which is better than what the policy alone would have produced, hence it is an “expert” relative to the policy. The policy is afterwards updated to produce an action more like the planning algorithm’s output.

- The [ExIt](https://arxiv.org/abs/1705.08439) algorithm uses this approach to train deep neural networks to play Hex.
- [AlphaZero](https://arxiv.org/abs/1712.01815) is another example of this approach.

**Data Augmentation for Model-Free Methods**. Use a model-free RL algorithm to train a policy or Q-function, but either 1) augment real experiences with fictitious ones in updating the agent, or 2) use only fictitous experience for updating the agent.
- See [MBVE](https://arxiv.org/abs/1803.00101) for an example of augmenting real experiences with fictitious ones.

![Image](https://spinningup.openai.com/en/latest/_images/rl_algorithms_9_15.svg) 


REFERENCES: \
[1] [OPENAI](https://spinningup.openai.com/en/latest/index.html) \
[2] [HUGGING FACE](https://huggingface.co/learn/deep-rl-course/unit0/introduction)


__________

**LET'S START WORKING!**

After this introduction, thanks to the references mentioned above, our goal is to implement the TIC-TAC-TOE Game with several techniques/algorithms and then benchmarking them according to some metrics.

In this work we present different methodologies for the ```step function``` i.e. some agents will be completely free to do whathever they want (allowed to take invalid actions) while others will get a list of ONLY POSSIBLE ALLOWED actions _aka_ Action masking.

Our agents are going to learn against a random player, while at the end they will be tested against the TTTBOT, our bot that uses the 15-sum method for winning.

Finally, a comparaison is done between agents to select the best-two agents and the user is allowed to play against them.

```python

 Board Position:    Values:
  0 | 1 | 2          2 | 9 | 4
 -----------       -----------
  3 | 4 | 5          7 | 5 | 3
 -----------       -----------
  6 | 7 | 8          6 | 1 | 8

```

#### Preparatory Step: Build the enviroment

Below you can find the implementation of the enviroment.
The ```TicTacToeEnv``` is made of:\
a. Initializer: Where to initialize all required attributes\
b. A step function (1): A step function that masks non allowed steps, the agent will get a random action to perform in case of an invalid action is selected\
c. A step function (2): Here instead if the action is not allowed, a penalty is given and the episode marked as terminated\
d. Render function: For logging during trainign and testing\
e. Reset function: To reset the Env after an episode

**NOTE**: To ensure randomness, our agent will be assigned 'X' or 'O' randomly and thus assigned at which turn to play

In [4]:
import logging
logging.basicConfig(level=logging.INFO)

In [5]:
class TicTacToeEnv(gymnasium.Env):
    def __init__(self, player='X', step_variant='without_masking', render_mode='human'):
        super(TicTacToeEnv, self).__init__()

        # Define the observation space and action space
        self.observation_space = gymnasium.spaces.Box(
            low=0, high=2, shape=(3, 3), dtype=int)
        self.action_space = gymnasium.spaces.Discrete(9)
        self.state = np.zeros((3, 3), dtype=int)

        self.player = player  # X or O
        self.player_index = 1 if player == 'X' else 2  # 1 or 2 (X or O)
        self.opponent = 'O' if player == 'X' else 'X'
        self.opponent_index = 2 if player == 'X' else 1  # 2 or 1 (O or X)

        self.current_player = 1

        self.initial_step_variant = step_variant
        self.step_variant = step_variant

        self.reset()

    def switch_player(self):
        self.current_player = 1 if self.current_player == 2 else 2

    def action_conversion(self, action):
        # Convert the action from the agent to the state
        row, col = action // 3, action % 3
        return row, col

    def random_valid_move(self):
        # Used by opponent to perform a random move
        valid = False
        while not valid:
            valid_move = random.choice(range(9))
            if self.state[valid_move // 3][valid_move % 3] == 0:
                valid = True
        return valid_move

    def action_masking(self):
        # Used by the agent to mask invalid actions
        valid_moves = []
        for i in range(9):
            if self.state[i // 3][i % 3] == 0:
                valid_moves.append(i)
        return valid_moves

    def check_draw(self):
        # Check if the game is a draw
        return np.all(self.state != 0)

    def check_player_win(self, player_index):
        # Check if the agent won OR the opponent won
        flag = False
        for i in range(3):
            any_row = np.all(self.state[i] == player_index)
            any_col = np.all(self.state[:, i] == player_index)
            Rdiag = np.all(np.diag(self.state) == player_index)
            Ldiag = np.all(np.diag(np.fliplr(self.state)) == player_index)
            flag = np.any([any_row, any_col, Rdiag, Ldiag])
        return flag

    # The episode ends when the game ends or when the agent performs an invalid move
    def step_without_masking(self, action):
        reward = 0
        done = False
        truncated = False

        # Action conversion
        row, col = self.action_conversion(action)

        # Player 1 turn
        if self.current_player == 1:
            if self.current_player == self.player_index:  # The Agent plays as X
                if self.state[row, col] == 0:  # Action is valid
                    self.state[row, col] = 1
                else:
                    reward = -7
                    done = True
                    logging.info(f"Invalid move by the agent")
                    return self.state, reward, done, truncated, {}

            else:  # The opponent plays as X
                move = self.random_valid_move()
                row, col = self.action_conversion(move)
                self.state[row, col] = 1

        if self.check_draw():
            reward = 0
            done = True
            logging.info(f"Draw")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.player_index):
            reward = 10
            done = True
            logging.info(f"Agent won")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.opponent_index):
            reward = -10
            done = True
            logging.info(f"Opponent won")
            return self.state, reward, done, truncated, {}

        self.switch_player()

        # Player 2 turn
        if self.current_player == 2:
            if self.current_player == self.player_index:  # The Agent plays as O
                if self.state[row, col] == 0:  # Action is valid
                    self.state[row, col] = 2
                else:
                    reward = -7
                    done = True
                    logging.info(f"Invalid move by the agent")
                    return self.state, reward, done, truncated, {}

            else:  # The opponent plays as O
                move = self.random_valid_move()
                row, col = self.action_conversion(move)
                self.state[row, col] = 2

        if self.check_draw():
            reward = 0
            done = True
            logging.info(f"Draw")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.player_index):
            reward = 10
            done = True
            logging.info(f"Agent won")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.opponent_index):
            reward = -10
            done = True
            logging.info(f"Opponent won")
            return self.state, reward, done, truncated, {}

        self.switch_player()

        return self.state, reward, done, truncated, {}

    # The episode ends when the game ends or when the agent performs an invalid move
    def step_with_masking(self, action):
        reward = 0
        done = False
        truncated = False

        valid_moves = self.action_masking()
        if action not in valid_moves:
            action = random.choice(valid_moves)

        # Action conversion
        row, col = self.action_conversion(action)

        # Player 1 turn
        if self.current_player == 1:
            if self.current_player == self.player_index:  # The Agent plays as X
                if self.state[row, col] == 0:  # Action is valid
                    self.state[row, col] = 1
                else:
                    reward = -7
                    done = True
                    logging.info(f"Invalid move by the agent")
                    return self.state, reward, done, truncated, {}

            else:  # The opponent plays as X
                move = self.random_valid_move()
                row, col = self.action_conversion(move)
                self.state[row, col] = 1

        if self.check_draw():
            reward = 0
            done = True
            logging.info(f"Draw")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.player_index):
            reward = 10
            done = True
            logging.info(f"Agent won")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.opponent_index):
            reward = -10
            done = True
            logging.info(f"Opponent won")
            return self.state, reward, done, truncated, {}

        self.switch_player()

        # Player 2 turn
        if self.current_player == 2:
            if self.current_player == self.player_index:  # The Agent plays as O
                if self.state[row, col] == 0:  # Action is valid
                    self.state[row, col] = 2
                else:
                    reward = -7
                    done = True
                    logging.info(f"Invalid move by the agent")
                    return self.state, reward, done, truncated, {}

            else:  # The opponent plays as O
                move = self.random_valid_move()
                row, col = self.action_conversion(move)
                self.state[row, col] = 2

        if self.check_draw():
            reward = 0
            done = True
            logging.info(f"Draw")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.player_index):
            reward = 10
            done = True
            logging.info(f"Agent won")
            return self.state, reward, done, truncated, {}

        if self.check_player_win(self.opponent_index):
            reward = -10
            done = True
            logging.info(f"Opponent won")
            return self.state, reward, done, truncated, {}

        self.switch_player()

        return self.state, reward, done, truncated, {}

    def step(self, action):
        if self.step_variant == 'without_masking':
            return self.step_without_masking(action)
        elif self.step_variant == 'with_masking':
            return self.step_with_masking(action)
        else:
            raise ValueError("Invalid step variant")

    def reset(self, seed=None, options=None):
        seed = random.randint(0, 100)
        super().reset(seed=seed)
        # Reset the state of the environment to an initial state
        self.state = np.zeros((3, 3), dtype=int)

        symbol = random.choice(['X', 'O'])

        self.player = 'X' if symbol == 'X' else 'O'  # X or O
        self.player_index = 1 if symbol == 'X' else 2  # 1 or 2 (X or O)

        self.opponent = 'O' if symbol == 'X' else 'X'
        self.opponent_index = 2 if symbol == 'X' else 1  # 2 or 1 (O or X)
        self.current_player = 1

        self.step_variant = self.initial_step_variant

        info = {}
        return self.state, info

    def render(self, render_mode='human'):
        symbol_map = {0: ' ', 1: 'X', 2: 'O'}
        print("\n")
        print(
            f"----- Agent Plays as {self.player} - {self.player_index} ------\n")
        for row in range(3):
            print(" ", end="")
            for col in range(3):
                # Print the symbol for each cell
                symbol = symbol_map[self.state[row, col]]
                print(symbol, end="")
                if col < 2:
                    print(" | ", end="")
            print("\n")
            if row < 2:
                print("-----------")
        print("\n")

#### 1. Model-Free RL

#### Part 1: With and Without AM ? Which is better
Since training and testing on my bad PC takes a long time, we will do a small comparaison between two RL algorithms, each trained and tested with and without action masking step function (same timestamps) and then we will see which one will survive to be used on other technqiues.

1. Deep Q-Learning (DQN) without Action Masking
2. Deep Q-Learning (DQN) with Action Masking
3. Advantage-Actor Critic (A2C) without Action Masking
4. Advantage-Actor Critic (A2C) without Action Masking

The winner will then be trained also on: 

5. Augmented Random Search (ARS) 
6. Proximal Policy Optimization (PPO)

In [26]:
env_noMasking = TicTacToeEnv(player='X', step_variant='without_masking')
env_Masking = TicTacToeEnv(player='X', step_variant='with_masking')

DQN_monitored_env_1 = Monitor(env_noMasking, "D:\\Github\\computational-intelligence\\2023-24\\DQN_monitored_env_1.csv")
DQN_monitored_env_2 = Monitor(env_Masking,"D:\\Github\\computational-intelligence\\2023-24\\DQN_monitored_env_2.csv")

A2C_monitored_env_1 = Monitor(env_noMasking, "D:\\Github\\computational-intelligence\\2023-24\\A2C_monitored_env_1.csv")
A2C_monitored_env_2 = Monitor(env_Masking, "D:\\Github\\computational-intelligence\\2023-24\\A2C_monitored_env_2.csv")

In [5]:
################################# Training #####################################

In [27]:
timesteps_per_epoch = 150_000

model_DQN_1 = DQN('MlpPolicy', DQN_monitored_env_1,
                 tensorboard_log="./logs/", verbose=0)

model_DQN_2 = DQN('MlpPolicy', DQN_monitored_env_2,
                 tensorboard_log="./logs/", verbose=0)

model_A2C_1 = A2C('MlpPolicy', A2C_monitored_env_1,
                 tensorboard_log="./logs/", verbose=0)

model_A2C_2 = A2C('MlpPolicy', A2C_monitored_env_2,
                 tensorboard_log="./logs/", verbose=0)

model_DQN_1.learn(total_timesteps=timesteps_per_epoch, tb_log_name="DQN_without_masking")
model_DQN_1.save("D:\\Github\\computational-intelligence\\2023-24\\DQN_1")

model_DQN_2.learn(total_timesteps=timesteps_per_epoch, tb_log_name="DQN_with_masking")
model_DQN_2.save("D:\\Github\\computational-intelligence\\2023-24\\DQN_2")

model_A2C_1.learn(total_timesteps=timesteps_per_epoch, tb_log_name="A2C_without_masking")
model_A2C_1.save("D:\\Github\\computational-intelligence\\2023-24\\A2C_1")

model_A2C_2.learn(total_timesteps=timesteps_per_epoch, tb_log_name="A2C_with_masking")
model_A2C_2.save("D:\\Github\\computational-intelligence\\2023-24\\A2C_2")

INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Opponent won
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO:root:Invalid move by the agent
INFO: