## **AlphaZero Algorithm**

#### Imports

In [None]:
import numpy as np

import math

import torch

import torch.nn as nn

import torch.nn.functional as F

torch.manual_seed(0) # set the seed as 0

from tqdm.notebook import trange

import random

import matplotlib.pyplot as plt

from torch.utils.tensorboard import SummaryWriter


In [None]:
%run MCTS_Go.ipynb
%run NeuralNetwork_Go.ipynb    # CNN model in Attax game

### AlphaZero Class

In the following code, an AlphaZero algorithm is implemented and evaluated for playing Go. The evaluation process focuses on assessing both the algorithm's learning and gameplay performance, as well as the efficiency and effectiveness of its learning process. The primary evaluation metrics used are Loss Metrics, which are monitored during training to gauge how well the model learns from self-play data. Additionally, TensorBoard is utilized for data analysis and visualization.

The Loss Metrics include:

**Policy Loss:**
- This loss measures how well the policy head of your neural network predicts the correct action to take at each step. In the context of games, the policy generally represents the probability distribution over possible moves.
- The policy loss is typically calculated using a cross-entropy loss function between the predicted probabilities and the actual distribution of moves from the self-play data.
- In AlphaZero, the policy network guides the search by providing a prior probability to the Monte Carlo Tree Search (MCTS) algorithm.

 **Value Loss:**
- The value loss measures how accurately the value head of the neural network estimates the expected outcome (win, loss, or draw) from a given board state.
- It is usually calculated using mean squared error (MSE) loss, which compares the predicted value to the actual game outcome.
- In AlphaZero, the value estimate helps the MCTS evaluate board states without having to simulate all the way to the end of the game.

**Total Loss:**
- The total loss is the sum of the policy loss and the value loss. It represents the overall performance of the neural network in both predicting the next best move (policy) and estimating the game's outcome from the current position (value).
- Minimizing the total loss is the goal of the training process, as it leads to improvements in the model's policy and value predictions, which should translate into stronger gameplay performance.


The code processes data as follows:

Model predictions are obtained by passing the current game state through the model (self.model(state)).
policy_loss is calculated by comparing the model's policy output to target policy probabilities using cross-entropy loss.
value_loss is computed by comparing the model's value output to the actual game outcome using mean squared error loss.
The loss variable represents the total loss, which is the sum of policy_loss and value_loss, and it needs to be minimized.
The methods zero_grad(), backward(), and step() are used to clear old gradients, compute gradients, and update model parameters during optimization.
The code also maintains lists (policy_losses, value_losses, total_losses) to store losses for visualization during the training process.

This code is encapsulated within a Python class called AlphaZero, which contains methods for self-play, training, visualization of losses, and learning.

In [None]:

class AlphaZero:

    def __init__(self, model, optimizer, game, args):
        self.model = model
        self.game = game 
        self.optimizer = optimizer
        self.args = args
        self.mcts = MCTS(game, args, 1, model)

         #to start the tensorboard use this command in the terminal: tensorboard --logdir=runs
        self.writer = SummaryWriter('runs/alphazero_experiment')      

    # def create_mov(self, action, player):
    #     # Extracts xi, yi, xf, yf from action
    #     xi = int(action[0])
    #     yi = int(action[1])
    #     xf = int(action[3])
    #     yf = int(action[4])
    
    #     mov = game.movement(xi, yi, xf, yf, player)
    #     return mov
    
    # Function to play a game of self-play
    def selfPlay(self):
        self.game = GoGame(9)  # Reset the game state
        player = 1
        self.mcts = MCTS(self.game, self.args, player, self.model)  # Reset MCTS

        memory = []  # List to store the game memory
        player = 1  # Player to start the game
        state = self.game.get_initial_state()  # Get the initial state of the game
        last_move = None  # Variable to store the last move

        move_count = 0  # Track the number of moves
        top_move_count = 0  # Track how often the top move is chosen

        while True:
            neutral_state = self.game.change_perspective(state, player)
            action_probs = self.mcts.search(neutral_state)

            valid_moves = self.game.get_valid_moves(state, player)
            action_probs = [prob if index in valid_moves else 0 for index, prob in enumerate(action_probs)]
            total_prob = sum(action_probs)
            if total_prob > 0:
                action_probs = [prob / total_prob for prob in action_probs]

            action = np.random.choice(self.game.action_size + 1, p=action_probs)
            print(action_probs)

            # Check if the current move is the same as the last move
            if action == 81 and last_move == 81:  # Assuming 81 is the 'pass' move
                break  # Break out of the loop if two consecutive passes occur

            move_count += 1
            if action == np.argmax(action_probs):
                top_move_count += 1

            state = self.game.get_next_state(state, action, player)
            value, _, is_terminal = self.game.get_value_and_terminated(state, action, player)

            memory.append((neutral_state, action_probs, player))

            if is_terminal:
                break  # Break out of the loop if the game reaches a terminal state

            last_move = action  # Update the last move
            player = -player  # Switch player

        game_length = move_count
        top_move_ratio = top_move_count / move_count if move_count > 0 else 0

        returnMemory = []
        for hist_neutral_state, hist_action_probs, hist_player in memory:
            hist_outcome = value if hist_player == player else -value  # Negate value for opponent
            returnMemory.append((
                self.game.get_encoded_state(hist_neutral_state),
                hist_action_probs,
                hist_outcome
            ))

        return returnMemory, value, game_length, top_move_ratio




    # Function to train the model
    def train(self, memory):
        random.shuffle(memory) # Shuffle the training data

        # Initialize lists to store losses for visualization
        policy_losses = [] 
        value_losses = []
        total_losses = []

        # Loop through the memory in batches 
        for batchIdx in range(0, len(memory), self.args['batch_size']): 
            sample = memory[batchIdx:min(len(memory)-1, batchIdx + self.args['batch_size'])] # Get the batch of data
            state, policy_targets, value_targets = zip(*sample) # Transpose the data

            state, policy_targets, value_targets = np.array(state), np.array(policy_targets), np.array(value_targets).reshape(-1, 1) # Convert the data to numpy arrays

            state = torch.tensor(state, dtype=torch.float32) # Convert the state to a tensor
            policy_targets = torch.tensor(policy_targets, dtype=torch.float32) # Convert the policy targets to a tensor
            value_targets = torch.tensor(value_targets, dtype=torch.float32) # Convert the value targets to a tensor

            out_policy, out_value = self.model(state) # Get the output policy and value from the model
 
            # Calculate losses
            policy_loss = F.cross_entropy(out_policy, policy_targets) 
            value_loss = F.mse_loss(out_value, value_targets)
            total_loss = policy_loss + value_loss


            # Log losses
            policy_losses.append(policy_loss.item())
            value_losses.append(value_loss.item())
            total_losses.append(total_loss.item())

            # Backpropagate and optimize the model
            self.optimizer.zero_grad() 
            total_loss.backward() 
            self.optimizer.step()

        return policy_losses, value_losses, total_losses # Return the losses

    # Function to visualize the losses
    def visualize_losses(self, policy_losses, value_losses, total_losses):
        plt.figure(figsize=(10, 5)) # Set the figure size
        plt.plot(policy_losses, label='Policy Loss') # Plot the policy losses
        plt.plot(value_losses, label='Value Loss') # Plot the value losses
        plt.plot(total_losses, label='Total Loss')  # Plot the total losses
        plt.xlabel('Training Steps') # Set the x label
        plt.ylabel('Loss') # Set the y label
        plt.title('Training Loss Over Time') # Set the title
        plt.legend() # Show the legend
        plt.show() # Show the plot

    # Function to visualize the performance
    def visualize_performance(self, win_rates):
        plt.figure(figsize=(10, 5)) # Set the figure size
        plt.plot(win_rates, label='Win Rate') # Plot the win rates
        plt.xlabel('Iterations') # Set the x label
        plt.ylabel('Win Rate') # Set the y label
        plt.title('Win Rate Over Iterations') # Set the title
        plt.legend() # Show the legend
        plt.show() # Show the plot


    # Function to learn the model
    def learn(self):
            
            # For all iterations of learning
            for iterations in range(self.args['num_iterations']):

                memory = [] # List to store the game memory
                outcomes = []  # List to store the outcomes of each game
                game_lengths = []  # Store the length of each game
                top_move_ratios = []  # Store the move quality metric for each game



                self.model.eval() # Set the model to evaluation mode
                for _ in trange(self.args['num_selfPlay_iterations']): # For each iteration of self-play
                    
                    game_memory, game_outcome, game_length, top_move_ratio = self.selfPlay() # Play a game of self-play
                    memory += game_memory # Append the game memory to the memory list
                    outcomes.append(game_outcome) # Append the outcome of the game
                    game_lengths.append(game_length) # Append the length of the game
                    top_move_ratios.append(top_move_ratio) # Append the move quality metric


                self.model.train() # Set the model to training mode

                # Calculate and log the averages
                avg_game_length = sum(game_lengths) / len(game_lengths)
                avg_top_move_ratio = sum(top_move_ratios) / len(top_move_ratios)

                
                # Calculate the win rate ->
                if len(outcomes) > 0: # If there are outcomes
                    wins = outcomes.count(1)  # Count the number of wins 
                    losses = outcomes.count(-1) # Count the number of losses
                    draws = outcomes.count(0) # Count the number of draws
                    win_rate = wins / len(outcomes)  # Calculate the win rate
                else: # If there are no outcomes
                    win_rate = 0 # Set the win rate to 0

                print(f"Iteration {iterations}: Win rate: {win_rate*100:.2f}%") # Print the win rate



                # Use TensorBoard to log these metrics
                self.writer.add_scalar('Performance/Average_Game_Length', avg_game_length, iterations)
                self.writer.add_scalar('Performance/Average_Top_Move_Ratio', avg_top_move_ratio, iterations)
                self.writer.add_scalar('Performance/Win_Rate', win_rate, iterations)


                # Collect all loss metrics
                all_policy_losses = []
                all_value_losses = []
                all_total_losses = []
                all_win_rates = []

                # For each epoch of training
                for _ in trange(self.args['num_epochs']):
                    policy_losses, value_losses, total_losses = self.train(memory) # Train the model
                    all_policy_losses.extend(policy_losses) # Append the policy losses
                    all_value_losses.extend(value_losses) # Append the value losses
                    all_total_losses.extend(total_losses) # Append the total losses
                    all_win_rates.append(win_rate) # Append the win rate

                # At the end of all epochs, visualize the losses, win rates
                self.visualize_losses(all_policy_losses, all_value_losses, all_total_losses)
                self.visualize_performance(all_win_rates)

                # Save the model after each iteration of learning
                torch.save(self.model.state_dict(), f"model_{iterations}.pt")
                torch.save(self.optimizer.state_dict(), f"optimizer_{iterations}.pt")


            self.writer.close()  # Close the TensorBoard writer

                

#O QUE ESTAVA
    # def learn(self):
    #     for iterations in range(self.args['num_iterations']):
    #         memory = []

    #         self.model.eval()

    #         for selfPlay_iteration in trange(self.args['num_selfPlay_iterations']): #trange was used so that we can visualise the progress bars
    #             memory += self.selfPlay()

    #         self.model.train() 
    #         for epoch in trange(self.args['num_epochs']):
    #             self.train(memory)

    #         #store the weights of the model
    #         torch.save(self.model.state_dict(), f"model_{iterations}.pt")
    #         torch.save(self.optimizer.state_dict(), f"optimizer_{iterations}.pt")
                

##### Old tensorboard code

In [None]:
# To use TensorBoard with your AlphaZero implementation, you'll need to make use of the TensorBoard logging through `SummaryWriter` from the `torch.utils.tensorboard` module. Below is a step-by-step guide to integrate TensorBoard into your existing AlphaZero code.

# 1. **Install TensorBoard**:
#    Ensure you have TensorBoard installed in your environment. If not, you can install it via pip:

#     ```bash
#     pip install tensorboard
#     ```

# 2. **Import SummaryWriter**:
#    At the beginning of your script, import the necessary TensorBoard class:

#     ```python
#     from torch.utils.tensorboard import SummaryWriter
#     ```

# 3. **Initialize SummaryWriter**:
#    Create a `SummaryWriter` instance at the beginning of your training script. This object will be used to write logs into a directory that TensorBoard will later read from.

#     ```python
#     writer = SummaryWriter('runs/alphazero_experiment_1')
#     ```

# 4. **Log Data**:
#    Throughout your training loop and other functions, use the `writer` to log data, such as loss, accuracy, or custom metrics. Here's how you might incorporate it into your AlphaZero training loop:

#     ```python
#     def train(self, memory):
#         # ... [your existing code] ...
        
#         # Loop over batches
#         for batchIdx in range(0, len(memory), self.args['batch_size']):
#             # ... [your existing code] ...
            
#             # After optimizer step, log the losses
#             writer.add_scalar('Loss/policy', policy_loss.item(), global_step)
#             writer.add_scalar('Loss/value', value_loss.item(), global_step)
#             writer.add_scalar('Loss/total', loss.item(), global_step)
            
#             # Increment your global step counter
#             global_step += 1
#     ```

# 5. **Log Custom Metrics and Visualizations**:
#    Besides scalar values, you might want to log histograms of parameters, images of the game board, or distributions of move probabilities:

#     ```python
#     # Log parameters (histograms)
#     for name, param in self.model.named_parameters():
#         writer.add_histogram(name, param, global_step)

#     # Log example game states as images
#     # Convert your game state to an image (assuming you have a function for this)
#     img = game_state_to_image(state)  
#     writer.add_image('Game/Board', img, global_step)
#     ```

# 6. **Start TensorBoard**:
#    Once your script is running and logging data, start TensorBoard in a terminal pointing it to the directory where the logs are being written:

#     ```bash
#     tensorboard --logdir=runs
#     ```

# 7. **View Your Logs**:
#    Open your browser and go to `localhost:6006` (or the URL provided in the terminal when you start TensorBoard) to view the logs and visualizations.

# 8. **Close SummaryWriter**:
#    At the end of training, or when you're done logging data, ensure to close the SummaryWriter to flush any remaining outputs to disk:

#     ```python
#     writer.close()
#     ```

# By following these steps, you'll be able to integrate TensorBoard into your AlphaZero model for rich logging and visualization capabilities. This will help you monitor your training process, understand your model's behavior, and make informed decisions to improve its performance. Remember to customize the logging according to what's most relevant for your specific scenario and model.