# Settelers of catan inspired by Alphazero approach
Aviv Cohen; avivcohen@campus.technion.ac.il

Dan Navon; danavon@campus.technion.ac.il

### An informal description of the problem you are trying to solve (examples are best
Catan is a resource driven game in which players roll dice to collect resources, trade materials with other players, and
compete to be the first one to achieve 10 points.
Our goal is to create an agent that improves his strategy to this game, while playing against himself.
Quick explanation of Catan can be found in [YouTube: How to Play Catan in 4 Minutes - Rules Girl](https://www.youtube.com/watch?v=4fUa_ZJ7beM)



### Motivation for solving this problem and why is it hard
While dealing well with traditional games, general techniques are often unsatisfactory for modern strategic games,
commonly called Eurogames, because of the greater complexity of these games when compared to traditional board games [1].
Eurogame archetype, with gameplay elements that make it challenging for traditional tree search algorithms, such as
Minimax: imperfect information, randomly determined moves, more than 2 players and negotiation between players. Most
autonomous agent players available for this game have game-specific heuristics and have a low win-rate against human players.

[1] - D. Robilliard and C. Fonlupt, Towards Human-Competitive
Game Playing for Complex Board Games with Genetic
Programming. Cham: Springer International Publishing,
2016, pp. 123–135.

### Previous methods for solving the problem and their strengths and weaknesses:

[JSettlers](https://github.com/jsettlers/settlers-remake) is an open-source Java implementation of Settlers of Catan
that [includes implementations of AI agents](https://www.semanticscholarorg/paper/Real-time-decision-making-for-adversarial-using-a-Hammond-Thomas/bee030f91fe1074548e58fbebc92d2b10c90bc1d) that are frequently used as a benchmark for new game playing strategies.

[QSettlers,by Peter McAughan in 2019](https://akrishna77.github.io/QSettlers/) - in this work, the authors attempted to apply the DQN paradigm to develop an AI model to play and win Settlers of Catan.
Although the team was able to train and develop a working DQN model specifically for the player trading mechanism of the game they couldn't implement a working DQN model for general gameplay.<br>
<img src="docs/images/DQN_trades.jpg" width="600" align="center"/>

At [Re-L Catan: Evaluation of Deep Reinforcement Learning for Resource Management Under Competitive and Uncertain Environments](https://cs230.stanford.edu/projects_fall_2021/reports/103176936.pdf) the authors tried to take the Qsettlers method into a general gamplay by creating different DQN for each part of the game.<br>
The paper lack the explanation on how they connected these NNs, but they claim to perform better than the heuristic-based agent on a game server named www.colonist.io.
<img src="docs/images/RE-L.jpg" width="600" align="center"/>

These last 2 paper led us to abandon the DQN approach due to the incomplete view over the game.

[Optimizing UCT for Settlers of Catan](https://www.sbgames.org/sbgames2017/papers/ComputacaoFull/175405.pdf) extends the  rules' simplification assumed at [Monte-Carlo Tree Search in Settlers of Catan](https://www.researchgate.net/publication/220716999_Monte-Carlo_Tree_Search_in_Settlers_of_Catan), The former paper uses a combination of pruning strategy that uses domain knowledge to reduce the algorithm’s search space and trade-optimistic search heuristic.
<img src="docs/images/MovePruning.jpg" width="600" align="center"/>

The problem with this approach is that it heavily computation demanding due to the length(depth of the tree) of each rollout until the end of game times the number of iteration times the length of the actual game.<br>
Secondly it doesn't have any learning process between games as in Alphazero.

[Mastering the game of Go with deep neural networks and tree search](https://www.nature.com/articles/nature16961.pdf) is a well known algorithm which relay on MCTS but replacing the rollouts with a CNN in order to tackle these 2 last weaknesses. <br>
The Alphazero algorithm assumes full observability and only 2 agents unlike the Catan game which may be played up to 4 players game. <br>

It is worth to mention [Game strategies for The Settlers of Catan](https://ieeexplore.ieee.org/document/6932884) which gives a survey over different game strategies, this paper gave us another point of view although we didn't actually use it.


### A description of your solution to the problem, how it overcomes the issues in previous mehtods, and what new issues arise.
Ultimately, our approach is a combination of simplified Alphazero and the heuristic+pruning method shown at "Optimizing UCT for Settlers of Catan".
We want to take advantage over the great success of Alphazero in other games as "GO" with the ability to save execution
time using DNN prediction instead of rollouts, we want to use the approach of the MCTS with heuristics+pruning in order to shrink the search space.

2 problems arise with this approach:
1. We need to change some game's rules in order to be compatible with the Alphazero algorithm
2. We don't have the computational power "DeepMind" have and therefore we won't be able to produce good results as
AlphaGo, some projects as [Leela Zero](https://github.com/leela-zero/leela-zero) have tried to duplicate AlphaGO results
without success due to the amount of computational resources required.

We see this project as a way to learn and interact with these new algorithms and not to achieve the most successful algorithm.


### A short description on how you intend to evaluate your solution.
Since we slightly changed the rules of the game we cannot compare our algorithm against other AI agents as [JSettlers](https://github.com/jsettlers/settlers-remake)
 or [colonist.io](www.colonist.io) and other games servers with bots as other papers did.

Therefore, we will try to examine our assumptions:
1. Using trained NN should perform better than no NN( and no rollouts) while using the MCTS variant "AlphaZero" proposed.
2. Using pruning and/or heuristic or not - not trivial assumption since we might trim actions that can give a better result.
 We will test it by playing different types of agents between themselves.

## Domain - Catan

We forked our project from [PyCatan2](https://github.com/josefwaller/PyCatan2), which gives a raw implementation of the
 game in order to let other developers to implement the rules they intend to use.
In our case, we implemented the very basic game elements - building cities, settlements, roads and trading (no players' trading).
The reason is to keep the game state fully visible as in other AlphaZero games implementation (Go, Chess, 4-in a row, tic-tac-toe, etc.) so we can be compatible with the original game.

Full explanation of the game can be found [here](https://en.wikipedia.org/wiki/Catan) or under docs directory (the video at the beginning might help as well).

In [1]:
import os.path
import random
import torch
import matplotlib.pyplot as plt
import numpy

from src.mcts import mcts_get_best_action
from src.mlp import MLP
from src.dataset import Dataset
from src.training import MLPTrainer
from src.plot import plot_fit


We wrapped the PyCatan2 in order to let the agents interact with game.

One can change <code>catan_wrp.py</code> - add or remove different rules of the game, change the initial board, etc.

In [2]:
from src.catan_wrp import Catan
num_players = 4
catan_game = Catan()
print("Board:")
print(catan_game.game.board)

Board:
                                                       
                                                       
                                                       
                                                       
                 3:1         2:1                       
                  .--'--.--'--.--'--.                  
                  | 10  |  2  |  9  | 2:1              
               .--'--.--'--.--'--.--'--.               
           2:1 | 12  |  6  |  4  | 10  |               
            .--'--.--'--.--'--.--'--.--'--.            
            |  9  | 11  |   R |  3  |  8  | 3:1        
            '--.--'--.--'--.--'--.--'--.--'            
           2:1 |  8  |  3  |  4  |  5  |               
               '--.--'--.--'--.--'--.--'               
                  |  5  |  6  | 11  | 2:1              
                  '--.--'--.--'--.--'                  
                 3:1         3:1                       
                                         

The coordination system is a skewed 2D grid,the left image bellow demonstrate it, more details can be found under <code>/docs/Working-with-Board.srt</code>


<img src="docs/images/catangrid_withpieces.png" align="left"/>
<img src="docs/images/BeginnerBoard.jpg" align="right"/>

### Observations
The observations returned as a serialized vector.
The reason for this serialization is that we would like to predict how good a state is using a DNN which have to receive a constant size input vector.

observations represent = <code>[current_player, dice, longest_road, initialization_stage, intersection_buildings, roads, resources, harbors] </code>.
1. <code>current_player</code> - current player id, value can be ranged between 0-3.
2. <code>dice</code> - dice value can be ranged between 2-12.
3. <code>longest_road</code> - the player id who owns the longest road (2 victory points).
4. <code>initialization_stage</code> - boolean value, indicates if we are at the initialization stage.
5. <code>intersection_buildings</code> - vector with 55 elements where values represent the 55 intersections along the board. values represented as $player\_id*10 + building\_type$ where $building\_type$ may be- 1-Settlement, 2-City. if there are no building, then value = 0 - No building.
for example the value "12" means that the intersection belongs to player_id==int(12/10)==1 with a city(12%10==2) on it.
6. <code>roads</code> - vector with 71 elements - represents all possible locations to place roads at. values may be - 0 - No road, otherwise - player_id+1 where the player_id represents who owns this road.
7. <code>resources</code> - there are 5 different types of resources (lumber, brick, ore, grain, wool), assuming 4
players in the game, this vector will have 20 elements where each element represents the amount of each resource
8. <code>harbors</code> - 9 elements, the values are as in intersection_buildings, harbors change the trading rates from 4/1 to 3/1
 
overall the state is represented in a compact way with 160 essential elements.

at this image, green dots representing the intersection, between each 2 adjacent dots, a road can be placed

<img src="docs/images/intersections.png"/>

Image taken from ["RE-L Catan"](https://cs230.stanford.edu/projects_fall_2021/reports/103176936.pdf)


In [18]:
state = catan_game.get_state()
print("state size:" + str(len(state)))
print(state)

state size:160
tensor([ 3.,  8.,  0.,  0.,  0.,  0.,  0.,  0., 21.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0., 31.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  1.,  0., 21.,  0.,  0.,  0.,  0., 11.,  0., 11.,  0.,  0.,
         0.,  0.,  0.,  0., 31.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  4.,  0.,  0.,  3.,  0.,  0.,
         0.,  2.,  0.,  0.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  4.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  3.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
         1.,  0.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,  2.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  4.,  0.,  0.,  0.,  3.,  4.,  1.,  3.,  0.,
         0.,  0.,  0.,  2.,  0.,  3.,  1.,  0.,  0.,  1.,  2.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.])


### Actions
The actions space contains - building(city / settlement/ road), trading or end turn(+dice roll)
Not all the action always available, the building and trading availability depends on the number of resources each player has.

The first 8 turns (assuming 4 players) are being played in a different way.
The order of the players is 1-2-3-4-4-3-2-1.
This stage is the picking stage where each player pick a settlement and a road. at the reverse order (4-3-2-1) the
 players receive resources as well according to the surround hexes.
Because each player can pick any legal settlement at the beginning, the number of actions can be up to 55.

Each action is represented as a tuple where the first element reflects the type of action and the second the coordination of the action.

0. build a road
1. build a settlement
2. build a city
3. trade resources (second value is the type of the trade)
4. end turn

In [20]:
actions = catan_game.get_actions(prune=False)
for a in actions:
    print(a)
best_action = random.choice(actions)
catan_game.make_action(best_action)
print(catan_game.game.board)

(3, ((<Resource.LUMBER: 0>, -4), (<Resource.GRAIN: 3>, 1)))
(3, ((<Resource.LUMBER: 0>, -4), (<Resource.ORE: 4>, 1)))
(3, ((<Resource.LUMBER: 0>, -4), (<Resource.BRICK: 1>, 1)))
(3, ((<Resource.LUMBER: 0>, -4), (<Resource.WOOL: 2>, 1)))
(4,)
                                                       
                                                       
                                                       
                                                       
                 3:1         2:1                       
                  .--'--s--'--.--'--.                  
                  | 10  |  2  |  9  | 2:1              
               .--'--s--'--.--s--.--'--.               
           2:1 | 12  |  6  |  4  | 10  |               
            .--'--.--'--s--'--s--'--.--'--.            
            |  9  | 11  |   R |  3  |  8  | 3:1        
            '--.--'--s--'--.--'--.--'--.--'            
           2:1 |  8  |  3  |  4  |  5  |               
               '--s--'--.--'--

### Game simulation
Here are a few moves to address how the game is progressing.
One can use the <code>src/text_game.py</code> in order to play by himself

In [21]:
for i in range(20):
    actions = catan_game.get_actions(prune=False)
    action = random.choice(actions)
    catan_game.make_action(action)
    print("Player " + str(catan_game.get_turn() + 1) + ", action:" + str(action))

print(catan_game.game.board)

Player 2, action:(4,)
Player 2, action:(3, ((<Resource.LUMBER: 0>, -4), (<Resource.BRICK: 1>, 1)))
Player 2, action:(3, ((<Resource.BRICK: 1>, -4), (<Resource.WOOL: 2>, 1)))
Player 2, action:(3, ((<Resource.WOOL: 2>, -4), (<Resource.GRAIN: 3>, 1)))
Player 3, action:(4,)
Player 4, action:(4,)
Player 4, action:(3, ((<Resource.WOOL: 2>, -2), (<Resource.LUMBER: 0>, 1)))
Player 4, action:(3, ((<Resource.WOOL: 2>, -2), (<Resource.BRICK: 1>, 1)))
Player 4, action:(<BuildingType.ROAD: 0>, frozenset({(q: -4, r:4), (q: -3, r:4)}))
Player 1, action:(4,)
Player 2, action:(4,)
Player 3, action:(4,)
Player 4, action:(4,)
Player 1, action:(4,)
Player 2, action:(4,)
Player 3, action:(4,)
Player 4, action:(4,)
Player 1, action:(4,)
Player 2, action:(4,)
Player 2, action:(3, ((<Resource.LUMBER: 0>, -4), (<Resource.GRAIN: 3>, 1)))
                                                       
                                                       
                                                       
        

## Model
We model the environment as an MDP

$\mathfrak{X}$ - $\{[current_player, dice, longest_road, initializtion_stage, intersection_buildings, roads, resources, harbors]\}$
The state space details can be found under Observations section

$\mathfrak{A} = \{\mathcal{A}^i\}_{i=1}^{4}$  where $\mathcal{A}^i(s)$ = \{"possible trades","possible buildings(cities/settlements/roads)", "end turn"\}
<br>The action space details can be found under Action section. The trading and building possibilities depends on the resources' availability.

$\mathfrak{P} (s'|s,a)$ - The probability of getting the state s', when we are in the state s, and make the action $a$.
The stochastic transition happens only when $s.phase = "dice" $.
The stochasticity is over $s'.resources$ for all players due to a new resources' allocation (see the [rules](https://en.wikipedia.org/wiki/Catan) of the game)
     
$\mathfrak{R} = \{\mathcal{R}^i\}_{i=1}^{4}$ - reward functions. $\mathcal{R}^i=\frac{agent^i\_ score}{\sum{scores}}$ 
    
Our objective is to find an optimal policy for our agent that maximizes his utility function.


## Solution

### The planning diagram
<img src="images/plan.png" width="900" />


### How we solved the problem:
Our solution is a combination between a variation of MCTS, and DNN.
The DNN role:
The role of the DNN is to estimate the value function (the result at the end of the game).
It takes a vectorized representation of a game's state(described under the observation section) as an input, and returns a prediction of all the players' results at the end of the game, when the game stats from the given state.
Each training game, we create a dataset with all the states we visited along the game, and with the result of the game as the same label of each sample.
At the end of each game, we train the DNN to improve its predictions in the next game.

#### The MCTS role:
Each train along the training game, we use a variation of MCTS to choose the action of the current player.

#### The variation of the MCTS:
Contrary to the regular MCTS, our MCTS consists of more than one player, as mentioned in the diagram.
Each layer of the MCTS, belongs to other player, and the order of the players determined according to their order in the game.
In addition, instead of states nodes only, our MCTS contains actions nodes too.

#### The selection phase:
The selection phase in our MCTS, chooses the next node according to the UCT criterion, but from the current player
 perspective. That is, different players will get different UCT values of the same node.
In our MCTS, the UCT value will be calculated by the sum of the rewards of the CURRENT PLAYER.

#### The insertion phase:
Contrary to the regular MCTS, our version inserts not only the new state, but also the possible actions from it.

#### The simulation phase:
In our version, instead of the simulation phase, we use the DNN to predict the result of the game from the current state.

#### The propagation phase:
The propagation phase of our version will look the same as the original one, but include the updates of the action nodes.
Each action node, will get the same value of its parent state node.

### The guarantees of our model:
Because we rely on the DNN to predict the result of the game, the results of our algorithm depend on the inclusion
 ability of the DNN, so there is no guarantees about the executions of our algorithm.


### How practically our method solves the problem:
As we mentioned above, the role of the DNN is to learn from sets of states and the end results.
If the inclusion ability of the DNN will be satisfying, it will improve its predictions every game, and better actions will be taken by the MCTS.


### Implementation challenges:
* In the original version of the UCT, we should give the highest priority to non visited nodes. Catan game has a big branching factor, and it would take a lot of time to explore all the unvisited nodes in the tree. we faced with this issue by using pruning (as in "optimizing UCT for Catan")  that will shrink the exploration part of the UCT, to make it possible with our hardware limitations.

* At the beginning of the training, the DNN initialized with random weights. so all the players took random actions
along the first game, and close to random actions in the next few games. We found out that there is a big chance to
run the game forever when all the players play randomly. We faced this issue by adding a heuristic function to the DNN
result, to make the trains a little more sophisticated, to make the first games done.

* When the selection phase in the MCTS chooses to take some action, we have to know what will be the given state after this action.
In Catan, the given state depends on the cubes result after making an action, so we couldn't know what will be the given action.
We couldn't choose one of the possible states randomly, because their distribution isn't uniform.
We faced with this issue by simulating the chosen action, and taking the given state in the simulator. That way it will converge to the right distribution of the possible states,


def iteration(root, game, agent, c, d):
    """
    make one iteration of the MCTS
    :param game: the current game
    :param agent: the agent who activates the method
    :c the weight of the exploration part in the UCT
    :d the weight of the heuristic
    :return: void
    """

    original_state = game.get_state()

    # returns the reward if it's the end of the game, the selected action, and the given state after playing this action
    reward, action_leaf, new_state = selection(root, game, c)

    if not game.is_over():
        action_leaf = expansion(action_leaf, new_state, game, agent.prune)

        # adding the weighted heuristic value to the predicted reward from the DNN
        reward = d * game.heuristic(new_state) + agent.model.forward(new_state)

    back_propagation(action_leaf, reward)

    # back to the original state of the game
    game.set_state(original_state)

In [3]:
"""
Parameters definition
"""
# training parms
hp_model_training = dict(loss_fn=torch.nn.MSELoss(),
                         batch_size=100,
                         num_epochs=100,
                         test_ratio=0.2,
                         valid_ratio=0.2,
                         early_stopping=100)
# optimizer params
hp_optimizer = dict(lr=0.001,
                    weight_decay=0.01,
                    momentum=0.99)
# NN structure params
hp_model = dict(hidden_layers_num=1,
                hidden_layers_size=20,
                activation='relu')

#MCTS params: c - UCT exploration/exploitation param, d-weight on heuristic importance against the model(NN)
hp_mcts = dict(c=1,
               d=3,
               iterations_num=200)


In [4]:
# NN creation
def create_model(in_dim, out_dim, model_file):
    # If a model is already exists, load him. otherwise, create a new model.
    if os.path.isfile(model_file):
        print(f'loading model from "{model_file}"...')
        mlp = torch.load(model_file)
        print(mlp)
        return mlp

    mlp = MLP(
        in_dim=in_dim,
        dims=[hp_model['hidden_layers_size']] * hp_model['hidden_layers_num'] + [out_dim],
        nonlins=[hp_model['activation']] * hp_model['hidden_layers_num'] + ['none']
    )

    print('creating model...')
    print(mlp)
    return mlp


# training function
def train(dl_train, dl_valid, dl_test, model):
    loss_fn = hp_model_training['loss_fn']
    optimizer = torch.optim.SGD(params=model.parameters(), **hp_optimizer)
    trainer = MLPTrainer(model, loss_fn, optimizer)

    return trainer.fit(dl_train,
                       dl_valid,
                       num_epochs=hp_model_training['num_epochs'],
                       print_every=10,
                       early_stopping=hp_model_training['early_stopping'])

## Traning
In case you want to train the existing model named "model2" change <code>train = True</code>
If a new model is needed, change <code>model_path</code>

In [5]:
train = False
model_path = "src/model2"

In [6]:
"""
Training model function, all the agent will use the same model.
At the end of the game, all the actions will be collected as data inputs where the end game results will be the data labels.
Now we can train our model.
"""
def train_agent(games_num, model_path):
    #playing with the same model for all agents.
    model = create_model(Catan.get_state_size(), Catan.get_players_num(), model_path)
    agents = Catan.get_players_num()*[Agent(model)]

    for i in range(1, games_num + 1):
        print(f'_________________game {i}/{games_num}________________')

        #Create a new game and dataset each iteration in order to train the model.
        catan_game = Catan()
        ds = Dataset(hp_model_training['batch_size'], hp_model_training['valid_ratio'], hp_model_training['test_ratio'])

        #statistics
        turns_num = 0
        actions_num = 0

        while True:
            actions_num += 1

            # get the best possible action by the MCTS+Model
            best_action = mcts_get_best_action(catan_game, agents, hp_mcts['c'], hp_mcts['d'], hp_mcts['iterations_num'])
            print("Player " + str(catan_game.get_turn() + 1) + ", action:" + str(best_action))

            # Execute the action
            reward = catan_game.make_action(best_action)
            if best_action[0] == 4: #end turn
                turns_num += 1

                print("Player " + str(catan_game.get_turn() + 1) + " turn!, dice: " + str(catan_game.dice))
                # print(catan_game.game.board)

            ds.add_sample(catan_game.get_state()) # add new input sample.

            #stop the game if the number of action exceeded 600 action
            if catan_game.is_over() or actions_num > 600:
                if actions_num > 600:
                  print("No winner, Final board:")
                  print(catan_game.game.board)
                  reward = [catan_game.game.get_victory_points(catan_game.game.players[i]) for i in range (Catan.get_players_num())]
                else:
                  print("Congratulations! Player %d wins!" % (catan_game.cur_id_player + 1))
                  print("Final board:")
                  print(catan_game.game.board)


                #set lablel and train
                ds.set_label(reward)

                dl_train, dl_valid, dl_test = ds.get_data_loaders()
                fit_res = train(dl_train, dl_valid, dl_test, model)
                plot_fit(fit_res, log_loss=False, train_test_overlay=True)
                plt.show()
                print(ds)

                print(f'saving model in "{model_path}"')
                torch.save(model, model_path)
                break


In [7]:
"""
Agent holds the model it will be used to make decisions under the MCTS.\n",
Each agent can be use different model to predict the value function on a given state.\n",
""",
class Agent:
    def __init__(self, model, prune=True, rand=False):
        self.model=model
        self.prune=prune #boolean parmeter, wheter to prune or not
        self.rand=rand #boolean parmeter, whether to take an random action or not

In [8]:
if train == True:
    train_agent(2,model_path )

## Evaluation

1. At the previous section we played number of games and trained our neural network. now we intend to run multiple games in order to collect statistics over agents' winning rate.
2. The evaluation has been done by comparing different types of agents:<br>
     a. $Agent 1$ - Trained NN with pruning.<br>
     b. $Agent 2$ - Trained NN without pruning.<br>
     c. $Agent 3$ - No NN with pruning.<br>
     d. $Agent 4$ - No NN without pruning.<br>
     c. $Agent 5$ - Random actions.<br>

* Agents 1-4 is based on the MCTS, with 300 iterations, using heuristic.


In [9]:
def test_agent(games_num,agents):
    stats = {k: [] for k in range(Catan.get_players_num())}
    
    hp_mcts['iterations_num']=1000
    for i in range(1, games_num + 1):
        print(f'_________________game {i}/{games_num}________________')
        catan_game = Catan()
        actions_num = 0 
        turns_num = 0
        while True:
            actions_num += 1
            if agents[catan_game.get_turn()].rand == True:
                actions = catan_game.get_actions(prune=False)
                best_action = random.choice(actions)
                # reward = catan_game.make_action(tuple(best_action))
            else:
                best_action = mcts_get_best_action(catan_game, agents, hp_mcts['c'], hp_mcts['d'], hp_mcts['iterations_num'])
            
            if actions_num == 8:
                hp_mcts['iterations_num']=200

            print("Player " + str(catan_game.get_turn() + 1) + ", action:" + str(best_action))
            reward = catan_game.make_action(best_action)
            if best_action[0] == 4:
                turns_num += 1
                print("Player " + str(catan_game.get_turn() + 1) + " turn!, dice: " + str(catan_game.dice))
                # print(catan_game.game.board)

            if catan_game.is_over() or actions_num > 600:
                if actions_num > 600:
                  print("No winner, Final board:")
                  print(catan_game.game.board)
                else:
                  stats[catan_game.cur_id_player].append([actions_num, int(turns_num/4)])
                  print("Congratulations! Player %d wins!" % (catan_game.cur_id_player + 1))
                  print("Final board:")
                  print(catan_game.game.board)
                break
    return stats

In [10]:
def set_seed(seed):
    torch.manual_seed(seed)
    random.seed(seed)
    numpy.random.seed(seed)

## Results

As explained before, Since we slightly changed the rules of the game we cannot compare our algorithm against other AI agents as [JSettlers](https://github.com/jsettlers/settlers-remake) or [colonist.io](www.colonist.io) and other games servers with bots as other papers did.

Therefore, we will try to examine our assumptions:

1. We expected that the winning rate will be higher while using trained NN against non NN, and action pruning should be better than no pruning.
2. We also expected that the NN should have greater impact over winning rate comparing the pruning.
3. Obviously the random agent should perform the worst which was correctly anticipated


There are 2 different evaluations, the first one evolve agents 1-4 as decribed above, second will test agents 1,2,4,5
The evaluations is being excuted by playing multiple games and collect statisics over winning and number of winning.
In case you want to evaluate the code by yourself, change <code>evaluate = True</code>


In [16]:
evaluate = False

num_of_tests = 15 # number of games to play and collect statiscs from.

un_trained_model = create_model(Catan.get_state_size(), Catan.get_players_num(), 'model')
trained_model = create_model(Catan.get_state_size(), Catan.get_players_num(), 'src/model2')
agent1 = Agent(trained_model)
agent2 = Agent(trained_model,prune=False)
agent3 = Agent(un_trained_model)
agent4 = Agent(un_trained_model,prune=False)
agent5 = Agent(un_trained_model,prune=False,rand=True)

agents1 = [agent1, agent2, agent3, agent4]
agents2 = [agent1, agent2, agent4, agent5]


creating model...
MLP(
  (mlp_layers): Sequential(
    (0): Linear(in_features=160, out_features=20, bias=True)
    (1): ReLU()
    (2): Linear(in_features=20, out_features=4, bias=True)
    (3): Identity()
  )
)
loading model from "src/model2"...
MLP(
  (mlp_layers): Sequential(
    (0): Linear(in_features=160, out_features=20, bias=True)
    (1): ReLU()
    (2): Linear(in_features=20, out_features=4, bias=True)
    (3): Identity()
  )
)


In [12]:
def get_stats(winning_records):
    avg_actions= [0,0,0,0]
    avg_turns= [0,0,0,0]
    winnig_rate = [(100*len(v)/num_of_tests) for k,v in winning_records.items()]

    for k,v in winning_records.items():
        for a,t in v:
            avg_actions[k] += a/len(v)
            avg_turns[k] += t/len(v)

    return (winnig_rate, avg_actions, avg_turns)


In [17]:
seed = 11
set_seed(seed)
if evaluate == True:
    
    winning_records = test_agent(num_of_tests, agents1)
                                                    

_________________game 1/15________________
Player 1, action:(<BuildingType.SETTLEMENT: 1>, (q: 3, r:-2))
Player 1, action:(<BuildingType.ROAD: 0>, frozenset({(q: 2, r:-2), (q: 3, r:-2)}))
Player 2, action:(<BuildingType.SETTLEMENT: 1>, (q: 3, r:1))
Player 2, action:(<BuildingType.ROAD: 0>, frozenset({(q: 3, r:1), (q: 4, r:0)}))
Player 3, action:(<BuildingType.SETTLEMENT: 1>, (q: 1, r:2))
Player 3, action:(<BuildingType.ROAD: 0>, frozenset({(q: 0, r:2), (q: 1, r:2)}))
Player 4, action:(<BuildingType.SETTLEMENT: 1>, (q: 2, r:3))
Player 4, action:(<BuildingType.ROAD: 0>, frozenset({(q: 2, r:3), (q: 3, r:2)}))
Player 4, action:(<BuildingType.SETTLEMENT: 1>, (q: -2, r:-1))
Player 4, action:(<BuildingType.ROAD: 0>, frozenset({(q: -3, r:-1), (q: -2, r:-1)}))
Player 3, action:(<BuildingType.SETTLEMENT: 1>, (q: -3, r:1))
Player 3, action:(<BuildingType.ROAD: 0>, frozenset({(q: -4, r:1), (q: -3, r:1)}))
Player 2, action:(<BuildingType.SETTLEMENT: 1>, (q: 1, r:-3))
Player 2, action:(<BuildingType

In [18]:
if evaluate == True:
    
    winnig_rate, avg_actions, avg_turns = get_stats(winning_records)
    
    print("winning rate(prectnage), number of actions, number of turns::")
    print("Agent 1 - Trained NN, with pruning:" + str(winnig_rate[0]) + ", " + str(avg_actions[0]) + ", " + str(avg_turns[0]))
    print("Agent 2 - Trained NN, without pruning:" + str(winnig_rate[1]) + ", " + str(avg_actions[1]) + ", " + str(avg_turns[1]))
    print("Agent 3 - No NN, with pruning:" + str(winnig_rate[2]) + ", "+ str(avg_actions[2]) + ", "+ str(avg_turns[2]))
    print("Agent 4 - No NN, without pruning:" + str(winnig_rate[3]) + ", "+ str(avg_actions[3]) + ", "+ str(avg_turns[3]))
                                                    

winning rate(prectnage), number of actions, number of turns::
Agent 1 - Trained NN, with pruning:40.0, 142.83333333333334, 21.666666666666664
Agent 2 - Trained NN, without pruning:6.666666666666667, 177.0, 25.0
Agent 3 - No NN, with pruning:33.333333333333336, 168.2, 25.0
Agent 4 - No NN, without pruning:13.333333333333334, 143.0, 17.5


### Test 1:
<code>
winning rate(prectnage), number of actions(mean), number of turns(mean):
Agent 1 - Trained NN, with pruning:    40.0,  142.8, 21
Agent 2 - Trained NN, without pruning: 6.6,   177.0, 25
Agent 3 - No NN, with pruning:         33.33, 168.2, 25
Agent 4 - No NN, without pruning:      13.33, 143.0, 17
</code>

In [19]:
#Test random actions agent
set_seed(seed)
if evaluate == True:
    winning_records2 = test_agent(num_of_tests, agents2)

_________________game 1/15________________
Player 1, action:(<BuildingType.SETTLEMENT: 1>, (q: 3, r:-1))
Player 1, action:(<BuildingType.ROAD: 0>, frozenset({(q: 4, r:-1), (q: 3, r:-1)}))
Player 2, action:(<BuildingType.SETTLEMENT: 1>, (q: 4, r:-3))
Player 2, action:(<BuildingType.ROAD: 0>, frozenset({(q: 5, r:-3), (q: 4, r:-3)}))
Player 3, action:(<BuildingType.SETTLEMENT: 1>, (q: 3, r:1))
Player 3, action:(<BuildingType.ROAD: 0>, frozenset({(q: 3, r:1), (q: 3, r:2)}))
Player 4, action:(<BuildingType.SETTLEMENT: 1>, (q: 3, r:-4))
Player 4, action:(<BuildingType.ROAD: 0>, frozenset({(q: 3, r:-5), (q: 3, r:-4)}))
Player 4, action:(<BuildingType.SETTLEMENT: 1>, (q: 0, r:2))
Player 4, action:(<BuildingType.ROAD: 0>, frozenset({(q: 0, r:2), (q: 1, r:2)}))
Player 3, action:(<BuildingType.SETTLEMENT: 1>, (q: -2, r:3))
Player 3, action:(<BuildingType.ROAD: 0>, frozenset({(q: -3, r:4), (q: -2, r:3)}))
Player 2, action:(<BuildingType.SETTLEMENT: 1>, (q: 1, r:-3))
Player 2, action:(<BuildingType

In [20]:
#Test random actions agent
if evaluate == True:

    winnig_rate, avg_actions, avg_turns = get_stats(winning_records2)
    
    print("winning rate(prectnage), number of actions, number of turns::")
    print("Agent 1 - Trained NN, with pruning:" + str(winnig_rate[0]) + ", " + str(avg_actions[0]) + ", " + str(avg_turns[0]))
    print("Agent 2 - Trained NN, without pruning:" + str(winnig_rate[1]) + ", " + str(avg_actions[1]) + ", " + str(avg_turns[1]))
    print("Agent 4 - No NN, without pruning:" + str(winnig_rate[2]) + ", "+ str(avg_actions[2]) + ", "+ str(avg_turns[2]))
    print("Agent 5 - Random actions:" + str(winnig_rate[3]) + ", "+ str(avg_actions[3]) + ", "+ str(avg_turns[3]))


winning rate(prectnage), number of actions, number of turns::
Agent 1 - Trained NN, with pruning:53.333333333333336, 147.0, 20.875
Agent 2 - Trained NN, without pruning:20.0, 227.0, 29.333333333333336
Agent 4 - No NN, without pruning:26.666666666666668, 167.25, 20.75
Agent 5 - Random actions:0.0, 0, 0


### Test 2:
<code>
winning rate(prectnage), number of actions(mean), number of turns(mean):
Agent 1 - Trained NN, with pruning:    53.33, 147.0,  20.875
Agent 2 - Trained NN, without pruning: 20.0,  227.0,  29.333333333333336
Agent 4 - No NN, without pruning:      26.66, 167.25, 20.75
Agent 5 - Random actions:              0.0,   0,      0
</code>

Agent 1 got the best results as expected, using a combination of MCTS with pruning and heuristic, and successfully trained NN.
Agent 3 has second-best scores, while agents 2 and 4 have similarly lower scores.
We deduce that the reason for better scores for agent 3 is that it uses the MCTS wisely.  Iterations will go deeper in the tree in contrast to agents 2 and 4 which will search more widely with usually worse options due to the large search space.
For that reason, we deduce that pruning has a higher impact on the winning rate.

Moreover, it seems that the NN is not strong enough as well (i.e. not trained enough or bad architecture) because of the small difference between agents 2 and 4.
The small difference for agent 4 may be due to insufficient games data which leads to higher variance.

In addition, we can see that the random player-agent5, can’t score points, because even player 2, with the bad NN, still uses the MCTS with a heuristic to estimate the reward, which is a much better appreciation than the random one.



## Method Limitations / Possible Future Extensions

### Limitations
1. Our method relay on fully state visibility (as in AlphaZero) because the DNN input has to be a game's state. For that reason we decided to simplify the game so that every state is fully visible to all player. The real game have more rules that damage the state visibility (As development cards, robber, map randomness, etc.) that we omitted. 

2. At the beginning, the DNN have random weights. Without any domain knowledge usage, the game episode may be infinite.

3. Our algorithm demands high computational power which we cannot supply for both MCTS and DNN architecture. for that reason, we used a smaller amount of iterations(300) comparing to [Optimizing UCT for Settlers of Catan](https://www.sbgames.org/sbgames2017/papers/ComputacaoFull/175405.pdf) (10K) and we use a weaker NN architecture comparing to the mighty AlphaZero architecture.

###  Possible Future Extensions
1. We can take inspiration from ["RE-L"](https://cs230.stanford.edu/projects_fall_2021/reports/103176936.pdf) to use DQN in some parts of our algorithm, specifically at the initial stage picking we have a great impact over the entire game
2. Adapting our algorithm to a POMDP (as in ["blind-chess"](https://towardsdatascience.com/blind-chess-log-0-d6b05c6cf90c) ) setting and use the original game rules. 

In [26]:
# Use agents without heuristics
seed = 4
set_seed(seed)
hp_mcts['d']=0 # ignore the heuristic.
hp_mcts['iterations_num']=100

model = create_model(Catan.get_state_size(), Catan.get_players_num(), "new_model")
catan_game = Catan()
agents = Catan.get_players_num()*[agent5]

# stats = test_agent(1,agents)
for i in range(450):
    best_action = mcts_get_best_action(catan_game, agents, hp_mcts['c'], hp_mcts['d'], hp_mcts['iterations_num'])
    reward = catan_game.make_action(best_action)

    if catan_game.is_over(): 
        break
        
if catan_game.is_over():
    print("\n\nGame is over after: "+str(i)+" actions")
else:
    print("\n\nGame is going forever!")

creating model...
MLP(
  (mlp_layers): Sequential(
    (0): Linear(in_features=160, out_features=20, bias=True)
    (1): ReLU()
    (2): Linear(in_features=20, out_features=4, bias=True)
    (3): Identity()
  )
)


Game is going forever!


This example proves the second limitation, using the MCTS without domain knoledge will lead to an endless game.