# Reinforcement-based Search Tree Pruning
## Abraham Oliver, Brown County High School

In [2]:
# Dependencies
import random, sys
import numpy as np
from importlib import reload
from copy import copy

# Deepnotakto Project
from deepnotakto import util, train
from deepnotakto.games import notakto

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html


In [3]:
reload(notakto)

<module 'deepnotakto.games.notakto' from '/home/abe/git/Deep-Notakto/deepnotakto/games/notakto/__init__.py'>

## The Game
#### 3 x 3
**Player 1 Winning Strategy** Play in the center on the first move. Play a knight's move (from chess) from the opponent's move.

#### 4 x 4
** Player 2 Winning Strategy** Draw an imaginary line either horizontally or vertically, splitting the board in half. Play a knight's move   from the opponent's move on the side of the imaginary line that the opponent's move was played.

#### 5 x 5, 6 x 6, and 7x7
**Player 1 Winning Strategy** Not yet known

#### 8x8 and larger
**Winner Not Known**

In [4]:
BOARD_SIZE = 3
# Create a human player (that can be used for both players)
r = notakto.RandomAgent()
h = notakto.Human()
# Create a 3x3 game environment
e = notakto.Env(BOARD_SIZE)
# Play games between the humans on the 3x3
gui = notakto.VisualNotaktoGame(e, r, h, -1, show_confidences = False)

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


### Q-Learning Agents
#### Definitions
##### Markov Decision Process (MDP)
A markov decision process is a decision process that is defined by the tuple $(S, A, R_p(\cdot), \gamma)$ where $S$ is a state space (space of possible board positions), $A$ is an action space (space of possible actions for an agent to make), $R_p(s)$ is the immediate reward for some $s \in S$ and a given player, and $\gamma$ is the discount factor (the balance between future and immediate rewards). An time step of a deterministic MDP at time $t$ is $(s_t,a_t,r_t)$ where $r_t = R_p(s_t)$. An agent in an MDP is optimized in order to maximize the expected discounted reward from a given time-step $t$ until a terminal state at time-step $T$, $R_T=\mathbb{E}[\sum_{n=t}^{T} \gamma^{n-t} r_t]$.
##### Q-Learning
In this environment, there exists a function $Q^*: S \to A$ that produces the action $a$ that maximizes $R_T$ for a given state. Because it is often impossible to find the true $Q^*$, we approximate $Q^*$ with a funtion $Q_\pi: S \to A$ that produces an action based on a given policy $\pi$ (note that $Q=Q^*$ when $\pi=\pi^*$, the optimal policy). We define $Q$ by $Q_\pi(s)=\mathrm{argmax}_a\mathbb{E}[R_T|s, a, \pi]$.
##### Q-Agent
For a computer agent, Q is defined by a neural network that accepts a given board state and returns a probability distribution over the action space. After a game rollout is completed, we can train the neural network by calculating target Q-values using the Bellman Equation $Q_{\mathrm{target}}(s_t)=r_t+\gamma \mathrm{max}_{a'}Q(s',a')$.

In [5]:
# Create a Q-Agent
training_parameters = {"mode": "none", "learn_rate": .01, "rotate": True, "epochs": 1, "batch_size": 2, "replay_size": 10}
a1 = notakto.QTree(game_size = 3, layers = [9, 9, 9], params = training_parameters)
# Create a 3x3 environment
e = notakto.Env(3)
# Create a random player
r = notakto.RandomAgent()
# Play the Q-Agent against the random player
a1.deterministic = True
gui = notakto.VisualNotaktoGame(e, a1, r, -1, show_confidences = True)

W0727 17:13:15.526562 140633944557376 deprecation.py:323] From /home/abe/git/Deep-Notakto/deepnotakto/QTree.py:184: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

W0727 17:13:15.551548 140633944557376 deprecation_wrapper.py:119] From /home/abe/git/Deep-Notakto/deepnotakto/QTree.py:202: The name tf.verify_tensor_all_finite is deprecated. Please use tf.compat.v1.verify_tensor_all_finite instead.

W0727 17:13:15.557926 140633944557376 deprecation_wrapper.py:119] From /home/abe/git/Deep-Notakto/deepnotakto/QTree.py:146: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.

W0727 17:13:15.630326 140633944557376 deprecation.py:323] From /home/abe/anaconda3/env

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [8]:
# Load a Q-Agent
a1 = util.load_agent("agent-saves/notakto/p2_best.npz", notakto.QTree)
a1.change_param("mode", "none")
a1.deterministic = True
# Play the Q-Agent against the random player
gui = notakto.VisualNotaktoGame(e, a1, h, -1, show_confidences = True)

TypeError: __init__() missing 1 required positional argument: 'layers'

### Training

In [6]:
# Train the agent
a1.change_param("mode", "replay")
a1.deterministic = False
train.train_model_with_tournament_evaluation(a1, r, e, sims = 70, games = 30, save_every = 10,
                                            model_path = "agent-saves/notakto/p2.npz",
                                            stats_path = "agent-saves/notakto/p2.stats")



-------- QTree(b958b) --------
Saved as 'agent-saves/notakto/p2.npz'
Started at 5 : 13 PM

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 06 (at 5 : 13 PM)
Iteration             1
Q Evaluation          20%
BEST MODEL

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 12 (at 5 : 13 PM)
Iteration             2
Q Evaluation          3%

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 17 (at 5 : 13 PM)
Iteration             3
Q Evaluation          0%

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 24 (at 5 : 13 PM)
Iteration             4
Q Evaluation          3%

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 32 (at 5 : 13 PM)
Iteration             5
Q Evaluation          0%

Self play... Completed
Q-based evaluation... Complete
Time                  0 : 00 : 38 (at 5 : 14 PM)
Iteration             6
Q E

KeyboardInterrupt: 

In [7]:
# Play the newly trained agent
a1.change_param("mode", "none")
a1.deterministic = False
gui = notakto.VisualNotaktoGame(e, a1, h, -1, show_confidences = True)

SystemExit: 