# Mock AlphaGo (3) - Reinforced Learning
In this notebook, we will train the policy network by letting them compete each other according to DeepMind:

> We further trained the policy network by policy gradient reinforcement learning.
Each iteration consisted of a mini-batch of n games played in parallel, between
the current policy network $p_\rho$ that is being trained, and an opponent $p_\rho-$
that uses parameters $\rho^-$ from a previous iteration, randomly sampled from
a pool $O$ of opponents, so as to increase the stability of training. Weights were
initialized to $\rho = \rho^- = \sigma$. Every 500 iterations, we added the current
parameters $\rho$ to the opponent pool. Each game $i$ in the mini-batch was played
out until termination at step $T^i$, and then scored to determine the outcome
$z^i_t = \pm r(s_{T^i})$ from each player’s perspective. The games were then replayed
to determine the policy gradient update, $\Delta\rho = \frac{a}{n}\Sigma^n_{i=1}
\Sigma^{T^i}_{t=1}\frac{\partial\log p_\rho(a^i_t|s^i_t)}{\partial_\rho}(z^i_t-v(s^i_t))$, using the REINFORCE 
algorithm with baseline $v(s^i_t)$ for variance reduction. On the first pass 
through the training pipeline, the baseline was set to zero; on the second pass
we used the value network $v_\theta(s)$ as a baseline; this provided a small
performance boost. The policy network was trained in this way for 10,000 
mini-batches of 128 games, using 50 GPUs, for one day.

In [1]:
import os, numpy as np
from caffe2.python import core, model_helper, workspace, brew, utils
from caffe2.proto import caffe2_pb2
from sgfutil import BOARD_POSITION

%matplotlib inline
from matplotlib import pyplot

# how many games will be run in one minibatch
GAMES_BATCHES = 3 # [1,infinity) depends on your hardware
# how many iterations for this tournament
TOURNAMENT_ITERS = 500 # [1,infinity)

if workspace.has_gpu_support:
    device_opts = core.DeviceOption(caffe2_pb2.CUDA, workspace.GetDefaultGPUID())
    print('Running in GPU mode on default device {}'.format(workspace.GetDefaultGPUID()))
else :
    device_opts = core.DeviceOption(caffe2_pb2.CPU, 0)
    print('Running in CPU mode')

arg_scope = {"order": "NCHW"}

ROOT_FOLDER = os.path.join(os.path.expanduser('~'), 'python', 'tutorial_data','go','param') # folder stores the loss/accuracy log



Running in CPU mode


We need to differentiate primary player and sparring partner. Primary player will learn from the game result

In [2]:
# who is primary and who is sparring partner? 
PRIMARY_PLAYER = "black" # or white
SPARRING_PLAYER = "white"

### Config for primary player
PRIMARY_WORKSPACE = os.path.join(ROOT_FOLDER, PRIMARY_PLAYER)
PRIMARY_CONV_LEVEL = 4
PRIMARY_FILTERS = 128
PRIMARY_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
PRIMARY_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS))

### Config for sparring partner
SPARR_WORKSPACE = os.path.join(ROOT_FOLDER, SPARRING_PLAYER)
SPARR_CONV_LEVEL = 13
SPARR_FILTERS = 192
SPARR_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
SPARR_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(SPARR_CONV_LEVEL,SPARR_FILTERS,SPARR_PRE_TRAINED_ITERS))

print('{}-{}-{}({}) vs. {}-{}-{}({})'.format(
    PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS, PRIMARY_PLAYER,
    SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, SPARRING_PLAYER))

4-128-1(black) vs. 13-192-1(white)


Following training parameters are only for primary player.

In [3]:
BASE_LR = -0.00005 # (-0.003,0) The base Learning Rate

TRAIN_BATCHES = 64 # how many samples will be trained within one mini-batch, depends on your hardware

# after training, where to store the params
SAVE_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".
                           format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS+TOURNAMENT_ITERS))
print('After training, result will be saved to {}'.format(SAVE_FOLDER))

After training, result will be saved to /home/wangd/python/tutorial_data/go/param/RL-conv=4-k=128-iter=501


## AlphaGo Neural Network Architecture

In [4]:
from modeling import AddConvModel, AddTrainingOperators

## Build the actual network

In [5]:
import caffe2.python.predictor.predictor_exporter as pe

data = np.empty(shape=(TRAIN_BATCHES,48,19,19), dtype=np.float32)
label = np.empty(shape=(TRAIN_BATCHES,1), dtype=np.int32)

### Primary player
>Train Net: Blob('data','label') ==> Predict Net ==> Loss ==> Backward Propergation

In [6]:
workspace.SwitchWorkspace(PRIMARY_WORKSPACE, True)
with core.DeviceScope(device_opts):
    primary_train_model = model_helper.ModelHelper(name="primary_train_model", arg_scope=arg_scope, init_params=True)
    workspace.FeedBlob("data", data, device_option=device_opts)
    predict = AddConvModel(primary_train_model, "data", conv_level=PRIMARY_CONV_LEVEL, filters=PRIMARY_FILTERS)
    workspace.FeedBlob("label", data, device_option=device_opts)
    AddTrainingOperators(primary_train_model, predict, "label", base_lr=BASE_LR)
    workspace.RunNetOnce(primary_train_model.param_init_net)
    workspace.CreateNet(primary_train_model.net, overwrite=True)
    
primary_predict_net = pe.prepare_prediction_net(os.path.join(PRIMARY_LOAD_FOLDER, "policy_model.minidb"),
                                               "minidb", device_option=device_opts)



### Sparring partner
>Predict Net: Blob('data') ==> Predict Net ==> Blob('predict')

In [7]:
# Initialize sparring partner
workspace.SwitchWorkspace(SPARR_WORKSPACE, True)
sparring_predict_net = pe.prepare_prediction_net(os.path.join(SPARR_LOAD_FOLDER, "policy_model.minidb"),
                                                 "minidb", device_option=device_opts)

## Run the tournament and training

### Compete

In [8]:
from go import GameState, BLACK, WHITE, EMPTY, PASS
from preprocessing import Preprocess
from game import DEFAULT_FEATURES

game_state = [GameState() for i in range(GAMES_BATCHES)]
game_result = [0] * GAMES_BATCHES # 0 - Not Ended; BLACK - Black Wins; WHITE - White Wins
p = [Preprocess(DEFAULT_FEATURES)] * GAMES_BATCHES
history = [[] for i in range(GAMES_BATCHES)]
board = None

# for each step in all games
for step in range(0,500):
    
    board = np.concatenate([p[i].state_to_tensor(game_state[i]).astype(np.float32) for i in range(GAMES_BATCHES)])
    
    if step % 2 == 0:
        current_player = BLACK
        current_color = 'black'
    else:
        current_player = WHITE
        current_color = 'white'

    if step % 2 == (PRIMARY_PLAYER == 'white'):
        # primary player move
        workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
        workspace.FeedBlob('data', board, device_option=device_opts)
        workspace.RunNet(primary_predict_net)
    else:
        # sparring partner move
        workspace.SwitchWorkspace(SPARR_WORKSPACE)
        workspace.FeedBlob('data', board, device_option=device_opts)
        workspace.RunNet(sparring_predict_net)

    predict = workspace.FetchBlob('predict') # [0.01, 0.02, ...] in shape (N,361)
    
    for i in range(GAMES_BATCHES):
        if game_result[i]: # game end
            continue
        else: # game not end
            legal_moves = [ x*19+y for (x,y) in game_state[i].get_legal_moves(include_eyes=False)] # [59, 72, ...] in 1D
            if len(legal_moves) > 0: # at least 1 legal move
                #mask = np.in1d(sorted_move[i], legal_moves) # [True, False, True, ...]
                #current_choice = sorted_move[i][mask][0] # The top legal move
                probabilities = predict[i][legal_moves] # [0.02, 0.01, ...]
                current_choice = np.random.choice(legal_moves, 1, p=probabilities/np.sum(probabilities))
                (x, y) = (current_choice/19, current_choice%19)
                history[i].append(('B', x, y, board[i]))
                game_state[i].do_move(action = (x, y), color = current_player) # End of Game?
                print('game({}) step({}) {} move({},{})'.format(i, step, current_color, x, y))
            else:
                game_state[i].do_move(action = PASS, color = current_player)
                print('game({}) step({}) {} PASS'.format(i, step, current_color))
                game_result[i] = game_state[i].is_end_of_game

    if np.all(game_result):
        break

game(0) step(0) black move([3],[15])
game(1) step(0) black move([16],[9])
game(2) step(0) black move([2],[3])
game(0) step(1) white move([14],[16])
game(1) step(1) white move([3],[3])
game(2) step(1) white move([16],[13])
game(0) step(2) black move([14],[15])
game(1) step(2) black move([2],[11])
game(2) step(2) black move([12],[2])
game(0) step(3) white move([13],[15])
game(1) step(3) white move([15],[3])
game(2) step(3) white move([16],[6])
game(0) step(4) black move([15],[16])
game(1) step(4) black move([13],[2])
game(2) step(4) black move([16],[5])
game(0) step(5) white move([15],[15])
game(1) step(5) white move([16],[5])
game(2) step(5) white move([15],[6])
game(0) step(6) black move([14],[14])
game(1) step(6) black move([16],[6])
game(2) step(6) black move([15],[5])
game(0) step(7) white move([15],[14])
game(1) step(7) white move([15],[5])
game(2) step(7) white move([14],[6])
game(0) step(8) black move([14],[13])
game(1) step(8) black move([16],[2])
game(2) step(8) black move([3],

game(0) step(73) white move([17],[3])
game(1) step(73) white move([2],[1])
game(2) step(73) white move([13],[16])
game(0) step(74) black move([16],[1])
game(1) step(74) black move([4],[3])
game(2) step(74) black move([11],[16])
game(0) step(75) white move([15],[3])
game(1) step(75) white move([3],[4])
game(2) step(75) white move([12],[13])
game(0) step(76) black move([15],[0])
game(1) step(76) black move([1],[3])
game(2) step(76) black move([14],[11])
game(0) step(77) white move([6],[5])
game(1) step(77) white move([2],[3])
game(2) step(77) white move([14],[16])
game(0) step(78) black move([12],[12])
game(1) step(78) black move([1],[4])
game(2) step(78) black move([15],[17])
game(0) step(79) white move([4],[13])
game(1) step(79) white move([4],[4])
game(2) step(79) white move([13],[11])
game(0) step(80) black move([5],[13])
game(1) step(80) black move([1],[5])
game(2) step(80) black move([15],[10])
game(0) step(81) white move([3],[12])
game(1) step(81) white move([6],[3])
game(2) step(

game(0) step(145) white move([3],[13])
game(1) step(145) white move([1],[2])
game(2) step(145) white move([6],[4])
game(0) step(146) black move([6],[13])
game(1) step(146) black move([13],[1])
game(2) step(146) black move([6],[0])
game(0) step(147) white move([7],[13])
game(1) step(147) white move([12],[2])
game(2) step(147) white move([3],[1])
game(0) step(148) black move([7],[14])
game(1) step(148) black move([12],[1])
game(2) step(148) black move([1],[1])
game(0) step(149) white move([8],[13])
game(1) step(149) white move([12],[3])
game(2) step(149) white move([0],[1])
game(0) step(150) black move([8],[14])
game(1) step(150) black move([12],[5])
game(2) step(150) black move([0],[0])
game(0) step(151) white move([9],[13])
game(1) step(151) white move([10],[8])
game(2) step(151) white move([1],[0])
game(0) step(152) black move([11],[13])
game(1) step(152) black move([4],[18])
game(2) step(152) black move([18],[16])
game(0) step(153) white move([9],[14])
game(1) step(153) white move([6

game(0) step(217) white move([11],[14])
game(1) step(217) white move([8],[11])
game(2) step(217) white move([12],[16])
game(0) step(218) black move([11],[4])
game(1) step(218) black move([7],[11])
game(2) step(218) black move([14],[18])
game(0) step(219) white move([13],[5])
game(1) step(219) white move([8],[10])
game(2) step(219) white move([12],[18])
game(0) step(220) black move([11],[13])
game(1) step(220) black move([7],[8])
game(2) step(220) black move([6],[6])
game(0) step(221) white move([1],[10])
game(1) step(221) white move([6],[9])
game(2) step(221) white move([7],[6])
game(0) step(222) black move([1],[9])
game(1) step(222) black move([5],[10])
game(2) step(222) black move([8],[6])
game(0) step(223) white move([11],[14])
game(1) step(223) white move([6],[10])
game(2) step(223) white move([5],[7])
game(0) step(224) black move([13],[4])
game(1) step(224) black move([5],[11])
game(2) step(224) black move([5],[6])
game(0) step(225) white move([10],[15])
game(1) step(225) white mo

game(0) step(289) white move([15],[18])
game(1) step(289) white move([4],[5])
game(2) step(289) white move([12],[7])
game(0) step(290) black move([16],[18])
game(1) step(290) black move([5],[6])
game(2) step(290) black move([6],[9])
game(0) step(291) white move([6],[15])
game(1) step(291) white move([3],[6])
game(2) step(291) white move([6],[8])
game(0) step(292) black move([13],[9])
game(1) step(292) black move([8],[17])
game(2) step(292) black move([12],[6])
game(0) step(293) white move([16],[8])
game(1) step(293) white move([7],[17])
game(2) step(293) white move([5],[0])
game(0) step(294) black move([13],[8])
game(1) step(294) black move([8],[18])
game(2) step(294) black move([8],[9])
game(0) step(295) white move([16],[9])
game(1) step(295) white move([9],[18])
game(2) step(295) white move([6],[1])
game(0) step(296) black move([18],[9])
game(1) step(296) black move([10],[18])
game(2) step(296) black move([7],[0])
game(0) step(297) white move([12],[11])
game(1) step(297) white move([

game(0) step(361) white move([5],[18])
game(1) step(361) white move([11],[2])
game(2) step(361) white move([18],[0])
game(0) step(362) black move([4],[18])
game(1) step(362) black move([14],[13])
game(2) step(362) black move([4],[0])
game(0) step(363) white move([6],[18])
game(1) step(363) white move([3],[9])
game(2) step(363) white move([10],[13])
game(0) step(364) black move([8],[8])
game(1) step(364) black move([4],[9])
game(2) step(364) black move([10],[11])
game(0) step(365) white move([9],[8])
game(1) step(365) white move([10],[1])
game(2) step(365) white move([5],[0])
game(0) step(366) black move([9],[9])
game(1) step(366) black move([5],[7])
game(2) step(366) black move([10],[14])
game(0) step(367) white move([8],[7])
game(1) step(367) white move([5],[6])
game(2) step(367) white move([6],[1])
game(0) step(368) black move([2],[18])
game(1) step(368) black move([5],[1])
game(2) step(368) black move([17],[0])
game(0) step(369) white move([10],[8])
game(1) step(369) white move([4],

game(0) step(433) white move([8],[12])
game(2) step(433) white move([7],[17])
game(0) step(434) black move([7],[10])
game(2) step(434) black move([5],[17])
game(0) step(435) white move([8],[9])
game(2) step(435) white move([6],[18])
game(0) step(436) black move([7],[10])
game(2) step(436) black move([8],[18])
game(0) step(437) white move([8],[10])
game(2) step(437) white move([7],[18])
game(0) step(438) black move([9],[3])
game(2) step(438) black move([0],[15])
game(0) step(439) white move([11],[3])
game(2) step(439) white move([0],[13])
game(0) step(440) black move([9],[5])
game(2) step(440) black move([0],[12])
game(0) step(441) white move([17],[5])
game(2) step(441) white move([3],[2])
game(0) step(442) black move([18],[7])
game(2) step(442) black move([0],[14])
game(0) step(443) white move([17],[7])
game(2) step(443) white move([11],[4])
game(0) step(444) black move([12],[12])
game(2) step(444) black move([15],[17])
game(0) step(445) white move([11],[13])
game(2) step(445) white mo

### Record the game in SGF format

In [9]:
from sgfutil import GetWinner, WriteBackSGF
from datetime import datetime
import sgf

#comment out for better performance
for i in range(GAMES_BATCHES):
    winner = GetWinner(game_state[i]) # B+, W+, T
    filename = os.path.join(
        os.path.expanduser('~'), 'python', 'tutorial_files','selfplay',
        '({}_{}_{})vs({}_{}_{})_{}_{}_{}'.format(PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS,
                                        SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, i, winner,
                                        datetime.now().strftime("%Y-%m-%d")))
    print(filename)
    WriteBackSGF(winner, history[i], filename)

/home/wangd/python/tutorial_files/selfplay/(4_128_1)vs(13_192_1)_0_W+_2017-09-22
('B', array([3]), array([15]), array([[[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]],

       [[ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        ..., 
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ...,  0.,  0.,  0.]],

       [[ 1.,  1.,  1., ...,  1.,  1.,  1.],
        [ 1.,  1.,  1., ...,  1.,  1.,  1.],
        [ 1.,  1.,  1., ...,  1.,  1.,  1.],
        ..., 
        [ 1.,  1.,  1., ...,  1.,  1.,  1.],
        [ 1.,  1.,  1., ...,  1.,  1.,  1.],
        [ 1.,  1.,  1., ...,  1.,  1.,  1.]],

       ..., 
       [[ 0.,  0.

TypeError: only integer scalar arrays can be converted to a scalar index

## Learn from the winning games

>We use a reward function $r(s)$ that is zero for all non-terminal time-steps $t < T$.
The outcome $z_t = \pm r(s_T)$ is the terminal reward at the end of the game from the perspective of the
current player at time-step $t$: $+1$ for winning and $-1$ for losing. Weights are then updated at each
time-step $t$ by stochastic gradient ascent in the direction that maximizes expected outcome.

In [None]:
iter = 0
k = 0
for i in range(GAMES_BATCHES):
    print('Learning {} steps in {} of {} games.'.format(iter * 32, i, GAMES_BATCHES))
    for step in history[i]:
        if (step[0] == 'B' and winner == BLACK) or (step[0] == 'W' and winner == WHITE):
            data[k] = step[3]
            label[k] = step[1]*19+step[2]
            k += 1
            if k == TRAIN_BATCHES:
                iter += 1
                k = 0
                workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
                workspace.FeedBlob("data", data, device_option=device_opts)
                workspace.FeedBlob("label", label, device_option=device_opts)
                workspace.RunNet(primay_train_model.net)
print('Finished')

### Save the RL model of primary player
and also make a copy to opponent folder

In [None]:
if not os.path.exists(SAVE_FOLDER):
    os.makedirs(SAVE_FOLDER)
# construct the model to be exported
pe_meta = pe.PredictorExportMeta(
    predict_net=primary_deploy_model.net.Proto(),
    parameters=[str(b) for b in primary_deploy_model.params], 
    inputs=["data"],
    outputs=["predict"],
)
pe.save_to_db("minidb", os.path.join(SAVE_FOLDER, "policy_model.minidb"), pe_meta)
#pe.save_to_db("minidb", os.path.join(SPARR_FOLDER, "policy_model.minidb"), pe_meta)
print('Params saved to {}'.format(SAVE_FOLDER))