# Mock AlphaGo (3) - Reinforced Learning
In this notebook, we will train the policy network by letting them compete each other according to DeepMind:

> We further trained the policy network by policy gradient reinforcement learning.
Each iteration consisted of a mini-batch of n games played in parallel, between
the current policy network $p_\rho$ that is being trained, and an opponent $p_\rho-$
that uses parameters $\rho^-$ from a previous iteration, randomly sampled from
a pool $O$ of opponents, so as to increase the stability of training. Weights were
initialized to $\rho = \rho^- = \sigma$. Every 500 iterations, we added the current
parameters $\rho$ to the opponent pool. Each game $i$ in the mini-batch was played
out until termination at step $T^i$, and then scored to determine the outcome
$z^i_t = \pm r(s_{T^i})$ from each player’s perspective. The games were then replayed
to determine the policy gradient update, $\Delta\rho = \frac{a}{n}\Sigma^n_{i=1}
\Sigma^{T^i}_{t=1}\frac{\partial\log p_\rho(a^i_t|s^i_t)}{\partial_\rho}(z^i_t-v(s^i_t))$, using the REINFORCE 
algorithm with baseline $v(s^i_t)$ for variance reduction. On the first pass 
through the training pipeline, the baseline was set to zero; on the second pass
we used the value network $v_\theta(s)$ as a baseline; this provided a small
performance boost. The policy network was trained in this way for 10,000 
mini-batches of 128 games, using 50 GPUs, for one day.

In [1]:
import os, numpy as np
from caffe2.python import core, model_helper, workspace, brew, utils
from caffe2.proto import caffe2_pb2
from sgfutil import BOARD_POSITION

%matplotlib inline
from matplotlib import pyplot

# how many games will be run in one minibatch
GAMES_BATCHES = 128 # [1,infinity) depends on your hardware
# how many iterations for this tournament
TOURNAMENT_ITERS = 250000 # [1,infinity)

if workspace.has_gpu_support:
    device_opts = core.DeviceOption(caffe2_pb2.CUDA, workspace.GetDefaultGPUID())
    print('Running in GPU mode on default device {}'.format(workspace.GetDefaultGPUID()))
else :
    device_opts = core.DeviceOption(caffe2_pb2.CPU, 0)
    print('Running in CPU mode')

arg_scope = {"order": "NCHW"}

ROOT_FOLDER = os.path.join(os.path.expanduser('~'), 'python', 'tutorial_data','go','param') # folder stores the loss/accuracy log



Running in CPU mode


We need to differentiate primary player and sparring partner. Primary player will learn from the game result

In [2]:
# who is primary and who is sparring partner? 
PRIMARY_PLAYER = "black" # or white
SPARRING_PLAYER = "white"

### Config for primary player
PRIMARY_WORKSPACE = os.path.join(ROOT_FOLDER, PRIMARY_PLAYER)
PRIMARY_CONV_LEVEL = 4
PRIMARY_FILTERS = 128
PRIMARY_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
PRIMARY_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS))

### Config for sparring partner
SPARR_WORKSPACE = os.path.join(ROOT_FOLDER, SPARRING_PLAYER)
SPARR_CONV_LEVEL = 13
SPARR_FILTERS = 192
SPARR_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
SPARR_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(SPARR_CONV_LEVEL,SPARR_FILTERS,SPARR_PRE_TRAINED_ITERS))

print('{}-{}-{}({}) vs. {}-{}-{}({})'.format(
    PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS, PRIMARY_PLAYER,
    SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, SPARRING_PLAYER))

4-128-1(black) vs. 13-192-1(white)


Following training parameters are only for primary player.

In [3]:
BASE_LR = -0.003 # (-0.003,0) The base Learning Rate

TRAIN_BATCHES = 64 # how many samples will be trained within one mini-batch, depends on your hardware

# after training, where to store the params
SAVE_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".
                           format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS+TOURNAMENT_ITERS))
print('After training, result will be saved to {}'.format(SAVE_FOLDER))

After training, result will be saved to /home/wangd/python/tutorial_data/go/param/RL-conv=4-k=128-iter=501


## AlphaGo Neural Network Architecture

In [4]:
from modeling import AddConvModel, AddTrainingOperators

## Build the actual network

In [5]:
import caffe2.python.predictor.predictor_exporter as pe

data = np.empty(shape=(TRAIN_BATCHES,48,19,19), dtype=np.float32)
label = np.empty(shape=(TRAIN_BATCHES,), dtype=np.int32)

### Primary player
>Train Net: Blob('data','label') ==> Predict Net ==> Loss ==> Backward Propergation

In [6]:
workspace.SwitchWorkspace(PRIMARY_WORKSPACE, True)
# for learning from winner
with core.DeviceScope(device_opts):
    primary_train_model = model_helper.ModelHelper(name="primary_train_model", arg_scope=arg_scope, init_params=True)
    workspace.FeedBlob("data", data, device_option=device_opts)
    predict = AddConvModel(primary_train_model, "data", conv_level=PRIMARY_CONV_LEVEL, filters=PRIMARY_FILTERS)
    workspace.FeedBlob("label", data, device_option=device_opts)
    AddTrainingOperators(primary_train_model, predict, "label", base_lr=BASE_LR)
    workspace.RunNetOnce(primary_train_model.param_init_net)
    workspace.CreateNet(primary_train_model.net, overwrite=True)
# for learning from negative examples
with core.DeviceScope(device_opts):
    primary_train_neg_model = model_helper.ModelHelper(name="primary_train_neg_model", arg_scope=arg_scope, init_params=True)
    #workspace.FeedBlob("data", data, device_option=device_opts)
    predict = AddConvModel(primary_train_neg_model, "data", conv_level=PRIMARY_CONV_LEVEL, filters=PRIMARY_FILTERS)
    #workspace.FeedBlob("label", data, device_option=device_opts)
    AddTrainingOperators(primary_train_neg_model, predict, "label", base_lr=BASE_LR, learn_neg=True)
    workspace.RunNetOnce(primary_train_neg_model.param_init_net)
    workspace.CreateNet(primary_train_neg_model.net, overwrite=True)
    
primary_predict_net = pe.prepare_prediction_net(os.path.join(PRIMARY_LOAD_FOLDER, "policy_model.minidb"),
                                               "minidb", device_option=device_opts)



### Sparring partner
>Predict Net: Blob('data') ==> Predict Net ==> Blob('predict')

In [7]:
# Initialize sparring partner
workspace.SwitchWorkspace(SPARR_WORKSPACE, True)
sparring_predict_net = pe.prepare_prediction_net(os.path.join(SPARR_LOAD_FOLDER, "policy_model.minidb"),
                                                 "minidb", device_option=device_opts)

## Run the tournament and training

### Compete

In [12]:
from go import GameState, BLACK, WHITE, EMPTY, PASS
from preprocessing import Preprocess
from game import DEFAULT_FEATURES
from datetime import datetime

np.random.seed(datetime.now().microsecond)

game_state = [ GameState() for i in range(GAMES_BATCHES) ]
game_result = [0] * GAMES_BATCHES # 0 - Not Ended; BLACK - Black Wins; WHITE - White Wins
p = [ Preprocess(DEFAULT_FEATURES) ] * GAMES_BATCHES
history = [ [] for i in range(GAMES_BATCHES) ]
board = None

# for each step in all games
for step in range(0,500):
    
    board = np.concatenate([p[i].state_to_tensor(game_state[i]).astype(np.float32) for i in range(GAMES_BATCHES)])
    
    if step % 2 == 0:
        current_player = BLACK
        current_color = 'B'
    else:
        current_player = WHITE
        current_color = 'W'

    if step % 2 == (PRIMARY_PLAYER == 'white'):
        # primary player move
        workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
        workspace.FeedBlob('data', board, device_option=device_opts)
        workspace.RunNet(primary_predict_net)
    else:
        # sparring partner move
        workspace.SwitchWorkspace(SPARR_WORKSPACE)
        workspace.FeedBlob('data', board, device_option=device_opts)
        workspace.RunNet(sparring_predict_net)

    predict = workspace.FetchBlob('predict') # [0.01, 0.02, ...] in shape (N,361)
    
    for i in range(GAMES_BATCHES):
        if game_result[i]: # game end
            continue
        else: # game not end
            legal_moves = [ x*19+y for (x,y) in game_state[i].get_legal_moves(include_eyes=False)] # [59, 72, ...] in 1D
            if len(legal_moves) > 0: # at least 1 legal move
                probabilities = predict[i][legal_moves] # [0.02, 0.01, ...]
                # use numpy.random.choice to randomize the step,
                # otherwise use np.argmax to get best choice
                # current_choice = legal_moves[np.argmax(probabilities)]
                current_choice = np.random.choice(legal_moves, 1, p=probabilities/np.sum(probabilities))[0]
                (x, y) = (current_choice/19, current_choice%19)
                history[i].append((current_color, x, y, board[i]))
                game_state[i].do_move(action = (x, y), color = current_player) # End of Game?
                print('game({}) step({}) {} move({},{})'.format(i, step, current_color, x, y))
            else:
                game_state[i].do_move(action = PASS, color = current_player)
                print('game({}) step({}) {} PASS'.format(i, step, current_color))
                game_result[i] = game_state[i].is_end_of_game

    if np.all(game_result):
        break

game(0) step(0) B move(3,3)
game(1) step(0) B move(16,15)
game(2) step(0) B move(16,13)
game(0) step(1) W move(15,3)
game(1) step(1) W move(3,15)
game(2) step(1) W move(13,16)
game(0) step(2) B move(4,15)
game(1) step(2) B move(15,3)
game(2) step(2) B move(13,15)
game(0) step(3) W move(16,15)
game(1) step(3) W move(3,2)
game(2) step(3) W move(12,15)
game(0) step(4) B move(3,12)
game(1) step(4) B move(5,16)
game(2) step(4) B move(14,15)
game(0) step(5) W move(14,16)
game(1) step(5) W move(2,13)
game(2) step(5) W move(14,16)
game(0) step(6) B move(16,5)
game(1) step(6) B move(14,16)
game(2) step(6) B move(15,16)
game(0) step(7) W move(13,2)
game(1) step(7) W move(2,4)
game(2) step(7) W move(15,17)
game(0) step(8) B move(16,2)
game(1) step(8) B move(9,15)
game(2) step(8) B move(12,16)
game(0) step(9) W move(16,3)
game(1) step(9) W move(16,5)
game(2) step(9) W move(16,16)
game(0) step(10) B move(17,3)
game(1) step(10) B move(3,12)
game(2) step(10) B move(15,15)
game(0) step(11) W move(17,4

game(0) step(91) W move(4,16)
game(1) step(91) W move(12,4)
game(2) step(91) W move(1,3)
game(0) step(92) B move(5,15)
game(1) step(92) B move(17,5)
game(2) step(92) B move(11,8)
game(0) step(93) W move(6,17)
game(1) step(93) W move(17,6)
game(2) step(93) W move(9,4)
game(0) step(94) B move(7,17)
game(1) step(94) B move(18,6)
game(2) step(94) B move(10,4)
game(0) step(95) W move(6,16)
game(1) step(95) W move(17,4)
game(2) step(95) W move(10,5)
game(0) step(96) B move(7,16)
game(1) step(96) B move(17,3)
game(2) step(96) B move(11,4)
game(0) step(97) W move(0,17)
game(1) step(97) W move(18,5)
game(2) step(97) W move(11,5)
game(0) step(98) B move(8,11)
game(1) step(98) B move(16,2)
game(2) step(98) B move(12,5)
game(0) step(99) W move(8,9)
game(1) step(99) W move(11,1)
game(2) step(99) W move(12,6)
game(0) step(100) B move(6,2)
game(1) step(100) B move(14,1)
game(2) step(100) B move(11,6)
game(0) step(101) W move(6,15)
game(1) step(101) W move(5,6)
game(2) step(101) W move(9,5)
game(0) st

game(0) step(181) W move(1,2)
game(1) step(181) W move(7,15)
game(2) step(181) W move(0,6)
game(0) step(182) B move(1,3)
game(1) step(182) B move(7,14)
game(2) step(182) B move(0,3)
game(0) step(183) W move(2,4)
game(1) step(183) W move(8,14)
game(2) step(183) W move(0,4)
game(0) step(184) B move(1,1)
game(1) step(184) B move(7,13)
game(2) step(184) B move(0,5)
game(0) step(185) W move(3,2)
game(1) step(185) W move(10,13)
game(2) step(185) W move(1,8)
game(0) step(186) B move(0,2)
game(1) step(186) B move(9,13)
game(2) step(186) B move(1,6)
game(0) step(187) W move(3,4)
game(1) step(187) W move(9,16)
game(2) step(187) W move(1,2)
game(0) step(188) B move(4,3)
game(1) step(188) B move(11,13)
game(2) step(188) B move(7,12)
game(0) step(189) W move(4,4)
game(1) step(189) W move(14,0)
game(2) step(189) W move(13,1)
game(0) step(190) B move(5,3)
game(1) step(190) B move(13,0)
game(2) step(190) B move(12,1)
game(0) step(191) W move(5,4)
game(1) step(191) W move(15,0)
game(2) step(191) W move

game(0) step(271) W move(13,12)
game(1) step(271) W move(4,9)
game(2) step(271) W move(18,9)
game(0) step(272) B move(12,13)
game(1) step(272) B move(1,5)
game(2) step(272) B move(10,0)
game(0) step(273) W move(6,9)
game(1) step(273) W move(5,1)
game(2) step(273) W move(10,2)
game(0) step(274) B move(10,0)
game(1) step(274) B move(2,10)
game(2) step(274) B move(8,0)
game(0) step(275) W move(15,13)
game(1) step(275) W move(0,12)
game(2) step(275) W move(13,0)
game(0) step(276) B move(5,9)
game(1) step(276) B move(0,9)
game(2) step(276) B move(11,2)
game(0) step(277) W move(11,6)
game(1) step(277) W move(16,10)
game(2) step(277) W move(9,16)
game(0) step(278) B move(12,6)
game(1) step(278) B move(2,8)
game(2) step(278) B move(9,17)
game(0) step(279) W move(11,7)
game(1) step(279) W move(17,11)
game(2) step(279) W move(3,7)
game(0) step(280) B move(11,5)
game(1) step(280) B move(17,13)
game(2) step(280) B move(10,16)
game(0) step(281) W move(10,5)
game(1) step(281) W move(2,9)
game(2) ste

game(0) step(361) W move(0,15)
game(1) step(361) W move(0,10)
game(2) step(361) W move(7,4)
game(0) step(362) B move(4,2)
game(1) step(362) B move(1,15)
game(2) step(362) B move(0,7)
game(0) step(363) W move(4,17)
game(1) step(363) W move(1,14)
game(2) step(363) W move(15,4)
game(0) step(364) B move(8,18)
game(1) step(364) B move(0,0)
game(2) step(364) B move(3,11)
game(0) step(365) W move(0,7)
game(1) step(365) W move(10,2)
game(2) step(365) W move(15,10)
game(0) step(366) B move(5,15)
game(1) step(366) B move(10,18)
game(2) step(366) B move(17,12)
game(0) step(367) W move(0,18)
game(1) step(367) W move(9,18)
game(2) step(367) W move(0,0)
game(0) step(368) B move(18,10)
game(1) step(368) B move(11,11)
game(2) step(368) B move(1,16)
game(0) step(369) W move(18,9)
game(1) step(369) W move(12,13)
game(2) step(369) W move(0,16)
game(0) step(370) B move(17,0)
game(1) step(370) B move(12,12)
game(2) step(370) B move(0,1)
game(0) step(371) W move(17,10)
game(1) step(371) W move(6,0)
game(2) 

game(0) step(473) W move(16,2)
game(0) step(474) B move(16,6)
game(0) step(475) W move(2,10)
game(0) step(476) B move(16,8)
game(0) step(477) W move(2,13)
game(0) step(478) B move(3,14)
game(0) step(479) W move(14,9)
game(0) step(480) B PASS


### Record the game in SGF format

In [13]:
from sgfutil import GetWinner, WriteBackSGF
from datetime import datetime
import sgf

winner = [ GetWinner(game_state[i]) for i in range(GAMES_BATCHES) ] # B+, W+, T

#comment out for better performance
for i in range(GAMES_BATCHES):
    filename = os.path.join(
        os.path.expanduser('~'), 'python', 'tutorial_files','selfplay',
        '({}_{}_{})vs({}_{}_{})_{}_{}_{}'.format(PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS,
                                        SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, i, winner[i],
                                        datetime.now().strftime("%Y-%m-%d")))
    print(filename)
    WriteBackSGF(winner, history[i], filename)

/home/wangd/python/tutorial_files/selfplay/(4_128_1)vs(13_192_1)_0_W+_2017-09-25
/home/wangd/python/tutorial_files/selfplay/(4_128_1)vs(13_192_1)_1_W+_2017-09-25
/home/wangd/python/tutorial_files/selfplay/(4_128_1)vs(13_192_1)_2_W+_2017-09-25


## Learn from the winning games

>We use a reward function $r(s)$ that is zero for all non-terminal time-steps $t < T$.
The outcome $z_t = \pm r(s_T)$ is the terminal reward at the end of the game from the perspective of the
current player at time-step $t$: $+1$ for winning and $-1$ for losing. Weights are then updated at each
time-step $t$ by stochastic gradient ascent in the direction that maximizes expected outcome.

In [14]:
iter = 0
k = 0
for i in range(GAMES_BATCHES):
    print('Learning {} steps in {} of {} games.'.format(iter * 32, i, GAMES_BATCHES))
    for step in history[i]:
        if (step[0] == 'B' and winner[i] == 'B+') or (step[0] == 'W' and winner[i] == 'W+'):
            data[k] = step[3]
            label[k] = step[1]*19+step[2]
            k += 1
            if k == TRAIN_BATCHES:
                iter += 1
                k = 0
                workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
                workspace.FeedBlob("data", data, device_option=device_opts)
                workspace.FeedBlob("label", label, device_option=device_opts)
                workspace.RunNet(primary_train_model.net)

Learning 0 steps in 0 of 3 games.
Learning 96 steps in 1 of 3 games.
Learning 224 steps in 2 of 3 games.


Now learning from negative examples.

In [15]:
iter = 0
k = 0
for i in range(GAMES_BATCHES):
    print('Learning negative examples {} steps in {} of {} games.'.format(iter * 32, i, GAMES_BATCHES))
    for step in history[i]:
        if (step[0] == 'B' and winner[i] == 'W+') or (step[0] == 'W' and winner[i] == 'B+'):
            data[k] = step[3]
            label[k] = step[1]*19+step[2]
            k += 1
            if k == TRAIN_BATCHES:
                iter += 1
                k = 0
                workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
                workspace.FeedBlob("data", data, device_option=device_opts)
                workspace.FeedBlob("label", label, device_option=device_opts)
                workspace.RunNet(primary_train_neg_model.net)

print('Finished')

Learning negative examples 0 steps in 0 of 3 games.
Learning negative examples 96 steps in 1 of 3 games.
Learning negative examples 224 steps in 2 of 3 games.
Finished


### Save the RL model of primary player
and also make a copy to opponent folder

In [None]:
if not os.path.exists(SAVE_FOLDER):
    os.makedirs(SAVE_FOLDER)
# construct the model to be exported
pe_meta = pe.PredictorExportMeta(
    predict_net=primary_deploy_model.net.Proto(),
    parameters=[str(b) for b in primary_deploy_model.params], 
    inputs=["data"],
    outputs=["predict"],
)
pe.save_to_db("minidb", os.path.join(SAVE_FOLDER, "policy_model.minidb"), pe_meta)
#pe.save_to_db("minidb", os.path.join(SPARR_FOLDER, "policy_model.minidb"), pe_meta)
print('Params saved to {}'.format(SAVE_FOLDER))