# Mock AlphaGo (3B) Policy Network - Reinforced Learning in mass production
In this notebook, we will train the policy network by letting them compete each other according to DeepMind:

> We further trained the policy network by policy gradient reinforcement learning.
Each iteration consisted of a mini-batch of n games played in parallel, between
the current policy network $p_\rho$ that is being trained, and an opponent $p_\rho-$
that uses parameters $\rho^-$ from a previous iteration, randomly sampled from
a pool $O$ of opponents, so as to increase the stability of training. Weights were
initialized to $\rho = \rho^- = \sigma$. Every 500 iterations, we added the current
parameters $\rho$ to the opponent pool. Each game $i$ in the mini-batch was played
out until termination at step $T^i$, and then scored to determine the outcome
$z^i_t = \pm r(s_{T^i})$ from each player’s perspective. The games were then replayed
to determine the policy gradient update, $\Delta\rho = \frac{a}{n}\Sigma^n_{i=1}
\Sigma^{T^i}_{t=1}\frac{\partial\log p_\rho(a^i_t|s^i_t)}{\partial_\rho}(z^i_t-v(s^i_t))$, using the REINFORCE 
algorithm with baseline $v(s^i_t)$ for variance reduction. On the first pass 
through the training pipeline, the baseline was set to zero; on the second pass
we used the value network $v_\theta(s)$ as a baseline; this provided a small
performance boost. The policy network was trained in this way for 10,000 
mini-batches of 128 games, using 50 GPUs, for one day.

In [1]:
import os, numpy as np
from caffe2.python import core, model_helper, workspace, brew, utils
from caffe2.proto import caffe2_pb2
from sgfutil import BOARD_POSITION

%matplotlib inline
from matplotlib import pyplot

# how many games will be run in one minibatch
GAMES_BATCHES = 16 # [1,infinity) depends on your hardware
# how many iterations for this tournament
TOURNAMENT_ITERS = 1000 # [1,infinity)

if workspace.has_gpu_support:
    device_opts = core.DeviceOption(caffe2_pb2.CUDA, workspace.GetDefaultGPUID())
    print('Running in GPU mode on default device {}'.format(workspace.GetDefaultGPUID()))
else :
    device_opts = core.DeviceOption(caffe2_pb2.CPU, 0)
    print('Running in CPU mode')

arg_scope = {"order": "NCHW"}

ROOT_FOLDER = os.path.join(os.path.expanduser('~'), 'python', 'tutorial_data','go','param') # folder stores the loss/accuracy log



Running in CPU mode


We need to differentiate primary player and sparring partner. Primary player will learn from the game result

In [2]:
# who is primary and who is sparring partner? 
PRIMARY_PLAYER = "black" # or white
SPARRING_PLAYER = "white"

### Config for primary player
PRIMARY_WORKSPACE = os.path.join(ROOT_FOLDER, PRIMARY_PLAYER)
PRIMARY_CONV_LEVEL = 4
PRIMARY_FILTERS = 128
PRIMARY_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
PRIMARY_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS))

### Config for sparring partner
SPARR_WORKSPACE = os.path.join(ROOT_FOLDER, SPARRING_PLAYER)
SPARR_CONV_LEVEL = 4
SPARR_FILTERS = 128
SPARR_PRE_TRAINED_ITERS = 1
# before traning, where to load the params
SPARR_LOAD_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".format(SPARR_CONV_LEVEL,SPARR_FILTERS,SPARR_PRE_TRAINED_ITERS))

print('{}-{}-{}({}) vs. {}-{}-{}({})'.format(
    PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS, PRIMARY_PLAYER,
    SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, SPARRING_PLAYER))

4-128-1(black) vs. 4-128-1(white)


Following training parameters are only for primary player.

In [3]:
BASE_LR = -0.003 # (-0.003,0) The base Learning Rate

TRAIN_BATCHES = 16 # how many samples will be trained within one mini-batch, depends on your hardware

# after training, where to store the params
SAVE_FOLDER = os.path.join(ROOT_FOLDER, "RL-conv={}-k={}-iter={}".
                           format(PRIMARY_CONV_LEVEL,PRIMARY_FILTERS,PRIMARY_PRE_TRAINED_ITERS+TOURNAMENT_ITERS))
print('After training, result will be saved to {}'.format(SAVE_FOLDER))

After training, result will be saved to /home/wangd/python/tutorial_data/go/param/RL-conv=4-k=128-iter=1001


## AlphaGo Neural Network Architecture

In [4]:
from modeling import AddConvModel, AddTrainingOperators

## Build the actual network

In [5]:
import caffe2.python.predictor.predictor_exporter as pe

data = np.empty(shape=(TRAIN_BATCHES,48,19,19), dtype=np.float32)
label = np.empty(shape=(TRAIN_BATCHES,), dtype=np.int32)

### Primary player
>Train Net: Blob('data','label') ==> Predict Net ==> Loss ==> Backward Propergation

In [6]:
workspace.SwitchWorkspace(PRIMARY_WORKSPACE, True)
# for learning from winner
with core.DeviceScope(device_opts):
    primary_train_model = model_helper.ModelHelper(name="primary_train_model", arg_scope=arg_scope, init_params=True)
    workspace.FeedBlob("data", data, device_option=device_opts)
    predict = AddConvModel(primary_train_model, "data", conv_level=PRIMARY_CONV_LEVEL, filters=PRIMARY_FILTERS)
    workspace.FeedBlob("label", data, device_option=device_opts)
    AddTrainingOperators(primary_train_model, predict, "label", base_lr=BASE_LR)
    workspace.RunNetOnce(primary_train_model.param_init_net)
    workspace.CreateNet(primary_train_model.net, overwrite=True)
# for learning from negative examples
with core.DeviceScope(device_opts):
    primary_train_neg_model = model_helper.ModelHelper(name="primary_train_neg_model", arg_scope=arg_scope, init_params=True)
    #workspace.FeedBlob("data", data, device_option=device_opts)
    predict = AddConvModel(primary_train_neg_model, "data", conv_level=PRIMARY_CONV_LEVEL, filters=PRIMARY_FILTERS)
    #workspace.FeedBlob("label", data, device_option=device_opts)
    AddTrainingOperators(primary_train_neg_model, predict, "label", base_lr=BASE_LR, learn_neg=True)
    workspace.RunNetOnce(primary_train_neg_model.param_init_net)
    workspace.CreateNet(primary_train_neg_model.net, overwrite=True)
    
primary_predict_net = pe.prepare_prediction_net(os.path.join(PRIMARY_LOAD_FOLDER, "policy_model.minidb"),
                                               "minidb", device_option=device_opts)



### Sparring partner
>Predict Net: Blob('data') ==> Predict Net ==> Blob('predict')

In [None]:
# Initialize sparring partner
workspace.SwitchWorkspace(SPARR_WORKSPACE, True)
sparring_predict_net = pe.prepare_prediction_net(os.path.join(SPARR_LOAD_FOLDER, "policy_model.minidb"),
                                                 "minidb", device_option=device_opts)

## Run the tournament and training
>We use a reward function $r(s)$ that is zero for all non-terminal time-steps $t < T$.
The outcome $z_t = \pm r(s_T)$ is the terminal reward at the end of the game from the perspective of the
current player at time-step $t$: $+1$ for winning and $-1$ for losing. Weights are then updated at each
time-step $t$ by stochastic gradient ascent in the direction that maximizes expected outcome.

In [None]:
from go import GameState, BLACK, WHITE, EMPTY, PASS
from preprocessing import Preprocess
from game import DEFAULT_FEATURES
from datetime import datetime
from sgfutil import GetWinner, WriteBackSGF
import sgf

np.random.seed(datetime.now().microsecond)

# construct the model to be exported
pe_meta = pe.PredictorExportMeta(
    predict_net=primary_train_model.net.Proto(),
    parameters=[str(b) for b in primary_train_model.params], 
    inputs=["data"],
    outputs=["predict"],
)

for tournament in range(TOURNAMENT_ITERS):
    # Every 500 tournament, copy current player to opponent. i.e. checkpoint
    if tournament > 0 and tournament % 50 == 0:
        pe.save_to_db("minidb", os.path.join(SAVE_FOLDER, "policy_model_{}.minidb".format(PRIMARY_PRE_TRAINED_ITERS+tournament)), pe_meta)
        print('Checkpoint saved to {}'.format(SAVE_FOLDER))
        
    # TODO: randomly pickup opponent
    # TODO: randomly change color of player
    game_state = [ GameState() for i in range(GAMES_BATCHES) ]
    game_result = [0] * GAMES_BATCHES # 0 - Not Ended; BLACK - Black Wins; WHITE - White Wins
    p = [ Preprocess(DEFAULT_FEATURES) ] * GAMES_BATCHES
    history = [ [] for i in range(GAMES_BATCHES) ]
    board = None
    
    if PRIMARY_PLAYER == 'black':
        BLACK_PLAYER = 'Primary'
        WHITE_PLAYER = 'Sparring'
    else:
        BLACK_PLAYER = 'Sparring'
        WHITE_PLAYER = 'Primary'

    # for each step in all games
    for step in range(0,500):
        board = np.concatenate([p[i].state_to_tensor(game_state[i]).astype(np.float32) for i in range(GAMES_BATCHES)])

        if step % 2 == 0:
            current_player = BLACK
            current_color = 'B'
        else:
            current_player = WHITE
            current_color = 'W'

        if step % 2 == (PRIMARY_PLAYER == 'white'):
            # primary player move
            workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
            workspace.FeedBlob('data', board, device_option=device_opts)
            workspace.RunNet(primary_predict_net)
        else:
            # sparring partner move
            workspace.SwitchWorkspace(SPARR_WORKSPACE)
            workspace.FeedBlob('data', board, device_option=device_opts)
            workspace.RunNet(sparring_predict_net)

        predict = workspace.FetchBlob('predict') # [0.01, 0.02, ...] in shape (N,361)

        for i in range(GAMES_BATCHES):
            if game_result[i]: # game end
                continue
            else: # game not end
                legal_moves = [ x*19+y for (x,y) in game_state[i].get_legal_moves(include_eyes=False)] # [59, 72, ...] in 1D
                if len(legal_moves) > 0: # at least 1 legal move
                    probabilities = predict[i][legal_moves] # [0.02, 0.01, ...]
                    # use numpy.random.choice to randomize the step,
                    # otherwise use np.argmax to get best choice
                    # current_choice = legal_moves[np.argmax(probabilities)]
                    if np.sum(probabilities) > 0:
                        current_choice = np.random.choice(legal_moves, 1, p=probabilities/np.sum(probabilities))[0]
                    else:
                        current_choice = np.random.choice(legal_moves, 1)[0]
                    (x, y) = (current_choice/19, current_choice%19)
                    history[i].append((current_color, x, y, board[i]))
                    game_state[i].do_move(action = (x, y), color = current_player) # End of Game?
                    #print('game({}) step({}) {} move({},{})'.format(i, step, current_color, x, y))
                else:
                    game_state[i].do_move(action = PASS, color = current_player)
                    #print('game({}) step({}) {} PASS'.format(i, step, current_color))
                    game_result[i] = game_state[i].is_end_of_game

        if np.all(game_result):
            break
    
    # Get the winner
    winner = [ GetWinner(game_state[i]) for i in range(GAMES_BATCHES) ] # B+, W+, T
    print('Tournament {} Finished with Black({}) {}:{} White({})'.
          format(tournament, BLACK_PLAYER, sum(np.char.count(winner, 'B+')),
                 sum(np.char.count(winner, 'W+')), WHITE_PLAYER)) 
    
    # Save the games
    for i in range(GAMES_BATCHES):
        filename = os.path.join(
            os.path.expanduser('~'), 'python', 'tutorial_files','selfplay',
            '({}_{}_{})vs({}_{}_{})_{}_{}_{}'.format(PRIMARY_CONV_LEVEL, PRIMARY_FILTERS, PRIMARY_PRE_TRAINED_ITERS+tournament,
                                            SPARR_CONV_LEVEL, SPARR_FILTERS, SPARR_PRE_TRAINED_ITERS, i, winner[i],
                                            datetime.now().strftime("%Y-%m-%dT%H:%M:%S%Z")))
        #print(filename)
        WriteBackSGF(winner, history[i], filename)
    
    # After each tournament, learn from the winner
    #iter = 0
    k = 0
    for i in range(GAMES_BATCHES):
        #print('Learning {} steps in {} of {} games'.format(iter * TRAIN_BATCHES, i, GAMES_BATCHES))
        for step in history[i]:
            if (step[0] == 'B' and winner[i] == 'B+') or (step[0] == 'W' and winner[i] == 'W+'):
                data[k] = step[3]
                label[k] = step[1]*19+step[2]
                k += 1
                if k == TRAIN_BATCHES:
                    #iter += 1
                    k = 0
                    workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
                    workspace.FeedBlob("data", data, device_option=device_opts)
                    workspace.FeedBlob("label", label, device_option=device_opts)
                    workspace.RunNet(primary_train_model.net)
    
    # And learn from negative examples
    #iter = 0
    k = 0
    for i in range(GAMES_BATCHES):
        #print('Learning negative examples {} steps in {} of {} games'.format(iter * TRAIN_BATCHES, i, GAMES_BATCHES))
        for step in history[i]:
            if (step[0] == 'B' and winner[i] == 'W+') or (step[0] == 'W' and winner[i] == 'B+'):
                data[k] = step[3]
                label[k] = step[1]*19+step[2]
                k += 1
                if k == TRAIN_BATCHES:
                    #iter += 1
                    k = 0
                    workspace.SwitchWorkspace(PRIMARY_WORKSPACE)
                    workspace.FeedBlob("data", data, device_option=device_opts)
                    workspace.FeedBlob("label", label, device_option=device_opts)
                    workspace.RunNet(primary_train_neg_model.net)

Tournament 0 Finished with Black(Primary) 11:5 White(Sparring)
Tournament 1 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 2 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 3 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 4 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 5 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 6 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 7 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 8 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 9 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 10 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 11 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 12 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 13 Finished with Black(Primary) 0:16 White(Sparring)
Tournament 14 Finished with Black(Primary) 0:16 White(Sparring)
