<h1 style="color:darkred"> Training a Neural Network Heads Up No Limit Texas Holdem with probabilistic decisions</h1>
In this notebook we are training an agent with self-play reinforcement learning to play heads-up (1 vs.1) Poker. Deterministic policies are not suited for non-deterministic games.

<h2>Simple Test Game </h2>
Before we train a model to play poker, we need a very simple test game, that has similar properties, but is so simple that the training algorithm can be verified with reasonable training time. We don't want to try training a model to play poker, before we know that the algorithm in principle can beat similar games. These are the criteria the test-game has to meet.
<ul>
    <li>Two players </li>
    <li>Imperfect Information </li>
    <li>Round based. At least two decisions have to be made per round.</li>
    <li>Solution known. We need to be able to compute the Nash-Equilibrium for the game or the game has to have been solved already.<br>
    we need to be able to verify if the our model learns the same solution as the nash equilibrium</li>
</ul>
We are going to invent our own very simple no limit holdem poker game. We are going to limit the number of cards to the range of A to T and have with two sets, so there are only 10 cards total. A players receives one hole-card. And there will be one community card. There will be two betting rounds. Preflop the player can decide to fold, call or go all in. Postflop the player has the same decisions. Both players have 5 BB. The winning hand is a pair or the high card. There are no suits<br>
There is a very limited number of possible actions, so we can use simple q-learning to learn the best strategy.
<ol>
    <li>sb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb checks</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb calls </li>
    <li>sb calls, bb checks, (flop), bb allin, sb folds </li>
    <li>sb calls, bb checks, (flop), bb allin, sb calls </li>
    <li>sb calls, bb allin,  sb folds </li>
    <li>sb calls, bb allin, sb calls </li>
    <li>sb allin, bb folds </li>
    <li>sb allin, bb calls </li>
</ol>

The file can be found in game.PokerSimple


In [1]:
from game.PokerSimple import PokerSimple

<h3>Testing the game</h3>

The file 'Test_PokerSimple.py' contains unittests for the game. In addition let's do some sanity checks. We are going to make the agents do random moves and record how much which hand wins. We are also going to check if both players have the same stats.

In [None]:
import random
hero = 0
villain = 1
game = PokerSimple(hero, villain)

n_games = 1000000
results= [{1:0, 2: 0, 3: 0, 4: 0, 5: 0},{1:0, 2: 0, 3: 0, 4: 0, 5: 0}]

for i in range(n_games):
    if i %10000 == 0:
        print(i)
    done, next_to_act, observation = game.reset()
    while game.done == False:
        action = random.randint(0,2)
        game.implement_action(game.next_to_act[0], action)
    game.create_observation(0)
    game.create_observation(1)
    stack_change_0 = game.observations[0][1] -5
    stack_change_1 = game.observations[1][1] -5
    hand_0 = game.observations[0][4]
    hand_1 = game.observations[1][4]
    
    results[0][hand_0] += stack_change_0
    results[1][hand_1] += stack_change_1

In [None]:
results

Sanity check succeeded. The lower hands loose and the higher hands win.

<h3> Learning Optimal Game </h3>
Below is an attempt to solve for the nash equilibrium for the game. Due to complexity, I have not succeeded in this. A paper presenting a potential solution can be found <a href="http://proceedings.mlr.press/v119/munos20a/munos20a.pdf">here</a>.

In [None]:
probs_sb = {1:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            2:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            3:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            4:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            5:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}}}

In [None]:
probs_bb = {1:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            2:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
                           'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            3:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            4:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5, 0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            5:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}}}
            


In [None]:
#dictionary to store proportions of win losse and split a player wins at showdown
pot_props_pre = {key: {key: 0 for key in range(1,6)} for key in range(1,6)}

deck = list(range(1,6))
deck.extend(deck)
for hole_0 in range(1,6):
    deck_rest = deck[:]
    deck_rest.remove(hole_0)
    for hole_1 in range (1,6):
        deck_rest_ = deck_rest[:]
        deck_rest_.remove(hole_1)
        win_0 = 0
        win_1 = 0
        split = 0
        for card in deck_rest_:
            if (hole_0 == card and hole_1 != card) or (hole_0 > hole_1 and hole_1 != card):
                win_0 += 1
            elif (hole_1 == card and hole_0 != card) or (hole_1 > hole_0 and hole_0 != card):
                win_1 += 1
            else:
                split +=1
        proportion = (win_0/8, win_1/8, split/8)
        pot_props_pre[hole_0][hole_1] = proportion 
        
#lookup table for postflop. -1 means loose, 1 means win and 0 means split
lookup_post = {key: {key: {key: 0 for key in range(1,6)} for key in range(1,6)} for key in range(1,6)}
for hole_0 in range(1,6):
    deck_rest = deck[:]
    deck_rest.remove(hole_0)
    for hole_1 in range (1,6):
        deck_rest_ = deck_rest[:]
        deck_rest_.remove(hole_1)
        
        for card in deck_rest_:
            if (hole_0 == card and hole_1 != card) or (hole_0 > hole_1 and hole_1 != card):
                lookup_post[hole_0][hole_1][card] = 1
            elif (hole_1 == card and hole_0 != card) or (hole_1 > hole_0 and hole_0 != card):
                lookup_post[hole_0][hole_1][card] = -1
            else:
                lookup_post[hole_0][hole_1][card] = 0


In [None]:
import numpy as np
def cal_ev_sb_hole(probs_sb, probs_bb, holecard):
    #postflop utils must be calculated for each flop. We need the probability of villain having a particular hand.
    util_1111 =  []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]
            util += p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_1111.append(util/probs_total)
    
    util_11120 = 1
    util_11121 = []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_to_push'][1]
            util += 5*p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_11121.append(util/probs_total)
    
    util_1120 = -1
    util_1121 = []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][1]
            util += 5*p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_1121.append(util/probs_total)
    
    util_120 = -1
    util_121 = 0
    probs_total = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][1]
        probs_total += pvil_playhand*p_deal_holevil
        util_121 += 5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][0]-5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][1]
    util_121 /= probs_total
    
    util_20 = 1
    util_21 = 0
    probs_total = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_push'][1]
        probs_total += pvil_playhand*p_deal_holevil
        util_21 += 5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][0]-5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][1]
    util_21 /= probs_total
    p0 = probs_sb[holecard]['preflop']['preflop_fi'][0]
    p1 = probs_sb[holecard]['preflop']['preflop_fi'][1]
    p2 = probs_sb[holecard]['preflop']['preflop_fi'][2]
    p11 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]
        p11 += pvil_playhand* p_deal_holevil
        
        
    p12 = 0    
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]
        p12 += pvil_playhand* p_deal_holevil
    
        
    p111 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        for board in range(1,6):
            if board == holecard and board == hole_vil:
                p_board = 0
            elif board ==holecard or board == hole_vil:
                p_board = 1/8
            else:
                p_board = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]
            p111 += pvil_playhand*p_board*p_deal_holevil
    p111 /= p11
            
    p112 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        for board in range(1,6):
            if board == holecard and board == hole_vil:
                p_board = 0
            elif board ==holecard or board == hole_vil:
                p_board = 1/8
            else:
                p_board = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][1]
            p112 += pvil_playhand*p_board*p_deal_holevil
    p112 /= p11
    
    #ev = p0*-0.5+ p1['preflop_fi'][1]*util_1 + probs_sb[holecard]['preflop']['preflop_fi'][2]*util_2

<h3>Q-Learning Simple Game </h3>
Through Q-Learning, we are going to learn the best deterministic strategy for the simple game. This will serve as a benchmark for the actual deep-learning algorithm. Because Q_Learning solves for a deterministic solution, a probabilistic solution will have to beat it, or if the deterministic solution is optimal, it should break even. In Q_learning, we need all possible states and action pairs. <br>
We first need a q-table. A table of all possible states and actions.

In [None]:
q= {'sb':{1:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            2:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            3:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            4:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            5:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}}},
    'bb': {1:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            2:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            3:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            4:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            5:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}}}}

In [None]:
import numpy as np
import random
epsilon = 0.2
lr = 0.01
hero = 0
villain = 1
sb_score = 0
bb_score = 0
game = PokerSimple(hero, villain)
#record state - action pair
n_games = 1000000
for i in range(n_games):
    if i %10000 == 0:
        print("episode",i)
    last_state_action = {0:{'state':None, 'action': None}, 1:{'state':None, 'action': None}}

    done, next_to_act, observation = game.reset()
    hands = [game.hole_0,game.hole_1]

    sb = next_to_act
    bb = 0 if sb==1 else 1
    if random.uniform(0,1) < epsilon:
        action = random.randint(0,2)

    else:
        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['preflop']['preflop_fi'])

    last_state_action[next_to_act]['state'] = ['sb',hands[next_to_act], 'preflop', 0, 'preflop_fi'] #these information provies all necessary infos to find q in table.
    last_state_action[next_to_act]['action'] = action

    game.implement_action(next_to_act,action)
    done = game.done

    while not done:
        if random.uniform(0,1) < epsilon:
            explore = True
            action = random.randint(0,1)
        else:
            explore = False

        hero = game.next_to_act[0]
        game.create_observation(hero)

        if game.next_to_act[0] == sb:
            if game.street == 0: #if game is preflop and player is sb, then it must be that bb pushed, otherwise there is no more action for sb              
                if not explore:
                    action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['preflop']['preflop_to_push'])
                last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'preflop', 0, 'preflop_to_push']
                last_state_action[game.next_to_act[0]]['action'] = action

            else:
                if game.stacks[bb] != 0: #bb checked
                    if not explore:
                        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['postflop'][game.board]['to_check'])
                    last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'postflop', game.board, 'to_check']
                    last_state_action[game.next_to_act[0]]['action'] = action

                    if action == 0:
                        action =1

                    else:
                        action = 2


                else: ##bb pushed
                    if not explore:
                        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['postflop'][game.board]['to_push'])

                    last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'postflop', game.board, 'to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action               
        else:
            if game.street == 0:
                if game.stacks[sb] != 0: #sb called                
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['preflop']['to_call'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'preflop', 0, 'to_call']
                    last_state_action[game.next_to_act[0]]['action'] = action 

                    if action == 0:
                        action =1

                    else:
                        action = 2

                else:
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['preflop']['to_push'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'preflop', 0, 'to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action 

            else:
                game.create_observation(game.next_to_act[0])
                if game.observations[game.next_to_act[0]][9] == -1: #postflop first in can only happen if preflop went call check           
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['postflop'][game.board]['postflop_fi'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'postflop', game.board, 'postflop_fi']
                    last_state_action[game.next_to_act[0]]['action'] = action 

                    if action == 0:
                        action = 1

                    else:
                        action = 2

                else: #if there is any action left it must be that smallblind pushed
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['postflop'][game.board]['postflop_to_push'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'postflop', game.board, 'postflop_to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action


        game.implement_action(game.next_to_act[0],action)
        done = game.done

        if not done:
            position = last_state_action[game.next_to_act[0]]['state'][0]
            hand = last_state_action[game.next_to_act[0]]['state'][1]
            street = last_state_action[game.next_to_act[0]]['state'][2]
            board = last_state_action[game.next_to_act[0]]['state'][3]
            situation = last_state_action[game.next_to_act[0]]['state'][4]
            chosen_action = last_state_action[game.next_to_act[0]]['action']
            if board:
                q_current = q[position][hand][street][board][situation][chosen_action]
                q[position][hand][street][board][situation][chosen_action]+= lr * (np.max(q[position][hand][street][board][situation]) - q_current)

            else:

                q_current = q[position][hand][street][situation][chosen_action]
                q[position][hand][street][situation][chosen_action]+= lr * (np.max(q[position][hand][street][situation]) - q_current)

    if sb == 0:
        reward_0 = game.stacks[0]-4.5 
        reward_1 = game.stacks[1]-4
        sb_score += reward_0 -0.5
        bb_score += reward_1 - 1
    else:
        reward_1 = game.stacks[1]-4.5 
        reward_0 = game.stacks[0]-4
        sb_score += reward_1 -0.5
        bb_score += reward_0 -1
    chosen_action_0 = last_state_action[0]['action']

    if chosen_action_0 is not None:

        position_0 = last_state_action[0]['state'][0]
        hand_0 = last_state_action[0]['state'][1]
        street_0 = last_state_action[0]['state'][2]
        board_0 = last_state_action[0]['state'][3]
        situation_0 = last_state_action[0]['state'][4]

        if board_0:
            q_current = q[position_0][hand_0][street_0][board_0][situation_0][chosen_action_0]
            q[position_0][hand_0][street_0][board_0][situation_0][chosen_action_0]+= lr * (reward_0 - q_current)
        else:
            q_current = q[position_0][hand_0][street_0][situation_0][chosen_action_0]
            q[position_0][hand_0][street_0][situation_0][chosen_action_0]+= lr * (reward_0 - q_current)
        
        
    chosen_action_1 = last_state_action[1]['action']
    if chosen_action_1 is not None:
        position_1 = last_state_action[1]['state'][0]
        hand_1 = last_state_action[1]['state'][1]
        street_1 = last_state_action[1]['state'][2]
        board_1 = last_state_action[1]['state'][3]
        situation_1 = last_state_action[1]['state'][4]

        if board_1:
            q_current = q[position_1][hand_1][street_1][board_1][situation_1][chosen_action_1]
            q[position_1][hand_1][street_1][board_1][situation_1][chosen_action_1]+= lr * (reward_1 - q_current)
        else:
            q_current = q[position_1][hand_1][street_1][situation_1][chosen_action_1]
            q[position_1][hand_1][street_1][situation_1][chosen_action_1]+= lr * (reward_1 - q_current)

Let's check the average score for sb and bb per hand and the q-table. We can do a few sanity checks. 
<ul>
    <li>The average score for sb must be higher than or equal to bb.</li>
    <li>The average score must be well above -1 for bb. -1 would be average score if bb always folds.</li>
    <li>Folding preflop should always result in a q of 0. </li>
    <li>Higher hans must have higher qs</li>
    <li>If hand matches the board, hands must have highest scores</li>
</ul>

In [None]:
print("sb",sb_score/n_games)
print("bb",bb_score/n_games)
q

All sanity checks were successfull.

In [None]:
# Save q table
import pickle
with open('q_simple', 'wb') as q_simple_file:
    pickle.dump(q, q_simple_file)

<h3>Training a deep neural network to beat the Q-table</h3>
Now that we have a benchmark to test our neural network against, we can start creating the neural network and training it through self play. Let's first try to implement the actor-critic. The code is inspired by Phil Tabor. His own implementation for a different purpose can be found on his <a href = https://github.com/philtabor/Youtube-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/actor_critic/tensorflow2 >GitHub Page</a>.

<h4>Training directly against the Q-Table</h4>
To see if the algorithm can learn to beat the q-table, we first let it play against an opponent who imploys the q-table strategy. Later we will see if the algorithm can learn that through self-play.

In [2]:
from AgentActorCritic_SimpleGame import Agent

2022-05-02 13:21:50.078202: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-02 13:21:50.078233: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
agent = Agent(alpha_actor=2e-6, alpha_critic=1e-4)

2022-05-02 13:21:52.248941: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-02 13:21:52.248986: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-02 13:21:52.249018: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (cromi-Lenovo-V15-ADA): /proc/driver/nvidia/version does not exist
2022-05-02 13:21:52.249438: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
#first we need a function that chooses actions for the q-player based on the observation
import pickle
import numpy as np

with open('q_simple', 'rb') as q_simple_file:
    q = pickle.load(q_simple_file)
def choose_action_q(observation):
    
    hole = observation[4]
    #sb:
    if observation[0] == 0:
        if observation[3] == 0: #preflop
            if observation[-10] ==-1: #first in
                action = np.argmax(q['sb'][hole]['preflop']['preflop_fi'])

            else: # bb must have gone allin, if sb has another action to perform
                action = np.argmax(q['sb'][hole]['preflop']['preflop_to_push'])
                
        
        else: #postflop
            board = observation[6]
            if observation[2] != 0: #bb did not push
                action = np.argmax(q['sb'][hole]['postflop'][board]['to_check'])
                if action == 0:
                    action =1
                else:
                    action =2
            else:
                action = np.argmax(q['sb'][hole]['postflop'][board]['to_push'])
                            
    
    #bb 
    else:
        if observation[3] == 0: #preflop
            if observation[-10] == 0.5: #sb completed
                action = np.argmax(q['bb'][hole]['preflop']['to_call'])
                if action == 0:
                    action =1
                else:
                    action =2
            else: #sb pushed
                action = np.argmax(q['bb'][hole]['preflop']['to_push'])
        
        else:
            board = observation[6]
            if observation[2]!= 0: #bb is first in
                action = np.argmax(q['bb'][hole]['postflop'][board]['postflop_fi'])
                if action == 0:
                    action =1
                else:
                    action =2
            else: #bb checked and sb pushed
                action = np.argmax(q['bb'][hole]['postflop'][board]['postflop_to_push'])

    return action
                
                
            
                


Let's import the actor-critic agent. It consists of three neural networks - two actor networks and one critic network. The actor networks model the policy (fold, call, allin). There are two because preflop and postflop games differ quite substantially. Prior testing has shown that postflop actions get rarer and rarer and if there were only one actor network, it would be dominated by training preflop. The third network is the critic network. It models the value of a given state of the game. In trainig, the actor networks use gradient ascent to increase the porbability of using actions that lead to better outcomes than the value predicted by the critic. The critic is trained with replay memory and batch learning. There are different ways of adapting the training strategy. We will start with a simple one and later refine. The actor must learn slower than the critic, because its learning depends on decent value approximations by the critic.

In [5]:
agent.load_models()
game = PokerSimple(0,1)
n_games = 200_000
total_score = 0
score = 0
score_history = []
for i in range(n_games):
    if i%100 == 0:
        print("Episode", i)
    done, next_to_act, observation = game.reset()
    action_hero = -1
    while not game.done:
        next_to_act = game.next_to_act[0]
        
        if next_to_act == 0:
            if action_hero != -1:
                #hero performed an action. Create resulting state and learn.
                game.create_observation(0)
                observation_ = game.observations[0]
                
                agent.memory.store_transition(observation, action_hero, 0, observation_, False)
                agent.learn(observation, action_hero, 0, observation_, False)
                action_hero = -1
            else:
                #hero has already learned the last state action. Implement new action
                game.create_observation(0)
                observation = game.observations[0]
                action_hero = agent.choose_action(observation)
                game.implement_action(0, action_hero)
           
        else:
            game.create_observation(1)
            action = choose_action_q(game.observations[1])           
            game.implement_action(1, action)
            game.create_observation(1)
            
            

    game.create_observation(0)
    stack = game.observations[0][1]
    reward = stack -4.5 if game.observations[0][0] == 0 else stack - 4
    score = stack -5
    score_history.append(score)
    avg_score = np.mean(score_history[-1000:])
    agent.memory.store_transition(observation, action_hero,reward, game.observations[0], True)
    agent.learn(observation, action_hero,reward, game.observations[0], True)
    if i %100 == 0:
        print("episode", i, "total score", total_score, "avg score of last 1000 hands", avg_score)
        print(agent.print_strategy())
    total_score += score

actor loss tf.Tensor([0.01433827], shape=(1,), dtype=float32)
[0, 4.5, 4, 0, 3, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0]
[0, 4, 4, 1, 3, 2, 1, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1, 0]
preflop training
expected reward tf.Tensor(-0.29184836, shape=(), dtype=float32)
action 1
true next state value or reward tf.Tensor(-0.30629444, shape=(), dtype=float32) reward tf.Tensor(0.0, shape=(), dtype=float32) target tf.Tensor([[-0.30629444]], shape=(1, 1), dtype=float32)
actor loss tf.Tensor([-0.00015962], shape=(1,), dtype=float32)
[0, 4, 4, 1, 3, 2, 1, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1, 0]
[0, 4.0, 6.0, 1, 3, 2, 1, 0.5, 0, 0, 0, -1, -1, -1, -1, -1, -1, 0]
postflop training
expected reward tf.Tensor(-0.296166, shape=(), dtype=float32)
action 1
true next state value or reward tf.Tensor(-0.08935358, shape=(), dtype=float32) reward tf.Tensor(-0.5, shape=(), dtype=float32) target tf.Tensor([[-0.5]], shape=(1, 1), dtype=float32)
actor loss tf.Tensor([-0.01541565], shape=(1,), dtype=floa

2022-05-02 13:28:35.124691: W tensorflow/core/data/root_dataset.cc:200] Optimization loop failed: CANCELLED: Operation was cancelled


actor loss tf.Tensor([-0.00411766], shape=(1,), dtype=float32)
[0, 4.5, 4, 0, 5, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0]
[0, 6.0, 4, 0, 5, 6.0, 0, 4.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, 0]
preflop training
expected reward tf.Tensor(2.0562174, shape=(), dtype=float32)
action 2
true next state value or reward tf.Tensor(2.498096, shape=(), dtype=float32) reward tf.Tensor(1.5, shape=(), dtype=float32) target tf.Tensor([[1.5]], shape=(1, 1), dtype=float32)
actor loss tf.Tensor([-0.00390318], shape=(1,), dtype=float32)
[0, 4.5, 4, 0, 5, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0]
[0, 8.75, 1.25, 0, 5, 10.0, 0, 4.5, 4, -1, -1, -1, -1, -1, -1, -1, -1, 0]
preflop training
expected reward tf.Tensor(2.0435202, shape=(), dtype=float32)
action 2
true next state value or reward tf.Tensor(-8.396806, shape=(), dtype=float32) reward tf.Tensor(4.25, shape=(), dtype=float32) target tf.Tensor([[4.25]], shape=(1, 1), dtype=float32)
actor loss tf.Tensor([0.01548108], shape=(1,), dtype=flo

KeyboardInterrupt: 

In [16]:
states = agent.memory.state_memory
rewards = agent.memory.reward_memory
states_ = agent.memory.new_state_memory
actions = agent.memory.action_memory
dones = agent.memory.terminal_memory

In [None]:
states[states[:,2] == 0]

In [None]:
agent.actor_postflop(states[states[:,2] == 0])

In [30]:
self.position_0, self.stacks[0], self.stacks[1], self.street, self.hole_0, self.pot, self.board]

SyntaxError: unmatched ']' (979525032.py, line 1)

Some strategies appear incorrect. When we look at the q-table startegies, which are deterministic, we see that the incorrect strategies here are in situations that never happen, and therefore are not trained on. This should not be a problem in self-play learning, because exploitative policies will be detected by the facat that actions that exploit such startegies will be reinforced and will therefore appear in the learning data.

Let's check how well our trained model does against the q-table.

In [None]:
game = PokerSimple(0,1)
n_games = 100_000
total_score = 0
score = 0
situations_q = {'pre_sb_first_in':0, 'pre_sb_complete_to_push':0,'pre_bb_to_call': 0, 'pre_bb_to_push':0,
               'post_sb_to_check':0, 'post_sb_to_push':0, 'post_bb_first_in':0, 'post_bb_check_pushsb':0}

for i in range(1,n_games):
    if i%1000 == 0:
        print("Episode", i)
    done, next_to_act, observation = game.reset()
    action_hero = -1
    situations = []
    while not game.done:
        next_to_act = game.next_to_act[0]
        
        if next_to_act == 0:

            #hero has already learned the last state action. Implement new action
            game.create_observation(0)
            observation = game.observations[0]
            action_hero = agent.choose_action(observation)
            game.implement_action(0, action_hero)
           
        else:
            game.create_observation(1)
            if game.observations[1][0]==0: #sb
                if game.observations[1][3]== 0: #preflop
                    if game.observations[1][1] == 4.5: #sb first in
                        situation = 'pre_sb_first_in'
                    else: #sb completed and bb pushed
                        situation = 'pre_sb_complete_to_push'
                else:#postflop
                    if game.observations[1][2] == 0: #bb pushed
                        situation = 'post_sb_to_push'
                    else:
                        situation = 'post_sb_to_check'
            
            else: #bb
                if game.observations[1][3]== 0: #preflop
                    if game.observations[1][1] == 4:
                        situation = 'pre_bb_to_call'
                    else:
                        situation = 'pre_bb_to_push'
                else:#postflop
                    if game.observations[1][2] != 0:#bb first in
                        situation = 'post_bb_first_in'
                    else:
                        situation =  'post_bb_check_pushsb'
                
                    
                        
            action = choose_action_q(game.observations[1])        
            game.implement_action(1, action)
                                    
        situations.append(situation)

    game.create_observation(0)
    stack = game.observations[0][1]
    score = stack -5
    total_score += score
    for move in situations:
        situations_q[move] -= score

    if i %100 == 0:
        print("episode", i, "total score", total_score, "avg score", total_score/i)

In [51]:
situations_q

{'pre_sb_first_in': 110.5,
 'pre_sb_complete_to_push': -429.25,
 'pre_bb_to_call': -99.25,
 'pre_bb_to_push': 0,
 'post_sb_to_check': 457.0,
 'post_sb_to_push': 87.25,
 'post_bb_first_in': 273.5,
 'post_bb_check_pushsb': 19.0}

In [54]:
agent.print_strategy()

--------------------------------------
preflop

sb
 first in

holecard 1 value -0.29984355 probs [0.007, 0.992, 0.0]
holecard 2 value -0.059440576 probs [0.003, 0.997, 0.0]
holecard 3 value 0.2521708 probs [0.001, 0.995, 0.004]
holecard 4 value 0.6650563 probs [0.001, 0.897, 0.101]
holecard 5 value 2.1936593 probs [0.003, 0.014, 0.983]
preflop

sb
 complete - push

holecard 1 value -4.5937266 probs [0.019, 0.954, 0.027]
holecard 2 value -4.4570303 probs [0.014, 0.752, 0.233]
holecard 3 value -4.3541703 probs [0.022, 0.077, 0.901]
holecard 4 value -4.3494754 probs [0.014, 0.011, 0.975]
holecard 5 value -4.340943 probs [0.009, 0.003, 0.989]

bb
  to complete

holecard 1 value 0.23083474 probs [0.01, 0.99, 0.0]
holecard 2 value 0.43086874 probs [0.002, 0.997, 0.002]
holecard 3 value 0.6853593 probs [0.001, 0.973, 0.026]
holecard 4 value 1.3139273 probs [0.007, 0.185, 0.808]
holecard 5 value 2.874401 probs [0.003, 0.003, 0.994]

bb
  to push

holecard 1 value 0.79313225 probs [0.007, 0.971

In [43]:
def print_strategy_q():
    print("--------------------------------------")
    print("preflop\n\nsb\n first in\n")

    obs=[0, 4.5, 4, 0, 3, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    for i in range(1,6):
        obs[4] =i
        action = choose_action_q(obs)
        qs = q['sb'][i]['preflop']['preflop_fi']
        print("holecard", i, "qs", qs, "action", action)

    print("preflop\n\nsb\n complete - push\n")

    obs=[0, 4, 0, 0, 3, 6, 0, 0.5, 4, -1, -1, -1, -1, -1, -1, -1, -1]
    for i in range(1,6):
        obs[4] =i
        action = choose_action_q(obs)
        qs = q['sb'][i]['preflop']['preflop_to_push']
        print("holecard", i, "qs", qs, "action", action)

    print("\nbb\n  to complete\n")
    obs=[1, 4, 4, 0, 3, 2, 0, 0.5, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    obs.append(0) if obs[4] != obs[6] else obs.append(1)
    for i in range(1,6):
        obs[4] =i
        action = choose_action_q(obs)
        qs = q['bb'][i]['preflop']['to_call']
        print("holecard", i, "qs", qs, "action", action)

    print("\nbb\n  to push\n")
    obs=[1, 4, 0, 0, 3, 6, 0, 4, -1, -1, -1, -1, -1, -1, -1, -1, -1]
    obs.append(0) if obs[4] != obs[6] else obs.append(1)
    for i in range(1,6):
        obs[4] =i
        action = choose_action_q(obs)
        qs = q['bb'][i]['preflop']['to_push']
        print("holecard", i, "qs", qs, "action", action)


    print("--------------------------------------")
    print("postflop\n\nsb\n  to check\n")
    for board in range(1,6):#all boards
        print("board", board)
        for hole in range(1,6):
            obs=[0, 4, 4, 1, hole, 2, board, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1]
            action = choose_action_q(obs)
            qs = q['sb'][hole]['postflop'][board]['to_check']
            print("holecard", i, "qs", qs, "action", action)

    print("\n  to push\n")
    for board in range(1,6):#all boards
        print("board", board)
        for hole in range(1,6):
            obs=[0, 4, 4, 1, hole, 2, board, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1]
            action = choose_action_q(obs)
            qs = q['sb'][hole]['postflop'][board]['to_push']
            print("holecard", i, "qs", qs, "action", action)

    print("--------------------------------------")
    print("postflop\n\nbb\n  first in\n")
    for board in range(1,6):#all boards
        print("board", board)
        for hole in range(1,6):
            obs=[0, 4, 4, 1, hole, 2, board, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1]
            action = choose_action_q(obs)
            qs = q['bb'][hole]['postflop'][board]['postflop_fi']
            print("holecard", i, "qs", qs, "action", action)

    print("\n  to push after check\n")
    for board in range(1,6):#all boards
        print("board", board)
        for hole in range(1,6):
            obs=[0, 4, 4, 1, hole, 2, board, 0.5, 0, 0, -1, -1, -1, -1, -1, -1, -1]
            action = choose_action_q(obs)
            qs = q['bb'][hole]['postflop'][board]['postflop_to_push']
            print("holecard", i, "qs", qs, "action", action)

In [None]:
print_strategy_q()

In [None]:
#agent.load_models()
#agent.critic.load_weights(agent.checkpoint_file_critic)
import random
game = PokerSimple(0,1)
n_games = 2_000_000
eps = 0.5
for i in range(n_games):
    if i %100 == 0:
        loss = 0
        loss_count = 0
        print("Episode", i)
        agent.print_strategy()
    done, next_to_act, observation = game.reset()
    score = 0
    actions = {}
    observations={}
    observations_= {}
    #we can reduce the learning-time by 50% because we have two players. We always need pairs of start and terminal states. Once we have those we can perform a learning step
    # We need to play the game until the terminal state is known.
    while not game.done:
        next_to_act = game.next_to_act[0]
        #check if there is already an action for the player, so we do a learning step before taking a new action.
        if next_to_act in actions.keys():
            game.create_observation(next_to_act)
            agent.memory.store_transition(observations[next_to_act], actions[next_to_act],rewards[next_to_act], game.observations[next_to_act], False)
            agent.learn(observations[next_to_act], actions[next_to_act],rewards[next_to_act], game.observations[next_to_act], False)
            del(observations[next_to_act], actions[next_to_act])
        game.create_observation(next_to_act)
        observations[next_to_act]=game.observations[next_to_act]
        if random.random() > eps and game.street == 0:
            action = agent.choose_action(observations[next_to_act])
        else:
            action = random.randint(0,2)
        actions[next_to_act]=action
        
        game.implement_action(next_to_act,action) # after action is implemented, we wait for the action to be on the same player. He then has the terminal state of his action.
        rewards = simulate(game, 10)

        if not game.done: #game can be done, if action was fold or the game ended because it was the last action.

            if next_to_act == game.next_to_act[0]: #if it's the same players turn, the terminal state of his action has been achieved. We can do a training step.,
                game.create_observation(next_to_act)
                observations_[next_to_act] = game.observations[next_to_act]
                agent.memory.store_transition(observations[next_to_act],actions[next_to_act],rewards[next_to_act], observations_[next_to_act],False)
                agent.learn(observations[next_to_act],actions[next_to_act],rewards[next_to_act], observations_[next_to_act],False)
                del(observations[next_to_act], observations_[next_to_act], actions[next_to_act]) #clear the states for the learning player
            else:#if it's the other players turn, he must implement his action, so we receive the terminal state of the first player.
                next_to_act = game.next_to_act[0]
                if next_to_act in actions.keys():
                    game.create_observation(next_to_act)
                    agent.memory.store_transition(observations[next_to_act], actions[next_to_act],rewards[next_to_act], game.observations[next_to_act], False)
                    agent.learn(observations[next_to_act], actions[next_to_act],rewards[next_to_act], game.observations[next_to_act], False)
                    del(observations[next_to_act], actions[next_to_act])
                game.create_observation(next_to_act)
                observations[next_to_act] = game.observations[next_to_act]
                if random.random() > eps and game.street == 0:
                    action = agent.choose_action(observations[next_to_act])
                else:
                    action = random.randint(0,2)
                actions[next_to_act]=action
                game.implement_action(next_to_act,action)
                rewards = simulate(game, 10)
                if next_to_act == 0:
                    agent_learn = 1 #the other player learns
                else:
                    agent_learn = 0

                if not game.done and game.next_to_act[0] == agent_learn:
                    game.create_observation(agent_learn) #create the current state - which is the terminal state of the initial action in the while loop
                    observations_[agent_learn] = game.observations[agent_learn]
                    agent.memory.store_transition(observations[agent_learn], actions[agent_learn],rewards[agent_learn], observations_[agent_learn], False)
                    agent.learn(observations[agent_learn], actions[agent_learn],rewards[agent_learn], observations_[agent_learn], False)
                    del(observations[agent_learn], observations_[agent_learn], actions[agent_learn])






    #after the game is done, we learn on the rewards. The starting state is in the dictonary if learning has not occured yet.
    #the terminal states are the current observations
    if 0 in actions.keys():
        game.create_observation(0)
        reward_0 = game.stacks[0]- 4.5 if game.position_0 ==0 else game.stacks[0]-4 #blinds are payed regardless of the decision. So fold recieves a reward of 0
        agent.memory.store_transition(observations[0], actions[0],rewards[0], game.observations[0], True)
        loss+=  agent.learn(observations[0], actions[0],rewards[0], game.observations[0], True)
        loss_count+= 1
    if 1 in actions.keys():
        game.create_observation(1)
        reward_1 = game.stacks[1]- 4.5 if game.position_1 ==0 else game.stacks[1]-4 #blinds are payed regardless of the decision. So fold recieves a reward of 0
        agent.memory.store_transition(observations[1], actions[1],rewards[1], game.observations[1], True)
        loss+=  agent.learn(observations[1], actions[1],rewards[1], game.observations[1], True)
        loss_count+= 1
    if i%100==0:
        print("loss", loss/loss_count)



#agent.save_models()

In [8]:
import copy
def simulate(current_game, n_simulations): #use after implementing action
    
    total_reward_0, total_reward_1 = 0, 0
    
    for k in range(n_simulations):
        game = copy.deepcopy(current_game)
        while not game.done:
            next_to_act = game.next_to_act[0]
            #check if there is already an action for the player, so we do a learning step before taking a new action.
            game.create_observation(next_to_act)
            observations[next_to_act]=game.observations[next_to_act]
            action = agent.choose_action(observations[next_to_act])
            actions[next_to_act]=action
            game.implement_action(next_to_act,action) # after action is implemented, we wait for the action to be on the same player. He then has the terminal state of his action.


            if not game.done: #game can be done, if action was fold or the game ended because it was the last action.


                next_to_act = game.next_to_act[0]
                game.create_observation(next_to_act)
                observations[next_to_act] = game.observations[next_to_act]
                action = agent.choose_action(observations[next_to_act])
                game.implement_action(next_to_act,action)

        total_reward_0 += game.stacks[0]- 4.5 if game.position_0 ==0 else game.stacks[0]-4
        total_reward_1 += game.stacks[1]- 4.5 if game.position_1 ==0 else game.stacks[1]-4

    
    avg_reward_0 = total_reward_0 / n_simulations
    avg_reward_1 = total_reward_1 / n_simulations

    return avg_reward_0, avg_reward_1

In [80]:
game_new=copy.deepcopy(game)
total_reward_0, total_reward_1 = 0, 0 

In [82]:
next_to_act = game_new.next_to_act[0]
next_to_act

1

In [84]:
game_new.create_observation(next_to_act)
observations[next_to_act]=game_new.observations[next_to_act]
observations

{0: [0, 4.5, 4, 0, 3, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
 1: [1, 4, 0, 0, 4, 6.0, 0, 4.5, -1, -1, -1, -1, -1, -1, -1, -1, -1]}

In [92]:
action = agent.choose_action(observations[next_to_act])

In [93]:
action

0

In [87]:
agent.print_strategy()

--------------------------------------
preflop

sb

holecard 1 value -0.12683435 probs [0.539, 0.284, 0.178]
holecard 2 value -0.22990859 probs [0.558, 0.274, 0.168]
holecard 3 value -0.3504145 probs [0.579, 0.265, 0.156]
holecard 4 value -0.46691316 probs [0.599, 0.255, 0.146]
holecard 5 value -0.57830334 probs [0.617, 0.247, 0.136]

bb
  to complete

holecard 1 value -0.13642003 probs [0.541, 0.286, 0.173]
holecard 2 value -0.257392 probs [0.562, 0.276, 0.162]
holecard 3 value -0.3720688 probs [0.585, 0.265, 0.15]
holecard 4 value -0.49449757 probs [0.605, 0.255, 0.14]
holecard 5 value -0.6093389 probs [0.622, 0.246, 0.132]

bb
  to push

holecard 1 value 0.097202 probs [0.552, 0.288, 0.16]
holecard 2 value 0.07157244 probs [0.557, 0.284, 0.158]
holecard 3 value -0.013402835 probs [0.562, 0.281, 0.157]
holecard 4 value -0.183801 probs [0.571, 0.275, 0.154]
holecard 5 value -0.28443256 probs [0.584, 0.267, 0.149]
--------------------------------------
postflop

sb
  to check

board 1
