<h1 style="color:darkred"> Training a Neural Network Heads Up No Limit Texas Holdem with probabilistic decisions</h1>
In this notebook we are training an agent with self-play reinforcement learning to play heads-up (1 vs.1) Poker. Deterministic policies are not suited for non-deterministic games.

<h2>Simple Test Game </h2>
Before we train a model to play poker, we need a very simple test game, that has similar properties, but is so simple that the training algorithm can be verified with reasonable training time. We don't want to try training a model to play poker, before we know that the algorithm in principle can beat similar games. These are the criteria the test-game has to meet.
<ul>
    <li>Two players </li>
    <li>Imperfect Information </li>
    <li>Round based. At least two decisions have to be made per round.</li>
    <li>Solution known. We need to be able to compute the Nash-Equilibrium for the game or the game has to have been solved already.<br>
    we need to be able to verify if the our model learns the same solution as the nash equilibrium</li>
</ul>
We are going to invent our own very simple no limit holdem poker game. We are going to limit the number of cards to the range of A to T and have with two sets, so there are only 10 cards total. A players receives one hole-card. And there will be one community card. There will be two betting rounds. Preflop the player can decide to fold, call or go all in. Postflop the player has the same decisions. Both players have 5 BB. The winning hand is a pair or the high card. There are no suits<br>
There is a very limited number of possible actions, so we can use simple q-learning to learn the best strategy.
<ol>
    <li>sb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb checks</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb calls </li>
    <li>sb calls, bb checks, (flop), bb allin, sb folds </li>
    <li>sb calls, bb checks, (flop), bb allin, sb calls </li>
    <li>sb calls, bb allin,  sb folds </li>
    <li>sb calls, bb allin, sb calls </li>
    <li>sb allin, bb folds </li>
    <li>sb allin, bb calls </li>
</ol>

The file can be found in game.PokerSimple


In [4]:
from game.PokerSimple import PokerSimple

<h3>Testing the game</h3>

The file 'Test_PokerSimple.py' contains unittests for the game. In addition let's do some sanity checks. We are going to make the agents do random moves and record how much which hand wins. We are also going to check if both players have the same stats.

In [None]:
import random
hero = 0
villain = 1
game = PokerSimple(hero, villain)

n_games = 1000000
results= [{1:0, 2: 0, 3: 0, 4: 0, 5: 0},{1:0, 2: 0, 3: 0, 4: 0, 5: 0}]

for i in range(n_games):
    if i %10000 == 0:
        print(i)
    done, next_to_act, observation = game.reset()
    while game.done == False:
        action = random.randint(0,2)
        game.implement_action(game.next_to_act[0], action)
    game.create_observation(0)
    game.create_observation(1)
    stack_change_0 = game.observations[0][1] -5
    stack_change_1 = game.observations[1][1] -5
    hand_0 = game.observations[0][4]
    hand_1 = game.observations[1][4]
    
    results[0][hand_0] += stack_change_0
    results[1][hand_1] += stack_change_1

In [10]:
results

[{1: -265279.25, 2: -131995.0, 3: -420.25, 4: 133693.25, 5: 266358.25},
 {1: -267017.0, 2: -132978.75, 3: 217.5, 4: 131922.75, 5: 265498.5}]

Sanity check succeeded. The lower hands loose and the higher hands win.

<h3> Learning Optimal Game </h3>
Below is an attempt to solve for the nash equilibrium for the game. Due to complexity, I have not succeeded in this. A paper presenting a potential solution can be found <a href="http://proceedings.mlr.press/v119/munos20a/munos20a.pdf">here</a>.

In [2]:
probs_sb = {1:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            2:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            3:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            4:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}},
            5:{'preflop':{'preflop_fi':[0.33, 0.33, 0.34], 'preflop_to_push': [0.5, 0.5]}, 
            'postflop':{1:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},2:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        3:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},4:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]},
                        5:{'to_check': [0.5, 0.5], 'to_push':[0.5, 0.5]}}}}

In [69]:
probs_bb = {1:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            2:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
                           'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                                       5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            3:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            4:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5, 0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}},
            5:{'preflop':{'to_call':[0.5,0.5], 'to_push':[0.5,0.5]},
               'postflop':{1:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           2:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           3:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           4:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]},
                           5:{'postflop_fi':[0.5,0.5], 'postflop_to_push':[0.5,0.5]}}}}
            


In [15]:
#dictionary to store proportions of win losse and split a player wins at showdown
pot_props_pre = {key: {key: 0 for key in range(1,6)} for key in range(1,6)}

deck = list(range(1,6))
deck.extend(deck)
for hole_0 in range(1,6):
    deck_rest = deck[:]
    deck_rest.remove(hole_0)
    for hole_1 in range (1,6):
        deck_rest_ = deck_rest[:]
        deck_rest_.remove(hole_1)
        win_0 = 0
        win_1 = 0
        split = 0
        for card in deck_rest_:
            if (hole_0 == card and hole_1 != card) or (hole_0 > hole_1 and hole_1 != card):
                win_0 += 1
            elif (hole_1 == card and hole_0 != card) or (hole_1 > hole_0 and hole_0 != card):
                win_1 += 1
            else:
                split +=1
        proportion = (win_0/8, win_1/8, split/8)
        pot_props_pre[hole_0][hole_1] = proportion 
        
#lookup table for postflop. -1 means loose, 1 means win and 0 means split
lookup_post = {key: {key: {key: 0 for key in range(1,6)} for key in range(1,6)} for key in range(1,6)}
for hole_0 in range(1,6):
    deck_rest = deck[:]
    deck_rest.remove(hole_0)
    for hole_1 in range (1,6):
        deck_rest_ = deck_rest[:]
        deck_rest_.remove(hole_1)
        
        for card in deck_rest_:
            if (hole_0 == card and hole_1 != card) or (hole_0 > hole_1 and hole_1 != card):
                lookup_post[hole_0][hole_1][card] = 1
            elif (hole_1 == card and hole_0 != card) or (hole_1 > hole_0 and hole_0 != card):
                lookup_post[hole_0][hole_1][card] = -1
            else:
                lookup_post[hole_0][hole_1][card] = 0


In [160]:
import numpy as np
def cal_ev_sb_hole(probs_sb, probs_bb, holecard):
    #postflop utils must be calculated for each flop. We need the probability of villain having a particular hand.
    util_1111 =  []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]
            util += p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_1111.append(util/probs_total)
    
    util_11120 = 1
    util_11121 = []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_to_push'][1]
            util += 5*p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_11121.append(util/probs_total)
    
    util_1120 = -1
    util_1121 = []
    for board in range(1,6):
        util = 0
        probs_total = 0
        for hole_vil in range(1,6):
            if holecard == board and holecard == hole_vil:
                p_deal_holevil = 0
            elif hole_vil == board or hole_vil ==holecard:
                p_deal_holevil = 1/8
            else:
                p_deal_holevil = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][1]
            util += 5*p_deal_holevil*pvil_playhand*lookup_post[holecard][hole_vil][board]
            probs_total += p_deal_holevil*pvil_playhand
        util_1121.append(util/probs_total)
    
    util_120 = -1
    util_121 = 0
    probs_total = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][1]
        probs_total += pvil_playhand*p_deal_holevil
        util_121 += 5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][0]-5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][1]
    util_121 /= probs_total
    
    util_20 = 1
    util_21 = 0
    probs_total = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_push'][1]
        probs_total += pvil_playhand*p_deal_holevil
        util_21 += 5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][0]-5*p_deal_holevil*pvil_playhand*pot_props[holecard][hole_vil][1]
    util_21 /= probs_total
    p0 = probs_sb[holecard]['preflop']['preflop_fi'][0]
    p1 = probs_sb[holecard]['preflop']['preflop_fi'][1]
    p2 = probs_sb[holecard]['preflop']['preflop_fi'][2]
    p11 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]
        p11 += pvil_playhand* p_deal_holevil
        
        
    p12 = 0    
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]
        p12 += pvil_playhand* p_deal_holevil
    
        
    p111 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        for board in range(1,6):
            if board == holecard and board == hole_vil:
                p_board = 0
            elif board ==holecard or board == hole_vil:
                p_board = 1/8
            else:
                p_board = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][0]
            p111 += pvil_playhand*p_board*p_deal_holevil
    p111 /= p11
            
    p112 = 0
    for hole_vil in range(1,6):
        if holecard == hole_vil:
            p_deal_holevil = 1/9
        else:
            p_deal_holevil = 2/9
        for board in range(1,6):
            if board == holecard and board == hole_vil:
                p_board = 0
            elif board ==holecard or board == hole_vil:
                p_board = 1/8
            else:
                p_board = 2/8
            pvil_playhand = probs_bb[hole_vil]['preflop']['to_call'][0]*probs_bb[hole_vil]['postflop'][board]['postflop_fi'][1]
            p112 += pvil_playhand*p_board*p_deal_holevil
    p112 /= p11
    
    #ev = p0*-0.5+ p1['preflop_fi'][1]*util_1 + probs_sb[holecard]['preflop']['preflop_fi'][2]*util_2

<h3>Q-Learning Simple Game </h3>
Through Q-Learning, we are going to learn the best deterministic strategy for the simple game. This will serve as a benchmark for the actual deep-learning algorithm. Because Q_Learning solves for a deterministic solution, a probabilistic solution will have to beat it, or if the deterministic solution is optimal, it should break even. In Q_learning, we need all possible states and action pairs. <br>
We first need a q-table. A table of all possible states and actions.

In [120]:
q= {'sb':{1:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            2:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            3:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            4:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}},
            5:{'preflop':{'preflop_fi':[0,0,0], 'preflop_to_push': [0,0]}, 
            'postflop':{1:{'to_check': [0,0], 'to_push':[0,0]},2:{'to_check': [0,0], 'to_push':[0,0]},
                        3:{'to_check': [0,0], 'to_push':[0,0]},4:{'to_check': [0,0], 'to_push':[0,0]},
                        5:{'to_check': [0,0], 'to_push':[0,0]}}}},
    'bb': {1:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            2:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            3:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            4:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}},
            5:{'preflop':{'to_call':[0,0], 'to_push':[0,0]},
               'postflop':{1:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           2:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           3:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           4:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]},
                           5:{'postflop_fi':[0,0], 'postflop_to_push':[0,0]}}}}}

In [None]:
import numpy as np
import random
epsilon = 0.2
lr = 0.01
hero = 0
villain = 1
sb_score = 0
bb_score = 0
game = PokerSimple(hero, villain)
#record state - action pair
n_games = 1000000
for i in range(n_games):
    if i %10000 == 0:
        print("episode",i)
    last_state_action = {0:{'state':None, 'action': None}, 1:{'state':None, 'action': None}}

    done, next_to_act, observation = game.reset()
    hands = [game.hole_0,game.hole_1]

    sb = next_to_act
    bb = 0 if sb==1 else 1
    if random.uniform(0,1) < epsilon:
        action = random.randint(0,2)

    else:
        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['preflop']['preflop_fi'])

    last_state_action[next_to_act]['state'] = ['sb',hands[next_to_act], 'preflop', 0, 'preflop_fi'] #these information provies all necessary infos to find q in table.
    last_state_action[next_to_act]['action'] = action

    game.implement_action(next_to_act,action)
    done = game.done

    while not done:
        if random.uniform(0,1) < epsilon:
            explore = True
            action = random.randint(0,1)
        else:
            explore = False

        hero = game.next_to_act[0]
        game.create_observation(hero)

        if game.next_to_act[0] == sb:
            if game.street == 0: #if game is preflop and player is sb, then it must be that bb pushed, otherwise there is no more action for sb              
                if not explore:
                    action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['preflop']['preflop_to_push'])
                last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'preflop', 0, 'preflop_to_push']
                last_state_action[game.next_to_act[0]]['action'] = action

            else:
                if game.stacks[bb] != 0: #bb checked
                    if not explore:
                        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['postflop'][game.board]['to_check'])
                    last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'postflop', game.board, 'to_check']
                    last_state_action[game.next_to_act[0]]['action'] = action

                    if action == 0:
                        action =1

                    else:
                        action = 2


                else: ##bb pushed
                    if not explore:
                        action = np.argmax(q['sb'][hands[game.next_to_act[0]]]['postflop'][game.board]['to_push'])

                    last_state_action[game.next_to_act[0]]['state'] = ['sb',hands[game.next_to_act[0]], 'postflop', game.board, 'to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action               
        else:
            if game.street == 0:
                if game.stacks[sb] != 0: #sb called                
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['preflop']['to_call'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'preflop', 0, 'to_call']
                    last_state_action[game.next_to_act[0]]['action'] = action 

                    if action == 0:
                        action =1

                    else:
                        action = 2

                else:
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['preflop']['to_push'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'preflop', 0, 'to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action 

            else:
                game.create_observation(game.next_to_act[0])
                if game.observations[game.next_to_act[0]][9] == -1: #postflop first in can only happen if preflop went call check           
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['postflop'][game.board]['postflop_fi'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'postflop', game.board, 'postflop_fi']
                    last_state_action[game.next_to_act[0]]['action'] = action 

                    if action == 0:
                        action = 1

                    else:
                        action = 2

                else: #if there is any action left it must be that smallblind pushed
                    if not explore:
                        action = np.argmax(q['bb'][hands[game.next_to_act[0]]]['postflop'][game.board]['postflop_to_push'])
                    last_state_action[game.next_to_act[0]]['state'] = ['bb',hands[game.next_to_act[0]], 'postflop', game.board, 'postflop_to_push']
                    last_state_action[game.next_to_act[0]]['action'] = action


        game.implement_action(game.next_to_act[0],action)
        done = game.done

        if not done:
            position = last_state_action[game.next_to_act[0]]['state'][0]
            hand = last_state_action[game.next_to_act[0]]['state'][1]
            street = last_state_action[game.next_to_act[0]]['state'][2]
            board = last_state_action[game.next_to_act[0]]['state'][3]
            situation = last_state_action[game.next_to_act[0]]['state'][4]
            chosen_action = last_state_action[game.next_to_act[0]]['action']
            if board:
                q_current = q[position][hand][street][board][situation][chosen_action]
                q[position][hand][street][board][situation][chosen_action]+= lr * (np.max(q[position][hand][street][board][situation]) - q_current)

            else:

                q_current = q[position][hand][street][situation][chosen_action]
                q[position][hand][street][situation][chosen_action]+= lr * (np.max(q[position][hand][street][situation]) - q_current)

    if sb == 0:
        reward_0 = game.stacks[0]-4.5 
        reward_1 = game.stacks[1]-4
        sb_score += reward_0 -0.5
        bb_score += reward_1 - 1
    else:
        reward_1 = game.stacks[1]-4.5 
        reward_0 = game.stacks[0]-4
        sb_score += reward_1 -0.5
        bb_score += reward_0 -1
    chosen_action_0 = last_state_action[0]['action']

    if chosen_action_0 is not None:

        position_0 = last_state_action[0]['state'][0]
        hand_0 = last_state_action[0]['state'][1]
        street_0 = last_state_action[0]['state'][2]
        board_0 = last_state_action[0]['state'][3]
        situation_0 = last_state_action[0]['state'][4]

        if board_0:
            q_current = q[position_0][hand_0][street_0][board_0][situation_0][chosen_action_0]
            q[position_0][hand_0][street_0][board_0][situation_0][chosen_action_0]+= lr * (reward_0 - q_current)
        else:
            q_current = q[position_0][hand_0][street_0][situation_0][chosen_action_0]
            q[position_0][hand_0][street_0][situation_0][chosen_action_0]+= lr * (reward_0 - q_current)
        
        
    chosen_action_1 = last_state_action[1]['action']
    if chosen_action_1 is not None:
        position_1 = last_state_action[1]['state'][0]
        hand_1 = last_state_action[1]['state'][1]
        street_1 = last_state_action[1]['state'][2]
        board_1 = last_state_action[1]['state'][3]
        situation_1 = last_state_action[1]['state'][4]

        if board_1:
            q_current = q[position_1][hand_1][street_1][board_1][situation_1][chosen_action_1]
            q[position_1][hand_1][street_1][board_1][situation_1][chosen_action_1]+= lr * (reward_1 - q_current)
        else:
            q_current = q[position_1][hand_1][street_1][situation_1][chosen_action_1]
            q[position_1][hand_1][street_1][situation_1][chosen_action_1]+= lr * (reward_1 - q_current)

Let's check the average score for sb and bb per hand and the q-table. We can do a few sanity checks. 
<ul>
    <li>The average score for sb must be higher than or equal to bb.</li>
    <li>The average score must be well above -1 for bb. -1 would be average score if bb always folds.</li>
    <li>Folding preflop should always result in a q of 0. </li>
    <li>Higher hans must have higher qs</li>
    <li>If hand matches the board, hands must have highest scores</li>
</ul>

In [125]:
print("sb",sb_score/n_games)
print("bb",bb_score/n_games)
q

sb 0.01673825
bb -0.01673825


{'sb': {1: {'preflop': {'preflop_fi': [0.0,
     0.07433285846717742,
     -1.28261613598956],
    'preflop_to_push': [-0.4999999999999972, -2.8017342096198115]},
   'postflop': {1: {'to_check': [1.4999999996713171, 2.585622015331433],
     'to_push': [-0.4440960587882114, 5.49999999404266]},
    2: {'to_check': [-0.35402523555916954, -0.0408504120698869],
     'to_push': [-0.4999999999999972, -4.008184759955431]},
    3: {'to_check': [-0.3663365401769724, -0.37921351763447475],
     'to_push': [-0.4999999999999972, -3.9588870390636575]},
    4: {'to_check': [-0.3678676580248863, -0.6971811046538243],
     'to_push': [-0.4999999999999972, -3.9676150942308452]},
    5: {'to_check': [-0.41225439333458086, -0.6883320653538317],
     'to_push': [-0.4999999999999972, -4.1418922399568245]}}},
  2: {'preflop': {'preflop_fi': [0.0, 0.08095881565905509, -1.349101118903017],
    'preflop_to_push': [-0.4999999999999972, -1.340201862415521]},
   'postflop': {1: {'to_check': [-0.3418291274478324, -

All sanity checks were successfull.

In [126]:
# Save q table
import pickle
with open('q_simple', 'wb') as q_simple_file:
    pickle.dump(q, q_simple_file)

In [None]:
with open('q_simple', 'rb') as q_simple_file:
    q_test = pickle.load(q_simple_file)

<h3>Training a deep neural network to beat the Q-table</h3>
Now that we have a benchmark to test our neural network against, we can start creating the neural network and training it through self play. Let's first try to implement the actor-critic. The code is inspired by 