<h1 style="color:darkred"> Training a Neural Network Heads Up No Limit Texas Holdem with probabilistic decisions</h1>
In this notebook we are training an agent with self-play reinforcement learning to play heads-up (1 vs.1) Poker. Deterministic policies are not suited for non-deterministic games.

<h2>Simple Test Game </h2>
Before we train a model to play poker, we need a very simple test game, that has similar properties, but is so simple that the training algorithm can be verified with reasonable training time. We don't want to try training a model to play poker, before we know that the algorithm in principle can beat similar games. These are the criteria the test-game has to meet.
<ul>
    <li>Two players </li>
    <li>Imperfect Information </li>
    <li>Round based. At least two decisions have to be made per round.</li>
    <li>Solution known. We need to be able to compute the Nash-Equilibrium for the game or the game has to have been solved already.<br>
    we need to be able to verify if the our model learns the same solution as the nash equilibrium</li>
</ul>
We are going to invent our own very simple no limit holdem poker game. We are going to limit the number of cards to the range of A to T and not have any suits, so there are only 5 cards total. A players receives on hole-card. And there will be one community card. There will be two betting rounds. Preflop the player can decide to fold, call or go all in. Postflop the player has the same decisions. Both players have 5 BB. The winning hand is a pair or the high card.<br>
There is a very limited number of possible actions, so we can use simple q-learning to learn the best strategy.
<ol>
    <li>sb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb checks</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb folds</li>
    <li>sb calls, bb checks, (flop), bb checks, sb allin, bb calls </li>
    <li>sb calls, bb checks, (flop), bb allin, sb folds </li>
    <li>sb calls, bb checks, (flop), bb allin, sb calls </li>
    <li>sb calls, bb allin,  sb folds </li>
    <li>sb calls, bb allin, sb calls </li>
    <li>sb allin, bb folds </li>
    <li>sb allin, bb calls </li>
</ol>



In [116]:
import random
class PokerSimple:
    def __init__(self,agent_0, agent_1):
        self.agent_0 = agent_0
        self.agent_1 = agent_1
        self.deck = list(range(1,6))
        
    
    def reset(self):
        self.done = False
        self.position_0 = random.randint(0,1)
        self.position_1 = 0 if self.position_0 ==1 else 1
        self.street = 0
        self.pot = 1.5

        self.history = [] #overall 10 actions are possible
        if self.position_0 == 0:
            self.stack_0 = 4.5
            self.stack_1 = 4

        else:
            self.stack_0 = 4
            self.stack_1 = 4.5
        random.shuffle(self.deck)
        self.hole_0 = self.deck[0]
        self.hole_1 = self.deck[1]
        self.board = 0 #no card dealt yet
        self.next_to_act = [0, 1] if self.position_0 == 0 else [1, 0]
        self.observations= [[],[]]
        self.create_observation(self.next_to_act[0])
        return self.done, self.next_to_act[0], self.observations[self.next_to_act[0]]
    
    def create_observation(self,player = 0):
        if player == 0:
            self.observations[0] = [self.position_0, self.stack_0, self.stack_1, self.street, self.hole_0, self.pot, self.board]
            history_obs = self.history
            for i in range(10-len(self.history)):
                history_obs.append(-1)
            self.observations[0].extend(history_obs)
            
        else:
            self.observations[1] = [self.position_1, self.stack_1, self.stack_0, self.street, self.hole_1, self.pot, self.board]
            history_obs = self.history
            for i in range(10-len(self.history)):
                history_obs.append(-1)
            self.observations[1].extend(history_obs)
    
    def showdown():
        if self.hole_0 == self.board:
            strength_0 = 10
        else:
            strength_0 = self.hole_0
            
        if self.hole_1 == self.board:
            strength_1 = 10
        else:
            strength_1 = self.hole_1
        
        if strength_0 > strength_1:
            return 0
        elif strength_1 > strength_0:
            return 1
        else:
            return 2 #split
        
    
    def implement_action(self, player, action):
        #actions: 0 = fold or check, depending on whether pot is balanced or not
        #         1 = check or call, depending on whether pot is balanced or not
        #         2 = allin or call, depending on whether pot is balanced or not
        if self.done:
            raise  RuntimeError('Poker Round is finished. Cannot implement any more actions. Start new hand!')
            return
        if player != self.next_to_act[0]:
            raise  RuntimeError('{} is not next to act. Cannot implement action for this player'.format(player))
            return
       
        if self.stack_0 == self.stack_1:
            if action == 0: #transform fold to check
                action = 1
        
        if action == 0:
            if player == 0:
                self.stack_1 += self.pot
            else:
                self.stack_0 += self.pot
            self.history.append(0)
            self.done=True
        
        elif action == 1: #call or check
            if len(next_to_act)==1: #last to act
                if player == 0:
                    self.pot += self.stack_1 - self.stack_0
                    self.stack_0 = self.stack_1
                else:
                    self.pot += self.stack_0 - self.stack_1
                    self.stack_1 = self.stack_0
                
                if self.street == 0:
                    self.street = 1
                    self.next_to_act =[1,0]
                
                else:
                    self.done = True
                    result = showdown()
                    
            
                
                    
        
    

In [117]:
hero = 0
villain = 1
game = PokerSimple(hero, villain)

In [120]:
for i in range(100):
    print(game.reset())

(False, 1, [0, 4.5, 4, 0, 4, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 4, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 1, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 0, [0, 4.5, 4, 0, 1, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 2, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 0, [0, 4.5, 4, 0, 5, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 5, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 0, [0, 4.5, 4, 0, 5, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 1, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 3, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 4, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 0, [0, 4.5, 4, 0, 1, 1.5, 0, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
(False, 1, [0, 4.5, 4, 0, 1, 1.5, 0, -1, -1, -1, -1, -1, -1, -1,

In [92]:
game.implement_action(0)

RuntimeError: 0 is not next to act. Cannot implement action for this player