### Counterfactual Regret Minimization (CFR) and its application to Kuhn Poker


Source consulted: https://modelai.gettysburg.edu/2013/cfr/cfr.pdf


Kuhn Poker is a simple 3-card poker game created by Harold E. Kuhn. Two players each bet 1 chip blind into the pot before the deal. Three cards (usually K, Q, and J) are suffled, and one card is dealt to each player and held as private information in the original Kuhn Poker game. We implement the game as described in the above paper with additional change as inspired by the [ReBeL](https://arxiv.org/abs/2007.13544) paper. Players do not have any private information -- a referee observes the players' cards and makes decision based on the players' strategy which they announce in the beginning of the game. The players then update their belief based on how the other player's play and simultaneously update their strategy. Further information can be found on the paper.  

In [71]:
## source: https://ai.plainenglish.io/building-a-poker-ai-part-6-beating-kuhn-poker-with-cfr-using-python-1b4172a6ab2d

from typing import List, Dict
import random
import numpy as np
import sys

Actions = ['B', 'C']  # bet/call vs check/fold

class InformationSet():
    def __init__(self):
        self.cumulative_regrets = np.zeros(shape=len(Actions))
        self.strategy_sum = np.zeros(shape=len(Actions))
        self.num_actions = len(Actions)

    def normalize(self, strategy: np.array) -> np.array:
        """Normalize a strategy. If there are no positive regrets,
        use a uniform random strategy"""
        if sum(strategy) > 0:
            strategy /= sum(strategy)
        else:
            strategy = np.array([1.0 / self.num_actions] * self.num_actions)
        return strategy

    def get_strategy(self, reach_probability: float) -> np.array:
        """Return regret-matching strategy"""
        strategy = np.maximum(0, self.cumulative_regrets)
        strategy = self.normalize(strategy)

        self.strategy_sum += reach_probability * strategy
        return strategy

    def get_average_strategy(self) -> np.array:
        return self.normalize(self.strategy_sum.copy())


class KuhnPoker():
    @staticmethod
    def is_terminal(history: str) -> bool:
        return history in ['BC', 'BB', 'CC', 'CBB', 'CBC']  # in cases where player 1 

    @staticmethod
    def get_payoff(history: str, cards: List[str]) -> int:
        """get payoff for 'active' player in terminal history"""
        if history in ['BC', 'CBC']:
            return +1
        else:  # CC or BB or CBB
            payoff = 2 if 'B' in history else 1
            active_player = len(history) % 2
            player_card = cards[active_player]
            opponent_card = cards[(active_player + 1) % 2]
            if player_card == 'K' or opponent_card == 'J':
                return payoff
            else:
                return -payoff


class KuhnCFRTrainer_NoRebel():
    def __init__(self):
        self.infoset_map: Dict[str, InformationSet] = {}
        self.current_average_strategy = np.array([0.5, 0.5]) # keep track of strategies, initialize with 50% bet, 50% pass

    def get_information_set(self, card_and_history: str) -> InformationSet:
        """add if needed and return"""
        if card_and_history not in self.infoset_map:
            self.infoset_map[card_and_history] = InformationSet()
        return self.infoset_map[card_and_history]

    def cfr(self, cards: List[str], history: str, reach_probabilities: np.array, active_player: int):
        if KuhnPoker.is_terminal(history):
            return KuhnPoker.get_payoff(history, cards)

        my_card = cards[active_player]
        info_set = self.get_information_set(my_card + history)

        strategy = info_set.get_strategy(reach_probabilities[active_player])
        
#         ####### CFR-AVG modification as per Rebel #############
#         strategy = (self.current_average_strategy + strategy)/2   # current strategy is not the last strategy, instead its 
#                                                                   # the current average strategy.
#         self.current_average_strategy = strategy
#         ########################################################
        opponent = (active_player + 1) % 2
        counterfactual_values = np.zeros(len(Actions))

        for ix, action in enumerate(Actions):
            action_probability = strategy[ix]

            # compute new reach probabilities after this action
            new_reach_probabilities = reach_probabilities.copy()
            new_reach_probabilities[active_player] *= action_probability

            # recursively call cfr method, next player to act is the opponent
            counterfactual_values[ix] = -self.cfr(cards, history + action, new_reach_probabilities, opponent)

        # Value of the current game state is just counterfactual values weighted by action probabilities
        node_value = counterfactual_values.dot(strategy)
        for ix, action in enumerate(Actions):
            
            info_set.cumulative_regrets[ix] += reach_probabilities[opponent] * (counterfactual_values[ix] - node_value)

        return node_value

    def train(self, num_iterations: int) -> int:
        util = 0
        kuhn_cards = ['J', 'Q', 'K']
        for _ in range(num_iterations):
            cards = random.sample(kuhn_cards, 2)
            history = ''
            reach_probabilities = np.ones(2)
            util += self.cfr(cards, history, reach_probabilities, 0)
        return util

## THINK IN TERMS OF REBEL

1. Players only have some belief over their current card: 

<!-- C(P_1) = [1/3, 1/3, 1/3]  -- Player 1's probability of getting each card | could be K, Q, or J
C(P_2) = [1/2, 1/2]       -- Player 2's probability of getting remaining card | conditioned over player 1's action -->

2. Players share same belief as it is assumed that players actually "know the best policy". 
    \begin{aligned}
        \text{belief} &= \begin{pmatrix} argmax\big(P(K), P(Q), P(J)\big) \\
                    argmax\big(P(K), P(Q), P(J)\big) \end{pmatrix}
    \end{aligned}

3. The infostate (here, infoset) then becomes strings of belief + action 


4. History is just a string of cards (we omit legal actions because player can always pick either pass(call)/bet



### How to condition belief?  based on history ? 

-- If history is not empty:
                
                -- check last action : if bet --> likely had a better card. --> update belief 
--  If empty:
      
                -- use current belief (initial random??)
                

-------------------------------

**Try Implementing these below**

_______________________________


In [7]:
kuhn_cards = ['J', 'Q', 'K']
random.sample(kuhn_cards, 2)

probs = np.ones((2,3))/3

tostring = lambda x: kuhn_cards[x[0]] + kuhn_cards[x[1]]

cards = np.argmax(probs, axis = 1)
cards = tostring(cards)


## update belief



In [46]:
cards

'JJ'

In [67]:
## source: https://ai.plainenglish.io/building-a-poker-ai-part-6-beating-kuhn-poker-with-cfr-using-python-1b4172a6ab2d

from typing import List, Dict
import random
import numpy as np
import sys

Actions = ['B', 'C']  # bet/call vs check/fold

class InformationSet():
    def __init__(self):
        self.cumulative_regrets = np.zeros(shape=len(Actions))
        self.strategy_sum = np.zeros(shape=len(Actions))
        self.num_actions = len(Actions)

    def normalize(self, strategy: np.array) -> np.array:
        """Normalize a strategy. If there are no positive regrets,
        use a uniform random strategy"""
        if sum(strategy) > 0:
            strategy /= sum(strategy)
        else:
            strategy = np.array([1.0 / self.num_actions] * self.num_actions)
        return strategy

    def get_strategy(self, reach_probability: float) -> np.array:
        """Return regret-matching strategy"""
        strategy = np.maximum(0, self.cumulative_regrets)
        strategy = self.normalize(strategy)

        self.strategy_sum += reach_probability * strategy
        return strategy

    def get_average_strategy(self) -> np.array:
        return self.normalize(self.strategy_sum.copy())


class KuhnPoker():
    @staticmethod
    def is_terminal(history: str) -> bool:
        return history in ['BC', 'BB', 'CC', 'CBB', 'CBC']  # in cases where player 1 

    @staticmethod
    def get_payoff(history: str, cards: List[str]) -> int:
        """get payoff for 'active' player in terminal history"""
        if history in ['BC', 'CBC']:
            return +1
        else:  # CC or BB or CBB
            payoff = 2 if 'B' in history else 1
            active_player = len(history) % 2
            player_card = cards[active_player]
            opponent_card = cards[(active_player + 1) % 2]
            if player_card == 'K' or opponent_card == 'J':
                return payoff
            else:
                return -payoff


class KuhnCFRTrainer():
    def __init__(self):
        self.infoset_map: Dict[str, InformationSet] = {}
        self.current_average_strategy = np.array([0.5, 0.5]) # keep track of strategies, initialize with 50% bet, 50% pass

    def get_information_set(self, card_and_history: str) -> InformationSet:
        """add if needed and return"""
        if card_and_history not in self.infoset_map:
            self.infoset_map[card_and_history] = InformationSet()
        return self.infoset_map[card_and_history]

    def cfr(self, cards: List[str], history: str, reach_probabilities: np.array, active_player: int):
        if KuhnPoker.is_terminal(history):
            return KuhnPoker.get_payoff(history, cards)
        
        opponent = (active_player + 1) % 2
        
        
        
        ### BELIEF UPDATE --- SIMPLE IMPLEMENTATION FOR CONCEPT DEMONSTRATION
        
        if not(history == ''):
            last_action = history[len(history)-1]
            
            if last_action == 'B':  # opponent likely has better card
                dist = [0.1, 0.3, 0.6]
                cards[opponent] = np.random.choice(kuhn_cards, p=dist)
                cards[active_player] = np.random.choice([c for c in kuhn_cards if c != cards[opponent]])
            elif last_action =='C': # opponent likely has bad card
                dist = [0.6, 0.3, 0.1]
                cards[opponent] = np.random.choice(kuhn_cards, p=dist)
                cards[active_player] = np.random.choice([c for c in kuhn_cards if c != cards[opponent]])

        my_card = cards[active_player]
        info_set = self.get_information_set(my_card + history)

        strategy = info_set.get_strategy(reach_probabilities[active_player])
        
        ####### CFR-AVG modification as per Rebel #############
        strategy = (self.current_average_strategy + strategy)/2   # current strategy is not the last strategy, instead its 
                                                                  # the current average strategy.
        self.current_average_strategy = strategy
        ########################################################
        #opponent = (active_player + 1) % 2
        counterfactual_values = np.zeros(len(Actions))

        for ix, action in enumerate(Actions):
            action_probability = strategy[ix]

            # compute new reach probabilities after this action
            new_reach_probabilities = reach_probabilities.copy()
            new_reach_probabilities[active_player] *= action_probability

            # recursively call cfr method, next player to act is the opponent
            counterfactual_values[ix] = -self.cfr(cards, history + action, new_reach_probabilities, opponent)

        # Value of the current game state is just counterfactual values weighted by action probabilities
        node_value = counterfactual_values.dot(strategy)
        for ix, action in enumerate(Actions):
            
            info_set.cumulative_regrets[ix] += reach_probabilities[opponent] * (counterfactual_values[ix] - node_value)

        return node_value

        for ix, action in enumerate(Actions):
            action_probability = strategy[ix]

            # compute new reach probabilities after this action
            new_reach_probabilities = reach_probabilities.copy()
            new_reach_probabilities[active_player] *= action_probability

            # recursively call cfr method, next player to act is the opponent
            counterfactual_values[ix] = -self.cfr(cards, history + action, new_reach_probabilities, opponent)

        # Value of the current game state is just counterfactual values weighted by action probabilities
        node_value = counterfactual_values.dot(strategy)
        for ix, action in enumerate(Actions):
            
            info_set.cumulative_regrets[ix] += reach_probabilities[opponent] * (counterfactual_values[ix] - node_value)

        return node_value
    
    
    def train(self, num_iterations: int) -> int:
        util = 0
        kuhn_cards = ['J', 'Q', 'K']
        actual = []
        beliefs = []
        
        for _ in range(num_iterations):
            actual_cards = random.sample(kuhn_cards, 2)
            actual.append(actual_cards) # keep track of cards
            
#             probs = np.random.rand(2,3)  # initial belief -- not quite correct - need to come up with something robust
#             belief = np.argmax(probs, axis = 1)
#             belief = tostring(belief)
#             beliefs.append(belief)
            
            i_dist = np.ones(len(kuhn_cards))/len(kuhn_cards)

            belief = np.random.choice(kuhn_cards, size = 2, p=i_dist)
            beliefs.append(belief)
            
            history = ''
            reach_probabilities = np.ones(2)
            util += self.cfr(belief, history, reach_probabilities, 0)  # cfr w/belief
        
        return util

In [77]:
num_iterations = 1000
cfr_trainer_rebel = KuhnCFRTrainer()
cfr_trainer = KuhnCFRTrainer_NoRebel()
util_Rebel = cfr_trainer_rebel.train(num_iterations)
util_NoRebel = cfr_trainer.train(num_iterations)


print('------------CFR w/o ReBeL Implementation---------------------------------')
print(f"\nRunning Kuhn Poker chance sampling CFR for {num_iterations} iterations")
print(f"\nExpected average game value (for player 1): {(-1./18):.3f}")
print(f"Computed average game value               : {(util_NoRebel / num_iterations):.3f}\n")

print("We expect the bet frequency for a Jack to be between 0 and 1/3")
print("The bet frequency of a King should be three times the one for a Jack\n")

print(f"History  Bet  Pass")
for name, info_set in sorted(cfr_trainer.infoset_map.items(), key=lambda s: len(s[0])):
    print(f"{name:3}:    {info_set.get_average_strategy()}")
    
print("\n\n\n")
    
print('------------CFR w ReBeL Implementation---------------------------------')
print(f"\nComputed average game value               : {(util_Rebel / num_iterations):.3f}\n")

print("We expect the bet frequency for a Jack to be between 0 and 1/3")
print("The bet frequency of a King should be three times the one for a Jack\n")

print(f"History  Bet  Pass")
for name, info_set in sorted(cfr_trainer_rebel.infoset_map.items(), key=lambda s: len(s[0])):
    print(f"{name:3}:    {info_set.get_average_strategy()}")

------------CFR w/o ReBeL Implementation---------------------------------

Running Kuhn Poker chance sampling CFR for 1000 iterations

Expected average game value (for player 1): -0.056
Computed average game value               : -0.015

We expect the bet frequency for a Jack to be between 0 and 1/3
The bet frequency of a King should be three times the one for a Jack

History  Bet  Pass
K  :    [0.76897653 0.23102347]
Q  :    [0.08165009 0.91834991]
J  :    [0.13465637 0.86534363]
QB :    [0.37723185 0.62276815]
QC :    [0.00901548 0.99098452]
KB :    [0.99837134 0.00162866]
KC :    [0.99837134 0.00162866]
JB :    [0.00144092 0.99855908]
JC :    [0.31021867 0.68978133]
KCB:    [0.99686336 0.00313664]
QCB:    [0.66699756 0.33300244]
JCB:    [9.11364584e-04 9.99088635e-01]




------------CFR w ReBeL Implementation---------------------------------

Computed average game value               : 0.393

We expect the bet frequency for a Jack to be between 0 and 1/3
The bet frequency of a King

'J'