# Brief

This notebook shows simple implementation of On-Policy Multi-Visit Monte Carlo decision maker

## Action Table
`ActionTable` is a self-defined action space

In [1]:
class ActionTable():
    CHECK = 1
    CALL = 2
    RAISE = 3
    FOLD = 4

## Monte Carlo Model
`MCModel` uses Monte Carlo to update its action policy when an episode terminates. An episode is a series of states and actions pairs. A state describes the observed environment after an action has been made. For example, we can use a combination of player's cards and public cards to represent a state.

In [2]:
class MCModel(object):
    def __init__(self):
        self.episode = []      # a series of (state, action) pairs
        self.pi = {}           # action policy
        self.Q = {}            # expectation of (state, action) pair
        self.Returns = {}      # returns of each (state, action) pairs
        self.initial_stack = 0  
        self.final_stack = 0
        self.epsilon = 0.3     # probability not to perform the best action (to do exploration)

Now we describe each member function under `MCModel`

`get_state` function convert observation into a state

In [3]:
    def get_state(self, observation):
        my_card = observation.player_states.hand    # a list of 2
        community_card = observation.community_card # a list of 5
        return my_card + community_card

`record_episode` is called from outside each time we want to observe. For example, we might want to call `record_episode` when we receive `__action` or `__bet` message

In [4]:
    def record_episode(self, observation, action):
        state = self.get_state(observation)
        self.episode.append([','.join(map(str, state)), action.action])  # use string to represent state
                                                                         # action is an instance of ActionTable

`set_initial_stack` and `set_final_stack` are called from outside when a round starts and ends, respectively

In [5]:
    def set_initial_stack(self, stack):
        self.initial_stack = stack
        
    def set_final_stack(self, stack):
        self.final_stack = stack

`on_policy_mc` is called each time an episode ends, specifically, when a round ends. We here use stack difference as return of  the rounds, ie, how well the bot played.

There are three `for` loops inside the `on_policy_mc` function. The first loop and second loop calculate the expected return each state, action pair by sampling. The third loop update the model's action policy `pi`. Note that the action policy not always choose the best action for a state.

In [6]:
    def on_policy_mc(self):
        G = self.final_stack - self.initial_stack  # stack difference as return
        for sa in self.episode:
            s, a = sa
            # no discount, so G for all s,a are same, regardless of order
            if s not in self.Returns:
                self.Returns[s] = {}
            if a not in self.Returns[s]:
                self.Returns[s][a] = []
            self.Returns[s][a].append(G)
        
        # update Q
        for s in self.Returns.keys():
            for a in self.Returns[s].keys():
                if s not in self.Q:
                    self.Q[s] = {}
                self.Q[s][a] = np.mean(self.Returns[s][a])
        
        # update action policy pi
        for s in self.Returns.keys():
            possible_action = self.Returns[s]
            A_star = max(possible_action.iteritems(), key=operator.itemgetter(1))[0]  # best action by largest average return
            exploit_prob = 1 - self.epsilon + self.epsilon / 4  # four action only
            explore_prob = self.epsilon / 4
            choice = np.random.choice(['exploit', 'explore'], [exploit_prob, explore_prob])
            if choice == 'exploit':
                self.pi[s] = A_star
            else:
                self.pi[s] = random.choice([ActionTable.FOLD,
                                            ActionTable.CALL,
                                            ActionTable.RAISE,
                                            ActionTable.CHECK])

Finally, `take_action` is called each time an action is needed. `take_action` will return an action suggested by action policy `pi`, or a randomly choose action

In [7]:
    def take_action(self, observation):
        # take action under pi
        state = self.get_state(observation)
        state_string = ','.join(map(str, state))
        if state_string not in self.pi:
            action = random.choice([ActionTable.FOLD,
                                    ActionTable.CALL,
                                    ActionTable.RAISE,
                                    ActionTable.CHECK])
            self.pi[state_string] = action
        else:
            action = self.pi[state_string]
        return action