# Playing Punish with Reinforcement Learning
### Jim Shepich III
### Updated: 2 October 2022
### Time Required: 13 hours 26 minutes

# Contents <a id="contents"></a>

- [Import Packages](#import-packages)
- [Notebook Settings](#config)
- [Representing Game States](#states)
- [Representing Actions](#actions)
- [Simulating the Game Environment](#environment)
- [Enumerating State Space](#state-space)
- [Modeling Enemy Strategies](#enemy-strategies)
    - [Possible Actions Reweighted by Limited Empirical Strategic Samples (PARLESS)](#parless)
    - [PARLESS Augmented With Neural Networks (PAWNN)](#pawnn)
- [Enumerating State-Action Transition Probabilities](#transitions)
- [State-Action Reward Functions](#rewards)

# Import Packages <a name="import-packages"></a>
### [↑ Contents](#contents)

In [47]:
using Combinatorics
using StatsBase
using JSON
using BenchmarkTools

# Notebook Settings <a name="config"></a>
### [↑ Contents](#contents)

This section allows a user to configure the setttings of this notebook. For the most part, it will allow you to toggle between generating results or loading them from a file on disk, as well as set model hyperparameters.

In [2]:
CONFIG = Dict(
    "state-space" => Dict(
        "generate" => false,
        "filepath" => "statespace.json"
    ),
    "transition-probabilities" => Dict(
        "naive" => Dict("generate"=>true,"filepath"=>"naive_transitions.json"),
        "parless" => Dict("generate"=>true,"filepath"=>"parless_transitions.json"),
        "pawnn" => Dict("generate"=>true,"filepath"=>"pawnn_transitions.json")
    )
)

Dict{String, Dict{String}} with 2 entries:
  "transition-probabilities" => Dict{String, Dict{String, Any}}("naive"=>Dict("…
  "state-space"              => Dict{String, Any}("generate"=>false, "filepath"…

# Representing Game States <a id="states"></a>
### [↑ Contents](#contents)

In a standard, two-player game of Punish, the game state consists of the following attributes:

- The cards in the two players' hands
- The cards that are showing (cards that have been played and the two cards that have been revealed)
- The cards that are face-down in the deck (which can potentially be played via the Feint action)
- Whether or not each player has already feinted
- Whether or not each player is exhausted from using the Punish action
- Both players' HP totals
- The breath number 

Although the measure number is also an aspect of the game state, we will exclude it from our modeling because although measure number may have an impact on how a player's attitudes toward taking risks evolves, it does not change the mechanics of the game. Because a game of Punish can theoretically last for an indefinite number of measures (much like Rock-Paper-Scissors), it is necessary that we exclude measure number in order to have a finite state space.

We will design our model to support an agent with the limited perspective of a single player. This means that instead of seeing the specific cards in the opponent's hand and the deck, the agent will only be able to see the number of cards in each. Additionally, since the number of face-down cards in the deck can be determined based on feinting status, we don't need to include it in the state. 

In the next cell, we create a custom structure in which to store a game state. For convenience, we will use `NamedTuple` structures to store the number of each card type in the player's hand, the player's status conditions, enemy's status conditions and hand size, and number of each card type in the discard pile. 

Victory and loss states will be singleton structures.

In [3]:
abstract type PunishState end

struct DuelingState <: PunishState
    breath::Int64
    hand::NamedTuple{(:guard, :rush, :dodge, :strike, :punish), NTuple{5, Int64}}
    status::NamedTuple{(:hp, :exhausted, :feinted), Tuple{Int64, Bool, Bool}}
    enemy::NamedTuple{(:hp, :hand_size, :exhausted, :feinted), Tuple{Int64, Int64, Bool, Bool}}
    discard::NamedTuple{(:guard, :rush, :dodge, :strike, :punish), NTuple{5, Int64}}
end 

struct WinState <: PunishState end
struct LossState <: PunishState end

Although a struct with `NamedTuple` fields is convenient to work with, it is not a memory-efficient data structure to use in the state-action value function, which we will represent with a lookup table. So, in the next cell, we will implement functions to encode the game state as an integer and decode a game state from such an integer. We will encode a win as -1 and a loss as -2; all dueling states will be encoded as positive integers.

In [4]:
function encode_state(state::DuelingState)
    """This function encodes a `DuelingState` structure as an integer in which each digit
    corresponds to an attribute of the game state."""
    digit_list = vcat(
        [state.breath],
        [card for card in state.hand],
        [Int(stat) for stat in state.status],
        [Int(stat) for stat in state.enemy],
        [card for card in state.discard]
    )
    #Sequentially add each attribute of the game state to a list of digits.
    return sum([digit*10^(i-1) for (i, digit) in enumerate(reverse(digit_list))])
    #Convert the list of digits into an integer by multiplying each digit by the 
    #next power of 10 and adding them all together. Reverse the array so that
    #the first digit in the array gets multiplied by the largest power of 10
    #and becomes the leftmost digit in the resulting integer.
end

function encode_state(state::WinState)
     return -1
end

function encode_state(state::LossState)
     return -2
end


function decode_state(coding::Int64)
    """This function decodes a game state that has been
    encoded as an integer and returns a `DuelingState` structure."""
    if coding < 0
        return [WinState(),LossState()][abs(coding)]
    end
    digit_list = reverse(digits(coding))
    #Convert the encoding into an array of digits, and reverse it so
    #that the leftmost digit is the first entry in the array.
    state = DuelingState(
        digit_list[1],
        (
            guard=digit_list[2], 
            rush=digit_list[3], 
            dodge=digit_list[4], 
            strike=digit_list[5], 
            punish=digit_list[6]
        ),
        (
            hp=digit_list[7],
            exhausted=Bool(digit_list[8]),
            feinted=Bool(digit_list[9])
        ),
        (
            hp=digit_list[10], 
            hand_size=digit_list[11],
            exhausted=Bool(digit_list[12]),
            feinted=Bool(digit_list[13])
        ),
        (
            guard=digit_list[14], 
            rush=digit_list[15], 
            dodge=digit_list[16], 
            strike=digit_list[17], 
            punish=digit_list[18]
        )
    )
end

decode_state (generic function with 1 method)

Just in case, we will define a dispatch of the `==` operator to compare a pair of `PunishState`s. Turns out that we can easily compare a pair of `PunishState` objects by comparing their integer encodings.

In [5]:
Base.:(==)(x::PunishState, y::PunishState) = encode_state(x) == encode_state(y)

# Representing Actions <a id="actions"></a>
### [↑ Contents](#contents)

Now, we need to decicde how we will represent actions. There are no more than 10 unique actions that can be taken in the game of Punish: play one of any of the 5 card types, or feint with one of those cards. I think that it will be most convenient to work with actions that are represented as `(card::Symbol, feint::Bool)` pairs. However, for memory efficiency, we will encode these pairs as a two-digit integer in which the first digit identifies the card and the second digit indicates whether or not that card is used for a feint.

Importantly, if a player is exhausted, then they may not take any actions. We will use the `(:rest,:false)` tuple for this "action", which we will encode numerically as `90`.

In [6]:
function encode_action(action::Tuple{Symbol,Bool})
    """This function takes an action represented as a (card, is_feint) tuple and 
    encodes it as an integer in which the first digit represents the card used and
    the second represents whether or not the action is a feint.
    """
    card_identifiers = (guard=1,rush=2,dodge=3,strike=4,punish=5,rest=9)
    return card_identifiers[action[1]]*10+Int(action[2])
end

function decode_action(coding::Int)
    """This function decodes an action encoded as an integer back into a (card,is_feint) tuple."""
    card_identifiers = Dict(1=>:guard,2=>:rush,3=>:dodge,4=>:strike,5=>:punish,9=>:rest)
    card = card_identifiers[digits(coding)[2]]
    feint = Bool(digits(coding)[1])
    return (card,feint)
end

decode_action (generic function with 1 method)

In [7]:
function possible_actions(state::DuelingState)
    """This function returns an encoded list of the possible actions
    that can be taken by the player from a given state."""
    actions = []
    if state.status.exhausted
        return [encode_action((:rest,false))]
        #When the player is exhausted, the only action they can take
        #is rest (which by definition does not involve a feint).
    end
    
    for (card,count) in pairs(state.hand)
        if count == 0 
            continue
        else
            push!(actions,encode_action((card,false)))
            #The player can play any card that they hold at least 1 of.
            if !state.status.feinted
                push!(actions,encode_action((card,true)))
                #If they have not already feinted this breath, they also 
                #have the option of discarding that card to perform a feint.
            end
        end
    end
    return actions
end

function possible_actions(state::Int)
     return possible_actions(decode_state(state))
end

function possible_actions(state::WinState)
     return []
end

function possible_actions(state::LossState)
     return []
end

possible_actions (generic function with 4 methods)

# Simulating the Game Environment <a id="environment"></a>
### [↑ Contents](#contents)

In this section, we will create the functions that implement our representation of the overall game environment. In RL problems, the environment is typically modeled as a Markov Decision Process (MDP) — a model in which at any state, an agent can choose from some set of actions that will influence the probabilities of transitioning to different states, and the set of possible actions and transition probabilities from any given state are fixed. In other words, the actions and transition probabilities are a state function rather than a path function.

Treating PUNISH as an MDP is a bit of an oversimplification because throughout the course of a Measure, you gain information about what is in the opponent's hand. The most clear example is that if you play a Punish against a Dodge and the enemy does not follow up with a Punish, you know that they do not have a Punish in hand. Additionally, if both players feint and do not use Punish until the last measure, you know with certainty the single card that is in the enemy's hand.

Now, if I were to include this information in the game state, there would be less bias in treating the game as an MDP. However, I made the command decision that the benefit of including this information is outweighed by the increase in the size of the state space that would result from it. 

In [8]:
function ΔHP(cards::NamedTuple{(:player,:enemy),Tuple{Symbol,Symbol}})
    """This function computes the change in HP incurred by each player
    when a pair of cards is played. Priority is not taken into account here."""
    if cards.player == cards.enemy
        return (player=0, enemy=0)
        #No damage is dealt in a clash. 
    end
    
    damage_map = Dict(:rush=>1,:strike=>2,:punish=>3)
    player_Δ = -get(damage_map,cards.enemy,0) *
        #Base damage.
        (cards.player==:dodge ? cards.enemy==:rush : 1) +
        #Dodging reduces base damage from non-Rush attacks to 0.
        (cards.player==:guard ? Int(cards.enemy in keys(damage_map)) : 0)
        #Guarding reduces damage from any attacks by 1.
    enemy_Δ = -get(damage_map,cards.player,0) *
        #Base damage.
        (cards.enemy==:dodge ? cards.player==:rush : 1) +
        #Dodging reduces base damage from non-Rush attacks to 0.
        (cards.enemy==:guard ? Int(cards.player in keys(damage_map)) : 0)
        #Guarding reduces damage from any attacks by 1.
    
    return (player=player_Δ, enemy=enemy_Δ)
    
end

ΔHP (generic function with 1 method)

In [9]:
priority = Dict(
    :rest => -1,
    :guard => 0,
    :rush => 1,
    :dodge => 2,
    :strike => 3,
    :punish => 4
)

Dict{Symbol, Int64} with 6 entries:
  :strike => 3
  :dodge  => 2
  :guard  => 0
  :punish => 4
  :rest   => -1
  :rush   => 1

In [10]:
function breath(
    state::DuelingState,
    picks::NamedTuple{(:player,:enemy),Tuple{Symbol,Symbol}},
    feints::NamedTuple,
    )
    """This function takes complete information about one Breath (i.e. the input state,
    as well as both players' picked cards and the results of any feints) and returns the
    successor state. 
    
    This function is designed to be deterministic and single-valued. To that end, successors
    of a fourth breath are returned as a single placeholder whose `breath` field is set to 5;
    we will use a different function to compute all possible re-deals for the start of a new
    measure.
    """
    plays = (
        player=isnothing(feints.player) ? picks.player : feints.player, 
        enemy=isnothing(feints.enemy) ? picks.enemy : feints.enemy
    )
    #Track the actual cards that are played 
    #(i.e. if feinting, the drawn card; otherwise, the picked card).
    
        
    successor = DuelingState(
        state.breath + 1,
        (; [(card,count-Int(card==picks.player)) for (card,count) in pairs(state.hand)]...),
        #Whichever card was picked is removed from the hand.
        (
            hp = (state.status.hp+ΔHP(plays).player),
            #Compute the change in HP and apply it to the HP total.
            exhausted = (plays.player==:punish),
            #If the player played Punish this breath they become exhausted.
            feinted = (!isnothing(feints.player))||state.status.feinted),
            #The player has feinted if they feinted this breath or in a previous breath.
        (
            hp = (state.enemy.hp+ΔHP(plays).enemy), 
            hand_size = (state.enemy.hand_size-Int(plays.enemy!=:rest)),
            #The enemy's hand size decreases by 1 unless they rested after a Punish.
            exhausted= (plays.enemy==:punish),
            feinted = (!isnothing(feints.enemy)||state.enemy.feinted)
        ),
        (; [(card,count+sum([picks...,feints...].==card)) 
            for (card,count) in pairs(state.discard)]...),
        #For each card type in the discard pile, increment the count for every
        #pick and every feint that matched that type this breath.
            
    )
    
    if successor.status.hp <= 0
        if successor.enemy.hp <= 0
            return priority[plays.player] < priority[plays.enemy] ? WinState() : LossState()
            #If both players would be reduced to nonpositive HP this breath, the winner
            #is determined by card priority.
        else
            return LossState()
            #If just the player would be reduced to nonpositive HP, it is a loss.
        end
    elseif successor.enemy.hp <= 0
        return WinState()
        #If just the enemy would be reduced to nonpositive HP, it is a win.
    else
        return successor
        #If both the player and the enemy have positive HP, then the duel continues.
    end
end

breath (generic function with 1 method)

In [11]:
function redeal(state::DuelingState)
    """This function takes an end-of-measure state and returns a
    dict mapping each possible beginning-of-new-measure state to its
    probability of occuring based on a random redealing of cards."""
    successors = []
    for draws in combinations(
        vcat([repeat([card],count) for (card, count) in pairs(state.discard)]...),
        5-sum(state.hand)
    )
        #Randomly draw a number of cards equal to the difference between 5 and your 
        #ending hand size from the cards visible in the discard pile.
        
        hand = (; [(card,count+sum(draws.==card)) for (card,count) in pairs(state.hand)]...)
        #Add the drawn cards to the cards remaining in your hand at the end of the Breath.
        
        for discards in combinations(
            vcat([repeat([card],count-sum(draws.==card)) for (card, count) in pairs(state.discard)]...),
            3
        )
            #Randomly choose 3 cards to go face-up into the discard pile from the cards 
            #that remain after you've made your draws.
            discard = (; [(card,sum(discards.==card)) for (card,count) in pairs(state.discard)]...)
            #These three cards replace the old discard pile. The rest of the cards fill the 
            #enemy's hand and the face-down deck.
            
            successor = DuelingState(
                1, #The first breath of a new measure.
                hand,
                (hp=min(3,state.status.hp+1),exhausted=false,feinted=false),
                #Between measures, each player heals 1HP up to a maximum of 3,
                #and status effects are removed.
                (hp=min(3,state.enemy.hp+1),hand_size=5,exhausted=false,feinted=false),
                discard
            )
            
            push!(successors,encode_state(successor))
        end
    end
    return proportionmap(successors)
end


function redeal(state::Int)
    return redeal(decode_state(state))
end

redeal (generic function with 2 methods)

In [12]:
function enemy_states(state::DuelingState)
    """This function takes a DuelingState based on the player's incomplete
    information and returns a dict mapping each possible state of the enemy's
    incomplete information to a probability based on card counts."""
    
    enemy_states = []
    
    for enemy_hand in combinations(
        vcat([repeat([card],3-state.hand[card]-state.discard[card]) for card in keys(state.discard)]...),
        state.enemy.hand_size
    )
    #The enemy's hand could be any member of the set of all possible combinations of n cards chosen
    #from whichever cards the player cannot see, where n is the enemy's hand size. 
        enemy_state = DuelingState(
            state.breath,
            (; [(card,sum(enemy_hand.==card)) for card in [:guard,:rush,:dodge,:strike,:punish]]...),
            (hp=state.enemy.hp, exhausted=state.enemy.exhausted, feinted=state.enemy.feinted),
            (
                hp=state.status.hp,
                hand_size=sum(state.hand),
                exhausted=state.status.exhausted,
                feinted=state.status.feinted
            ),
            state.discard
        )
        #Breath number, discard, and statuses are all common information. 
        push!(enemy_states,encode_state(enemy_state))
    end
    return proportionmap(enemy_states)
    #Return normalized value counts of the possible enemy states.
end

enemy_states (generic function with 1 method)

In [43]:
function transitionmap(state::DuelingState,action::Tuple{Symbol,Bool}; empirical_strategies=Dict())
    """This function takes a `DuelingState` and an action tuple and
    returns a dict mapping all possible successor states to the probability of 
    that successor resulting from taking the given action from the given state, i.e.
    a dict of (successor, transition probability) pairs.
    
    The `empirical_strategies` keyword argument allows you to pass in a dictionary of 
    (state, action probability map) pairs corresponding to the enemy's empirically observed
    mixed strategies (probability distribution over possible actions). For any states
    not in the dictionary, a uniform distribution over all possible actions will be assumed.
    """
    transitions = Dict{Int64,Float64}()
    for (enemy_state,p_enemy_state) in enemy_states(state)
        #Loop over all possible enemy hands. 
        enemy_actions = get(
            empirical_strategies,
            enemy_state,
            Dict(enemy_action=>1/length(possible_actions(enemy_state)) 
                for enemy_action in possible_actions(enemy_state) )
        )
        #Look up the enemy's state in the empirical strategies dict. If the entry is missing,
        #assume a uniform mixed strategy (equal probability of all possible actions).
        
        decoded_enemy_state = decode_state(enemy_state)
        
        for (enemy_action,p_enemy_action) in enemy_actions
            #Loop over all possible enemy actions.
            
            deck = vcat([repeat([card],3-(
                        state.hand[card]+
                        decoded_enemy_state.hand[card]+
                        state.discard[card]
                        ))
                    for card in keys(state.discard)]...)
            #With a fixed enemy hand, we have certainty about which cards are in the deck.
             
            feints = [(player=pf,enemy=ef) for (pf,ef) in zip(
                (action[2] ? deck : repeat([nothing],length(deck))),
                (decode_action(enemy_action)[2] ? reverse(deck) : repeat([nothing],length(deck))),
            )]
            #Generate the set of possible outcomes of the feints taken.
            if length(feints) == 0
                feints = [(player=nothing,enemy=nothing)]
            end
            
            for (feint, p_feint) in proportionmap(feints)
                #Loop over the distinct feint outcomces. 
                picks = (player=action[1],enemy=decode_action(enemy_action)[1])
                result = breath(state,picks,feint)
                #Simulate a Breath using the fixed player and enemy actions and feint outcomes.
                successors = (isdefined(result,:breath)&&result.breath==5) ? redeal(result) : Dict(
                    encode_state(result)=>1
                )
                #If the result is a Breath 5 state, then generate the possible redeals
                #and their probabilities. Otherwise, there is a single successor that
                #occurs with probability 1. 
                for (successor,p_redealt) in successors
                    p = p_enemy_state * p_enemy_action * p_feint * p_redealt
                    #Multiply all the conditional probabilities to get the overall
                    #probability of the successor resulting from the player taking
                    #the given action from the given state.
                    if successor in keys(transitions)
                        transitions[successor] += p
                    else
                        transitions[successor] = p
                    end
                    #If this successor has already been mapped out, add the 
                    #probability from this path to it.
                end
            end
        end
    end
    return transitions
end

function transitionmap(state::Int,action::Int; kwargs...)
     return transitionmap(decode_state(state),decode_action(action);kwargs...)
end

transitionmap (generic function with 2 methods)

# Enumerating State Space <a id="state-space"></a>
### [↑ Contents](#contents)

The next thing we have to do as we lay the groundwork is to generate the set of all possible game states. The following facts restrict the state space from the simple Cartesian product of all possible values of each aspect of the game state:
- Neither player's HP can be 0 in a non-end state
- Neither player's HP can be less than 2 in a Breath 1 state
- Neither player can already be exhausted in a Breath 1 state
- Neither player can have already feinted in a Breath 1 state
- A player's hand size must be 5-(breath#-1) or 5-(breath#-2); the latter can only be true if there is a Punish in the discard pile for that player
- Likewise, the number of cards in the discard pile must be 2*(breath#)+(# of feints)-(# of punishes played), where the # of Punishes played is a variable that cannot exceed 2, and must be at least 1 if there are three Punish cards in the discard
- For each card type, the sum of the number in hand and the number showing in the discard pile cannot exceed 3.
- A player cannot have an HP of 1 unless 
    - A Strike was used against them (guarded if they started with 2 HP and unguarded if they started with 1)
    - A Rush was used against them and not guarded
    - A Punish was used against them and guarded 

It is clear to see that the state space's intension cannot be stated simply. The last rule we have listed in particular shows that when we look into how the cards played affect the possible status conditions, things get pretty complicated.

There might be an easier way to generate the set of possible states. The intension of the set of possible Breath 1 states is much simpler than that of the set of all possible states. If we enumerate all possible Breath 1 states, we can simply use the functions we defined in the previous section to generate the set of all possible successor states, i.e. the set of all possible Breath 2 states. Then, we rinse and repeat to get the Breath 3 and Breath 4 states. This is certainly not the most efficient way to go about this, but we only have to do it once.

We'll start out by generating the possible starting (Measure 1 Breath 1) states. Since both players start with 3 HP, 5 cards in hand, and no status conditions, this boils down to the set of possible combinations of hands and discard piles.

In [14]:
if CONFIG["state-space"]["generate"]
    
STARTING_STATES = Set{Int64}()
hands = Set{NamedTuple}()

for draws in (combinations(repeat([:guard,:rush,:dodge,:strike,:punish],3),5))
    hand = (; [(card,sum(draws.==card)) for card in [:guard,:rush,:dodge,:strike,:punish]]...)
    #Loop over all possible starting hands.
    if hand in hands
        continue
        #The `combinations` function does not account for identical elements, so
        #skip any combinations that have already been encountered.
    else
        push!(hands,hand)
        #Track the combinations that have been encountered.
    end
    
    discards = []
    for discard_draws in combinations(
            vcat([repeat([card],3-count) for (card, count) in pairs(hand)]...),
            3
        )
        #Loop over all possible 3-card discard piles taken from the cards that
        #remain after the player's hand has been drawn.
        discard = (; [(card,sum(discard_draws.==card)) for card in [:guard,:rush,:dodge,:strike,:punish]]...)
        if discard in discards
            continue
        else
            push!(discards,discard)
        end
        #Likewise, skip discard combinations that have already been encountered.
        
        starting_state = DuelingState(
            1,
            hand,
            (hp=3, exhausted=false, feinted=false),
            (hp=3, hand_size=5, exhausted=false, feinted=false),
            discard
        )
        push!(STARTING_STATES,encode_state(starting_state))
         
    end
    
end

println("$(length(STARTING_STATES)) distinct starting states")
end

    2150 distinct starting states

Now, we'll generate all possible Breath 1 states. Still no status effects and 5 cards in both players' hands, but either or both players can have 2 HP on Breath 1 after the first Measure.

In [15]:
if CONFIG["state-space"]["generate"]

BREATH1_STATES = union(
    STARTING_STATES,
    STARTING_STATES .- 100000000000,
    #Player starts with 2HP
    STARTING_STATES .- 100000000,
    #Enemy starts with 2HP
    STARTING_STATES .- 100100000000
    #Both start with 2HP
)
println("Breath 1: $(length(BREATH1_STATES)) distinct states")
    
end

    Breath 1: 8600 distinct states

Now, we have to find the Breath 2-4 states. We will do this by using the `transitionmap` function to find all the possible successors from each state in the previous breath. This will inevitably be very inefficient, but it will get the job done.

In [16]:
if CONFIG["state-space"]["generate"]

STATE_SPACE = Set{Int64}([STARTING_STATES...])
new_states = Set{Int64}([STARTING_STATES...])

for breath in 2:4
    successors = Set{Int64}()
    for state in new_states
        for action in possible_actions(state)
            union!(successors, keys(transitionmap(state,action)))
            #Loop over all distinct states of the previous breath and 
            #construct a set of all distinct successors from any state
            #in that breath.
        end
    end
    
    new_states = successors
    println("Breath $(breath): $(length(new_states)) distinct states")
    #Print the number of distinct states for each breath number.
    
    if length(new_states)==0
        break
    end
    
    union!(STATE_SPACE,successors)
end
    
end

    Breath 2: 65592 distinct states
    Breath 3: 147263 distinct states
    Breath 4: 153330 distinct states

In [17]:
if CONFIG["state-space"]["generate"]
    open(CONFIG["state-space"]["filepath"],"w") do f
        JSON.print(f,STATE_SPACE)
    end
    #If the state space is to be generated, then save the results.
else 
    STATE_SPACE = Set(JSON.parsefile(CONFIG["state-space"]["filepath"],use_mmap=false))
    #Otherwise, load from disk. The use_mmap keyword must be set to false, otherwise
    #the file will remain open in Julia.
end
println("Total: $(length(STATE_SPACE)) distinct states")

Total: 368331 distinct states


# Modeling Enemy Strategies <a id="enemy-strategies"></a>
### [↑ Contents](#contents)

One of the biases of our current representation of PUNISH is the assumption that at any given state, the enemy will weight each of their possible acitons equally. In actuality, we expect a human player to (on average) avoid taking actions that will lead to imminent defeat. However, human players lack the computational power to estimate expected rewards by considering anything deeper than the immediate Breath. Additionally, attitudes toward risk-taking will also inform a human player's behavior. This is beyond the scope of what we are able to derive from a strictly logical premise. So, we turn to data.

## Possible Actions Reweighted by Limited Empirical Strategic Samples (PARLESS) <a id="parless"></a>
### [↑ Contents](#contents)

In this section, I present a method for using real game data to reweight the enemy's mixed strategies (i.e. probability distribution over possible actions) which I call **"Possible Actions Reweighted by Limited Empirical Strategic Samples" (PARLESS)**. In this method, we will use the empirical distribution of actions taken from a given state by real players as the "evidence" term in Bayes' rule to reweight the parameters of a multinomial probability distribution, which we use to represent the mixed strategy. We will use a Dirichlet prior with equal probabilities, as it is the conjugate prior of the multinomial distribution.

https://stats.stackexchange.com/questions/44494/why-is-the-dirichlet-distribution-the-prior-for-the-multinomial-distribution

## PARLESS Augmented With Neural Networks (PAWNN) <a id="pawnn"></a>
### [↑ Contents](#contents)

PUNISH has a small yet dedicated player base. However, with hundreds of thousands of game states, we simply do not have empirical strategy data for every possible action. However, we may be able to extrapolate from the states that we do have using a deep neural network model. Here, we will experiment with using neural networks trained on the (state ↦ mixed strategy) set generated by the basic PARLESS method to extrapolate mixed strategies for states for which we lack data.

# Enumerating State-Action Transition Probabilities <a id="transitions"></a>
### [↑ Contents](#contents)

Although we have the `transitionmap` function to generate the transition probabilities for any given state-action pair, it will significantly speed up the value iteration computations if we pre-compile the transition probabilities.

Note: we will need to generate a separate transition probability map for all three models of enemy strategies (Naive, PARLESS, and PAWNN).

## Naive

In [None]:
if CONFIG["transition-probabilities"]["naive"]["generate"]
    NAIVE_TRANSITIONS = Dict( 
        state => Dict(
            action => transitionmap(state,action) 
            for action in possible_actions(state)
        )
        for state in STATE_SPACE
    )
    open(CONFIG["transition-probabilities"]["naive"]["filepath"],"w") do f
        JSON.print(f,NAIVE_TRANSITIONS)
    end
    #Generate transition map and save results.
else 
    NAIVE_TRANSITIONS = Set(
        JSON.parsefile(CONFIG["transition-probabilities"]["naive"]["filepath"],use_mmap=false)
    )
    #Load results from disk.
end

## PARLESS

## PAWNN

# State-Action Reward Function <a id="rewards"></a>
### [↑ Contents](#contents)

Biases:
- Equiprobable transitions
- Independence of successor states (i.e. ignoring information about what an opponent has)