A betting man has the opportunity to make bets on the outcomes of a sporting event. If the team he bets on wins, he wins as many dollars as he has staked on that result; if his team loses, he loses his stake. The game ends when the betting man wins by reaching his goal of $100, or loses by running out of money. On each bet, he must decide what portion of his capital to stake, in integer numbers of dollars. 

The problem can be formulated as an undiscounted, episodic, finite MDP. The state is the betting man’s capital, s ∈ {1, 2, . . . , 99} and the actions are stakes, a ∈ {0, 1, . . . , min(s, 100−s)}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal.

We will use value iteration method (Dynamic Programming) to solve this problem and hence assume that the agent knows the environment dynamics such as reward function and number of states.

In [9]:
# Discount factor
gamma = 1

# Probability of home team winning
p = 0.4

# The number of states availabe
numStates = 100

# List for storing the reward value
reward = [0 for _ in range(101)]
reward[100]=1 # Only final state has a reward for this problem

# Small threshold value for comparing the difference
theta = 0.00000001

# List to store the value function for all states form 1 to 99
value=[0 for _ in range(101)]

# List to store the amount of bet that gives the max reward
policy = [0 for _ in range(101)]

In [2]:
def reinforcement_learning():
    delta = 1
    while delta > theta:
        delta = 0
        "Looping over all the states i.e the money in hand for a current episode"
        for i in range(1,numStates):
            oldvalue = value[i]
            bellmanequation(i)
            diff = abs(oldvalue-value[i])
            delta = max(delta,diff)
    return value, policy        

In [3]:
def bellmanequation(num):
    "Initialize optimal value to be zero"
    optimalvalue = 0

    "The range of number of bets"
    for bet in range(0,min(num,100-num)+1):
        "Amount after winning and losing"
        win = num + bet
        loss = num - bet
        "calculate the average of possible states for an action"
        "In this case it would be home team winning or away team winning"
        sum = p * (reward[win] + gamma * value[win]) + (1 - p) * (reward[loss] + gamma * value[loss])

        "Choose the action that gives the max reward and update the policy and value for that"
        if sum > optimalvalue:
            optimalvalue = sum
            value[num] = sum
            policy[num] = bet

In [5]:
value, policy = reinforcement_learning()

In [7]:
import numpy as np
value = np.array(value)
value

array([0.        , 0.00206562, 0.00516406, 0.00922547, 0.01291015,
       0.0173854 , 0.02306368, 0.02781411, 0.03227539, 0.03768507,
       0.0434635 , 0.05035447, 0.05765919, 0.06523937, 0.06953528,
       0.07443124, 0.08068847, 0.08661104, 0.09421268, 0.10314362,
       0.10865874, 0.11596663, 0.12588617, 0.13357998, 0.14414799,
       0.16      , 0.16309844, 0.16774609, 0.17383821, 0.17936523,
       0.1860781 , 0.19459552, 0.20172117, 0.20841308, 0.21652761,
       0.22519525, 0.2355317 , 0.24648879, 0.25785906, 0.26430292,
       0.27164686, 0.2810327 , 0.28991657, 0.30131902, 0.31471544,
       0.32298812, 0.33394994, 0.34882926, 0.36036996, 0.37622198,
       0.4       , 0.40309844, 0.40774609, 0.41383821, 0.41936523,
       0.4260781 , 0.43459552, 0.44172117, 0.44841308, 0.45652761,
       0.46519525, 0.4755317 , 0.48648879, 0.49785906, 0.50430292,
       0.51164686, 0.5210327 , 0.52991657, 0.54131902, 0.55471544,
       0.56298812, 0.57394994, 0.58882926, 0.60036996, 0.61622

In [8]:
np.array(policy)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  0,  0, 10,  0, 12, 12, 14, 10, 16,
        0,  0,  6, 20,  0, 22,  0,  0,  0,  1,  2,  3,  0,  5,  6,  7,  8,
        0, 10, 11,  0, 38, 11, 10,  9,  0,  7,  6,  5,  0,  3,  0,  0,  0,
        1,  2,  3,  0,  5, 44,  7, 42,  0, 10, 39, 12, 13, 11, 10,  9,  0,
        7,  6,  5,  0,  3,  0,  0,  0,  1,  0,  3,  4,  5,  0,  7, 17,  9,
       10, 11,  0, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0])