<h1>Monte Carlo Methods</h1>

# 0. Monte Carlo Control, output $\pi \approx \pi_{*}$

## Card Game

### Environment

+ Two Players: Me and Dealer. 

+ Task: Get Cards from an infinite deck. The player could stick (stop) or hit (ask for another card). If the accumulative sum of the value of cards explodes 21, the player goes bust.

+ Reward: The Winner gets +1, the Loser gets -1, 0 for each when it draws.

### Policy of The Dealer

If the accumlative sum is equal or greater than 17, then sticks, otherwise hits.

### Policy of Me

If the accumlative sum is equal or greater than 20, then sticks, otherwise hits.

### State

+ My current sum 12~21

### Action

Stick or Hit


### Marte Carlo ES (Exploing Starts), output $\pi \approx \pi_{*}$

+ Initialize, $\forall s \in \mathcal{S}, a \in \mathcal{A}$

> $Q(s, a) \leftarrow$ arbitrary

> $\pi(s) \leftarrow$ arbitrary

> $Returns(s, a) \leftarrow$ empty list

+ Repeat forever:

> Choose $S_0 \in \mathcal{S}$ and $A_0 \in \mathcal{A}(S_0)$ s.t. all pairs have probability $> 0$

> Generate an episode starting from $S_0, A_0$, following $\pi$

> For each pair $<s, a>$ appearing in the episode: 

>> $G \leftarrow$ the returns that follows the first occurance of $s, a$

>> Append $G$ to $Returns(s, a)$

>> $Q(s, a) \leftarrow$ average($Returns(s, a)$)

> For each $s$ in the episode: 

>> $\pi(s) \leftarrow \arg\max_{a}Q(s, a)$

In [1]:
# Monte Carlo Exploring Start
import numpy as np

n_s = 10 # me: 12~21 (Total: 10)
n_a = 2 # 0: stick, 1: hit

# Initialize
Q = np.zeros((n_s, n_a))
pi = np.zeros(n_s)

states = np.array(range(12, 22))
actions = np.array(range(n_a))

cards = np.array(range(1, 53))

In [2]:
def get_card(cards):
    card = np.random.choice(cards) % 13
    if card > 10 or card == 0:
        card = 10
        
    return card

In [3]:
def dealer_action(cards):
    
    stop = False
    
    dealer_sum = 0
    while stop == False:
        dealer_sum += get_card(cards)
        
        if dealer_sum >= 17:
            stop = True
            
    if dealer_sum > 21:
        dealer_sum = 0
    
    return dealer_sum

In [10]:
def me_action(cards):
    
    episode = {}
    
    stop = False
    
    me_sum = 0
    while stop == False:
        me_sum += get_card(cards)
        s = me_sum
        a = 1 # 0: stick, 1: hit
        
        if me_sum >= 18:
            stop = True
            a = 0
            
        episode[s-12] = a
    
    if me_sum > 21:
        me_sum = 0

    return me_sum, episode

In [11]:
def one_episode(states, actions, cards):

    # Me and Dealer Take actions indepently
    
    # Dealer Action
    dealer_sum = dealer_action(cards)
    
    # Me Action
    me_sum, episode = me_action(cards)
    
    # Compute Reward
    if me_sum > dealer_sum:
        r = 1
    elif me_sum < dealer_sum:
        r = -1
    else:
        r = 0

    Returns = np.zeros((n_s, n_a))
    Counts = np.zeros((n_s, n_a))
    
    for s in range(10):
        if s in episode:
            a = episode[s]
            Returns[s][a] += r
            Counts[s][a] += 1
        
    return Returns, Counts

In [12]:
n_episode = 10000

Q = np.zeros((n_s, n_a))
total_counts = np.zeros((n_s, n_a))

for _ in range(n_episode):
    Returns, Counts = one_episode(states, actions, cards)
    Q += Returns
    total_counts += Counts

for s in range(n_s):
    for a in range(n_a):
        if total_counts[s][a] > 0:
            Q[s][a] /= total_counts[s][a]
            
for s in range(n_s):
    pi[s] = np.argmax(Q[s])

In [13]:
Q

array([[ 0.        , -0.16746032],
       [ 0.        , -0.22720126],
       [ 0.        , -0.25681818],
       [ 0.        , -0.24518888],
       [ 0.        , -0.31121751],
       [ 0.        , -0.34687299],
       [ 0.08053691,  0.        ],
       [ 0.30191354,  0.        ],
       [ 0.63743316,  0.        ],
       [ 0.92460317,  0.        ]])

In [14]:
pi

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [15]:
total_counts

array([[   0., 1260.],
       [   0., 1272.],
       [   0., 1320.],
       [   0., 1403.],
       [   0., 1462.],
       [   0., 1551.],
       [1490.,    0.],
       [1411.,    0.],
       [1870.,    0.],
       [1008.,    0.]])