# 1. Describe the environment in the Nim learning model.

The environment of the learning model is the 'board' which consists of a list of stacks that contain some quantity of objects (3 stacks of at most 10 items in this case).

# 2. Describe the agent in the Nim learning model.

Our agent is a Q-Learning agent. This agent uses a table (*qtable*) of expected payoffs to make "educated" guesses about which stack to pull objects out of based on the highest expected reward (*a*) based on the current state of the environment. The number of items pulled (*move*) is equal to a modulo of the expected payoff by the maximum number of states of a stack. The stack pulled from (*pile*) is determined by an integer division of the payoff (*a*) by the maximum number of items in a stack.

To be honest I'm a little confused as to why/how the stack selection works as it seems that it is more or less random which stack it tries to pull objects from (based on reward function not truly random, but still).

If this procedure produces an illegal move then the agent will instead make a random legal move. 

# 3. Describe the reward and penalty in the Nim learning model.

# 4. How many possible states there could be in the Nim game with a maximum of 10 items per pile and 3 piles total?

The number of states is solvable as the product of the possible states of the stacks. The possible states of the stacks is just equal to the max occupancy plus one (for the empty state).

So in this case the possible number of states for 3 stacks ($st_1$, $st_2$, $st_3$) of at max $10$ items each is equal to $(st_1 + 1)(st_2 + 1)(st_2 + 1) = 11^3 = 1331$

# 5. How many possible actions there could be in the Nim game with 10 items per pile and 3 piles total?

The only action an agent may take is to take objects from a stack. The number of possible actions for each stack is then just the number of items in that stack. Therefore the total number of actions an agent may take is equal to the sum of the stack occupancies. In this case with 3 stacks of 10 items, we have $st_1 + st_2 + st_3 = 3 * 10 = 30$ possible actions.

# 6. Find a way to improve the provided Nim game learning model

Code taken from module 9 notebook.

In [1]:
import numpy as np
from random import randint, choice

# The number of piles is 3

# max number of items per pile
ITEMS_MX = 10

# Initialize starting position
def init_game():
    return [randint(1,ITEMS_MX), randint(1,ITEMS_MX), randint(1,ITEMS_MX)]

# Based on X-oring the item counts in piles - mathematical solution
def nim_guru(st):
    xored = st[0] ^ st[1] ^ st[2]
    if xored == 0:
        return nim_random(st)
    #
    for pile in range(3):
        s = st[pile] ^ xored
        if s <= st[pile]:
            return st[pile]-s, pile

# Random Nim player
def nim_random(_st):
    pile = choice([i for i in range(3) if _st[i]>0])  # find the non-empty piles
    return randint(1, _st[pile]), pile  # random move

In [2]:
def default_nim_qlearner(_st):
    # pick the best rewarding move, equation 1
    a = np.argmax(qtable[_st[0], _st[1], _st[2]])  # exploitation
    # index is based on move, pile
    move, pile = a%ITEMS_MX+1, a//ITEMS_MX
    
    # check if qtable has generated a random but game illegal move - we have not explored there yet
    if move <= 0 or _st[pile] < move:
        move, pile = nim_random(_st)  # exploration
    #
    return move, pile  # action

In [3]:
Engines = {'Random':nim_random, 'Guru':nim_guru, 'DQlearner':default_nim_qlearner}

def game(a, b):
    state, side = init_game(), 'A'
    while True:
        engine = Engines[a] if side == 'A' else Engines[b]
        move, pile = engine(state)
        # print(state, move, pile)  # debug purposes
        state[pile] -= move
        if state == [0, 0, 0]:  # game ends
            return side  # winning side
        #
        side = 'B' if side == 'A' else 'A'  # switch sides

def play_games(_n, a, b):
    from collections import defaultdict
    wins = defaultdict(int)
    for i in range(_n):
        wins[game(a, b)] += 1
    # info
    print(f"{_n} games, {a:>8s} {wins['A']}   {b:>8s} {wins['B']}")
    #
    return wins['A'], wins['B']

In [4]:
qtable, Alpha, Gamma, Reward = None, 1.0, 0.8, 100.0

# learn from _n games, randomly played to explore the possible states
def nim_qlearn(_n):
    global qtable
    # based on max items per pile
    qtable = np.zeros((ITEMS_MX+1, ITEMS_MX+1, ITEMS_MX+1, ITEMS_MX*3), dtype=float)
    # play _n games
    for i in range(_n):
        # first state is starting position
        st1 = init_game()
        while True:  # while game not finished
            # make a random move - exploration
            move, pile = nim_random(st1)
            st2 = list(st1)
            # make the move
            st2[pile] -= move  # --> last move I made
            if st2 == [0, 0, 0]:  # game ends
                qtable_update(Reward, st1, move, pile, 0)  # I won
                break  # new game
            #
            qtable_update(0, st1, move, pile, np.max(qtable[st2[0], st2[1], st2[2]]))
            st1 = st2

# Equation 3 - update the qtable
def qtable_update(r, _st1, move, pile, q_future_best):
    a = pile*ITEMS_MX+move-1
    qtable[_st1[0], _st1[1], _st1[2], a] = Alpha * (r + Gamma * q_future_best)

In [5]:
%%time
nim_qlearn(100)

Wall time: 6.96 ms


In [6]:
%%time
# See the training size effect
n_train = (3, 10, 100, 1000, 10000, 50000, 100000)
wins = []
for n in n_train:
    nim_qlearn(n)
    a, b = play_games(1000, 'DQlearner', 'Random')
    wins += [a/(a+b)]

1000 games, DQlearner 586     Random 414
1000 games, DQlearner 592     Random 408
1000 games, DQlearner 649     Random 351
1000 games, DQlearner 701     Random 299
1000 games, DQlearner 730     Random 270
1000 games, DQlearner 716     Random 284
1000 games, DQlearner 696     Random 304
Wall time: 8.56 s


In [7]:
# Function to print the entire set of states
def qtable_log(_fn):
    with open(_fn, 'w') as fout:
        s = 'state'
        for a in range(ITEMS_MX*3):
            move, pile = a%ITEMS_MX+1, a//ITEMS_MX
            s += ',%02d_%01d' % (move,pile)
        #
        print(s, file=fout)
        for i, j, k in [(i,j,k) for i in range(ITEMS_MX+1) for j in range(ITEMS_MX+1) for k in range(ITEMS_MX+1)]:
            s = '%02d_%02d_%02d' % (i,j,k)
            for a in range(ITEMS_MX*3):
                r = qtable[i, j, k, a]
                s += ',%.1f' % r
            #
            print(s, file=fout)
#
qtable_log('qtable_debug.txt')

### My Agent

In [8]:
def my_nim_qlearner(_st):

SyntaxError: unexpected EOF while parsing (<ipython-input-8-2ed3c789538a>, line 1)