In [2]:
%load_ext autoreload
%autoreload 2

# Bellman Equation Q Table Application to Number Guessing Game

This script implements the Bellman equation to train a Q table and test it.

The game I (try to) apply it to is: the numbers 0-4 are shuffled and hidden (i.e. no repeats). As the player, guess numbers (in the range 0-4) until you get first number in the hidden sequence. Then move onto the second number and guess numbers until you get a match. Continue until you've guessed each number in the hidden sequence.

This is a dumb game. And obviously the only strategies are to (1) given a current index, don't repeat guesses and (2) don't guess numbers that you got correct earlier in the sequence.

As you'll see, my first attempt fails completely. The issue is that the state and observation space are useless -- they are both just your current index in the answer sequence. My initial thought was that one only needs the memory of the previous guesses to "beat" this game. Fair enough, but I haven't set this situation up to use that info. Either I need a different algorithm and/or more q-table dimensions OR I need to provide the "AI" with a "memory".

The lesson learned in this first failure is that what you provide in observation/state must be a conscious choice. A thoughtless choice of observation will just probably not be adequate.

Recognizing that in real-world applications, you probably won't be throttling the observation space at all. Anything that could be useful (and which doesn't increase your training time too badly) you want to include.

### Explore NumberGuess Environment

In [3]:
from number_guess_environment import NumberGuess

In [4]:
env = NumberGuess()

print("Random valid answer number:", env.observation_space.sample())
print("Random valid guess number:", env.action_space.sample())


print("\nA few random games:")
episodes = 10
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0
    n_guesses = 0
    while not done:
        n_guesses += 1
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print(f'Episode:{episode} Score:{score} NGuesses:{n_guesses}')

Random valid answer number: 3
Random valid guess number: 0

A few random games:
Episode:1 Score:-10 NGuesses:20
Episode:2 Score:-20 NGuesses:30
Episode:3 Score:-49 NGuesses:59
Episode:4 Score:-15 NGuesses:25
Episode:5 Score:-10 NGuesses:20
Episode:6 Score:-3 NGuesses:13
Episode:7 Score:-7 NGuesses:17
Episode:8 Score:-27 NGuesses:37
Episode:9 Score:-23 NGuesses:33
Episode:10 Score:-9 NGuesses:19


In [5]:
import numpy as np

### Update and training/testing functions

In [6]:
from q_learning_utils import train_test, default_params, update_q_table

### Train and Test

In [7]:
# initial q table full of zeroes
init_q_table = np.zeros([env.observation_space.n, env.action_space.n])

In [8]:
# Train
env = NumberGuess(False)
q_table, avg_reward = train_test(env, init_q_table, n_episodes = 100, do_train = True)
print(f"average reward: {avg_reward}")

average reward: -15.42


In [9]:
print(q_table)

[[-19.52775925 -19.2651705  -19.5375084  -18.72958765 -19.51862032]
 [-20.52425101 -22.15592013 -22.24453604 -21.41609695 -21.25548207]
 [-22.43160358 -22.41701668 -22.26308656 -22.16992186 -22.33196317]
 [-23.25173778 -24.17000836 -24.23712759 -24.25238603 -23.3517368 ]
 [-26.04345078 -26.07544955 -25.16789394 -26.07537687 -26.16682292]]


In [10]:
# Test
avg_reward = train_test(env, q_table, n_episodes = 100, do_train = False)[1]
print(f"average reward: {avg_reward}")
avg_reward = train_test(env, q_table, n_episodes = 100, do_train = False)[1]
print(f"average reward: {avg_reward}")
avg_reward = train_test(env, q_table, n_episodes = 100, do_train = False)[1]
print(f"average reward: {avg_reward}")

average reward: -35.79
average reward: -34.59
average reward: -34.25


Fail! Of course, in hindsight the naive Bellman equation + q table can't get this right. Because the numbers are random, it only recognizes that no number is a good guess. What it /needs/ is memory of its previous guesses, and knowledge of the previous correct answers.

## Small improvement
In this case, a simple and dumb solution to our problem can be to expand the observation space/state. Brainstorming here...

The state could encode:

* knowledge of answer sequence, e.g. [2,4,-1,-1,-1], dim: (5+1)! at most
* guesses in this iteration: [0,0,1,1,0] or some bit rep? 2*5

So this state space dimensionality is: 5! * 10 = 1200

---------------------------------

OK, not all of this thinking was right, but anyway, I'm in a new notebook now, number_guess2, and a new environment NumberGuess2.