# Simple Reinforcement Learning with Tensorflow: Part 0 - Q-Tables




In this iPython notebook we implement a Q-Table algorithm that solves the FrozenLake problem. To learn more, read here: https://medium.com/@awjuliani/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

For more reinforcment learning tutorials, see: https://github.com/awjuliani/DeepRL-Agents


### Step 0: Import libraries and openAI env

In [1]:
import gym
import numpy as np

### Step 1: Load the environement

In [2]:
env = gym.make("FrozenLake-v0")

[2018-01-09 14:45:58,399] Making new env: FrozenLake-v0


#### The rules of our game
<ul>
    <li> We are in a grid world </li>
    <li> The agent must reach the goal </li>
        <ul>
            <li> Some tiles are frozen (= safe) </li>
            <li> Other are holes (= dangerous) </li>
        </ul>
    <li> The wind occasionally blows the agent in an uncertain direction </li>
    <li> The agent is rewarded 1 for finding a walkable path to a goal tile and 0 for other steps</li>
</ul>
<img src="frozen_lake.png"/>


### Step 2: Set the hyperparameters

In [4]:
total_episodes = 2000 
max_steps = 99 # Max steps per episode
learning_rate = 0.8
gamma = 0.95 # Discount rate

### Step 3: Build our Qtable

<ul>
    <li> 16x4 Q-table </li>
    <li> 16 possible state (1 for each bloc)</li>
    <li> 4 possible actions </li>
</ul>

In [5]:
# Init the q-table 16*4 with all zeros
qtable = np.zeros([env.observation_space.n,env.action_space.n])

### Step 4: Implement Q-Table learning algorithm

<img src="q-learning-diagram.png">
<img src="q-learning.png">

In [6]:
# Create list to contain total reward per episode
rewardList = []

# 2.
for episode in range(total_episodes):
    state = env.reset() # Restart our game from beginning
    done = False
    rewardAll = 0
    step = 0
    
    for step in range(max_steps):
        # 3. Choose an action by greedily (with noise) picking from Q table
        action = np.argmax(qtable[state,:] + np.random.randn(1,env.action_space.n)*(1./(episode+1)))
        
        # 4. Perform the action and get new state and reward
        new_state, reward, done, info = env.step(action)
        
        # 5. Update the q table (Bellman equation)
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        
        rewardAll += reward
        
        # Our new state is state
        state = new_state
        
        # If done : finish episode
        if done == True: 
            break
    rewardList.append(rewardAll)

### Step 5: Output the score

In [7]:
print ("Score over time: " +  str(sum(rewardList)/total_episodes))
print(qtable)

Score over time: 0.425
[[  1.58597959e-01   4.57806789e-03   4.26348990e-03   2.49390371e-03]
 [  6.46068360e-04   8.55208620e-04   6.06632965e-05   2.30436510e-01]
 [  5.77317238e-04   1.33570654e-01   3.48207649e-04   1.03243818e-03]
 [  1.43559139e-04   1.84821849e-04   2.09462340e-04   9.18084659e-02]
 [  2.01349198e-01   4.93233857e-04   2.60805454e-03   3.56516733e-04]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  5.64277002e-04   3.80742751e-06   1.15061946e-01   5.64540614e-07]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  8.41745850e-04   2.99277942e-04   3.69974640e-04   3.83949434e-01]
 [  3.40682342e-03   3.90227934e-01   0.00000000e+00   0.00000000e+00]
 [  6.27151568e-01   6.90903319e-04   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   4.79265833e-01  

### Step 6: Use our Q-table to play FrozenLake !

In [10]:
env.reset()

for episode in range(200):
    state = env.reset()
    done = False

    for step in range(max_steps):
        env.render()
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(qtable[state,:])
        
        new_state, reward, done, info = env.step(action)
    
        if done:
            break
        else:
            state = new_state
env.close()
    


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG
  (Up)
SFFF
FHFH
F[41mF[0mFH
HFFG
  (Down)
SFFF
FHFH
FF[41mF[0mH
HFFG
  (Left)
SFFF
FH[41mF[0mH
FFFH
HFFG

[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
H