# Introduction
This example implements a q-learn table for FrozenLake in gym. 

The final score is 
0.7 ~ 0.8 /0.78

It is l

In [1]:
import gym
import numpy as np

# Game
The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

rozenLake-v0 defines "solving" as getting average reward of 0.78 over 100 consecutive trials.

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

**GAME SURFACE**

SFFF<br />
FHFH<br />
FFFH<br />
HFFG

**NOTES**

S: starting point, safe<br />
F: frozen surface, safe<br />
H: hole, fall to your doom<br />
G: goal, where the frisbee is located

**ACTIONS**

0: Left<br />
1: Down<br />
2: Right<br />
3: Up

In [70]:
env = gym.make('FrozenLake-v0')

In [71]:
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


# Q-Table

In [131]:
n_o = env.observation_space.n # Observation space = 16, number of different states
n_a = env.action_space.n # Action space = 4
Q = np.zeros([env.observation_space.n,env.action_space.n])
Q # Q-Table is a 16 * 4 matrix, with each value representing the reward if the action is taken.

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

# Model

In [132]:
learning_rate = .8
gamma = .95 # Future reward is discounted accordingly
n_epochs = 2000 # Number of games for training
n_steps = 199 # Number of steps every game plays

In [133]:
for i in range(n_epochs):
    state = env.reset() # s is reset to 0, the starting point
    done = False
    
    for j in range(n_steps):
        #Reset environment and get first new observation
        s = env.reset()
        rAll = 0
        d = False
        j = 0
        #The Q-Table learning algorithm
        while j < 99:
            j+=1
            #Choose an action by greedily (with noise) picking from Q table
            a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
            #Get new state and reward from environment
            s1,r,d,_ = env.step(a)
            #Update Q-Table with new knowledge
            Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
            rAll += r
            s = s1
            if d == True:
                break

In [134]:
print("Score over time: " +  str(sum(rList)/num_episodes))

Score over time: 0.472


In [135]:
print("Final Q-Table Values")
print(Q)

Final Q-Table Values
[[  2.14845199e-01   3.97555433e-03   3.19736198e-03   1.62249310e-03]
 [  1.57151716e-04   6.97870516e-05   1.78712050e-04   4.65635054e-01]
 [  2.32241127e-04   1.84659132e-01   6.32643971e-04   2.99464033e-03]
 [  2.95287477e-04   6.17121299e-04   6.20370903e-05   8.90514420e-02]
 [  3.64614858e-01   3.11022462e-04   2.90867757e-04   5.07104797e-04]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  1.21669362e-03   4.79870701e-05   4.97399573e-02   1.71792431e-04]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  4.19117818e-04   1.23306929e-03   1.23309570e-03   5.97388958e-01]
 [  2.83277841e-05   5.90799416e-01   3.55913137e-04   2.09160613e-04]
 [  8.03490695e-01   3.32934009e-04   1.08388031e-04   7.93252601e-05]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  5.51702651e-03   1.94147283e-03   4.51633226e-01   1

# Score

In [171]:
n_trails = 100
reward_counter = 0
for i in range(n_trails):
    state = env.reset() # s is reset to 0, the starting point
    done = False
    for j in range(n_steps):
        reward_prediction = Q[state,:]
        action = np.argmax(reward_prediction)
        
        state_next, reward, done, info = env.step(action)
                
        # Prepare for the next state
        reward_counter = reward_counter + reward
        state = state_next   
        if done:
            break
score = float(reward_counter) / n_trails
print(score)

0.73


In [137]:
np.argmax(Q[0,:])

0