# Q-Learning with Frozen Lake V0

### Enviorment
* Grid 4 x 4 = 16 Squares

**Discrete State Space = 16**

This is the amount of Rows of the Q-Table.

### Goal
Get the passenger & deliver to destination.

### Action Space
* 4 Directions (Up, Down, Left, Right)

**Discrete Action Space = 4**

This is the amount of Columns of the Q-Table.

### Rewards
* S: starting point, safe
* F: frozen surface, safe
* H: hole, fall to your doom
* G: goal, where the frisbee is located

*The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.*

 Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction.

In [1]:
import numpy as np
import gym
import random

## Step 1: Create the Enviornment

In [2]:
env = gym.make("FrozenLake-v0")
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


## Step 2: Create the Q-Table

In [3]:
state_space = env.observation_space.n
action_space = env.action_space.n

Q = np.zeros((state_space, action_space))

print(Q)
print(Q.shape)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
(16, 4)


## Step 3: Define the Hyperparameters

In [4]:
total_episodes = 50000        # Total number of training episodes
total_test_episodes = 100     # Total number of test episodes
max_steps = 99                # Max steps per episode

learning_rate = 0.8           # Learning rate
gamma = 0.95                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005            # Exponential decay rate for exploration prob

## Step 4: Define the Epsilon-Greedy Policy
This handles the exploration/explotation trade-off.

* If a random number > Epsilon -> Explotation (Agent selects highest state-action pair value)
* Otherwise do Exploration (any random action)

In [5]:
def epsilon_greedy_policy(Q, state):
    if (random.uniform(0,1) > epsilon):
      action = np.argmax(Q[state])
    else:
      action = env.action_space.sample()
      
    return action

## Step 5: Train the Q-Learning Algorithim

In [6]:
rewards = []

for episode in range(total_episodes):
  # Reset the Enviornment
  state = env.reset()
  step = 0
  done = False
  total_rewards = 0

  # Reduce Epsilon to Reduce Exploration Probability
  epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(- decay_rate * episode)

  for step in range(max_steps):
    action = epsilon_greedy_policy(Q, state)

    # Take the Action & Observe the Rewards (r) & Outcome State (s)
    new_state, reward, done, info = env.step(action)

    # Update the Q-Table Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
    Q[state][action] = Q[state][action] + learning_rate * (reward + gamma * np.max(Q[new_state]) - Q[state][action])
    # Q[state, action] = Q[state, action] + learning_rate * (reward + gamma * np.max(Q[new_state, :]) - Q[state, action])


    total_rewards += reward

    # Set New State as State
    state = new_state

    # If the game is Done we finish the episode
    if done == True:
      break

  rewards.append(total_rewards)

print("Train Score: " +  str(sum(rewards)/total_episodes))

print(Q)

Train Score: 0.48948
[[1.32334747e-01 4.37050976e-02 1.09608501e-01 2.92408900e-02]
 [2.62331002e-06 3.33393313e-04 9.33192413e-03 4.99944346e-02]
 [1.23920029e-02 2.57643123e-02 1.25279422e-02 1.33921762e-02]
 [2.64929245e-03 2.80251377e-04 4.63566365e-03 4.16177078e-02]
 [2.03396127e-01 6.44781593e-04 9.93378610e-05 2.74512721e-02]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [7.62077046e-02 2.65978093e-05 2.16518471e-05 1.78090982e-06]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.46934775e-02 4.31156841e-02 9.33753270e-04 2.82089420e-01]
 [2.78386056e-02 1.41571540e-01 5.00586089e-02 3.79282230e-05]
 [3.46258385e-01 7.32909521e-04 3.99507263e-05 4.16657276e-05]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [6.73201504e-02 1.33949882e-01 6.32589740e-01 1.42803538e-02]
 [2.37217764e-01 2.87328556e-01 8.86047460e-01 3.92122066e-02]
 [0.00000000e+00 0.00000000e+00 0.

## Step 6: Test

In [7]:
rewards = []

for episode in range(total_test_episodes):
  # Reset the Enviornment
  state = env.reset()
  step = 0
  done = False
  total_rewards = 0

  print("******************** EPISODE ", episode, " ********************")

  for step in range(max_steps):
    # Take the action that has the Maximum Expected Future Reward for that State
    action = np.argmax(Q[state][:])
    new_state, reward, done, info = env.step(action)

    total_rewards += reward

    if done:
      env.render()
      rewards.append(total_rewards)
      print(step)
      break
    
    state = new_state

env.close()

print('Test Score: ', str(sum(rewards)/total_test_episodes))

******************** EPISODE  0  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
65
******************** EPISODE  1  ********************
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
18
******************** EPISODE  2  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
16
******************** EPISODE  3  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
70
******************** EPISODE  4  ********************
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
15
******************** EPISODE  5  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
49
******************** EPISODE  6  ********************
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
28
******************** EPISODE  7  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
40
******************** EPISODE  8  ********************
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
22
******************** EPISODE  9  ********************
  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG
59
******************** EPISODE  10  **********