# Frozen-Lake with Q-Learning (table based)

## The environment

(From https://www.gymlibrary.ml/environments/toy_text/frozen_lake/ ):

Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H) by walking over the Frozen(F) lake. The agent may not always move in the intended direction due to the slippery nature of the frozen lake.

### Action Space

The agent takes a 1-element vector for actions. The action space is ```(dir)```, where ```dir``` decides direction to move in which can be:

    0: LEFT

    1: DOWN

    2: RIGHT

    3: UP

### Observation Space / State Space

The observation/state is a value representing the agent’s current position as current_row * nrows + current_col (where both the row and col start at 0). For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. For example, the 4x4 map has 16 possible observations.

### Rewards

Reward schedule:

    Reach goal(G): +1

    Reach hole(H): 0

    Reach frozen(F): 0


## Interacting with OpenAI gym and the environment

### Initialization

We can create and render the environment like this:

In [48]:
import gym

environment = gym.make('FrozenLake-v1', desc=None, map_name="4x4", is_slippery=True)

def init_environment(env: gym.Env):
    state, _ = env.reset(return_info=True)  # Restart/initialize the environment
    print(env.render(mode="ansi"))

    # The state returned from environment.reset() is our initial state:
    print(state)

init_environment(environment)


[41mS[0mFFF
FHFH
FFFH
HFFG

0


### Moving / Taking an action


In [49]:
# We can get the action-space with environment.action_space
print(environment.action_space)

Discrete(4)


In [50]:
# To move around (taking a series of actions), we use the ```environment.step()``` function:
def move(env: gym.Env):
    new_state, reward, done, _ = env.step(0)  # 0 is left, remember that we can use all 4 actions from the action-space, and that there is a chance of slipping

    # We got three new variables:
    print(f"New state : {new_state}")  # Updated state after moving
    print(f"Reward: {reward}")  # The reward we got from the environment
    print(f"Done: {done}")  # If the game is finished or not
    print("")
    print(env.render(mode="ansi"))  # Re-render environment after moving

move(environment)

New state : 0
Reward: 0.0
Done: False

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG



## The Q-Table

We use a q-table to help guide us to the best action to take at each timestep.
It is just a simple lookup table containing the maximum expected future rewards for all actions in all states.


In [51]:
import numpy as np

# Create an empty q-table of size state-space x action-space
def initialize_q_table(env: gym.Env) -> np.array:
    return np.array([[0 for _ in range(env.action_space.n)] for _ in range(env.observation_space.n)], dtype=np.float32)

q_table = initialize_q_table(environment)
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


## Policies

A policy is a method that maps from a state to an action
This is what decides how we play the game

In [52]:
# We first define the optimal policy, this policy always picks what it believes is the "best" action in the current state
def optimal_policy(env: gym.Env, q_sa: np.array, s: int) -> int:
    """RL-policy for optimal play.

    Args:
        env: Frozen-lake Environment
        q_sa: q-table
        s: state

    Returns:
        optimal action for given state and q-table.
    """
    if np.all(q_sa[s] == q_sa[s][0]):  # If all q-values are equal (e.g. all 0), we cannot differentiate
        return env.action_space.sample()  # Pick a random action
    return int(np.argmax(q_sa[s]))  # Return the argument (element number) with the highest q-value

In [53]:
# When we are training, we also want some exploration (see exploration/exploitation tradeoff)
def epsilon_greedy_policy(env: gym.Env, q_sa: np.array, s: int, eps: float = 0.15) -> int:
    """RL-policy for exploration/exploitation play.

    Args:
        env: Frozen-lake Environment
        q_sa: q-table
        s: state
        eps: exploration chance

    Returns:
        either random action, or optimal action for given state and q-table.
    """
    if np.random.rand() < eps:  # If a random number n is lower than eps:
        return env.action_space.sample()  # Pick a random action
    return optimal_policy(env, q_sa, s)  # Otherwise, play optimally

In [54]:
# We can also have a decaying explore strategy, to promote exploration early on and exploitation later
def decaying_epsilon_greedy_policy(env: gym.Env, q_sa: np.array, s: int, episode: int, max_episodes: int, max_eps: float = 0.8, min_eps: float = 0.02) -> int:
    """RL-policy for exploration/exploitation play.

    Args:
        env: Frozen-lake Environment
        q_sa: q-table
        s: state
        episode: current timestep
        max_episodes: maximum timestep
        max_eps: max exploration chance
        min_eps: min exploration chance

    Returns:
        either random action, or optimal action for given state and q-table.
    """
    eps = min_eps + (max_eps - min_eps) * ((max_episodes - episode) / max_episodes)
    if np.random.rand() < eps:  # If a random number n is lower than eps:
        return env.action_space.sample()  # Pick a random action
    return optimal_policy(env, q_sa, s)  # Otherwise, play optimally

## Playing FrozenLake with a policy

To play FrozenLake with a q-table and a policy, we just replace the action in the earlier step with the action from the policy:

In [55]:
def play():
    max_steps = 20  # Maximum steps to play
    state, _ = environment.reset(return_info=True)  # Restart/initialize the environment
    print(environment.render(mode="ansi"))
    for _ in range(max_steps):
        action = optimal_policy(environment, q_table, state)  # Chose the optimal action based on values from the q-table
        new_state, reward, done, _ = environment.step(action)  # Play using that action
        print(environment.render(mode="ansi"))

        # We stop the game if we are finished
        if done:
            break

        state = new_state  # If not, replace the state with the new state before next step

play()


[41mS[0mFFF
FHFH
FFFH
HFFG

  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG



## Learning from experience / Updating the q-table

Right now the q-table is filled with zeroes, and does not update, we want to update this for every step we take based on the td-error.

In [56]:
# The td-error is based on a single experience, i.e. a e_t = (s_t, a_t, r_t, s_(t+1)) tuple, for the experience at timestep t
# We can save this in a dataclass:

from dataclasses import dataclass

@dataclass
class Experience:
    __slots__ = ("state", "action", "reward", "new_state", "done")  # Optimization so that we can save thousands of these objects later
    state: int
    action: int
    reward: float
    new_state: int
    done: bool

In [57]:
# We can then calculate the td-error and q-update of a single experience
def q_temporal_difference(q_sa: np.array, experience: Experience, alpha: float = 0.85, gamma: float = 0.98) -> float:
    """Calculates the q-update.

	Args:
		q_sa: q-table
		experience: a single experience
		alpha: learning-rate
		gamma: discount

	Returns:
		q-td update value
    """
    td_error = experience.reward + gamma * np.max(q_sa[experience.new_state]) - q_sa[experience.state][experience.action]
    return q_sa[experience.state][experience.action] + alpha * td_error

In [58]:
# Using the q-update, we can update our q-table over multiple games
def q_learning(env: gym.Env, q_sa: np.array, n_episodes: int = 100000, m_ep_length: int = 200) -> np.array:
    """q-learning implementation to update a q-table.

	Args:
		env: gym environment
		q_sa: initial q-table
		n_episodes: number of episodes to train on
		m_ep_length: maximum episode length

	Returns:
		updated q-table
    """
    for episode in range(n_episodes):
        s, _ = env.reset(return_info=True)  # Restart/initialize the environment
        for _ in range(m_ep_length):
            a = decaying_epsilon_greedy_policy(env, q_sa, s, episode, n_episodes)  # Exploration strategy
            s_new, r, d, _ = env.step(a)  # Play using that action

            exp = Experience(s, a, r, s_new, d)  # We create an experience from this transition
            q_td = q_temporal_difference(q_sa, exp)  # We calculate the q-update
            q_sa[s][a] = q_td  # We update the q-table

            # We stop the game if we are finished
            if d:
                break

            s = s_new  # If not, replace the state with the new state before next step
    return q_sa

In [59]:
# Now we can try:
q_table = initialize_q_table(environment)
print(q_table)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


In [73]:
q_table = q_learning(environment, q_table, n_episodes=300000)  # 100k episodes will take some time, lower to see how it works (though it might not converge)
print(np.around(q_table, 2))  # We round to two decimal places for readability

[[0.54 0.39 0.24 0.29]
 [0.   0.06 0.06 0.36]
 [0.18 0.15 0.04 0.15]
 [0.   0.01 0.01 0.15]
 [0.64 0.01 0.01 0.38]
 [0.   0.   0.   0.  ]
 [0.13 0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.12 0.1  0.02 0.59]
 [0.11 0.73 0.11 0.09]
 [0.94 0.   0.   0.06]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.63 0.7  0.83 0.11]
 [0.29 1.   0.2  0.35]
 [0.   0.   0.   0.  ]]


You can now redo the "Playing FrozenLake with a policy" step, and you have a fully working q-learning agent.

# Frozen-Lake with a DQN (Deep Q-Network) agent

## Neural Network predictions

While we in the previous step used a q-table as backend, the problem can also be solved by function approximation with a neural network.

### Building the model

We can start by building the network model:

In [61]:
# Imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import Sequential
from tensorflow.python.keras.layers import Dense
from tensorflow.python.keras.optimizer_v2.adam import Adam

def build_dqn_model(alpha: float = 0.001) -> Sequential:
    """Builds a deep neural net which predicts the Q values for all possible
    actions given a state.

    The input should have the shape of the state, and the output should have the same shape as
    the action space since we want 1 Q value per possible action.

    Args:
		alpha: learning-rate

	Returns:
		q-net model
    """
    x_data = np.linspace(0, 15, 16)
    normalizer = keras.layers.Normalization(input_shape=[1, ], axis=None)
    normalizer.adapt(np.array(x_data))

    q_net = Sequential()
    # We start with the normalizer, input shape is of size 1 (state)
    q_net.add(normalizer)
    # First hidden layer has 32 neurons
    q_net.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))
    # The second hidden layer also have 32 neurons
    q_net.add(Dense(64, activation='relu', kernel_initializer='he_uniform'))
    # Since we have 4 possible actions, the output layer should be of size 4
    q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
    q_net.compile(optimizer=Adam(learning_rate=alpha), loss='mse')
    return q_net

In [62]:
# We can then create a dqn-model (it will be initialized with random weights)
q_net_model = build_dqn_model()

# And then we can "predict" the q-value outputs from a state s (in this case 1)
state_input = tf.convert_to_tensor([1], dtype=tf.float32)
pred = q_net_model.predict(state_input)
print(f"Q-values for state 1: {pred}")

# To get the q value for a specific action a (in this case action 1);
q = pred[0][1]
print(f"Q-value of state 1, action 1: {q}")

Q-values for state 1: [[-1.4472475   2.6199214  -0.1604914  -0.41213113]]
Q-value of state 1, action 1: 2.6199214458465576


### Modifying q-learning functions

We can reuse policy functions and playing functions from the q-learning agent (with a q-table backend), but we will need to modify them to take in the neural network instead:

In [63]:
def dqn_optimal_policy(env: gym.Env, q_net: Sequential, s: int) -> int:
    """RL-policy for optimal play.

    Args:
        env: Frozen-lake Environment
        q_net: q-network
        s: state

    Returns:
        optimal action for given state and q-table.
    """
    s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
    q_values = q_net.predict(s_tensor)[0]
    # print(f"Q-values: {q_values}")
    return int(np.argmax(q_values))  # Return the argument (element number) with the highest q-value

def dqn_epsilon_greedy_policy(env: gym.Env, q_net: Sequential, s: int, eps: float = 0.15) -> int:
    """RL-policy for exploration/exploitation play.

    Args:
        env: Frozen-lake Environment
        q_net: q-network
        s: state
        eps: exploration chance

    Returns:
        either random action, or optimal action for given state and q-table.
    """
    if np.random.rand() < eps:  # If a random number n is lower than eps:
        return env.action_space.sample()  # Pick a random action
    return dqn_optimal_policy(env, q_net, s)  # Otherwise, play optimally

def dqn_decaying_epsilon_greedy_policy(env: gym.Env, q_net: Sequential, s: int, episode: int, max_episodes: int, max_eps: float = 0.95, min_eps: float = 0.01) -> int:
    """RL-policy for exploration/exploitation play.

    Args:
        env: Frozen-lake Environment
        q_net: q-network
        s: state
        episode: current timestep
        max_episodes: maximum timestep
        max_eps: max exploration chance
        min_eps: min exploration chance

    Returns:
        either random action, or optimal action for given state and q-table.
    """
    max_episodes = int(max_episodes * 0.9)  # Testing with "optimal play" for last 10% of episodes
    episode = min(episode, max_episodes)
    eps = min_eps + (max_eps - min_eps) * ((max_episodes - episode) / max_episodes)
    if np.random.rand() < eps:  # If a random number n is lower than eps:
        return env.action_space.sample()  # Pick a random action
    return dqn_optimal_policy(env, q_net, s)  # Otherwise, play optimally

In [64]:
# We can test the optimal-policy:
print(f"Optimal action: {dqn_optimal_policy(environment, q_net_model, 1)}") # Optimal action

Optimal action: 1


### Playing with a DQN agent

In [65]:
# To play the game with a DQN-agent, we modify the "Play FrozenLake with a q-table agent", by replacing the policy with a DQN-policy:

def dqn_play(max_steps: int = 20):
    state, _ = environment.reset(return_info=True)  # Restart/initialize the environment
    print(environment.render(mode="ansi"))
    for _ in range(max_steps):
        action = dqn_optimal_policy(environment, q_net_model, state)  # Chose the optimal action based on values from the q-table
        # print(f"Action: {action}")
        new_state, reward, done, _ = environment.step(action)  # Play using that action
        print(environment.render(mode="ansi"))

        # We stop the game if we are finished
        if done:
            break

        state = new_state  # If not, replace the state with the new state before next step

dqn_play()


[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Down)
SFFF
F[41mH[0mFH
FFFH
HFFG



## Experience Replay

For training our network, we generally want to use batches sampled from a larger buffer of experiences

### Replay buffer
We can implement a buffer with the Experience class we implemented earlier:

In [66]:
from collections import deque
from random import sample

class ReplayBuffer:
    """Replay buffer.

    Stores and samples gameplay experiences
    """

    def __init__(self, max_size: int = 2000) -> None:
        self.buffer = deque(maxlen=max_size)

    def store(self, experience: Experience) -> None:
        """Store a gameplay experience in the buffer.

        Args:
            experience: gameplay experience to store

        Returns:
            None
        """
        self.buffer.append(experience)

    def sample(self, batch_size: int = 32) -> list[Experience]:
        """Samples a list of gameplay experiences of (max) size batch_size.

        Args:
            batch_size: maximum size of the batch to sample

        Returns:
            Sampled batch of gameplay experiences
        """
        batch_size = min(batch_size, len(self.buffer))
        return sample(self.buffer, batch_size)

### Storing experiences
We can store experiences in the buffer simply by playing the game, as we did in the "Playing with a DQN agent" step:

In [67]:
def collect_experiences(env: gym.Env, q_net: Sequential, buffer: ReplayBuffer, episode: int, max_episode: int, max_steps: int = 200) -> None:
    """Plays a single game/episode of the environment env, and stores all the transitions as experiences in the buffer.

    Args:
        env: OpenAI gym environment
        q_net: Q-network
        buffer: replay buffer
        episode: current episode number (for decaying eps-greedy)
        max_episode: max episode number (for decaying eps-greedy)
        max_steps: max steps to play for in the environment

    Returns:
        None
    """
    s, _ = environment.reset(return_info=True)  # Restart/initialize the environment
    for _ in range(max_steps):
        a = dqn_decaying_epsilon_greedy_policy(env, q_net, s, episode, max_episode)  # Chose the optimal action based on values from the q-table
        s_new, r, d, _ = environment.step(a)  # Play using that action
        if d and r == 0:
            r = -1
        experience = Experience(s, a, r, s_new, d)
        buffer.store(experience)

        # We stop the game if we are finished
        if d:
            break

        s = s_new  # If not, replace the state with the new state before next step

## Training the q-net

Now we need to be able to update the q-net, as we did with the q-table earlier in the notebook.
(NB: This is not part of the pensum, but left for completeness)

### Evaluating the agent/q-net
We should also be able to evaluate the q-net, so that we can say if it is doing well when training
and to compare different models etc

In [68]:
def evaluate_q_net(env: gym.Env, q_net: Sequential, episodes: int = 10, max_steps: int = 200) -> float:
    """Evaluates the performance of the given q-net.

    Plays n games/episodes of the given environment and calculates the average reward.
    Args:
        env: the game environment
        q_net: the q-net / agent
        episodes: number of episodes to play
        max_steps: max steps to play for in the environment

    Returns:
        average reward
    """
    t_reward = 0.0
    for _ in range(episodes):
        s, _ = environment.reset(return_info=True)  # Restart/initialize the environment
        ep_reward = 0.0
        for _ in range(max_steps):
            a = dqn_optimal_policy(env, q_net, s)  # Chose the optimal action
            s_new, r, d, _ = environment.step(a)  # Play using that action
            ep_reward += r
            # We stop the game if we are finished
            if d:
                break

            s = s_new  # If not, replace the state with the new state before next step
        t_reward += ep_reward
    return t_reward/episodes

### Q-Net Learning
Finally, the replacement for the q-learning method:

In [69]:
def dqn_utility(q_net: Sequential, s: int) -> int:
    """Utility function.

    Args:
        q_net: q-network
        s: state

    Returns:
        q-value of optimal action for given state and q-net.
    """
    s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
    q_values = q_net.predict(s_tensor)[0]
    return int(np.amax(q_values))  # Return the argument (element number) with the highest q-value

def train(q_net: Sequential, batch: list[Experience], gamma: float = 0.98) -> float:
    """

    Args:
        q_net: q-net
        batch: the batch to train on
        gamma: discount-value

    Returns:
        trained q-net
    """
    # We first create a list of all current q-values in the batch:
    batch_states = [experience.state for experience in batch]
    s_tensor = tf.convert_to_tensor(batch_states, dtype=tf.float32)
    q_values = q_net.predict(s_tensor)

    # We want to calculate the error over the q-values, so we make a copy to use as a target
    target_q = np.copy(q_values)

    # We then repeat for all utilities of the next states in the batch:
    batch_ns = [experience.new_state for experience in batch]
    ns_tensor = tf.convert_to_tensor(batch_ns, dtype=tf.float32)
    utilities = q_net.predict(ns_tensor)
    utilities = [np.amax(utility) for utility in utilities]

    for i in range(len(batch)):
        experience = batch[i]
        target = experience.reward
        if not experience.done:
            # Error is similar to q-learning
            target = experience.reward + gamma * utilities[i]

        # What we would have predicted

        # We update the prediction (to use as the error)
        target_q[i][experience.action] = target
    # Now we update the network, the fit function will take care of the rest of the update algorithm (learning-rate, error and gradient)
    target_q = tf.convert_to_tensor(target_q, dtype=tf.float32)
    training_history = q_net.fit(x=s_tensor, y=target_q, verbose=0)
    loss = training_history.history['loss']
    return loss

def dqn_learning(env: gym.Env, q_net: Sequential, buffer: ReplayBuffer, min_buffer: int = 100, n_episodes: int = 10000, max_steps: int = 200) -> Sequential:
    """dqn implementation to update a q-net.

	Args:
		env: gym environment
		q_net: agent/q-net
		buffer: The replay-buffer we will use
		min_buffer: minimum buffer size before we start training
		n_episodes: number of episodes to train on
		max_steps: maximum episode length

	Returns:
		updated q-table
    """
    # We first start by playing a few episodes so that we have some samples in our buffer
    for episode in range(n_episodes):
        collect_experiences(env, q_net, buffer, episode, n_episodes, max_steps=max_steps)  # Plays one episode and adds to buffer

        if episode >= min_buffer:  # We only start updating the q-net after we have enough experiences to sample from
            experience_batch = buffer.sample(256)
            loss = train(q_net, experience_batch)
            performance = evaluate_q_net(env, q_net)
            print(f"Episode: {episode}/{n_episodes}, the performance of the q-net is: {performance}, the loss is: {loss[0]}")
    return q_net

In [70]:
# Now to train:
replay_buffer = ReplayBuffer(max_size=512)
q_net_model = dqn_learning(environment, q_net_model, replay_buffer, n_episodes=5000)


Episode: 100/5000, the performance of the q-net is: 0.0, the loss is: 1.3142683506011963
Episode: 101/5000, the performance of the q-net is: 0.0, the loss is: 0.6105836033821106
Episode: 102/5000, the performance of the q-net is: 0.0, the loss is: 0.27251121401786804
Episode: 103/5000, the performance of the q-net is: 0.0, the loss is: 0.19604827463626862
Episode: 104/5000, the performance of the q-net is: 0.0, the loss is: 0.14146272838115692
Episode: 105/5000, the performance of the q-net is: 0.0, the loss is: 0.12943997979164124
Episode: 106/5000, the performance of the q-net is: 0.0, the loss is: 0.10835181176662445
Episode: 107/5000, the performance of the q-net is: 0.0, the loss is: 0.11646077781915665
Episode: 108/5000, the performance of the q-net is: 0.0, the loss is: 0.10373982787132263
Episode: 109/5000, the performance of the q-net is: 0.0, the loss is: 0.10603200644254684
Episode: 110/5000, the performance of the q-net is: 0.0, the loss is: 0.11984729766845703
Episode: 111

We can now compare the q-table and the q-net:

In [74]:
def compare_q(q_net: Sequential, q_sa: np.array):
    for s in range(16):
        s_tensor = tf.convert_to_tensor([s], dtype=tf.float32)
        q_values = q_net.predict(s_tensor)
        print(f"State {s}: \n    q-table: {np.round(q_sa[s],2)} \n    q-net: {np.round(q_values, 2)}")
compare_q(q_net_model, q_table)

State 0: 
    q-table: [0.54 0.39 0.24 0.29] 
    q-net: [[ 0.19 -0.07 -0.17  0.01]]
State 1: 
    q-table: [0.   0.06 0.06 0.36] 
    q-net: [[ 0.1  -0.11 -0.19  0.01]]
State 2: 
    q-table: [0.18 0.15 0.04 0.15] 
    q-net: [[-0.04 -0.18 -0.23 -0.01]]
State 3: 
    q-table: [0.   0.01 0.01 0.15] 
    q-net: [[-0.15 -0.2  -0.24 -0.07]]
State 4: 
    q-table: [0.64 0.01 0.01 0.38] 
    q-net: [[ 0.23 -0.17 -0.11 -0.4 ]]
State 5: 
    q-table: [0. 0. 0. 0.] 
    q-net: [[ 0.06 -0.3  -0.15 -0.55]]
State 6: 
    q-table: [0.13 0.   0.   0.  ] 
    q-net: [[-0.15 -0.61 -0.15 -0.84]]
State 7: 
    q-table: [0. 0. 0. 0.] 
    q-net: [[-0.43 -1.24 -0.27  0.03]]
State 8: 
    q-table: [0.12 0.1  0.02 0.59] 
    q-net: [[ 0.07 -0.17 -0.05  0.31]]
State 9: 
    q-table: [0.11 0.73 0.11 0.09] 
    q-net: [[ 0.1   0.36  0.08 -0.07]]
State 10: 
    q-table: [0.94 0.   0.   0.06] 
    q-net: [[ 0.32  0.18  0.19 -0.12]]
State 11: 
    q-table: [0. 0. 0. 0.] 
    q-net: [[ 0.32  0.09  0.28 -0.11]]
St