In [1]:
import gym
import tensorflow as tf
import tensorflow.keras as keras

Choose an action using the Epsilon-Greedy Exploration Strategy
In the Epsilon-Greedy Exploration strategy, the agent chooses a random action with probability epsilon and exploits the best known action with probability 1 — epsilon.

Both the Main model and the Target model map input states to output actions. These output actions actually represent the model’s predicted Q-value. In this case, the action that has the largest predicted Q-value is the best known action at that state.

After choosing an action, it’s time for the agent to perform the action and update the Main and Target networks according to the Bellman equation. Deep Q-Learning agents use Experience Replay to learn about their environment and update the Main and Target networks.

To summarize, the main network samples and trains on a batch of past experiences every 4 steps. The main network weights are then copied to the target network weights every 100 steps.

In [None]:
def agent(state_shape, action_shape):
    alpha = 0.001
    init = tf.keras.initializers.HeUniform()
    
    model = keras.Sequential()
    model.add(keras.layers.Dense(24, input_shape=state_shape, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(12, activation='relu', kernel_initializer=init))
    model.add(keras.layers.Dense(action_shape, activation='linear', kernel_initializer=init))
    model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(lr=alpha), metrics=['accuracy'])
    
    return model

In [None]:
env = gym.make('CartPole-v0')
model = agent()

for i_episode in range(20):
    obs = env.reset()
    for t in range(1000):
        action, states = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        env.render()

        if done:
            print(f"Episode {i_episode} finished after {t+1} timesteps")
            break
env.close()