# Deep Reinforcement Learning Agent

Hilfreiche Erklärungen am Beispiel CartPole:
- https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
- https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial

## Abhängigkeiten installieren

Achtung: tf-agents 0.19.0 benötigt typing-extensions==4.5.0, ipython jedoch Version 4.6.0. Daher kann es zu Fehlermeldungen beim `Neustarten` kommen.  
`Alles Ausführen` funktioniert dennoch.

In [23]:
%pip install tf_keras==2.15.0
%pip install tf-agents[reverb]==0.19.0
%pip install matplotlib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A n

In [24]:
import os
# Keep using keras-2 (tf-keras) rather than keras-3 (keras).
os.environ['TF_USE_LEGACY_KERAS'] = '1'

In [25]:
import reverb
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tf_agents import trajectories as ts
from tf_agents.environments import py_environment
from tf_agents.environments import tf_py_environment
from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.networks import q_network
from tf_agents.agents.dqn import dqn_agent
from tf_agents.utils import common
from tf_agents.policies import boltzmann_policy
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import epsilon_greedy_policy
from tf_agents.policies import q_policy
from tf_agents.replay_buffers import reverb_replay_buffer
from tf_agents.replay_buffers import reverb_utils
from tf_agents.drivers import py_driver


## Umgebung definieren

In [26]:
class LoginEnv(py_environment.PyEnvironment):
    def __init__(self):
        
        # Zustandseigenschaften: Richtiges Passwort (boolean), Zeit zwischen Loginversuchen (date), Falsches Passwort Zähler (int), letzte Aktion (int)
        self._observation_spec = array_spec.BoundedArraySpec(
                                shape=(4,), dtype=np.int32, minimum=0, name='observation')
        
        # Aktionen: 0 = Nicht sperren, 1 = 30s sperren, 2 = 1m sperren, 3 = 3min sperren, 4 = Dauerhaft sperren
        self._action_spec = array_spec.BoundedArraySpec(
                                shape=(), dtype=np.int32, minimum=0, maximum=4, name='action')
        
        # Interne Zustandsvariablen initialisieren
        self._state = np.array([0, 0, 0, 0], dtype=np.int32)
        self._episode_ended = False
    
    def _reset(self):
        self._state[0] = np.random.choice([0, 1])  # Richtiges Passwort: 0 oder 1
        self._state[1] = np.random.randint(0, 3600)  # Zeit zwischen Loginversuchen
        self._state[2] = np.random.randint(0, 11)  # Falsches Passwort Zähler
        self._state[3] = np.random.randint(1, 3) if self._state[2] > 0 else 0  # Letzte Aktion
        self._episode_ended = False
        return ts.restart(np.array(self._state, dtype=np.int32))

    def _step(self, action):
        if self._episode_ended:
            return self.reset()
        
        reward = 0
        if action == 0:  # Nicht sperren
            if self._state[0] == 0:
                reward = 1
                self._episode_ended = True
            elif self._state[1] <= 3 or self._state[2] >= 10:
                reward = -1
                self._episode_ended = True
        else:
            reward = 1
            self._episode_ended = True
            
        """ elif action == 1:  # 30s sperren
            if self._state[0] == 0:
                reward = -1
                self._episode_ended = True
            elif self._state[1] <= 3 or (3 < self._state[2] <= 6):
                reward = 1
                self._episode_ended = True
        elif action == 2:  # 1m sperren
            if self._state[0] == 0:
                reward = -1
                self._episode_ended = True
            elif 6 < self._state[2] <= 9:
                reward = 1
                self._episode_ended = True
        elif action == 3:  # 3min sperren
            if self._state[0] == 0:
                reward = -1
                self._episode_ended = True
            elif 9 < self._state[2] < 10:
                reward = 1
                self._episode_ended = True
        elif action == 4:  # Dauerhaft sperren
            if self._state[2] >= 10:
                reward = 1
                self._episode_ended = True
            else:
                reward = -1
                self._episode_ended = True """
    
        if self._episode_ended:
            return ts.termination(np.array(self._state, dtype=np.int32), reward)
        else:
            return ts.transition(np.array(self._state, dtype=np.int32), reward=0.0, discount=1.0)

    def action_spec(self):
        return self._action_spec

    def observation_spec(self):
        return self._observation_spec


In [27]:
py_env = LoginEnv()
train_env = tf_py_environment.TFPyEnvironment(py_env)
eval_env = tf_py_environment.TFPyEnvironment(py_env)

## Deep Reinforcement Learning Agent (DQN-Agent) definieren

In [28]:
# Erstelle das Q-Network
fc_layer_params = (100,)
q_net = q_network.QNetwork(
    train_env.observation_spec(),
    train_env.action_spec(),
    fc_layer_params=fc_layer_params)

# Konfiguriere den DQN-Agenten
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
train_step_counter = tf.Variable(0)
agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

# Initialisiere und kompiliere den Agenten
agent.initialize()


## Policy definieren

In [29]:
#policy = boltzmann_policy.BoltzmannPolicy(agent.policy)
#policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())
policy = epsilon_greedy_policy.EpsilonGreedyPolicy(agent.policy, epsilon=0.1)

## Metriken und Auswertung

In [30]:
def compute_average_return(environment, policy, num_episodes=10):
    print(f"Starte Berechnung der durchschnittlichen Rückkehr über {num_episodes} Episoden.")
    total_return = 0.0
    for episode in range(num_episodes):
        print(f"Episode {episode+1}/{num_episodes} startet.")
        time_step = environment.reset()
        episode_return = 0.0
        step_count = 0
        while not time_step.is_last():
            step_count += 1
            action_step = policy.action(time_step)
            time_step = environment.step(action_step.action)
            episode_return += time_step.reward
            print(f"  Schritt {step_count}, Zwischensumme der Rückkehr: {episode_return}")
        total_return += episode_return
        print(f"Episode {episode+1} abgeschlossen. Rückkehr: {episode_return}")
    average_return = total_return / num_episodes
    print(f"Durchschnittliche Rückkehr nach {num_episodes} Episoden: {average_return}")
    return average_return

## Wiederholungspuffer

In [31]:
# Create a replay buffer table
table_name = 'replay_buffer'
replay_buffer_signature = tensor_spec.from_spec(
    agent.collect_data_spec)
replay_buffer_signature = tensor_spec.add_outer_dim(
    replay_buffer_signature)

table = reverb.Table(
    table_name,
    max_size=100000,
    sampler=reverb.selectors.Uniform(),
    remover=reverb.selectors.Fifo(),
    rate_limiter=reverb.rate_limiters.MinSize(1),
    signature=replay_buffer_signature)

# Create a replay buffer server
reverb_server = reverb.Server([table])

replay_buffer = reverb_replay_buffer.ReverbReplayBuffer(
    agent.collect_data_spec,
    table_name=table_name,
    sequence_length=2,
    local_server=reverb_server)

rb_observer = reverb_utils.ReverbAddTrajectoryObserver(
    replay_buffer.py_client,
    table_name,
    sequence_length=2)

[reverb/cc/platform/tfrecord_checkpointer.cc:162]  Initializing TFRecordCheckpointer in /tmp/tmpudh_x7yy.
[reverb/cc/platform/tfrecord_checkpointer.cc:565] Loading latest checkpoint from /tmp/tmpudh_x7yy
[reverb/cc/platform/default/server.cc:71] Started replay server on port 42887


## Datensammlung

In [32]:
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=64,
    num_steps=2).prefetch(3)

iterator = iter(dataset)
print(iterator)

<tensorflow.python.data.ops.iterator_ops.OwnedIterator object at 0x7ab85ecd7bb0>


[reverb/cc/platform/default/server.cc:84] Shutting down replay server


## Training des Agenten

In [33]:

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)
print("Optimized training.")

# Reset the train step.
agent.train_step_counter.assign(0)
print("reseted")

# Evaluate the agent's policy once before training.
avg_return = compute_average_return(eval_env, agent.policy, 10)
returns = [avg_return]
print("evaluated")
      
# Reset the environment.
time_step = train_env.reset()
print("reseted")

# Create a driver to collect experience.
collect_driver = py_driver.PyDriver(
    train_env, #CHANGE
    py_tf_eager_policy.PyTFEagerPolicy(
      agent.collect_policy, 
      use_tf_function=True, 
      batch_time_steps=False #CHANGE
      ),
    [rb_observer],
    max_steps=1)

import time

for _ in range(20):
  print("Erfahrung sammeln...")
  start_time = time.time()  # Startzeit für das Sammeln der Erfahrung
  experience, unused_info = next(iterator)
  duration = time.time() - start_time  # Dauer des Sammelns
  print(f"Erfahrung gesammelt in {duration:.2f} Sekunden")

  # Überprüfen der Größe der gesammelten Erfahrung
  num_experiences = len(experience)
  print(f"Anzahl der gesammelten Erfahrungen: {num_experiences}")

  print("Aktualisiere Netzwerk...")
  train_loss = agent.train(experience).loss
  print(f"Trainingsverlust: {train_loss}")

  """   print("collect")
    # Collect a few steps and save to the replay buffer.
    time_step, _ = collect_driver.run(time_step)
    print("sample a batch")
    # Sample a batch of data from the buffer and update the agent's network.
    experience, unused_info = next(iterator)
    train_loss = agent.train(experience).loss
  """
  print("step")
  step = agent.train_step_counter.numpy()

  if step % 200 == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % 1000 == 0:
    avg_return = compute_average_return(eval_env, agent.policy, 10)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

Optimized training.
reseted
Starte Berechnung der durchschnittlichen Rückkehr über 10 Episoden.
Episode 1/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 1 abgeschlossen. Rückkehr: [1.]
Episode 2/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 2 abgeschlossen. Rückkehr: [1.]
Episode 3/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 3 abgeschlossen. Rückkehr: [1.]
Episode 4/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 4 abgeschlossen. Rückkehr: [1.]
Episode 5/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 5 abgeschlossen. Rückkehr: [1.]
Episode 6/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 6 abgeschlossen. Rückkehr: [1.]
Episode 7/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 7 abgeschlossen. Rückkehr: [1.]
Episode 8/10 startet.
  Schritt 1, Zwischensumme der Rückkehr: [1.]
Episode 8 abgeschlossen. Rückkehr: [1.]
Episode 9/10 startet.
  Schritt 1, Zwisc

[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.
[reverb/cc/client.cc:165] Sampler and server are owned by the same process (15677) so Table replay_buffer is accessed directly without gRPC.


## Visualisierung

In [None]:
# funktioniert nicht, da 4 Dimensionen
iterations = range(0, 20000 + 1, 1000)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

: 