# **Gridworld + Q-Learning**

Stell dir vor, du lernst Fahrradfahren. Anfangs probierst du wild aus, kippst um, sammelst Erfahrungen und wirst St√ºck f√ºr St√ºck besser.

Genau das ist das Prinzip des **Verst√§rkenden Lernens (Reinforcement Learning, RL)**, auf das wir in diesem Notebook einen genaueren Blick werfen. Dabei f√ºhrt ein KI-**Agent** selbst√§ndig Aktionen in einer dynamischen **Umgebung** (engl. environment) aus und erlernt durch Versuch und Irrtum eine **Strategie** (engl. policy), die die Summe der erhaltenen **Belohnungen** (engl. rewards) maximiert.

### Begriffe im Schnelldurchlauf

| Begriff | Erkl√§rung |
|---------|-----------|
| **Agent** | trifft Entscheidungen |
| **Umgebung** | reagiert auf Aktionen des Agenten |
| **Zustand** (*state*) | beobachtbare Beschreibung eines Zeitpunktes |
| **Aktion** (*action*) | Entscheidung des Agenten, die den Zustand ver√§ndern soll |
| **Belohnung** (*reward*) | numerisches Feedback nach jeder Aktion |
| **Episode** | Abfolge von Zust√§nden bis zu einem Endzustand |
| **Strategie¬†œÄ** (*policy*) | Regel, nach der der Agent seine Aktionen w√§hlt |
| **Q‚ÄëWert¬†Q(s,a)** | erwarteter Return bei Aktion *a* in Zustand *s* |
| **Exploration¬†‚Üî‚ÄØExploitation** | Ausprobieren neuer Aktionen vs. Ausnutzen bekannter guter Aktionen |

### Aufbau des Notebooks

1. **Kleine‚ÄØinteraktive‚ÄØUmgebung** ‚Äì Ihr steuert selbst einen Agenten per Buttons durch ein 5‚ÄØ√ó‚ÄØ5‚ÄëGrid und erlebt, wie eingeschr√§nkte Information Planung erschwert.  
2. **Q‚ÄëLearning in einer gr√∂√üeren Umgebung** ‚Äì Wir lassen einen Agenten mittels tabellarischem Q‚ÄëLearning eine 6‚ÄØ√ó‚ÄØ6‚ÄëWelt mit verschiedenen Feldern (Eis ‚ùÑÔ∏è, Abpraller üî¥, Gruben üï≥Ô∏è) erkunden und eine optimale Strategie finden.  
3. **Erweiterte Umgebung & eigene Experimente** ‚Äì Hier k√∂nnt ihr s√§mtliche Umgebungs‚Äë und Algorithmus‚ÄëParameter ver√§ndern, eigene Layouts entwerfen und beobachten, wie sich das Lernverhalten ver√§ndert.


In [1]:
# Pakete im aktuellen Jupyter‚ÄëKernel installieren
import sys
!{sys.executable} -m pip install -q numpy
!{sys.executable} -m pip install -q moviepy
!{sys.executable} -m pip install -q "ipywidgets==8.1.6"
!{sys.executable} -m pip install -q "jupyterlab_widgets==3.*"
!{sys.executable} -m pip install -q pillow


[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# Geladene Versionen √ºberpr√ºfen
import importlib, sys
print("Python", sys.version.split()[0])

for pkg in ("ipywidgets", "jupyterlab_widgets", "moviepy"):
    try:
        mod = importlib.import_module(pkg)
        print(f"{pkg:<18} {mod.__version__}")
    except ModuleNotFoundError:
        print(f"{pkg:<18} -- not installed --")

Python 3.13.1
ipywidgets         8.1.6
jupyterlab_widgets 3.0.14
moviepy            2.1.1


In [3]:
# Bibliotheken f√ºr Anzeige und Benutzereingabe
import ipywidgets as widgets
from IPython.display import display, clear_output, Video
from PIL import Image, ImageDraw, ImageFont
from moviepy import ImageSequenceClip

# Bibliotheken f√ºr Reinforcement Learning
import numpy as np

# Hilfsbibliotheken
import random
import time
import copy
import os
import tempfile
import io

In [4]:
# Symbolverzeichnis f√ºr die Felder
SYMBOLS = {
    "agent"      : "ü§ñ",
    "start"      : "üî∞",
    "goal"       : "üö©",
    "empty"      : "",
    "wall"       : "üß±Ô∏è",  #"‚¨õÔ∏è",
    "pit"        : "üï≥Ô∏è",
    "ice"        : "‚ùÑÔ∏è",  #"üßä",
    "bumper"     : "ü™ÄÔ∏è",  #"üî¥",
    "sticky"     : "üü´",
    "wind"       : "üí®",
    "conveyor_U" : "‚¨ÜÔ∏è",
    "conveyor_D" : "‚¨áÔ∏è",
    "conveyor_L" : "‚¨ÖÔ∏è",
    "conveyor_R" : "‚û°Ô∏è",
    "trampoline" : "ü¶ò",  #"‚ÜïÔ∏è",
    "portal"     : "üåÄ",
    "collapse"   : "‚ö†Ô∏è",
    "toll"       : "üí∞",
    "battery"    : "üîãÔ∏è",
    "gem"        : "üíéÔ∏è",
}

def _symbol_for(env, r, c):
    """Gibt den SYMBOLS‚ÄëSchl√ºssel f√ºr die Position (r,c) in *env* zur√ºck."""
    pos = (r, c)
    if pos == env.start_pos: return "start"
    if pos == env.goal_pos: return "goal"
    if pos in getattr(env, "wall_positions",       set()): return "wall"
    if pos in getattr(env, "pit_positions",        set()): return "pit"
    if pos in getattr(env, "ice_positions",        set()): return "ice"
    if pos in getattr(env, "bumper_positions",     set()): return "bumper"
    if pos in getattr(env, "sticky_positions",     set()): return "sticky"
    if pos in getattr(env, "wind_positions",       set()): return "wind"
    if pos in getattr(env, "trampoline_positions", set()): return "trampoline"
    if pos in getattr(env, "portal_lookup",       dict()): return "portal"
    if pos in getattr(env, "collapse_positions",   set()):  return "collapse"
    if pos in getattr(env, "already_collapsed",    set()): return "pit"
    if pos in getattr(env, "toll_positions",       set()):  return "toll"
    if pos in getattr(env, "battery_positions",    set()):  return "battery"
    if pos in getattr(env, "gem_positions",        set()):  return "gem"

    # conveyor
    conv_dir = getattr(env, "conveyor_map", {}).get(pos)
    if conv_dir:
        return f"conveyor_{conv_dir}"

    # portal
    for a,b in getattr(env, "portal_pairs", []):
        if pos in (a,b): return "portal"

    return "empty"

def _glyph_for(env, r, c):
    key = _symbol_for(env, r, c)
    return SYMBOLS.get(key)

In [5]:
# Hilfsfunktionen f√ºr Video Frames
from pathlib import Path

# OS-specific colour emoji font
def _default_emoji_font(px):
    if sys.platform.startswith("win"):
        fp = Path(r"C:\Windows\Fonts\seguiemj.ttf")
    elif sys.platform.startswith("linux"):
        fp = Path("/usr/share/fonts/truetype/noto/NotoColorEmoji.ttf")
    else:
        raise OSError("Add a colour-emoji font path for your OS")
    return ImageFont.truetype(str(fp), px)

def _emoji_frame(rows, cols, cell_px, border, symbols, agent_pos, font_path=None, agent_glyph="ü§ñ"):
    """
    Zeichne Grid-Umgebung basierend auf den Argumenten.

    symbols:     dict {(row, col): glyph}
    agent_pos:   (row, col)
    """
    W, H = cols*cell_px + 2*border, rows*cell_px + 2*border
    img  = Image.new("RGBA", (W, H), "white")
    draw = ImageDraw.Draw(img)

    # grid lines
    for r in range(rows + 1):
        y = border + r*cell_px
        draw.line([(border, y), (W-border, y)], fill="grey")
    for c in range(cols + 1):
        x = border + c*cell_px
        draw.line([(x, border), (x, H-border)], fill="grey")

    # w√§hle font
    font = ImageFont.truetype(str(font_path), int(cell_px*0.8)) if font_path else _default_emoji_font(int(cell_px*0.8))

    # zeichne emojis + agent
    if agent_pos:
        symbols = dict(symbols)
        symbols[agent_pos] = agent_glyph

    for (r, c), glyph in symbols.items():
        gx, gy = border + c*cell_px + cell_px//2, \
                 border + r*cell_px + cell_px//2
        draw.text((gx, gy),
                  glyph.replace("\uFE0F", ""),    # drop VS-16
                  font=font,
                  anchor="mm",
                  embedded_color=True)
    return img

In [6]:
# Hilfsfunktionen f√ºr die Aufl√∂sung
_PRESET_RES = {"720p": (1280, 720), "1080p": (1920, 1080)}

def _parse_resolution(res):
    """
    Akzeptiert '720p', '1080p', ein (Breite, H√∂he) Tupel oder None
    """
    if res is None:
        return None
    if isinstance(res, str):
        try:
            return _PRESET_RES[res.lower()]
        except KeyError:
            raise ValueError(f"Unknown preset '{res}'. Use one of {list(_PRESET_RES)} or pass a (width, height) tuple.")
    if len(res) == 2:
        return tuple(map(int, res))
    raise ValueError("Aufl√∂sung muss sein '720p' / '1080p' / (Breite, H√∂he) / None")

## Teil¬†1: Kleine interaktive Umgebung - Erkunde die Welt selbst üë£

Du spielst nun den **Agenten** in einer 5 √ó 5-Welt.  
Der Clou: du **siehst nur, wo du schon warst** ‚Äì so sp√ºrt auch ein echter RL-Agent seine Welt zun√§chst ‚Äûim Dunkeln‚Äú ab.

### Feldtypen & Rewards
| Feld | Symbol | Effekt | Reward |
|------|--------|--------|--------|
| Eis | üßä | 50 % Chance, in eine zuf√§llige Richtung wegzurutschen | 0 |
| Abpraller | üî¥ | 3 Felder zur√ºck in die Richtung, aus der du gekommen bist | 0 |
| Grube | üï≥Ô∏è | Sofortiges Ende | ‚àí1 |
| Ziel | üö© | Ende | +1 |

> üîé **Denke kurz nach:**  
> Wie w√ºrdest *du* entscheiden, wenn du die Rewards nicht kennst?  
> Welches Dilemma hat der Agent zwischen **Entdecken** (Eis riskieren) und **Ausnutzen** (sicheren Weg nehmen)?

### Regeln
- Ihr seht immer nur Felder, die ihr bereits *besucht* habt.
- Bewegt euch mit den Kn√∂pfen `Oben`, `Unten`, `Links`, `Rechts`.
- Sobald ihr das Zielfeld erreicht, habt ihr gewonnen!
- Mit dem "Zur√ºcksetzen" Knopf setzt ihr die Umgebung und den Agenten zur√ºck, d.h. ihr startet eine neue Episode.
- Mit dem Knopf "Umgebung aufdecken" k√∂nnt ihr die komplette Umgebung sehen, falls ihr z.B. zu oft in Gruben f√§llt.

In [7]:
class InteractiveGridEnv:
    """
    Interaktive, teilweise beobachtbare Felder-Umgebung
    """
    def __init__(self, rows=6, cols=6, pit_frac=0.10, ice_frac=0.15, bumper_frac=0.10, reveal_full=False, seed=None):
        if seed is not None:
            random.seed(seed)

        self.rows = rows
        self.cols = cols

        self.start_pos = (0, 0)
        # Zielposition zuf√§llig w√§hlen
        cells = [(r, c) for r in range(self.rows) for c in range(self.cols) if (r, c) != self.start_pos]
        self.goal_pos = random.choice(cells)

        self.agent_pos = self.start_pos
        self.visited = set([self.start_pos])

        # Canvas‚ÄëObjekt f√ºr die Visualisierung
        self.cell_size = 25
        self.canvas = widgets.Image(format='png', layout={'width': '25%'})

        self.pit_frac = pit_frac
        self.ice_frac = ice_frac
        self.bumper_frac = bumper_frac
        
        self.reveal_full = reveal_full
        self.done = False
        self.last_event = None   # "goal" | "pit" | None

        # ----- verteile Feldervariationen -------------------------------------------------
        self.tile_map = {}  # (r,c) -> {"ice","bumper","pit"}
        pool = [p for p in cells if p != self.goal_pos]
        random.shuffle(pool)

        def take(frac):
            n = int(frac * len(pool))
            picked, rest = pool[:n], pool[n:]
            return picked, rest

        pits,    pool = take(pit_frac)
        ice,     pool = take(ice_frac)
        bumpers, pool = take(bumper_frac)

        self.tile_map.update({p: "pit"    for p in pits})
        self.tile_map.update({p: "ice"    for p in ice})
        self.tile_map.update({p: "bumper" for p in bumpers})

    # ---------------------------------------------------------------------------
    # Hilfsfunktionen
    # ---------------------------------------------------------------------------
    def _move(self, direction):
        r, c = self.agent_pos
        if direction == "Oben":
            r = max(r - 1, 0)
        elif direction == "Unten":
            r = min(r + 1, self.rows - 1)
        elif direction == "Links":
            c = max(c - 1, 0)
        elif direction == "Rechts":
            c = min(c + 1, self.cols - 1)
        self.agent_pos = (r, c)
        self.visited.add(self.agent_pos)

    def _propagate_effects(self, incoming_dir):
        """
        Eis-/Abpralleffekte anwenden, bis kein neuer Effekt mehr auftritt oder der Agent Ziel/Grube oder den Spielfeldrand erreicht.
        """
        while not self.done:
            # zuerst pr√ºfen, ob das Ziel erreicht wurde
            if self.agent_pos == self.goal_pos:
                self.done = True
                self.last_event = "goal"
                return

            tile = self.tile_map.get(self.agent_pos)

            if tile == "pit":
                self.done = True
                self.last_event = "pit"
                return

            if tile == "ice" and random.random() < 0.5:
                slip_dir = random.choice(["Oben", "Unten", "Links", "Rechts"])
                self._move(slip_dir)
                #self.visited.add(self.agent_pos)
                incoming_dir = slip_dir
                continue   # Schleife fortf√ºhren, um potentiell neuen Feldeffekt zu evaluieren

            if tile == "bumper":
                opposite = {"Oben": "Unten", "Unten": "Oben", "Links": "Rechts", "Rechts": "Links"}[incoming_dir]
                for _ in range(3):
                    self._move(opposite)
                    #self.visited.add(self.agent_pos)

                    # Fr√ºhzeitige Abbruchpr√ºfung bei jedem Abpraller
                    if self.agent_pos == self.goal_pos:
                        self.done = True
                        self.last_event = "goal"
                        return
                    if self.tile_map.get(self.agent_pos) == "pit":
                        self.done = True
                        self.last_event = "pit"
                        return
                incoming_dir = opposite
                continue   # Schleife fortf√ºhren, um potentiell neuen Feldeffekt zu evaluieren

            return

    # ---------------------------------------------------------------------------
    # RL functions
    # ---------------------------------------------------------------------------
    def reset(self):
        self.agent_pos = self.start_pos
        self.visited = set([self.start_pos])
        
        cells = [(r, c) for r in range(self.rows) for c in range(self.cols) if (r, c) != self.start_pos]
        self.goal_pos = random.choice(cells)

        self.tile_map = {}  # (r,c) -> {"ice","bumper","pit"}
        pool = [p for p in cells if p != self.goal_pos]
        random.shuffle(pool)

        def take(frac):
            n = int(frac * len(pool))
            picked, rest = pool[:n], pool[n:]
            return picked, rest

        pits,    pool = take(self.pit_frac)
        ice,     pool = take(self.ice_frac)
        bumpers, pool = take(self.bumper_frac)

        self.tile_map.update({p: "pit"    for p in pits})
        self.tile_map.update({p: "ice"    for p in ice})
        self.tile_map.update({p: "bumper" for p in bumpers})
        
        self.done = False
        self.last_event = None
        self.render()

    def step(self, action):
        if self.done:
            return self.agent_pos, True

        # Hauptbewegungsschritt
        self._move(action)
        #self.visited.add(self.agent_pos)

        # Kaskaden-Logik
        self._propagate_effects(action)

        self.render()
        return self.agent_pos, self.done

    # ---------------------------------------------------------------------------
    # Render-Funktionen
    # ---------------------------------------------------------------------------
    def render(self):
        cs, R, C = self.cell_size, self.rows, self.cols
        W, H     = C*cs, R*cs
        img  = Image.new("RGBA", (W, H), "white")
        draw = ImageDraw.Draw(img)
        
        # w√§hle colour-emoji font basierend auf Betriebssystem
        if sys.platform.startswith("win"):
            font_path = Path(r"C:\Windows\Fonts\seguiemj.ttf")
        elif sys.platform.startswith("linux"):
            font_path = Path("/usr/share/fonts/truetype/noto/NotoColorEmoji.ttf")
        else:
            raise OSError("add a colour-emoji font path for your OS")
        font = ImageFont.truetype(str(font_path), int(cs*0.8))

        # Hilfsfunktion, um Emojis zentriert in Feldern zu zeichnen
        def _draw_centered(glyph, cx, cy):
            g = glyph.replace("\uFE0F", "")
            l, t, rbb, bbb = draw.textbbox((0, 0), g, font=font, embedded_color=True)
            w, h = rbb - l, bbb - t
            draw.text((cx - w/2 - l, cy - h/2 - t), g, font=font, embedded_color=True)

        # zeichne Hintergrundfarben der Felder
        for r in range(R):
            for c in range(C):
                x0, y0, x1, y1 = c*cs, r*cs, (c+1)*cs-1, (r+1)*cs-1
                if (not self.reveal_full) and ((r, c) not in self.visited):
                    draw.rectangle([x0, y0, x1, y1], fill="#D3D3D3")
                else:
                    draw.rectangle([x0, y0, x1, y1], fill="white")

        # zeichne Emojis
        for r in range(R):
            for c in range(C):
                key = ("goal" if (r, c) == self.goal_pos else self.tile_map.get((r, c)))
                g = SYMBOLS.get(key)
                if g and ((self.reveal_full) or ((r, c) in self.visited)):
                    _draw_centered(g, c*cs + cs/2, r*cs + cs/2)
        ar, ac = self.agent_pos
        _draw_centered(SYMBOLS["agent"], ac*cs + cs/2, ar*cs + cs/2)

        # zeichne Felderr√§nder
        for r in range(R + 1):
            y = min(r*cs, H-1)
            draw.line([(0, y), (W-1, y)], fill="grey")
        for c in range(C + 1):
            x = min(c*cs, W-1)
            draw.line([(x, 0), (x, H-1)], fill="grey")

        # flush to widgets.Image
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        self.canvas.value = buf.getvalue()


# ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê  User Interface WIDGETS  ‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê #

env = InteractiveGridEnv(rows=5, cols=5)

# Kn√∂pfe
up_btn    = widgets.Button(description="Oben",    layout={'width': '60px'})
down_btn  = widgets.Button(description="Unten",  layout={'width': '60px'})
left_btn  = widgets.Button(description="Links",  layout={'width': '60px'})
right_btn = widgets.Button(description="Rechts", layout={'width': '60px'})
reset_btn = widgets.Button(description="Zur√ºcksetzen", button_style="warning")
reveal_btn = widgets.ToggleButton(value=False, description="Umgebung aufdecken")

# status message
status = widgets.HTML(value="", layout={'height':'30px', 'margin':'4px 0 0 0'})

def _toggle_moves(disable=True):
    """Aktiviert bzw. deaktiviert die Bewegungsschaltfl√§chen."""
    for b in (up_btn, down_btn, left_btn, right_btn):
        b.disabled = disable

output = widgets.Output()

@output.capture(clear_output=True)
def on_move(btn):
    if btn is reset_btn:
        env.reset()
        status.value = ""
        _toggle_moves(False)
        display(ui)
        return

    action = btn.description
    _, done = env.step(action)
    
    if done:
        _toggle_moves(True)
        if env.last_event == "goal":
            status.value = "<b>Du hast das Ziel erreicht! Dr√ºcke auf \"Zur√ºcksetzen\", um neu zu starten.</b>"
        elif env.last_event == "pit":
            status.value = "<b>Du bist in eine Grube gefallen! Dr√ºcke auf \"Zur√ºcksetzen\", um neu zu starten.</b>"
    display(ui)

def on_reveal(change):
    env.reveal_full = change["new"]
    env.render()

# Knopf-Callbacks verbinden
for b in (up_btn, down_btn, left_btn, right_btn, reset_btn):
    b.on_click(on_move)
reveal_btn.observe(on_reveal, "value")

# layout
btn_row = widgets.HBox([left_btn, up_btn, down_btn, right_btn, reset_btn, reveal_btn])
ui = widgets.VBox([btn_row, env.canvas, status])

env.render()
display(ui)

VBox(children=(HBox(children=(Button(description='Links', layout=Layout(width='60px'), style=ButtonStyle()), B‚Ä¶

# Teil¬†2: Gr√∂√üere Umgebung¬†und¬†Q‚ÄëLearning - Lassen wir den Computer lernen ü§ñ

Jetzt wird es spannend¬†‚Äì wir konstruieren eine **6‚ÄØ√ó‚ÄØ6‚ÄëGridworld**, in der der Agent das Zielfeld üö© finden soll, ohne dabei in Gruben üï≥Ô∏è zu fallen oder zu viel Zeit in riskanten Bereichen zu verbringen. Ein m√∂gliches Layout (ihr k√∂nnt es sp√§ter √§ndern) k√∂nnte so aussehen:

<table style="border-collapse: collapse;">
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üî∞</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üî¥</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üï≥Ô∏è</td>
</tr>
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">‚ùÑÔ∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">‚ùÑÔ∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
</tr>
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üî¥</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üï≥Ô∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
</tr>
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">‚ùÑÔ∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
</tr>
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üï≥Ô∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">‚ùÑÔ∏è</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üî¥</td>
</tr>
<tr>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">.</td>
<td style="border: 1px solid #ccc; width: 34px; height: 34px; text-align: center;">üö©</td>
</tr>
</table>

* üî∞‚ÄØ=‚ÄØStart, üö©‚ÄØ=‚ÄØZiel  
* Symbole wie oben beschrieben (Abpraller üî¥, Eis ‚ùÑÔ∏è, ‚Ä¶).

---

## Q‚ÄëLearning

Bei **Q‚ÄëLearning** approximieren wir den erwarteten Return \( Q(s,a) \) f√ºr jedes Zustand‚ÄëAktion‚ÄëPaar (s,a).

### Ablauf von Q-Learning in 3 Schritten
1. **Vorhersage:** Schaue in Tabelle Q(*s,a*) ‚Äì ‚ÄûWas glaube ich zu bekommen?‚Äú  
2. **Handlung & Feedback:** F√ºhre Aktion *a* aus, erhalte Belohnung *r*, neuen Zustand *s‚Ä≤*  
3. **Korrektur:** Aktualisiere Tabelle mithilfe folgender Formel:

$$Q(s, a) \leftarrow Q(s, a) + \alpha (r + \gamma \max_a Q(s', a) - Q(s, a))$$


### Hyperparameter-Spickzettel
| Name | Bedeutung | Typische Range | Wirkung |
|------|-----------|----------------|---------|
| **Œ±** (*Learning Rate*) | Gewicht von neuem gegen√ºber altem Wissen | 0.1 ‚Äì 0.5 | hohes Œ± lernt schnell, kann aber ‚Äû√ºberschie√üen‚Äú |
| **Œ≥** (*Discount Factor*) | Wie weit blickt der Agent in die Zukunft? | 0.9 ‚Äì 0.99 | Fokus auf l√§ngerfristige Rewards |
| **Œµ** (*Exploration Rate*) | Wahrscheinlichkeit f√ºr zuf√§llige Aktion (Œµ‚Äëgreedy) | 0.1 ‚Äì 0.3 | mehr ausprobieren am Anfang |

> **Merke:**  
> ‚Ä¢ Hohe Œ± ‚Üí schnelle, aber riskante Updates  
> ‚Ä¢ Hohe Œ≥ ‚Üí weit in die Zukunft planen  
> ‚Ä¢ Hohe Œµ ‚Üí viel ausprobieren, niedrige Œµ ‚Üí bekanntes ausnutzen
> 
> üí° *Tipp:* Reduziere Œµ pro Episode (*annealing*): `Œµ = max(Œµ_min, Œµ_start ¬∑ decay^episode)` ‚Äì so wird aus Neugier schrittweise Ausnutzen.


In [8]:
class LargeGridEnv:
    """
    Eine Felder-Umgebung mit speziellen Feldervarianten:
    - Eisfeld üßä: 50‚ÄØ% Chance, nach dem eigentlichen Zug zuf√§llig wegzurutschen.
    - Abpraller üî¥: Betritt der Agent das Feld, wird er drei Felder in die Richtung zur√ºckgeschleudert, aus der er kam.
    - Grube üï≥Ô∏è: Die Episode endet sofort mit einer Strafe.
    - Ziel üö©: Die Episode endet mit einer Belohnung.
    """
    def __init__(self, rows=6, cols=6, ice_positions=[(2,2), (3,4)], bumper_positions=[(1,4)], pit_positions=[(4,1)], goal_position=(5,5),
                 reward_goal=10.0, reward_pit=-10.0, reward_step=-0.1, resolution="720p"):
        self.rows = rows
        self.cols = cols
        self.ice_positions = set(ice_positions)
        self.bumper_positions = set(bumper_positions)
        self.pit_positions = set(pit_positions)

        self.start_pos = (0, 0)
        self.goal_pos = goal_position

        self.reward_goal = reward_goal
        self.reward_pit = reward_pit
        self.reward_step = reward_step

        #self.action_map = {
        #    0: (-1, 0), # up
        #    1: (1, 0),  # down
        #    2: (0, -1), # left
        #    3: (0, 1)   # right
        #}
        self.action_map = {
            0: "up",
            1: "down",
            2: "left",
            3: "right"
        }

        self.rng = np.random.default_rng()

        self.agent_pos = (0, 0)
        self.visited = set([self.agent_pos])
        self.done = False

        self.tile_map = {}  # (r,c) -> {"ice","bumper","pit"}
        self.tile_map.update({p: "pit"    for p in pit_positions})
        self.tile_map.update({p: "ice"    for p in ice_positions})
        self.tile_map.update({p: "bumper" for p in bumper_positions})

        self.cell_size = 25
        # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Fenster‚Äë, Zellen‚Äë und Randgr√∂√üen bestimmen ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        target = _parse_resolution(resolution)

        if target is None:
            self.cell_size = 25
            self.frame_w = cols * self.cell_size
            self.frame_h = rows * self.cell_size
        else:
            self.frame_w, self.frame_h = target   # e.g. (1280, 720)
            self.cell_size   = min(self.frame_w // cols, self.frame_h // rows)
            if self.cell_size == 0:
                raise ValueError("Grid ist zu gro√ü f√ºr ausgew√§hlte Aufl√∂sung.")

        # derived geometry
        grid_w, grid_h   = cols*self.cell_size, rows*self.cell_size
        self.offset_x    = (self.frame_w - grid_w) // 2     # ‚â• 0‚ÄÉ(letter-box)
        self.offset_y    = (self.frame_h - grid_h) // 2

    # ---------------------------------------------------------------------------
    # Hilfsfunktionen
    # ---------------------------------------------------------------------------
    def to_index(self, row, col):
        """
        Wandel (row, col) in einen einzelnen Integer Index um.
        """
        return row * self.cols + col

    def from_index(self, index):
        """
        Inverse von to_index: gegeben einen Integer Index, gebe (row, col) zur√ºck.
        """
        return (index // self.cols, index % self.cols)
    
    def _move(self, direction):
        r, c = self.agent_pos
        if direction == "up":
            r = max(r - 1, 0)
        elif direction == "down":
            r = min(r + 1, self.rows - 1)
        elif direction == "left":
            c = max(c - 1, 0)
        elif direction == "right":
            c = min(c + 1, self.cols - 1)
        self.agent_pos = (r, c)
        self.visited.add(self.agent_pos)

    def _propagate_effects(self, incoming_dir):
        """
        Eis-/Abpralleffekte anwenden, bis kein neuer Effekt mehr auftritt oder der Agent Ziel/Grube oder den Spielfeldrand erreicht.
        """
        while not self.done:
            # zuerst pr√ºfen, ob das Ziel erreicht wurde
            if self.agent_pos == self.goal_pos:
                self.done = True
                #self.last_event = "goal"
                return

            tile = self.tile_map.get(self.agent_pos)

            if tile == "pit":
                self.done = True
                #self.last_event = "pit"
                return

            if tile == "ice" and random.random() < 0.5:
                slip_dir = random.choice(["up", "down", "left", "right"])
                self._move(slip_dir)
                #self.visited.add(self.agent_pos)
                incoming_dir = slip_dir
                continue

            if tile == "bumper":
                opposite = {"up": "down", "down": "up", "left": "right", "right": "left"}[incoming_dir]
                for _ in range(3):
                    self._move(opposite)
                    #self.visited.add(self.agent_pos)

                    # Fr√ºhzeitige Abbruchpr√ºfung bei jedem Abpraller
                    if self.agent_pos == self.goal_pos:
                        self.done = True
                        #self.last_event = "goal"
                        return
                    if self.tile_map.get(self.agent_pos) == "pit":
                        self.done = True
                        #self.last_event = "pit"
                        return
                incoming_dir = opposite
                continue

            return

    # ---------------------------------------------------------------------------
    # Reinforcement Learning Funktionen
    # ---------------------------------------------------------------------------
    def reset(self):
        """
        Starte immer in der links oberen Ecke.
        """
        self.agent_pos = (0, 0)
        self.visited = set([self.agent_pos])
        self.done = False
        return self.to_index(*self.agent_pos)

    def step(self, action):
        """
        Gehe einen Schritt in der Umgebung anhand der gegebenen Aktion (0..3).
        """
        if self.done:
            return self.to_index(*self.agent_pos), 0.0, True

        # Hauptbewegungsschritt
        self._move(self.action_map[action])
        #self.visited.add(self.agent_pos)

        # Kaskaden-Logik
        self._propagate_effects(self.action_map[action])

        # berechne Belohnung
        tile = self.tile_map.get(self.agent_pos)
        if self.agent_pos == self.goal_pos:
            reward = self.reward_goal
        elif tile == "pit":
            reward = self.reward_pit
        else:
            reward = self.reward_step
        
        #self.render()
        return self.to_index(*self.agent_pos), reward, self.done

    # ---------------------------------------------------------------------------
    # Render-Funktionen
    # ---------------------------------------------------------------------------
    def _make_frame(self, *, with_agent=True):
        """
        Pillow image of the current board.
        `with_agent=False` lets you render a background-only frame.
        """
        padding_size = 2   # for padding at borders
        symbols = {
            (r, c): _glyph_for(self, r, c)          
            for r in range(self.rows)
            for c in range(self.cols)
            if _glyph_for(self, r, c) is not None
        }
        agent = self.agent_pos if with_agent else None
        return _emoji_frame(self.rows,
                            self.cols,
                            self.cell_size,
                            padding_size,
                            symbols,
                            agent)
    def render(self, pos):
        """
        Rendern der aktuellen Position des Agenten in der Umgebung.
        """
        self.agent_pos = pos
        return np.asarray(self._make_frame())[:, :, :3]

## Q‚ÄëLearning‚ÄëImplementierung

Wir legen eine Q‚ÄëTabelle der Form `(rows * cols, num_actions)` an. Die Grundschritte:

1. **Q initialisieren**¬†‚Äì entweder mit Nullen oder kleinen Zufallswerten.  
2. **F√ºr jede Episode**:  
    - Umgebung zur√ºcksetzen.  
    - Solange die Episode nicht beendet ist:  
        - `action` per Œµ‚Äëgreedy aus Q(*s*,¬∑) w√§hlen.  
        - `next_state` *s‚Ä≤*, `reward` *r*, `done` beobachten.  
        - Q aktualisieren:  
          \( Q[state, action] &larr; Q[state, action] + $\alpha$ [reward + $\gamma$ $\max_a$ Q[next\_state, a] - Q[state, action]] \)
        - `state = next_state` (*s¬†‚Üê¬†s‚Ä≤*)


### Wie passt sich Q an? üîÑ

Stell dir vor, in *Zustand s* glaubst du f√ºr Aktion ‚Üì **0.4 Punkte** zu bekommen.  
Du probierst es aus, bekommst **r = 0.6**, landest in *s‚Ä≤* mit bestem gesch√§tzten Wert **0.5**.


$$\text{Target} = 0.6 + 0.95 \cdot 0.5 = 1.075$$
$$\Delta = 1.075 - 0.4 = 0.675$$
$$Q\_\text{neu} = 0.4 + 0.3 \cdot 0.675 = 0.6025$$


> ‚Ä¢ **Vorhersage** = 0.4, **Ziel** = 1.075 (mit Œ≥ = 0.95)  
> ‚Ä¢ **Fehler** schrumpft mit Œ± = 0.3  
> ‚Ä¢ Wert steigt moderat ‚Äì *learning, but not over-reacting*

In [9]:
def q_learning(
        env,
        num_episodes=200,
        alpha=0.1,           # Lernrate / learning rate
        gamma=0.95,          # Diskontierungsfaktor / discount rate
        epsilon=0.1,         # Explorationsrate / exploration rate (epsilon-greedy strategy)
        max_steps=100,       # Begrenzung der Schritte pro Episode, um endloses Umherwandern zu verhindern / limit on steps per episode to prevent infinite wandering
        record_interval=20,  # Intervall f√ºr die Speicherung von Trainingsepisoden / interval for storing training episodes
        report_interval=20   # Intervall f√ºr die Meldung der Episodenbelohnung / interval for reporting episode reward
):
    """
    Trainiere einen tabellarischen Q-Learning-Agenten in der gegebenen Umgebung.
    """
    num_states = env.rows * env.cols
    num_actions = 4  # up/down/left/right

    # Q‚ÄëTabelle initialisieren (Anzahl_Zust√§nde √ó Anzahl_Aktionen)
    Q = np.zeros((num_states, num_actions), dtype=np.float32)

    # Listen zur Speicherung der Trainingsdaten initialisieren
    stored_trajectories = {}  # key=Episode, value=Liste der gesehenen Zust√§nde
    rewards_history = []

    for episode in range(num_episodes):
        state = env.reset()
        episode_reward = 0.0
        trajectory = [env.agent_pos]

        for t in range(max_steps):
            # Epsilon-greedy Aktions-Auswahl
            if np.random.rand() < epsilon:
                action = np.random.randint(num_actions)
            else:
                action = np.argmax(Q[state])

            next_state, reward, done = env.step(action)

            episode_reward += reward
            trajectory.append(env.agent_pos)

            # Q-learning update
            best_next_q = np.max(Q[next_state])
            Q[state, action] += alpha * (reward + gamma * best_next_q - Q[state, action])

            state = next_state

            if done:
                break

        rewards_history.append(episode_reward)

        # Ende der Episode
        # Speichere die Laufbahn alle 'record_interval' Episoden
        if episode % record_interval == 0:
            stored_trajectories[episode] = trajectory[:]

        # Dokumentiere Belohungungs-Fortschritt
        if episode % report_interval == 0:
            avg_r = np.mean(rewards_history[-report_interval:])
            print(f"Episode {episode}, Avg reward (last {report_interval} ep.): {avg_r:.2f}")

    return Q, stored_trajectories, rewards_history

In [10]:
# Umgebung erstellen
# ---------- Versteckte Klippe ----------------------------------------------------- #
large_env = LargeGridEnv(
    rows=6,
    cols=6,
    ice_positions=[(1,0), (1,1), (1,2), (1,3), (1,4), (4,1), (4,2), (4,3), (4,4)],
    bumper_positions=[(3,3)],
    pit_positions=[(2,0), (2,1), (2,2), (2,3), (2,4), (4,5)],
    goal_position=(5,5),
    reward_goal=1.0,
    reward_pit=-1.0,
    reward_step=-0.1,
    resolution="720p"
)
# ---------- Eis-Korridor ---------------------------------------------------------- #
large_env = LargeGridEnv(
    rows=6,
    cols=6,
    ice_positions=[(0,1), (0,2), (0,3), (1,0), (1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (3,0), (3,1), (3,2), (3,3), (3,4), (3,5)],
    bumper_positions=[(2,2), (2,4)],
    pit_positions=[(0,4), (0,5), (2,0), (2,1), (2,5), (4,0), (4,2), (4,4), (5,0)],
    goal_position=(5,5),
    reward_goal=1.0,
    reward_pit=-1.0,
    reward_step=-0.1,
    resolution="720p"
)

# Tabellarisches Q‚ÄëLearning trainieren
Q, stored_trajectories, rewards_history = q_learning(
    large_env,
    num_episodes=1000,
    alpha=0.1,
    gamma=0.95,
    epsilon=0.1,
    max_steps=100,
    record_interval=200,
    report_interval=50
)

Episode 0, Avg reward (last 50 ep.): -2.00
Episode 50, Avg reward (last 50 ep.): -1.98
Episode 100, Avg reward (last 50 ep.): -1.63
Episode 150, Avg reward (last 50 ep.): -1.30
Episode 200, Avg reward (last 50 ep.): -1.68
Episode 250, Avg reward (last 50 ep.): -1.43
Episode 300, Avg reward (last 50 ep.): -1.06
Episode 350, Avg reward (last 50 ep.): -0.79
Episode 400, Avg reward (last 50 ep.): -0.57
Episode 450, Avg reward (last 50 ep.): -0.87
Episode 500, Avg reward (last 50 ep.): -0.81
Episode 550, Avg reward (last 50 ep.): -0.58
Episode 600, Avg reward (last 50 ep.): -0.47
Episode 650, Avg reward (last 50 ep.): -0.80
Episode 700, Avg reward (last 50 ep.): -0.47
Episode 750, Avg reward (last 50 ep.): -0.76
Episode 800, Avg reward (last 50 ep.): -0.68
Episode 850, Avg reward (last 50 ep.): -0.58
Episode 900, Avg reward (last 50 ep.): -0.69
Episode 950, Avg reward (last 50 ep.): -0.58


## Lernfortschritt im Zeitraffer üéûÔ∏è

Alle **N-ten** Episoden wurde die Route gespeichert.  
Schaue dir die Videos an und achte auf:

1. **Exploration ‚Üí Exploitation:** Anfangs irrt der Agent, sp√§ter geht er zielstrebig.  
2. **Gruben-Hits:** Sollten mit der Zeit seltener werden.  
3. **Schrittl√§nge:** Die Episode endet immer schneller, wenn Q-Werte konvergieren.

In [11]:
def make_video_from_frames(frames, filename=None, fps=25):
    """
    Diese Funktion nimmt eine Liste von Frames und macht daraus ein Video, das im Browser angezeigt werden kann.
    """
    clip = ImageSequenceClip(frames, fps=fps)

    use_tmpfile = filename is None

    if use_tmpfile:
        with tempfile.NamedTemporaryFile(mode='w+b', suffix='.mp4', delete=False) as f:
            filename = f.name
    else:
        folder = os.path.dirname(filename)
        os.makedirs(folder, exist_ok=True)

    clip.write_videofile(filename, logger=None, preset='ultrafast', threads=1)
    with open(filename, mode='rb') as f:
        video_embd = Video(f.read(), html_attributes='controls autoplay', mimetype='video/mp4', embed=use_tmpfile)

    if use_tmpfile:
        os.unlink(filename)

    video_embd.reload()

    return video_embd

In [12]:
episodes = {}
for ep, traj in stored_trajectories.items():
    large_env.reset()
    ep_frames = [large_env.render(pos) for pos in traj]
    episodes[ep] = ep_frames
    time.sleep(0.1)

for ep, frames in episodes.items():
    print(f"Number of Frames for episode {ep}: {len(frames)}")

Number of Frames for episode 0: 12
Number of Frames for episode 200: 8
Number of Frames for episode 400: 12
Number of Frames for episode 600: 12
Number of Frames for episode 800: 13


In [13]:
training_output = widgets.Output(layout={'border': '1px solid black', 'height': '800px', 'overflow': 'scroll'})
with training_output:
    clear_output(wait=True)
    for ep, frames in episodes.items():
        print(f"=== Training Episode {ep} ===")
        video = make_video_from_frames(frames, fps=2)
        display(video)

display(training_output)

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b‚Ä¶

## Test des gelernten Agenten üö¶

Nun lassen wir den Agenten **exploitive** (Œµ‚ÄØ=‚ÄØ0) laufen, um die gelernte Policy zu testen ‚ûú reine Ausnutzung der gelernten Tabelle.

In [14]:
# Test‚ÄëEpisode‚ÄëTrajektorie speichern
state = large_env.reset()
trajectory = [large_env.agent_pos]
done = False
while not done:
    action = np.argmax(Q[state])
    next_state, _, done = large_env.step(action)
    trajectory.append(large_env.agent_pos)
    state = next_state


large_env.reset()
test_ep_frames = [large_env.render(pos) for pos in trajectory]

print(f"Number of Frames for test episode: {len(test_ep_frames)}")

Number of Frames for test episode: 10


In [15]:
test_output = widgets.Output(layout={'border': '1px solid black', 'height': '800px', 'overflow': 'scroll'})
with test_output:
    clear_output(wait=True)
    print(f"=== Test Episode ===")
    test_video = make_video_from_frames(test_ep_frames, fps=2)
    display(test_video)

display(test_output)

Output(layout=Layout(border_bottom='1px solid black', border_left='1px solid black', border_right='1px solid b‚Ä¶

# Teil¬†3: Level-Design & Experimente üõ†Ô∏è

In diesem Abschnitt k√∂nnt ihr mit einer erweiterten Version der Umgebung experimentieren, die viele neue Feld‚ÄëVarianten bietet:

## Urspr√ºngliche Felder
| Feld | Symbol | Effekt | Reward |
|------|--------|--------|--------|
| Eis | üßä | 50 % Chance, in eine zuf√§llige Richtung wegzurutschen | 0 |
| Abpraller | üî¥ | 3 Felder zur√ºck in die Richtung, aus der du gekommen bist | 0 |
| Grube | üï≥Ô∏è | Sofortiges Ende | ‚àí1 |
| Ziel | üö© | Ende | +1 |

## Neue Felder
| Feld | Symbol | Effekt | Reward | Angestrebter Lerneffekt |
|------|--------|--------|--------|-------------------------|
| **Mauer** | üß±Ô∏è | Bewegungen in eine Mauer lassen den Agenten einfach stehen |  |  |
| **Klebriger Schlamm** | üü´ | Das Verlassen des Schlamms kostet einen zus√§tzlichen Zug (bleibt einen Schritt stehen) |  |  |
| **Trampolin** | ü¶ò | L√§sst den Agenten sofort zwei Felder nach vorne springen |  |  |
| **F√∂rderband** | ‚¨ÜÔ∏è,‚¨áÔ∏è,‚¨ÖÔ∏è,‚û°Ô∏è | Nach dem Betreten wird der Agent automatisch ein Feld in Bandrichtung bewegt, bevor er wieder handeln darf |  |  |
| **Wind** | üí® | Schiebt den Agenten ein Feld in die aktuelle Windrichtung (√§ndert sich zuf√§llig jede Episode) |  |  |
| **Portal** | üåÄ | Ein Portal¬†A teleportiert sofort zu seinem Partner‚ÄëPortal¬†B (ben√∂tigt immer eine gerade Anzahl an Portalfeldern) |  |  |
| **Einst√ºrzender Boden** | ‚ö†Ô∏è | Nach dem ersten Betreten verwandelt sich das Feld f√ºr den Rest der Episode in eine Grube |  |  |
| **Maut‚ÄëTor** | üí∞ | Beim Betreten muss eine Geb√ºhr bezahlt werden, √∂ffnet aber vielleicht einen k√ºrzeren Weg |  |  |
| **Batterie** | üîãÔ∏è | Gew√§hrt zus√§tzliche Belohnung und verschwindet nach dem Einsammeln. Wenn ihr *keine* Batterie sammelt, gibt es am Ende eine hohe Strafe |  |  |
| **Zeit‚ÄëJuwel** | üíéÔ∏è | Gibt innerhalb der ersten *N* Schritte eine positive Belohnung, danach eine negative |  |  |


## Umgebung und Trainings‚ÄëParameter anpassen

Durch Ausf√ºhren der folgenden Zelle k√∂nnt ihr:
- Umgebungs‚ÄëParameter, Felder und Q‚ÄëLearning‚ÄëParameter anpassen
- Einen Q‚ÄëLearning‚ÄëAgenten mit euren Einstellungen trainieren
- Den trainierten Agenten anschlie√üend testen

### Faustregeln:
> ‚Ä¢ Rewards im Bereich ‚àí1 ‚Ä¶ +1 halten, sonst ‚Äûexplodieren‚Äú Q-Werte.  
> ‚Ä¢ Komplexere Physik ‚áí mehr Episoden und evtl. kleinere Œ±.  
> ‚Ä¢ Unfaire Strafen (|r|‚â´1) k√∂nnen Lernen behindern.  

In [16]:
class ExtendedGridEnv(LargeGridEnv):
    """
    Erweiterung des bestehenden LargeGridEnv, die zus√§tzliche Feldervarianten hinzuf√ºgt.
    """
    # Richtungs‚ÄëHilfsfunktion
    DIRS = {
        "U": (-1, 0),
        "D": ( 1, 0),
        "L": ( 0,-1),
        "R": ( 0, 1)
    }

    def __init__(self,
                 rows=6, cols=6,
                 # originale Felder
                 ice_positions=None,
                 bumper_positions=None,
                 pit_positions=None,
                 # neue Felder
                 wall_positions=None,
                 sticky_positions=None,
                 conveyor_map=None,   # dict (row,col)->"U/D/L/R"
                 trampoline_positions=None,
                 wind_positions=None,
                 portal_pairs=None,   # list[((r1,c1),(r2,c2))]
                 collapse_positions=None,
                 toll_positions=None,
                 battery_positions=None,
                 gem_positions=None,
                 # Belohnungen
                 reward_goal=10.0,
                 reward_pit=-10.0,
                 reward_step=-0.1,
                 reward_wall=-0.5,
                 reward_sticky=-1.0,
                 reward_trampoline=1.0,
                 reward_toll=-3.0,
                 battery_required=False,
                 goal_position=None,
                 rng_seed=None):

        # Elternklasse mit entsprechenden Argumenten aufrufen
        super().__init__(
            rows=rows, cols=cols,
            ice_positions=ice_positions or [],
            bumper_positions=bumper_positions or [],
            pit_positions=pit_positions or [],
            goal_position=goal_position or (rows-1, cols-1),
            reward_goal=reward_goal,
            reward_pit=reward_pit,
            reward_step=reward_step
        )

        self.action_map = {
            0: (-1, 0), # oben
            1: (1, 0),  # unten
            2: (0, -1), # links
            3: (0, 1)   # rechts
        }

        # speichere neue Felder
        self.wall_positions      = set(wall_positions or [])
        self.sticky_positions    = set(sticky_positions or [])
        self.conveyor_map        = {tuple(k):v for k,v in (conveyor_map or {}).items()}
        self.trampoline_positions= set(trampoline_positions or [])
        self.wind_positions      = set(wind_positions or [])
        self.portal_lookup       = {}
        if portal_pairs:
            for a,b in portal_pairs:
                self.portal_lookup[tuple(a)] = tuple(b)
                self.portal_lookup[tuple(b)] = tuple(a)
        self.collapse_positions  = set(collapse_positions or [])
        self.already_collapsed   = set()
        self.toll_positions      = set(toll_positions or [])
        self.battery_positions   = set(battery_positions or [])
        self.gem_positions       = set(gem_positions or [])

        self.reward_wall         = reward_wall
        self.reward_sticky       = reward_sticky
        self.reward_trampoline   = reward_trampoline
        self.reward_toll         = reward_toll

        self.battery_required    = battery_required
        self.has_battery         = False

        # Windrichtung wird pro Episode neu bestimmt
        self.rng = np.random.default_rng(rng_seed)
        self.wind_dir_idx = self.rng.integers(0,4)
        self.skip_turns = 0
        self.step_count = 0

        # Originalpositionen von Batterien, Edelsteinen und Einsturzfeldern f√ºr den Reset speichern
        self.battery_positions_original = self.battery_positions.copy()
        self.gem_positions_original = self.gem_positions.copy()
        self.collapse_positions_original = self.collapse_positions.copy()

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ #
    # Hilfsfunkion
    def _in_bounds(self, r, c):
        return 0 <= r < self.rows and 0 <= c < self.cols

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ #
    def reset(self, display_canvas=False):
        # Reset der Elternklasse verwenden und Batterien, Edelsteine, Einsturzfelder aktualisieren
        self.has_battery = False
        self.battery_positions = self.battery_positions_original.copy()
        self.gem_positions = self.gem_positions_original.copy()
        self.already_collapsed.clear()
        self.collapse_positions = self.collapse_positions_original.copy()
        self.skip_turns = 0
        self.wind_dir_idx = self.rng.integers(0,4)
        return super().reset()

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ #
    def check_proposed_pos(self, pos, dr=None, dc=None):
        """
        L√∂sen von Ketten von erzwungenen Bewegungen auf (F√∂rderb√§nder, Wind, Trampoline, ...) und zur√ºckgeben des letzten Feldes.
        Die Schleife endet, wenn eine Regel den Agenten auf demselben Feld h√§lt, oder wenn er auf einem Feld landet, das keine neue Bewegungswirkung erzwingt.
        """
        cur_r, cur_c = pos
        cur_dr, cur_dc = dr, dc   # letzte Bewegungsrichtung - ben√∂tigt von Abpraller / Trampolin
        visited = set()           # Schleifenerkennung von F√∂rderb√§ndern, die einen Kreislauf bilden

        while True:
            cur_pos = (cur_r, cur_c)

            # 1. Felder, die die Bewegung sofort abbrechen
            if cur_pos in self.wall_positions:
                return self.agent_pos            # die urspr√ºngliche Bewegung r√ºckg√§ngig machen
            if cur_pos in self.portal_lookup:
                return self.portal_lookup[cur_pos]
            if cur_pos not in (
                self.ice_positions
                | self.bumper_positions
                | set(self.conveyor_map)
                | self.trampoline_positions
                | self.wind_positions
            ):
                return cur_pos                   # kein anderes Feld √§ndert die Position

            # 2. Schleifen erkennen (Wind am Umgebungsrand Richtung Wand, F√∂rderband Richtung Wand, Zyklen, ...)
            if cur_pos in visited:
                return cur_pos                   # bereits besucht ‚Üí Stop
            visited.add(cur_pos)

            # 3. genau *eine* Bewegungsregel anwenden
            if cur_pos in self.ice_positions:
                if self.rng.random() < 0.5:
                    slip_action = self.rng.integers(0, 4)
                    cur_dr, cur_dc = self.action_map[slip_action]
                else:
                    return cur_pos               # kein Rutschen
            elif cur_pos in self.bumper_positions and cur_dr is not None:
                cur_dr, cur_dc = -cur_dr, -cur_dc
                cur_dr *= 3
                cur_dc *= 3
            elif cur_pos in self.conveyor_map:
                cur_dr, cur_dc = self.DIRS[self.conveyor_map[cur_pos]]
            elif cur_pos in self.trampoline_positions and cur_dr is not None:
                cur_dr *= 2
                cur_dc *= 2
            elif cur_pos in self.wind_positions:
                cur_dr, cur_dc = list(self.DIRS.values())[self.wind_dir_idx]
            else:
                return cur_pos                   # Fall zur Absicherung

            # 4. Bewegung (mit Randbegrenzung)
            cur_r = np.clip(cur_r + cur_dr, 0, self.rows - 1)
            cur_c = np.clip(cur_c + cur_dc, 0, self.cols - 1)

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ #
    def step(self, action):
        self.step_count += 1

        # Umgang mit Klebrigem Schlamm
        if self.skip_turns>0:
            self.skip_turns -= 1
            # keine Bewegung ausf√ºhren
            reward = self.reward_sticky
            done = False
            return self.to_index(*self.agent_pos), reward, done

        if self.done:
            return self.to_index(*self.agent_pos), 0.0, True

        dr, dc = self.action_map[action]
        next_r = np.clip(self.agent_pos[0]+dr, 0, self.rows-1)
        next_c = np.clip(self.agent_pos[1]+dc, 0, self.cols-1)
        proposed_pos = (next_r, next_c)
        final_pos = self.check_proposed_pos(proposed_pos, dr=dr, dc=dc)

        # Falls Agent auf Klebrigem Schlamm landet
        if final_pos in self.sticky_positions:
            self.skip_turns = 1

        # Update Agentenposition
        self.agent_pos = final_pos
        self.visited.add(final_pos)

        # Belohnung / Terminierungsflag
        reward = self.reward_step
        done = False

        # Grube / Einst√ºrzender Boden
        if self.agent_pos in self.pit_positions or self.agent_pos in self.already_collapsed:
            reward = self.reward_pit
            done = True
        elif self.agent_pos in self.collapse_positions:
            self.already_collapsed.add(self.agent_pos)
            self.collapse_positions.remove(self.agent_pos)

        # Maut-Tor
        if self.agent_pos in self.toll_positions:
            reward += self.reward_toll

        # Trampolin-Belohnungsbonus
        if proposed_pos in self.trampoline_positions:
            reward += self.reward_trampoline

        # Batterie wird aufgesammelt
        if self.agent_pos in self.battery_positions:
            self.has_battery = True
            self.battery_positions.remove(self.agent_pos)

        # Zeit-abh√§ngiges Juwel
        if self.agent_pos in self.gem_positions:
            bonus = 5 if self.step_count < 20 else -5
            reward += bonus
            self.gem_positions.remove(self.agent_pos)

        # Ziel
        if self.agent_pos == self.goal_pos:
            if not self.battery_required or self.has_battery:
                reward = self.reward_goal
                done = True

        self.done = done
        return self.to_index(*self.agent_pos), reward, done

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ #
    def render(self, pos):
        """Return numpy RGB array ‚Äì same shape as Canvas.get_image_data() gave."""
        self.agent_pos = pos

        # Einst√ºrzender Boden -> Symbol muss ge√§ndert werden
        if self.agent_pos in self.collapse_positions:
            self.already_collapsed.add(self.agent_pos)
            self.collapse_positions.remove(self.agent_pos)
        # Batterie aufgehoben -> Batterie nicht mehr anzeigen
        if self.agent_pos in self.battery_positions:
            self.battery_positions.remove(self.agent_pos)
        # Zeit-Juwel aufgehoben -> nicht mehr anzeigen
        if self.agent_pos in self.gem_positions:
            self.gem_positions.remove(self.agent_pos)
        
        return np.asarray(self._make_frame())[:, :, :3]

In [17]:
# -------------------------------------
# Schieberegler f√ºr Umgebungsparameter
# -------------------------------------
# einheitlicher Stil, so dass alle Regler visuell angenehmer und besser aufeinander abgestimmt sind
slider_style = {'description_width': '240px'}  # label column width
slider_layout = widgets.Layout(width='540px')  # longer slider track
box_layout = widgets.Layout(margin='5px 0')

# Hilfsfunktion: Slider und Zahlenfeld nebeneinander, synchronisiert
def _make_slider_with_text(*, is_int, **kwargs):
    Sldr = widgets.IntSlider if is_int else widgets.FloatSlider
    Txt  = widgets.IntText   if is_int else widgets.FloatText
    s = Sldr(readout=False, continuous_update=True, style=slider_style, layout=slider_layout, **kwargs)
    t = Txt(step=kwargs.get("step",1), layout=widgets.Layout(width='100px'))
    widgets.link((s,'value'), (t,'value'))  # two-way binding
    return s, widgets.HBox([s,t])


# Widgets definieren, um die Umgebung anzupassen
# ---- basic grid size ------------------------------------------------
rows_widget, rows_row   = _make_slider_with_text(is_int=True,  value=5, min=4,  max=10, step=1,  description='Grid Reihen')
cols_widget, cols_row   = _make_slider_with_text(is_int=True,  value=5, min=4,  max=10, step=1,  description='Grid Spalten')
# ---- Ziel-/Grube-/Schritt-Belohungen --------------------------------------
reward_goal_widget, reward_goal_row = _make_slider_with_text(is_int=False, value=10.0, min=0.0,  max=50.0,  step=1.0, description='Ziel-Belohnung')
reward_pit_widget, reward_pit_row  = _make_slider_with_text(is_int=False, value=-10.0,min=-50.0, max=0.0,  step=1.0, description='Gruben-Belohnung')
reward_step_widget, reward_step_row = _make_slider_with_text(is_int=False, value=-0.1, min=-10.0, max=10.0, step=0.1, description='Schritt-Belohnung')
# ---- Felder-abh√§ngige Belohnungen ------------------------------------------
reward_wall_widget, reward_wall_row  = _make_slider_with_text(is_int=False, value=-0.5, min=-5.0,  max=0.0,  step=0.1, description='Wand-Bestrafung')
reward_sticky_widget, reward_sticky_row= _make_slider_with_text(is_int=False, value=-1.0, min=-10.0, max=0.0,  step=0.1, description='Klebriger-Schlamm-Bestrafung')
reward_trampoline_widget, reward_trampoline_row = _make_slider_with_text(is_int=False, value=1.0,  min=0.0,  max=10.0,  step=0.1, description='Trampolin-Belohnung')
reward_toll_widget, reward_toll_row  = _make_slider_with_text(is_int=False, value=-3.0, min=-10.0, max=0.0,  step=0.1, description='Maut-Tor-Bestrafung')
# ---- simple flags / seed --------------------------------------------
battery_required_widget = widgets.Checkbox(value=False, description='Agent braucht Batterie f√ºr Ziel', indent=False)
rng_seed_widget, rng_seed_row = _make_slider_with_text(is_int=True, value=0, min=0, max=65535, step=1, description='RNG Seed (0=random)')
# ---- alle Umgebungswidgets sammeln ----------------------------------
env_parameters_box = widgets.VBox([
    rows_row, cols_row,
    reward_goal_row, reward_pit_row, reward_step_row,
    reward_wall_row, reward_sticky_row,
    reward_trampoline_row, reward_toll_row,
    battery_required_widget,
    rng_seed_row
], layout=box_layout)

# -------------------------------------
# Schieberegler f√ºr Agentenparameter
# -------------------------------------
alpha_widget, alpha_row  = _make_slider_with_text(is_int=False, value=0.1, min=0.0, max=1.0, step=0.01, description=r'Lernrate / Learning rate Œ±')
gamma_widget, gamma_row  = _make_slider_with_text(is_int=False, value=0.9, min=0.0, max=1.0, step=0.01, description=r'Diskontierungsfaktor / Discount rate Œ≥')
epsilon_widget, epsilon_row= _make_slider_with_text(is_int=False, value=0.1, min=0.0, max=1.0, step=0.01, description=r'Exploration rate Œµ')
num_episodes_widget,num_episodes_row=_make_slider_with_text(is_int=True, value=100, min=1, max=1000, step=10, description='Episodes')
max_steps_widget, max_steps_row = _make_slider_with_text(is_int=True, value=100, min=10, max=1000, step=10, description='Max steps/ep')
record_interval_widget, record_interval_row = _make_slider_with_text(is_int=True, value=100, min=1, max=1000, step=10, description='Store episode interval')
report_interval_widget, report_interval_row = _make_slider_with_text(is_int=True, value=20, min=10, max=100, step=10, description='Store reward interval')

agent_parameters_box = widgets.VBox([alpha_row, gamma_row, epsilon_row, num_episodes_row, max_steps_row, record_interval_row, report_interval_row], layout=box_layout)

# -------------------------------------
# Feldauswahl + Presets + Zur√ºcksetzen
# -------------------------------------
tile_types = ["Empty", "Ice", "Bumper", "Pit", "Wall", "Sticky Mud", "Conveyor Belt (U)", "Conveyor Belt (D)", "Conveyor Belt (L)", "Conveyor Belt (R)", "Trampoline", "Wind", "Portal", "Collapsing Floor", "Toll Gate", "Battery", "Gem", "Goal"]

dd_layout = widgets.Layout(width='130px')   # erh√∂hte Breite, damit lange Namen nicht abgeschnitten werden

def _get_current_tile_grid():
    """Gibt eine Kopie der aktuell in der UI ausgew√§hlten Kachelnamen zur√ºck"""
    try:
        return copy.deepcopy([[dd.value for dd in row] for row in tile_selectors])
    except NameError:
        return [["Empty"] * cols_widget.value for _ in range(rows_widget.value)]

_widget_vars = {name: obj for name, obj in globals().items() if name.endswith("_widget") and hasattr(obj, "value")}
_default_values = {name: w.value for name, w in _widget_vars.items()}
#_default_tile_grid = _get_current_tile_grid()
_default_tile_grid = [["Empty"] * cols_widget.value for _ in range(rows_widget.value)]


preset_configs = {
    "Default": {
        **_default_values,
        "tile_grid": _default_tile_grid,
    },
    "Warm-Up Playground (5√ó5) - sanfter Einstieg mit ein paar Hindernissen": {
        **_default_values,
        "rows_widget": 5, "cols_widget": 5,
        "num_episodes_widget": 400, "max_steps_widget": 120,
        "alpha_widget": 0.12, "epsilon_widget": 0.18,
        "reward_step_widget": -0.10, "reward_pit_widget": -12.0, "reward_wall_widget": -0.6,
        "tile_grid": [
            ["Empty", "Empty", "Wall",  "Empty", "Empty"],
            ["Ice",   "Empty", "Wall",  "Pit",   "Empty"],
            ["Empty", "Empty", "Empty", "Empty", "Empty"],
            ["Empty", "Pit",   "Empty", "Empty", "Empty"],
            ["Empty", "Empty", "Empty", "Empty", "Goal" ],
        ],
    },
    "Conveyor Workshop (6√ó6) - F√∂rderb√§nder, Schlamm & Wind lehren Zwangsbewegungen": {
        **_default_values,
        "rows_widget": 6, "cols_widget": 6,
        "num_episodes_widget": 800, "max_steps_widget": 160,
        "alpha_widget": 0.15, "epsilon_widget": 0.30,
        "reward_step_widget": -0.05, "reward_sticky_widget": -1.0, "reward_trampoline_widget": 2.0, "reward_wall_widget": -1.0,
        "tile_grid": [
            ["Empty",      "Conveyor Belt (R)", "Conveyor Belt (R)", "Conveyor Belt (R)", "Conveyor Belt (D)", "Empty"],
            ["Empty",      "Sticky Mud",        "Empty",             "Empty",             "Conveyor Belt (D)", "Empty"],
            ["Empty",      "Empty",             "Empty",             "Empty",             "Conveyor Belt (D)", "Empty"],
            ["Trampoline", "Empty",             "Wall",              "Wall",              "Conveyor Belt (D)", "Empty"],
            ["Empty",      "Wind",              "Empty",             "Empty",             "Conveyor Belt (D)", "Empty"],
            ["Empty",      "Empty",             "Empty",             "Empty",             "Empty",             "Goal" ],
        ],
    },
    "Battery‚ÄëPortal¬†Run¬†(7√ó5) ‚Äì Batterie einsammeln, Maut entrichten, Portale benutzen": {
        **_default_values,
        "rows_widget": 7, "cols_widget": 5,
        "battery_required_widget": True,
        "num_episodes_widget": 1100, "max_steps_widget": 180,
        "alpha_widget": 0.10, "epsilon_widget": 0.25,
        "reward_step_widget": -0.20, "reward_toll_widget": -3.0, "reward_wall_widget": -1.5,
        "tile_grid": [
            ["Empty",             "Empty", "Portal", "Wall",       "Empty"],
            ["Empty",             "Wall",  "Empty",  "Wall",       "Empty"],
            ["Empty",             "Wall",  "Empty",  "Toll Gate",  "Empty"],
            ["Conveyor Belt (U)", "Empty", "Empty",  "Wall",       "Empty"],
            ["Portal",            "Wall",  "Empty",  "Wind",       "Empty"],
            ["Battery",           "Empty", "Empty",  "Wall",       "Empty"],
            ["Empty",             "Empty", "Empty",  "Empty",      "Goal" ],
        ],
    },
    "Collapsing¬†Canyon¬†(6√ó7) ‚Äì verschwindende B√∂den und st√ºrmische Schluchten": {
        **_default_values,
        "rows_widget": 6, "cols_widget": 7,
        "num_episodes_widget": 1400, "max_steps_widget": 200,
        "alpha_widget": 0.12, "epsilon_widget": 0.35,
        "reward_step_widget": -0.15, "reward_pit_widget": -15.0, "reward_wall_widget": -1.0,
        "tile_grid": [
            ["Empty",      "Wind",             "Empty", "Empty", "Empty", "Wind",             "Empty"],
            ["Empty",      "Collapsing Floor", "Empty", "Pit",   "Empty", "Collapsing Floor", "Empty"],
            ["Empty",      "Collapsing Floor", "Empty", "Empty", "Empty", "Collapsing Floor", "Empty"],
            ["Empty",      "Wind",             "Empty", "Pit",   "Empty", "Wind",             "Empty"],
            ["Trampoline", "Empty",            "Empty", "Empty", "Empty", "Empty",            "Empty"],
            ["Empty",      "Empty",            "Empty", "Empty", "Empty", "Empty",            "Goal" ],
        ],
    },
    "Gem‚ÄëBumper¬†Maze¬†(6√ó6) ‚Äì Edelsteine sammeln, Abprallern und Mauern ausweichen": {
        **_default_values,
        "rows_widget": 6, "cols_widget": 6,
        "num_episodes_widget": 650, "max_steps_widget": 140,
        "alpha_widget": 0.11, "epsilon_widget": 0.22,
        "reward_step_widget": -0.08, "reward_wall_widget": -1.2,
        "tile_grid": [
            ["Empty", "Wall",  "Empty", "Bumper",            "Empty",             "Empty"],
            ["Empty", "Wall",  "Empty", "Sticky Mud",        "Empty",             "Empty"],
            ["Gem",   "Empty", "Empty", "Conveyor Belt (L)", "Conveyor Belt (L)", "Empty"],
            ["Empty", "Wall",  "Wall",  "Wall",              "Empty",             "Empty"],
            ["Empty", "Empty", "Empty", "Empty",             "Empty",             "Empty"],
            ["Empty", "Empty", "Empty", "Empty",             "Empty",             "Goal" ],
        ],
    },
}


# Helfer, der ein Preset‚ÄëDictionary auf die Widgets anwendet
def _apply_preset(cfg_name):
    cfg = preset_configs[cfg_name]
    # alle Schieberegler / Skalar-Widgets aktualisieren
    for k, v in cfg.items():
        if k in _widget_vars:
            _widget_vars[k].value = v
    # Nach √Ñndern der Slider die Kachelmatrix neu aufbauen
    #update_tile_grid(cfg.get("tile_grid"))


preset_dropdown = widgets.Dropdown(options=list(preset_configs.keys()), value="Default", description="Presets:")
apply_btn = widgets.Button(description="Apply", button_style="success", tooltip="Ausgew√§hltes Preset anwenden")
reset_btn = widgets.Button(description="Reset", button_style="warning", tooltip="Auf Standard zur√ºcksetzen")

apply_btn.on_click(lambda b: _apply_preset(preset_dropdown.value))
reset_btn.on_click(lambda b: _apply_preset("Default"))

preset_box = widgets.HBox([preset_dropdown, apply_btn, reset_btn])


tile_selectors = []  # 2D list (rows x cols) of Dropdown widgets
tile_grid_container = widgets.VBox()

def create_tile_selectors(rows, cols, preset):
    """
    Erstellt ein 2D-Gitter mit Dropdown-Widgets (eines pro Feld).
    Gibt dieses als Liste von Listen zur√ºck, zusammen mit einer VBox, die diese visuell anordnet.
    """
    # get a safe tile_grid
    tg = preset_configs[preset]["tile_grid"]
    if not tg:  # None or empty list
        tg = [["Empty"] * cols for _ in range(rows)]
    else:  # pad / crop if size differs
        tg = [
            [
                tg[r][c] if r < len(tg) and c < len(tg[r]) else "Empty"
                for c in range(cols)
            ]
            for r in range(rows)
        ]

    grid_rows, selectors_2d = [], []
    for r in range(rows):
        row_selectors = [
            widgets.Dropdown(options=tile_types, value=tg[r][c], description='', layout=dd_layout)
            for c in range(cols)
        ]
        selectors_2d.append(row_selectors)
        grid_rows.append(widgets.HBox(row_selectors))
    return selectors_2d, widgets.VBox(grid_rows)

def update_tile_grid(_):
    """
    Erstellt das 2D-Raster der Kachelselektoren neu, wenn sich Zeile/Spalte/Preset √§ndern.
    """
    rows = rows_widget.value
    cols = cols_widget.value
    preset = preset_dropdown.value

    # Tile‚ÄëDropdowns neu aufbauen
    global tile_selectors
    tile_selectors, grid_vbox = create_tile_selectors(rows, cols, preset)

    tile_grid_container.children = [grid_vbox]

# √Ñnderungen an Zeilen/Spalten beobachten, um die Kachelmatrix neu aufzubauen
rows_widget.observe(update_tile_grid, names='value')
cols_widget.observe(update_tile_grid, names='value')
preset_dropdown.observe(update_tile_grid, names='value')

# Die Kachelmatrix einmalig beim Start initialisieren
update_tile_grid(None)

# -------------------------------------
# Kn√∂pfe + Output
# -------------------------------------
# Schaltfl√§che zum Starten des Trainings mit gew√§hlten Parametern
train_button = widgets.Button(description='Trainiere Q-Learning-Agenten mit ausgew√§hlten Parameter-Werten', layout=widgets.Layout(width='400px'))

# Button to kick off replaying training episodes
replay_button = widgets.Button(description='Wiederhole Trainingsepisoden', layout=widgets.Layout(width='400px'))

# Button to kick off test episode
test_button = widgets.Button(description='Test Episode mit ausgelerntem Agenten', layout=widgets.Layout(width='400px'))

# Anzeigebereich f√ºr das Training
output1 = widgets.Output(
    layout={
        'border': '1px solid black',
        #'width': '800px',
        'height': '400px',
        'overflow': 'scroll',
    }
)

# Anzeigebereich f√ºr Trainings‚ÄëReplays
output2 = widgets.Output(
    layout={
        'border': '1px solid black',
        #'width': '200px',
        'height': '600px',
        'overflow': 'scroll',
    }
)

# Anzeigebereich f√ºr Testepisode
output3 = widgets.Output(
    layout={
        'border': '1px solid black',
        #'width': '200px',
        'height': '600px',
        'overflow': 'scroll',
    }
)

# Gemeinsames Dictionary zwischen Schaltfl√§chen
training_data = {
    'env': None,
    'Q': None,
    'trajectories': None,
    'rewards_history': None
}

def train_agent(env_params, agent_params, training_data):
    """
    Erstellt die Umgebung und den Q-Learning-Agenten unter Verwendung der √ºbergebenen Dictionaries.
    F√ºhrt dann das Training durch.
    """

    # Als Ziel wird das erste vom Benutzer gew√§hlte ‚ÄûGoal‚Äú-Feld verwendet
    goal_positions = env_params['goal_positions']
    if len(goal_positions) == 0:
        # Falls kein Ziel gew√§hlt wurde ‚Üí Zelle unten rechts als Standard
        goal_pos = (env_params['rows'], env_params['cols'])
    else:
        goal_pos = goal_positions[0]

    # Umgebung initialisieren
    large_env = ExtendedGridEnv(
        rows                =env_params['rows'],
        cols                =env_params['cols'],
        # tiles
        ice_positions       =env_params['ice_positions'],
        bumper_positions    =env_params['bumper_positions'],
        pit_positions       =env_params['pit_positions'],
        wall_positions      =env_params['wall_positions'],
        sticky_positions    =env_params['sticky_positions'],
        conveyor_map        =env_params['conveyor_map'],   # dict (row,col)->"U/D/L/R"
        trampoline_positions=env_params['trampoline_positions'],
        wind_positions      =env_params['wind_positions'],
        portal_pairs        =env_params['portal_pairs'],   # list[((r1,c1),(r2,c2))]
        collapse_positions  =env_params['collapse_positions'],
        toll_positions      =env_params['toll_positions'],
        battery_positions   =env_params['battery_positions'],
        gem_positions       =env_params['gem_positions'],
        # rewards
        reward_goal         =env_params['reward_goal'],
        reward_pit          =env_params['reward_pit'],
        reward_step         =env_params['reward_step'],
        reward_wall         =env_params['reward_wall'],
        reward_sticky       =env_params['reward_sticky'],
        reward_trampoline   =env_params['reward_trampoline'],
        reward_toll         =env_params['reward_toll'],
        battery_required    =env_params['battery_required'],
        goal_position       =goal_pos,
        rng_seed            =env_params['rng_seed'],
    )

    # Tabellarisches Q-Learning trainieren
    Q, stored_trajectories, rewards_history = q_learning(
        large_env,
        num_episodes=agent_params['num_episodes'],
        alpha=agent_params['alpha'],
        gamma=agent_params['gamma'],
        epsilon=agent_params['epsilon'],
        max_steps=agent_params['max_steps'],
        record_interval=agent_params['record_interval'],
        report_interval=agent_params['report_interval']
    )

    # Referenzen f√ºr sp√§tere Wiedergabe sichern
    training_data['env'] = large_env
    training_data['Q'] = Q
    training_data['trajectories'] = stored_trajectories
    training_data['rewards_history'] = rewards_history


@output1.capture(clear_output=True)
def on_train_button_clicked(_):

    rows = rows_widget.value
    cols = cols_widget.value

    # tile_selectors (die 2D‚ÄëDropdowns) in Positionslisten umwandeln
    ice_positions = []
    bumper_positions = []
    wall_positions = []
    sticky_positions = []
    conveyor_map = {}
    trampoline_positions = []
    wind_positions = []
    portal_positions = []
    collapse_positions = []
    toll_positions = []
    battery_positions = []
    gem_positions = []
    pit_positions = []
    goal_positions = []

    for r in range(rows):
        for c in range(cols):
            tile_choice = tile_selectors[r][c].value
            if tile_choice == "Ice":
                ice_positions.append((r, c))
            elif tile_choice == "Bumper":
                bumper_positions.append((r, c))
            elif tile_choice == "Wall":
                wall_positions.append((r, c))
            elif tile_choice == "Sticky Mud":
                sticky_positions.append((r, c))
            elif tile_choice == "Conveyor Belt (U)":
                conveyor_map[(r, c)] = "U"
            elif tile_choice == "Conveyor Belt (D)":
                conveyor_map[(r, c)] = "D"
            elif tile_choice == "Conveyor Belt (L)":
                conveyor_map[(r, c)] = "L"
            elif tile_choice == "Conveyor Belt (R)":
                conveyor_map[(r, c)] = "R"
            elif tile_choice == "Trampoline":
                trampoline_positions.append((r, c))
            elif tile_choice == "Wind":
                wind_positions.append((r, c))
            elif tile_choice == "Portal":
                portal_positions.append((r, c))
            elif tile_choice == "Collapsing Floor":
                collapse_positions.append((r, c))
            elif tile_choice == "Toll Gate":
                toll_positions.append((r, c))
            elif tile_choice == "Battery":
                battery_positions.append((r, c))
            elif tile_choice == "Gem":
                gem_positions.append((r, c))
            elif tile_choice == "Pit":
                pit_positions.append((r, c))
            elif tile_choice == "Goal":
                goal_positions.append((r, c))

    if len(goal_positions) == 0:
        goal_positions.append((rows-1, cols-1))
    
    portal_pairs = []
    if len(portal_positions) >= 2:
        portal_pairs.append((portal_positions[0], portal_positions[1]))

    # Einstellungen aus den Umgebungs‚ÄëWidgets auslesen
    env_params = {
        'rows'                : rows_widget.value,
        'cols'                : cols_widget.value,
        # Belohnungen
        'reward_goal'         : reward_goal_widget.value,
        'reward_pit'          : reward_pit_widget.value,
        'reward_step'         : reward_step_widget.value,
        'reward_wall'         : reward_wall_widget.value,
        'reward_sticky'       : reward_sticky_widget.value,
        'reward_trampoline'   : reward_trampoline_widget.value,
        'reward_toll'         : reward_toll_widget.value,
        'battery_required'    : battery_required_widget.value,
        'rng_seed'            : (None if rng_seed_widget.value == 0 else rng_seed_widget.value),
        # Felder-Positionen
        'ice_positions'       : ice_positions,
        'bumper_positions'    : bumper_positions,
        'wall_positions'      : wall_positions,
        'sticky_positions'    : sticky_positions,
        'conveyor_map'        : conveyor_map,
        'trampoline_positions': trampoline_positions,
        'wind_positions'      : wind_positions,
        'portal_pairs'        : portal_pairs,
        'collapse_positions'  : collapse_positions,
        'toll_positions'      : toll_positions,
        'battery_positions'   : battery_positions,
        'gem_positions'       : gem_positions,
        'pit_positions'       : pit_positions,
        'goal_positions'      : goal_positions
    }

    # Einstellungen aus den Agent‚ÄëWidgets auslesen
    agent_params = {
        'alpha'               : alpha_widget.value,
        'gamma'               : gamma_widget.value,
        'epsilon'             : epsilon_widget.value,
        'num_episodes'        : num_episodes_widget.value,
        'max_steps'           : max_steps_widget.value,
        'record_interval'     : record_interval_widget.value,
        'report_interval'     : report_interval_widget.value
    }

    # An die Trainingsfunktion √ºbergeben
    train_agent(env_params, agent_params, training_data)
    print("Training completed.")


@output2.capture(clear_output=True)
def on_replay_button_clicked(_):
    env = training_data['env']
    trajectories = training_data['trajectories']

    global replay_training_canvases
    replay_training_canvases = {}

    if env is None or trajectories is None:
        print("Keine gespeicherten Trainingsdaten gefunden. Bitte erst trainieren.")
        return

    episodes = {}
    for ep, traj in trajectories.items():
        env.reset()
        ep_frames = [env.render(pos) for pos in traj]
        episodes[ep] = ep_frames

    for ep, frames in episodes.items():
        print(f"=== Training Episode {ep} ===")
        video = make_video_from_frames(frames, fps=2)
        display(video)


@output3.capture(clear_output=True)
def on_test_button_clicked(_):
    env = training_data['env']
    agent = training_data['Q']

    if env is None or agent is None:
        print("Keine gespeicherten Trainingsdaten gefunden. Bitte erst trainieren.")
        return

    # Test‚ÄëEpisode‚ÄëTrajektorie speichern
    state = env.reset()
    trajectory = [env.agent_pos]
    done = False
    while not done:
        action = np.argmax(agent[state])
        next_state, _, done = env.step(action)
        trajectory.append(env.agent_pos)
        state = next_state
    
    env.reset()
    test_ep_frames = [env.render(pos) for pos in trajectory]

    print(f"=== Test Episode ===")
    test_video = make_video_from_frames(test_ep_frames, fps=2)
    display(test_video)


# Callbacks an Schaltfl√§chen anh√§ngen
train_button.on_click(on_train_button_clicked)
replay_button.on_click(on_replay_button_clicked)
test_button.on_click(on_test_button_clicked)


# Boxen in ein Tab-Widget einbinden
tab = widgets.Tab(children=[env_parameters_box, tile_grid_container, agent_parameters_box])
tab.set_title(0, "Environment")
tab.set_title(1, "Tiles")
tab.set_title(2, "Agent")


# Tabs, Schaltfl√§chen und Ausgabebereich stapeln
agent_box = widgets.VBox([
    tab,
    preset_box,
    train_button,
    output1,
    replay_button,
    output2,
    test_button,
    output3,
])
display(agent_box)

VBox(children=(Tab(children=(VBox(children=(HBox(children=(IntSlider(value=5, description='Grid Reihen', layou‚Ä¶

# Weiterf√ºhrende Literatur & Ressourcen üìö

| Thema | Ressource | Typ |
|-------|-----------|-----|
| Foundations of RL | [Sutton & Barto ‚Äì *Reinforcement Learning: An Introduction* (2nd Ed., PDF)](https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) | Buch (Open Access) |
| Gridworld Tutorials | [David Silver ‚Äì RL Course, Lecture 1: *Introduction to Reinforcement Learning*](https://www.youtube.com/watch?v=2pWv7GOvuf0) | Video |
| Q-Learning Demo | [OpenAI Gym ‚Äì *FrozenLake* Environment Docs & Example Notebook](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) | Jupyter/Docs |
| Parameter Tuning | [Blog-Post ‚Äì *Hyperparameter Tuning in Reinforcement Learning is Easy, Actually*](https://www.automl.org/hyperparameter-tuning-in-reinforcement-learning-is-easy-actually/) | Artikel |
