
# Neuro‑Symbolic AI Tutorial: Echo State Networks + Symbolic Rules (CartPole)

This is a **tutorial** that shows how to combine:

- **Neural computation** (an **Echo State Network**, ESN)  
- with **symbolic reasoning** (human‑interpretable **if/else rules**)  

…to build a simple **neuro‑symbolic policy** for the classic control task **CartPole**.

---

## Learning goals

By the end, students should be able to:

1. Explain what a **reservoir / ESN** is and why we often **freeze** its recurrent weights.
2. Implement a **symbolic module** using rules.
3. Combine neural and symbolic features into a single policy.
4. Train a policy using **REINFORCE** (policy‑gradient).
5. Understand limitations and how to improve the approach.

---

## What we will build

We will build a policy:

\[
\pi(a \mid s) = \text{Softmax}(\text{Readout}([\text{ESN}(s), \ \text{Rules}(s)]))
\]

- **ESN(s)**: high‑dimensional reservoir state  
- **Rules(s)**: small interpretable vector derived from symbolic rules  
- **Readout**: a trainable linear layer mapping combined features to action logits

---

## Note about Gym versions

Colab sometimes ships different versions of `gym` / `gymnasium`.
This notebook includes **compatibility code** to handle the "new step API" style outputs.



## 1) Setup

This cell:
- configures minor compatibility hacks for `gym`,
- enables TF32 (optional, for speed on newer GPUs),


In [None]:
# --- Basic imports ---
import numpy as np
import warnings

warnings.filterwarnings("ignore", category=DeprecationWarning)

# Gym checker compatibility: some versions still reference np.bool8
# (np.bool8 was deprecated; aliasing avoids noisy errors in older Gym checks)
np.bool8 = np.bool_

import torch
import torch.nn as nn
import gym

# --- Optional GPU performance knobs (safe to leave on/off) ---
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Free cached GPU memory (optional)
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("PyTorch:", torch.__version__)
print("Gym:", gym.__version__)
print("CUDA available:", torch.cuda.is_available())


PyTorch: 2.9.0+cu128
Gym: 0.25.2
CUDA available: True


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  self.setter(val)
  return datetime.utcnow().replace(tzinfo=utc)



## 2) Background: Echo State Networks (ESN)

An **Echo State Network** is a type of **reservoir computing** model:

- It has a recurrent “reservoir” with weights **W**.
- The reservoir weights are typically **fixed** (not trained), but chosen so that the system is stable.
- Only a **readout layer** is trained (e.g., linear classifier/regressor).

### The "echo state property" (intuition)

We want the reservoir state to depend on the **recent history** of inputs, but not explode.
A common trick is to scale the reservoir recurrent matrix so that its **spectral radius** is < 1.

### In this tutorial

We will:
- Initialize a sparse random reservoir matrix **W**,
- scale it to a chosen spectral radius,
- compute reservoir states with:

\[
h_t = \tanh(W_{\text{in}} x_t + W h_{t-1})
\]

Then we concatenate **h_t** with a symbolic vector and pass it to a trainable readout.



## 3) Symbolic reasoning module (rules)

A symbolic component is usually:
- **interpretable**
- **editable**
- potentially **verifiable**

Here we define rules based on two CartPole observations:

- **cart position** \(x\)
- **pole angle** \(\theta\)

The output is a small vector of size 2 that you can interpret as:
- tendency toward action 0 vs action 1 (left vs right)

> This is not the only way to encode rules — it’s just a simple, teachable example.


In [None]:
class SymbolicReasoningModule:
    '''
    A tiny rule-based module.

    Input:  CartPole state vector: [x, x_dot, theta, theta_dot]
    Output: A 2D vector (symbolic preference for action 0 vs 1)

    Rules are intentionally simple so students can:
    - read them
    - modify them
    - see how behavior changes
    '''
    def __init__(self, device):
        self.device = device

        # Rule functions return [pref_action_0, pref_action_1]
        self.rules = {
            "pole_angle": lambda angle: (
                [0.9, 0.1] if angle < -0.1 else
                ([0.1, 0.9] if angle > 0.1 else [0.5, 0.5])
            ),
            "cart_position": lambda pos: (
                [0.8, 0.2] if pos < -1.0 else
                ([0.2, 0.8] if pos > 1.0 else [0.5, 0.5])
            )
        }

    def forward(self, state):
        # CartPole: state = [x, x_dot, theta, theta_dot]
        pole_angle = float(state[2].item())
        cart_position = float(state[0].item())

        angle_output = self.rules["pole_angle"](pole_angle)
        position_output = self.rules["cart_position"](cart_position)

        # Combine rule outputs (simple average)
        symbolic_output = [(a + b) / 2.0 for a, b in zip(angle_output, position_output)]
        return torch.tensor(symbolic_output, dtype=torch.float32, device=self.device)

    def refine_rules(self, feedback):
        '''
        Very simple rule adaptation:
        - if a rule family gets negative feedback, soften it.

        This is *not* a principled approach; it's a teaching example.
        '''
        for key in self.rules:
            if feedback.get(key, 0) < 0:
                # Use a "softer" preference
                self.rules[key] = lambda x: (
                    [0.6, 0.4] if x < -0.1 else
                    ([0.4, 0.6] if x > 0.1 else [0.5, 0.5])
                )



## 4) Neuro‑symbolic ESN module

We combine:
- reservoir state \(h_t\) of size `reservoir_dim`
- symbolic output of size `symbolic_dim=2`

Then the **readout** maps the concatenated vector to action logits.

### Why this is "neuro‑symbolic"
- The reservoir learns a rich nonlinear representation of the state.
- The symbolic rules encode human intuition.
- The final decision uses **both**.


In [None]:
class NeuroSymbolicEchoStateNetwork(nn.Module):
    def __init__(self, input_dim, reservoir_dim, output_dim, device,
                 symbolic_dim=2, spectral_radius=0.9, sparsity=0.1):
        super().__init__()
        self.device = device
        self.reservoir_dim = reservoir_dim
        self.symbolic_dim = symbolic_dim

        # Input-to-reservoir weights (fixed)
        self.Win = torch.randn(reservoir_dim, input_dim, device=device) * 0.1

        # Reservoir recurrent weights: sparse + scaled by spectral radius
        W = np.random.randn(reservoir_dim, reservoir_dim)
        W *= (np.random.rand(reservoir_dim, reservoir_dim) < sparsity)

        # Scale by spectral radius
        eigs = np.max(np.abs(np.linalg.eigvals(W)))
        W = (W / (eigs + 1e-12)) * spectral_radius

        # Store as torch tensor
        self.W = torch.from_numpy(W.astype(np.float32)).to(device)

        # Reservoir state (updated every forward pass)
        self.state = torch.zeros(reservoir_dim, device=device, dtype=torch.float32)

        # Symbolic module + trainable readout
        self.symbolic_module = SymbolicReasoningModule(device=device)
        self.readout = nn.Linear(reservoir_dim + symbolic_dim, output_dim).to(device)

    def forward(self, x):
        '''
        x: state vector (shape: [input_dim])
        returns: action logits (shape: [output_dim])
        '''
        self.state = torch.tanh(self.Win @ x + self.W @ self.state)
        symbolic_output = self.symbolic_module.forward(x)
        combined = torch.cat((self.state, symbolic_output))
        return self.readout(combined)

    def refine_symbolic_rules(self, feedback):
        self.symbolic_module.refine_rules(feedback)



## 5) Policy network

We convert ESN logits into a probability distribution using Softmax:

\[
\pi(a|s) = \text{Softmax}(\text{logits})
\]

Then we sample actions using a categorical distribution.


In [None]:
class PolicyNetwork(nn.Module):
    def __init__(self, esn):
        super().__init__()
        self.esn = esn
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        logits = self.esn(x)
        return self.softmax(logits)



## 6) Training with REINFORCE (policy gradient)

We use the classic REINFORCE update:

\[
\nabla_\theta J(\theta) \approx \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \, G_t
\]

where \(G_t\) is the discounted return from time \(t\).

### In code

- collect `log_probs` during the episode,
- compute discounted returns,
- normalize returns (helps stability),
- compute loss = `-log_prob * return`,
- update via Adam.

We also compute a simple **feedback signal** for rule refinement:
- reward is positive when the pole is near upright and cart near center,
- otherwise give negative points.


In [None]:
def make_env():
    '''
    Create CartPole with compatibility across Gym versions.

    Some versions accept new_step_api=True, others don't.
    We'll try it and fall back if needed.
    '''
    try:
        return gym.make("CartPole-v1", new_step_api=True)
    except TypeError:
        return gym.make("CartPole-v1")


def reset_env(env):
    '''Gym reset compatibility: returns obs or (obs, info).'''
    out = env.reset()
    obs = out[0] if isinstance(out, tuple) else out
    return obs


def step_env(env, action):
    '''
    Gym step compatibility:
    - new API: obs, reward, terminated, truncated, info
    - old API: obs, reward, done, info
    '''
    out = env.step(action)
    if len(out) == 5:
        obs, reward, terminated, truncated, info = out
        done = terminated or truncated
        return obs, reward, done, info
    else:
        obs, reward, done, info = out
        return obs, reward, done, info


def train(
    episodes=300,
    reservoir_dim=150,
    lr=1e-2,
    gamma=0.99,
    print_every=10,
    seed=0
):
    # Reproducibility (partial; environments can still add randomness)
    np.random.seed(seed)
    torch.manual_seed(seed)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    env = make_env()

    input_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    esn = NeuroSymbolicEchoStateNetwork(
        input_dim=input_dim,
        reservoir_dim=reservoir_dim,
        output_dim=action_dim,
        device=device
    )
    policy = PolicyNetwork(esn).to(device)

    optimizer = torch.optim.Adam(policy.parameters(), lr=lr)

    reward_history = []

    for episode in range(episodes):
        obs = reset_env(env)
        state = torch.tensor(obs, dtype=torch.float32, device=device)

        rewards = []
        log_probs = []

        # feedback for symbolic rule refinement
        feedback = {"pole_angle": 0.0, "cart_position": 0.0}

        done = False
        while not done:
            probs = policy(state)
            dist = torch.distributions.Categorical(probs)
            action = dist.sample()

            log_probs.append(dist.log_prob(action))

            obs, reward, done, _info = step_env(env, int(action.item()))
            next_state = torch.tensor(obs, dtype=torch.float32, device=device)
            rewards.append(float(reward))

            # Simple heuristic feedback signal:
            # Encourage upright pole (small angle) and centered cart (small position).
            feedback["pole_angle"] += reward if abs(float(next_state[2])) < 0.1 else -1.0
            feedback["cart_position"] += reward if abs(float(next_state[0])) < 1.0 else -1.0

            state = next_state

        # refine rules after each episode
        esn.refine_symbolic_rules(feedback)

        # discounted returns
        discounted = []
        R = 0.0
        for r in reversed(rewards):
            R = r + gamma * R
            discounted.insert(0, R)

        returns = torch.tensor(discounted, dtype=torch.float32, device=device)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        # policy gradient loss
        loss = torch.stack([-lp * Rt for lp, Rt in zip(log_probs, returns)]).sum()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_reward = sum(rewards)
        reward_history.append(total_reward)

        if episode % print_every == 0:
            print(f"Episode {episode:03d} | Total Reward: {total_reward:.1f}")

    env.close()
    return reward_history


# Run training (feel free to change episodes)
reward_history = train(episodes=1500, print_every=10)


Episode 000 | Total Reward: 21.0


  return datetime.utcnow().replace(tzinfo=utc)


Episode 010 | Total Reward: 9.0
Episode 020 | Total Reward: 12.0
Episode 030 | Total Reward: 8.0
Episode 040 | Total Reward: 28.0
Episode 050 | Total Reward: 10.0
Episode 060 | Total Reward: 47.0
Episode 070 | Total Reward: 19.0
Episode 080 | Total Reward: 85.0
Episode 090 | Total Reward: 60.0
Episode 100 | Total Reward: 48.0
Episode 110 | Total Reward: 54.0
Episode 120 | Total Reward: 55.0
Episode 130 | Total Reward: 95.0
Episode 140 | Total Reward: 65.0
Episode 150 | Total Reward: 86.0
Episode 160 | Total Reward: 143.0
Episode 170 | Total Reward: 93.0
Episode 180 | Total Reward: 123.0
Episode 190 | Total Reward: 128.0
Episode 200 | Total Reward: 173.0
Episode 210 | Total Reward: 195.0
Episode 220 | Total Reward: 70.0
Episode 230 | Total Reward: 123.0
Episode 240 | Total Reward: 175.0
Episode 250 | Total Reward: 413.0
Episode 260 | Total Reward: 148.0
Episode 270 | Total Reward: 118.0
Episode 280 | Total Reward: 88.0
Episode 290 | Total Reward: 133.0
Episode 300 | Total Reward: 45.0
E


## 7) Plot learning curve

We can plot total episode reward over time.


In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.plot(reward_history)
plt.xlabel("Episode")
plt.ylabel("Total Reward")
plt.title("Neuro-Symbolic ESN on CartPole (REINFORCE)")
plt.show()



## 8) Discussion: what worked, what didn’t, and why

### What’s nice here
- **Interpretability:** the symbolic module is readable and editable.
- **Modularity:** you can replace rules or reservoir without changing the whole system.
- **Fast iteration:** ESN training is light because only the readout is trained.

### Limitations of this tutorial design
- The ESN reservoir weights are random and fixed: performance can vary by seed.
- The rule refinement method here is **heuristic** and not guaranteed to improve.
- REINFORCE has high variance; it may need tuning.
