# Notebook 5: Hardware Implementation (FPGA) & Digital Logic

**Topic:** PhD Thesis - Scientific Computing & Geophysical Inversion  
**Description:** This final notebook bridges the gap between software algorithms and hardware reality. We translate the continuous Langevin Dynamics developed in previous notebooks into a **Digital Logic Architecture** suitable for FPGA implementation. This involves Fixed-Point Arithmetic, Pseudo-Random Number Generation (LFSR), and Massively Parallel Mesh design.

## 1. From Continuous to Digital

Standard CPUs use 64-bit Floating Point arithmetic (Float64). FPGAs, however, are optimized for low-precision integers. To map our P-bit algorithm to hardware, we must discretize both **Time** and **Values**.

### 1.1. The Discrete Update Equation

Recall the continuous Langevin equation from Notebook 3:

$$
m(t+\Delta t) = m(t) - \eta \nabla \Phi + \sqrt{2T} \cdot \xi
$$

**Where:**
* $m(t)$: Current model parameter (conductivity) at continuous time $t$.
* $\eta$: Learning rate (step size) determining how fast we move.
* $\nabla \Phi$: Gradient of the energy function (Determinstic Force).
* $T$: Temperature (Noise magnitude).
* $\xi$: Random Gaussian noise source.

In **Digital Logic**, continuous variables become registers, and multiplications become bit-shifts. The hardware update rule is:

$$
M_{reg}[k+1] = M_{reg}[k] - (G_{fixed}[k] \gg S) + (\text{LFSR}[k] \times \text{Gain}[k])
$$

**Where:**
* $M_{reg}[k]$: The model value stored in a hardware register at clock cycle $k$ (e.g., 16-bit Signed Integer).
* $G_{fixed}[k]$: The gradient value converted to fixed-point integer format.
* $\gg S$: Bitwise Right Shift operation. This replaces the multiplication by learning rate $\eta$. (e.g., shifting right by 4 is dividing by 16).
* $\text{LFSR}[k]$: A pseudo-random integer generated by a Linear Feedback Shift Register at cycle $k$.
* $\text{Gain}[k]$: An integer multiplier representing the Temperature $T$.

## 2. Fixed-Point Arithmetic (Q-Format)

We represent real numbers using integers by scaling them by a factor of $2^Q$. This is called **Q-format**.

The conversion formula is:
$$
X_{int} = \text{round}(X_{float} \times 2^Q)
$$

**Where:**
* $X_{int}$: The integer representation stored in FPGA registers (e.g., `1024`).
* $X_{float}$: The real-world physical value (e.g., `1.0`).
* $Q$: The number of fractional bits (Precision). If $Q=10$, then $2^{10}=1024$ becomes our scale factor.

For example, if $Q=10$:
* Real value $0.5$ $\rightarrow$ Integer $512$.
* Real value $-2.0$ $\rightarrow$ Integer $-2048$.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# --- Helper Functions for Hardware Simulation ---

Q_FACTOR = 10  # 10 bits for fractional part
SCALE = 2**Q_FACTOR

def to_fixed(x):
    """Converts Float to Fixed-Point Integer (Q10)"""
    return np.round(x * SCALE).astype(int)

def to_float(x_int):
    """Converts Fixed-Point Integer back to Float"""
    return x_int / SCALE

# Test the quantization
val_real = 0.12345
val_fixed = to_fixed(val_real)
val_recovered = to_float(val_fixed)

print(f"Real: {val_real}")
print(f"Fixed (Integer): {val_fixed} (Binary: {bin(val_fixed)})")
print(f"Recovered: {val_recovered}")
print(f"Quantization Error: {abs(val_real - val_recovered):.6f}")

## 3. Hardware Noise Generation: LFSR

Generating true Gaussian noise on a chip is complex and resource-heavy. Instead, we use a **Linear Feedback Shift Register (LFSR)** to generate pseudo-random integers. While individual LFSRs are uniform, summing multiple LFSRs (or filtering one) roughly approximates a Gaussian distribution via the Central Limit Theorem.

The update logic for a standard Galois LFSR is:

$$
b_{new} = b_{tap_1} \oplus b_{tap_2} \oplus \dots \oplus b_{tap_n}
$$

**Where:**
* $\oplus$: XOR operation (Exclusive OR gate in hardware).
* $b_{new}$: The new bit shifted into the register (MSB or LSB).
* $b_{tap}$: The bits at specific positions ('taps') chosen to maximize the period of the random sequence.

In [None]:
class LFSR16:
    """
    16-bit Linear Feedback Shift Register Simulator.
    Polynomial: x^16 + x^14 + x^13 + x^11 + 1
    This logic is implemented using simple XOR gates and Flip-Flops on FPGA.
    """
    def __init__(self, seed=1):
        self.state = seed & 0xFFFF # Ensure 16-bit
        
    def next(self):
        # Taps for 16-bit: 16, 14, 13, 11 (0-indexed: 15, 13, 12, 10)
        bit = ((self.state >> 0) ^ (self.state >> 2) ^ (self.state >> 3) ^ (self.state >> 5)) & 1
        self.state = (self.state >> 1) | (bit << 15)
        # Convert unsigned to signed (centered around 0)
        # Range: -32768 to +32767
        return (self.state - 32768) 

# Verify Randomness
lfsr = LFSR16(seed=1234)
noise_samples = [lfsr.next() for _ in range(1000)]

plt.figure(figsize=(10, 3))
plt.plot(noise_samples[:200])
plt.title("First 200 outputs of LFSR (Hardware Noise Source)")
plt.grid(True)
plt.show()

## 4. The P-bit Core Logic (Bit-True Model)

Now we assemble the components into a single **P-bit Core Class**. This class simulates exactly what the Verilog module does cycle-by-cycle.

**Architecture Components:**
1.  **Input:** `grad_input` (From gradient bus).
2.  **Memory:** `m_reg` (Register storing the current conductivity model).
3.  **Noise:** `LFSR` instance.
4.  **Logic:** Adder, Shifter, and Clamper.

In [None]:
class DigitalPbit:
    def __init__(self, init_val, seed):
        # Initialize register with fixed-point value
        self.m_reg = to_fixed(init_val) 
        self.lfsr = LFSR16(seed)
        
    def clock_step(self, grad_input_fixed, shift_amount, noise_gain):
        """
        Executes one clock cycle update.
        grad_input_fixed: Gradient value (Integer)
        shift_amount: Simulates Learning Rate (Bitwise Right Shift)
        noise_gain: Multiplier for noise intensity (Temperature)
        """
        # 1. Generate Noise (Hardware: LFSR module)
        noise_val = self.lfsr.next() >> 4 # Scale down raw 16-bit noise
        
        # 2. Apply Update Rule (Hardware: Adder/Subtractor)
        # Update = - (Grad >> S) + (Noise * Gain)
        deterministic_force = -(grad_input_fixed >> shift_amount)
        stochastic_force = noise_val * noise_gain
        
        self.m_reg += (deterministic_force + stochastic_force)
        
        # 3. Clamping (Hardware: Comparator/Mux)
        # We must keep the model within physical limits [-4.0, 0.0]
        # In Fixed-Point: [-4096, 0]
        limit_min = to_fixed(-4.0)
        limit_max = to_fixed(0.0)
        
        if self.m_reg < limit_min: self.m_reg = limit_min
        if self.m_reg > limit_max: self.m_reg = limit_max
        
        return self.m_reg

    def get_val(self):
        return to_float(self.m_reg)

## 5. Simulating a "Bit-True" Optimization

We will now simulate a simple 1D optimization problem using this bit-true logic. This proves that our integer approximations still converge to the correct result.

* **Goal:** Find $x$ that minimizes $f(x) = (x - (-2.0))^2$.
* **Gradient:** $\nabla f = 2(x + 2.0)$.
* **Target Solution:** $x = -2.0$.

In [None]:
# Simulation Setup
pbit = DigitalPbit(init_val=-4.0, seed=999)
history = []
noise_profile = []

print("Starting Bit-True Hardware Simulation...")

CYCLES = 1500

for cycle in range(CYCLES):
    # 1. Read current value from register
    x_curr = pbit.get_val()
    
    # 2. Compute Gradient (Simulating the external physics engine)
    grad_float = 2 * (x_curr - (-2.0))
    grad_fixed = to_fixed(grad_float)
    
    # 3. Annealing Schedule (Noise Gain drops over time)
    # Linearly decrease noise gain from 3 to 0
    current_noise_gain = int(3 * (1 - cycle/CYCLES))
    
    # 4. Clock Step (Hardware Update)
    # Shift 6 -> Division by 64 (Learning Rate approx 0.015)
    pbit.clock_step(grad_fixed, shift_amount=6, noise_gain=current_noise_gain)
    
    history.append(x_curr)
    noise_profile.append(current_noise_gain)

print(f"Final Register Value: {history[-1]:.4f} (Target: -2.0000)")

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(history, label='Register Value')
plt.axhline(-2.0, color='r', linestyle='--', label='Target')
plt.title("Optimization Convergence (Hardware Simulation)")
plt.xlabel("Clock Cycles")
plt.ylabel("Model Parameter")
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.plot(noise_profile, color='orange')
plt.title("Annealing Schedule (Noise Gain)")
plt.xlabel("Clock Cycles")
plt.ylabel("Gain (Integer)")
plt.grid(True)

plt.show()

## 6. Massively Parallel Architecture (Mesh)

The real power of this approach comes from scaling. On an FPGA, we instantiate thousands of `DigitalPbit` cores in a 2D Mesh topology.

### 6.1. The Systolic Array Concept

The FPGA design consists of:
1.  **P-bit Core Mesh:** A $100 \times 100$ grid of the registers defined above.
2.  **Gradient Bus:** A high-bandwidth bus (AXI-Stream) that broadcasts gradient values from the CPU to the FPGA.
3.  **Global Controller:** A simple state machine that manages the annealing schedule (broadcasting `shift_amount` and `noise_gain`).

Since each P-bit updates independently based on its local gradient and internal random generator, the throughput scales **linearly** with the number of cores. Updating 10,000 parameters takes the same time as updating 1 parameter (1 clock cycle).

## 7. Conclusion & Thesis Roadmap

This notebook concludes the methodological development of the thesis. We have successfully traversed the full stack:

1.  **Physics (Notebook 1):** Defined the Forward Problem (Maxwell's Eqs).
2.  **Inverse Theory (Notebook 2):** Derived Gradients via Adjoint State Method.
3.  **Statistical Mechanics (Notebook 3):** Introduced P-bits and Langevin Dynamics.
4.  **Artificial Intelligence (Notebook 4):** Integrated RBMs for SciML inversion.
5.  **Hardware Implementation (Notebook 5):** Mapped the algorithms to Digital Logic.

The final proposed system represents a **Physics-AI-Hardware** co-design capable of solving large-scale geophysical problems with unprecedented speed and energy efficiency.