# b16 Documentation

## Bernd Paysan

January 17, 2011

# Abstract

This article presents architecture and implementation of the b16 stack processor. This processor is inspired by CHUCK MOORE's newest Forth processors. The minimalistic design fits into small FPGAs and ASICs and is ideally suited for applications that need both control and calculations. The factor is shifted towards control to save space. The synthesizible implementation uses Verilog.

#### Introduction

Minimalistic CPUs can be used in many designs. A state machine often is too complicated and too difficult to develop, when there are more than a few states. A program with subroutines can perform a lot more complex tasks, and is easier to develop at the same time. Also, ROM and RAM blocks occupy much less place on silicon than "random logic". That's also valid for FPGAs, where "block RAM" is—in contrast to logic elements—plenty.

The architecture is inspired by the c18 from Chuck Moore [1]. The exact instruction mix is different; it also differs from the standard b16 core. Also, this architecture is byte-addressed.

A word about Verilog: Verilog is a C-like language, but tailored for the purpose to simulate logic, and to write synthesizible code. Variables are bits and bit vectors, and assignments are typically non-blocking, i.e. on assignments first all right sides are computed, and the left sides are modified afterwards. Also, Verilog has events, like changing of values or clock edges, and blocks can wait on them.

# 1 Architectural Overview

The core components are

- An ALU
- $\bullet$  A data stack with top and next of stack (T and N) as inputs for the ALU
- A return stack
- An instruction pointer P
- An address mux addr, to address external memory
- An instruction latch I

Figure 1 shows a block diagram.

#### **B16 small Block Diagram**



Figure 1: Block Diagram

# 1.1 Register

In addition to the user-visible latches there are control latches for external RAM (rd and wr), stack pointers (sp and rp), and a carry c.

| Name  | Function             |  |  |  |
|-------|----------------------|--|--|--|
| Т     | Top of Stack         |  |  |  |
| I     | Instruction Bundle   |  |  |  |
| P     | Program Counter      |  |  |  |
| R     | Top of Returnstack   |  |  |  |
| state | Processor State      |  |  |  |
| sp    | Stack Pointer        |  |  |  |
| rp    | Return Stack Pointer |  |  |  |
| c     | Carry Flag           |  |  |  |

```
⟨register declarations⟩≡
  reg [sdep-1:0] sp;
  reg [rdep-1:0] rp;
  reg 'L T, I, P, R;
  reg [1:0] state;
  reg c;
```

# 2 Instruction Set

There are 32 different instructions. Since several instructions fit into a 16 bit word, we call the bits to store the packed instructions in an instruction word "slot", and the instruction word itself "bundle". The arrangement here is 1,5,5,5, i.e. the first slot is only one bit large (the more significant bits are filled with 0), and the others all 5 bits.

The operations in one instruction word are executed one after the other. Each instruction takes one cycle, memory operation (including instruction fetch) need another cycle. Which instruction is to be executed is stored in the variable state.

The instruction set is divided into four groups: jumps, ALU, memory, and stack. Table 1 shows an overview over the instruction set. Note: Some special characters indicate functions as follows: "!": "store", "@": "load", ">": "to" if before, "from" if afterwards.

Operations will be described using a "stack effect". This is a template for the stack elements before and after the operation, separated by a long dash. The names are listed in the order bottom to top, unchanged stack elements below are not listed.

Jumps use the rest of the instruction word as target address (except ret). The lower bits of the instruction pointer P are replaced, there's nothing added. For instructions in the last slot, no address remains, so they use T (TOS) as target.

The instructions themselves are executed depending on inst:

```
\langle instructions \rangle \equiv
\texttt{casez(inst)}
\langle control\ flow \rangle
\langle ALU\ operations \rangle
\langle load/store \rangle
\langle stack\ operations \rangle
endcase // case(inst)
```

#### 2.1 Jumps

In detail, jumps are performed as follows: the target address is stored in the address latch addr, which addresses memory, not in the P register. The register P will be set to the incremented value of addr, after the instruction fetch cycle. Apart from call, jmp and ret there are conditional jumps, which test for 0 and carry. The lowest bit of the return stack is used to save the carry flag across calls. Conditional instructions don't consume the tested value, which is different from Forth.

To make it easier to understand, I also define the effect of an instruction in a pseudo language:

```
nop (—)
call (—r:P) P \leftarrow jmp; c \leftarrow 0
jmp (—) P \leftarrow jmp
ret (r:a—) P \leftarrow a \land \$FFFE; c \leftarrow a \land 1
\mathbf{jz} ( n—) \mathbf{if}(n=0) P \leftarrow jmp
jnz ( n—) if (n \neq 0) P \leftarrow jmp
jc (x—) if(c) P \leftarrow jmp
jnc (x—) if(c = 0) P \leftarrow jmp
\langle control flow \rangle \equiv
  5'b00001: begin
       rp <= rpdec;</pre>
       R \le \{ \text{ state } == 2'b00 ? incaddr[15:1] : P[15:1], c \}
       P \le jmp;
       c \le 1, b0;
       if(state_ == 2'b11) 'DROP;
       P \le jmp;
       if(state == 2'b11) 'DROP;
   end
   5'b00011: { rp, c, P, R } <=
                { rpinc, R[0], R[1-1:1], 1'b0, toR };
  5'b001??: begin
       if((inst[1] ? c : zero) ^ inst[0])
           P \le jmp;
       'DROP;
```

#### 2.2 ALU Operations

end

The ALU instructions use the ALU, which computes a result res and a carry bit from T and N. The instruction com is an exception, since it only inverts T—that doesn't require an ALU.

Ordinary ALU instructions just write the result of the ALU into T and c, and reload N.

**xor** ( a b—r ) 
$$r \leftarrow a \oplus b$$
  
**com** ( a—r )  $r \leftarrow a \oplus \$FFFF$ , c  $\leftarrow 1$ 

|    | 0   | 1    | 2    | 3   | 4   | 5   | 6  | 7    | Comment    |
|----|-----|------|------|-----|-----|-----|----|------|------------|
| 0  | nop | call | jmp  | ret | jz  | jnz | jc | jnc  |            |
|    |     | exec | goto | ret | gz  | gnz | gc | gnc  | for slot 3 |
| 8  | xor | com  | and  | or  | +   | +c  | *+ | /-   |            |
| 10 | !+  | @+   | @    | lit | c!+ | c@+ | c@ | litc |            |
|    | !.  | @.   | @    | lit | c!. | c@. | c@ | litc | for slot 1 |
| 18 | nip | drop | over | dup | >r  |     | r> |      |            |

Table 1: Instruction Set

```
and (a b—r) r \leftarrow a \wedge b
or (a b—r) r \leftarrow a \lor b
+ ( a b—r ) c, r \leftarrow a + b
+\mathbf{c} (a b—r) c, r \leftarrow a + b + c
*+ (a b—a r) if (c) c_n, r \leftarrow a+b else c_n, r \leftarrow 0, b; r, R, c \leftarrow
     c_n, r, R
/- ( a b—a r ) c_n, r_n \leftarrow a + b + 1; if(c \vee c_n) r \leftarrow r_n;
    \mathbf{c}, r, \mathbf{R} \leftarrow r, \mathbf{R}, \mathbf{c} \vee c_n
\langle ALU \ operations \rangle \equiv
  5'b01001: { c, T } <= { 1'b1, ~T };
  5'b01110: { T, R, c } <=
      { c ? { carry, res } : { 1'b0, T }, R };
  5'b01111: { c, T, R } <=
      { (c | carry) ? res : T, R, (c | carry) };
  'ifndef FPGA
  5'b01???: { sp, c, T } <= { spinc, carry, res };
  'else
  5'b01000: { sp, c, T } <= { spinc, carry, res };
  5'b01010: { sp, c, T } <= { spinc, carry, res };
  5'b01011: { sp, c, T } <= { spinc, carry, res };
  5'b01100: { sp, c, T } <= { spinc, carry, res };
  5'b01101: { sp, c, T } <= { spinc, carry, res };
  'endif
```

# 2.3 Memory Instructions

Memory instructions use either T as address, and N as data (source or destination), or P as address, and T as destination (literals). The address is auto-incremented, except for instructions in the first slot which use T as address—this is to implement read-modify-write instructions (non-incremeting is written as @. or !. in the assembler, don't care as @\* or !\*).

```
!+ ( n A—A' ) mem[A] \leftarrow n; A' \leftarrow A + 2
@+ ( A—n A' ) n \leftarrow mem[A]; A' \leftarrow A + 2
@ ( A—n ) n \leftarrow mem[A];
lit (—n ) n \leftarrow mem[P]; P \leftarrow P + 2
c!+ ( c A—A' ) mem.b[A] \leftarrow c; A' \leftarrow A + 1
c@+ ( A—c A' ) c \leftarrow mem.b[A]; A' \leftarrow A + 1
```

```
c@ ( A—c ) c \leftarrow mem.b[A];
lite (—c) c \leftarrow mem.b[P]; P \leftarrow P + 1
\langle address \ handling \rangle \equiv
  wire 'L incaddr, dataw, datas;
  wire tos2r, tos2n;
  wire incby, bswap, addrsel, access, rd;
  wire [1:0] wr;
  assign incby = (rwinst[4:2] != 3'b101);
  assign access = (rwinst[4:3] == 2'b10);
  assign addrsel = rd ?
        (access & (rwinst[1:0] != 2'b11)) : |wr;
  assign rd = (state==2'b00) ||
               (access && (rwinst[1:0]!=2'b00));
  assign wr = (access && (rwinst[1:0]==2'b00)) ?
               { ~rwinst[2] | ~T[0],
                 ~rwinst[2] | T[0] } : 2'b00;
  mux #(1) addrmux(.out(addr), .sel(addrsel), .atpg(1'b0
  assign incaddr = addr + incby + 1;
  assign tos2n = (!rd | (rwinst[1:0] == 2'b11));
  mux #(1) toNmux(.out(toN), .sel(tos2n), .atpg(atpg), .
  assign bswap = incby ^ addr[0];
  assign datas = bswap ? data : { data[7:0], data[1-1:8]
  assign dataw = incby ? datas : { 8'h00, datas[7:0] };
  assign dataout = { bswap ? N[15:8] : N[7:0],
                      bswap ? N[7:0] : N[15:8] };
```

Memory access can't just be done word wise, but also byte wise. Therefore two write lines exist. For byte wise store the lower byte of T is copied to the higher one.

```
⟨load/store⟩≡
5'b10?0?: begin
   if(nextstate != 2'b10) T <= incaddr;
   sp <= rd ? spdec : spinc;
end
5'b10?1?: T <= dataw;</pre>
```

Memory accesses need an extra cycle. Here the result of the memory access is handled.

```
\begin{split} \langle load\text{-}store \rangle &\equiv \\ \langle pointer\ increment \rangle \\ &\text{if(|state[1:0]) begin} \\ &\quad \langle store\ afterwork \rangle \\ &\text{end else begin} \\ &\quad \langle ifetch \rangle \\ &\text{end} \end{split}
```

After the access is completed, the result for a load has to be pushed on the stack, or into the instruction register; for stores, the TOS is to be dropped.

```
\langle store afterwork \rangle =
if(rd && { inst[4:3], inst[1:0] } != 4'b1010)
    sp <= spdec;
if(|wr) sp <= spinc;</pre>
```

Furthermore, the incremented address may go back to the pointer.

```
\langle pointer \ increment \rangle \equiv \qquad \qquad \qquad \text{First the} \\ \text{if}("|\text{state}[1:0] \ || \\ ((\text{inst}[4:3] == 2'\text{b10}) \&\& (\text{inst}[1:0] == 2'\text{b11})) \\ /* \\ P <= \text{incaddr};
```

To shortcut a nop in the first instruction, there's some special logic. That's the second part of NEXT.

#### 2.3.1 Peripherals

Peripherals should only use address bits [15:1], read a whole word, and select the bytes written to based on the two write bits (bit 1 for most significant byte, bit 0 for least significant byte).

# 2.4 Stack Instructions

Stack instructions change the stack pointer and move values into and out of latches. With the 8 used stack operations, one notes that swap is missing. Instead, there's nip. The reason is a possible implementation option: it's possible to omit N, and fetch this value directly out of the stack RAM. This consumes more time, but saves space.

```
nip ( a b—b )
drop ( a—)
over ( a b—a b a )
dup ( a—a a )
>r ( a—r:a )
r> ( r:a—a )
```

```
⟨stack operations⟩≡
   5'b11000: sp <= spinc;
   5'b11001: 'DROP;
5'b11010: { sp, T } <= { spdec, N };
   5'b11011: sp <= spdec;
   5'b11100: begin
      R <= T; rp <= rpdec; 'DROP;
end // case: 5'b11100
   5'b11110: begin
      { sp, T, R } <= { spdec, R, toR };
      rp <= rpinc;
end // case: 5'b11110</pre>
```

# 3 The Rest of the Implementation

First the implementation file with comment and modules.

```
/*
   * b16 core: 16 bits,
   * inspired by c18 core from Chuck Moore
  \langle inst\text{-}comment \rangle
  'define L [1-1:0]
  'define DROP { sp, T } <= { spinc, N }
  'define DEBUGGING
  'define FPGA
  // 'define BUSTRI
  'timescale 1ns / 1ns
  \langle ALU \rangle
  \langle Stack \rangle
  \langle mux \rangle
  \langle cpu \rangle
  \langle debugger \rangle
\langle inst\text{-}comment \rangle \equiv
   * Instruction set:
   * 1, 5, 5, 5 bits
                                                        7
           0
                              3
                                           5
      0: nop call jmp
                              ret
                                     jz
                                           jnz
                                                  jс
                                                        jnc
       /3
                 exec goto ret
                                           gnz
                                    gz
                                                 gc
                                                        gnc
       8: xor
                 com
                                                        /-
                      and
                              or
                                           +c
                                                  *+
   * 10: !+
                 @+
                        @
                              lit
                                    c!+
                                           c@+
                                                  c@
                                                        litc
                        0
      /1 !.
                 @.
                              lit
                                    c!.
                                                 c@
                                                        litc
   * 18: nip drop over dup
                                    >r
                                                 r>
\langle mux \rangle \equiv
  module mux(out, sel, atpg, in1, in0);
      parameter 1=16;
      input 'L in1, in0;
      input sel, atpg;
      output 'L out;
      assign out = (sel | atpg) ? in1 : in0;
  endmodule // mux
```

#### 3.1 Top Level

The CPU consists of several parts, which are all implemented in the same Verilog module.

```
\langle cpu \rangle \equiv
  module cpu(clk, run, reset, addr, rd, wr, data,
                    dataout, scanning, atpg
   'ifdef DEBUGGING,
                    dr, dw, daddr, din, dout, bp'endif);
        ⟨port declarations⟩
        \langle register\ declarations \rangle
        \langle instruction \ selection \rangle
        \langle ALU \ instantiation \rangle
        \langle address\ handling \rangle
        \langle stack \ pushs \rangle
        \langle stack\ instantiation \rangle
        \langle state\ changes \rangle
        \langle debugging \ read \rangle
       always @(posedge clk or negedge reset)
            \langle register\ updates \rangle
```

endmodule // cpu

First, Verilog needs port declarations, so that it can now what's input and output. The parameter are used to configure other word sizes and stack depths.

```
\langle port \ declarations \rangle \equiv
  parameter rstaddr=16'h3FFE, show=0,
              l=16, sdep=4, rdep=4;
  input clk, run, reset, scanning, atpg;
  output 'L addr;
  output rd;
  output [1:0] wr;
  input 'L data;
  output 'L dataout;
  \langle debugging\text{-}ports \rangle
```

The ALU is instantiated with the configured width, and the necessary wires are declared

```
\langle ALU \ instantiation \rangle \equiv
  wire 'L res, toN, toR, N;
  wire carry, zero;
  alu #(1) alu16(.res(res), .carry(carry), .zero(zero),
```

```
Since the stacks work in parallel, we have to calculate
when a value is pushed onto the stack (thus only if some-
thing is stored there).
```

```
\langle stack \ pushs \rangle \equiv
  reg dpush, rpush;
  always @(state or inst or rd or run \langle dbg senselist \rangle)
    begin
        rpush = 1'b0;
        dpush = (|state[1:0] & rd) |
                  (inst[4] && inst[3] && inst[1]);
        casez(inst)
            5'b00001: rpush = |state[1:0] | run;
            5'b11100: rpush = 1'b1;
        endcase // case(inst)
        \langle stack\ debugging \rangle
    end
```

The stacks don't only consist of the two stack modules, but also need an incremented and decremented stack pointer. The return stack even allows to write the top of return stack even without changing the return stack depth.

```
wire [sdep-1:0] spdec, spinc;
wire [rdep-1:0] rpdec, rpinc;
stack #(sdep,1) dstack(.clk(clk), .sp(sp), .spdec(spde
                        .push(dpush), .in(toN), .out(N)
stack #(rdep,1) rstack(.clk(clk), .sp(rp), .spdec(rpde
                        .push(rpush), .in(R), .out(toR)
assign spdec = sp-\{\{(sdep-1)\{1'b0\}\}, 1'b1\};
assign spinc = sp+\{\{(sdep-1)\{1'b0\}\}, 1'b1\};
```

The basic core is the fully synchronous register update. Each register needs a reset value, and depending on the state transition, the corresponding assignments have to be coded. Most of that is from above, only the instruction fetch and the assignment of the next value of incby has to be done.

assign rpdec =  $rp-\{\{(rdep-1)\{1'b0\}\}, 1'b1\};$ 

assign rpinc =  $rp+\{(rdep-1)\{1'b0\}\}, 1'b1\};$ 

```
\langle register\ updates \rangle \equiv
                                                      if(!reset) begin
                                                           \langle resets \rangle
                                                      end else if(run) begin
                                                       'ifdef REPORT_VERBOSE
                                                           if(show) begin
                                                                \langle debug \rangle
.T(T), .N(N), .c(c), .inst(inst[2:0])); end
                                                       'endif
                                                           \langle load\text{-}store \rangle
                                                           state <= nextstate;</pre>
                                                           \langle instructions \rangle
                                                      end else begin // debug
                                                           \langle debugging \rangle
                                                      end // else: !if(reset)
```

 $\langle stack\ instantiation \rangle \equiv$ 

As reset value, we initialize the CPU so that it is about to fetch the next instruction from address 0. The stacks are all empty, the registers contain all zeros.

```
\( \text{resets} \) \( \text{resets} \) \( \text{state} <= 2'\text{b11}; \)
\( P <= r\text{staddr}; \)
\( T <= 16'\text{h0000}; \)
\( I <= 16'\text{h0000}; \)
\( R <= 16'\text{h0000}; \)
\( c <= 1'\text{b0}; \)
\( \text{sp} <= 0; \)
\( rp <= 0; \)
\( \text{rp} <= 0; \)
\( \text{rp
```

The transition to the next state (the NEXT within a bundle) is done separately. That's necessary, since the assignments of the other variables are not just dependent on the current state, but partially also on the next state (e.g. when to fetch the next instruction word).

### 3.2 Debugging

For debugging purposes, all registers are memory read—writable. This requires an external bus master attached to the debugging interface. The debugging interface is configured with the DEBUGGING flag. It's only active when the processor is stopped, so the processor itself can't access its own registers.

The debugging module offers the following registers as address space:

| Address | read         | write   |  |  |
|---------|--------------|---------|--|--|
| \$FFE0  | stack[sp++]  | push+T  |  |  |
| \$FFE2  | rstack[rp++] | rpush+R |  |  |
| \$FFE4  | bp           | bp      |  |  |
| \$FFE6  | state+stop   | state   |  |  |
| \$FFE8  | Р            | Р       |  |  |
| \$FFEA  | Т            | Т       |  |  |
| \$FFEC  | R            | R       |  |  |
| \$FFEE  | I            | I       |  |  |

The stacks and the state register change state when being read, so be careful!

```
\langle debugger \rangle \equiv
  'ifdef DEBUGGING
 module debugger(clk, nreset, run,
                   addr, data, r, w,
                   cpu_addr, cpu_r,
                   drun, dr, dw, bp);
 parameter l=16, dbgaddr = 12'hFFE;
  input clk, nreset, run, r, cpu_r;
  input [1:0] w;
  input [1-1:1] addr;
  input 'L data, cpu_addr;
  output drun, dr, dw;
 output 'L bp;
 reg drun, drun1;
 reg 'L bp;
 wire dsel = (addr[l-1:4] == dbgaddr);
 assign dr = dsel & r;
  assign dw = dsel & |w;
  always @(posedge clk or negedge nreset)
  if(!nreset) begin
     drun <= 1;
     drun1 <= 1;
     bp <= 16'hfffff;</pre>
  end else begin
     if(cpu_addr == bp && cpu_r) { drun, drun1 } <= 0;</pre>
     else if(run) drun <= drun1;</pre>
     if((dr | dw) && (addr[3:1] == 3'h3)) begin
        drun <= !dr & dw;
        drun1 <= !dr & dw & data[12];</pre>
     if(dw \&\& addr[3:1] == 3'h2) bp <= data;
  end
  endmodule
  'endif
\langle debugging \rangle \equiv
  'ifdef DEBUGGING
  if(dw) case(daddr)
     3'h0: { sp, T } <= { spdec, din };
     3'h1: { rp, R } <= { rpdec, din };
     3'h3: { c, state, sp, rp } <=
              { din[10:8],
                din[sdep+3:4], din[rdep-1:0] };
     3'h4: P <= din;
     3'h5: T <= din;
     3'h6: R <= din;
     3'h7: I <= din;
  endcase
  if(dr) case(daddr)
     3'h0: sp <= spinc;
     3'h1: rp <= rpinc;
     default ;
  endcase
  'endif
```

```
\langle debugging \ read \rangle \equiv
                                                           \langle ALU \rangle \equiv
  'ifdef DEBUGGING
                                                              // leda off
  reg 'L dout;
                                                             module alu(res, carry, zero, T, N, c, inst);
                                                                 \langle ALU \ ports \rangle
  always @(daddr or dr or run or P or T or R or I or
            state or sp or rp or c or N or toR or bp)
                                                                               'L r1, r2;
                                                                 wire
  if(!dr || run) dout = 'h0;
                                                                 wire [1:0] carries;
  else case(daddr)
     3'h0: dout = N;
                                                                 assign r1 = T ^ N ^ carries;
     3'h1: dout = toR;
                                                                 assign r2 = (T \& N) |
     3'h2: dout = bp;
                                                                               (T & carries'L) |
     3'h3: dout = { run, 4'h0, c, state,
                                                                               (N & carries'L);
                       {4-sdep{1'b0}}, sp,
                                                             // This generates a carry chain, not a loop!
                       {4-rdep{1'b0}}, rp };
                                                                 assign carries =
     3'h4: dout = P;
                                                                       prop ? { r2[1-1:0], (c | selr) & andor }
     3'h5: dout = T;
                                                                             : { c, {(1){andor}}};
     3'h6: dout = R;
                                                                 assign res = (selr & ~prop) ? r2 : r1;
     3'h7: dout = I;
                                                                 assign carry = carries[1];
  endcase
                                                                 assign zero = ~|T;
  'endif
                                                              endmodule // alu
                                                              // leda on
\langle debugging\text{-}ports \rangle \equiv
                                                             The ALU has ports T and N, carry in, and the lowest 3
  'ifdef DEBUGGING
                                                           bits of the instruction as input, a result, carry out, and test
     input [2:0] daddr;
                                                           for zero as output.
     input dr, dw;
     input 'L din, bp;
                                                           \langle ALU \ ports \rangle \equiv
     output 'L dout;
                                                             parameter l=16;
  'endif
                                                              input 'L T, N;
                                                              input c;
\langle dbg \ senselist \rangle \equiv
                                                              input [2:0] inst;
  'ifdef DEBUGGING
                                                              output 'L res;
  or run or dw or daddr
                                                              output carry, zero;
  'endif
                                                             wire prop, andor, selr;
\langle stack\ debugging \rangle \equiv
  'ifdef DEBUGGING
                                                             assign { prop, selr, andor } = inst;
  if(!run && dw) casez(daddr)
     3'h0: dpush = 1;
                                                           3.4
                                                                  Stacks
     3'h1: rpush = 1;
  endcase
                                                           The stacks are modelled as block RAM in the FPGA. In an
  'endif
```

#### 3.3 ALU

The ALU just computes the sum with possible carry-ins, the logical operations, and a zero flag. It reuses the same logic (essentially what comprises a full adder) to do both sums and logic.

The stacks are modelled as block RAM in the FPGA. In an ASIC, this is implemented with latches. The block RAM (or register file) needs one read and one write port.

```
\langle Stack \rangle \equiv
 module stack(clk, sp, spdec, push, scan, in, out);
     parameter dep=2, l=16;
     input clk, push, scan;
     input [dep-1:0] sp, spdec;
     input 'L in;
     output 'L out;
     wire write = push & ~clk & ~scan;
     reg 'L stackmem[0:(1<<dep)-1];</pre>
  'ifndef FPGA
     always @(write or spdec or in)
        if(write) stackmem[spdec] <= in;</pre>
     always @(posedge clk)
        if(push)
            stackmem[spdec] <= in;</pre>
  \verb"endif"
    assign out = stackmem[sp];
  endmodule // stack
```

# References

[1] c18 ColorForth Compiler, Chuck Moore,  $17^{\rm th}$  EuroForth Conference Proceedings, 2001