# b16 Documentation

### Bernd Paysan

February 9, 2011

# Abstract

This article presents architecture and implementation of the b16 stack processor. This processor is inspired by Chuck Moore's newest Forth processors. The minimalistic design fits into small FPGAs and ASICs and is ideally suited for applications that need both control and calculations. The factor is shifted towards control to save space. The synthesizible implementation uses Verilog.

## Introduction

Minimalistic CPUs can be used in many designs. A state machine often is too complicated and too difficult to develop, when there are more than a few states. A program with subroutines can perform a lot more complex tasks, and is easier to develop at the same time. Also, ROM and RAM blocks occupy much less place on silicon than "random logic". That's also valid for FPGAs, where "block RAM" is—in contrast to logic elements—plenty.

The architecture is inspired by the c18 from Chuck Moore [1]. The exact instruction mix is different; it also differs from the standard b16 core. Also, this architecture is byte-addressed.

A word about Verilog: Verilog is a C-like language, but tailored for the purpose to simulate logic, and to write synthesizible code. Variables are bits and bit vectors, and assignments are typically non-blocking, i.e. on assignments first all right sides are computed, and the left sides are modified afterwards. Also, Verilog has events, like changing of values or clock edges, and blocks can wait on them.

# 1 Architectural Overview

The core components are

- An ALU
- A data stack with top and next of stack (T and N) as inputs for the ALU
- A return stack
- An instruction pointer P
- An address mux addr, to address external memory
- An instruction latch I

Figure 1 shows a block diagram.



Figure 1: Block Diagram

# 1.1 Register

In addition to the standard Forth machine registers there are control registers for external RAM (rd and wr), stack pointers (sp and rp), and a carry c. For consistency with Chuck Moores' nomenclature, violating most coding style guidelines, the Forth machine registers are single-letter variables in upper case. Since the source code is a LyX document, you can use the "search whole word" mode to find them easily, and they also show up on top of the signal list during simulation.

| Name  | Function             |
|-------|----------------------|
| Т     | Top of Stack         |
| I     | Instruction Bundle   |
| P     | Program Counter      |
| R     | Top of Returnstack   |
| state | Processor State      |
| sp    | Stack Pointer        |
| rp    | Return Stack Pointer |
| c     | Carry Flag           |

```
(register declarations)≡
  reg [sdep-1:0] sp;
  reg [rdep-1:0] rp;

reg 'L T, I, P, R;
  reg [1:0] state;
  reg c;
```

|    | 0   | 1    | 2    | 3   | 4   | 5   | 6  | 7                    | Comment    |
|----|-----|------|------|-----|-----|-----|----|----------------------|------------|
| 0  | nop | call | jmp  | ret | jz  | jnz | jc | jnc                  |            |
|    |     | exec | goto | ret | gz  | gnz | gc | $\operatorname{gnc}$ | for slot 3 |
| 8  | xor | com  | and  | or  | +   | +c  | *+ | /-                   |            |
| 10 | !+  | @+   | @    | lit | c!+ | c@+ | c@ | litc                 |            |
|    | !.  | @.   | @    | lit | c!. | c@. | c@ | litc                 | for slot 1 |
| 18 | nip | drop | over | dup | >r  |     | r> |                      |            |

Table 1: Instruction Set

### 2 Instruction Set

There are 32 different instructions. Since several instructions fit into a 16 bit word, we call the bits to store the packed instructions in an instruction word "slot", and the instruction word itself "bundle". The arrangement here is 1,5,5,5, i.e. the first slot is only one bit large (the more significant bits are filled with 0), and the others all 5 bits.

The operations in one instruction word are executed one after the other. Each instruction takes one cycle, memory operation (including instruction fetch) need another cycle. Which instruction is to be executed is stored in the variable state.

The instruction set is divided into four groups: jumps, ALU, memory, and stack. Table 1 shows an overview over the instruction set. Note: Some special characters indicate functions as follows:

```
! "store"
```

@ "load",

> "to" if before, "from" if afterwards.

Operations will be described using a "stack effect". This is a template for the stack elements before and after the operation, separated by a long dash. The names are listed in the order bottom to top, unchanged stack elements below are not listed.

Jumps use the rest of the instruction word as target address (except ret). The lower bits of the instruction pointer P are replaced, there's nothing added. For instructions in the last slot, no address remains, so they use T (TOS) as target.

The instructions themselves are executed depending on inst:

# 2.1 Jumps

In detail, jumps are performed as follows: the target address is stored in the address latch addr, which addresses memory, not in the P register. The register P will be set to the incremented value of addr, after the instruction fetch cycle. Apart from call, jmp and ret there are conditional jumps, which test for 0 and carry. The lowest bit of the return stack is used to save the carry flag across calls. Conditional instructions don't consume the tested value, which is different from Forth.

To make it easier to understand, I also define the effect of an instruction in a pseudo language. Every instruction has a stack effect (before—after) with top of stack on the right, "r:" prefix indicating return stack, and register assignments:

```
\begin{aligned} & \mathbf{nop} \ (\ -\ ) \\ & \mathbf{call} \ (\ -\mathbf{r}:\mathbf{P} \ ) \ \mathbf{P} \leftarrow jmp; \ \mathbf{c} \leftarrow \mathbf{0} \end{aligned}
```

```
jmp ( — ) P \leftarrow jmp
ret (r:a—) P \leftarrow a \land \$FFFE; c \leftarrow a \land 1
\mathbf{jz} ( n— ) \mathbf{if}(n=0) P \leftarrow jmp
jnz ( n— ) if (n \neq 0) P \leftarrow jmp
\mathbf{jc} \ (\mathbf{x} - ) \ \mathbf{if}(c) \mathbf{P} \leftarrow jmp
jnc (x—) if(c = 0) P \leftarrow jmp
\langle control \ flow \rangle \equiv
  5'b00001: begin // call
      rp <= rpdec;</pre>
      R <= { ~|state ? incaddr[15:1] : P[15:1], c }; 2.3
      P \le jmp;
      c \le 1'b0;
       if(state == 2'b11) 'DROP;
   end // case: 5'b00001
   5'b00010: begin // jmp
      P \le jmp;
      if(state == 2'b11) 'DROP;
   end
  5'b00011: // ret
                { rp, c, P, R } <=
                { rpinc, R[0], R[1-1:1], 1'b0, toR };
   5'b00100, 5'b00101, 5'b00110, 5'b00111:
  begin // conditional jmps
       if((inst[1] ? c : zero) ^ inst[0])
          P \le jmp;
       'DROP;
   end
```

## 2.2 ALU Operations

The ALU instructions use the ALU, which computes a result res and a carry bit from T and N. The instruction com is an exception, since it only inverts T—that doesn't require an ALU.

Ordinary ALU instructions just write the result of the ALU into T and c, and reload N.

```
\begin{array}{l} \mathbf{xor} \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{r} \ ) \ r \leftarrow a \oplus \mathbf{b} \\ \mathbf{com} \ ( \ \mathbf{a} - \mathbf{r} \ ) \ r \leftarrow a \oplus \mathbf{\$FFFF}, \ \mathbf{c} \leftarrow 1 \\ \mathbf{and} \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{r} \ ) \ r \leftarrow a \wedge b \\ \mathbf{or} \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{r} \ ) \ r \leftarrow a \vee b \\ + \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{r} \ ) \ \mathbf{c}, r \leftarrow a + b \\ + \mathbf{c} \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{r} \ ) \ \mathbf{c}, r \leftarrow a + b + \mathbf{c} \\ * + \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{a} \ \mathbf{r} \ ) \ \mathbf{if}(\mathbf{c}) \ c_n, r \leftarrow a + b \ \mathbf{else} \ c_n, r \leftarrow 0, b; r, \mathbf{R}, \mathbf{c} \leftarrow c_n, r, \mathbf{R} \\ / - \ ( \ \mathbf{a} \ \mathbf{b} - \mathbf{a} \ \mathbf{r} \ ) \ c_n, r_n \leftarrow a + b + 1; \ \mathbf{if}(\mathbf{c} \vee c_n) \ r \leftarrow r_n; \\ \mathbf{c}, r, \mathbf{R} \leftarrow r, \mathbf{R}, \mathbf{c} \vee c_n \end{array}
```

```
⟨ALU operations⟩≡
5'b01001: // com
{ c, T } <= { 1'b1, ~T };
5'b01110: // *+
    { T, R, c } <=
    { c ? { carry, res } : { 1'b0, T }, R };
5'b01111: // /-
    { c, T, R } <=
    { (c | carry) ? res : T, R, (c | carry) };
5'b01000, 5'b01010, 5'b01011, 5'b01100, 5'b01101:
    // xor, and, or, +, +c
    { sp, c, T } <= { spinc, carry, res };</pre>
```

# 2.3 Memory Instructions

Memory instructions use either T as address, and N as data (source or destination), or P as address, and T as destination (literals). The address is auto-incremented, except for instructions in the first slot which use T as address—this is to implement read-modify-write instructions (non-incremeting is written as @. or !. in the assembler, don't care as @\* or !\*).

!+ ( n A—A' ) 
$$mem[A] \leftarrow n$$
; A'  $\leftarrow$  A + 2  
@+ ( A—n A' )  $n \leftarrow mem[A]$ ; A'  $\leftarrow$  A + 2  
@ ( A—n )  $n \leftarrow mem[A]$ ;  
lit (—n )  $n \leftarrow mem[P]$ ; P  $\leftarrow$  P + 2  
c!+ ( c A—A' )  $mem.b[A] \leftarrow c$ ; A'  $\leftarrow$  A + 1  
c@+ ( A—c A' )  $c \leftarrow mem.b[A]$ ; A'  $\leftarrow$  A + 1  
c@ ( A—c )  $c \leftarrow mem.b[A]$ ;

```
\langle address \ handling \rangle \equiv
 wire 'L incaddr, dataw, datas;
 wire tos2r, tos2n;
 wire incby, bswap, addrsel, access, rd;
 wire [1:0] wr;
  assign incby = (rwinst[4:2] != 3'b101);
  assign access = (rwinst[4:3] == 2'b10);
  assign addrsel = rd ?
        (access & (rwinst[1:0] != 2'b11)) : |wr;
  assign rd = (state==2'b00) ||
              (access && (rwinst[1:0]!=2'b00));
  assign wr = (access && (rwinst[1:0]==2'b00)) ?
              { ~rwinst[2] | ~T[0],
                 ~rwinst[2] | T[0] } : 2'b00;
  assign addr = addrsel ? T : P;
  assign incaddr = addr + incby + 1;
  assign tos2n = (!rd \mid (rwinst[1:0] == 2'b11));
  assign toN = tos2n ? T : dataw;
  assign bswap = ~incby ^ addr[0];
  assign datas = bswap ? { data[7:0], data[1-1:8] }
                        : data;
  assign dataw = incby ? datas
                        : { 8'h00, datas[7:0] };
 assign dataout = bswap ? { N[7:0], N[1-1:8] }
                          : N;
```

Memory access can't just be done word wise, but also byte wise. Therefore two write lines exist. For byte wise store the lower byte of T is copied to the higher one.

Memory accesses need an extra cycle. Here the result of the memory access is handled.

After the access is completed, the result for a load has to be pushed on the stack, or into the instruction register; for stores, the TOS is to be dropped.

```
\langle store afterwork \rangle =
if(rd && { inst[4:3], inst[1:0] } != 4'b1010)
    sp <= spdec;
if(|wr) sp <= spinc;</pre>
```

Furthermore, the incremented address may go back to the program pointer.

```
\langle pointer increment \rangle =
if(~|state ||
    ({ inst[4:3], inst[1:0] } == 4'b1011))
    P <= incaddr;</pre>
```

To shortcut a **nop** in the first instruction, there's some special logic. That's the second part of NEXT.

```
\langle ifetch \rangle \equiv
I <= data;
if(!data[15]) state[1:0] <= 2'b01;
```

#### 2.3.1 Peripherals

Peripherals should only use address bits [15:1], read a whole word, and select the bytes written to based on the two write bits (bit 1 for most significant byte, bit 0 for least significant byte).

#### 2.4 Stack Instructions

Stack instructions change the stack pointer and move values into and out of latches. With the 6 used stack operations, one notes that swap is missing. Instead, there's nip. The reason is a possible implementation option: it's possible to omit N, and fetch this value directly out of the stack RAM. This consumes more time, but saves space.

```
nip ( a b—b )
drop ( a— )
over ( a b—a b a )
dup ( a—a a )
>r ( a—r:a )
r> ( r:a—a )
\langle stack\ operations \rangle \equiv
  5'b11000: sp <= spinc;
                                            // nip
  5'b11001: 'DROP;
                                            // drop
  5'b11010: { sp, T } <= { spdec, N }; // over
  5'b11011: sp <= spdec;
                                            // dup
  5'b11100: begin
                                            // >r
     R <= T; rp <= rpdec; 'DROP;</pre>
  end // case: 5'b11100
  5'b11110: begin
                                            // r>
     { sp, T, R } <= { spdec, R, toR };
     rp <= rpinc;</pre>
  end // case: 5'b11110
  default ;
                                            // noop
```

# 3 The Rest of the Implementation

First the implementation file(s) with comment and modules. You can either have all in one file (b16.v), or each module in a file with the same name as the module—the defines will go to b16-defines.v for central manipulation of the defines.

```
\langle header \rangle \equiv
   /*
     * b16 core: 16 bits,
     * inspired by c18 core from Chuck Moore
     * (c) 2002-2011 by Bernd Paysan
         \langle gpl\text{-}header \rangle
\langle defines \rangle \equiv
    'define L [1-1:0]
    'define DROP { sp, T } <= { spinc, N }
   'define DEBUGGING
    'define FPGA
   // 'define BUSTRI
\langle b16.v \rangle \equiv
   \langle header \rangle
   /*
   \langle inst\text{-}comment \rangle
   \langle defines \rangle
   \langle ALU \rangle
   \langle latchen \rangle
   \langle Stack \rangle
   \langle cpu \rangle
   \langle debugger \rangle
\langle b16\text{-}defines.v \rangle \equiv
   \langle defines \rangle
\langle alu.v \rangle \equiv
   \langle header \rangle
   'include "b16-defines.v"
   \langle ALU \rangle
\langle stack.v \rangle \equiv
   \langle header \rangle
    'include "b16-defines.v"
   \langle Stack \rangle
\langle latchen.v \rangle \equiv
   \langle header \rangle
    'include "b16-defines.v"
   \langle latchen \rangle
```

```
 \begin{array}{l} \langle cpu.v\rangle \equiv \\ \langle header\rangle \\ /* \\ \langle inst\text{-}comment\rangle \\ */ \\ \text{`include "b16-defines.v"} \\ \langle cpu\rangle \\ \langle debugging.v\rangle \equiv \\ \langle header\rangle \\ \text{`include "b16-defines.v"} \\ \langle debugger\rangle \\ \langle gpl\text{-}header\rangle \equiv \\ \end{array}
```

This program is free software; you can redistribute it it under the terms of the GNU General Public License a the Free Software Foundation; version 2 of the License

This program is distributed in the hope that it will b but WITHOUT ANY WARRANTY; without even the implied war MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. GNU General Public License for more details.

This is not the source code of the program, the source literate programming style article.

```
\langle inst\text{-}comment \rangle \equiv
   * Instruction set:
   * 1, 5, 5, 5 bits
                             3
                                         5
          0
                 1
      0: nop
                call jmp
                                         jnz
                                               jс
                                                      jnc
                            ret
                                   jz
      /3
                 exec goto ret
                                   gz
                                         gnz
                                               gc
                                                     gnc
      8: xor
                com
                       and
                             or
                                         +c
                                                *+
                                                      /-
                                                     litc
   * 10: !+
                 0+
                             lit
                                   c!+
                                         c@+
                                               c@
      /1 !.
                @.
                       @
                             lit
                                   c!.
                                         c@.
                                               c@
                                                     litc
                drop over dup
   * 18: nip
                                               r>
```

### 3.1 Top Level

The CPU consists of several parts, which are all implemented in the same Verilog module.

endmodule // cpu

```
\langle cpu \rangle \equiv
  module cpu(clk, run, nreset, addr, rd, wr, data,
                     dataout, scanning, atpg
   'ifdef DEBUGGING,
                     dr, dw, daddr, din, dout, bp'endif);
        ⟨port declarations⟩
        \langle register\ declarations \rangle
        \langle instruction \ selection \rangle
        \langle ALU \ instantiation \rangle
        \langle address\ handling \rangle
        \langle stack \ pushs \rangle
        \langle stack\ instantiation \rangle
        \langle state\ changes \rangle
        \langle debugging \ read \rangle
        always @(posedge clk or negedge nreset)
             \langle register\ updates \rangle
```

First, Verilog needs port declarations, so that it can know what's input and output. The parameter are used to configure other word sizes and stack depths. The CPU is not fully scalable, e.g. the instruction decoder or the byte swap operation for byte access depends on 16 bit word size, but those parts of the CPU that are scalable can be scaled by changing that parameter—the others need manual intervention.

Since the stacks work in parallel, we have to calculate when a value is pushed onto the stack (thus **only** if something is stored there).

```
\( \stack \ pushs \rangle = \)
\( \text{reg dpush, rpush;} \)
\( \text{always Q(state or inst or rd or run } \langle \ dbg \ senselist \rangle \rangle ) \)
\( \text{begin} \)
\( \text{rpush = 1'b0;} \)
\( \text{dpush = (|state[1:0] & rd) | } \)
\( \text{(inst[4] && inst[3] && inst[1]);} \)
\( \text{case(inst)} \)
\( \text{5'b00001: rpush = |state[1:0] | run;} \)
\( \text{5'b11100: rpush = 1'b1;} \)
\( \text{default ;} \)
\( \text{endcase } // \text{ case(inst)} \)
\( \langle \text{stack debugging} \rangle \)
\( \text{end} \)
\( \text{end}
```

The stacks don't only consist of the two stack modules, but also need an incremented and decremented stack pointer. The return stack even allows to write the top of return stack even without changing the return stack depth.

```
\langle stack\ instantiation \rangle \equiv
 wire [sdep-1:0] spdec, spinc;
 wire [rdep-1:0] rpdec, rpinc;
 stack #(sdep,1) dstack(.clk(clk),
                            .sp(sp),
                            .spdec(spdec),
                            .push(dpush),
                            .in(toN),
                            .out(N),
                            .scan(scanning));
 stack #(rdep,l) rstack(.clk(clk),
                            .sp(rp),
                            .spdec(rpdec),
                            .push(rpush),
                            .in(R),
                            .out(toR),
                            .scan(scanning));
 assign spdec = sp-\{\{(sdep-1)\{1'b0\}\}, 1'b1\};
 assign spinc = sp+\{\{(sdep-1)\{1'b0\}\}, 1'b1\};
 assign rpdec = rp-\{\{(rdep-1)\{1'b0\}\}, 1'b1\};
 assign rpinc = rp+\{\{(rdep-1)\{1'b0\}\}, 1'b1\};
```

The basic core is the fully synchronous register update. Each register needs a reset value, and depending on the state transition, the corresponding assignments have to be coded. Most of that is from above, only the instruction fetch and the assignment of the next value of incby has to be done.

```
\begin{tabular}{ll} $\langle register\ updates \rangle \equiv $$ if (!nreset) \ begin $$ $\langle resets \rangle$ \\ end else if (run) begin $$ 'ifdef\ REPORT_VERBOSE $$ if (show) begin $$ $\langle debug \rangle$ \\ end $$ 'endif $$ $\langle load\textsuperscript{state} \rangle$ \\ state $<=\ next state; $$ $\langle instructions \rangle$ \\ end else begin $//\ debug $$ $\langle debugging \rangle$ \\ end $//\ else: !if (nreset)$ \\ \end $</=\ next state; $$ $\langle load\ load\
```

As reset value, we initialize the CPU so that it is about to fetch the next instruction from address 0. The stacks are all empty, the registers contain all zeros.

```
\( \text{resets} \) \( \text{state} <= 2'\text{b11}; \)
P <= rstaddr;
T <= 16'\text{h0000};
I <= 16'\text{h0000};
R <= 16'\text{h0000};
c <= 1'\text{b0};
sp <= 0;
rp <= 0;
}</pre>
```

The transition to the next state (the NEXT within a bundle) is done separately. That's necessary, since the assignments of the other variables are not just dependent on the current state, but partially also on the next state (e.g. when to fetch the next instruction word).

# 3.2 Debugging

For debugging purposes, all registers are memory read—writable. This requires an external bus master attached to the debugging interface. The debugging interface is configured with the DEBUGGING flag. It's only active when the processor is stopped, so the processor itself can't access its own registers.

The debugging module offers the following registers as address space:

| Address | read         | write   |
|---------|--------------|---------|
| \$FFE0  | stack[sp++]  | push+T  |
| \$FFE2  | rstack[rp++] | rpush+R |
| \$FFE4  | bp           | bp      |
| \$FFE6  | state+stop   | state   |
| \$FFE8  | P            | P       |
| \$FFEA  | Т            | Т       |
| \$FFEC  | R            | R       |
| \$FFEE  | I            | I       |

The stacks and the state register change state when being read, so be careful!

```
\langle debugger \rangle \equiv
                                                          \langle debugging \ read \rangle \equiv
  'ifdef DEBUGGING
                                                            'ifdef DEBUGGING
                                                            reg 'L dout;
  module debugger(clk, nreset, run,
                    addr, data, r, w,
                    cpu_addr, cpu_r,
                                                            always @(daddr or dr or run or P or T or R or I or
                    drun, dr, dw, bp);
                                                                      state or sp or rp or c or N or toR or bp)
  parameter l=16, dbgaddr = 12'hFFE;
                                                            if(!dr || run) dout = 'h0;
  input clk, nreset, run, r, cpu_r;
                                                            else case(daddr)
  input [1:0] w;
                                                               3'h0: dout = N:
  input [1-1:1] addr;
                                                               3'h1: dout = toR;
  input 'L data, cpu_addr;
                                                               3'h2: dout = bp;
  output drun, dr, dw;
                                                               3'h3: dout = { run, 4'h0, c, state,
  output 'L bp;
                                                                                {4-sdep{1'b0}}, sp,
                                                                                {4-rdep{1'b0}}, rp };
  reg drun, drun1;
                                                               3'h4: dout = P;
  reg 'L bp;
                                                               3'h5: dout = T;
  wire dsel = (addr[l-1:4] == dbgaddr);
                                                               3'h6: dout = R;
                                                               3'h7: dout = I;
  assign dr = dsel & r;
  assign dw = dsel & |w;
                                                            endcase
                                                            'endif
  always @(posedge clk or negedge nreset)
                                                          \langle debugging\text{-}ports \rangle \equiv
  if(!nreset) begin
                                                            'ifdef DEBUGGING
     drun <= 1;
                                                               input [2:0] daddr;
     drun1 <= 1;
                                                               input dr, dw;
     bp <= 16'hffff;</pre>
                                                               input 'L din, bp;
  end else begin
                                                               output 'L dout;
     if(cpu_addr == bp && cpu_r)
                                                            'endif
        { drun, drun1 } <= 0;
     else if(run) drun <= drun1;</pre>
                                                          \langle dbg \ senselist \rangle \equiv
     if((dr | dw) && (addr[3:1] == 3'h3)) begin
                                                            'ifdef DEBUGGING
        drun <= !dr & dw;
                                                            or run or dw or daddr
        drun1 <= !dr & dw & data[12];</pre>
                                                            'endif
     if(dw \&\& addr[3:1] == 3'h2) bp <= data;
                                                          \langle stack \ debugging \rangle \equiv
  end
                                                            'ifdef DEBUGGING
                                                            if(!run && dw) case(daddr)
  endmodule
                                                               3'h0: dpush = 1;
  'endif
                                                               3'h1: rpush = 1;
\langle debugging \rangle \equiv
                                                               default;
  'ifdef DEBUGGING
                                                            endcase
                                                            'endif
  if(dw) case(daddr)
     3'h0: { sp, T } <= { spdec, din };
     3'h1: { rp, R } <= { rpdec, din };
     3'h3: { c, state, sp, rp } <=
              { din[10:8],
                din[sdep+3:4], din[rdep-1:0] };
     3'h4: P <= din;
     3'h5: T <= din;
     3'h6: R <= din;
     3'h7: I <= din;
     default;
  endcase
  if(dr) case(daddr)
     3'h0: sp <= spinc;
     3'h1: rp <= rpinc;
     default ;
  endcase
  'endif
```



Figure 2: ALU bit slice

#### 3.3 ALU

The ALU just computes the sum with possible carry-ins, the logical operations, and a zero flag. It reuses the same logic (essentially what comprises a full adder) to do both sums and logic. Figure 2 illustrates the logic that processes one bit of the ALU operation: Two multiplexers and one full adder (or the equivalent logic) per bit is sufficient to implement an ALU. The carry works as an AND gate if the carry in is 0 (both a and b input must be 1 to create a carry out), an OR gate if the carry in is 1 (both a and b input must be 0 to not create a carry out), and the sum is an XOR of a and b without carry in, and an XNOR with carry in. The XNOR operation of the ALU is not used. When the carry is propagated, a normal sum is generated; in this case, the result r selected is always the sum.

```
\langle ALU \rangle \equiv
  module alu(res, carry, zero, T, N, c, inst);
     \langle ALU \ ports \rangle
     wire
                   'L r1, r2;
     wire [1:0]
                  carries;
     assign r1 = T ^ N ^ carries;
     assign r2 = (T \& N)
                   (T & carries'L) |
                   (N & carries'L);
  // This generates a carry *chain*, not a loop!
     assign carries =
          prop ? { r2[1-1:0], (c | selr) & andor }
                : { c, {(1){andor}}};
     assign res = (selr & ~prop) ? r2 : r1;
     assign carry = carries[1];
     assign zero = ~|T;
  endmodule // alu
```

The ALU has ports T and N, carry in, and the lowest 3 bits of the instruction as input, a result, carry out, and test for zero as output.

```
\langle ALU ports \rangle =
  parameter l=16;
  input 'L T, N;
  input c;
  input [2:0] inst;
  output 'L res;
  output carry, zero;

wire prop, andor, selr;
assign { prop, selr, andor } = inst;
```

#### 3.4 Stacks

The stacks are modelled as block RAM in the FPGA. In an ASIC, this is implemented with latches. The block RAM (or register file) needs one read and one write port.

```
\langle Stack \rangle \equiv
 module stack(clk, sp, spdec, push, scan, in, out);
     parameter dep=2, l=16;
     input clk, push, scan;
     input [dep-1:0] sp, spdec;
     input 'L in;
     output 'L out;
     reg 'L stackmem[0:(1<<dep)-1];
  'ifndef FPGA
     wire write;
     latchen genwrite(.clk(clk),
                         .en(push),
                         .scan(scan),
                         .out(write));
     always @(write or spdec or in)
         if(write) stackmem[spdec] <= in;</pre>
  'else
     always @(posedge clk)
         if(push)
            stackmem[spdec] <= in;</pre>
  'endif
    assign out = stackmem[sp];
  endmodule // stack
\langle latchen \rangle \equiv
  'ifndef FPGA
  module latchen(clk, en, scan, out);
     input clk, en, scan;
     output out;
     assign out = en & ~clk & ~scan;
  endmodule
  'endif
```

# References

[1] c18 ColorForth Compiler, Chuck Moore,  $17^{\rm th}$  EuroForth Conference Proceedings, 2001