§1 MMIX-PIPE INTRODUCTION 1

1. Introduction. This program is the heart of the meta-simulator for the ultra-configurable MMIX pipeline: It defines the MMIX\_run routine, which does most of the work. Another routine, MMIX\_init, is also defined here, and so is a header file called mmix\_pipe.h. The header file is used by the main routine and by other routines like MMIX\_config, which are compiled separately.

Readers of this program should be familiar with the explanation of MMIX architecture as presented in the main program module for MMMIX.

A lot of subtle things can happen when instructions are executed in parallel. Therefore this simulator ranks among the most interesting and instructive programs in the author's experience. The author has tried his best to make everything correct ... but the chances for error are great. Anyone who discovers a bug is therefore urged to report it as soon as possible to knuth-bug@cs.stanford.edu; then the program will be as useful as possible. Rewards will be paid to bug-finders! (Except for bugs in version 0.)

It sort of boggles the mind when one realizes that the present program might someday be translated by a C compiler for MMIX and used to simulate *itself*.

2 INTRODUCTION MMIX-PIPE §2

2. This high-performance prototype of MMIX achieves its efficiency by means of "pipelining," a technique of overlapping that is explained for the related DLX computer in Chapter 3 of Hennessy & Patterson's book Computer Architecture (second edition). Other techniques such as "dynamic scheduling" and "multiple issue," explained in Chapter 4 of that book, are used too.

One good way to visualize the procedure is to imagine that somebody has organized a high-tech car repair shop according to similar principles. There are eight independent functional units, which we can think of as eight groups of auto mechanics, each specializing in a particular task; each group has its own workspace with room to deal with one car at a time. Group F (the "fetch" group) is in charge of rounding up customers and getting them to enter the assembly-line garage in an orderly fashion. Group D (the "decode and dispatch" group) does the initial vehicle inspection and writes up an order that explains what kind of servicing is required. The vehicles go next to one of the four "execution" groups: Group X handles routine maintenance, while groups XF, XM, and XD are specialists in more complex tasks that tend to take longer. (The XF people are good at floating the points, while the XM and XD groups are experts in multilink suspensions and differentials.) When the relevant X group has finished its work, cars drive to M station, where they send or receive messages and possibly pay money to members of the "memory" group. Finally all necessary parts are installed by members of group W, the "write" group, and the car leaves the shop. Everything is tightly organized so that in most cases the cars move in synchronized fashion from station to station, at regular 100-nanocentury intervals.

In a similar way, most MMIX instructions can be handled in a five-stage pipeline, F–D–X–M–W, with X replaced by XF for floating-point addition or conversion, or by XM for multiplication, or by XD for division or square root. Each stage ideally takes one clock cycle, although XF, XM, and (especially) XD are slower. If the instructions enter in a suitable pattern, we might see one instruction being fetched, another being decoded, and up to four being executed, while another is accessing memory, and yet another is finishing up by writing new information into registers; all this is going on simultaneously during one clock cycle. Pipelining with eight separate stages might therefore make the machine run up to 8 times as fast as it could if each instruction were being dealt with individually and without overlap. (Well, perfect speedup turns out to be impossible, because of the shared M and W stages; the theory of knapsack programming, to be discussed in Section 7.7 of The Art of Computer Programming, tells us that the maximal achievable speedup is at most 8 - 1/p - 1/q - 1/r when XF, XM, and XD have delays bounded by p, q, and r cycles. But we can achieve a factor of more than 7 if we are very lucky.)

Consider, for example, the ADD instruction. This instruction enters the computer's processing unit in F stage, taking only one clock cycle if it is in the cache of instructions recently seen. Then the D stage recognizes the command as an ADD and acquires the current values of Y and Z; meanwhile, of course, another instruction is being fetched by F. On the next clock cycle, the X stage adds the values together. This prepares the way for the M stage to watch for overflow and to get ready for any exceptional action that might be needed with respect to the settings of special register rA. Finally, on the fifth clock cycle, the sum is either written into X or the trip handler for integer overflow is invoked. Although this process has taken five clock cycles (that is, X), the net increase in running time has been only X.

Of course congestion can occur, inside a computer as in a repair shop. For example, auto parts might not be readily available; or a car might have to sit in D station while waiting to move to XM, thereby blocking somebody else from moving from F to D. Sometimes there won't necessarily be a steady stream of customers. In such cases the employees in some parts of the shop will occasionally be idle. But we assume that they always do their jobs as fast as possible, given the sequence of customers that they encounter. With a clever person setting up appointments—translation: with a clever programmer and/or compiler arranging MMIX instructions—the organization can often be expected to run at nearly peak capacity.

In fact, this program is designed for experiments with many kinds of pipelines, potentially using additional functional units (such as several independent X groups), and potentially fetching, dispatching, and executing several nonconflicting instructions simultaneously. Such complications make this program more difficult than a simple pipeline simulator would be, but they also make it a lot more instructive because we can get a better understanding of the issues involved if we are required to treat them in greater generality.

§3 MMIX-PIPE INTRODUCTION 3

**3.** Here's the overall structure of the present program module.

```
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "abstime.h"

{ Preprocessor definitions 6 }

{ Header definitions 11 }

{ Global variables 20 }

{ External variables 4 }

{ Internal prototypes 13 }

{ External prototypes 9 }

{ Subroutines 14 }

{ External routines 10 }
```

4. The identifier **Extern** is used in MMIX-PIPE to declare variables that are accessed in other modules. Actually all appearances of '**Extern**' are defined to be blank here, but '**Extern**' will become '**extern**' in the header file.

```
#define Extern /* blank for us, extern for them */
format Extern extern

(External variables 4) =
Extern int verbose; /* controls the level of diagnostic output */
See also sections 29, 59, 60, 66, 69, 77, 86, 98, 115, 136, 150, 168, 207, 211, 214, 242, 247, 284, and 349.
This code is used in sections 3 and 5.
```

5. The header file repeats the basic definitions and declarations.

```
⟨mmix-pipe.h 5⟩ ≡
#define Extern extern
⟨Header definitions 6⟩
⟨Type definitions 11⟩
⟨External variables 4⟩
⟨External prototypes 9⟩
```

**6.** Subroutines of this program are declared first with a prototype, as in ANSI C, then with an old-style C function definition. The following preprocessor commands make this work correctly with both new-style and old-style compilers.

```
\langle Header definitions 6\rangle \equiv #ifdef __STDC__ #define ARGS(list) list #else #define ARGS(list) () #endif See also sections 7, 8, 52, 57, 87, 129, and 166.
```

This code is used in sections 3 and 5.

4 INTRODUCTION MMIX-PIPE §7

7. Some of the names that are natural for this program are in conflict with library names on at least one of the host computers in the author's tests. So we bypass the library names here.

```
\langle Header definitions 6 \rangle +\equiv #define random my_random #define fsqrt my_fsqrt #define div my_div
```

**8.** The amount of verbosity depends on the following bit codes.

```
\langle Header definitions 6 \rangle + \equiv
#define issue\_bit (1 \ll 0)
                                /* show control blocks when issued, deissued, committed */
#define pipe\_bit (1 \ll 1)
                               /* show the pipeline and locks on every cycle */
#define coroutine_bit (1 \ll 2)
                                   /* show the coroutines when started on every cycle */
                                   /* show the coroutines when scheduled */
#define schedule_bit (1 \ll 3)
                                      /* complain when reading from an uninitialized chunk of memory */
#define uninit\_mem\_bit (1 \ll 4)
#define interactive_read_bit (1 \ll 5)
                                          /* prompt user when reading from I/O location */
                                    /* display special read/write transactions as they happen */
#define show\_spec\_bit (1 \ll 6)
#define show_pred_bit (1 \ll 7)
                                    /* display branch prediction details */
#define show\_wholecache\_bit (1 \ll 8)
                                           /* display cache blocks even when their key tag is invalid */
```

**9.** The *MMIX\_init()* routine should be called exactly once, after *MMIX\_config()* has done its work but before the simulator starts to execute any programs. Then *MMIX\_run* can be called as often as the user likes.

```
⟨ External prototypes 9⟩ ≡
Extern void MMIX_init ARGS((void));
Extern void MMIX_run ARGS((int cycs, octa breakpoint));
See also sections 38, 161, 175, 178, 180, 209, 212, and 252.
This code is used in sections 3 and 5.
```

```
\langle \text{External routines } 10 \rangle \equiv
  void MMIX_init()
     register int i, j;
     ⟨Initialize everything 22⟩;
  void MMIX_run(cycs, breakpoint)
       int cycs;
       octa breakpoint;
     (Local variables 12);
     while (cycs) {
       if (verbose & (issue_bit | pipe_bit | coroutine_bit | schedule_bit)) printf("***\_Cycle\_\%d\n", ticks.l);
       ⟨ Perform one machine cycle 64⟩;
       if (verbose & pipe_bit) {
          print_pipe(); print_locks();
       if (breakpoint\_hit \lor halted) {
          if (breakpoint\_hit) printf("Breakpoint\_instruction\_fetched\_at\_time\_%d\n", ticks.l-1);
          if (halted) printf("Halted_lat_ltime_l%d\n", ticks.l-1);
       cycs --;
  cease:;
  }
See also sections 39, 162, 176, 179, 181, 210, 213, and 253.
This code is used in section 3.
11. \langle \text{Type definitions } 11 \rangle \equiv
  typedef enum {
     false, true, wow
                /* slightly extended booleans */
  } bool;
See also sections 17, 23, 37, 40, 44, 68, 76, 164, 167, 206, 246, and 371.
This code is used in sections 3 and 5.
12. \langle \text{Local variables } 12 \rangle \equiv
  register int i, j, m;
  bool breakpoint\_hit = false;
  bool halted = false;
See also sections 124 and 258.
This code is used in section 10.
```

6 Introduction mmix-pipe  $\S13$ 

13. Error messages that abort this program are called panic messages. The macro called *confusion* will never be needed unless this program is internally inconsistent.

```
#define errprint \theta(f) fprintf(stderr, f)
#define errprint1(f, a) fprintf(stderr, f, a)
#define errprint2(f, a, b) fprintf(stderr, f, a, b)
#define panic(x) { errprint\theta("Panic: "); x; errprint\theta("!\n"); expire(); }
#define confusion(m) errprint1("This_{\sqcup}can't_{\sqcup}happen:_{\sqcup}%s", m)
\langle \text{Internal prototypes } 13 \rangle \equiv
  static void expire ARGS((void));
See also sections 18, 24, 27, 30, 32, 34, 42, 45, 55, 62, 72, 90, 92, 94, 96, 156, 158, 169, 171, 173, 182, 184, 186, 188, 190, 192,
     195, 198, 200, 202, 204, 240, 250, 254, and 377.
This code is used in section 3.
14. \langle Subroutines 14\rangle \equiv
  static void expire()
                                /* the last gasp before dying */
     if (ticks.h) errprint2("(Clock_time_is_\%dH+%d.)\n", ticks.h, ticks.l);
     else errprint1("(Clock time is %d.)\n", ticks.l);
     exit(-2);
See also sections 19, 21, 25, 28, 31, 33, 35, 43, 46, 56, 63, 73, 91, 93, 95, 97, 157, 159, 170, 172, 174, 183, 185, 187, 189, 191,
     193, 196, 199, 201, 203, 205, 208, 241, 251, 255, 378, 379, 381, 384, and 387.
This code is used in section 3.
```

15. The data structures of this program are not precisely equivalent to logical gates that could be implemented directly in silicon; we will use data structures and algorithms appropriate to the C programming language. For example, we'll use pointers and arrays, instead of buses and ports and latches. However, the net effect of our data structures and algorithms is intended to be equivalent to the net effect of a silicon implementation. The methods used below are essentially equivalent to those used in real machines today, except that diagnostic facilities are added so that we can readily watch what is happening.

Each functional unit in the MMIX pipeline is programmed here as a coroutine in C. At every clock cycle, we will call on each active coroutine to do one phase of its operation; in terms of the repair-station analogy described in the main program, this corresponds to getting each group of auto mechanics to do one unit of operation on a car. The coroutines are performed sequentially, although a real pipeline would have them act in parallel. We will not "cheat" by letting one coroutine access a value early in its cycle that another one computes late in its cycle, unless computer hardware could "cheat" in an equivalent way.

 $\S16$  MMIX-PIPE LOW-LEVEL ROUTINES

16. Low-level routines. Where should we begin? It is tempting to start with a global view of the simulator and then to break it down into component parts. But that task is too daunting, because there are so many unknowns about what basic ingredients ought to be combined when we construct the larger components. So let us look first at the primitive operations on which the superstructure will be built. Once we have created some infrastructure, we'll be able to proceed with confidence to the larger tasks ahead.

17. This program for the 64-bit MMIX architecture is based on 32-bit integer arithmetic, because nearly every computer available to the author at the time of writing (1998–1999) was limited in that way. Details of the basic arithmetic appear in a separate program module called MMIX-ARITH, because the same routines are needed also for the assembler and for the non-pipelined simulator. The definition of type **tetra** should be changed, if necessary, to conform with the definitions found there.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef unsigned int tetra;
                                         /* for systems conforming to the LP-64 data model */
  typedef struct {
     tetra h, l:
               /* two tetrabytes make one octabyte */
  } octa;
18. \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_octa ARGS((octa));
19. \langle Subroutines 14\rangle + \equiv
  static void print_octa(o)
       octa o;
     if (o.h) printf("%x%08x", o.h, o.l); else printf("%x", o.l);
20. \langle Global variables 20 \rangle \equiv
  extern octa zero_octa;
                                  /* zero\_octa.h = zero\_octa.l = 0 */
                                /* neg\_one.h = neg\_one.l = -1 */
  extern octa neg_one;
                            /* auxiliary output of a subroutine */
  extern octa aux;
                                 /* set by certain subroutines for signed arithmetic */
  extern bool overflow;
  extern int exceptions;
                                 /* bits set by floating point operations */
                                 /* the current rounding mode */
  extern int cur_round;
See also sections 36, 41, 48, 50, 51, 53, 54, 65, 70, 78, 83, 88, 99, 107, 127, 148, 154, 194, 230, 235, 238, 248, 285, 303, 305, 315,
    374, 376, and 388.
```

This code is used in section 3.

8 LOW-LEVEL ROUTINES MMIX-PIPE §21

**21.** Most of the subroutines in MMIX-ARITH return an octabyte as a function of two octabytes; for example, oplus(y, z) returns the sum of octabytes y and z. Multiplication returns the high half of a product in the global variable aux; division returns the remainder in aux.

```
\langle \text{Subroutines } 14 \rangle + \equiv
  extern octa oplus \ ARGS((octa \ y, octa \ z));
                                                       /* unsigned y+z*/
  extern octa ominus \ ARGS((octa \ y, octa \ z));
                                                         /* unsigned y-z */
  extern octa incr ARGS((octa y, int delta));
                                                        /* unsigned y + \delta (\delta is signed) */
  extern octa oand ARGS((octa y, octa z));
                                                       /* y \wedge z */
                                                        /* y \wedge \bar{z} */
  extern octa oandn ARGS((octa y, octa z));
                                                         /* y \ll s, 0 \le s \le 64 */
  extern octa shift\_left \ ARGS((octa\ y, int\ s)):
  extern octa shift_right ARGS((octa y, int s, int uns)); /*y \gg s, signed if \neg uns */s
  extern octa omult \ ARGS((octa \ y, octa \ z));
                                                        /* unsigned (aux, x) = y \times z */
  extern octa signed\_omult \ ARGS((octa \ y, octa \ z));
                                                               /* signed x = y \times z, setting overflow */
  extern octa odiv \ ARGS((octa \ x, octa \ y, octa \ z));
                                                               /* unsigned (x,y)/z; aux = (x,y) \mod z */
                                                             /* signed y/z, when z \neq 0; aux = y \mod z */
  extern octa signed\_odiv \ ARGS((octa \ y, octa \ z));
  extern int count\_bits ARGS((tetra z));
                                                 /* x = \nu(z) */
                                                             /* half of BDIF */
  extern tetra byte\_diff ARGS((tetra y, tetra z));
  extern tetra wyde\_diff \ ARGS((tetra \ y, tetra \ z));
                                                              /* half of WDIF */
  extern octa bool\_mult \ ARGS((octa \ y, octa \ z, bool \ xor));
                                                                       /* MOR or MXOR */
  extern octa load\_sf ARGS((tetra z));
                                                 /* load short float */
                                                  /* store short float */
  extern tetra store\_sf ARGS((octa x));
  extern octa fplus \ ARGS((octa \ y, octa \ z));
                                                       /* floating point x = y \oplus z */
  extern octa fmult \ ARGS((octa\ y, octa\ z));
                                                       /* floating point x = y \otimes z */
  extern octa fdivide \ ARGS((octa \ y, octa \ z));
                                                        /* floating point x = y \oslash z */
                                                /* floating point x = \sqrt{z} */
  extern octa froot ARGS((octa, int));
  extern octa fremstep ARGS((octa y, octa z, int delta));
                                                                   /* floating point x \operatorname{rem} z = y \operatorname{rem} z */
  extern octa fintegerize ARGS((octa z, int mode));
                                                             /* floating point x = \text{round}(z) */
                                                   /* -1, 0, 1, \text{ or } 2 \text{ if } y < z, y = z, y > z, y \parallel z */
  extern int fcomp \ ARGS((octa \ y, octa \ z));
  extern int fepscomp ARGS((octa y, octa z, octa eps, int sim));
     /* x = sim? [y \sim z(\epsilon)] : [y \approx z(\epsilon)] */
  extern octa floatit ARGS((octa z, int mode, int unsgnd, int shrt));
                                                                                 /* fix to float */
  extern octa fixit ARGS((octa z, int mode));
                                                      /* float to fix */
22. We had better check that our 32-bit assumption holds.
\langle Initialize everything 22\rangle \equiv
  if (shift\_left(neq\_one, 1).h \neq \#ffffffff)
     panic(errprint0("Incorrect_implementation_of_type_tetra"));
See also sections 26, 61, 71, 79, 89, 116, 128, 153, 231, 236, 249, and 286.
```

This code is used in section 10.

§23 MMIX-PIPE COROUTINES 9

23. Coroutines. As stated earlier, this program can be regarded as a system of interacting coroutines. Coroutines—sometimes called threads—are more or less independent processes that share and pass data and control back and forth. They correspond to the individual workers in an organization.

We don't need the full power of recursive coroutines, in which new threads are spawned dynamically and have independent stacks for computation; we are, after all, simulating a fixed piece of hardware. The total number of coroutines we deal with is established once and for all by the MMIX\_config routine, and each coroutine has a fixed amount of local data.

The simulation operates one clock tick at a time, by executing all coroutines scheduled for time t before advancing to time t + 1. The coroutines at time t may decide to become dormant or they may reschedule themselves and/or other coroutines for future times.

Each coroutine has a symbolic *name* for diagnostic purposes (e.g., ALU1); a nonnegative *stage* number (e.g., 2 for the second stage of a pipeline); a pointer to the next coroutine scheduled at the same time (or  $\Lambda$  if the coroutine is unscheduled); a pointer to a lock variable (or  $\Lambda$  if no lock is currently relevant); and a reference to a control block containing the data to be processed.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef struct coroutine_struct {
                         /* symbolic identification of a coroutine */
     char *name;
                     /* its rank */
     int stage;
     struct coroutine_struct *next;
                                                /* its successor */
                                                   /* what it might be locking */
     struct coroutine_struct **lockloc;
                                           /* its data */
     struct control_struct *ctl;
  } coroutine;
24. \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_coroutine_id ARGS((coroutine *));
  static void errprint_coroutine_id ARGS((coroutine *));
      \langle Subroutines 14\rangle + \equiv
  static void print\_coroutine\_id(c)
       coroutine *c;
     if (c) printf("%s:%d", c \rightarrow name, c \rightarrow stage);
     else printf("??");
  static void errprint\_coroutine\_id(c)
       coroutine *c;
     if (c) errprint2 ("%s:%d", c \rightarrow name, c \rightarrow stage);
     else errprint\theta("??");
```

10 COROUTINES MMIX-PIPE §26

**26.** Coroutine control is masterminded by a ring of queues, one each for times  $t, t+1, \ldots, t+ring\_size-1$ , when t is the current clock time.

All scheduling is first-come-first-served, except that coroutines with higher *stage* numbers have priority. We want to process the later stages of a pipeline first, in this sequential implementation, for the same reason that a car must drive from M station into W station before another car can enter M station.

Each queue is a circular list of **coroutine** nodes, linked together by their *next* fields. A list head h with  $stage = max\_stage$  comes at the end and the beginning of the queue. (All stage numbers of legitimate coroutines are less than  $max\_stage$ .) The queued items are  $h \neg next$ ,  $h \neg next \neg next$ , etc., from back to front, and we have  $c \neg stage \le c \neg next \neg stage$  unless c = h.

Initially all queues are empty.

```
⟨ Initialize everything 22 ⟩ +≡ { register coroutine *p; for (p = ring; p < ring + ring\_size; p++) p→next = p; }
```

**27.** To schedule a coroutine c with positive delay  $d < ring\_size$ , we call schedule(c, d, s). (The s parameter is used only if scheduling is being logged; it does not affect the computation, but we will generally set s to the state at which the scheduled coroutine will begin.)

```
⟨Internal prototypes 13⟩ +≡ static void schedule ARGS((coroutine *, int, int));
```

```
28.
      \langle \text{Subroutines } 14 \rangle + \equiv
   static void schedule(c, d, s)
         coroutine *c;
        int d, s;
      register int tt = (cur\_time + d) \% ring\_size;
                                                          /* start at the list head */
      register coroutine *p = \&ring[tt];
                                            /* do a sanity check */
      if (d \le 0 \lor d \ge ring\_size)
         panic(confusion("Scheduling_{\sqcup}"); errprint\_coroutine\_id(c); errprint1("_\u00cdwith_\u00ddelay_\u00cdwd", d));
      while (p \rightarrow next \rightarrow stage < c \rightarrow stage) p = p \rightarrow next;
      c \neg next = p \neg next;
     p \rightarrow next = c;
      if (verbose & schedule_bit) {
        printf("\_scheduling\_"); print\_coroutine\_id(c);
        printf("_{\sqcup}at_{\sqcup}time_{\sqcup}%d,_{\sqcup}state_{\sqcup}%d\n", ticks.l+d,s);
   }
```

**29.**  $\langle \text{External variables 4} \rangle + \equiv$ 

```
Extern int ring_size; /* set by MMIX_config, must be sufficiently large */
Extern coroutine *ring;
Extern int cur_time;
```

**30.** The all-important *ctl* field of a coroutine, which contains the data being manipulated, will be explained below. One of its key components is the *state* field, which helps to specify the next actions the coroutine will perform. When we schedule a coroutine for a new task, we often want it to begin in state 0.

```
⟨Internal prototypes 13⟩ +≡ static void startup ARGS((coroutine *, int));
```

§31 mmix-pipe coroutines 11

```
31. \langle Subroutines 14\rangle +=
static void startup(c,d)
coroutine *c;
int d;
{
c \rightarrow ctl \rightarrow state = 0;
schedule(c,d,0);
}
```

**32.** The following routine removes a coroutine from whatever queue it's in. The case c o next = c is also permitted; such a self-loop can occur when a coroutine goes to sleep and expects to be awakened (that is, scheduled) by another coroutine. Sleeping coroutines have important data in their ctl field; they are therefore quite different from unscheduled or "unemployed" coroutines, which have  $c o next = \Lambda$ . An unemployed coroutine is not assumed to have any valid data in its ctl field.

```
⟨Internal prototypes 13⟩ +≡
static void unschedule ARGS((coroutine *));
```

```
33. \langle Subroutines 14\rangle +\equiv static void unschedule(c) coroutine *c; { register coroutine *p; if (c \neg next) { for (p = c; \ p \neg next \neq c; \ p = p \neg next) ; p \neg next = c \neg next; c \neg next = \Lambda; if (verbose \& schedule\_bit) { printf("unschedulingu"); \ print\_coroutine\_id(c); \ printf("\n"); } } }
```

**34.** When it is time to process all coroutines that have queued up for a particular time t, we empty the queue called ring[t] and link its items in the opposite order (from front to back). The following subroutine uses the well known algorithm discussed in exercise 2.2.3–7 of The Art of Computer Programming.

```
\langle \text{ Internal prototypes } 13 \rangle +\equiv 
static coroutine *queuelist ARGS((int));
```

```
static coroutine *queuelist(t)

int t;

{ register coroutine *p, *q = \&sentinel, *r;

for (p = ring[t].next; p \neq \&ring[t]; p = r) {

r = p \rightarrow next;

p \rightarrow next = q;

q = p;

}

ring[t].next = \&ring[t];

sentinel.next = q;

return q;
```

12 COROUTINES MMIX-PIPE §36

```
36. ⟨Global variables 20⟩ +≡ coroutine sentinel; /* dummy coroutine at origin of circular list */
```

**37.** Coroutines often start working on tasks that are *speculative*, in the sense that we want certain results to be ready if they prove to be useful; we understand that speculative computations might not actually be needed. Therefore a coroutine might need to be aborted before it has finished its work.

All coroutines must be written in such a way that important data structures remain intact even when the coroutine is abruptly terminated. In particular, we need to be sure that "locks" on shared resources are restored to an unlocked state when a coroutine holding the lock is aborted.

A lockvar variable is  $\Lambda$  when it is unlocked; otherwise it points to the coroutine responsible for unlocking it.

```
#define set\_lock(c, l)
         \{l = c; (c) \neg lockloc = \&(l); \}
#define release\_lock(c, l)
         \{ l = \Lambda; (c) \neg lockloc = \Lambda; \}
\langle Type definitions 11 \rangle + \equiv
  typedef coroutine *lockvar;
    \langle \text{External prototypes } 9 \rangle + \equiv
  Extern void print_locks ARGS((void));
39. \langle External routines 10\rangle + \equiv
  void print_locks()
    print_cache_locks(ITcache);
    print\_cache\_locks(DTcache);
    print_cache_locks(Icache);
    print_cache_locks(Dcache);
    print_cache_locks(Scache);
    if (mem_lock) printf("mem_locked_by_1%s:%d\n", mem_lock-name, mem_lock-stage);
    if (dispatch_lock) printf("dispatch_locked_by_%s:%d\n", dispatch_lock¬name, dispatch_lock¬stage);
    if (wbuf_lock)
       printf("head_of_write_buffer_locked_by_%s:%d\n", wbuf_lock-name, wbuf_lock-stage);
    if (clean_lock) printf("cleaner_locked_by_%s:%d\n", clean_lock¬name, clean_lock¬stage);
       printf("write_buffer_flush_locked_by_%s:%d\n", speed_lock-name, speed_lock-stage);
  }
```

§40 mmix-pipe coroutines 13

**40.** Many of the quantities we deal with are speculative values that might not yet have been certified as part of the "real" calculation; in fact, they might not yet have been calculated.

A spec consists of a 64-bit quantity o and a pointer p to a specnode. The value o is meaningful only if the pointer p is  $\Lambda$ ; otherwise p points to a source of further information.

A **specnode** is a 64-bit quantity o together with links to other **specnode**s that are above it or below it in a doubly linked list. An additional known bit tells whether the o field has been calculated. There also is a 64-bit addr field, to identify the list and give further information. A **specnode** list keeps track of speculative values related to a specific register or to all of main memory; we will discuss such lists in detail later.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef struct {
     octa o;
     struct specnode\_struct *p;
  } spec;
  typedef struct specnode_struct {
     octa o;
     bool known;
     octa addr;
     struct specnode_struct *up, *down;
  } specnode;
41. \langle Global variables 20\rangle + \equiv
                         /* zero_spec.o.h = zero_spec.o.l = 0 and zero_spec.p = \Lambda */
  spec zero_spec;
     \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print\_spec \ ARGS((spec));
     \langle Subroutines 14\rangle + \equiv
  static void print\_spec(s)
       spec s;
     if (\neg s.p) print\_octa(s.o);
       printf(">"); print\_specnode\_id(s.p \rightarrow addr);
  }
  static void print\_specnode(s)
       specnode s;
     if (s.known) { print\_octa(s.o); printf("!"); }
     else if (s.o.h \lor s.o.l) { print\_octa(s.o); printf("?"); }
     else printf("?");
     print\_specnode\_id(s.addr);
```

14 COROUTINES MMIX-PIPE §44

**44.** The analog of an automobile in our simulator is a block of data called **control**, which represents all the relevant facts about an MMIX instruction. We can think of it as the work order attached to a car's windshield. Each group of employees updates the work order as the car moves through the shop.

A **control** record contains the original location of an instruction, and its four bytes OP X Y Z. An instruction has up to four inputs, which are **spec** records called y, z, b and ra; it also has up to three outputs, which are **specnode** records called x, a, and rl. (We usually don't mention the special input ra or the special output rl, which refer to MMIX's internal registers rA and rL.) For example, the main inputs to a DIVU command are \$Y, \$Z, and rD; the outputs are the quotient \$X\$ and the remainder rR. The inputs to a STO command are \$Y, \$Z, and \$X\$; there is one "output," and the field x.addr will be set to the physical address of the memory location corresponding to virtual address \$Y + \$Z.

Each **control** block also points to the coroutine that owns it, if any. And it has various other fields that contain other tidbits of information; for example, we have already mentioned the state field, which often governs a coroutine's actions. The i field, which contains an internal operation code number, is generally used together with state to switch between alternative computational steps. If, for example, the op field is SUB or SUBI or NEG or NEGI, the internal opcode i will be simply sub. We shall define all the fields of **control** records now and discuss them later.

An actual hardware implementation of MMIX wouldn't need all the information we are putting into a **control** block. Some of that information would typically be latched between stages of a pipeline; other portions would probably appear in so-called "rename registers." We simulate rename registers only indirectly, by counting how many registers of that kind would be in use if we were mimicking low-level hardware details more precisely. The *go* field is a **specnode** for convenience in programming, although we use only its *known* and *o* subfields. It generally contains the address of the subsequent instruction.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  (Declare mmix_opcode and internal_opcode 47)
  typedef struct control_struct {
    octa loc;
                  /* virtual address where an instruction originated */
    mmix_opcode op; unsigned char xx, yy, zz;
                                                          /* the original instruction bytes */
    spec y, z, b, ra;
                          /* inputs */
    specnode x, a, go, rl;
                                /* outputs */
    coroutine *owner;
                             /* a coroutine whose ctl this is */
    internal_opcode i;
                             /* internal opcode */
                  /* internal mindset */
    int state;
    bool usage;
                     /* should rU be increased? */
                     /* should we stall until b.p \equiv \Lambda? */
    bool need\_b;
                      /* should we stall until ra.p \equiv \Lambda? */
    bool need_ra;
                     /* does x correspond to a rename register? */
    bool ren_x;
                      /* does x correspond to a memory write? */
    bool mem_x;
                     /* does a correspond to a rename register? */
    bool ren_a;
                    /* does rl correspond to a new value of rL? */
    bool set_l;
    bool interim;
                       /* does this instruction need to be reissued on interrupt? */
    unsigned int arith_exc;
                                 /* arithmetic exceptions for event bits of rA */
    unsigned int hist:
                             /* history bits for use in branch prediction */
    int denin, denout;
                            /* execution time penalties for denormal handling */
                             /* speculative rO and rS before this instruction */
    octa cur_O, cur_S;
                                 /* does this instruction generate an interrupt? */
    unsigned int interrupt;
    void *ptr_a, *ptr_b, *ptr_c;
                                     /* generic pointers for miscellaneous use */
   control:
```

```
45. \langle \text{Internal prototypes } 13 \rangle + \equiv  static void print\_control\_block \ ARGS((control *));
```

§46 MMIX-PIPE COROUTINES 15

```
46.
         \langle Subroutines 14\rangle + \equiv
   static void print\_control\_block(c)
          control *c;
       octa default_go;
       if (c \neg loc.h \lor c \neg loc.l \lor c \neg op \lor c \neg xx \lor c \neg yy \lor c \neg zz \lor c \neg owner) {
          print\_octa(c \rightarrow loc);
          printf(": \_\%02x\%02x\%02x\%02x\%02x(\%s)", c \rightarrow op, c \rightarrow xx, c \rightarrow yy, c \rightarrow zz, internal\_op\_name[c \rightarrow i]);
       if (c \rightarrow usage) printf("*");
       if (c \rightarrow interim) printf("+");
       if (c \rightarrow y.o.h \lor c \rightarrow y.o.l \lor c \rightarrow y.p) \{ printf(" y = "); print\_spec(c \rightarrow y); \}
       if (c \rightarrow z.o.h \lor c \rightarrow z.o.l \lor c \rightarrow z.p) \{ printf(" \bot z = "); print\_spec(c \rightarrow z); \}
       if (c \rightarrow b.o.h \lor c \rightarrow b.o.l \lor c \rightarrow b.p \lor c \rightarrow need\_b) {
          printf("\_b="); print\_spec(c \rightarrow b);
          if (c \rightarrow need\_b) printf("*");
       if (c \rightarrow need\_ra) \{ printf(" \sqsubseteq rA="); print\_spec(c \rightarrow ra); \}
       if (c \rightarrow ren\_x \lor c \rightarrow mem\_x) { printf("\_x="); print\_specnode(c \rightarrow x); }
       else if (c \rightarrow x.o.h \lor c \rightarrow x.o.l) {
           printf("_{\perp}x="); print\_octa(c\rightarrow x.o); printf("%c", c\rightarrow x.known?'!':'?');
       if (c \rightarrow ren\_a) { printf("\_a="); print\_specnode(c \rightarrow a); }
       if (c \rightarrow set\_l) { printf(" \Box rL = "); print\_specnode(c \rightarrow rl); }
       if (c \neg interrupt) { printf(" \neg int="); print\_bits(c \neg interrupt); }
       if (c \rightarrow arith\_exc) { printf("\_exc="); print\_bits(c \rightarrow arith\_exc \ll 8); }
       default\_go = incr(c \rightarrow loc, 4);
       if (c \neg go.o.l \neq default\_go.l \lor c \neg go.o.h \neq default\_go.h) {
          printf(" \rightarrow "); print\_octa(c \rightarrow go.o);
       if (verbose & show_pred_bit) printf("_hist=%x", c-hist);
       if (c \rightarrow i \equiv pop) {
          printf(" \_ rS=");
          print\_octa(c \neg cur\_S);
          printf ("□r0=");
          print\_octa(c \rightarrow cur\_O);
       printf("ustate=%d", c→state);
```

16 LISTS MMIX-PIPE §47

47. Lists. Here is a (boring) list of all the MMIX opcodes, in order.  $\langle \text{ Declare } \mathbf{mmix\_opcode} \text{ and } \mathbf{internal\_opcode} \text{ 47} \rangle \equiv$ typedef enum { TRAP, FCMP, FUN, FEQL, FADD, FIX, FSUB, FIXU, FLOT, FLOTI, FLOTU, FLOTUI, SFLOT, SFLOTI, SFLOTU, SFLOTUI, FMUL, FCMPE, FUNE, FEQLE, FDIV, FSQRT, FREM, FINT, MUL, MULI, MULU, MULUI, DIV, DIVI, DIVU, DIVUI, ADD, ADDI, ADDU, ADDUI, SUB, SUBI, SUBU, SUBUI, IIADDU, IIADDUI, IVADDU, IVADDUI, VIIIADDU, VIIIADDUI, XVIADDUI, XVIADDUI, CMP, CMPI, CMPU, CMPUI, NEG, NEGI, NEGU, NEGUI, SL, SLI, SLU, SLUI, SR, SRI, SRU, SRUI, BN, BNB, BZ, BZB, BP, BPB, BOD, BODB, BNN, BNNB, BNZ, BNZB, BNP, BNPB, BEV, BEVB, PBN, PBNB, PBZ, PBZB, PBP, PBPB, PBOD, PBODB, PBNN, PBNNB, PBNZ, PBNZB, PBNP, PBNPB, PBEV, PBEVB, CSN, CSNI, CSZ, CSZI, CSP, CSPI, CSOD, CSODI, CSNN, CSNNI, CSNZ, CSNZI, CSNP, CSNPI, CSEV, CSEVI, ZSN, ZSNI, ZSZ, ZSZI, ZSP, ZSPI, ZSOD, ZSODI, ZSNN, ZSNNI, ZSNZ, ZSNZI, ZSNP, ZSNPI, ZSEV, ZSEVI, LDB, LDBI, LDBU, LDBUI, LDW, LDWI, LDWU, LDWUI, LDT, LDTI, LDTU, LDTUI, LDO, LDOI, LDOU, LDOUI, LDSF, LDSFI, LDHT, LDHTI, CSWAP, CSWAPI, LDUNC, LDUNCI, LDVTS, LDVTSI, PRELD, PRELDI, PREGO, PREGOI, GO, GOI, STB, STBI, STBU, STBUI, STW, STWI, STWU, STWUI, STT, STTI, STTU, STTUI, STO, STOI, STOU, STOUI, STSF, STSFI, STHT, STHTI, STCO, STCOI, STUNC, STUNCI, SYNCD, SYNCDI, PREST, PRESTI, SYNCID, SYNCIDI, PUSHGO, PUSHGOI, OR, ORI, ORN, ORNI, NOR, NORI, XOR, XORI, AND, ANDI, ANDN, ANDNI, NAND, NANDI, NXOR, NXORI, BDIF, BDIFI, WDIF, WDIFI, TDIF, TDIFI, ODIF, ODIFI, MUX, MUXI, SADD, SADDI, MOR, MORI, MXOR, MXORI, SETH, SETMH, SETML, SETL, INCH, INCMH, INCML, INCL, ORH, ORMH, ORML, ORL, ANDNH, ANDNMH, ANDNML, ANDNL, JMP, JMPB, PUSHJ, PUSHJB, GETA, GETAB, PUT, PUTI, POP, RESUME, SAVE, UNSAVE, SYNC, SWYM, GET, TRIP } mmix\_opcode;

See also section 49.

This code is used in section 44.

 $\S48$  MMIX-PIPE LISTS 17

```
48. \langle Global variables 20\rangle + \equiv
  \mathbf{char} * opcode\_name[] = \{ \texttt{"TRAP"}, \texttt{"FCMP"}, \texttt{"FUN"}, \texttt{"FEQL"}, \texttt{"FADD"}, \texttt{"FIX"}, \texttt{"FSUB"}, \texttt{"FIXU"}, \\
  "FLOT", "FLOTI", "FLOTU", "FLOTUI", "SFLOT", "SFLOTI", "SFLOTU", "SFLOTUI",
  "FMUL", "FCMPE", "FUNE", "FEQLE", "FDIV", "FSQRT", "FREM", "FINT",
  "MUL", "MULI", "MULU", "MULUI", "DIV", "DIVI", "DIVU", "DIVUI",
  "ADD", "ADDI", "ADDU", "ADDUI", "SUB", "SUBI", "SUBU", "SUBUI",
  "2ADDU", "2ADDUI", "4ADDU", "4ADDUI", "8ADDUI", "16ADDUI", "16ADDUI", "16ADDUI",
  "CMP", "CMPI", "CMPU", "CMPUI", "NEG", "NEGI", "NEGU", "NEGUI",
  "SL", "SLI", "SLU", "SLUI", "SR", "SRI", "SRU", "SRUI",
  "BN", "BNB", "BZ", "BZB", "BP", "BPB", "BOD", "BODB",
  "BNN", "BNNB", "BNZ", "BNZB", "BNP", "BNPB", "BEV", "BEVB",
  "PBN", "PBNB", "PBZ", "PBZB", "PBP", "PBPB", "PBOD", "PBODB",
  "PBNN", "PBNNB", "PBNZ", "PBNZB", "PBNP", "PBNPB", "PBEV", "PBEVB",
  "CSN", "CSNI", "CSZ", "CSZI", "CSP", "CSPI", "CSOD", "CSODI",
  "CSNN", "CSNNI", "CSNZ", "CSNZI", "CSNP", "CSNPI", "CSEV", "CSEVI",
  "ZSN", "ZSNI", "ZSZ", "ZSZI", "ZSP", "ZSPI", "ZSOD", "ZSODI",
  "ZSNN", "ZSNNI", "ZSNZ", "ZSNZI", "ZSNP", "ZSNPI", "ZSEV", "ZSEVI",
  "LDB", "LDBI", "LDBU", "LDBUI", "LDW", "LDWI", "LDWU", "LDWUI",
  "LDT", "LDTI", "LDTU", "LDTUI", "LDO", "LDOI", "LDOU", "LDOUI",
  "LDSF", "LDSFI", "LDHT", "LDHTI", "CSWAP", "CSWAPI", "LDUNC", "LDUNCI",
  "LDVTS", "LDVTSI", "PRELD", "PRELDI", "PREGO", "PREGOI", "GO", "GOI",
  "STB", "STBI", "STBU", "STBUI", "STW", "STWI", "STWU", "STWUI",
  "STT", "STTI", "STTU", "STTUI", "STO", "STOI", "STOU", "STOUI",
  "STSF", "STSFI", "STHT", "STHTI", "STCO", "STCOI", "STUNC", "STUNCI".
  "SYNCD", "SYNCDI", "PREST", "PRESTI", "SYNCID", "SYNCIDI", "PUSHGO", "PUSHGOI",
  "OR", "ORI", "ORN", "ORNI", "NOR", "NORI", "XOR", "XORI",
  "AND", "ANDI", "ANDN", "ANDNI", "NAND", "NANDI", "NXOR", "NXORI",
  "BDIF", "BDIFI", "WDIF", "WDIFI", "TDIF", "TDIFI", "ODIF", "ODIFI",
  "MUX", "MUXI", "SADD", "SADDI", "MOR", "MORI", "MXOR", "MXORI",
  "SETH", "SETMH", "SETML", "SETL", "INCH", "INCMH", "INCML", "INCL",
  "ORH", "ORMH", "ORML", "ORL", "ANDNH", "ANDNMH", "ANDNML", "ANDNL",
  "JMP", "JMPB", "PUSHJ", "PUSHJB", "GETA", "GETAB", "PUT", "PUTI"
  "POP", "RESUME", "SAVE", "UNSAVE", "SYNC", "SWYM", "GET", "TRIP"};
```

18 LISTS MMIX-PIPE  $\S49$ 

49. And here is a (likewise boring) list of all the internal opcodes. The smallest numbers, less than or equal to  $max\_pipe\_op$ , correspond to operations for which arbitrary pipeline delays can be configured with  $MMIX\_config$ . The largest numbers, greater than  $max\_real\_command$ , correspond to internally generated operations that have no official OP code; for example, there are internal operations to shift the  $\gamma$  pointer in the register stack, and to compute page table entries.

```
\langle \text{ Declare } \mathbf{mmix\_opcode} \text{ and } \mathbf{internal\_opcode} \text{ 47} \rangle + \equiv
#define max_pipe_op feps
#define max_real_command trip
  typedef enum {
               /* multiplication by zero */
    mul0.
    mul1,
               /* multiplication by 1–8 bits */
    mul2,
               /* multiplication by 9–16 bits */
    mul3,
               /* multiplication by 17–24 bits */
               /* multiplication by 25–32 bits */
    mul4,
               /* multiplication by 33–40 bits */
    mul5.
               /* multiplication by 41–48 bits */
    mul6.
    mul7,
               /* multiplication by 49–56 bits */
    mul8,
               /* multiplication by 57–64 bits */
    div,
              /* DIV[U][I] */
            /* S[L,R][U][I] */
    sh,
              /* MUX[I] */
    mux,
    sadd.
               /* SADD[I] */
    mor,
              /* M[X]OR[I] */
              /* FADD, FSUB */
    fadd,
    fmul,
               /* FMUL */
              /* FDIV */
    fdiv,
               /* FSQRT */
    fsqrt,
              /* FINT */
    fint,
    fix,
             /* FIX[U] */
              /* [S]FLOT[U][I] */
    flot,
              /* FCMPE, FUNE, FEQLE */
    feps,
               /* FCMP */
    fcmp,
               /* FUN, FEQL */
    funeq.
              /* FSUB */
    fsub,
    frem,
               /* FREM */
              /* MUL[I] */
    mul,
               /* MULU[I] */
    mulu,
    divu,
              /* DIVU[I] */
    add,
              /* ADD[I] */
               /* [2,4,8,16,] ADDU[I], INC[M][H,L] */
    addu,
    sub,
             /* SUB[I], NEG[I] */
    subu,
              /* SUBU[I], NEGU[I] */
             /* SET[M][H,L], GETA[B] */
    set,
    or.
             /* OR[I], OR[M][H,L] */
              /* ORN[I] */
    orn.
    nor.
              /* NOR[I] */
              /* AND[I] */
    and,
               /* ANDN[I], ANDN[M][H,L] */
    andn,
    nand,
               /* NAND[I] */
    xor,
             /* XOR[I] */
               /* NXOR[I] */
    nxor.
              /* SLU[I] */
    shlu,
```

 $\S49$  MMIX-PIPE LISTS 19

```
shru,
           /* SRU[I] */
  shl,
          /* SL[I] */
          /* SR[I] */
  shr,
           /* CMP[I] */
  cmp,
  cmpu,
            /* CMPU[I] */
  bdif,
           /* BDIF[I] */
  wdif,
           /* WDIF[I] */
           /* TDIF[I] */
  tdif,
  odif,
           /* ODIF[I] */
           /* ZS[N][N,Z,P][I], ZSEV[I], ZSOD[I] */
  zset,
           /* CS[N][N,Z,P][I], CSEV[I], CSOD[I] */
  cset,
  get,
          /* GET */
          /* PUT[I] */
  put,
  ld.
         /* LD[B,W,T,O][U][I], LDHT[I], LDSF[I] */
  ldptp,
            /* load page table pointer */
  ldpte,
            /* load page table entry */
            /* LDUNC[I] */
  ldunc,
  ldvts,
            /* LDVTS[I] */
  preld,
            /* PRELD[I] */
            /* PREST[I] */
  prest,
         /* STO[U][I], STCO[I], STUNC[I] */
  st,
  syncd,
            /* SYNCD[I] */
             /* SYNCID[I] */
  syncid,
          /* ST[B,W,T][U][I], STHT[I] */
  pst,
            /* STUNC[I], in write buffer */
  stunc,
  cswap,
             /* CSWAP[I] */
         /* B[N][N,Z,P][B] */
  br,
  pbr,
          /* PB[N][N,Z,P][B] */
            /* PUSHJ[B] */
  pushj,
         /* GO[I] */
  go,
            /* PREGO[I] */
  prego,
              /* PUSHGO[I] */
  pushgo,
  pop,
          /* POP */
              /* RESUME */
  resume,
           /* SAVE */
  save,
              /* UNSAVE */
  unsave,
            /* SYNC */
  sync,
  jmp,
           /* JMP[B] */
  noop,
            /* SWYM */
           /* TRAP */
  trap,
           /* TRIP */
  trip,
                 /* increase \gamma pointer */
  incgamma,
  decgamma,
                 /* decrease \gamma pointer */
  incrl,
            /* increase rL and \beta */
          /* intermediate stage of SAVE */
  sav,
             /* intermediate stage of UNSAVE */
  unsav,
            /* intermediate stage of RESUME */
  resum
} internal_opcode;
```

20 LISTS **MMIX-PIPE** ξ50

```
50.
     \langle \text{Global variables 20} \rangle + \equiv
  char *internal_op_name[] = {"mul0", "mul1", "mul2", "mul3", "mul4", "mul5", "mul6", "mul7", "mul8",
       "div", "sh", "mux", "sadd", "mor", "fadd", "fmul", "fdiv", "fsqrt", "fint", "fix", "flot",
       "feps", "fcmp", "funeq", "fsub", "frem", "mul", "mulu", "divu", "add", "addu", "sub", "subu",
       "set", "or", "orn", "nor", "and", "andn", "nand", "xor", "nxor", "shlu", "shru", "shru", "shr",
       "cmp", "cmpu", "bdif", "wdif", "tdif", "odif", "zset", "cset", "get", "put", "ld", "ldptp",
       "ldpte", "ldunc", "ldvts", "preld", "prest", "st", "syncd", "syncid", "pst", "stunc", "cswap",
       "br", "pbr", "pushj", "go", "prego", "pushgo", "pop", "resume", "save", "unsave", "sync", "jmp",
       "noop", "trap", "trip", "incgamma", "decgamma", "incrl", "sav", "unsav", "resum"};
```

**51.** We need a table to convert the external opcodes to internal ones.

```
\langle \text{Global variables } 20 \rangle + \equiv
  internal\_opcode internal\_op[256] = {
  trap, fcmp, funeq, funeq, fadd, fix, fsub, fix,
  flot, flot, flot, flot, flot, flot, flot, flot,
  fmul, feps, feps, feps, fdiv, fsqrt, frem, fint,
  mul, mul, mulu, mulu, div, div, divu, divu,
  add, add, addu, addu, sub, sub, subu, subu,
  addu, addu, addu, addu, addu, addu, addu, addu,
  cmp, cmp, cmpu, cmpu, sub, sub, subu, subu,
  shl, shl, shlu, shlu, shr, shru, shru, shru,
  br, br, br, br, br, br, br, br,
  br, br, br, br, br, br, br, br,
  pbr, pbr, pbr, pbr, pbr, pbr, pbr, pbr,
  pbr, pbr, pbr, pbr, pbr, pbr, pbr, pbr,
  cset, cset, cset, cset, cset, cset, cset,
  cset, cset, cset, cset, cset, cset, cset, cset,
  zset, zset, zset, zset, zset, zset, zset, zset,
  zset, zset, zset, zset, zset, zset, zset,
  ld, ld, ld, ld, ld, ld, ld, ld,
  ld, ld, ld, ld, ld, ld, ld, ld,
  ld, ld, ld, ld, cswap, cswap, ldunc, ldunc,
  ldvts, ldvts, preld, preld, prego, prego, go, go,
  pst, pst, pst, pst, pst, pst, pst, pst,
  pst, pst, pst, pst, st, st, st, st
  pst, pst, pst, pst, st, st, st, st,
  syncd, syncd, prest, prest, syncid, syncid, pushgo, pushgo,
  or, or, orn, orn, nor, nor, xor, xor,
  and, and, andn, andn, nand, nand, nxor, nxor,
  bdif, bdif, wdif, wdif, tdif, tdif, odif, odif,
  mux, mux, sadd, sadd, mor, mor, mor, mor,
  set, set, set, set, addu, addu, addu, addu,
  or, or, or, or, andn, andn, andn, andn,
  imp, imp, pushi, pushi, set, set, put, put,
  pop, resume, save, unsave, sync, noop, get, trip \;
```

 $\S52$  MMIX-PIPE LISTS 21

52. While we're into boring lists, we might as well define all the special register numbers, together with an inverse table for use in diagnostic outputs. These codes have been designed so that special registers 0–7 are unencumbered, 8–11 can't be PUT by anybody, 12–18 can't be PUT by the user. Pipeline delays might occur when GET is applied to special registers 21–31 or when PUT is applied to special registers 15–20. The SAVE and UNSAVE commands store and restore special registers 0–6 and 23–27.

```
\langle Header definitions 6 \rangle + \equiv
#define rA 21
                    /* arithmetic status register */
#define rB
                   /* bootstrap register (trip) */
\#define rC
                   /* cycle counter */
#define rD
                   /* dividend register */
\#define rE
              2
                   /* epsilon register */
#define rF
                     /* failure location register */
\#define rG
              19
                     /* global threshold register */
\#define rH
              3
                    /* himult register */
                    /* interval counter */
\#define rI
            12
#define rJ
                   /* return-jump register */
                     /* interrupt mask register */
#define rK
              15
#define rL
             20
                    /* local threshold register */
\#define rM
              5
                    /* multiplex mask register */
#define rN
                    /* serial number */
                     /* register stack offset */
#define rO
              10
#define rP
              23
                     /* prediction register */
#define rQ
                     /* interrupt request register */
#define rR
              6
                   /* remainder register */
#define rS
             11
                    /* register stack pointer */
#define rT
                     /* trap address register */
              13
#define rU
              17
                     /* usage counter */
                     /* virtual translation register */
\#define rV
              18
\#define rW
              24
                     /* where-interrupted register (trip) */
#define rX
              25
                     /* execution register (trip) */
\#define rY
              26
                     /* Y operand (trip) */
#define rZ
              27
                     /* Z operand (trip) */
#define rBB
                     /* bootstrap register (trap) */
               7
#define rTT
              14
                      /* dynamic trap address register */
#define rWW 28
                        /* where-interrupted register (trap) */
                       /* execution register (trap) */
#define rXX
                29
                      /* Y operand (trap) */
\#define rYY
                30
#define rZZ
                      /* Z operand (trap) */
53. \langle Global variables 20 \rangle + \equiv
  char *special_name[32] = {"rB", "rD", "rE", "rH", "rJ", "rM", "rR", "rBB", "rC", "rN", "rO", "rS",
      "rI", "rT", "rTT", "rK", "rQ", "rU", "rV", "rG", "rL", "rA", "rF", "rP", "rW", "rX", "rY", "rZ",
      "rWW", "rXX", "rYY", "rZZ"};
```

22 LISTS MMIX-PIPE §54

**54.** Here are the bit codes that affect trips and traps. The first eight cases also apply to the upper half of rQ; the next eight apply to rA.

```
#define P_BIT (1 \ll 0)
                               /* instruction in privileged location */
#define S_BIT (1 \ll 1)
                               /* security violation */
#define B_BIT (1 \ll 2)
                               /* instruction breaks the rules */
                               /* instruction for kernel only */
#define K_BIT (1 \ll 3)
#define N_BIT (1 \ll 4)
                               /* virtual translation bypassed */
#define PX_BIT (1 \ll 5)
                                /* permission lacking to execute from page */
                                /* permission lacking to write on page */
#define PW_BIT
                   (1 \ll 6)
                                /* permission lacking to read from page */
#define PR_BIT (1 \ll 7)
                               /* distance from PR_BIT to protection code position */
#define PROT_OFFSET 5
#define X_BIT (1 \ll 8)
                               /* floating inexact */
#define Z_BIT
                               /* floating division by zero */
                  (1 \ll 9)
#define U_BIT (1 \ll 10)
                                /* floating underflow */
                                /* floating overflow */
#define O_BIT
                 (1 \ll 11)
#define I_BIT
                                /* floating invalid operation */
                  (1 \ll 12)
#define W_BIT
                                /* float-to-fix overflow */
                  (1 \ll 13)
#define V_BIT (1 \ll 14)
                                /* integer overflow */
#define D_BIT (1 \ll 15)
                                /* integer divide check */
#define H_BIT (1 \ll 16)
                                /* trip handler bit */
#define F_BIT (1 \ll 17)
                                /* forced trap bit */
#define E_BIT (1 \ll 18)
                                /* external (dynamic) trap bit */
\langle \text{Global variables 20} \rangle + \equiv
  char bit_code_map[] = "EFHDVWIOUZXrwxnkbsp";
55. \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_bits ARGS((int));
56. \langle Subroutines 14\rangle + \equiv
  static void print\_bits(x)
      int x:
    register int b, j;
    for (i = 0, b = E_BIT; (x & (b + b - 1)) \land b; i++, b \gg = 1)
      if (x \& b) printf("%c", bit_code_map[j]);
  }
57. The lower half of rQ holds external interrupts of highest priority. Most of them are implementation-
dependent, but a few are defined in general.
\langle Header definitions 6 \rangle + \equiv
#define POWER_FAILURE (1 \ll 0)
                                       /* try to shut down calmly and quickly */
                                      /* try to save the file systems */
#define PARITY_ERROR (1 \ll 1)
#define NONEXISTENT_MEMORY (1 \ll 2)
                                             /* a memory address can't be used */
#define REBOOT_SIGNAL (1 \ll 4)
                                       /* it's time to start over */
#define INTERVAL_TIMEOUT (1 \ll 7)
                                          /* the timer register, rI, has reached zero */
```

§58 mmix-pipe dynamic speculation 23

**58. Dynamic speculation.** Now that we understand some basic low-level structures, we're ready to look at the larger picture.

This simulator is based on the idea of "dynamic scheduling with register renaming," as introduced in the 1960s by R. M. Tomasulo [IBM Journal of Research and Development 11 (1967), 25–33]. Moreover, the dynamic scheduling method is extended here to "speculative execution," as implemented in several processors of the 1990s and described in section 4.6 of Hennessy and Patterson's Computer Architecture, second edition (1995). The essential idea is to keep track of the pipeline contents by recording all dependencies between unfinished computations in a queue called the reorder buffer. An entry in the reorder buffer might, for example, correspond to an instruction that adds together two numbers whose values are still being computed; those numbers have been allocated space in earlier positions of the reorder buffer. The addition will take place as soon as both of its operands are known, but the sum won't be written immediately into the destination register. It will stay in the reorder buffer until reaching the hot seat at the front of the queue. Finally, the addition leaves the hot seat and is said to be committed.

Some instructions in the reorder buffer may in fact be executed only on speculation, meaning that they won't really be called for unless a prior branch instruction has the predicted outcome. Indeed, we can say that all instructions not yet in the hot seat are being executed speculatively, because an external interrupt might occur at any time and change the entire course of computation. Organizing the pipeline as a reorder buffer allows us to look ahead and keep busy computing values that have a good chance of being needed later, instead of waiting for slow instructions or slow memory references to be completed.

The reorder buffer is in fact a queue of **control** records, conceptually forming part of a circle of such records inside the simulator, corresponding to all instructions that have been dispatched or *issued* but not yet committed, in strict program order.

The best way to get an understanding of speculative execution is perhaps to imagine that the reorder buffer is large enough to hold hundreds of instructions in various stages of execution, and to think of an implementation of MMIX that has dozens of functional units—more than would ever actually be built into a chip. Then one can readily visualize the kinds of control structures and checks that must be made to ensure correct execution. Without such a broad viewpoint, a programmer or hardware designer will be inclined to think only of the simple cases and to devise algorithms that lack the proper generality. Thus we have a somewhat paradoxical situation in which a difficult general problem turns out to be easier to solve than its simpler special cases, because it enforces clarity of thinking.

Instructions that have completed execution and have not yet been committed are analogous to cars that have gone through our hypothetical repair shop and are waiting for their owners to pick them up. However, all analogies break down, and the world of automobiles does not have a natural counterpart for the notion of speculative execution. That notion corresponds roughly to situations in which people are led to believe that their cars need a new piece of equipment, but they suddenly change their mind once they see the price tag, and they insist on having the equipment removed even after it has been partially or completely installed.

Speculatively executed instructions might make no sense: They might divide by zero or refer to protected memory areas, etc. Such anomalies are not considered catastrophic or even exceptional until the instruction reaches the hot seat.

The person who designs a computer with speculative execution is an optimist, who has faith that the vast majority of the machine's predictions will come true. The person who designs a reliable implementation of such a computer is a pessimist, who understands that all predictions might come to naught. The pessimist does, however, take pains to optimize the cases that do turn out well.

**59.** Let's consider what happens to a single instruction, say ADD \$1,\$2,\$3, as it travels through the pipeline in a normal situation. The first time this instruction is encountered, it is placed into the I-cache (that is, the instruction cache), so that we won't have to access memory when we need to perform it again. We will assume for simplicity in this discussion that each I-cache access takes one clock cycle, although other possibilities are allowed by *MMIX\_config*.

Suppose the simulated machine fetches the example ADD instruction at time 1000. Fetching is done by a coroutine whose *stage* number is 0. A cache block typically contains 8 or 16 instructions. The fetch unit of our machine is able to fetch up to *fetch\_max* instructions on each clock cycle and place them in the fetch buffer, provided that there is room in the buffer and that all the instructions belong to the same cache block.

The dispatch unit of our simulator is able to issue up to <code>dispatch\_max</code> instructions on each clock cycle and move them from the fetch buffer to the reorder buffer, provided that functional units are available for those instructions and there is room in the reorder buffer. A functional unit that handles ADD is usually called an ALU (arithmetic logic unit), and our simulated machine might have several of them. If they aren't all stalled in stage 1 of their pipelines, and if the reorder buffer isn't full, and if the machine isn't in the process of deissuing instructions that were mispredicted, and if fewer than <code>dispatch\_max</code> instructions are ahead of the ADD in the fetch buffer, and if all such prior instructions can be issued without using up all the free ALUs, our ADD instruction will be issued at time 1001. (In fact, all of these conditions are usually true.)

We assume that L > 3, so that \$1, \$2, and \$3 are local registers. For simplicity we'll assume in fact that the register stack is empty, so that the ADD instruction is supposed to set  $l[1] \leftarrow l[2] + l[3]$ . The operands l[2] and l[3] might not be known at time 1001; they are **spec** values, which might point to **specnode** entries in the reorder buffer for previous instructions whose destinations are l[2] and l[3]. The dispatcher fills the next available control block of the reorder buffer with information for the ADD, containing appropriate **spec** values corresponding to l[2] and l[3] in its y and z fields. The x field of this control block will be inserted into a doubly linked list of **specnode** records, corresponding to l[1] and to all instructions in the reorder buffer that have l[1] as a destination. The boolean value x.known will be set to false, meaning that this speculative value still needs to be computed. Subsequent instructions that need l[1] as a source will point to x, if they are issued before the sum x.o has been computed. Double linking is used in the **specnode** list because the ADD instruction might be cancelled before it is finally committed; thus deletions might occur at either end of the list for l[1].

At time 1002, the ALU handling the ADD will stall if its inputs y and z are not both known (namely if  $y.p \neq \Lambda$  or  $z.p \neq \Lambda$ ). In fact, it will also stall if its third input rA is not known; the current speculative value of rA, except for its event bits, is represented in the ra field of the control block, and we must have  $ra.p \equiv \Lambda$ . In such a case the ALU will look to see if the **spec** values pointed to by y.p and/or ra.p become defined on this clock cycle, and it will update its own input values accordingly.

But let's assume that y, z, and ra are already known at time 1002. Then x.o will be set to y.o + z.o and x.known will become true. This will make the result destined for l[1] available to be used in other commands at time 1003.

If no overflow occurs when adding y.o to z.o, the interrupt and arith\_exc fields of the control block for ADD are set to zero. But when overflow does occur (shudder), there are two cases, based on the V-enable bit of rA, which is found in field b.o of the control block. If this bit is 0, the V-bit of the arith\_exc field in the control block is set to 1; the arith\_exc field will be ored into rA when the ADD instruction is eventually committed. But if the V-enable bit is 1, the trip handler should be called, interrupting the normal sequence. In such a case, the interrupt field of the control block is set to specify a trip, and the fetcher and dispatcher are told to forget what they have been doing; all instructions following the ADD in the reorder buffer must now be deissued. The virtual starting address of the overflow trip handler, namely location 32, is hastily passed to the fetch routine, and instructions will be fetched from that location as soon as possible. (Of course the overflow and the trip handler are still speculative until the ADD instruction is committed. Other exceptional conditions might cause the ADD itself to be terminated before it gets to the hot seat. But the pipeline keeps charging ahead, always trying to guess the most probable outcome.)

The commission unit of this simulator is able to commit and/or deissue up to *commit\_max* instructions on each clock cycle. With luck, fewer than *commit\_max* instructions will be ahead of our ADD instruction at time 1003, and they will all be completed normally. Then l[1] can be set to x.o, and the event bits of rA

§59 mmix-pipe dynamic speculation 25

can be updated from  $arith\_exc$ , and the ADD command can pass through the hot seat and out of the reorder buffer.

```
⟨ External variables 4⟩ +≡
Extern int fetch_max, dispatch_max, peekahead, commit_max;
/* limits on instructions that can be handled per clock cycle */
```

**60.** The instruction currently occupying the hot seat is the only issued-but-not-yet-committed instruction that is guaranteed to be truly essential to the machine's computation. All other instructions in the reorder buffer are being executed on speculation; if they prove to be needed, well and good, but we might want to jettison them all if, say, an external interrupt occurs.

Thus all instructions that change the global state in complicated ways—like LDVTS, which changes the virtual address translation caches—are performed only when they reach the hot seat. Fortunately the vast majority of instructions are sufficiently simple that we can deal with them more efficiently while other computations are taking place.

In this implementation the reorder buffer is simply housed in an array of control records. The first array element is  $reorder\_bot$ , and the last is  $reorder\_top$ . Variable hot points to the control block in the hot seat, and hot - 1 to its predecessor, etc. Variable cool points to the next control block that will be filled in the reorder buffer. If  $hot \equiv cool$  the reorder buffer is empty; otherwise it contains the control records hot,  $hot - 1, \ldots, cool + 1$ , except of course that we wrap around from  $reorder\_bot$  to  $reorder\_top$  when moving down in the buffer.

```
Extern control *reorder_bot, *reorder_top;
   /* least and greatest entries in the ring containing the reorder buffer */
Extern control *hot, *cool; /* front and rear of the reorder buffer */
Extern control *old_hot; /* value of hot at beginning of cycle */
Extern int deissues; /* the number of instructions that need to be deissued */

61. ⟨Initialize everything 22⟩ +≡
hot = cool = reorder_top;
deissues = 0;

62. ⟨Internal prototypes 13⟩ +≡
static void print_reorder_buffer ARGS((void));
```

863

MMIX-PIPE

26

```
63.
             \langle Subroutines 14\rangle + \equiv
     static void print_reorder_buffer()
          printf("Reorder_buffer");
          if (hot \equiv cool) \ printf(" (empty) \n");
          else { register control *p;
               if (deissues) printf(" (%d to be deissued)", deissues);
               if (doing_interrupt) printf("\( \( \) (interrupt \( \) state\( \) \( \) , doing_interrupt \( \);
               printf (":\n");
               for (p = hot; p \neq cool; p = (p \equiv reorder\_bot ? reorder\_top : p - 1)) {
                    print\_control\_block(p);
                    if (p \rightarrow owner) {
                          printf("""); print\_coroutine\_id(p \neg owner);
                    printf("\n");
               }
          printf(" \_ \%d \_ available \_ rename \_ register\%s, \_ \%d \_ memory \_ slot\%s \\ \verb|\| rename \_ regs, \\ | rename \_
                     rename\_regs \neq 1? "s": "", mem\_slots, mem\_slots \neq 1? "s": "");
     }
64. Here is an overview of what happens on each clock cycle.
\langle Perform one machine cycle 64\rangle \equiv
          (Check for external interrupt 314);
          dispatch\_count = 0;
          old\_hot = hot;
                                                     /* remember the hot seat position at beginning of cycle */
                                                     /* remember the fetch buffer contents at beginning of cycle */
          old\_tail = tail;
          suppress\_dispatch = (deissues \lor dispatch\_lock);
          if (doing_interrupt) \langle Perform one cycle of the interrupt preparations 318 \rangle
          else (Commit and/or deissue up to commit_max instructions 67);
          (Execute all coroutines scheduled for the current time 125);
          if (\neg suppress\_dispatch) (Dispatch one cycle's worth of instructions 74);
                                                                    /* and the beat moves on */
          ticks = incr(ticks, 1);
          dispatch\_stat[dispatch\_count] ++;
This code is used in section 10.
65. \langle \text{Global variables } 20 \rangle + \equiv
     int dispatch_count;
                                                         /* how many dispatched on this cycle */
     bool suppress_dispatch;
                                                                     /* should dispatching be bypassed? */
     \mathbf{int}\ doing\_interrupt;
                                                            /* how many cycles of interrupt preparations remain */
     lockvar dispatch_lock;
                                                                  /* lock to prevent instruction issues */
66. \langle \text{External variables 4} \rangle + \equiv
                                                                             /* how often did we dispatch 0, 1, ... instructions? */
     Extern int *dispatch_stat;
     Extern bool security_disabled;
                                                                                      /* omit security checks for testing purposes? */
```

§67 MMIX-PIPE DYNAMIC SPECULATION 27

```
67. ⟨Commit and/or deissue up to commit_max instructions 67⟩ ≡
{
    for (m = commit_max; m > 0 ∧ deissues > 0; m--) ⟨Deissue the coolest instruction 145⟩;
    for (; m > 0; m--) {
        if (hot ≡ cool) break; /* reorder buffer is empty */
        if (¬security_disabled) ⟨Check for security violation, break if so 149⟩;
        if (hot¬owner) break; /* hot seat instruction isn't finished */
        ⟨Commit the hottest instruction, or break if it's not ready 146⟩;
        i = hot¬i;
        if (hot ≡ reorder_bot) hot = reorder_top;
        else hot --;
        if (i ≡ resum) break; /* allow the resumed instruction to see the new rK */
    }
}
```

This code is used in section 64.

28 THE DISPATCH STAGE MMIX-PIPE  $\S 68$ 

**68.** The dispatch stage. It would be nice to present the parts of this simulator by dealing with the fetching, dispatching, executing, and committing stages in that order. After all, instructions are first fetched, then dispatched, then executed, and finally committed. However, the fetch stage depends heavily on difficult questions of memory management that are best deferred until we have looked at the simpler parts of simulation. Therefore we will take our initial plunge into the details of this program by looking first at the dispatch phase, assuming that instructions have somehow appeared magically in the fetch buffer.

The fetch buffer, like the circular priority queue of all coroutines and the circular queue used for the reorder buffer, lives in an array that is best regarded as a ring of elements. The elements are structures of type **fetch**, which have five fields: A 32-bit *inst*, which is an MMIX instruction; a 64-bit *loc*, which is the virtual address of that instruction; an *interrupt* field, which is nonzero if, for example, the protection bits in the relevant page table entry for this address do not permit execution access; a boolean *noted* field, which becomes *true* after the dispatch unit has peeked at the instruction to see whether it is a jump or probable branch; and a *hist* field, which records the recent branch history. (The least significant bits of *hist* correspond to the most recent branches.)

```
⟨Type definitions 11⟩ +≡
typedef struct {
  octa loc; /* virtual address of instruction */
  tetra inst; /* the instruction itself */
  unsigned int interrupt; /* bit codes that might cause interruption */
  bool noted; /* have we peeked at this instruction? */
  unsigned int hist; /* if we peeked, this was the peek_hist */
} fetch;
```

**69.** The oldest and youngest entries in the fetch buffer are pointed to by *head* and *tail*, just as the oldest and youngest entries in the reorder buffer are called *hot* and *cool*. The fetch coroutine will be adding entries at the *tail* position, which starts at  $old\_tail$  when a cycle begins, in parallel with the actions simulated by the dispatcher. Therefore the dispatcher is allowed to look only at instructions in head,  $head - 1, \ldots, old\_tail + 1$ , although a few more recently fetched instructions will usually be present in the fetch buffer by the time this part of the program is executed.

```
⟨External variables 4⟩ +≡
Extern fetch *fetch_bot, *fetch_top;
/* least and greatest entries in the ring containing the fetch buffer */
Extern fetch *head, *tail; /* front and rear of the fetch buffer */
70. ⟨Global variables 20⟩ +≡
fetch *old_tail; /* rear of the fetch buffer available on the current cycle */
71. #define UNKNOWN_SPEC ((specnode *) 1)
⟨Initialize everything 22⟩ +≡
head = tail = fetch_top;
inst_ptr.p = UNKNOWN_SPEC;
72. ⟨Internal prototypes 13⟩ +≡
static void print_fetch_buffer ARGS((void));
```

 $\S73$  mmix-pipe the dispatch stage 29

```
73.
       \langle Subroutines 14\rangle + \equiv
  static void print_fetch_buffer()
     printf("Fetch_buffer");
     if (head \equiv tail) printf("\_(empty)\n");
     else { register fetch *p;
        if (resuming) printf("

(resumption

state

%d)", resuming);
        printf(": \n");
        for (p = head; p \neq tail; p = (p \equiv fetch\_bot ? fetch\_top : p - 1)) {
           print\_octa(p \rightarrow loc);
           printf(": \_\%08x(\%s)", p \rightarrow inst, opcode\_name[p \rightarrow inst \gg 24]);
           if (p \rightarrow interrupt) print\_bits(p \rightarrow interrupt);
           if (p \rightarrow noted) printf("*");
           printf("\n");
        }
     }
     printf("Instruction pointer is;);
     if (inst\_ptr.p \equiv \Lambda) print\_octa(inst\_ptr.o);
     else {
        printf("waiting_for_");
        if (inst\_ptr.p \equiv UNKNOWN\_SPEC) printf("dispatch");
        else if (inst\_ptr.p \rightarrow addr.h \equiv (tetra) - 1) print\_coroutine\_id(((control *) inst\_ptr.p \rightarrow up) \rightarrow owner);
        else print\_specnode\_id(inst\_ptr.p \rightarrow addr);
     printf("\n");
```

**74.** The best way to understand the dispatching process is once again to "think big," by imagining a huge fetch buffer and the potential ability to issue dozens of instructions per cycle, although the actual numbers are typically quite small.

If the fetch buffer is not empty after *dispatch\_max* instructions have been dispatched, the dispatcher also looks at up to *peekahead* further instructions to see if they are jumps or other commands that change the flow of control. Much of this action would happen in parallel on a real machine, but our simulator works sequentially.

In the following program, *true\_head* records the head of the fetch buffer as instructions are actually dispatched, while *head* refers to the position currently being examined (possibly peeking into the future).

If the fetch buffer is empty at the beginning of the current clock cycle, a "dispatch bypass" allows the dispatcher to issue the first instruction that enters the fetch buffer on this cycle. Otherwise the dispatcher is restricted to previously fetched instructions.

```
 \begin{tabular}{ll} $\langle$ Dispatch one cycle's worth of instructions 74 $\rangle$ $\equiv$ $\{$ register fetch *true\_head, *new\_head; $$ $true\_head = head; $$ $if $(head \equiv head; \land head \neq tail)$ $old\_tail = (head \equiv fetch\_bot? fetch\_top: head - 1); $$ $peek\_hist = cool\_hist; $$ $for $(j=0; j < dispatch\_max + peekahead; j++) $$ $$ $\langle$ Look at the head instruction, and try to dispatch it if $j < dispatch\_max 75 $\rangle$; $$ $head = true\_head; $$$ $\}$ $$
```

This code is used in section 64.

30 THE DISPATCH STAGE MMIX-PIPE  $\S75$ 

```
75.
      \langle \text{Look at the } head \text{ instruction, and try to dispatch it if } j < dispatch_max 75 \rangle \equiv
     register mmix_opcode op;
     register int yz, f;
     register bool freeze\_dispatch = false;
     register func *u = \Lambda;
     if (head \equiv old\_tail) break;
                                          /* fetch buffer empty */
     if (head \equiv fetch\_bot) new\_head = fetch\_top; else new\_head = head - 1;
     op = head \rightarrow inst \gg 24; yz = head \rightarrow inst \& \#ffff;
     \langle Determine the flags, f, and the internal opcode, i \otimes 0 \rangle;
     \langle \text{Install default fields in the } cool \text{ block } 100 \rangle;
     if (f \& rel\_addr\_bit) (Convert relative address to absolute address 84);
     if (head \neg noted) peek\_hist = head \neg hist;
     else (Redirect the fetch if control changes at this inst 85);
     if (j \geq dispatch\_max \vee dispatch\_lock \vee nullifying) {
        head = new\_head; continue; /* can't dispatch, but can peek ahead */
     if (cool \equiv reorder\_bot) new\_cool = reorder\_top; else new\_cool = cool - 1;
     (Dispatch an instruction to the cool block if possible, otherwise goto stall 101);
     (Assign a functional unit if available, otherwise goto stall 82);
     Check for sufficient rename registers and memory slots, or goto stall 111);
     if ((op \& \#e0) \equiv \#40) (Record the result of branch prediction 152);
     \langle \text{ Issue the } cool \text{ instruction } 81 \rangle;
     cool = new\_cool; cool\_O = new\_O; cool\_S = new\_S;
     cool\_hist = peek\_hist; continue;
  stall: (Undo data structures set prematurely in the cool block and break 123);
This code is used in section 74.
```

**76.** An instruction can be dispatched only if a functional unit is available to handle it. A functional unit consists of a 256-bit vector that specifies a subset of MMIX's opcodes, and an array of coroutines for the pipeline stages. There are k coroutines in the array, where k is the maximum number of stages needed by any of the opcodes supported.

```
⟨Type definitions 11⟩ +≡
typedef struct func_struct {
  char name[16]; /* symbolic designation */
  tetra ops[8]; /* big-endian bitmap for the opcodes supported */
  int k; /* number of pipeline stages */
  coroutine *co; /* pointer to the first of k consecutive coroutines */
} func;

77. ⟨External variables 4⟩ +≡
Extern func *funit; /* pointer to array of functional units */
Extern int funit_count; /* the number of functional units */
```

**78.** It is convenient to have a 256-bit vector of all the supported opcodes, because we need to shut off a lot of special actions when an opcode is not supported.

```
⟨Global variables 20⟩ +≡
control *new_cool; /* the reorder position following cool */
int resuming; /* set nonzero if resuming an interrupted instruction */
tetra support[8]; /* big-endian bitmap for all opcodes supported */
```

ξ79 **MMIX-PIPE** 

```
\langle Initialize everything 22\rangle + \equiv
  { register func *u;
     for (u = funit; u \leq funit + funit\_count; u++)
        for (i = 0; i < 8; i++) \ support[i] = u \rightarrow ops[i];
  }
80. #define sign_bit ((unsigned) #80000000)
(Determine the flags, f, and the internal opcode, i \ 80) \equiv
  if (\neg(support [op \gg 5] \& (sign\_bit \gg (op \& 31)))) {
        /* oops, this opcode isn't supported by any function unit */
     f = flags[TRAP], i = trap;
  } else f = flags[op], i = internal\_op[op];
  if (i \equiv trip \land (head \neg loc.h \& sign\_bit)) f = 0, i = noop;
This code is used in section 75.
81. (Issue the cool instruction 81) \equiv
  if (cool→interim) {
     cool \neg usage = false;
     if (cool \neg op \equiv SAVE) (Get ready for the next step of SAVE 341)
     else if (cool \neg op \equiv UNSAVE) (Get ready for the next step of UNSAVE 335)
     else if (cool \neg i \equiv preld \lor cool \neg i \equiv prest) (Get ready for the next step of PRELD or PREST 228)
     else if (cool \neg i \equiv prego) (Get ready for the next step of PREGO 229)
  else if (cool \rightarrow i < max\_real\_command) {
     if ((flags[cool \rightarrow op] \& ctl\_change\_bit) \lor cool \rightarrow i \equiv pbr)
        if (inst\_ptr.p \equiv \Lambda \land (inst\_ptr.o.h \& sign\_bit) \land \neg (cool\neg loc.h \& sign\_bit) \land cool\neg i \neq trap)
           cool \rightarrow interrupt \mid = P_BIT;
                                             /* jumping from nonnegative to negative */
     true\_head = head = new\_head;
                                               /* delete instruction from fetch buffer */
     resuming = 0;
  if (freeze\_dispatch) set\_lock(u \rightarrow co, dispatch\_lock);
  cool \neg owner = u \neg co; u \neg co \neg ctl = cool;
  startup(u \rightarrow co, 1);
                            /* schedule execution of the new inst */
  if (verbose & issue_bit) {
     printf("Issuing<sub>□</sub>"); print_control_block(cool);
     printf(" " "); print\_coroutine\_id(u \rightarrow co); printf(" " ");
  dispatch\_count +++;
```

This code is used in section 75.

32 THE DISPATCH STAGE MMIX-PIPE  $\S 82$ 

82. We assign the first functional unit that supports op and is totally unoccupied, if possible; otherwise we assign the first functional unit that supports op and has stage 1 unoccupied.

```
\langle Assign a functional unit if available, otherwise goto stall 82\rangle
  { register int t = op \gg 5, b = sign\_bit \gg (op \& 31);
     if (cool \neg i \equiv trap \land op \neq TRAP) {
                                                 /* opcode needs to be emulated */
        u = funit + funit\_count; /* this unit supports just TRIP and TRAP */
        goto unit_found;
     for (u = funit; u \leq funit + funit\_count; u++)
        if (u \rightarrow ops[t] \& b) {
          for (i = 0; i < u \rightarrow k; i ++)
             if (u \rightarrow co[i].next) goto unit\_busy;
          goto unit_found;
        unit\_busy:;
     for (u = funit; u < funit + funit\_count; u++)
        if ((u \rightarrow ops[t] \& b) \land (u \rightarrow co \rightarrow next \equiv \Lambda)) goto unit\_found;
                     /* all units for this op are busy */
     goto stall;
  unit_found:
```

This code is used in section 75.

83. The flags table records special properties of each operation code in binary notation: #1 means Z is an immediate value, #2 means rZ is a source operand, #4 means Y is an immediate value, #8 means rY is a source operand, #10 means rX is a source operand, #20 means rX is a destination, #40 means YZ is part of a relative address, #80 means the control changes at this point.

```
#define X_is_dest_bit #20
#define rel_addr_bit #40
#define ctl_change_bit #80
\langle \text{Global variables } 20 \rangle + \equiv
      unsigned char flags[256] = \{ \text{\#8a}, \text{\#2a}, \text{\#2a},
                                                                                                                                                                                                    /* TRAP, ... */
      #26, #25, #26, #25, #26, #25, #26, #25,
                                                                                                                     /* FLOT, ... */
      #2a, #2a, #2a, #2a, #2a, #26, #2a, #26,
                                                                                                                     /* FMUL, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* MUL, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* ADD, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* 2ADDU, ... */
      #2a, #29, #2a, #29, #26, #25, #26, #25,
                                                                                                                      /* CMP, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* SL, ... */
      #50, #50, #50, #50, #50, #50, #50, #50,
                                                                                                                     /* BN, ... */
      #50, #50, #50, #50, #50, #50, #50, #50,
                                                                                                                     /* BNN, ... */
      #50, #50, #50, #50, #50, #50, #50, #50,
                                                                                                                     /* PBN, ... */
      #50, #50, #50, #50, #50, #50, #50, #50,
                                                                                                                     /* PBNN, ... */
      #3a, #39, #3a, #39, #3a, #39, #3a, #39,
                                                                                                                     /* CSN, ... */
      #3a, #39, #3a, #39, #3a, #39, #3a, #39,
                                                                                                                     /* CSNN, ... */
                                                                                                                      /* ZSN, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29.
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* ZSNN, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* LDB, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* LDT, ... */
      #2a, #29, #2a, #29, #1a, #19, #2a, #29,
                                                                                                                     /* LDSF, ... */
      #2a, #29, #0a, #09, #0a, #09, #aa, #a9,
                                                                                                                     /* LDVTS, ... */
      #1a, #19, #1a, #19, #1a, #19, #1a, #19,
                                                                                                                     /* STB, ... */
      #1a, #19, #1a, #19, #1a, #19, #1a, #19,
                                                                                                                     /* STT, ... */
      #1a, #19, #1a, #19, #0a, #09, #1a, #19,
                                                                                                                     /* STSF, ... */
      #0a, #09, #0a, #09, #0a, #09, #aa, #a9,
                                                                                                                     /* SYNCD, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* OR, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                     /* AND, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29,
                                                                                                                      /* BDIF, ... */
      #2a, #29, #2a, #29, #2a, #29, #2a, #29.
                                                                                                                     /* MUX, ... */
                                                                                                                     /* SETH, ... */
      #20, #20, #20, #20, #30, #30, #30, #30,
      #30, #30, #30, #30, #30, #30, #30, #30,
                                                                                                                      /* ORH, ... */
      #c0, #c0, #e0, #e0, #60, #60, #02, #01,
                                                                                                                     /* JMP, ... */
      #80, #80, #00, #02, #01, #00, #20, #8a};
                                                                                                                       /* POP, ... */
84. (Convert relative address to absolute address 84) \equiv
            if (i \equiv jmp) yz = head \neg inst \& \#ffffff;
            if (op \& 1) yz = (i \equiv jmp ? #1000000 : #10000);
            cool \neg y.o = incr(head \neg loc, 4), cool \neg y.p = \Lambda;
            cool \neg z.o = incr(head \neg loc, yz \ll 2), cool \neg z.p = \Lambda;
```

This code is used in section 75.

34 THE DISPATCH STAGE MMIX-PIPE §85

**85.** The location of the next instruction to be fetched is in a **spec** variable called *inst\_ptr*. A slightly tricky optimization of the POP instruction is made in the common case that the speculative value of rJ is known.

```
\langle Redirect the fetch if control changes at this inst 85\rangle \equiv
  { register int predicted = 0;
     if ((op \& \#e0) \equiv \#40) (Predict a branch outcome 151);
     head \neg noted = true;
     head \rightarrow hist = peek\_hist;
     if (predicted \lor (f \& ctl\_change\_bit) \lor (i \equiv syncid \land \neg(cool\neg loc.h \& sign\_bit))) {
        old\_tail = tail = new\_head;
                                          /* discard all remaining fetches */
        ⟨Restart the fetch coroutine 287⟩;
        \mathbf{switch} (i) {
        case jmp: case br: case pbr: case pushj: inst\_ptr = cool \neg z; break;
        case pop: if (g[rJ].up \rightarrow known \land j < dispatch\_max \land \neg dispatch\_lock \land \neg nullifying) {
             inst\_ptr.o = incr(g[rJ].up \neg o, yz \ll 2), inst\_ptr.p = \Lambda;  break;
                 /* otherwise fall through, will wait on cool \neg go */
        case go: case pushgo: case trap: case resume: case syncid: inst_ptr.p = UNKNOWN_SPEC; break;
        case trip: inst\_ptr = zero\_spec; break;
  }
This code is used in section 75.
```

**86.** At any given time the simulated machine is in two main states, the "hot state" corresponding to instructions that have been committed and the "cool state" corresponding to all the speculative changes currently being considered. The dispatcher works with cool instructions and puts them into the reorder buffer, where they gradually get warmer and warmer. Intermediate instructions, between *hot* and *cool*, have intermediate temperatures.

A machine register like l[101] or g[250] is represented by a specnode whose o field is the current hot value of the register. If the up and down fields of this specnode point to the node itself, the hot and cool values of the register are identical. Otherwise up and down are pointers to the coolest and hottest ends of a doubly linked list of specnodes, representing intermediate speculative values (sometimes called "rename registers"). The rename registers are implemented as the x or a specnodes inside control blocks, for speculative instructions that use this register as a destination. Speculative instructions that use the register as a source operand point to the next-hottest specnode on the list, until the value becomes known. The doubly linked list of specnodes is an input-restricted deque: A node is inserted at the cool end when the dispatcher issues an instruction with this register as destination; a node is removed from the cool end if an instruction needs to be deissued; a node is removed from the hot end when an instruction is committed.

The special registers rA, rB, ... occupy the same array as the global registers g[32], g[33], ... . For example, rB is internally the same as g[0], because rB = 0.

```
\langle \text{External variables 4} \rangle + \equiv
  Extern specnode q[256];
                                    /* global registers and special registers */
                               /* the ring of local registers */
  Extern specnode *l;
  Extern int lring_size;
                                /* the number of on-chip local registers (must be a power of 2) */
  Extern int max_rename_regs, max_mem_slots;
                                                         /* capacity of reorder buffer */
  Extern int rename_regs, mem_slots;
                                              /* currently unused capacity */
87. \langle Header definitions 6\rangle + \equiv
#define ticks g[rC].o
                              /* the internal clock */
88. \langle \text{Global variables } 20 \rangle + \equiv
  int lring_mask;
                       /* for calculations modulo lring_size */
```

89 mmix-pipe the dispatch stage 35

**89.** The *addr* fields in the specnode lists for registers are used to identify that register in diagnostic messages. Such addresses are negative; memory addresses are positive.

All registers are initially zero except rG, which is initially 255, and rN, which has a constant value identifying the time of compilation. (The macro ABSTIME is defined externally in the file abstime.h, which should have just been created by ABSTIME; ABSTIME is a trivial program that computes the value of the standard library function  $time(\Lambda)$ . We assume that this number, which is the number of seconds in the "UNIX epoch," is less than  $2^{32}$ . Beware: Our assumption will fail in February of 2106.)

```
/* version of the MMIX architecture that we support */
\#define VERSION 1
\#define SUBVERSION 0
                                  /* secondary byte of version number */
                                     /* further qualification to version number */
#define SUBSUBVERSION 0
\langle \text{Initialize everything } 22 \rangle + \equiv
  rename\_regs = max\_rename\_regs;
  mem\_slots = max\_mem\_slots;
  lring\_mask = lring\_size - 1;
  for (j = 0; j < 256; j++) {
     g[j].addr.h = sign\_bit, g[j].addr.l = j, g[j].known = true;
    g[j].up = g[j].down = \&g[j];
  g[rG].o.l = 255;
  g[rN].o.h = (\text{VERSION} \ll 24) + (\text{SUBVERSION} \ll 16) + (\text{SUBSUBVERSION} \ll 8);
  g[rN].o.l = ABSTIME;
                              /* see comment and warning above */
  for (j = 0; j < lring\_size; j \leftrightarrow) {
    l[j].addr.h = sign\_bit, l[j].addr.l = 256 + j, l[j].known = true;
    l[j].up = l[j].down = \&l[j];
     \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_specnode_id ARGS((octa));
91. \langle Subroutines 14\rangle + \equiv
  static void print_specnode_id(a)
       octa a;
  {
     if (a.h \equiv sign\_bit) {
       if (a.l < 32) printf (special\_name[a.l]);
       else if (a.l < 256) printf ("g [%d]", a.l);
       else printf("1[\%d]", a.l - 256);
     } else if (a.h \neq (\mathbf{tetra}) - 1) {
       printf("m["); print_octa(a); printf("]");
  }
```

**92.** The *specval* subroutine produces a **spec** corresponding to the currently coolest value of a given local or global register.

```
⟨Internal prototypes 13⟩ +≡ static spec specval ARGS((specnode *));
```

36 The dispatch stage mmix-pipe  $\S 93$ 

```
93.
        \langle Subroutines 14\rangle + \equiv
   static spec specval(r)
         specnode *r;
   \{ \text{ spec } res; 
      if (r \rightarrow up \rightarrow known) res. o = r \rightarrow up \rightarrow o, res. p = \Lambda;
      else res.p = r \rightarrow up;
      return res;
94. The spec_install subroutine introduces a new speculative value at the cool end of a given doubly
linked list.
\langle \text{Internal prototypes } 13 \rangle + \equiv
   static void spec_install ARGS((specnode *, specnode *));
      \langle \text{Subroutines } 14 \rangle + \equiv
   static void spec\_install(r,t)
                                                 /* insert t into list r */
         specnode *r, *t;
      t \rightarrow up = r \rightarrow up;
      t \rightarrow up \rightarrow down = t;
      r \rightarrow up = t;
      t \rightarrow down = r;
      t \rightarrow addr = r \rightarrow addr;
96. Conversely, spec\_rem takes such a value out.
\langle \text{Internal prototypes } 13 \rangle + \equiv
   static void spec_rem ARGS((specnode *));
       \langle \text{Subroutines } 14 \rangle + \equiv
   static void spec\_rem(t)
                                           /* remove t from its list */
         specnode *t;
   { register specnode *u = t \rightarrow up, *d = t \rightarrow down;
      u \rightarrow down = d; d \rightarrow up = u;
   }
```

98. Some special registers are so central to MMIX's operation, they are carried along with each control block in the reorder buffer instead of being treated as source and destination registers of each instruction. For example, the register stack pointers rO and rS are treated in this way. The normal specnodes for rO and rS, namely g[rO] and g[rS], are not actually used; the cool values are called  $cool\_O$  and  $cool\_S$ . (Actually  $cool\_O$  and  $cool\_S$  correspond to the register values divided by 8, since rO and rS are always multiples of 8.)

The arithmetic status register, rA, is also treated specially. Its event bits are kept up to date only at the "hot" end, by accumulating values of  $arith\_exc$ ; an instruction to GET the value of rA will be executed only in the hot seat. The other bits of rA, which are needed to control trip handlers and floating point rounding, are treated in the normal way.

```
⟨External variables 4⟩ +≡

Extern octa cool_O, cool_S; /* values of rO, rS before the cool instruction */
```

§99 MMIX-PIPE

```
99.
       \langle \text{Global variables 20} \rangle + \equiv
                                /* values of rL and rG before the cool instruction */
  int cool\_L, cool\_G;
                                                  /* history bits for branch prediction */
   unsigned int cool_hist, peek_hist;
   octa new\_O, new\_S;
                                  /* values of rO, rS after cool */
100. (Install default fields in the cool block 100) \equiv
   cool \neg op = op; cool \neg i = i;
   cool \rightarrow xx = (head \rightarrow inst) \% #ff; cool \rightarrow yy = (head \rightarrow inst) \% #ff; cool \rightarrow zz = (head \rightarrow inst) \% #ff;
   cool \neg loc = head \neg loc;
   cool \neg y = cool \neg z = cool \neg b = cool \neg ra = zero\_spec;
   cool \neg x.o = cool \neg a.o = cool \neg rl.o = zero\_octa;
   cool \neg x.known = false;
   cool \neg x.up = \Lambda;
   cool \neg a.known = false:
   cool \neg a.up = \Lambda;
   cool \neg rl.known = true;
   cool \neg rl.up = \Lambda;
   cool \neg need\_b = cool \neg need\_ra = cool \neg ren\_x = cool \neg ren\_x = cool \neg ren\_a = cool \neg set\_l = false;
   cool \neg arith\_exc = cool \neg denin = cool \neg denout = 0;
   if ((head \neg loc.h \& sign\_bit) \land \neg (g[rU].o.h \& #8000)) cool \neg usage = false;
   else cool \neg usage = ((op \& (g[rU].o.h \gg 16)) \equiv g[rU].o.h \gg 24 ? true : false);
   new\_O = cool \neg cur\_O = cool\_O; new\_S = cool \neg cur\_S = cool\_S;
   cool \neg interrupt = head \neg interrupt;
   cool \rightarrow hist = peek\_hist;
   cool \neg go.o = incr(cool \neg loc, 4);
   cool \neg go.known = false, cool \neg go.addr.h = -1, cool \neg go.up = (specnode *) cool;
   cool \neg interim = false;
This code is used in section 75.
101. (Dispatch an instruction to the cool block if possible, otherwise goto stall 101) \equiv
   if (new\_cool \equiv hot) goto stall;
                                               /* reorder buffer is full */
   \langle \text{ Make sure } cool\_L \text{ and } cool\_G \text{ are up to date } 102 \rangle;
   \langle Install the operand fields of the cool block 103\rangle;
   if (f \& X_i = dest_b it) (Install register X as the destination, or insert an internal command and goto
           dispatch\_done if X is marginal 110 \rangle;
   \mathbf{switch} (i) {
     (Special cases of instruction dispatch 117)
   default: break;
   }
dispatch\_done:
This code is used in section 75.
102. The UNSAVE operation begins by loading register rG from memory. We don't really need to know the
value of rG until twelve other registers have been unsaved, so we aren't fussy about it here.
\langle \text{ Make sure } cool\_L \text{ and } cool\_G \text{ are up to date } 102 \rangle \equiv
  if (\neg g[rL].up \rightarrow known) goto stall;
   cool\_L = g[rL].up \neg o.l;
   if (\neg q[rG].up \neg known \land \neg (op \equiv UNSAVE \land cool \neg xx \equiv 1)) goto stall;
   cool\_G = g[rG].up \neg o.l;
This code is used in section 101.
```

38 THE DISPATCH STAGE MMIX-PIPE  $\S 103$ 

```
(Install the operand fields of the cool block 103) \equiv
   if (resuming) (Insert special operands when resuming an interrupted operation 324)
      if (f \& #10) \langle \text{Set } cool \neg b \text{ from register X } 106 \rangle
      if (third\_operand[op] \land (cool \neg i \neq trap)) \(\rangle Set cool \neg b \) and \(\rangle or cool \neg rangle rangle from special register 108\rangle;
      if (f \& #1) cool \neg z.o.l = cool \neg zz;
      else if (f \& #2) \langle \text{Set } cool \neg z \text{ from register Z } 104 \rangle
      else if ((op \& #f0) \equiv #e0) \langle Set cool \neg z \text{ as an immediate wyde } 109 \rangle;
      if (f \& #4) \ cool \neg y.o.l = cool \neg yy;
      else if (f \& #8) \langle \text{Set } cool \neg y \text{ from register Y } 105 \rangle
This code is used in section 101.
104. \langle \text{Set } cool \neg z \text{ from register Z } 104 \rangle \equiv
      if (cool \neg zz \ge cool \neg G) cool \neg z = specval(\&g[cool \neg zz]);
      else if (cool \neg zz < cool \bot L) cool \neg z = specval(\&l[(cool \bot O.l + cool \neg zz) \& lring\_mask]);
This code is used in section 103.
105. \langle \text{Set } cool \neg y \text{ from register Y } 105 \rangle \equiv
      if (cool \neg yy \ge cool \neg G) cool \neg y = specval(\&g[cool \neg yy]);
      else if (cool \neg yy < cool \neg L) cool \neg y = specval(\&l[(cool \neg O.l + cool \neg yy) \& lring\_mask]);
This code is used in section 103.
106. \langle \text{Set } cool \neg b \text{ from register X } 106 \rangle \equiv
      if (cool \neg xx \ge cool \neg G) cool \neg b = specval(\&g[cool \neg xx]);
      else if (cool \neg xx < cool \bot l) cool \neg b = specval(\&l[(cool \bot O.l + cool \neg xx) \& lring\_mask]);
      if (f \& rel\_addr\_bit) cool \neg need\_b = true; /* br, pbr */
This code is used in section 103.
```

 $\S107$  MMIX-PIPE THE DISPATCH STAGE 39

107. If an operation requires a special register as third operand, that register is listed in the *third\_operand* table.

```
\langle Global variables 20\rangle + \equiv
  unsigned char third_operand [256] = \{
                                       /* TRAP, ... */
  0, rA, 0, 0, rA, rA, rA, rA, rA,
  rA, rA, rA, rA, rA, rA, rA, rA,
                                            /* FLOT, ... */
                                             /* FMUL, ... */
  rA, rE, rE, rE, rA, rA, rA, rA,
  rA, rA, 0, 0, rA, rA, rD, rD,
                                         /* MUL, ... */
                                     /* ADD, ... */
  rA, rA, 0, 0, rA, rA, 0, 0,
                             /* 2ADDU, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                                 /* CMP, ... */
  0, 0, 0, 0, rA, rA, 0, 0,
  rA, rA, 0, 0, 0, 0, 0, 0,
                                 /* SL, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* BN, ... */
                             /* BNN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* PBN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* PBNN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* CSN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* CSNN, ... */
                             /* ZSN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* ZSNN, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* LDB, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
                             /* LDT, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
  /* LDSF, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* LDVTS, ... */
  rA, rA, 0, 0, rA, rA, 0, 0,
                                     /* STB, ... */
                                 /* STT, ... */
  rA, rA, 0, 0, 0, 0, 0, 0,
                                 /* STSF, ... */
  rA, rA, 0, 0, 0, 0, 0, 0,
                             /* SYNCD, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* OR, ... */
                             /* AND, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* BDIF, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                                  /* MUX, ... */
  rM, rM, 0, 0, 0, 0, 0, 0,
                             /* SETH, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                             /* ORH, ... */
  0, 0, 0, 0, 0, 0, 0, 0, 0,
                            /* JMP, ... */
                                  /* POP, ... */
  rJ, 0, 0, 0, 0, 0, 0, 255};
108. The cool-b field is busy in operations like STB or STSF, which need rA. So we use cool-ra instead,
when rA is needed.
\langle \text{Set } cool \neg b \text{ and/or } cool \neg ra \text{ from special register } 108 \rangle \equiv
     if (third\_operand[op] \equiv rA \lor third\_operand[op] \equiv rE) cool\_need\_ra = true, cool\_ra = specval(\&g[rA]);
```

if  $(third\_operand[op] \neq rA)$   $cool \neg need\_b = true, cool \neg b = specval(\&g[third\_operand[op]]);$ 

This code is used in section 103.

40 The dispatch stage MMIX-pipe  $\S 109$ 

```
109. \langle \text{Set } cool \neg z \text{ as an immediate wyde } 109 \rangle \equiv
     switch (op & 3) {
     case 0: cool \neg z.o.h = yz \ll 16; break;
     case 1: cool \neg z.o.h = yz; break;
     case 2: cool \neg z.o.l = yz \ll 16; break;
     case 3: cool \neg z.o.l = yz; break;
     if (i \neq set) { /* register X should also be the Y operand */
        cool \neg y = cool \neg b;
        cool \neg b = zero\_spec;
This code is used in section 103.
110. (Install register X as the destination, or insert an internal command and goto dispatch_done if X is
        marginal 110 \rangle \equiv
     if (cool \rightarrow xx \geq cool G) {
       if (i \neq pushgo \land i \neq pushj) cool \neg ren\_x = true, spec\_install(\&g[cool \neg xx], \&cool \neg x);
     } else if (cool \rightarrow xx < cool\_L)
        cool \neg ren\_x = true, spec\_install(\&l[(cool\_O.l + cool \neg xx) \& lring\_mask], \& cool \neg x);
                 /* we need to increase L before issuing head→inst */
     increase\_L: if (((cool\_S.l - cool\_O.l - cool\_L - 1) \& lring\_mask) \equiv 0)
          (Insert an instruction to advance gamma 113)
       else (Insert an instruction to advance beta and L 112);
  }
This code is used in section 101.
111. (Check for sufficient rename registers and memory slots, or goto stall 111) \equiv
  if (rename\_regs < cool \neg ren\_x + cool \neg ren\_a) goto stall;
  if (cool \neg mem\_x)
     if (mem_slots) mem_slots—; else goto stall;
  rename\_regs = cool \neg ren\_x + cool \neg ren\_a;
This code is used in section 75.
```

 $\S112$  MMIX-PIPE THE DISPATCH STAGE 41

112. The *incrl* instruction advances  $\beta$  and rL by 1 at a time when we know that  $\beta \neq \gamma$ , in the ring of local registers.

```
\langle Insert an instruction to advance beta and L 112\rangle \equiv
     cool \rightarrow i = incrl;
     spec\_install(\&l[(cool\_O.l + cool\_L) \& lring\_mask], \& cool \neg x);
     cool \neg need\_b = cool \neg need\_ra = false;
     cool \neg y = cool \neg z = zero\_spec;
     cool \neg x.known = true;
                                 /* cool \neg x.o = zero\_octa */
     spec\_install(\&g[rL],\&cool \neg rl);
     cool \neg rl.o.l = cool \bot L + 1;
     cool \neg ren\_x = cool \neg set\_l = true;
     op = SETH;
                       /* this instruction to be handled by the simplest units */
     cool \neg interim = true;
     goto dispatch_done;
This code is used in section 110.
113. The inequality instruction advances \gamma and rS by storing an octabyte from the local register ring to
virtual memory location cool\_S \ll 3.
\langle Insert an instruction to advance gamma 113\rangle \equiv
     cool \neg need\_b = cool \neg need\_ra = false;
     cool \neg i = incgamma;
     new\_S = incr(cool\_S, 1);
     cool \neg b = specval(\&l[cool\_S.l \& lring\_mask]);
     cool \neg y.p = \Lambda, cool \neg y.o = shift\_left(cool S, 3);
     cool \neg z = zero\_spec;
     cool \neg mem\_x = true, spec\_install(\&mem, \&cool \neg x);
                       /* this instruction needs to be handled by load/store unit */
     op = STOU:
     cool \neg interim = true;
     goto dispatch_done;
This code is used in sections 110, 119, and 337.
114. The decgamma instruction decreases \gamma and rS by loading an octabyte from virtual memory location
(cool\_S - 1) \ll 3 into the local register ring.
\langle Insert an instruction to decrease gamma 114\rangle \equiv
  {
     cool \rightarrow i = decgamma;
     new\_S = incr(cool\_S, -1);
     cool \neg z = cool \neg b = zero\_spec;
```

This code is used in section 120.

op = LDOU;

 $cool \neg need\_b = false;$ 

 $cool \neg interim = true;$ 

**goto** dispatch\_done;

 $cool \neg ptr\_a = (\mathbf{void} *) mem.up;$ 

 $cool \neg y.p = \Lambda, cool \neg y.o = shift\_left(new\_S, 3);$ 

 $cool \neg ren\_x = true, spec\_install(\&l[new\_S.l \& lring\_mask], \&cool \neg x);$ 

/\* this instruction needs to be handled by load/store unit \*/

42 THE DISPATCH STAGE MMIX-PIPE  $\S115$ 

115. Storing into memory requires a doubly linked data list of specnodes like the lists we use for local and global registers. In this case the head of the list is called mem, and the addr fields are physical addresses in memory.

```
\langle \text{External variables 4} \rangle + \equiv
Extern specnode mem;
```

116. The addr field of a memory specnode is all 1s until the physical address has been computed.

```
\langle \text{ Initialize everything } 22 \rangle + \equiv mem.addr.h = mem.addr.l = -1; mem.up = mem.down = \&mem;
```

117. The CSWAP operation is treated as a partial store, with X as a secondary output. Partial store (pst) commands read an octabyte from memory before they write it.

```
⟨ Special cases of instruction dispatch 117⟩ ≡ case cswap: cool \neg ren\_a = true; spec\_install(cool \neg xx \ge cool\_G ? \&g[cool \neg xx] : \&l[(cool\_O.l + cool \neg xx) \& lring\_mask], \&cool \neg a); cool \neg i = pst; case st: if ((op \& \#fe) \equiv STCO) \ cool \neg b.o.l = cool \neg xx; case pst: cool \neg mem\_x = true, spec\_install(\&mem, \&cool \neg x); break; case ld: case ldunc: cool \neg ptr\_a = (void *) \ mem.up; break; See also sections 118, 119, 120, 121, 122, 227, 312, 322, 332, 337, 347, and 355. This code is used in section 101.
```

118. When new data is PUT into special registers 15–20 (namely rK, rQ, rU, rV, rG, or rL) it can affect many things. Therefore we stop issuing further instructions until such PUTs are committed. Moreover, we will see later that such drastic PUTs defer execution until they reach the hot seat.

```
⟨Special cases of instruction dispatch 117⟩ +≡ case put: if (cool \neg yy \neq 0 \lor cool \neg xx \geq 32) goto illegal\_inst; if (cool \neg xx \geq 8) {
    if (cool \neg xx \leq 11) goto illegal\_inst;
    if (cool \neg xx \leq 18 \land \neg (cool \neg loc.h \& sign\_bit)) goto privileged\_inst;
    }
    if (cool \neg xx \leq 18 \land \neg (cool \neg loc.h \& sign\_bit)) goto privileged\_inst;
    }
    if (cool \neg xx \geq 15 \land cool \neg xx \leq 20) freeze\_dispatch = true;
    cool \neg ren\_x = true, spec\_install(\&g[cool \neg xx], \& cool \neg x); break;
    case get: if (cool \neg yy \lor cool \neg zz \geq 32) goto illegal\_inst;
    if (cool \neg zz \equiv rO) cool \neg z.o = shift\_left(cool O, 3);
    else if (cool \neg zz \equiv rS) cool \neg z.o = shift\_left(cool O, 3);
    else cool \neg z = specval(\&g[cool \neg zz]); break;
    illegal\_inst: cool \neg interrupt \mid = B\_BIT; goto noop\_inst;
    case ldvts: if (cool \neg loc.h \& sign\_bit) break;
    privileged\_inst: cool \neg interrupt \mid = K\_BIT;
    noop\_inst: cool \neg i = noop; break;
```

 $\S119$  MMIX-PIPE THE DISPATCH STAGE 43

119. A PUSHGO instruction with  $X \ge G$  causes L to increase momentarily by 1, even if L = G. But the value of L will be decreased before the PUSHGO is complete, so it will never actually exceed G. Moreover, we needn't insert an *incrl* command.

```
\langle Special cases of instruction dispatch 117\rangle + \equiv
case pushgo: inst\_ptr.p = \&cool \neg go;
case pushj:
  { register int x = cool \neg xx;
     if (x > cool_G) {
        if (((cool\_S.l - cool\_O.l - cool\_L - 1) \& lring\_mask) \equiv 0)
           (Insert an instruction to advance gamma 113)
        x = cool\_L; cool\_L++;
        cool \neg ren\_x = true, spec\_install(\&l[(cool\_O.l + x) \& lring\_mask], \& cool \neg x);
     cool \neg x.known = true, cool \neg x.o.h = 0, cool \neg x.o.l = x;
     cool \neg ren\_a = true, spec\_install(\&g[rJ], \&cool \neg a);
     cool \neg a.known = true, cool \neg a.o = incr(cool \neg loc, 4);
     cool \neg set\_l = true, spec\_install(\&g[rL], \&cool \neg rl);
     cool \neg rl.o.l = cool \bot L - x - 1;
     new_{-}O = incr(cool_{-}O, x + 1);
  } break;
case syncid:
  if (cool¬loc.h & sign_bit) break;
case qo: inst\_ptr.p = \&cool \neg qo; break;
```

120. We need to know the topmost "hidden" element of the register stack when a POP instruction is dispatched. This element is usually present in the local register ring, unless  $\gamma = \alpha$ .

Once it is known, let x be its least significant byte. We will be decreasing rO by x+1, so we may have to decrease  $\gamma$  repeatedly in order to maintain the condition rS  $\leq$  rO.

```
\langle Special cases of instruction dispatch 117\rangle + \equiv
case pop: if (cool \neg xx \land cool \bot z) = cool \neg xx \land cool \bot z = specval(\&l[(cool \bot O.l + cool \neg xx - 1) \& lring\_mask]);
pop\_unsave: if (cool\_S.l \equiv cool\_O.l) (Insert an instruction to decrease gamma 114);
   \{ \text{ register tetra } x; \}
     register int new_L;
     register specnode *p = l[(cool\_O.l - 1) \& lring\_mask].up;
     if (p \rightarrow known) x = (p \rightarrow o.l) \& #ff; else goto stall;
     if ((\mathbf{tetra})(cool\_O.l - cool\_S.l) \le x) (Insert an instruction to decrease gamma 114);
     new_{-}O = incr(cool_{-}O, -x - 1);
     if (cool \neg i \equiv pop) new\_L = x + (cool \neg xx \leq cool\_L ? cool \neg xx : cool\_L + 1);
     else new_L L = x;
     if (new_L > cool_G) new_L = cool_G;
     if (x < new\_L) cool \neg ren\_x = true, spec\_install(\&l[(cool\_O.l - 1) \& lring\_mask], \& cool \neg x);
     cool \rightarrow set\_l = true, spec\_install(\&g[rL], \&cool \rightarrow rl);
     cool \neg rl.o.l = new\_L;
     if (cool \neg i \equiv pop) {
        cool \rightarrow z.o.l = yz \ll 2;
        if (inst\_ptr.p \equiv UNKNOWN\_SPEC \land new\_head \equiv tail) inst\_ptr.p = \&cool \neg go;
     break;
```

44 THE DISPATCH STAGE MMIX-PIPE §121

```
121. \langle \text{Special cases of instruction dispatch } 117 \rangle + \equiv  case mulu: cool \neg ren\_a = true, spec\_install(\&g[rH], \&cool \neg a); break; case <math>div: \text{case } div: \text{cool} \neg ren\_a = true, spec\_install(\&g[rR], \&cool \neg a); break;
```

122. It's tempting to say that we could avoid taking up space in the reorder buffer when no operation needs to be done. A JMP instruction qualifies as a no-op in this sense, because the change of control occurs before the execution stage. However, even a no-op might have to be counted in the usage register rU, so it might get into the execution stage for that reason. A no-op can also cause a protection interrupt, if it appears in a negative location. Even more importantly, a program might get into a loop that consists entirely of jumps and no-ops; then we wouldn't be able to interrupt it, because the interruption mechanism needs to find the current location in the reorder buffer! At least one functional unit therefore needs to provide explicit support for JMP, JMPB, and SWYM.

The SWYM instruction with F\_BIT set is a special case: This is a request from the fetch coroutine for an update to the IT-cache, when the page table method isn't implemented in hardware.

```
⟨ Special cases of instruction dispatch 117⟩ +≡
case noop: if (cool¬interrupt & F_BIT) {
    cool¬go.o = cool¬y.o = cool¬loc;
    inst_ptr = specval(&g[rT]);
}
break;

123. ⟨Undo data structures set prematurely in the cool block and break 123⟩ ≡
    if (cool¬ren_x ∨ cool¬mem_x) spec_rem(&cool¬x);
    if (cool¬ren_a) spec_rem(&cool¬a);
    if (cool¬set_l) spec_rem(&cool¬rl);
    if (inst_ptr.p ≡ &cool¬go) inst_ptr.p = UNKNOWN_SPEC;
    break;
```

This code is used in section 75.

 $\S124$  MMIX-PIPE THE EXECUTION STAGES 45

**124.** The execution stages. MMIX's raison d'être is its ability to execute instructions. So now we want to simulate the behavior of its functional units.

Each coroutine scheduled for action at the current tick of the clock has a stage number corresponding to a particular subset of the MMIX hardware. For example, the coroutines with stage = 2 are the second stages in the pipelines of the functional units. A coroutine with stage = 0 works in the fetch unit. Several artificially large stage numbers are used to control special coroutines that do things like write data from buffers into memory.

In this program the current coroutine of interest is called self; hence  $self \neg stage$  is the current stage number of interest. Another key variable,  $self \neg ctl$ , is called data; this is the control block being operated on by the current coroutine. We typically are simulating an operation in which  $data \neg x$  is being computed as a function of  $data \neg y$  and  $data \neg z$ . The data record has many fields, as described earlier when we defined **control** structures; for example,  $data \neg owner$  is the same as self, during the execution stage, if it is nonnull.

This part of the simulator is written as if each functional unit is able to handle all 256 operations. In practice, of course, a functional unit tends to be much more specialized; the actual specialization is governed by the dispatcher, which issues an instruction only to a functional unit that supports it. Once an instruction has been dispatched, however, we can simulate it most easily if we imagine that its functional unit is universal.

Coroutines with higher stage numbers are processed first. The three most important variables that govern a coroutine's behavior, once  $self \neg stage$  is given, are the external operation code  $data \neg op$ , the internal operation code  $data \neg i$ , and the value of  $data \neg state$ . We typically have  $data \neg state = 0$  when a coroutine is first fired up.

```
⟨Local variables 12⟩ +≡
register coroutine *self; /* the current coroutine being executed */
register control *data; /* the control block of the current coroutine */
```

125. When a coroutine has done all it wants to on a single cycle, it says **goto** done. It will not be scheduled to do any further work unless the *schedule* routine has been called since it began execution. The *wait* macro is a convenient way to say "Please schedule me to resume again at the current  $data \neg state$ " after a specified time; for example, wait(1) will restart a coroutine on the next clock tick.

```
\#define wait(t) { schedule(self, t, data \neg state); goto done; }
#define pass\_after(t) schedule(self + 1, t, data \neg state)
#define sleep { self \neg next = self; goto done; }
                                                              /* wait forever */
#define awaken(c,t) schedule(c,t,c\rightarrow ctl\rightarrow state)
\langle Execute all coroutines scheduled for the current time 125\rangle \equiv
  cur\_time ++; if (cur\_time \equiv ring\_size) cur\_time = 0;
  for (self = queuelist(cur\_time); self \neq \&sentinel; self = sentinel.next) {
     sentinel.next = self \neg next; self \neg next = \Lambda;
                                                        /* unschedule this coroutine */
     data = self \neg ctl;
     if (verbose & coroutine_bit) {
       printf("_\running\"); print_coroutine_id(self); printf("\");
       print\_control\_block(data); printf("\n");
     switch (self→stage) {
     case 0: (Simulate an action of the fetch coroutine 288);
     case 1: (Simulate the first stage of an execution pipeline 130);
     default: (Simulate later stages of an execution pipeline 135);
     (Cases for control of special coroutines 126);
  terminate: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
  done:;
This code is used in section 64.
```

46 THE EXECUTION STAGES MMIX-PIPE  $\S126$ 

```
126.
       A special coroutine whose stage number is vanish simply goes away at its scheduled time.
\langle Cases for control of special coroutines 126\rangle
case vanish: goto terminate;
See also sections 215, 217, 222, 224, 232, 237, and 257.
This code is used in section 125.
127. \langle \text{Global variables } 20 \rangle + \equiv
  coroutine mem_locker;
                                 /* trivial coroutine that vanishes */
  coroutine Dlocker:
                             /* another */
  control vanish_ctl;
                            /* such coroutines share a common control block */
       \langle Initialize everything 22\rangle + \equiv
128.
  mem_locker.name = "Locker";
  mem\_locker.ctl = \&vanish\_ctl;
  mem\_locker.stage = vanish;
  Dlocker.name = "Dlocker";
  Dlocker.ctl = \&vanish\_ctl;
  Dlocker.stage = vanish;
  vanish\_ctl.go.o.l = 4;
  for (j = 0; j < DTcache \neg ports; j++) DTcache \neg reader[j].ctl = & vanish\_ctl;
  if (Dcache)
     for (j = 0; j < Dcache \neg ports; j ++) Dcache \neg reader[j].ctl = & vanish\_ctl;
  for (j = 0; j < ITcache \neg ports; j ++) ITcache \neg reader[j].ctl = & vanish\_ctl;
  if (Icache)
     for (j = 0; j < Icache \neg ports; j ++) Icache \neg reader[j].ctl = & vanish\_ctl;
129. Here is a list of the stage numbers for special coroutines to be defined below.
\langle Header definitions 6 \rangle + \equiv
#define max\_stage 99
                              /* exceeds all stage numbers */
#define vanish 98
                           /* special coroutine that just goes away */
#define flush_to_mem 97
                                  /* coroutine for flushing from a cache to memory */
#define flush\_to\_S 96
                              /* coroutine for flushing from a cache to the S-cache */
#define fill_from_mem
                           95
                                  /* coroutine for filling a cache from memory */
                               /* coroutine for filling a cache from the S-cache */
#define fill\_from\_S 94
                                 /* coroutine for filling a translation cache */
#define fill_from_virt 93
                                     /* coroutine for emptying the write buffer */
#define write_from_wbuf
                            /* coroutine for cleaning the caches */
#define cleanup 91
130. At the very beginning of stage 1, a functional unit will stall if necessary until its operands are available.
As soon as the operands are all present, the state is set nonzero and execution proper begins.
\langle Simulate the first stage of an execution pipeline 130\rangle \equiv
switch1: switch (data \rightarrow state)  {
  case 0: (Wait for input data if necessary; set state = 1 if it's there 131);
  case 1: (Begin execution of an operation 132);
  case 2: \langle Pass \ data \ to \ the \ next \ stage \ of \ the \ pipeline \ 134 \rangle;
  case 3: (Finish execution of an operation 144);
     (Special cases for states in the first stage 266);
  }
This code is used in section 125.
```

 $\S131$  MMIX-PIPE THE EXECUTION STAGES 47

131. If some of our input data has been computed by another coroutine on the current cycle, we grab it now but wait for the next cycle. (An actual machine wouldn't have latched the data until then.)

```
\langle Wait for input data if necessary; set state = 1 if it's there 131 \rangle \equiv
   if (data \rightarrow y.p) {
      j++;
      if (data \neg y.p \neg known) data \neg y.o = data \neg y.p \neg o, data \neg y.p = \Lambda;
      else j += 10;
   if (data \rightarrow z.p) {
      j++;
      if (data \neg z.p \neg known) data \neg z.o = data \neg z.p \neg o, data \neg z.p = \Lambda;
      else j += 10;
   if (data \rightarrow b.p) {
      if (data \neg need\_b) j \leftrightarrow j
      if (data \neg b.p \neg known) data \neg b.o = data \neg b.p \neg o, data \neg b.p = \Lambda;
      else if (data \neg need\_b) j += 10;
   if (data \neg ra.p) {
      if (data \neg need\_ra) j \leftrightarrow ;
      if (data \neg ra.p \neg known) data \neg ra.o = data \neg ra.p \neg o, data \neg ra.p = \Lambda;
      else if (data \rightarrow need\_ra) j += 10;
   if (j < 10) data\Rightarrowstate = 1;
                              /* otherwise we fall through to case 1 */
   if (j) wait (1);
This code is used in section 130.
```

132. Simple register-to-register instructions like ADD are assumed to take just one cycle, but others like FADD almost certainly require more time. This simulator can be configured so that FADD might take, say, four pipeline stages of one cycle each (1+1+1+1), or two pipeline stages of two cycles each (2+2), or a single unpipelined stage lasting four cycles (4), etc. In any case the simulator computes the results now, for simplicity, placing them in  $data \neg x$  and possibly also in  $data \neg a$  and/or  $data \neg interrupt$ . The results will not be officially made known until the proper time.

```
⟨ Begin execution of an operation 132 ⟩ ≡
   switch (data¬i) {
    ⟨ Cases to compute the results of register-to-register operation 137 ⟩;
    ⟨ Cases to compute the virtual address of a memory operation 265 ⟩;
    ⟨ Cases for stage 1 execution 155 ⟩;
    ⟩
    ⟨ Set things up so that the results become known when they should 133 ⟩;
This code is used in section 130.
```

48 THE EXECUTION STAGES MMIX-PIPE §133

**133.** If the internal opcode  $data \rightarrow i$  is  $max\_pipe\_op$  or less, a special pipeline sequence like 1 + 1 + 1 + 1 or 2 + 2 or 15 + 10, etc., has been configured. Otherwise we assume that the pipeline sequence is simply 1.

Suppose the pipeline sequence is  $t_1 + t_2 + \cdots + t_k$ . Each  $t_j$  is positive and less than 256, so we represent the sequence as a string  $pipe\_seq[data - i]$  of unsigned "characters," terminated by 0. Given such a string, we want to do the following: Wait  $(t_1 - 1)$  cycles and pass data to stage 2; wait  $t_2$  cycles and pass data to stage 3; ...; wait  $t_{k-1}$  cycles and pass data to stage k; wait  $t_k$  cycles and make the results known.

The value of denin is added to  $t_1$ ; the value of denout is added to  $t_k$ .

```
\langle Set things up so that the results become known when they should 133\rangle \equiv
  data \neg state = 3;
  if (data \neg i \leq max\_pipe\_op) { register unsigned char *s = pipe\_seq[data \neg i];
     j = s[0] + data \neg denin;
                                       /* more than one stage */
     if (s[1]) data \rightarrow state = 2;
     else j += data \neg denout;
     if (j > 1) wait (j - 1);
  goto switch1;
This code is used in section 132.
        When we're in stage j, the coroutine for stage j + 1 of the same functional unit is self + 1.
\langle \text{Pass } data \text{ to the next stage of the pipeline } 134 \rangle \equiv
pass\_data: if ((self + 1) \neg next) wait (1);
                                                    /* stall if the next stage is occupied */
  { register unsigned char *s = pipe\_seq[data - i];
     j = s[self \neg stage];
     if (s[self \neg stage + 1] \equiv 0) j += data \neg denout, data \neg state = 3;
                                                                                  /* the next stage is the last */
     pass\_after(j);
passit: (self + 1) \rightarrow ctl = data;
  data \neg owner = self + 1;
  goto done;
This code is used in section 130.
       \langle Simulate later stages of an execution pipeline 135\rangle \equiv
switch2: if (data \neg b.p \land data \neg b.p \neg known) data \neg b.o = data \neg b.p \neg o, data \neg b.p = \Lambda;
  switch (data→state) {
  case 0: panic(confusion("switch2"));
  case 1: (Begin execution of a stage-two operation 351);
  case 2: goto pass_data;
  case 3: goto fin_ex;
     (Special cases for states in later stages 272);
This code is used in section 125.
```

136. The default pipeline times use only one stage; they can be overridden by *MMIX\_config*. The total number of stages supported by this simulator is limited to 90, since it must never interfere with the *stage* numbers for special coroutines defined below. (The author doesn't feel guilty about making this restriction.)

```
⟨External variables 4⟩ +≡
#define pipe_limit 90
Extern unsigned char pipe_seq[max_pipe_op + 1][pipe_limit + 1];
```

 $\S137$  MMIX-PIPE THE EXECUTION STAGES 49

The simplest of all register-to-register operations is set, which occurs for commands like SETH as well as for commands like GETA. (We might as well start with the easy cases and work our way up.)  $\langle$  Cases to compute the results of register-to-register operation 137  $\rangle \equiv$ **case** set:  $data \rightarrow x.o = data \rightarrow z.o$ ; **break**; See also sections 138, 139, 140, 141, 142, 143, 343, 344, 345, 346, 348, and 350. This code is used in section 132. Here are the basic boolean operations, which account for 24 of MMIX's 256 opcodes.  $\langle$  Cases to compute the results of register-to-register operation 137 $\rangle + \equiv$ **case** or:  $data \rightarrow x.o.h = data \rightarrow y.o.h \mid data \rightarrow z.o.h$ ;  $data \rightarrow x.o.l = data \rightarrow y.o.l \mid data \rightarrow z.o.l;$ break; **case** orn:  $data \rightarrow x.o.h = data \rightarrow y.o.h \mid \sim data \rightarrow z.o.h;$  $data \rightarrow x.o.l = data \rightarrow y.o.l \mid \sim data \rightarrow z.o.l;$ break; case nor:  $data \rightarrow x.o.h = \sim (data \rightarrow y.o.h \mid data \rightarrow z.o.h);$  $data \rightarrow x.o.l = \sim (data \rightarrow y.o.l \mid data \rightarrow z.o.l);$ **case** and:  $data \rightarrow x.o.h = data \rightarrow y.o.h \& data \rightarrow z.o.h;$  $data \rightarrow x.o.l = data \rightarrow y.o.l \& data \rightarrow z.o.l;$ break; **case** andn:  $data \rightarrow x.o.h = data \rightarrow y.o.h \& \sim data \rightarrow z.o.h;$  $data \rightarrow x.o.l = data \rightarrow y.o.l \& \sim data \rightarrow z.o.l;$ break: **case** nand:  $data \rightarrow x.o.h = \sim (data \rightarrow y.o.h \& data \rightarrow z.o.h);$  $data \rightarrow x.o.l = \sim (data \rightarrow y.o.l \& data \rightarrow z.o.l);$ **case**  $xor: data \neg x.o.h = data \neg y.o.h \oplus data \neg z.o.h;$  $data \rightarrow x.o.l = data \rightarrow y.o.l \oplus data \rightarrow z.o.l;$ **case** nxor:  $data \neg x.o.h = data \neg y.o.h \oplus \sim data \neg z.o.h$ ;  $data \rightarrow x.o.l = data \rightarrow y.o.l \oplus \sim data \rightarrow z.o.l;$ break; The implementation of ADDU is only slightly more difficult. It would be trivial except for the fact that internal opcode addu is used not only for the ADDU[I] and INC[M][H,L] operations, in which we simply want to add  $data \neg y.o$  to  $data \neg z.o$ , but also for operations like 4ADDU.  $\langle$  Cases to compute the results of register-to-register operation 137  $\rangle + \equiv$ 

```
Cases to compute the results of register-to-register operation 137 +\equiv case addu: data \neg x.o = oplus((data \neg op \& #f8) \equiv #28 ? shift\_left(data \neg y.o, 1 + ((data \neg op \gg 1) \& #3)) : data \neg y.o, data \neg z.o); break; case subu: data \neg x.o = ominus(data \neg y.o, data \neg z.o); break;
```

MMIX-PIPE §140

140. Signed addition and subtraction produce the same results as their unsigned counterparts, but overflow must also be detected. Overflow occurs when adding y to z if and only if y and z have the same sign but their sum has a different sign. Overflow occurs in the calculation x = y - z if and only if it occurs in the calculation y = x + z.

```
 \begin{array}{l} \langle \, \text{Cases to compute the results of register-to-register operation } \, 137 \, \rangle \, + \equiv \\ \text{\bf case } \, add \colon \, data \neg x.o = \, oplus (\, data \neg y.o, \, data \neg z.o); \\ \text{\bf if } \, (((\, data \neg y.o.h \, \oplus \, data \neg z.o.h) \, \& \, sign\_bit) \equiv 0 \, \wedge \, ((\, data \neg y.o.h \, \oplus \, data \neg x.o.h) \, \& \, sign\_bit) \neq 0) \\ \, \, data \neg interrupt \, |= \, \text{V\_BIT}; \\ \text{\bf break}; \\ \text{\bf case } \, sub \colon \, data \neg x.o = \, ominus (\, data \neg y.o, \, data \neg z.o); \\ \text{\bf if } \, (((\, data \neg x.o.h \, \oplus \, data \neg z.o.h) \, \& \, sign\_bit) \equiv 0 \, \wedge \, ((\, data \neg y.o.h \, \oplus \, data \neg x.o.h) \, \& \, sign\_bit) \neq 0) \\ \, \, data \neg interrupt \, |= \, \text{V\_BIT}; \\ \text{\bf break}; \end{array}
```

141. The shift commands might take more than one cycle, or they might even be pipelined, if the default value of  $pipe\_seq[sh]$  is changed. But we compute shifts all at once here, because other parts of the simulator will take care of the pipeline timing. (Notice that shlu is changed to sh, for this reason. Similar changes to the internal op codes are made for other operators below.)

```
#define shift\_amt (data\neg z.o.h \lor data\neg z.o.l \ge 64 ? 64 : data\neg z.o.l) \langle Cases to compute the results of register-to-register operation 137 \rangle + \equiv case shlu: data\neg x.o = shift\_left(data\neg y.o, shift\_amt); data\neg i = sh; break; case <math>shl: data\neg x.o = shift\_left(data\neg y.o, shift\_amt); data\neg i = sh; \{ octa tmpo; tmpo = shift\_right(data\neg x.o, shift\_amt, 0); if (tmpo.h \ne data\neg y.o.h \lor tmpo.l \ne data\neg y.o.l) data\neg interrupt \models V\_BIT; \} break; case shru: data\neg x.o = shift\_right(data\neg y.o, shift\_amt, 1); data\neg i = sh; break; case <math>shr: data\neg x.o = shift\_right(data\neg y.o, shift\_amt, 0); data\neg i = sh; break;
```

142. The MUX operation has three operands, namely  $data \neg y$ ,  $data \neg z$ , and  $data \neg b$ ; the third operand is the current (speculative) value of rM, the special mask register. Otherwise MUX is unexceptional.

```
\langle Cases to compute the results of register-to-register operation 137 \rangle +\equiv case mux: data \neg x.o.h = (data \neg y.o.h \& data \neg b.o.h) + (data \neg z.o.h \& \sim data \neg b.o.h); data \neg x.o.l = (data \neg y.o.l \& data \neg b.o.l) + (data \neg z.o.l \& \sim data \neg b.o.l); break;
```

**143.** Comparisons are a breeze.

```
 \begin{array}{l} \langle \, {\rm Cases} \ {\rm to} \ {\rm compute} \ {\rm the} \ {\rm results} \ {\rm of} \ {\rm register} \ {\rm to} \ {\rm case} \ {\rm cmp} \colon \ {\rm if} \ ((data \neg y.o.h \ \& \ {\rm sign\_bit}) > (data \neg z.o.h \ \& \ {\rm sign\_bit})) \ {\rm \ goto} \ {\rm cmp\_pos} ; \\ {\rm \ case} \ {\rm cmpu} \colon \ {\rm if} \ (data \neg y.o.h \ < data \neg z.o.h) \ {\rm \ goto} \ {\rm cmp\_neg}; \\ {\rm \ if} \ (data \neg y.o.h > data \neg z.o.h) \ {\rm \ goto} \ {\rm cmp\_pos}; \\ {\rm \ if} \ (data \neg y.o.l \ < data \neg z.o.l) \ {\rm \ goto} \ {\rm cmp\_neg}; \\ {\rm \ if} \ (data \neg y.o.l \ < data \neg z.o.l) \ {\rm \ goto} \ {\rm cmp\_pos}; \\ {\rm \ cmp\_zero:} \ {\rm \ break}; \ /* \ data \neg x.o.h \ {\rm is} \ {\rm zero} \ */ \\ {\rm \ cmp\_pos:} \ data \neg x.o.l = 1; \ {\rm \ break}; \ /* \ data \neg x.o.h \ {\rm is} \ {\rm zero} \ */ \\ {\rm \ cmp\_neg:} \ data \neg x.o = neg\_one; \ {\rm \ break}; \end{array}
```

 $\S144$  MMIX-PIPE THE EXECUTION STAGES 51

144. The other operations will be deferred until later, now that we understand the basic ideas. But one more piece of code ought to be written before we move on, because it completes the execution stage for the simple cases already considered.

The  $ren\_x$  and  $ren\_a$  fields tell us whether the x and/or a fields contain valid information that should become officially known.

```
 \langle \text{Finish execution of an operation } 144 \rangle \equiv \\ \text{fin\_ex: if } (data \neg ren\_x) \ data \neg x.known = true; \\ \text{else if } (data \neg ren\_x) \ data \neg x.known = true, data \neg x.addr.l \& = -8; \\ \text{if } (data \neg ren\_a) \ data \neg a.known = true; \\ \text{if } (data \neg loc.h \& sign\_bit) \ data \neg ra.o.l = 0; \ /* \text{ no trips enabled for the operating system } */ \\ \text{if } (data \neg interrupt \& \# ffff) \ \langle \text{Handle interrupt at end of execution stage } 307 \rangle; \\ die: data \neg owner = \Lambda; \text{ goto } terminate; \ /* \text{ this coroutine now fades away } */ \\ \text{This code is used in section } 130.
```

145. The commission/deissue stage. Control blocks leave the reorder buffer either at the hot end (when they're committed) or at the cool end (when they're deissued). We hope most of them are committed, but from time to time our speculation is incorrect and we must deissue a sequence of instructions that prove to be unwanted. Deissuing must take priority over committing, because the dispatcher cannot do anything until the machine's cool state has stabilized.

Deissuing changes the cool state by undoing the most recently issued instructions, in reverse order. Committing changes the hot state by doing the least recently issued instructions, in their original order. Both operations are similar, so we assume that they take the same time; at most *commit\_max* instructions are deissued and/or committed on each clock cycle.

This code is used in section 67.

```
146.
         \langle Commit the hottest instruction, or break if it's not ready 146 \rangle \equiv
     if (nullifying) \langle Nullify the hottest instruction 147 \rangle
     else {
        if (hot \neg i \equiv get \land hot \neg zz \equiv rQ) new Q = oandn(g[rQ].o, hot \neg x.o);
        else if (hot \neg i \equiv put \land hot \neg xx \equiv rQ) \ hot \neg x.o.h | = new Q.h, hot \neg x.o.l | = new Q.l;
        if (hot \neg mem\_x) (Commit to memory if possible, otherwise break 256);
        if (verbose & issue_bit) {
           printf("Committing_{\sqcup}"); print\_control\_block(hot); printf("\n");
        if (hot \neg ren\_x) rename\_regs ++, hot \neg x.up \neg o = hot \neg x.o, spec\_rem(&(hot \neg x));
        if (hot \neg ren\_a) rename\_regs ++, hot \neg a.up \neg o = hot \neg a.o, spec\_rem(\&(hot \neg a));
        if (hot \rightarrow set\_l) hot \rightarrow rl.up \rightarrow o = hot \rightarrow rl.o, spec\_rem(&(hot \rightarrow rl));
        if (hot \neg arith\_exc) g[rA].o.l |= hot \neg arith\_exc;
        if (hot \neg usage) {
           g[rU].o.l++; if (g[rU].o.l \equiv 0) {
              g[rU].o.h++; if ((g[rU].o.h \& #7fff) \equiv 0) g[rU].o.h -= #8000;
        }
     if (hot \neg interrupt \geq H_BIT) (Begin an interruption and break 317);
This code is used in section 67.
```

147. A load or store instruction is "nullified" if it is about to be captured by a trap interrupt. In such cases it will be the only item in the reorder buffer; thus nullifying is sort of a cross between deissuing and committing. (It is important to have stopped dispatching when nullification is necessary, because instructions such as *incgamma* and *decgamma* change rS, and we need to change it back when an unexpected interruption occurs.)

```
 \left \langle \text{Nullify the hottest instruction } 147 \right \rangle \equiv \left \{ \\ \text{ if } \left( \textit{verbose \& issue\_bit} \right) \left \{ \\ \textit{printf} \left( \text{"Nullifying} \right); \; \textit{print\_control\_block} \left( \textit{hot} \right); \; \textit{printf} \left( \text{"} \right); \right \} \\ \text{ if } \left( \textit{hot} \neg \textit{ren} \bot x \right) \; \textit{rename\_regs} + +, \textit{spec\_rem} \left( \& \textit{hot} \neg x \right); \\ \text{ if } \left( \textit{hot} \neg \textit{ren} \bot a \right) \; \textit{rename\_regs} + +, \textit{spec\_rem} \left( \& \textit{hot} \neg a \right); \\ \text{ if } \left( \textit{hot} \neg \textit{mem} \bot x \right) \; \textit{mem\_slots} + +, \textit{spec\_rem} \left( \& \textit{hot} \neg x \right); \\ \text{ if } \left( \textit{hot} \neg \textit{set} \bot l \right) \; \textit{spec\_rem} \left( \& \textit{hot} \neg \textit{rl} \right); \\ \textit{cool\_O} = \textit{hot} \neg \textit{cur\_O}, \textit{cool\_S} = \textit{hot} \neg \textit{cur\_S}; \\ \textit{nullifying} = \textit{false}; \\ \right \}
```

This code is used in section 146.

148. Interrupt bits in rQ might be lost if they are set between a GET and a PUT. Therefore we don't allow PUT to zero out bits that have become 1 since the most recently committed GET.

```
\langle Global variables 20\rangle +\equiv octa new_Q; /* when rQ increases in any bit position, so should this */
```

54

149. An instruction will not be committed immediately if it violates the basic security rule of MMIX: An instruction in a nonnegative location should not be performed unless all eight of the internal interrupts have been enabled in the interrupt mask register rK. Conversely, an instruction in a negative location should not be performed if the  $P_BIT$  is enabled in rK.

Such instructions take one extra cycle before they are committed. The nonnegative-location case turns on the  $S_BIT$  of both rK and rQ, leading to an immediate interrupt (unless the current instruction is trap, put, or resume).

```
\langle Check for security violation, break if so 149\rangle \equiv
     if (hot \neg loc.h \& sign\_bit) {
        if ((g[rK].o.h \& P_BIT) \land \neg(hot \neg interrupt \& P_BIT)) {
           hot \rightarrow interrupt \mid = P_BIT;
          g[rQ].o.h \models P_BIT;
           new_Q.h \models P_BIT;
          if (verbose & issue_bit) {
             printf("\_setting\_rQ="); print\_octa(g[rQ].o); printf("\n");
          break;
     } else if ((g[rK].o.h \& #ff) \neq #ff \land \neg(hot \neg interrupt \& S_BIT)) {
        hot \rightarrow interrupt \mid = S_BIT;
        g[rQ].o.h = S_BIT;
        new_Q.h \mid = S_BIT;
        g[rK].o.h \models S\_BIT;
        if (verbose & issue_bit) {
          printf("\_setting\_rQ="); print\_octa(g[rQ].o);
          printf(", \underline{r}K="); print\_octa(g[rK].o); printf("\n");
        break;
  }
```

This code is used in section 67.

 $\S150$  MMIX-PIPE BRANCH PREDICTION 55

150. Branch prediction. An MMIX programmer distinguishes statically between "branches" and "probable branches," but many modern computers attempt to do better by implementing dynamic branch prediction. (See, for example, section 4.3 of Hennessy and Patterson's *Computer Architecture*, second edition.) Experience has shown that dynamic branch prediction can significantly improve the performance of speculative execution, by reducing the number of instructions that need to be deissued.

This simulator has an optional  $bp\_table$  containing  $2^{a+b+c}$  entries of n bits each, where n is between 1 and 8. Usually n is 1 or 2 in practice, but 8 bits are allocated per entry for convenience in this program. The  $bp\_table$  is consulted and updated on every branch instruction (every B or PB instruction, but not JMP), for advice on past history of similar situations. It is indexed by the a least significant bits of the address of the instruction, the b most recent bits of global branch history, and the next c bits of both address and history (exclusive-ored).

A *bp\_table* entry begins at zero and is regarded as a signed *n*-bit number. If it is nonnegative, we will follow the prediction in the instruction, namely to predict a branch taken only in the PB case. If it is negative, we will predict the opposite of the instruction's recommendation. The *n*-bit number is increased (if possible) if the instruction's prediction was correct, decreased (if possible) if the instruction's prediction was incorrect.

(Incidentally, a large value of n is not necessarily a good idea. For example, if n = 8 the machine might need 128 steps to recognize that a branch taken the first 150 times is not taken the next 150 times. And if we modify the update criteria to avoid this problem, we obtain a scheme that is rarely better than a simple scheme with smaller n.)

The values a, b, c, and n in this discussion are called  $bp\_a$ ,  $bp\_b$ ,  $bp\_c$ , and  $bp\_n$  in the program.

```
\langle \text{ External variables 4} \rangle + \equiv
Extern int bp\_a, bp\_b, bp\_c, bp\_n; /* parameters for branch prediction */
Extern char *bp\_table; /* either \Lambda or an array of 2^{a+b+c} items */
```

**151.** Branch prediction is made when we are either about to issue an instruction or peeking ahead. We look at the  $bp\_table$ , but we don't want to update it yet.

```
 \left\{ \begin{array}{ll} \text{Predict a branch outcome } 151 \right\rangle \equiv \\ \left\{ \begin{array}{ll} predicted = op \ \& \ \#10; & /* \ \text{start with the instruction's recommendation } \ */ \ \textbf{if } \ (bp\_table) \ \left\{ \begin{array}{ll} \textbf{register int } \ h; \\ m = ((head \neg loc.l \ \& \ bp\_cmask) \ll bp\_b) + (head \neg loc.l \ \& \ bp\_amask); \\ m = ((cool\_hist \ \& \ bp\_bcmask) \ll bp\_a) \oplus (m \gg 2); \\ h = bp\_table [m]; \\ \textbf{if } \ (h \ \& \ bp\_npower) \ \ predicted \ \oplus = \ \#10; \\ \\ \} \\ \textbf{if } \ (predicted) \ \ peek\_hist = (peek\_hist \ll 1) + 1; \\ \textbf{else } \ \ peek\_hist \ \ll = 1; \\ \\ \end{array} \right\}
```

This code is used in section 85.

56 BRANCH PREDICTION §152 **MMIX-PIPE** 

We update the  $bp\_table$  when an instruction is issued. And we store the opposite table value in  $cool \neg x.o.l$ , just in case our prediction turns out to be wrong.

```
\langle Record the result of branch prediction 152\rangle \equiv
  if (bp_table) { register int reversed, h, h_up, h_down;
     reversed = op \& #10;
     if (peek_hist & 1) reversed \oplus= #10;
     m = ((head \neg loc.l \& bp\_cmask) \ll bp\_b) + (head \neg loc.l \& bp\_amask);
     m = ((cool\_hist \& bp\_bcmask) \ll bp\_a) \oplus (m \gg 2);
     h = bp\_table[m];
     h\_up = (h+1) \& bp\_nmask; if (h\_up \equiv bp\_npower) h\_up = h;
     if (h \equiv bp\_npower) h\_down = h; else h\_down = (h-1) \& bp\_nmask;
     if (reversed) {
        bp\_table[m] = h\_down, cool \neg x.o.l = h\_up;
        cool \rightarrow i = pbr + br - cool \rightarrow i;
                                        /* reverse the sense */
        bp\_rev\_stat ++;
     } else {
        bp\_table[m] = h\_up, cool \neg x.o.l = h\_down;
                                                        /* go with the flow */
        bp\_ok\_stat ++;
     if (verbose & show_pred_bit) {
       printf("⊔predicting⊔"); print_octa(cool→loc);
       printf("_{\sqcup}\%s;_{\sqcup}bp[\%x]=\%d\n", reversed?"NG":"OK", m,
             bp\_table[m] - ((bp\_table[m] \& bp\_npower) \ll 1));
     cool \neg x.o.h = m;
```

This code is used in section 75.

153. The calculations in the previous sections need several precomputed constants, depending on the parameters a, b, c, and n.

```
\langle Initialize everything 22\rangle + \equiv
  bp\_amask = ((1 \ll bp\_a) - 1) \ll 2;
                                           /* least a bits of instruction address */
  bp\_cmask = ((1 \ll bp\_c) - 1) \ll (bp\_a + 2); /* the next c address bits */
  bp\_bcmask = (1 \ll (bp\_b + bp\_c)) - 1; /* least b + c bits of history info */
  bp\_nmask = (1 \ll bp\_n) - 1; /* least significant n bits */
  bp\_npower = 1 \ll (bp\_n - 1);
                                    /* 2^{n-1}, the sign bit of an n-bit number */
154. \langle Global variables 20 \rangle + \equiv
  int bp_amask, bp_cmask, bp_bcmask, bp_nmask, bp_npower;
                                   /* how often we overrode and agreed */
  int bp_rev_stat, bp_ok_stat;
  int bp_bad_stat, bp_good_stat;
                                     /* how often we failed and succeeded */
```

§155 MMIX-PIPE BRANCH PREDICTION 57

**155.** After a branch or probable branch instruction has been issued and the value of the relevant register has been computed in the reorder buffer as *data*→*b.o*, we're ready to determine if the prediction was correct or not.

```
\langle \text{ Cases for stage 1 execution 155} \rangle \equiv
case br: case pbr: j = register\_truth(data \rightarrow b.o, data \rightarrow op);
  if (j) data \neg qo.o = data \neg z.o; else data \neg qo.o = data \neg y.o;
  if (j \equiv (data \rightarrow i \equiv pbr)) bp\_good\_stat \leftrightarrow ;
  else {
               /* oops, misprediction */
     bp\_bad\_stat ++;
     ⟨Recover from incorrect branch prediction 160⟩;
  goto fin\_ex;
See also sections 313, 325, 327, 328, 329, 331, and 356.
This code is used in section 132.
        The register_truth subroutine is used by B, PB, CS, and ZS commands to decide whether an octabyte
satisfies the conditions of the opcode, data \neg op.
\langle \text{Internal prototypes } 13 \rangle + \equiv
  static int register_truth ARGS((octa, mmix_opcode));
157. \langle \text{Subroutines } 14 \rangle + \equiv
  static int register\_truth(o, op)
        octa o;
        mmix\_opcode op;
  \{ \text{ register int } b; 
     switch ((op \gg 1) \& #3) {
     case 0: b = o.h \gg 31; break;
                                              /* negative? */
     case 1: b = (o.h \equiv 0 \land o.l \equiv 0); break;
                                                         /* zero? */
     case 2: b = (o.h < sign\_bit \land (o.h \lor o.l)); break;
                                                                    /* positive? */
     case 3: b = o.l \& #1; break;
                                            /* odd? */
     if (op \& #8) return b \oplus 1;
     else return b;
  }
       The issued_between subroutine determines how many speculative instructions were issued between a
given control block in the reorder buffer and the current cool pointer, when cc = cool.
\langle \text{Internal prototypes } 13 \rangle + \equiv
  static int issued_between ARGS((control *, control *));
159. \langle Subroutines 14\rangle + \equiv
  static int issued\_between(c, cc)
        control *c, *cc;
     if (c > cc) return c - 1 - cc;
     return (c - reorder\_bot) + (reorder\_top - cc);
```

58 BRANCH PREDICTION MMIX-PIPE §160

160. If more than one functional unit is able to process branch instructions and if two of them simultaneously discover misprediction, or if misprediction is detected by one unit just as another unit is generating an interrupt, we assume that an arbitration takes place so that only the hottest one actually deissues the cooler instructions.

Changes to the  $bp\_table$  aren't undone when they were made on speculation in an instruction being deissued; nor do we worry about cases where the same  $bp\_table$  entry is being updated by two or more active coroutines. After all, the  $bp\_table$  is just a heuristic, not part of the real computation. We correct the  $bp\_table$  only if we discover that a prediction was wrong, so that we will be less likely to make the same mistake later.

```
\langle Recover from incorrect branch prediction 160\rangle \equiv
  i = issued\_between(data, cool);
  if (i < deissues) goto die;
  deissues = i;
  old\_tail = tail = head; resuming = 0;
                                            /* clear the fetch buffer */
  (Restart the fetch coroutine 287);
  inst\_ptr.o = data \neg qo.o, inst\_ptr.p = \Lambda;
  if (\neg(data \neg loc.h \& sign\_bit)) {
    if (inst\_ptr.o.h \& sign\_bit) data \neg interrupt |= P\_BIT;
    else data \rightarrow interrupt \&= \sim P_BIT;
  if (bp_table) {
    bp\_table[data \neg x.o.h] = data \neg x.o.l;
                                          /* this is what we should have stored */
    if (verbose & show_pred_bit) {
      printf("\_mispredicted\_"); print\_octa(data\rightarrowledge);
      printf("; \_bp[%x] = %d\n", data - x.o.h, data - x.o.l - ((data - x.o.l & bp\_npower) \ll 1));
  }
  cool\_hist = (j? (data \neg hist \ll 1) + 1: data \neg hist \ll 1);
This code is used in section 155.
161. \langle \text{External prototypes } 9 \rangle + \equiv
  Extern void print\_stats \ ARGS((void));
162. \langle \text{External routines } 10 \rangle + \equiv
  void print_stats()
  {
    register int j;
    bp\_ok\_stat, bp\_rev\_stat, bp\_good\_stat, bp\_bad\_stat);
    else printf("Predictions: "\dugood, \u\dubad\n", bp_good_stat, bp_bad_stat);
    printf("Instructions_issued_per_cycle:\n");
```

§163 MMIX-PIPE CACHE MEMORY 59

163. Cache memory. It's time now to consider MMIX's MMU, the memory management unit. This part of the machine deals with the critical problem of getting data to and from the computational units. In a RISC architecture all interaction between main memory and the computer registers is specified by load and store instructions; thus memory accesses are much easier to deal with than they would be on a machine with more complex kinds of interaction. But memory management is still difficult, if we want to do it well, because main memory typically operates at a much slower speed than the registers do. High-speed implementations of MMIX introduce intermediate "caches" of storage in order to keep the most important data accessible, and cache maintenance can be complicated when all the details are taken into account. (See, for example, Chapter 5 of Hennessy and Patterson's Computer Architecture, second edition.)

This simulator can be configured to have up to three auxiliary caches between registers and memory: An I-cache for instructions, a D-cache for data, and an S-cache for both instructions and data. The S-cache, also called a *secondary cache*, is supported only if both I-cache and D-cache are present. Arbitrary access times for each cache can be specified independently; we might assume, for example, that data items in the I-cache or D-cache can be sent to a register in one or two clock cycles, but the access time for the S-cache might be say 5 cycles, and main memory might require 20 cycles or more. Our speculative pipeline can have many functional units handling load and store instructions, but only one load or store instruction can be updating the D-cache or S-cache or main memory at a time. (However, the D-cache can have several read ports; furthermore, data might be passing between the S-cache and memory while other data is passing between the reorder buffer and the D-cache.)

Besides the optional I-cache, D-cache, and S-cache, there are required caches called the IT-cache and DT-cache, for translation of virtual addresses to physical addresses. A translation cache is often called a "translation lookaside buffer" or TLB; but we call it a cache since it is implemented in nearly the same way as an I-cache.

164. Consider a cache that has blocks of  $2^b$  bytes each and associativity  $2^a$ ; here  $b \ge 3$  and  $a \ge 0$ . The I-cache, D-cache, and S-cache are addressed by 48-bit physical addresses, as if they were part of main memory; but the IT and DT caches are addressed by 64-bit keys, obtained from a virtual address by blanking out the lower s bits and inserting the value of n, where the page size s and the process number n are found in rV. We will consider all caches to be addressed by 64-bit keys, so that both cases are handled with the same basic methods.

Given a 64-bit key, we ignore the low-order b bits and use the next c bits to address the *cache set*; then the remaining 64 - b - c bits should match one of  $2^a$  tags in that set. The case a = 0 corresponds to a so-called *direct-mapped* cache; the case c = 0 corresponds to a so-called *fully associative* cache. With  $2^c$  sets of  $2^a$  blocks each, and  $2^b$  bytes per block, the cache contains  $2^{a+b+c}$  bytes of data, in addition to the space needed for tags. Translation caches have b = 3 and they also usually have c = 0.

If a tag matches the specified bits, we "hit" in the cache and can use and/or update the data found there. Otherwise we "miss," and we probably want to replace one of the cache blocks by the block containing the item sought. The item chosen for replacement is called a *victim*. The choice of victim is forced when the cache is direct-mapped, but four strategies for victim selection are available when we must choose from among  $2^a$  entries for a > 0:

- "Random" selection chooses the victim by extracting the least significant a bits of the clock.
- "Serial" selection chooses  $0, 1, \ldots, 2^a 1, 0, 1, \ldots, 2^a 1, 0, \ldots$  on successive trials.
- "LRU (Least Recently Used)" selection chooses the victim that ranks last if items are ranked inversely to the time that has elapsed since their previous use.
- "Pseudo-LRU" selection chooses the victim by a rough approximation to LRU that is simpler to implement in hardware. It requires a bit table  $r_1 
  ldots r_{2^a-1}$ . Whenever we use an item with binary address  $(i_1 
  ldots i_a)_2$  in the set, we adjust the bit table as follows:

$$r_1 \leftarrow 1 - i_1, \quad r_{1i_1} \leftarrow 1 - i_2, \quad \dots, \quad r_{1i_1 \dots i_{a-1}} \leftarrow 1 - i_a;$$

here the subscripts on r are binary numbers. (For example, when a=3, the use of element  $(010)_2$  sets  $r_1 \leftarrow 1$ ,  $r_{10} \leftarrow 0$ ,  $r_{101} \leftarrow 1$ , where  $r_{101}$  means the same as  $r_5$ .) To select a victim, we start with  $l \leftarrow 1$  and then repeatedly set  $l \leftarrow 2l + r_l$ , a times; then we choose element  $l - 2^a$ . When a=1, this scheme is equivalent to LRU. When a=2, this scheme was implemented in the Intel 80486 chip.

```
⟨ Type definitions 11 ⟩ +≡
  typedef enum {
    random, serial, pseudo_lru, lru
  } replace_policy;
```

165. A cache might also include a "victim" area, which contains the last  $2^v$  victim blocks removed from the main cache area. The victim area can be searched in parallel with the specified cache set, thereby increasing the chance of a hit without making the search go slower. Each of the three replacement policies can be used also in the victim cache.

§166 MMIX-PIPE CACHE MEMORY 61

**166.** A cache also has a granularity  $2^g$ , where  $b \ge g \ge 3$ . This means that we maintain, for each cache block, a set of  $2^{b-g}$  "dirty bits," which identify the  $2^g$ -byte groups that have possibly changed since they were last read from memory. Thus if g = b, an entire cache block is either dirty or clean; if g = 3, the dirtiness of each octabyte is maintained separately.

Two policies are available when new data is written into all or part of a cache block. We can write-through, meaning that we send all new data to memory immediately and never mark anything dirty; or we can write-back, meaning that we update the memory from the cache only when absolutely necessary. Furthermore we can write-allocate, meaning that we keep the new data in the cache, even if the cache block being written has to be fetched first because of a miss; or we can write-around, meaning that we keep the new data only if it was part of an existing cache block.

(In this discussion, "memory" is shorthand for "the next level of the memory hierarchy"; if there is an S-cache, the I-cache and D-cache write new data to the S-cache, not directly to memory. The I-cache, IT-cache, and DT-cache are read-only, so they do not need the facilities discussed in this section. Moreover, the D-cache and S-cache can be assumed to have the same granularity.)

62 CACHE MEMORY MMIX-PIPE  $\S 167$ 

167. We have seen that many flavors of cache can be simulated. They are represented by **cache** structures, containing arrays of **cacheset** structures that contain arrays of **cacheblock** structures for the individual blocks. We use a full byte to store each *dirty* bit, and we use full integer words to store *rank* fields for LRU processing, etc.; memory economy is less important than simplicity in this simulator.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef struct {
                   /* bits of key not included in the cache block address */
    octa tag;
                      /* array of 2^{g-b} dirty bits, one per granule */
    char * dirty;
                     /* array of 2^{b-3} octabytes, the data in a cache block */
    octa * data;
                   /* auxiliary information for non-random policies */
    int rank;
  } cacheblock;
                                         /* array of 2^a or 2^v blocks */
  typedef cacheblock *cacheset;
  typedef struct {
                           /* lg of associativity, blocksize, setsize, granularity, and victimsize */
    int a, b, c, g, v;
    int aa, bb, cc, gg, vv;
       /* associativity, blocksize, setsize, granularity, and victimsize (all powers of 2) */
                      /* -2^{b+c} */
    int tagmask;
                                      /* how to choose victims and victim-victims */
    replace_policy repl, vrepl;
                   /* optional WRITE_BACK and/or WRITE_ALLOC */
    int mode;
    int access_time;
                         /* cycles to know if there's a hit */
                           /* cycles to copy a new block into the cache */
    int copy_in_time;
                            /* cycles to copy an old block from the cache */
    int copy_out_time;
    cacheset *set;
                        /* array of 2^c sets of arrays of cache blocks */
                           /* the victim cache, if present */
    cacheset victim;
    coroutine filler;
                           /* a coroutine for copying new blocks into the cache */
    control filler_ctl;
                           /* its control block */
                            /* a coroutine for writing dirty old data from the cache */
    coroutine flusher;
                             /* its control block */
    control flusher_ctl;
    cacheblock inbuf;
                            /* filling comes from here */
                              /* flushing goes to here */
    cacheblock outbuf;
    lockvar lock;
                       /* nonzero when the cache is being changed significantly */
                          /* nonzero when filler should pass data back */
    lockvar fill_lock;
                   /* how many coroutines can be reading the cache? */
    int ports;
                             /* array of coroutines that might be reading simultaneously */
    char *name;
                      /* "Icache", for example */
  } cache;
168. \langle \text{External variables 4} \rangle + \equiv
  Extern cache *Icache, *Dcache, *Scache, *ITcache, *DTcache;
```

**169.** Now we are ready to define some basic subroutines for cache maintenance. Let's begin with a trivial routine that tests if a given cache block is dirty.

```
⟨Internal prototypes 13⟩ +≡
static bool is_dirty ARGS((cache *, cacheblock *));
```

 $\S170$  MMIX-PIPE CACHE MEMORY 63

```
170.
        \langle Subroutines 14\rangle + \equiv
  static bool is\_dirty(c, p)
                       /* the cache containing it */
        cache *c;
        cacheblock *p;
                             /* a cache block */
  {
     register int j;
     register char *d = p \rightarrow dirty;
     for (j = 0; j < c \rightarrow bb; d ++, j += c \rightarrow gg)
        if (*d) return true;
     return false;
  }
171. For diagnostic purposes we might want to display an entire cache block.
\langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_cache_block ARGS((cacheblock, cache *));
       \langle \text{Subroutines } 14 \rangle + \equiv
  static void print\_cache\_block(p, c)
        cacheblock p;
        cache *c;
  { register int i, j, b = c \rightarrow bb \gg 3, g = c \rightarrow gg \gg 3;
     printf ("%08x%08x:⊔", p.tag.h, p.tag.l);
     for (i = j = 0; j < b; j ++, i += ((j \& (g - 1)) ? 0 : 1))
        printf("\%08x\%08x\%c", p.data[j].h, p.data[j].l, p.dirty[i]? '*' : '_\');
     printf("_{\sqcup}(%d)\n", p.rank);
173. \langle \text{Internal prototypes } 13 \rangle + \equiv
  static void print_cache_locks ARGS((cache *));
174. \langle Subroutines 14\rangle + \equiv
  static void print_cache_locks(c)
        cache *c;
  {
     if (c) {
        if (c 	ext{-}lock) printf("%s_locked_by_\%s:\%d\n", c 	ext{-}name, c 	ext{-}lock 	ext{-}name, c 	ext{-}lock 	ext{-}stage);
        if (c-fill_lock) printf("%sfill_locked_by_\%s:%d\n",c-name,c-fill_lock-name,c-fill_lock-stage);
  }
```

175. The *print\_cache* routine prints the entire contents of a cache. This can be a huge amount of data, but it can be very useful when debugging. Fortunately, the task of debugging favors the use of small caches, since interesting cases arise more often when a cache is fairly small.

```
⟨External prototypes 9⟩ +≡
Extern void print_cache ARGS((cache *, bool));
```

64 CACHE MEMORY MMIX-PIPE  $\S176$ 

```
176.
        \langle \text{External routines } 10 \rangle + \equiv
   void print\_cache(c, dirty\_only)
        cache *c;
         bool dirty_only;
      if (c) { register int i, j;
         printf("%suofu%s:", dirty_only? "Dirtyublocks": "Contents", c→name);
        if (c \rightarrow filler.next) {
            printf("□(filling□");
           print\_octa(c \rightarrow name[1] \equiv T? c \rightarrow filler\_ctl.y.o : c \rightarrow filler\_ctl.z.o;
           printf (")");
        if (c \rightarrow flusher.next) {
            printf("□(flushing□");
           print\_octa(c \rightarrow outbuf.tag);
           printf (")");
         printf("\n");
         \langle \text{ Print all of } c \text{'s cache blocks } 177 \rangle;
  }
         We don't print the cache blocks that have an invalid tag, unless requested to be verbose.
\langle \text{ Print all of } c \text{'s cache blocks } 177 \rangle \equiv
  for (i = 0; i < c \rightarrow cc; i++)
      for (j = 0; j < c \rightarrow aa; j +++)
         if ((\neg(c\rightarrow set[i][j].tag.h \& sign\_bit) \lor (verbose \& show\_wholecache\_bit)) \land
                  (\neg dirty\_only \lor is\_dirty(c, \&c \rightarrow set[i][j]))) {
           printf("[%d][%d]_{\sqcup}", i, j);
           print\_cache\_block(c \rightarrow set[i][j], c);
   for (j = 0; j < c \rightarrow vv; j ++)
     if ((\neg(c \neg victim[j].tag.h \& sign\_bit) \lor (verbose \& show\_wholecache\_bit)) \land
               (\neg dirty\_only \lor is\_dirty(c, \&c \neg victim[j]))) {
         printf("V[\%d]_{\sqcup}", j);
         print\_cache\_block(c \rightarrow victim[j], c);
This code is used in section 176.
         The clean_block routine simply initializes a given cache block.
\langle \text{External prototypes } 9 \rangle + \equiv
  Extern void clean_block ARGS((cache *, cacheblock *));
```

§179 MMIX-PIPE CACHE MEMORY 65

```
179. \langle External routines 10\rangle +\equiv void clean\_block(c,p) cache *c; cacheblock *p; {
    register int j;
    p-tag.h = sign\_bit, p-tag.l = 0;
    for (j = 0; \ j < c-bb \gg 3; \ j++) p-data[j] = zero\_octa;
    for (j = 0; \ j < c-bb \gg c-g; j++) p-dirty[j] = false;
}
```

**180.** The *zap\_cache* routine invalidates all tags of a given cache, effectively restoring it to its initial condition.

```
⟨External prototypes 9⟩ +≡

Extern void zap_cache ARGS((cache *));
```

**181.** We clear the *dirty* entries here, just to be tidy, although they could actually be left in arbitrary condition when the tags are invalid.

**182.** The *get\_reader* subroutine finds the index of an available reader coroutine for a given cache, or returns a negative value if no readers are available.

```
static int get\_reader ARGS((cache *));

183. \langle Subroutines 14\rangle +\equiv static int get\_reader(c) cache *c;

{ register int j;

for (j = 0; \ j < c \text{-}ports; \ j \text{++})

if (c \text{-}reader[j].next \equiv \Lambda) return j;

return -1;

}
```

 $\langle Internal prototypes 13 \rangle + \equiv$ 

66 CACHE MEMORY MMIX-PIPE  $\S184$ 

**184.** The subroutine  $copy\_block(c, p, cc, pp)$  copies the dirty items from block p of cache c into block pp of cache cc, assuming that the destination cache has a sufficiently large block size. (In other words, we assume that  $cc - b \ge c - b$ .) We also assume that both blocks have compatible tags, and that both caches have the same granularity.

```
\langle \text{Internal prototypes } 13 \rangle + \equiv
   static void copy_block ARGS((cache *, cacheblock *, cache *, cacheblock *));
185. \langle Subroutines 14\rangle + \equiv
   static void copy\_block(c, p, cc, pp)
         cache *c, *cc:
         cacheblock *p, *pp;
   {
      register int j, jj, i, ii, lim;
      register int off = p \rightarrow tag.l \& (cc \rightarrow bb - 1);
      if (c \rightarrow g \neq cc \rightarrow g \lor p \rightarrow tag.h \neq pp \rightarrow tag.h \lor p \rightarrow tag.l - off \neq pp \rightarrow tag.l) panic(confusion("copy_block"));
      for (j = 0, jj = off \gg c \neg g; j < c \neg bb \gg c \neg g; j ++, jj ++)
         if (p \rightarrow dirty[j]) {
            pp \rightarrow dirty[jj] = true;
            for (i = j \ll (c - g - 3), ii = jj \ll (c - g - 3), lim = (j + 1) \ll (c - g - 3); i < lim; i++, ii++)
               pp \rightarrow data[ii] = p \rightarrow data[i];
         }
   }
```

186. The  $choose\_victim$  subroutine selects the victim to be replaced when we need to change a cache set. We need only one bit of the rank fields to implement the r table when  $policy = pseudo\_lru$ , and we don't need rank at all when policy = random. Of course we use an a-bit counter to implement policy = serial. In the other case, policy = lru, we need an a-bit rank field; the least recently used entry has rank 0, and the most recently used entry has rank  $2^a - 1 = aa - 1$ .

```
 \langle \text{ Internal prototypes 13} \rangle + \equiv \\ \textbf{static cacheblock} * \textit{choose\_victim} \ \texttt{ARGS}((\textbf{cacheset}, \textbf{int}, \textbf{replace\_policy}));
```

```
187. \langle \text{Subroutines } 14 \rangle + \equiv
  static cacheblock *choose\_victim(s, aa, policy)
       cacheset s;
       int aa;
                   /* setsize */
       replace_policy policy;
    register cacheblock *p;
    register int l, m;
    switch (policy) {
    case random: return &s[ticks.l & (aa - 1)];
    case serial: l = s[0].rank; s[0].rank = (l+1) & (aa - 1); return & s[l];
    case lru:
       for (p = s; p < s + aa; p ++)
         if (p \rightarrow rank \equiv 0) return p;
       panic(confusion("lru_victim"));
                                              /* what happened? nobody has rank zero */
    case pseudo_lru:
       for (l = 1, m = aa \gg 1; m; m \gg = 1) l = l + l + s[l].rank;
       return \&s[l-aa];
    }
  }
```

§188 MMIX-PIPE CACHE MEMORY 67

**188.** The *note\_usage* subroutine updates the *rank* entries to record the fact that a particular block in a cache set is now being used.

```
\langle \text{Internal prototypes } 13 \rangle + \equiv
  static void note_usage ARGS((cacheblock *, cacheset, int, replace_policy));
       \langle \text{Subroutines } 14 \rangle + \equiv
  static void note\_usage(l, s, aa, policy)
       cacheblock *l;
                             /* a cache block that's probably worth preserving */
       cacheset s;
                          /* the set that contains l */
                     /* setsize */
       int aa:
       replace_policy policy;
     register cacheblock *p;
     register int j, m, r;
     if (aa \equiv 1 \lor policy \leq serial) return;
     if (policy \equiv lru) {
       r = l \rightarrow rank;
       for (p = s; p < s + aa; p++)
          if (p \rightarrow rank > r) p \rightarrow rank --;
       l \rightarrow rank = aa - 1;
     }
                /* policy \equiv pseudo\_lru */
     else {
       r = l - s;
       for (j = 1, m = aa \gg 1; m; m \gg = 1)
          if (r \& m) \ s[j].rank = 0, j = j + j + 1;
          else s[j].rank = 1, j = j + j;
     return;
  }
       The demote_usage subroutine is sort of the opposite of note_usage; it changes the rank of a given
block to least recently used.
\langle \text{Internal prototypes } 13 \rangle + \equiv
  static void demote_usage ARGS((cacheblock *, cacheset, int, replace_policy));
```

68 CACHE MEMORY §191 **MMIX-PIPE** 

```
191. \langle Subroutines 14\rangle + \equiv
  static void demote\_usage(l, s, aa, policy)
                            /* a cache block we probably don't need */
       cacheblock *l;
                          /* the set that contains l */
       cacheset s;
       int aa;
                     /* setsize */
       replace_policy policy;
     register cacheblock *p;
     register int j, m, r;
     if (aa \equiv 1 \lor policy \leq serial) return;
     if (policy \equiv lru) {
       r = l \rightarrow rank;
       for (p = s; p < s + aa; p ++)
          if (p \rightarrow rank < r) p \rightarrow rank ++;
       l \rightarrow rank = 0;
     else {
               /* policy \equiv pseudo\_lru */
       r = l - s;
       for (j = 1, m = aa \gg 1; m; m \gg = 1)
          if (r \& m) \ s[j].rank = 1, j = j + j + 1;
          else s[j].rank = 0, j = j + j;
     return;
  }
```

The cache\_search routine looks for a given key  $\alpha$  in a given cache, and returns a cache block if there's a hit; otherwise it returns Λ. If the search hits, the set in which the block was found is stored in global variable hit\_set. Notice that we need to check more bits of the tag when we search in the victim area.

```
\langle Internal prototypes 13 \rangle + \equiv
   static cacheblock *cache_search ARGS((cache *, octa));
193. \langle Subroutines 14\rangle + \equiv
   static cacheblock *cache\_search(c, alf)
        cache *c;
                          /* the cache to be searched */
        octa alf;
                          /* the key */
     register cacheset s;
     register cacheblock *p;
     s = cache\_addr(c, alf);
                                       /* the set corresponding to alf */
     for (p = s; p < s + c \rightarrow aa; p++)
        if (((p \rightarrow tag.l \oplus alf.l) \& c \rightarrow tagmask) \equiv 0 \land p \rightarrow tag.h \equiv alf.h) goto hit;
     s = c \rightarrow victim;
     if (\neg s) return \Lambda;
                               /* cache miss, and no victim area */
     for (p = s; p < s + c \rightarrow vv; p ++)
        if (((p \rightarrow tag.l \oplus alf.l) \& (-c \rightarrow bb)) \equiv 0 \land p \rightarrow tag.h \equiv alf.h) goto hit;
     return \Lambda;
                       /* double miss */
   hit: hit\_set = s; return p;
   }
```

#define  $cache\_addr(c, alf)$   $c \rightarrow set[(alf.l \& \sim (c \rightarrow tagmask)) \gg c \rightarrow b]$ 

§194 MMIX-PIPE CACHE MEMORY 69

```
194. \langle Global variables 20\rangle +\equiv cacheset hit_set;
```

195. If  $p = cache\_search(c, alf)$  hits and if we call  $use\_and\_fix(c, p)$  immediately afterwards, cache c is updated to record the usage of key alf. A hit in the victim area moves the cache block to the main area, unless the filler routine of cache c is active. A pointer to the (possibly moved) cache block is returned.

```
⟨ Internal prototypes 13⟩ +≡ static cacheblock *use_and_fix ARGS((cache *, cacheblock *));

196. ⟨ Subroutines 14⟩ +≡ static cacheblock *use_and_fix(c, p) cache *c; cacheblock *p;
   {
      if (hit\_set \neq c \neg victim) note_usage(p, hit\_set, c \neg aa, c \neg repl); else {
           note_usage(p, hit\_set, c \neg vv, c \neg vrepl); /* found in victim cache */ if (\neg c \neg filler.next) {
           register cacheblock *q = choose_victim(s, c ¬aa, c ¬repl); note_usage(q, s, c ¬aa, c ¬repl); ⟨ Swap cache blocks p and q 197⟩;
```

197. We can simply permute the pointers inside the cacheblock structures of a cache, instead of copying the data, if we are careful not to let any of those pointers escape into other data structures.

```
 \left \langle \text{Swap cache blocks } p \text{ and } q \text{ 197} \right \rangle \equiv \\ \left \{ & \text{ octa } t; \\ & \text{ register char } *d = p \text{-} dirty; \\ & \text{ register octa } *dd = p \text{-} data; \\ & t = p \text{-} tag; \ p \text{-} tag = q \text{-} tag; \ q \text{-} tag = t; \\ & p \text{-} dirty = q \text{-} dirty; \ q \text{-} dirty = d; \\ & p \text{-} data = q \text{-} data; \ q \text{-} data = dd; \\ \right \}
```

This code is used in sections 196 and 205.

return q;

}

return p;

**198.** The *demote\_and\_fix* routine is analogous to *use\_and\_fix*, except that we don't want to promote the data we found.

```
⟨Internal prototypes 13⟩ +≡
static cacheblock *demote_and_fix ARGS((cache *, cacheblock *));
```

70 CACHE MEMORY MMIX-PIPE  $\S 199$ 

```
199.
        \langle Subroutines 14\rangle + \equiv
  static cacheblock *demote\_and\_fix(c, p)
        cache *c:
        cacheblock *p;
     if (hit\_set \neq c \neg victim) demote_usage(p, hit\_set, c \neg aa, c \neg repl);
     else demote\_usage(p, hit\_set, c \neg vv, c \neg vrepl);
     return p;
        The subroutine load\_cache(c, p) is called at a moment when c-lock has been set and c-inbuf has been
filled with clean data to be placed in the cache block p.
\langle Internal prototypes 13 \rangle + \equiv
   static void load_cache ARGS((cache *, cacheblock *));
201. \langle Subroutines 14\rangle + \equiv
   static void load\_cache(c, p)
        cache *c;
        cacheblock *p;
     register int i;
     register octa *d;
     for (i = 0; i < c \rightarrow bb \gg c \rightarrow q; i++) p \rightarrow dirty[i] = false;
     d = p \rightarrow data; p \rightarrow data = c \rightarrow inbuf.data; c \rightarrow inbuf.data = d;
     p \rightarrow taq = c \rightarrow inbuf.taq;
     hit\_set = cache\_addr(c, p \rightarrow tag); use\_and\_fix(c, p);
                                                                        /* p \text{ not moved } */
        The subroutine flush\_cache(c, p, keep) is called at a "quiet" moment when c-flusher.next = \Lambda. It
puts cache block p into c-outbuf and fires up the c-flusher coroutine, which will take care of sending the
data to lower levels of the memory hierarchy. Cache block p is also marked clean.
\langle \text{Internal prototypes } 13 \rangle + \equiv
   static void flush_cache ARGS((cache *, cacheblock *, bool));
203. \langle Subroutines 14\rangle + \equiv
   static void flush\_cache(c, p, keep)
        cache *c;
        cacheblock *p;
                                  /* a block inside cache c */
                            /* should we preserve the data in p? */
        bool keep;
   {
     register octa *d;
     register char *dd;
     register int j;
     c \rightarrow outbuf.tag = p \rightarrow tag;
     if (keep) for (j = 0; j < c \rightarrow bb \gg 3; j ++) c \rightarrow outbuf.data[j] = p \rightarrow data[j];
     else d = c \rightarrow outbuf.data, c \rightarrow outbuf.data = p \rightarrow data, p \rightarrow data = d;
     dd = c \rightarrow outbuf.dirty, c \rightarrow outbuf.dirty = p \rightarrow dirty, p \rightarrow dirty = dd;
     for (j = 0; j < c \rightarrow bb \gg c \rightarrow g; j++) p \rightarrow dirty[j] = false;
     startup(&c-flusher, c-copy_out_time); /* will not be aborted */
   }
```

§204 MMIX-PIPE CACHE MEMORY 71

**204.** The *alloc\_slot* routine is called when we wish to put new information into a cache after a cache miss. It returns a pointer to a cache block in the main area where the new information should be put. The tag of that cache block is invalidated; the calling routine should take care of filling it and giving it a valid tag in due time. The cache's *filler* routine should not be active when *alloc\_slot* is called.

Inserting new information might also require writing old information into the next level of the memory hierarchy, if the block being replaced is dirty. This routine returns  $\Lambda$  in such cases if the cache is flushing a previously discarded block. Otherwise it schedules the *flusher* coroutine.

This routine returns  $\Lambda$  also if the given key happens to be in the cache. Such cases are rare, but the following scenario shows that they aren't impossible: Suppose the DT-cache access time is 5, the D-cache access time is 1, and two processes simultaneously look for the same physical address. One process hits in DT-cache but misses in D-cache, waiting 5 cycles before trying *alloc\_slot* in the D-cache; meanwhile the other process missed in D-cache but didn't need to use the DT-cache, so it might have updated the D-cache.

A key value is never negative. Therefore we can invalidate the tag in the chosen slot by forcing it to be negative.

```
\langle \text{Internal prototypes } 13 \rangle + \equiv
   static cacheblock *alloc_slot ARGS((cache *, octa));
        \langle Subroutines 14\rangle + \equiv
   static cacheblock *alloc\_slot(c, alf)
         cache *c;
         octa alf;
                            /* key that probably isn't in the cache */
      register cacheset s;
      register cacheblock *p, *q;
      if (cache\_search(c, alf)) return \Lambda;
      s = cache\_addr(c, alf);
                                          /* the set corresponding to alf */
      if (c \rightarrow victim) p = choose\_victim(c \rightarrow victim, c \rightarrow vv, c \rightarrow vrepl);
      else p = choose\_victim(s, c \rightarrow aa, c \rightarrow repl);
      if (is\_dirty(c, p)) {
         if (c \rightarrow flusher.next) return \Lambda;
         flush\_cache(c, p, false);
      if (c \rightarrow victim) {
         q = choose\_victim(s, c \rightarrow aa, c \rightarrow repl);
         \langle Swap cache blocks p and q 197\rangle;
         q \rightarrow taq.h \mid = siqn\_bit;
                                     /* invalidate the tag */
        return q;
      p \rightarrow tag.h \mid = sign\_bit; return p;
```

72 SIMULATED MEMORY MMIX-PIPE §206

**206.** Simulated memory. How should we deal with the potentially gigantic memory of MMIX? We can't simply declare an array m that has  $2^{48}$  bytes. (Indeed, up to  $2^{63}$  bytes are needed, if we consider also the physical addresses  $\geq 2^{48}$  that are reserved for memory-mapped input/output.)

We could regard memory as a special kind of cache, in which every access is required to hit. For example, such an "M-cache" could be fully associative, with  $2^a$  blocks each having a different tag; simulation could proceed until more than  $2^a - 1$  tags are required. But then the predefined value of a might well be so large that the sequential search of our *cache\_search* routine would be too slow.

Instead, we will allocate memory in chunks of  $2^{16}$  bytes at a time, as needed, and we will use hashing to search for the relevant chunk whenever a physical address is given. If the address is  $2^{48}$  or greater, special routines called  $spec\_read$  and  $spec\_write$ , supplied by the user, will be called upon to do the reading or writing. Otherwise the 48-bit address consists of a 32-bit chunk address and a 16-bit chunk offset.

Chunk addresses that are not used take no space in this simulator. But if, say, 1000 such patterns occur, the simulator will dynamically allocate approximately 65MB for the portions of main memory that are used. Parameter  $mem\_chunks\_max$  specifies the largest number of different chunk addresses that are supported. This parameter does not constrain the range of simulated physical addresses, which cover the entire 256 large-terabyte range permitted by MMIX.

```
\langle Type definitions 11 \rangle +=
typedef struct {
  tetra tag;    /* 32-bit chunk address */
  octa *chunk;    /* either Λ or an array of 2<sup>13</sup> octabytes */
} chunknode;
```

**207.** The parameter  $hash\_prime$  should be a prime number larger than the parameter  $mem\_chunks\_max$ , preferably more than twice as large but not much bigger than that. The default values  $mem\_chunks\_max = 1000$  and  $hash\_prime = 2003$  are set by  $MMIX\_config$  unless the user specifies otherwise.

```
⟨ External variables 4⟩ +≡
Extern int mem_chunks; /* this many chunks are allocated so far */
Extern int mem_chunks_max; /* up to this many different chunks per run */
Extern int hash_prime; /* larger than mem_chunks_max, but not enormous */
Extern chunknode *mem_hash; /* the simulated main memory */
```

**208.** The separately compiled procedures  $spec\_read()$  and  $spec\_write()$  have the same calling conventions as the general procedures  $mem\_read()$  and  $mem\_write()$ .

```
⟨Subroutines 14⟩ +≡
extern octa spec_read ARGS((octa addr)); /* for memory mapped I/O */
extern void spec_write ARGS((octa addr, octa val)); /* likewise */
```

209. If the program tries to read from a chunk that hasn't been allocated, the value zero is returned, optionally with a comment to the user.

Chunk address 0 is always allocated first. Then we can assume that a matching chunk tag implies a nonnull *chunk* pointer.

This routine sets *last\_h* to the chunk found, so that we can rapidly read other words that we know must belong to the same chunk. For this purpose it is convenient to let *mem\_hash[hash\_prime]* be a chunk full of zeros, representing uninitialized memory.

```
\langle \text{External prototypes } 9 \rangle + \equiv
Extern octa mem\_read \text{ARGS}((\text{octa } addr));
```

§210 MMIX-PIPE SIMULATED MEMORY

73

```
210.
       \langle \text{External routines } 10 \rangle + \equiv
  octa mem_read(addr)
       octa addr;
     register tetra off, key;
     register int h;
     if (addr.h \ge (1 \ll 16)) return spec\_read(addr);
     off = (addr.l \& #ffff) \gg 3;
     key = (addr.l \& #ffff0000) + addr.h;
     for (h = key \% hash\_prime; mem\_hash[h].tag \neq key; h--)  {
       if (mem\_hash[h].chunk \equiv \Lambda) {
          if (verbose & uninit_mem_bit)
             errprint 2 (\verb"uninitialized_lmemory_lread_lat_l\%08x\%08x", addr.h, addr.l);
                                          /* zero will be returned */
          h = hash\_prime; break;
       if (h \equiv 0) h = hash\_prime;
     last\_h = h;
     return mem_hash[h].chunk[off];
211. \langle \text{External variables } 4 \rangle + \equiv
  Extern int last_h;
                            /* the hash index that was most recently correct */
       \langle \text{External prototypes } 9 \rangle + \equiv
  Extern void mem_write ARGS((octa addr, octa val));
213. \langle External routines 10\rangle + \equiv
  void mem_write(addr, val)
       octa addr, val;
     register tetra off, key;
     register int h;
     if (addr.h \ge (1 \ll 16)) { spec\_write(addr, val); return; }
     off = (addr.l \& #ffff) \gg 3;
     key = (addr.l \& #ffff0000) + addr.h;
     for (h = key \% hash\_prime; mem\_hash[h].tag \neq key; h---)  {
       if (mem\_hash[h].chunk \equiv \Lambda) {
          if (++mem\_chunks > mem\_chunks\_max)
            panic(errprint1("More_than_%d_memory_chunks_nare_needed", mem_chunks_max));
          mem\_hash[h].chunk = (\mathbf{octa} *) \ calloc(1 \ll 13, \mathbf{sizeof}(\mathbf{octa}));
          if (mem\_hash[h].chunk \equiv \Lambda)
            panic(errprint1("I_{\sqcup}can't_{\sqcup}allocate_{\sqcup}memory_{\sqcup}chunk_{\sqcup}number_{\sqcup}\%d", mem\_chunks));
          mem\_hash[h].tag = key;
          break;
       if (h \equiv 0) h = hash\_prime;
     last\_h = h;
     mem\_hash[h].chunk[off] = val;
```

74 SIMULATED MEMORY MMIX-PIPE §214

**214.** The memory is characterized by several parameters, depending on the characteristics of the memory bus being simulated. Let  $bus\_words$  be the number of octabytes read or written simultaneously (usually  $bus\_words$  is 1 or 2; it must be a power of 2). The number of clock cycles needed to read or write  $c*bus\_words$  octabytes that all belong to the same cache block is assumed to be  $mem\_addr\_time + c*mem\_read\_time$  or  $mem\_addr\_time + c*mem\_write\_time$ , respectively.

```
⟨ External variables 4⟩ +≡
Extern int mem_addr_time; /* cycles to transmit an address on memory bus */
Extern int bus_words; /* width of memory bus, in octabytes */
Extern int mem_read_time; /* cycles to read from main memory */
Extern int mem_write_time; /* cycles to write to main memory */
Extern lockvar mem_lock; /* is nonnull when the bus is busy */
```

**215.** One of the principal ways to write memory is to invoke a  $flush\_to\_mem$  coroutine, which is the  $Scache \neg flusher$  if there is an S-cache, or the  $Dcache \neg flusher$  if there is a D-cache but no S-cache.

When such a coroutine is started, its  $data \neg ptr\_a$  will be Scache or Dcache. The data to be written will just have been copied to the cache's outbuf.

```
⟨ Cases for control of special coroutines 126⟩ +≡
case flush_to_mem:
{ register cache *c = (cache *) data¬ptr_a;
    switch (data¬state) {
    case 0: if (mem_lock) wait(1);
        data¬state = 1;
    case 1: set_lock(self, mem_lock);
        data¬state = 2;
        ⟨ Write the dirty data of c¬outbuf and wait for the bus 216⟩;
    case 2: goto terminate; /* this frees mem_lock and c¬outbuf */
    }
}
```

§216 MMIX-PIPE

```
75
```

```
216. Write the dirty data of c-outbuf and wait for the bus 216 \rangle \equiv
  {
      register int off, last_off, count, first, ii;
      register int del = c \neg gg \gg 3;
                                               /* octabytes per granule */
      octa addr;
      addr = c \rightarrow outbuf.tag; off = (addr.l \& #ffff) \gg 3;
      for (i = j = 0, first = 1, count = 0; j < c \rightarrow bb \gg c \rightarrow g; j ++)  {
        if (\neg c \neg outbuf.dirty[j]) i = ii, off += del, addr.l += del \ll 3;
        else while (i < ii) {
              if (first) {
                  count +++; last\_off = off; first = 0;
                  mem\_write(addr, c \rightarrow outbuf.data[i]);
               } else {
                 \textbf{if} \ \left( \left( \textit{off} \ \oplus \textit{last\_off} \right) \& \left( -\textit{bus\_words} \right) \right) \ \textit{count} +\!\!\!\!+;
                  last\_off = off;
                 mem\_hash[last\_h].chunk[off] = c \rightarrow outbuf.data[i];
              i ++; off ++; addr.l += 8;
      }
      wait(mem\_addr\_time + count * mem\_write\_time);
   }
```

This code is used in section 215.

76 CACHE TRANSFERS MMIX-PIPE §217

**217.** Cache transfers. We have seen that the *Dcache-flusher* sends data directly to the main memory if there is no S-cache. But if both D-cache and S-cache exist, the *Dcache-flusher* is a more complicated coroutine of type *flush\_to\_S*. In this case we need to deal with the fact that the S-cache blocks might be larger than the D-cache blocks; furthermore, the S-cache might have a write-around and/or write-through policy, etc. But one simplifying fact does help us: We know that the flusher coroutine will not be aborted until it has run to completion.

Some machines, such as the Alpha 21164, have an additional cache between the S-cache and memory, called the B-cache (the "backup cache"). A B-cache could be simulated by extending the logic used here; but such extensions of the present program are left to the interested reader.

```
\langle Cases for control of special coroutines 126\rangle + \equiv
case flush\_to\_S:
   { register cache *c = (cache *) data \neg ptr\_a;}
      register int block\_diff = Scache \neg bb - c \neg bb;
     p = (\mathbf{cacheblock} *) data \rightarrow ptr\_b;
      switch (data \rightarrow state) {
      case 0: if (Scache \neg lock) wait(1);
         data \neg state = 1;
      case 1: set\_lock(self, Scache \neg lock);
         data \rightarrow ptr_b = (\mathbf{void} *) cache\_search(Scache, c \rightarrow outbuf.tag);
        if (data \neg ptr\_b) data \neg state = 4;
        else if (Scache \neg mode \& WRITE\_ALLOC) data \neg state = (block\_diff ? 2 : 3);
        else data \neg state = 6;
         wait(Scache \neg access\_time);
      case 2: ⟨Fill Scache¬inbuf with clean memory data 219⟩;
      case 3: \langle Allocate a slot p in the S-cache 218\rangle:
         if (block\_diff) \langle Copy Scache \neg inbuf to slot p 220 \rangle;
      case 4: copy\_block(c, \&(c \neg outbuf), Scache, p);
         hit\_set = cache\_addr(Scache, c \neg outbuf.tag); use\_and\_fix(Scache, p);
                                                                                                      /* p not moved */
         data \rightarrow state = 5; wait(Scache \rightarrow copy\_in\_time);
      case 5: if ((Scache \rightarrow mode \& WRITE\_BACK) \equiv 0) {
                                                                           /* write-through */
           if (Scache \rightarrow flusher.next) wait (1);
           flush\_cache(Scache, p, true);
        goto terminate;
      case 6: (Handle write-around when flushing to the S-cache 221);
   }
         \langle Allocate a slot p in the S-cache 218 \rangle \equiv
                                                 /* perhaps an unnecessary precaution? */
  if (Scache \rightarrow filler.next) wait (1);
   p = alloc\_slot(Scache, c \rightarrow outbuf.tag);
  if (\neg p) wait (1);
   data \rightarrow ptr_{\bullet}b = (\mathbf{void} *) p;
   p \rightarrow tag = c \rightarrow outbuf.tag; \ p \rightarrow tag.l = c \rightarrow outbuf.tag.l \& (-Scache \rightarrow bb);
This code is used in section 217.
```

§219 MMIX-PIPE CACHE TRANSFERS 77

**219.** We only need to read *block\_diff* bytes, but it's easier to read them all and to charge only for reading the ones we needed.

```
\langle \text{Fill } Scache \neg inbuf \text{ with clean memory data 219} \rangle \equiv
   { register int count = block\_diff \gg 3;
     register int off, delay;
     octa addr;
     if (mem\_lock) wait(1);
     addr.h = c \rightarrow outbuf.tag.h; addr.l = c \rightarrow outbuf.tag.l \& -Scache \rightarrow bb;
     off = (addr.l \& #ffff) \gg 3;
     for (j = 0; j < Scache \rightarrow bb \gg 3; j ++)
        if (j \equiv 0) Scache-inbuf.data[j] = mem\_read(addr);
        else Scache \neg inbuf.data[j] = mem\_hash[last\_h].chunk[j + off];
     set\_lock(\&mem\_locker, mem\_lock);
     delay = mem\_addr\_time + (int)((count + bus\_words - 1)/(bus\_words)) * mem\_read\_time;
     startup(&mem_locker, delay);
     data \rightarrow state = 3; wait(delay);
This code is used in section 217.
220. \langle \text{Copy } Scache \neg inbuf \text{ to slot } p \text{ 220} \rangle \equiv
     register octa *d = p \neg data;
     p \rightarrow data = Scache \rightarrow inbuf.data; Scache \rightarrow inbuf.data = d;
This code is used in section 217.
        Here we assume that the granularity is 8.
\langle Handle write-around when flushing to the S-cache 221 \rangle \equiv
  if (Scache \neg flusher.next) wait (1);
   Scache \rightarrow outbuf.tag.h = c \rightarrow outbuf.tag.h;
   Scache \neg outbuf.tag.l = c \neg outbuf.tag.l \& (-Scache \neg bb);
   for (j = 0; j < Scache \rightarrow bb \gg Scache \rightarrow q; j++) Scache \rightarrow outbuf.dirty[j] = false;
   copy\_block(c, \&(c \neg outbuf), Scache \, , \&(Scache \neg outbuf));
   startup(\&Scache \neg flusher, Scache \neg copy\_out\_time);
   goto terminate;
This code is used in section 217.
```

78 CACHE TRANSFERS MMIX-PIPE §222

**222.** The S-cache gets new data from memory by invoking a *fill\_from\_mem* coroutine; the I-cache or D-cache may also invoke a *fill\_from\_mem* coroutine, if there is no S-cache. When such a coroutine is invoked, it holds *mem\_lock*, and its caller has gone to sleep. A physical memory address is given in *data-z.o*, and *data-ptr\_a* specifies either *Icache* or *Dcache*. Furthermore, *data-ptr\_b* specifies a block within that cache, determined by the *alloc\_slot* routine. The coroutine simulates reading the contents of the specified memory location, places the result in the *x.o* field of its caller's control block, and wakes up the caller. It proceeds to fill the cache's *inbuf* and, ultimately, the specified cache block, before waking the caller again.

Let  $c = data \neg ptr\_b$ . The caller is then  $c \neg fill\_lock$ , if this variable is nonnull. However, the caller might not wish to be awoken or to receive the data (for example, if it has been aborted). In such cases  $c \neg fill\_lock$  will be  $\Lambda$ ; the filling action continues without the wakeup calls. If c = Scache, the S-cache will be locked and the caller will not have been aborted.

```
\langle Cases for control of special coroutines 126\rangle + \equiv
case fill_from_mem:
   { register cache *c = (cache *) data \neg ptr\_a;}
      register coroutine *cc = c \rightarrow fill \perp lock;
      switch (data¬state) {
      case 0: data \rightarrow x.o = mem\_read(data \rightarrow z.o);
         if (cc) {
            cc \rightarrow ctl \rightarrow x.o = data \rightarrow x.o;
            awaken(cc, mem\_read\_time);
         }
         data \rightarrow state = 1;
         \langle \text{ Read data into } c \neg inbuf \text{ and wait for the bus } 223 \rangle;
      case 1: release_lock(self, mem_lock);
         data \rightarrow state = 2;
      case 2: if (c \neq Scache) {
            if (c \rightarrow lock) wait (1);
            set\_lock(self, c \rightarrow lock);
         if (cc) awaken(cc, c \rightarrow copy\_in\_time);
                                                                /* the second wakeup call */
         load\_cache(c, (cacheblock *) data \neg ptr\_b);
         data \rightarrow state = 3; wait(c \rightarrow copy\_in\_time);
      case 3: goto terminate;
   }
```

**223.** If c's cache size is no larger than the memory bus, we wait an extra cycle, so that there will be two wakeup calls.

```
 \left\{ \begin{array}{l} \text{Read data into $c\text{--}inbuf$ and wait for the bus 223} \right\} \equiv \\ \left\{ \begin{array}{l} \text{register int $count$, $off$;} \\ c\text{--}inbuf.tag = data\text{--}z.o; $c\text{--}inbuf.tag.l. \& = -c\text{--}bb$;} \\ count = c\text{--}bb \gg 3, off = (c\text{--}inbuf.tag.l. \& \text{ #fffff}) \gg 3; \\ \text{for $(i=0; $i < count; $i++$, $off ++) $c\text{--}inbuf.data[i] = mem\_hash[last\_h].chunk[off];} \\ \text{if $(count \leq bus\_words)$ $wait(1 + mem\_read\_time)$;} \\ \text{else $wait((\text{int})(count/bus\_words)* mem\_read\_time);} \\ \end{array} \right\}
```

This code is used in section 222.

§224 MMIX-PIPE CACHE TRANSFERS 79

**224.** The *fill\_from\_S* coroutine has the same conventions as *fill\_from\_mem*, except that the data comes directly from the S-cache if it is present there. This is the *filler* coroutine for the I-cache and D-cache if an S-cache is present.

```
\langle Cases for control of special coroutines 126\rangle + \equiv
case fill_from_S:
   { register cache *c = (cache *) data \neg ptr\_a;}
      register coroutine *cc = c \rightarrow fill\_lock;
      p = (\mathbf{cacheblock} *) data \neg ptr\_c;
      switch (data¬state) {
      case 0: p = cache\_search(Scache, data \neg z.o);
         if (p) goto S\_non\_miss;
         data \rightarrow state = 1;
      case 1: (Start the S-cache filler 225);
         data \rightarrow state = 2; sleep;
      case 2: if (cc) {
                                              /* this data has been supplied by Scache¬filler */
           cc \neg ctl \neg x.o = data \neg x.o;
            awaken(cc, Scache \neg access\_time);
                                                            /* we propagate it back */
         }
         data \rightarrow state = 3; sleep;
                                            /* when we awake, the S-cache will have our data */
      S\_non\_miss: if (cc)  {
            cc \rightarrow ctl \rightarrow x.o = p \rightarrow data[(data \rightarrow z.o.l \& (Scache \rightarrow bb - 1)) \gg 3];
            awaken(cc, Scache \neg access\_time);
      case 3: \langle \text{Copy data from } p \text{ into } c \rightarrow inbuf 226 \rangle;
         data \rightarrow state = 4; wait(Scache \rightarrow access\_time);
      case 4: if (c \rightarrow lock) wait (1);
         set\_lock(self, c \rightarrow lock);
         Scache \neg lock = \Lambda;
                                     /* we had been holding that lock */
         load\_cache(c, (\mathbf{cacheblock} *) data \neg ptr\_b);
         data \rightarrow state = 5; wait(c \rightarrow copy\_in\_time);
      case 5: if (cc) awaken (cc, 1);
                                                   /* second wakeup call */
         goto terminate;
   }
```

**225.** We are already holding the  $Scache\neg lock$ , but we're about to take on the  $Scache\neg fill\_lock$  too (with the understanding that one is "stronger" than the other). For a short time the  $Scache\neg lock$  will point to us but we will point to  $Scache\neg fill\_lock$ ; this will not cause difficulty, because the present coroutine is not abortable.

```
\langle \text{Start the S-cache filler } 225 \rangle \equiv \\ \textbf{if } (Scache \neg filler.next \lor mem\_lock) \ wait(1); \\ p = alloc\_slot(Scache, data \neg z.o); \\ \textbf{if } (\neg p) \ wait(1); \\ set\_lock(\& Scache \neg filler, mem\_lock); \\ set\_lock(self, Scache \neg fill\_lock); \\ data \neg ptr\_c = Scache \neg filler\_ctl.ptr\_b = (\textbf{void} *) \ p; \\ Scache \neg filler\_ctl.z.o = data \neg z.o; \\ startup(\& Scache \neg filler, mem\_addr\_time); \\ \text{This code is used in section } 224. \\ \end{aligned}
```

80 CACHE TRANSFERS MMIX-PIPE §226

**226.** The S-cache blocks might be wider than the blocks of the I-cache or D-cache, so the copying in this step isn't quite trivial.

```
 \langle \text{ Copy data from } p \text{ into } c \neg inbuf \ \ 226 \rangle \equiv \\ \{ \text{ } \textbf{register int } off; \\ c \neg inbuf.tag = data \neg z.o; \ c \neg inbuf.tag.l \ \& = -c \neg bb; \\ \textbf{for } (j=0,off=(c \neg inbuf.tag.l \ \& (Scache \neg bb-1)) \gg 3; \ j < c \neg bb \gg 3; \ j++,off++) \\ c \neg inbuf.data[j] = p \neg data[off]; \\ release\_lock(self,Scache \neg fill\_lock); \\ set\_lock(self,Scache \neg lock); \\ \}  This code is used in section 224.
```

**227.** The instruction PRELD X,\$Y,\$Z generates  $\lfloor X/2^b \rfloor$  commands if there are  $2^b$  bytes per block in the D-cache. These commands will try to preload blocks  $Y + Z, Y + Z + 2^b, \ldots$ , into the cache if it is not too busy.

Similar considerations apply to the instructions PREGO X, \$Y, \$Z and PREST X, \$Y, \$Z.

```
⟨ Special cases of instruction dispatch 117⟩ +≡
case preld: case prest: if (¬Dcache) goto noop_inst;
if (cool¬xx ≥ Dcache¬bb) cool¬interim = true;
cool¬ptr_a = (void *) mem.up; break;
case prego: if (¬Icache) goto noop_inst;
if (cool¬xx ≥ Icache¬bb) cool¬interim = true;
cool¬ptr_a = (void *) mem.up; break;
```

**228.** If the block size is 64, a command like PREST 200,\$Y,\$Z is actually issued as four commands PREST 200,\$Y,\$Z; PREST 191,\$Y,\$Z; PREST 127,\$Y,\$Z; PREST 63,\$Y,\$Z. An interruption will then be able to resume properly. In the pipeline, the instruction PREST 200,\$Y,\$Z is considered to affect bytes Y + Z + 192 through Y + Z + 200, or fewer bytes if Y + Z is not a multiple of 64. (Remember that these instructions are only hints; we act on them only if it is reasonably convenient to do so.)

```
⟨ Get ready for the next step of PRELD or PREST 228⟩ ≡ head \neg inst = (head \neg inst \& \sim ((Dcache \neg bb - 1) \ll 16)) - #10000; This code is used in section 81.
```

```
229. \langle Get ready for the next step of PREGO 229 \rangle \equiv head \neg inst = (head \neg inst \& \sim ((Icache \neg bb - 1) \ll 16)) - #10000; This code is used in section 81.
```

**230.** Another coroutine, called *cleanup*, is occasionally called into action to remove dirty data from the D-cache and S-cache. If it is invoked by starting in state 0, with its i field set to sync, it will clean everything. It can also be invoked in state 4, with its i field set to syncd and with a physical address in its z.o field; then it simply makes sure that no D-cache or S-cache blocks associated with that address are dirty.

Field x.o.h should be set to zero if items are expected to remain in the cache after being cleaned; otherwise field x.o.h should be set to  $sign\_bit$ .

The coroutine that invokes *cleanup* should hold *clean\_lock*. If that coroutine dies, because of an interruption, the *cleanup* coroutine will terminate prematurely.

We assume that the D-cache and S-cache have some sort of way to identify their first dirty block, if any, in access\_time cycles.

```
⟨Global variables 20⟩ +≡
coroutine clean_co;
control clean_ctl;
lockvar clean_lock;
```

§231 MMIX-PIPE CACHE TRANSFERS 81

```
231. ⟨Initialize everything 22⟩ +≡
  clean_co.ctl = & clean_ctl;
  clean_co.name = "Clean";
  clean_co.stage = cleanup;
  clean_ctl.go.o.l = 4;

232. ⟨Cases for control of special coroutines 126⟩ +≡
  case cleanup: p = (cacheblock *) data¬ptr_b;
  switch (data¬state) {
    ⟨Cases 0 through 4, for the D-cache 233⟩;
    ⟨Cases 5 through 9, for the S-cache 234⟩;
  case 10: goto terminate;
}
```

82 CACHE TRANSFERS MMIX-PIPE §233

```
\langle \text{Cases 0 through 4, for the D-cache 233} \rangle \equiv
case 0: if (Dcache \neg lock \lor (j = get\_reader(Dcache) < 0)) wait(1);
   startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
   set\_lock(self, Dcache \neg lock);
   i = j = 0;
Dclean\_loop: p = (i < Dcache \neg cc ? \&(Dcache \neg set[i][j]) : \&(Dcache \neg victim[j]));
  if (p \rightarrow tag.h \& sign\_bit) goto Dclean\_inc;
  if (\neg is\_dirty(Dcache, p)) {
     p \rightarrow tag.h \mid = data \rightarrow x.o.h; goto Dclean\_inc;
   data \neg y.o.h = i, data \neg y.o.l = j;
Dclean: data \neg state = 1; data \neg ptr_b = (void *) p; wait(Dcache \neg access\_time);
case 1: if (Dcache \neg flusher.next) wait (1);
   flush\_cache(Dcache, p, data \neg x.o.h \equiv 0);
  p \rightarrow tag.h \mid = data \rightarrow x.o.h;
   release\_lock(self, Dcache \neg lock);
   data \neg state = 2; wait(Dcache \neg copy\_out\_time);
case 2: if (\neg clean\_lock) goto done;
                                                    /* premature termination */
  if (Dcache \neg flusher.next) wait (1);
  if (data \rightarrow i \neq sync) goto Sprep;
   data \rightarrow state = 3;
case 3: if (Dcache \neg lock \lor (j = get\_reader(Dcache) < 0)) wait(1);
   startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
   set\_lock(self, Dcache \neg lock);
  i = data \rightarrow y.o.h, j = data \rightarrow y.o.l;
Dclean\_inc: j++;
  if (i < Dcache \neg cc \land j \equiv Dcache \neg aa) j = 0, i++;
  if (i \equiv Dcache \neg cc \land j \equiv Dcache \neg vv) {
     data \rightarrow state = 5; wait(Dcache \rightarrow access\_time);
   goto Dclean_loop;
case 4: if (Dcache \rightarrow lock \lor (j = get\_reader(Dcache) < 0)) wait(1);
   startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
   set\_lock(self, Dcache \neg lock);
   p = cache\_search(Dcache, data \neg z.o);
  if (p) {
     demote\_and\_fix(Dcache, p);
     if (is_dirty(Dcache, p)) goto Dclean;
   data \neg state = 9; wait(Dcache \neg access\_time);
This code is used in section 232.
```

§234 MMIX-PIPE

```
\langle \text{Cases 5 through 9, for the S-cache 234} \rangle \equiv
case 5: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
  if (\neg Scache) goto done;
  if (Scache \neg lock) wait(1);
   set\_lock(self, Scache \neg lock);
   i = j = 0;
Sclean\_loop: p = (i < Scache \neg cc ? \&(Scache \neg set[i][j]) : \&(Scache \neg victim[j]));
  if (p \rightarrow tag.h \& sign\_bit) goto Sclean\_inc;
  if (\neg is\_dirty(Scache, p)) {
     p \rightarrow tag.h \mid = data \rightarrow x.o.h; goto Sclean\_inc;
   data \rightarrow y.o.h = i, data \rightarrow y.o.l = j;
Sclean: data \neg state = 6; data \neg ptr\_b = (void *) p; wait(Scache \neg access\_time);
case 6: if (Scache \neg flusher.next) wait (1);
   flush\_cache(Scache, p, data \neg x.o.h \equiv 0);
  p \rightarrow tag.h \mid = data \rightarrow x.o.h;
   release\_lock(self, Scache \neg lock);
   data \neg state = 7; wait(Scache \neg copy\_out\_time);
case 7: if (\neg clean\_lock) goto done;
                                                     /* premature termination */
  if (Scache \neg flusher.next) wait (1);
  if (data \rightarrow i \neq sync) goto done;
   data \rightarrow state = 8;
case 8: if (Scache \neg lock) wait(1);
   set\_lock(self, Scache \neg lock);
  i = data \rightarrow y.o.h, j = data \rightarrow y.o.l;
Sclean\_inc: j \leftrightarrow ;
  if (i < Scache \neg cc \land j \equiv Scache \neg aa) j = 0, i++;
  if (i \equiv Scache \neg cc \land j \equiv Scache \neg vv) {
      data \neg state = 10; wait(Scache \neg access\_time);
   goto Sclean_loop:
Sprep: data \rightarrow state = 9;
case 9: if (self \neg lockloc) release\_lock(self, Dcache \neg lock);
  if (\neg Scache) goto done;
  if (Scache \neg lock) wait(1);
   set\_lock(self, Scache \neg lock);
  p = cache\_search(Scache, data \neg z.o);
  if (p) {
      demote\_and\_fix(Scache, p);
      if (is_dirty(Scache, p)) goto Sclean;
   }
   data \neg state = 10; wait(Scache \neg access\_time);
This code is used in section 232.
```

84

**235. Virtual address translation.** Special arrays of coroutines and control blocks come into play when we need to implement MMIX's rather complicated page table mechanism for virtual address translation. In effect, we have up to ten control blocks *outside* of the reorder buffer that are capable of executing instructions just as if they were part of that buffer. The "opcodes" of these non-abortable instructions are special internal operations called *ldptp* and *ldpte*, for loading page table pointers and page table entries.

Suppose, for example, that we need to translate a virtual address for the DT-cache in which the virtual page address  $(a_4a_3a_2a_1a_0)_{1024}$  of segment i has  $a_4 = a_3 = 0$  and  $a_2 \neq 0$ . Then the rules say that we should first find a page table pointer  $p_2$  in physical location  $2^{13}(r+b_i+2)+8a_2$ , then another page table pointer  $p_1$  in location  $p_2 + 8a_1$ , and finally the page table entry  $p_0$  in location  $p_1 + 8a_0$ . The simulator achieves this by setting up three coroutines  $c_0$ ,  $c_1$ ,  $c_2$  whose control blocks correspond to the pseudo-instructions

```
LDPTP x, [2^{63}+2^{13}(r+b_i+2)], 8a_2 LDPTP x, x, 8a_1 LDPTE x, x, 8a_0
```

where x is a hidden internal register and the other quantities are immediate values. Slight changes to the normal functionality of LDO give us the actions needed to implement LDPTP and LDPTE. Coroutine  $c_j$  corresponds to the instruction that involves  $a_j$  and computes  $p_j$ ; when  $c_0$  has computed its value  $p_0$ , we know how to translate the original virtual address.

The LDPTP and LDPTE commands return zero if their y operand is zero or if the page table does not properly match rV.

```
#define LDPTP PREGO
                             /* internally this won't cause confusion */
#define LDPTE GO
\langle \text{Global variables } 20 \rangle + \equiv
                                      /* control blocks for I and D page translation */
  control IPTctl[5], DPTctl[5];
  coroutine IPTco[10], DPTco[10];
                                         /* each coroutine is a two-stage pipeline */
  char *IPTname[5] = {"IPT0", "IPT1", "IPT2", "IPT3", "IPT4"};
  char *DPTname[5] = {"DPTO", "DPT1", "DPT2", "DPT3", "DPT4"};
236. \langle Initialize everything 22\rangle + \equiv
  for (j = 0; j < 5; j ++)
    DPTco[2*j].ctl = \&DPTctl[j]; IPTco[2*j].ctl = \&IPTctl[j];
    if (j > 0) DPTctl[j].op = IPTctl[j].op = LDPTP, <math>DPTctl[j].i = IPTctl[j].i = ldptp;
    else DPTctl[0].op = IPTctl[0].op = LDPTE, DPTctl[0].i = IPTctl[0].i = ldpte;
    IPTctl[j].loc = DPTctl[j].loc = neg\_one;
    IPTctl[j].go.o = DPTctl[j].go.o = incr(neg\_one, 4);
    IPTctl[j].ptr\_a = DPTctl[j].ptr\_a = (void *) \&mem;
    IPTctl[j].ren\_x = DPTctl[j].ren\_x = true;
    IPTctl[j].x.addr.h = DPTctl[j].x.addr.h = -1;
    IPTco[2*j].stage = DPTco[2*j].stage = 1;
    IPTco[2*j+1].stage = DPTco[2*j+1].stage = 2;
    IPTco[2*j].name = IPTco[2*j+1].name = IPTname[j];
    DPTco[2*j].name = DPTco[2*j+1].name = DPTname[j];
  ITcache \neg filler\_ctl.ptr\_c = (\mathbf{void} *) \& IPTco[0]; \ DTcache \neg filler\_ctl.ptr\_c = (\mathbf{void} *) \& DPTco[0];
```

237. Page table calculations are invoked by a coroutine of type  $fill\_from\_virt$ , which is used to fill the IT-cache or DT-cache. The calling conventions of  $fill\_from\_virt$  are analogous to those of  $fill\_from\_mem$  or  $fill\_from\_S$ : A virtual address is supplied in  $data\_y.o$ , and  $data\_ptr\_a$  points to a cache (ITcache or DTcache), while  $data\_ptr\_b$  is a block in that cache. We wake up the caller, who holds the cache's  $fill\_lock$ , as soon as the translation of the given address has been calculated, unless the caller has been aborted. (No second wakeup call is necessary.)

```
\langle Cases for control of special coroutines 126\rangle +\equiv
case fill_from_virt:
   { register cache *c = (cache *) data \neg ptr_a;
     register coroutine *cc = c \neg fill\_lock:
                                                                               /* \&IPTco[0] \text{ or } \&DPTco[0] */
     register coroutine *co = (coroutine *) data \neg ptr\_c;
     octa aaaaa;
     switch (data¬state) {
     case 0: (Start up auxiliary coroutines to compute the page table entry 243);
        data \neg state = 1;
     case 1: if (data \rightarrow b.p) {
           if (data \neg b.p \neg known) data \neg b.o = data \neg b.p \neg o, data \neg b.p = \Lambda;
           else wait(1);
        \langle Compute the new entry for c-inbuf and give the caller a sneak preview 245\rangle;
        data \rightarrow state = 2;
     case 2: if (c \rightarrow lock) wait(1);
        set\_lock(self, c \rightarrow lock);
        load\_cache(c, (\mathbf{cacheblock} *) data \neg ptr\_b);
        data \rightarrow state = 3; wait(c \rightarrow copy\_in\_time);
     case 3: data \rightarrow b.o = zero\_octa; goto terminate;
   }
```

238. The current contents of rV, the special virtual translation register, are kept unpacked in several global variables  $page_r$ ,  $page_s$ , etc., for convenience. Whenever rV changes, we recompute all these variables.

```
⟨ Global variables 20⟩ +≡
int page_n; /* the 10-bit n field of rV, times 8 */
int page_r; /* the 27-bit r field of rV */
int page_s; /* the 8-bit s field of rV */
int page_b[5]; /* the 4-bit b fields of rV; page_b[0] = 0 */
octa page_mask; /* the least significant s bits */
bool page_bad = true; /* does rV violate the rules? */
```

86

```
239. \langle \text{Update the } page \text{ variables } 239 \rangle \equiv
  \{ \text{ octa } rv; \}
     rv = data \neg z.o;
     page\_bad = (rv.l \& 7 ? true : false);
     page_n = rv.l \& #1ff8;
     rv = shift\_right(rv, 13, 1);
     page\_r = rv.l \& #7ffffff;
     rv = shift\_right(rv, 27, 1);
     page_s = rv.l \& #ff;
     if (page\_s < 13 \lor page\_s > 48) page\_bad = true;
     else if (page\_s < 32) page\_mask.h = 0, page\_mask.l = (1 \ll page\_s) - 1;
     else page\_mask.h = (1 \ll (page\_s - 32)) - 1, page\_mask.l = \#fffffffff;
     page_b[4] = (rv.l \gg 8) \& #f;
     page_b[3] = (rv.l \gg 12) \& #f;
     page_b[2] = (rv.l \gg 16) \& #f;
     page_b[1] = (rv.l \gg 20) \& #f;
This code is used in section 329.
```

**240.** Here's how we compute a tag of the IT-cache or DT-cache from a virtual address, and how we compute a physical address from a translation found in the cache.

**242.** Cheap (and slow) versions of MMIX leave the page table calculations to software. If the global variable no\_hardware\_PT is set true, fill\_from\_virt begins its actions in state 1, not state 0. (See the RESUME\_TRANS operation.)

 $\langle \text{ External variables } 4 \rangle + \equiv$  **Extern bool** no\_hardware\_PT;

243. Note: The operating system is supposed to ensure that changes to the page table entries do not appear in the pipeline when a translation cache is being updated. The internal LDPTP and LDPTE instructions use only the "hot state" of the memory system.

```
\langle Start up auxiliary coroutines to compute the page table entry 243\rangle \equiv
  aaaaa = data \rightarrow y.o;
  i = aaaaa.h \gg 29;
                           /* the segment number */
  aaaaa.h \&= #1fffffff;
                                  /* the address within segment i */
  aaaaa = shift\_right(aaaaa, page\_s, 1);
                                               /* the page address */
  for (j = 0; aaaaa.l \neq 0 \lor aaaaa.h \neq 0; j++) {
     co[2*j].ctl \rightarrow z.o.h = 0, co[2*j].ctl \rightarrow z.o.l = (aaaaa.l \& #3ff) \ll 3;
     aaaaa = shift\_right(aaaaa, 10, 1);
  if (page_b[i+1] < page_b[i] + j)
                                        /* address too large */
    ; /* nothing needs to be done, since data \rightarrow b.o is zero */
     if (j \equiv 0) j = 1, co[0].ctl \neg z.o = zero\_octa;
     (Issue j pseudo-instructions to compute a page table entry 244);
This code is used in section 237.
244. The first stage of coroutine c_j is co[2*j]. It will pass the jth control block to the second stage,
co[2*j+1], which will load page table information from memory (or hopefully from the D-cache).
(Issue j pseudo-instructions to compute a page table entry 244) \equiv
  i--;
  aaaaa.l = page\_r + page\_b[i] + j;
```

 $\begin{array}{l} j--;\\ aaaaa.l = page\_r + page\_b[i] + j;\\ co[2*j].ctl \neg y.p = \Lambda;\\ co[2*j].ctl \neg y.p = \Lambda;\\ co[2*j].ctl \neg y.o = shift\_left(aaaaa,13);\\ co[2*j].ctl \neg y.o.h += sign\_bit;\\ \textbf{for}~(~;~;~j--)~\{\\ co[2*j].ctl \neg x.o = zero\_octa;~co[2*j].ctl \neg x.known = false;\\ co[2*j].ctl \neg owner = \&\,co[2*j];\\ startup(\&\,co[2*j],1);\\ \textbf{if}~(j\equiv 0)~\textbf{break};\\ co[2*(j-1)].ctl \neg y.p = \&\,co[2*j].ctl \neg x;\\ \}\\ data \neg b.p = \&\,co[0].ctl \neg x; \end{array}$ 

This code is used in section 243.

**245.** At this point the translation of the given virtual address  $data \neg y.o$  is the octabyte  $data \neg b.o$ . Its least significant three bits are the protection code  $p = p_r p_w p_x$ ; its page address field is scaled by  $2^s$ . It is entirely zero, including the protection bits, if there was a page table failure.

```
\langle Compute the new entry for c-inbuf and give the caller a sneak preview 245 \rangle \equiv c-inbuf.tag = trans\_key(data-y.o); c-inbuf.data[0] = data-b.o; if (cc) { cc-ctl-z.o = data-b.o; awaken(cc, 1); \rangle
```

88 §246 THE WRITE BUFFER **MMIX-PIPE** 

**246**. The write buffer. The dispatcher has arranged things so that speculative stores into memory are recorded in a doubly linked list leading upward from mem. When such instructions finally are committed, they enter the "write buffer," which holds octabytes that are ready to be written into designated physical memory addresses (or into the D-cache and/or S-cache). The "hot state" of the computation is reflected not only by the registers and caches but also by the instructions that are pending in the write buffer.

```
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef struct {
     octa o;
                  /* data to be stored */
     octa addr;
                     /* its physical address */
                        /* when last committed (mod 2^{32}) */
     tetra stamp;
                               /* is this write special? */
     internal_opcode i;
  } write_node;
```

**static void** print\_write\_buffer ARGS((**void**));

**247.** We represent the buffer in the usual way as a circular list, with elements  $write\_tail + 1$ ,  $write\_tail + 2$ ,  $\dots$ ,  $write\_head$ .

The data will sit at least holding\_time cycles before it leaves the write buffer. This speeds things up when

```
different fields of the same octabyte are being stored by different instructions.
\langle \text{External variables 4} \rangle + \equiv
  Extern write_node *wbuf_bot, *wbuf_top;
                                                       /* least and greatest write buffer nodes */
  Extern write_node *write_head, *write_tail;
                                                          /* front and rear of the write buffer */
  Extern lockvar wbuf_lock;
                                      /* is the data in write_head being written? */
  Extern int holding_time;
                                    /* minimum holding time */
  Extern lockvar speed_lock;
                                      /* should we ignore holding_time? */
248. \langle Global variables 20 \rangle + \equiv
  coroutine write_co;
                             /* coroutine that empties the write buffer */
  control write_ctl;
                           /* its control block */
249. \langle Initialize everything 22 \rangle + \equiv
  write\_co.ctl = \&write\_ctl;
  write_co.name = "Write";
  write\_co.stage = write\_from\_wbuf;
  write\_ctl.ptr\_a = (\mathbf{void} *) \& mem;
  write\_ctl.go.o.l = 4;
  startup(\&write\_co, 1);
  write\_head = write\_tail = wbuf\_top;
250. \langle Internal prototypes 13\rangle + \equiv
```

 $\S251$  MMIX-PIPE THE WRITE BUFFER 89

```
251. \langle Subroutines 14 \rangle +=
static void print_write_buffer()
{
    printf("Write_buffer");
    if (write_head = write_tail) printf("_(empty)\n");
    else { register write_node *p;
        printf(":\n");
        for (p = write_head; p \neq write_tail; p = (p \equiv wbuf_bot ? wbuf_top : p - 1)) {
            printf("m["); print_octa(p-addr); printf("]="); print_octa(p-o);
            if (p-i \equiv stunc) printf("_unc");
            else if (p-i \equiv sync) printf("_usync");
            printf("_(age_\%d)\n", ticks.l - p-stamp);
        }
    }
}
```

**252.** The entire present state of the pipeline computation can be visualized by printing first the write buffer, then the reorder buffer, then the fetch buffer. This shows the progression of results from oldest to youngest, from sizzling hot to ice cold.

```
⟨ External prototypes 9⟩ +≡
Extern void print_pipe ARGS((void));

253. ⟨External routines 10⟩ +≡
void print_pipe()
{
    print_write_buffer();
    print_reorder_buffer();
    print_fetch_buffer();
```

**254.** The write\_search routine looks to see if any instructions ahead of a given place in the mem list of the reorder buffer are storing into a given physical address, or if there's a pending instruction in the write buffer for that address. If so, it returns a pointer to the value to be written. If not, it returns  $\Lambda$ . If the answer is currently unknown, because at least one possibly relevant physical address has not yet been computed, the subroutine returns the special code value DUNNO.

The search starts at the x.up field of a control block for a store instruction, otherwise at the  $ptr\_a$  field of the control block, unless  $ptr\_a$  points to a committed instruction.

The i field in the write buffer is usually st or pst, inherited from a store or partial store command. It may also be sync (from SYNC 1 or SYNC 3) or stunc (from STUNC).

```
#define DUNNO ((octa *) 1) /* an impossible non-\Lambda pointer */ \langle Internal prototypes 13 \rangle +\equiv static octa *write_search ARGS((control *, octa));
```

90 The Write Buffer MMIX-PIPE  $\S 255$ 

```
255. \langle Subroutines 14\rangle + \equiv
  static octa *write_search(ctl, addr)
        control *ctl;
        octa addr;
  { register specnode *p = (ctl \neg mem\_x ? ctl \neg x.up : (specnode *) ctl \neg ptr\_a);}
     register write_node *q = write\_tail;
     addr.l \&= -8;
     if (p \equiv \&mem) goto qloop;
     if (p > \&hot \neg x \land ctl \leq hot) goto qloop; /* already committed */
     if (p < \&ctl \neg x \land (ctl \le hot \lor p > \&hot \neg x)) goto qloop;
     for (; p \neq \&mem; p = p \rightarrow up) {
        if (p \rightarrow addr.h \equiv (\mathbf{tetra}) - 1) return DUNNO;
        if ((p \rightarrow addr.l \& -8) \equiv addr.l \land p \rightarrow addr.h \equiv addr.h) return (p \rightarrow known ? \& (p \rightarrow o) : DUNNO);
   qloop: for (;;) {
        if (q \equiv write\_head) return \Lambda;
        if (q \equiv wbuf\_top) q = wbuf\_bot; else q++;
        if (q \neg addr.l \equiv addr.l \land q \neg addr.h \equiv addr.h) return &(q \neg o);
     }
  }
```

§256 **MMIX-PIPE** THE WRITE BUFFER 91

When we're committing new data to memory, we can update an existing item in the write buffer if it has the same physical address, unless that item is already in the process of being written out. Increasing the value of holding-time will increase the chance that this economy is possible, but it will also increase the number of buffered items when writes are to different locations.

A store instruction that sets any of the eight interrupt bits rwxnkbsp will not affect memory, even if it doesn't cause an interrupt.

When "store" is followed by "store uncached" at the same address, or vice versa, we believe the most recent hint.

```
\langle Commit to memory if possible, otherwise break 256\rangle \equiv
   { register write_node *q = write\_tail;
      if (hot-interrupt & (F_BIT + #ff)) goto done_with_write;
      if (hot \rightarrow i \neq sync)
         for (;;) {
            if (q \equiv write\_head) break;
            if (q \equiv wbuf\_top) q = wbuf\_bot; else q++;
            if (q \rightarrow i \equiv sync) break;
            if (q \rightarrow addr.l \equiv hot \rightarrow x.addr.l \land q \rightarrow addr.h \equiv hot \rightarrow x.addr.h \land (q \neq write\_head \lor \neg wbuf\_lock))
               goto addr_found;
      { register write_node *p = (write\_tail \equiv wbuf\_bot ? wbuf\_top : write\_tail - 1);}
         if (p \equiv write\_head) break;
                                                    /* the write buffer is full */
         q = write\_tail; write\_tail = p;
         q \rightarrow addr = hot \rightarrow x.addr;
   addr\_found: q \rightarrow o = hot \rightarrow x.o;
      q \rightarrow stamp = ticks.l;
      q \rightarrow i = hot \rightarrow i;
   done\_with\_write: spec\_rem(\&(hot \neg x));
      mem\_slots ++;
```

This code is used in section 146.

92 THE WRITE BUFFER MMIX-PIPE  $\S 257$ 

**257.** A special coroutine whose duty is to empty the write buffer is always active. It holds the *wbuf\_lock* while it is writing the contents of *write\_head*. It holds *Dcache-fill\_lock* while waiting for the D-cache to fill a block.

```
\langle Cases for control of special coroutines 126\rangle + \equiv
case write\_from\_wbuf: p = (cacheblock *) data \neg ptr\_b;
  switch (data¬state) {
  case 4: (Forward the new data past the D-cache if it is write-through 263);
     data \neg state = 5;
  case 5: if (write\_head \equiv wbuf\_bot) write\_head = wbuf\_top; else write\_head ---;
  write\_restart: data \neg state = 0:
  case 0: if (self \neg lockloc) * (self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
                                                     /* write buffer is empty */
     if (write\_head \equiv write\_tail) \ wait(1);
     if (write\_head \neg i \equiv sync) (Ignore the item in write\_head 264);
     if (ticks.l - write\_head \neg stamp < holding\_time \land \neg speed\_lock) wait (1);
                                                                                             /* data too raw */
     if (\neg Dcache \lor (write\_head \neg addr.h \& \#ffff0000)) goto mem\_direct;
                                                                                            /* not cached */
     if (Dcache \neg lock \lor (j = get\_reader(Dcache) < 0)) wait (1);
                                                                              /* D-cache busy */
     startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
     \langle Write the data into the D-cache and set state = 4, if there's a cache hit 262\rangle;
     data \neg state = ((Dcache \neg mode \& WRITE\_ALLOC) \land write\_head \neg i \neq stunc ? 1 : 3);
     wait(Dcache \neg access\_time);
  case 1: \langle \text{Try to put the contents of location } write\_head \neg addr \text{ into the D-cache 261} \rangle;
     data \neg state = 2; sleep;
  case 2: data \rightarrow state = 0; sleep;
                                           /* wake up when the D-cache has the block */
  case 3: (Handle write-around when writing to the D-cache 259);
  mem_direct: \langle Write directly from write_head to memory 260 \rangle;
  }
258. \langle \text{Local variables } 12 \rangle + \equiv
  register cacheblock *p, *q;
```

**259.** The granularity is guaranteed to be 8 in write-around mode (see *MMIX\_config*). Although an uncached store will not be stored in the D-cache (unless it hits in the D-cache), it will go into a secondary cache.

```
 \langle \text{ Handle write-around when writing to the D-cache } 259 \rangle \equiv \\ \textbf{if } (\textit{Dcache} \neg \textit{flusher.next}) \ \textit{wait} (1); \\ \textit{Dcache} \neg \textit{outbuf.tag.h} = \textit{write\_head} \neg \textit{addr.h}; \\ \textit{Dcache} \neg \textit{outbuf.tag.l} = \textit{write\_head} \neg \textit{addr.l} \& (-\textit{Dcache} \neg \textit{bb}); \\ \textbf{for } (j=0; \ j < \textit{Dcache} \neg \textit{bb} \gg \textit{Dcache} \neg \textit{g}; \ j++) \ \textit{Dcache} \neg \textit{outbuf.dirty}[j] = \textit{false}; \\ \textit{Dcache} \neg \textit{outbuf.data}[(\textit{write\_head} \neg \textit{addr.l} \& (\textit{Dcache} \neg \textit{bb} - 1)) \gg 3] = \textit{write\_head} \neg \textit{o}; \\ \textit{Dcache} \neg \textit{outbuf.dirty}[(\textit{write\_head} \neg \textit{addr.l} \& (\textit{Dcache} \neg \textit{bb} - 1)) \gg \textit{Dcache} \neg \textit{g}] = \textit{true}; \\ \textit{set\_lock}(\textit{self}, \textit{wbuf\_lock}); \\ \textit{startup}(\& \textit{Dcache} \neg \textit{flusher}, \textit{Dcache} \neg \textit{copy\_out\_time}); \\ \textit{data} \neg \textit{state} = 5; \ \textit{wait}(\textit{Dcache} \neg \textit{copy\_out\_time}); \\ \text{This code is used in section 257.} \\ \end{cases}
```

```
260. ⟨Write directly from write_head to memory 260⟩ ≡

if (mem_lock) wait(1);

set_lock(self, wbuf_lock);

set_lock(&mem_locker, mem_lock); /* a coroutine of type vanish */

startup(&mem_locker, mem_addr_time + mem_write_time);

mem_write(write_head¬addr, write_head¬o);

data¬state = 5; wait(mem_addr_time + mem_write_time);

This code is used in section 257.
```

**261.** A subtlety needs to be mentioned here: While we're trying to update the D-cache, another instruction might be filling the same cache block (although not because of the same physical address). Therefore we **goto**  $write\_restart$  here instead of saying wait(1).

```
⟨ Try to put the contents of location write\_head \neg addr into the D-cache 261⟩ ≡ if (Dcache \neg filler.next) goto write\_restart; if ((Scache \land Scache \neg lock) \lor (\neg Scache \land mem\_lock)) goto write\_restart; p = alloc\_slot(Dcache, write\_head \neg addr); if (\neg p) goto write\_restart; if (Scache) set\_lock(\&Dcache \neg filler, Scache \neg lock) else set\_lock(\&Dcache \neg filler, mem\_lock); set\_lock(self, Dcache \neg filler\_ctl.ptr\_b = (void *) p; Dcache \neg filler\_ctl.z.o = write\_head \neg addr; startup(\&Dcache \neg filler, Scache ? Scache \neg access\_time : mem\_addr\_time); This code is used in section 257.
```

**262.** Here it is assumed that *Dcache¬access\_time* is enough to search the D-cache and update one octabyte in case of a hit. The D-cache is not locked, since other coroutines that might be simultaneously reading the D-cache are not going to use the octabyte that changes. Perhaps the simulator is being too lenient here.

```
\langle Write the data into the D-cache and set state = 4, if there's a cache hit 262 \rangle \equiv
  p = cache\_search(Dcache, write\_head \neg addr);
  if (p) {
      p = use\_and\_fix(Dcache, p);
      set_lock(self, wbuf_lock);
      data \neg ptr\_b = (\mathbf{void} *) p;
      p \rightarrow data[(write\_head \rightarrow addr.l \& (Dcache \rightarrow bb - 1)) \gg 3] = write\_head \rightarrow o;
      p \rightarrow dirty[(write\_head \rightarrow addr.l \& (Dcache \rightarrow bb - 1)) \gg Dcache \rightarrow g] = true;
      data \neg state = 4; wait(Dcache \neg access\_time);
This code is used in section 257.
263. \langle Forward the new data past the D-cache if it is write-through 263\rangle \equiv
   if ((Dcache \neg mode \& WRITE\_BACK) \equiv 0) {
                                                              /* write-through */
      if (Dcache \neg flusher.next) wait (1);
      flush\_cache(Dcache, p, true);
This code is used in section 257.
```

94 The Write Buffer MMIX-PIPE  $\S 264$ 

```
264. \langle Ignore the item in write\_head\ 264 \rangle \equiv { set\_lock(self, wbuf\_lock); \\ data\neg state = 5; \\ wait(1); \\ \} This code is used in section 257.
```

 $\S265$  MMIX-PIPE LOADING AND STORING 95

**265.** Loading and storing. A RISC machine is often said to have a "load/store architecture," perhaps because loading and storing are among the most difficult things a RISC machine is called upon to do.

We want memory accesses to be efficient, so we try to access the D-cache at the same time as we are translating a virtual address via the DT-cache. Usually we hit in both caches, but numerous cases must be dealt with when we miss. Is there an elegant way to handle all the contingencies? Alas, the author of this program was unable to think of anything better than to throw lots of code at the problem — knowing full well that such a spaghetti-like approach is fraught with possibilities for error.

Instructions like LDO x, y, z operate in two pipeline stages. The first stage computes the virtual address y + z, waiting if necessary until y and z are both known; then it starts to access the necessary caches. In the second stage we ascertain the corresponding physical address and hopefully find the data in the cache (or in the speculative mem list or the write buffer).

An instruction like STB x, y, z shares some of the computation of LDO x, y, z, because only one byte is being stored but the other seven bytes must be found in the cache. In this case, however, x is treated as an input, and mem is the output. The second stage of a store command can begin even though x is not known during the first stage.

Here's what we do at the beginning of stage 1.

```
#define ld_st_launch 7
                                                                            /* state when load/store command has its memory address */
\langle Cases to compute the virtual address of a memory operation 265 \rangle \equiv
case preld: case prest: case prego:
      data \neg z.o = incr(data \neg z.o, data \neg xx \& -(data \neg i \equiv prego ? Icache : Dcache) \neg bb);
           /* (I hope the adder is fast enough) */
case ld: case ldunc: case ldvts: case st: case syncd: case syncid: start_ld_st:
      data \rightarrow y.o = oplus(data \rightarrow y.o, data \rightarrow z.o);
      data \rightarrow state = ld\_st\_launch; goto switch1;
case ldptp: case ldpte: if (data \neg y.o.h) goto start\_ld\_st;
      data \rightarrow x.o = zero\_octa; data \rightarrow x.known = true; goto die;
                                                                                                                                                             /* page table fault */
This code is used in section 132.
266.
                  #define PRW_BITS (data \rightarrow i < st? PR_BIT: data \rightarrow i \equiv pst? PR_BIT + PW_BIT: (data \rightarrow i \equiv pst? PR_BIT + PW_BIT: (data \rightarrow i \equiv pst? PR_BIT + PW_BIT + PW_BIT: (data \rightarrow i \equiv pst? PR_BIT + PW_BIT + PW_B
                                  syncid \land (data \neg loc.h \& sign\_bit)) ? 0 : PW_BIT)
\langle Special cases for states in the first stage 266\rangle \equiv
case ld\_st\_launch: if ((self + 1) \rightarrow next) wait(1);
                                                                                                                                    /* second stage must be clear */
      \langle Handle special cases for operations like prego and ldvts 289\rangle;
      if (data \neg y.o.h \& sign\_bit) (Do load/store stage 1 with known physical address 271);
      if (page_bad) {
           if (data \rightarrow i \equiv st \lor (data \rightarrow i < preld \land data \rightarrow i > syncid)) data \rightarrow interrupt \models PRW\_BITS;
           goto fin_ex;
      if (DTcache \neg lock \lor (j = qet\_reader(DTcache)) < 0) wait (1);
      startup(\&DTcache \neg reader[j], DTcache \neg access\_time);
      (Look up the address in the DT-cache, and also in the D-cache if possible 267);
      pass\_after(DTcache \neg access\_time); goto passit;
See also sections 310, 326, 360, and 363.
This code is used in section 130.
```

96 LOADING AND STORING MMIX-PIPE §267

**267.** When stage 2 of a load/store command begins, the state will depend on what transpired in stage 1. For example, *data-state* will be *DT\_miss* if the virtual address key can't be found in the DT-cache; then stage 2 will have to compute the physical address the hard way.

The data¬state will be DT\_hit if the physical address is known via the DT-cache, but the data may or may not be in the D-cache. The data¬state will be hit\_and\_miss if the DT-cache hits and the D-cache doesn't. And data¬state will be ld\_ready if data¬x.o is the desired octabyte (for example, if both caches hit).

```
/* second stage state when DT-cache doesn't hold the key */
#define DT_miss 10
#define DT_hit 11
                           /* second stage state when physical address is known */
\#define hit\_and\_miss
                          12
                                 /* second stage state when D-cache misses */
                            /* second stage state when data has been read */
#define ld_ready 13
                            /* second stage state when data needn't be read */
#define st_ready 14
#define prest_win 15
                              /* second stage state when we can fill a block with zeroes */
\langle Look up the address in the DT-cache, and also in the D-cache if possible 267\rangle
  p = cache\_search(DTcache, trans\_key(data \neg y.o));
  if (\neg Dcache \lor Dcache \neg lock \lor (j = qet\_reader(Dcache)) < 0 \lor (data \neg i \ge st \land data \neg i \le syncid))
     (Do load/store stage 1 without D-cache lookup 270);
  startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
  if (p) (Do a simultaneous lookup in the D-cache 268)
  else data \rightarrow state = DT\_miss;
```

This code is used in section 266.

 $\S268$  MMIX-PIPE LOADING AND STORING 97

**268.** We assume that it is possible to look up a virtual address in the DT-cache at the same time as we look for a corresponding physical address in the D-cache, provided that the lower b+c bits of the two addresses are the same. (They will always be the same if  $b+c \le page\_s$ ; otherwise the operating system can try to make them the same by "page coloring" whenever possible.) If both caches hit, the physical address is known in  $\max(DTcache \neg access\_time, Dcache \neg access\_time)$  cycles.

If the lower b+c bits of the virtual and physical addresses differ, the machine will not know this until the DT-cache has hit. Therefore we simulate the operation of accessing the D-cache, but we go to DT-hit instead of to hit-and-miss because the D-cache will experience a spurious miss.

```
#define max(x,y) ((x) < (y) ? (y) : (x))
\langle \text{ Do a simultaneous lookup in the D-cache 268} \rangle \equiv
   \{ \mathbf{octa} * m; 
      (Update DT-cache usage and check the protection bits 269);
      data \rightarrow z.o = phys\_addr(data \rightarrow y.o, p \rightarrow data[0]);
      m = write\_search(data, data \neg z.o);
      if (m \equiv DUNNO) data \rightarrow state = DT\_hit:
      else if (m) data \rightarrow x.o = *m, data \rightarrow state = ld\_ready;
      else if (Dcache \rightarrow b + Dcache \rightarrow c > page\_s \land
               ((data \neg y.o.l \oplus data \neg z.o.l) \& ((Dcache \neg bb \ll Dcache \neg c) - (1 \ll page\_s)))) data \neg state = DT\_hit;
            /* spurious D-cache lookup */
      else {
         q = cache\_search(Dcache, data \neg z.o);
         if (q) {
            if (data \rightarrow i \equiv ldunc) q = demote\_and\_fix(Dcache, q);
            else q = use\_and\_fix(Dcache, q);
            data \rightarrow x.o = q \rightarrow data [(data \rightarrow z.o.l \& (Dcache \rightarrow bb - 1)) \gg 3];
            data \rightarrow state = ld\_ready;
           else data \rightarrow state = hit\_and\_miss;
      pass\_after(max(DTcache \neg access\_time, Dcache \neg access\_time));
      goto passit;
This code is used in section 267.
```

**269.** The protection bits  $p_r p_w p_x$  in a translation cache are shifted four positions right from the interrupt codes PR\_BIT, PW\_BIT, PX\_BIT. If the data is protected, we abort the load/store operation immediately; this protects the privacy of other users.

```
 \begin{array}{l} \langle \, \text{Update DT-cache usage and check the protection bits 269} \, \rangle \equiv \\ p = use\_and\_fix(DTcache,p); \\ j = \text{PRW\_BITS}; \\ \textbf{if } (((p\neg data[0].l \ll \text{PROT\_OFFSET}) \& j) \neq j) \ \{ \\ \textbf{if } (data\neg i \equiv syncd \lor data\neg i \equiv syncid) \ \textbf{goto} \ sync\_check; \\ \textbf{if } (data\neg i \neq preld \land data\neg i \neq prest) \ data\neg interrupt \mid = j \& \sim (p\neg data[0].l \ll \text{PROT\_OFFSET}); \\ \textbf{goto} \ fin\_ex; \\ \} \end{array}
```

This code is used in sections 268, 270, and 272.

98 Loading and storing mmix-pipe  $\S 270$ 

```
270. \langle Do load/store stage 1 without D-cache lookup 270\rangle \equiv { octa *m; if (p) { \langle Update DT-cache usage and check the protection bits 269\rangle; data \neg z.o = phys\_addr(data \neg y.o, p \neg data[0]); if (data \neg i \geq st \wedge data \neg i \leq syncid) data \neg state = st\_ready; else { m = write\_search(data, data \neg z.o); if (m \wedge m \neq \texttt{DUNNO}) data \neg x.o = *m, data \neg state = ld\_ready; else data \neg state = DT\_hit; } else data \neg state = DT\_miss; pass\_after(DTcache \neg access\_time); goto passit; } This code is used in section 267.
```

§271 MMIX-PIPE

```
271. (Do load/store stage 1 with known physical address 271) \equiv
  \{ \mathbf{octa} * m; 
     if (\neg(data\neg loc.h \& sign\_bit)) {
        if (data \neg i \equiv syncd \lor data \neg i \equiv syncid) goto sync\_check;
        if (data \neg i \neq preld \land data \neg i \neq prest) data \neg interrupt |= N_BIT;
        goto fin\_ex;
     data \neg z.o = data \neg y.o; data \neg z.o.h -= sign\_bit;
     if (data \rightarrow i \geq st \wedge data \rightarrow i \leq syncid) {
        data \rightarrow state = st\_ready; pass\_after(1); goto passit;
     m = write\_search(data, data \rightarrow z.o);
     if (m) {
        if (m \equiv DUNNO) data \rightarrow state = DT\_hit;
        else data \rightarrow x.o = *m, data \rightarrow state = ld\_ready;
     } else if ((data \neg z.o.h \& \#ffff0000) \lor \neg Dcache) {
        if (mem\_lock) wait(1);
        set\_lock(\&mem\_locker, mem\_lock);
        data \neg x.o = mem\_read(data \neg z.o);
        data \rightarrow state = ld\_ready;
        startup(\&mem\_locker, mem\_addr\_time + mem\_read\_time);
        pass_after(mem_addr_time + mem_read_time); goto passit;
     if (Dcache \neg lock \lor (j = get\_reader(Dcache)) < 0) {
        data \rightarrow state = DT\_hit; pass\_after(1); goto passit;
     startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
     q = cache\_search(Dcache, data \neg z.o);
     if (q) {
        if (data \neg i \equiv ldunc) q = demote\_and\_fix(Dcache, q);
        else q = use\_and\_fix(Dcache, q);
        data \neg x.o = q \neg data[(data \neg z.o.l \& (Dcache \neg bb - 1)) \gg 3];
        data \rightarrow state = ld\_ready;
     } else data \rightarrow state = hit\_and\_miss;
     pass_after(Dcache→access_time); goto passit;
This code is used in section 266.
```

100 LOADING AND STORING MMIX-PIPE §272

**272.** The program for the second stage is, likewise, rather long-winded, yet quite similar to the cache manipulations we have already seen several times.

Several instructions might be trying to fill the DT-cache for the same page. (A similar situation faced us in the *write\_from\_wbuf* coroutine.) The second stage therefore needs to do some translation cache searching just as the first stage did. In this stage, however, we don't go all out for speed, because DT-cache misses are rare.

```
/* second stage state when DT-cache should be searched again */
#define DT_retry 8
#define got\_DT 9
                               /* second stage state when DT-cache entry has been computed */
\langle Special cases for states in later stages 272 \rangle \equiv
square\_one: data \neg state = DT\_retry;
case DT-retry: if (DTcache\neg lock \lor (j = get-reader(DTcache)) < 0) wait(1);
  startup(\&DTcache \neg reader[j], DTcache \neg access\_time);
  p = cache\_search(DTcache, trans\_key(data \neg y.o));
  if (p) {
     (Update DT-cache usage and check the protection bits 269);
     data \neg z.o = phys\_addr(data \neg y.o, p \neg data[0]);
     if (data \neg i \geq st \land data \neg i \leq syncid) data \neg state = st\_ready;
     else data \rightarrow state = DT\_hit;
  } else data \rightarrow state = DT\_miss;
  wait(DTcache \neg access\_time);
case DT_miss: if (DTcache→filler.next)
     if (data \neg i \equiv preld \lor data \neg i \equiv prest) goto fin\_ex; else goto square\_one;
  if (no_hardware_PT)
     if (data \neg i \equiv preld \lor data \neg i \equiv prest) goto fin\_ex; else goto emulate\_virt;
  p = alloc\_slot(DTcache, trans\_key(data \rightarrow y.o));
  if (\neg p) goto square_one;
  data \neg ptr\_b = DTcache \neg filler\_ctl.ptr\_b = (\mathbf{void} *) p;
  DTcache \neg filler\_ctl.y.o = data \neg y.o;
  set\_lock(self, DTcache \neg fill\_lock);
  startup(\&DTcache \neg filler, 1);
  data \rightarrow state = qot DT;
  if (data \neg i \equiv preld \lor data \neg i \equiv prest) goto fin\_ex; else sleep;
case got\_DT: release\_lock(self, DTcache \neg fill\_lock);
  j = PRW_BITS;
  if (((data \rightarrow z.o.l \ll PROT\_OFFSET) \& j) \neq j) {
     if (data \neg i \equiv syncd \lor data \neg i \equiv syncid) goto sync\_check;
     data \rightarrow interrupt \mid = j \& \sim (data \rightarrow z.o.l \ll PROT\_OFFSET);
     goto fin_ex;
  data \neg z.o = phys\_addr(data \neg y.o, data \neg z.o);
  if (data \neg i \geq st \land data \neg i \leq syncid) goto finish\_store;
        /* otherwise we fall through to ld_retry below */
See also sections 273, 276, 279, 280, 299, 311, 354, 364, and 370.
This code is used in section 135.
```

§273 101 **MMIX-PIPE** LOADING AND STORING

273. The second stage might also want to fill the D-cache (and perhaps the S-cache) as we get the data. Several load instructions might be trying to fill the same cache block. So we should go back and look in the D-cache again if we miss and cannot allocate a slot immediately.

A PRELD or PREST instruction, which is just a "hint," doesn't do anything more if the caches are already busy.

```
\langle Special cases for states in later stages 272\rangle + \equiv
ld\_retry: data \rightarrow state = DT\_hit;
case DThit: if (data \rightarrow i \equiv preld \lor data \rightarrow i \equiv prest) goto finex;
   (Check for a hit in pending writes 278);
   if ((data \neg z.o.h \& \#ffff0000) \lor \neg Dcache) \lor Do load/store stage 2 without D-cache lookup 277);
   if (Dcache \neg lock \lor (j = get\_reader(Dcache)) < 0) wait (1);
   startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
   q = cache\_search(Dcache, data \neg z.o);
  if (q) {
     if (data \rightarrow i \equiv ldunc) q = demote\_and\_fix(Dcache, q);
     else q = use\_and\_fix(Dcache, q);
     data \rightarrow x.o = q \rightarrow data[(data \rightarrow z.o.l \& (Dcache \rightarrow bb - 1)) \gg 3];
     data \neg state = ld\_ready;
   } else data \rightarrow state = hit\_and\_miss;
   wait(Dcache \neg access\_time);
case hit\_and\_miss: if (data \neg i \equiv ldunc) goto avoid\_D;
   \langle \text{Try to get the contents of location } data \neg z.o \text{ in the D-cache } 274 \rangle;
274. Try to get the contents of location data \neg z.o in the D-cache 274 \rangle \equiv
   (Check for prest with a fully spanned cache block 275);
   if (Dcache→filler.next) goto ld_retry;
   if ((Scache \land Scache \neg lock) \lor (\neg Scache \land mem\_lock)) goto ld\_retry;
   q = alloc\_slot(Dcache, data \rightarrow z.o);
   if (\neg q) goto ld\_retry;
   if (Scache) set_lock(&Dcache¬filler, Scache¬lock)
   else set_lock(&Dcache¬filler, mem_lock);
   set\_lock(self, Dcache \neg fill\_lock);
   data \neg ptr\_b = Dcache \neg filler\_ctl.ptr\_b = (\mathbf{void} *) q;
   Dcache \neg filler\_ctl.z.o = data \neg z.o;
   startup(\&Dcache \neg filler, Scache ? Scache \neg access\_time : mem\_addr\_time);
   data \rightarrow state = ld\_ready;
   if (data \neg i \equiv preld \lor data \neg i \equiv prest) goto fin\_ex; else sleep;
This code is used in section 273.
275. If a prest instruction makes it to the hot seat, we have been assured by the user of PREST that the
```

current values of bytes in virtual addresses  $data \neg y.o - (data \neg xx \& -Dcache \neg bb)$  through  $data \neg y.o + (data \neg xx \& -Dcache \neg bb)$ (Dcache - bb - 1) are irrelevant. Hence we can pretend that we know they are zero. This is advantageous if it saves us from filling a cache block from the S-cache or from memory.

```
\langle Check for prest with a fully spanned cache block 275 \rangle \equiv
   if (data \rightarrow i \equiv prest \land 
             (data \neg xx \geq Dcache \neg bb \lor ((data \neg y.o.l \& (Dcache \neg bb - 1)) \equiv 0)) \land
             ((data \neg y.o.l + (data \neg xx \& (Dcache \neg bb - 1)) + 1) \oplus data \neg y.o.l) \ge Dcache \neg bb) goto prest_span;
This code is used in section 274.
```

102 Loading and storing mmix-pipe  $\S 276$ 

```
\langle Special cases for states in later stages 272\rangle + \equiv
276.
prest\_span: data \rightarrow state = prest\_win;
case prest_win: if (data \neq old\_hot \vee Dlocker.next) wait(1);
  if (Dcache \neg lock) goto fin\_ex;
  q = alloc\_slot(Dcache, data \neg z.o);
                                               /* OK if Dcache→filler is busy */
  if (q) {
     clean\_block(Dcache, q);
     q \rightarrow tag = data \rightarrow z.o; q \rightarrow tag.l \&= -Dcache \rightarrow bb;
     set\_lock(\&Dlocker, Dcache \neg lock);
     startup(\&Dlocker, Dcache \neg copy\_in\_time);
  goto fin\_ex;
277. (Do load/store stage 2 without D-cache lookup 277) \equiv
   avoid\_D: if (mem\_lock) wait(1);
     set\_lock(\&mem\_locker, mem\_lock);
     startup(\&mem\_locker, mem\_addr\_time + mem\_read\_time);
     data \neg x.o = mem\_read(data \neg z.o);
     data \neg state = ld\_ready; wait(mem\_addr\_time + mem\_read\_time);
This code is used in section 273.
278. \langle Check for a hit in pending writes 278\rangle \equiv
     octa *m = write\_search(data, data \neg z.o);
     if (m \equiv DUNNO) wait (1);
     if (m) {
        data \neg x.o = *m;
        data \rightarrow state = ld\_ready;
        wait(1);
This code is used in section 273.
```

 $\S279$  MMIX-PIPE LOADING AND STORING 103

**279.** The requested octabyte will arrive sooner or later in  $data \rightarrow x.o$ . Then a load instruction is almost done, except that we might need to massage the input a little bit.

```
\langle Special cases for states in later stages 272\rangle + \equiv
case ld\_ready: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
  if (data \rightarrow i \geq st) goto finish\_store;
   switch (data \neg op \gg 1) {
   case LDB \gg 1: case LDBU \gg 1: j = (data \neg z.o.l \& \#7) \ll 3; i = 56; goto fin\_ld;
   case LDW \gg 1: case LDWU \gg 1: j = (data \neg z.o.l \& \#6) \ll 3; i = 48; goto fin\_ld;
   case LDT \gg 1: case LDTU \gg 1: j = (data \neg z.o.l \& #4) \ll 3; i = 32;
   fin\_ld: data \neg x.o = shift\_right(shift\_left(data \neg x.o, j), i, data \neg op \& #2);
   default: goto fin_ex;
   case LDHT \gg 1: if (data \neg z.o.l \& 4) data \neg x.o.h = data \neg x.o.l;
      data \rightarrow x.o.l = 0; goto fin\_ex;
   case LDSF \gg 1: if (data \neg z.o.l \& 4) data \neg x.o.h = data \neg x.o.l;
     if ((data \rightarrow x.o.h \& \#7f800000) \equiv 0 \land (data \rightarrow x.o.h \& \#7fffff)) {
        data \rightarrow x.o = load\_sf(data \rightarrow x.o.h);
        data \rightarrow state = 3; wait(denin\_penalty);
     else data \neg x.o = load\_sf(data \neg x.o.h); goto fin\_ex;
   case LDPTP \gg 1: if ((data \neg x.o.h \& sign\_bit) \equiv 0 \lor (data \neg x.o.l \& #1ff8) \neq page\_n) data \neg x.o = zero\_octa;
     else data \rightarrow x.o.l \& = -(1 \ll 13);
     goto fin_ex;
   case LDPTE \gg 1: if ((data \neg x.o.l \& #1ff8) \neq page\_n) data \neg x.o = zero\_octa;
     else data \neg x.o = incr(oandn(data \neg x.o, page\_mask), data \neg x.o.l \& #7);
     data \rightarrow x.o.h \&= #ffff; goto fin_ex;
   case UNSAVE \gg 1: (Handle an internal UNSAVE when it's time to load 336);
280.
        \langle Special cases for states in later stages 272\rangle + \equiv
finish\_store: data \neg state = st\_ready;
case st_ready: switch (data-i) {
   case st: case pst: \langle Finish a store command 281\rangle;
   case syncd: data \neg b.o.l = (Dcache ? Dcache \neg bb : 8192); goto do\_syncd;
   case syncid: data \rightarrow b.o.l = (Icache ? Icache \rightarrow bb : 8192);
     if (Dcache \land Dcache \neg bb < data \neg b.o.l) \ data \neg b.o.l = Dcache \neg bb;
     goto do_syncid;
```

104 LOADING AND STORING MMIX-PIPE §281

Store instructions have an extra complication, because some of them need to check for overflow.  $\langle$  Finish a store command 281  $\rangle \equiv$  $data \rightarrow x.addr = data \rightarrow z.o;$ **if**  $(data \rightarrow b.p)$  wait(1); switch  $(data \neg op \gg 1)$  { case STUNC  $\gg 1$ :  $data \rightarrow i = stunc$ ; **default**:  $data \rightarrow x.o = data \rightarrow b.o$ ; **goto**  $fin_ex$ ; case STSF  $\gg 1$ :  $set\_round$ ;  $data \neg b.o.h = store\_sf(data \neg b.o)$ ;  $data \neg interrupt \mid = exceptions;$ if  $((data \rightarrow b.o.h \& #7f800000) \equiv 0 \land (data \rightarrow b.o.h \& #7fffff))$  { if  $(data \neg z.o.l \& 4)$   $data \neg x.o.l = data \neg b.o.h$ ; else  $data \rightarrow x.o.h = data \rightarrow b.o.h$ ;  $data \neg state = 3; wait(denout\_penalty);$ case STHT  $\gg 1$ : if  $(data \neg z.o.l \& 4)$   $data \neg x.o.l = data \neg b.o.h$ ; else  $data \rightarrow x.o.h = data \rightarrow b.o.h$ ; **goto**  $fin_{-}ex$ ; case STB  $\gg 1$ : case STBU  $\gg 1$ :  $j = (data \neg z.o.l \& \#7) \ll 3$ ; i = 56; goto  $fin\_st$ ; case STW  $\gg 1$ : case STWU  $\gg 1$ :  $j = (data \neg z.o.l \& \#6) \ll 3$ ; i = 48; goto  $fin\_st$ ; case STT  $\gg 1$ : case STTU  $\gg 1$ :  $j = (data - z.o.l \& #4) \ll 3$ ; i = 32;  $fin\_st: \langle \text{Insert } data \neg b.o \text{ into the proper field of } data \neg x.o, \text{ checking for arithmetic exceptions if signed } 282 \rangle;$ **goto** fin\_ex; case CSWAP  $\gg 1$ :  $\langle \text{Finish a CSWAP } 283 \rangle$ ; case SAVE  $\gg 1$ : (Handle an internal SAVE when it's time to store 342); This code is used in section 280. **282.** (Insert data-b.o into the proper field of data-x.o, checking for arithmetic exceptions if signed 282)  $\equiv$ { octa mask; if  $(\neg(data \neg op \& 2))$  { **octa** before, after;  $before = data \neg b.o; after = shift\_right(shift\_left(data \neg b.o, i), i, 0);$ if  $(before.l \neq after.l \vee before.h \neq after.h)$  data-interrupt  $|= V_BIT;$  $mask = shift\_right(shift\_left(neg\_one, i), j, 1);$  $data \rightarrow b.o = shift\_right(shift\_left(data \rightarrow b.o, i), j, 1);$  $data \neg x.o.h \oplus = mask.h \& (data \neg x.o.h \oplus data \neg b.o.h);$  $data \neg x.o.l \oplus = mask.l \& (data \neg x.o.l \oplus data \neg b.o.l);$ 

This code is used in section 281.

 $\S283$  MMIX-PIPE LOADING AND STORING 105

**283.** The CSWAP operation has four inputs (\$X,\$Y,\$Z,rP) as well as three outputs  $(\$X,M_8[A],rP)$ . To keep from exceeding the capacity of the control blocks in our pipeline, we wait until this instruction reaches the hot seat, thereby allowing us non-speculative access to rP.

```
 \begin{array}{l} \left\langle \text{Finish a CSWAP } 283 \right\rangle \equiv \\ & \text{if } \left( data \neq old\_hot \right) \ wait(1); \\ & \text{if } \left( data \neg x.o.h \equiv g[rP].o.h \wedge data \neg x.o.l \equiv g[rP].o.l \right) \left\{ \\ & data \neg a.o.l = 1; \quad /* \ data \neg a.o.h \text{ is zero } */ \\ & data \neg x.o = data \neg b.o; \\ \right\} \ & \text{else } \left\{ \\ & g[rP].o = data \neg x.o; \quad /* \ data \neg a.o \text{ is zero } */ \\ & \text{if } \left( verbose \& issue\_bit \right) \left\{ \\ & printf\left( \text{"\_setting}\_rP=\text{"}\right); \ print\_octa\left(g[rP].o\right); \ printf\left(\text{"}\n\text{"}\right); \\ \right\} \\ & data \neg i = cswap; \quad /* \ cosmetic \ change, \ affects \ the \ trace \ output \ only \ */ \ goto \ fin\_ex; \\ \end{array}
```

This code is used in section 281.

106 The fetch stage MMIX-pipe  $\S 284$ 

284. The fetch stage. Now that we've mastered the most difficult memory operations, we can relax and apply our knowledge to the slightly simpler task of filling the fetch buffer. Fetching is like loading/storing, except that we use the I-cache instead of the D-cache. It's slightly simpler because the I-cache is read-only. Further simplifications would be possible if there were no PREGO instruction, because there is only one fetch unit. However, we want to implement PREGO with reasonable efficiency, in order to see if that instruction is worthwhile; so we include the complications of simultaneous I-cache and IT-cache readers, which we have already implemented for the D-cache and DT-cache.

The fetch coroutine is always present, as the one and only coroutine with stage number zero.

In normal circumstances, the fetch coroutine accesses a cache block containing the instruction whose virtual address is given by *inst\_ptr* (the instruction pointer), and transfers up to *fetch\_max* instructions from that block to the fetch buffer. Complications arise if the instruction isn't in the cache, or if we can't translate the virtual address because of a miss in the IT-cache. Moreover, *inst\_ptr* is a **spec** variable whose value might not even be known; if *inst\_ptr.p* is nonnull, we don't know what to fetch.

```
⟨External variables 4⟩ +≡

Extern spec inst_ptr; /* the instruction pointer (aka program counter) */

Extern octa *fetched; /* buffer for incoming instructions */
```

**285.** The fetch coroutine usually begins a cycle in state  $fetch\_ready$ , with the most recently fetched octabytes in positions  $fetch\_lo$ ,  $fetch\_lo + 1$ , ...,  $fetch\_hi - 1$  of a buffer called fetched. Once that buffer has been exhausted, the coroutine reverts to state 0; with luck, the buffer might have more data by the time the next cycle rolls around.

```
⟨Global variables 20⟩ +≡
int fetch_lo, fetch_hi; /* the active region of that buffer */
coroutine fetch_co;
control fetch_ctl;
286. ⟨Initialize everything 22⟩ +≡
fetch_co.ctl = &fetch_ctl;
fetch_co.name = "Fetch";
fetch_co.name = "Fetch";
fetch_ctl.go.o.l = 4;
startup(&fetch_co,1);
287. ⟨Restart the fetch coroutine 287⟩ ≡
if (fetch_co.lockloc) *(fetch_co.lockloc) = Λ, fetch_co.lockloc = Λ;
unschedule(&fetch_co);
startup(&fetch_co,1);
This code is used in sections 85, 160, 308, 309, and 316.
```

 $\S288$  MMIX-PIPE THE FETCH STAGE 107

**288.** Some of the actions here are done not only by the fetcher but also by the first and second stages of a *prego* operation.

```
#define wait\_or\_pass(t)
          if (data \neg i \equiv prego) { pass\_after(t); goto passit; }
          else wait(t)
\langle Simulate an action of the fetch coroutine 288\rangle \equiv
switch\theta: switch (data \neg state) {
  new\_fetch: data \neg state = 0;
  case 0: (Wait, if necessary, until the instruction pointer is known 290);
     data \rightarrow y.o = inst\_ptr.o;
     data \neg state = 1; data \neg interrupt = 0; data \neg x.o = data \neg z.o = zero\_octa;
  case 1: start_fetch: if (data-y.o.h & sign_bit) \ Begin fetch with known physical address 296\;
     if (page_bad) goto bad_fetch;
     if (ITcache \neg lock \lor (j = get\_reader(ITcache)) < 0) wait (1);
     startup(\&ITcache \neg reader[j], ITcache \neg access\_time);
     (Look up the address in the IT-cache, and also in the I-cache if possible 291);
     wait\_or\_pass(ITcache \neg access\_time);
     (Other cases for the fetch coroutine 298)
This code is used in section 125.
289. (Handle special cases for operations like prego and ldvts 289) \equiv
  if (data \rightarrow i \equiv preqo) goto start\_fetch;
See also section 352.
This code is used in section 266.
290. Wait, if necessary, until the instruction pointer is known 290 \geq
  if (inst_ptr.p) {
     if (inst\_ptr.p \neq UNKNOWN\_SPEC \land inst\_ptr.p \rightarrow known) inst\_ptr.o = inst\_ptr.p \rightarrow o, inst\_ptr.p = \Lambda;
     wait(1);
This code is used in section 288.
291. #define got_IT 19
                                      /* state when IT-cache entry has been computed */
#define IT\_miss 20
                               /* state when IT-cache doesn't hold the key */
#define IT_hit 21
                             /* state when physical instruction address is known */
#define Ihit_and_miss 22
                                     /* state when I-cache misses */
#define fetch_ready 23
                                  /* state when instructions have been read */
                              /* state when a "preview" octabyte is ready */
#define qot_one 24
\langle Look up the address in the IT-cache, and also in the I-cache if possible 291 \rangle \equiv
  p = cache\_search(ITcache, trans\_key(data \rightarrow y.o));
  if (\neg Icache \lor Icache \neg lock \lor (j = get\_reader(Icache)) < 0) \land Begin fetch without I-cache lookup 295);
  startup(\&Icache \rightarrow reader[j], Icache \rightarrow access\_time);
  if (p) \langle Do a simultaneous lookup in the I-cache 292 \rangle
  else data \rightarrow state = IT\_miss;
This code is used in section 288.
```

108 THE FETCH STAGE MMIX-PIPE  $\S 292$ 

**292.** We assume that it is possible to look up a virtual address in the IT-cache at the same time as we look for a corresponding physical address in the I-cache, provided that the lower b+c bits of the two addresses are the same. (See the remarks about "page coloring," when we made similar assumptions about the DT-cache and D-cache.)

```
\langle Do a simultaneous lookup in the I-cache 292\rangle \equiv
      (Update IT-cache usage and check the protection bits 293);
     data \rightarrow z.o = phys_addr(data \rightarrow y.o, p \rightarrow data[0]);
     if (Icache \rightarrow b + Icache \rightarrow c > page\_s \land
              ((data \neg y.o.l \oplus data \neg z.o.l) \& ((Icache \neg bb \ll Icache \neg c) - (1 \ll paqe\_s)))) data \neg state = IT\_hit;
           /* spurious I-cache lookup */
     else {
        q = cache\_search(Icache, data \neg z.o);
        if (q) {
           q = use\_and\_fix(Icache, q);
           \langle \text{Copy the data from block } q \text{ to } fetched 294 \rangle;
           data \rightarrow state = fetch\_ready;
         } else data \neg state = Ihit\_and\_miss;
     wait\_or\_pass(max(ITcache \neg access\_time, Icache \neg access\_time));
This code is used in section 291.
293. (Update IT-cache usage and check the protection bits 293) \equiv
  p = use\_and\_fix(ITcache, p);
  if (\neg(p\neg data[0].l \& (PX\_BIT))) goto bad\_fetch;
This code is used in sections 292 and 295.
294. At this point inst_ptr.o equals data-y.o.
\langle \text{Copy the data from block } q \text{ to } fetched 294 \rangle \equiv
  if (data \rightarrow i \neq prego) {
     for (j = 0; j < Icache \rightarrow bb; j ++) fetched [j] = q \rightarrow data[j];
     fetch\_lo = (inst\_ptr.o.l \& (Icache \rightarrow bb - 1)) \gg 3;
     fetch\_hi = Icache \neg bb \gg 3;
This code is used in sections 292 and 296.
       \langle Begin fetch without I-cache lookup 295\rangle \equiv
     if (p) {
         (Update IT-cache usage and check the protection bits 293);
         data \neg z.o = phys\_addr(data \neg y.o, p \neg data[0]);
         data \rightarrow state = IT\_hit;
      } else data \rightarrow state = IT\_miss;
      wait\_or\_pass(ITcache \neg access\_time);
This code is used in section 291.
```

 $\S296$  MMIX-PIPE THE FETCH STAGE 109

```
\langle Begin fetch with known physical address 296\rangle \equiv
296.
  {
     if (data \neg i \equiv prego \land \neg (data \neg loc.h \& sign\_bit)) goto fin\_ex;
     data \neg z.o = data \neg y.o; data \neg z.o.h -= sign\_bit;
  known_phys: if (data¬z.o.h & #ffff0000) goto bad_fetch;
     if (\neg Icache) (Read from memory into fetched 297);
     if (Icache \neg lock \lor (j = get\_reader(Icache)) < 0) {
        data \neg state = IT\_hit; wait\_or\_pass(1);
     startup(\&Icache \neg reader[j], Icache \neg access\_time);
     q = cache\_search(Icache, data \neg z.o);
     if (q) {
        q = use\_and\_fix(Icache, q);
        \langle \text{Copy the data from block } q \text{ to } fetched 294 \rangle;
        data \neg state = fetch\_ready;
     } else data \neg state = Ihit\_and\_miss;
     wait\_or\_pass(Icache \neg access\_time);
  }
This code is used in section 288.
297. \langle \text{Read from memory into } fetched 297 \rangle \equiv
  \{  octa addr;
     addr = data \neg z.o;
     if (mem\_lock) wait(1);
     set\_lock(\&mem\_locker, mem\_lock);
     startup(\&mem\_locker, mem\_addr\_time + mem\_read\_time);
     addr.l \&= -(bus\_words \ll 3);
     fetched[0] = mem\_read(addr);
     \mathbf{for}\ (j=1;\ j < bus\_words;\ j +\!\!\!\!+)\ fetched[j] = mem\_hash[last\_h].chunk[((addr.l\ \&\ ^\#\mathtt{ffff}) \gg 3) + j];
     fetch\_lo = (data \neg z.o.l \gg 3) \& (bus\_words - 1); fetch\_hi = bus\_words;
     data \rightarrow state = fetch\_ready;
     wait(mem\_addr\_time + mem\_read\_time);
This code is used in section 296.
```

110 The fetch stage MMIX-pipe  $\S 298$ 

```
\langle Other cases for the fetch coroutine 298\rangle \equiv
case IT_miss: if (ITcache→filler.next)
     if (data \neg i \equiv prego) goto fin\_ex; else wait(1);
  if (no\_hardware\_PT) (Insert dummy instruction for page table emulation 302);
  p = alloc\_slot(ITcache, trans\_key(data \neg y.o));
                 /* hey, it was present after all */
     if (data \neg i \equiv prego) goto fin_ex; else goto new_efetch;
  data \neg ptr\_b = ITcache \neg filler\_ctl.ptr\_b = (\mathbf{void} *) p;
  ITcache \rightarrow filler\_ctl.y.o = data \rightarrow y.o;
  set\_lock(self, ITcache \neg fill\_lock);
  startup(\&ITcache \rightarrow filler, 1);
  data \rightarrow state = got\_IT;
  if (data \neg i \equiv prego) goto fin\_ex; else sleep;
case got_IT: release_lock(self, ITcache¬fill_lock);
  if (\neg(data\neg z.o.l \& (PX\_BIT \gg PROT\_OFFSET))) goto bad\_fetch;
  data \neg z.o = phys\_addr(data \neg y.o, data \neg z.o);
fetch\_retry: data \neg state = IT\_hit;
case IT_hit: if (data \neg i \equiv preqo) goto fin\_ex; else goto known\_phys;
case Ihit_and_miss: \( \text{Try to get the contents of location } \data \to z.o \text{ in the I-cache } 300 \);
See also section 301.
This code is used in section 288.
299.
        \langle Special cases for states in later stages 272\rangle + \equiv
case IT_miss: case Ihit_and_miss: case IT_hit: case fetch_ready: goto switch0;
        \langle \text{Try to get the contents of location } data \neg z.o \text{ in the I-cache } 300 \rangle \equiv
  if (Icache→filler.next) goto fetch_retry;
  if ((Scache \land Scache \neg lock) \lor (\neg Scache \land mem\_lock)) goto fetch\_retry;
  q = alloc\_slot(Icache, data \neg z.o);
  if (\neg q) goto fetch\_retry;
  if (Scache) set\_lock(\&Icache \neg filler, Scache \neg lock)
  else set_lock(&Icache→filler, mem_lock);
  set\_lock(self, Icache \neg fill\_lock);
  data \neg ptr\_b = Icache \neg filler\_ctl.ptr\_b = (\mathbf{void} *) q;
  Icache \rightarrow filler\_ctl.z.o = data \rightarrow z.o;
  startup(&Icache→filler, Scache ? Scache→access_time : mem_addr_time);
  data \neg state = got\_one;
  if (data \rightarrow i \equiv prego) goto fin_ex; else sleep;
This code is used in section 298.
```

 $\S301$  MMIX-PIPE THE FETCH STAGE 111

**301.** The I-cache filler will wake us up with the octabyte we want, before it has filled the entire cache block. In that case we can fetch one or two instructions before the rest of the block has been loaded.

```
\langle Other cases for the fetch coroutine 298\rangle + \equiv
bad\_fetch: if (data \neg i \equiv preqo) goto fin\_ex;
  data \rightarrow interrupt \mid = PX_BIT;
swym\_one: fetched[0].h = fetched[0].l = SWYM \ll 24;
  goto fetch_one;
case got\_one: fetched[0] = data \neg x.o;
                                                  /* a "preview" of the new cache data */
fetch\_one: fetch\_lo = 0; fetch\_hi = 1;
  data \rightarrow state = fetch\_ready:
case fetch_ready: if (self¬lockloc) *(self¬lockloc) = \Lambda, self¬lockloc = \Lambda;
  if (data \neg i \equiv prego) goto fin\_ex;
  for (j = 0; j < fetch\_max; j++) {
     register fetch *new_tail;
     if (tail \equiv fetch\_bot) new\_tail = fetch\_top;
     else new\_tail = tail - 1;
                                              /* fetch buffer is full */
     if (new\_tail \equiv head) break;
     \langle Install a new instruction into the tail position 304\rangle;
     tail = new\_tail;
     if (sleepy) \{
        sleepy = false; sleep;
     inst\_ptr.o = incr(inst\_ptr.o, 4);
     if (fetch\_lo \equiv fetch\_hi) goto new\_fetch;
  wait(1);
302. (Insert dummy instruction for page table emulation 302) \equiv
     if (cache_search(ITcache, trans_key(inst_ptr.o))) goto new_fetch;
     data \rightarrow interrupt \mid = F_BIT;
     sleepy = true;
     goto swym_one;
This code is used in section 298.
303. \langle Global variables 20 \rangle + \equiv
                       /* have we just emitted the page table emulation call? */
  bool sleepy;
304. At this point we check for egregiously invalid instructions. (Sometimes the dispatcher will actually
allow such instructions to occupy the fetch buffer, for internally generated commands.)
\langle Install a new instruction into the tail position 304\rangle \equiv
  tail \rightarrow loc = inst\_ptr.o;
  if (inst\_ptr.o.l \& 4) tail \neg inst = fetched[fetch\_lo ++].l;
  else tail \rightarrow inst = fetched[fetch\_lo].h;
  tail \rightarrow interrupt = data \rightarrow interrupt;
  i = tail \rightarrow inst \gg 24;
  if (i \ge \mathtt{RESUME} \land i \le \mathtt{SYNC} \land (tail \neg inst \& bad\_inst\_mask[i - \mathtt{RESUME}])) tail \neg interrupt \models \mathtt{B\_BIT};
  tail \neg noted = false;
  if (inst\_ptr.o.l \equiv breakpoint.l \land inst\_ptr.o.h \equiv breakpoint.h) breakpoint\_hit = true;
This code is used in section 301.
```

112 THE FETCH STAGE MMIX-PIPE  $\S 305$ 

305. The commands RESUME, SAVE, UNSAVE, and SYNC should not have nonzero bits in the positions defined here.

```
 \begin{split} \langle \, \text{Global variables 20} \, \rangle \, + &\equiv \\ \quad \text{int } \, \textit{bad\_inst\_mask} \, [4] = \{ \text{\#fffffe}, \text{\#fffff00}, \text{\#fffff8} \}; \end{split}
```

§306 MMIX-PIPE INTERRUPTS 113

**306.** Interrupts. The scariest thing about the design of a pipelined machine is the existence of interrupts, which disrupt the smooth flow of a computation in ways that are difficult to anticipate. Fortunately, however, the discipline of a reorder buffer, which forces instructions to be committed in order, allows us to deal with interrupts in a fairly natural way. Our solution to the problems of dynamic scheduling and speculative execution therefore solves the interrupt problem as well.

MMIX has three kinds of interrupts, which show up as bit codes in the *interrupt* field when an instruction is ready to be committed: H\_BIT invokes a trip handler, for TRIP instructions and arithmetic exceptions; F\_BIT invokes a forced-trap handler, for TRAP instructions and unimplemented instructions that need to be emulated in software; E\_BIT invokes a dynamic-trap handler, for external interrupts like I/O signals or for internal interrupts caused by improper instructions. In all three cases, the pipeline control has already been redirected to fetch new instructions starting at the correct handler address by the time an interrupted instruction is ready to be committed.

**307.** Most instructions come to the following part of the program, if they have finished execution with any 1s among the eight trip bits or the eight trap bits.

If the trip bits aren't all zero, we want to update the event bits of rA, or perform an enabled trip handler, or both. If the trap bits are nonzero, we need to hold onto them until we get to the hot seat, when they will be joined with the bits of rQ and probably cause an interrupt. A load or store instruction with nonzero trap bits will be nullified, not committed.

Underflow that is exact and not enabled is ignored, in accordance with the IEEE standard conventions. (This applies also to underflow triggered by RESUME\_SET.)

```
#define is_load_store(i) (i \ge ld \land i \le cswap)

{ Handle interrupt at end of execution stage 307} =

{

if ((data \neg interrupt \& \#ff) \land is_load\_store(data \neg i)) goto state\_5;

j = data \neg interrupt \& \#ff00;

data \neg interrupt -= j;

if ((j \& (U\_BIT + X\_BIT)) \equiv U\_BIT \land \neg (data \neg ra.o.l \& U\_BIT)) j \& = \sim U\_BIT;

data \neg arith\_exc = (j \& \sim data \neg ra.o.l) \gg 8;

if (j \& data \neg ra.o.l) \land Prepare for exceptional trip handler 308،;

if (data \neg interrupt \& \#ff) goto state\_5;
}
```

**308.** Since execution is speculative, an exceptional condition might not be part of the "real" computation. Indeed, the present coroutine might have already been deissued.

```
 \left \langle \text{ Prepare for exceptional trip handler } 308 \right \rangle \equiv \left \{ \\ i = issued\_between(data,cool); \\ \textbf{if } (i < deissues) \textbf{ goto } die; \\ deissues = i; \\ old\_tail = tail = head; \ resuming = 0; \\ \langle \text{ Restart the fetch coroutine } 287 \right \rangle; \\ cool\_hist = data\neg hist; \\ \textbf{for } (i = j \& data\neg ra.o.l, m = 16; \ \neg (i \& \texttt{D\_BIT}); \ i \ll = 1, m += 16); \\ data\neg go.o.h = 0, data\neg go.o.l = m; \\ inst\_ptr.o = data\neg go.o., inst\_ptr.p = \Lambda; \\ data\neg interrupt \mid = \texttt{H\_BIT}; \\ \textbf{goto } state\_4; \\ \right \}
```

This code is used in section 144.

This code is used in section 307.

114 INTERRUPTS MMIX-PIPE §309

```
309.
        \langle Prepare to emulate the page translation 309\rangle \equiv
  i = issued\_between(data, cool);
  if (i < deissues) goto die;
  deissues = i;
  old\_tail = tail = head; resuming = 0;
                                                    /* clear the fetch buffer */
  ⟨Restart the fetch coroutine 287⟩;
  cool\_hist = data \neg hist;
  inst\_ptr.p = \texttt{UNKNOWN\_SPEC};
  data \neg interrupt \mid = F\_BIT;
This code is used in section 310.
310. We need to stop dispatching when calling a trip handler from within the reorder buffer, lest we issue
an instruction that uses g[255] or rB as an operand.
\langle Special cases for states in the first stage 266\rangle + \equiv
emulate_virt: (Prepare to emulate the page translation 309);
state\_4: data \neg state = 4;
case 4: if (dispatch_lock) wait(1);
  set_lock(self, dispatch_lock);
state\_5: data \neg state = 5;
case 5: if (data \neq old\_hot) wait(1);
  if ((data \rightarrow interrupt \& F_BIT) \land data \rightarrow i \neq trap)  {
     inst\_ptr.o = g[rT].o, inst\_ptr.p = \Lambda;
     if (is\_load\_store(data \neg i)) nullifying = true;
  if (data→interrupt & #ff) {
     q[rQ].o.h = data \rightarrow interrupt \& #ff;
     new_Q.h = data \rightarrow interrupt \& #ff;
     if (verbose & issue_bit) {
        printf("\_setting\_rQ="); print\_octa(g[rQ].o); printf("\n");
  goto die;
311. The instructions of the previous section appear in the switch for coroutine stage 1 only. We need to
use them also in later stages.
\langle Special cases for states in later stages 272\rangle + \equiv
case 4: goto state_4;
case 5: goto state_5;
312. \langle Special cases of instruction dispatch 117\rangle + \equiv
case trap: if ((flags[op] \& X\_is\_dest\_bit) \land cool \neg xx < cool\_G \land cool \neg xx \geq cool\_L) goto increase\_L;
  if (\neg g[rT].up \rightarrow known \lor \neg g[rJ].up \rightarrow known) goto stall;
  inst\_ptr = specval(\&g[rT]); /* traps and emulated ops */
  cool \neg need\_b = true, cool \neg b = specval(\&g[255]);
case trip:
  if (\neg g[rJ].up \rightarrow known) goto stall;
  cool \neg ren\_x = true, spec\_install(\&g[255], \&cool \neg x);
  cool \neg x.known = true, cool \neg x.o = g[rJ].up \neg o;
  if (i \equiv trip) cool \neg go.o = zero\_octa;
  cool \neg ren\_a = true, spec\_install(\&g[i \equiv trap ? rBB : rB], \&cool \neg a);  break;
```

§313 mmix-pipe interrupts 115

```
313. \langle Cases for stage 1 execution 155\rangle +\equiv case trap: data-interrupt |= F_BIT; data-a.o = data-b.o; goto fin_ex; case trip: data-interrupt |= H_BIT; data-a.o = data-b.o; goto fin_ex;
```

**314.** The following check is performed at the beginning of every cycle. An instruction in the hot seat can be externally interrupted only if it is ready to be committed and not already marked for tripping or trapping.

```
\langle Check for external interrupt 314\rangle \equiv
  g[rI].o = incr(g[rI].o, -1);
  if (g[rI].o.l \equiv 0 \land g[rI].o.h \equiv 0) {
     g[rQ].o.l = INTERVAL\_TIMEOUT, new\_Q.l = INTERVAL\_TIMEOUT;
     if (verbose & issue_bit) {
        printf("\_setting\_rQ="); print\_octa(g[rQ].o); printf("\n");
  trying\_to\_interrupt = false;
  if (((g[rQ].o.h \& g[rK].o.h) \lor (g[rQ].o.l \& g[rK].o.l)) \land cool \neq hot \land
           \neg(hot \neg interrupt \& (E\_BIT + F\_BIT + H\_BIT)) \land \neg doing\_interrupt \land
           \neg(hot \neg i \equiv resum)) {
     if (hot \neg owner) trying\_to\_interrupt = true;
     else {
        hot \rightarrow interrupt \mid = E_BIT;
        (Deissue all but the hottest command 316);
        inst\_ptr.o = q[rTT].o; inst\_ptr.p = \Lambda;
     }
  }
This code is used in section 64.
```

315. ⟨Global variables 20⟩ +≡
bool trying\_to\_interrupt; /\* encouraging interruptible operations to pause \*/
bool nullifying; /\* stopping dispatch to nullify a load/store command \*/

**316.** It's possible that the command in the hot seat has been deissued, but only if the simulator has done so at the user's request. Otherwise the test ' $i \ge deissues$ ' here will always succeed.

The value of *cool\_hist* becomes flaky here. We could try to keep it strictly up to date, but the unpredictable nature of external interrupts suggests that we are better off leaving it alone. (It's only a heuristic for branch prediction, and a sufficiently strong prediction will survive one-time glitches due to interrupts.)

```
 \begin{split} &\langle \, \text{Deissue all but the hottest command } \, 316 \, \rangle \equiv \\ &i = issued\_between(hot,cool); \\ &\mathbf{if} \ (i \geq deissues) \ \{ \\ &deissues = i; \\ &tail = head; \ resuming = 0; \quad /* \ \text{clear the fetch buffer } */ \\ &\langle \, \text{Restart the fetch coroutine } \, 287 \, \rangle; \\ &\mathbf{if} \ (is\_load\_store(hot\neg i)) \ nullifying = true; \\ &\} \end{split}
```

This code is used in section 314.

116 Interrupts mmix-pipe §317

317. Even though an interrupted instruction has officially been either "committed" or "nullified," it stays in the hot seat for two or three extra cycles, while we save enough of the machine state to resume the computation later.

```
\langle Begin an interruption and break 317\rangle \equiv
     if (\neg(hot \neg interrupt \& H_BIT)) g[rK].o = zero\_octa;
                                                                     /* trap */
    if (((hot \neg interrupt \& H_BIT) \land hot \neg i \neq trip) \lor
             ((hot \rightarrow interrupt \& F\_BIT) \land hot \rightarrow i \neq trap) \lor
             (hot \neg interrupt \& E\_BIT)) \ doing\_interrupt = 3, suppress\_dispatch = true;
     else doing\_interrupt = 2; /* trip or trap started by dispatcher */
     break:
This code is used in section 146.
318. If a memory failure occurs, we should set rF here, either in case 2 or case 1. The simulator doesn't
do anything with rF at present.
\langle Perform one cycle of the interrupt preparations 318\rangle \equiv
  switch (doing_interrupt ---) {
  case 3: (Set resumption registers (rB, $255) or (rBB, $255) 319); break;
  case 2: (Set resumption registers (rW,rX) or (rWW,rXX) 320); break;
  case 1: \langle \text{Set resumption registers } (rY, rZ) \text{ or } (rYY, rZZ) | 321 \rangle;
     if (hot \equiv reorder\_bot) hot = reorder\_top; else hot ---;
     break:
  }
This code is used in section 64.
319. \langle Set resumption registers (rB, $255) or (rBB, $255) 319 \rangle \equiv
  j = hot \rightarrow interrupt \& H_BIT;
  q[j ? rB : rBB].o = q[255].o;
  g[255].o = g[rJ].o;
  if (verbose & issue_bit) {
     if (j) {
       printf("\_setting\_rB="); print\_octa(g[rB].o);
       printf("\_setting\_rBB="); print\_octa(g[rBB].o);
     printf(", $255="); print_octa(g[255].o); printf("\n");
```

This code is used in section 318.

 $\S320$  MMIX-PIPE INTERRUPTS 117

320. Here's where we manufacture the "ropcodes" for resumption.

```
\#define RESUME_AGAIN 0
                                  /* repeat the command in rX as if in location rW -4 */
#define RESUME_CONT 1
                                 /* same, but substitute rY and rZ for operands */
#define RESUME_SET 2
                                /* set r[X] to rZ */
                                   /* install (rY, rZ) into IT-cache or DT-cache, then RESUME_AGAIN */
#define RESUME_TRANS 3
#define pack_bytes(a, b, c, d) ((((((unsigned)(a) \ll 8) + (b)) \ll 8) + (c)) \ll 8) + (d)
\langle Set resumption registers (rW, rX) or (rWW, rXX) 320\rangle \equiv
  j = pack\_bytes(hot \neg op, hot \neg xx, hot \neg yy, hot \neg zz);
  if (hot→interrupt & H_BIT) {
    g[rW].o = incr(hot \neg loc, 4);
    g[rX].o.h = sign\_bit, g[rX].o.l = j;
    if (verbose & issue_bit) {
       printf("\_setting\_rW="); print\_octa(g[rW].o);
       printf(", \_rX="); print\_octa(g[rX].o); printf("\n");
  } else {
                /* trap */
    g[rWW].o = hot \neg go.o;
    g[rXX].o.l = j;
    if (hot→interrupt & F_BIT) { /* forced */
       if (hot \neg i \neq trap) j = RESUME\_TRANS;
                                                   /* emulate page translation */
       else if (hot \neg op \equiv TRAP) j = \#80;
                                                /* TRAP */
       else if (flags[internal\_op[hot \neg op]] \& X\_is\_dest\_bit) j = RESUME\_SET;
                                                                                     /* emulation */
                          /* emulation when r[X] is not a destination */
       else j = {}^{\#}80;
    } else {
                   /* dynamic */
       if (hot→interim)
         j = (hot \neg i \equiv frem \lor hot \neg i \equiv syncd \lor hot \neg i \equiv syncid ? RESUME\_CONT : RESUME\_AGAIN);
       else if (is\_load\_store(hot \neg i)) j = RESUME\_AGAIN;
       else j = {}^{\#}80;
                          /* normal external interruption */
    g[rXX].o.h = (j \ll 24) + (hot \neg interrupt \& \#ff);
    if (verbose & issue_bit) {
       printf("\_setting\_rWW="); print\_octa(g[rWW].o);
       printf(", \exists rXX="); print\_octa(g[rXX].o); printf("\n");
  }
```

This code is used in section 318.

118 INTERRUPTS MMIX-PIPE §321

```
321. \langle Set resumption registers (rY, rZ) or (rYY, rZZ) 321\rangle \equiv j = hot \neg interrupt \& H_BIT; if ((hot \neg interrupt \& F_BIT) \land hot \neg op \equiv SWYM) \ g[rYY].o = hot \neg go.o; else g[j ? rY : rYY].o = hot \neg y.o; if (hot \neg i \equiv st \lor hot \neg i \equiv pst) \ g[j ? rZ : rZZ].o = hot \neg x.o; else g[j ? rZ : rZZ].o = hot \neg z.o; if (verbose \& issue\_bit) \ \{ if (j) \ \{ printf("\_setting\_rY="); \ print\_octa(g[rY].o); printf(", \_rZ="); \ print\_octa(g[rZ].o); \ printf("\n"); \} else \{ printf("\_setting\_rYY="); \ print\_octa(g[rYY].o); printf(", \_rZZ="); \ print\_octa(g[rZZ].o); \ printf("\n"); \} \}
```

This code is used in section 318.

**322.** Whew; we've successfully interrupted the computation. The remaining task is to restart it again, as transparently as possible.

The RESUME instruction waits for the pipeline to drain, because it has to do such drastic things. For example, an interrupt may be occurring at this very moment, changing the registers needed for resumption.

```
\langle Special cases of instruction dispatch 117\rangle + \equiv
```

```
case resume: if (cool \neq old\_hot) goto stall;
   inst\_ptr = specval(\&g[cool \rightarrow zz ? rWW : rW]);
  if (\neg(cool \neg loc.h \& sign\_bit)) {
     if (cool \neg zz) cool \neg interrupt = K_BIT;
     else if (inst\_ptr.o.h \& sign\_bit) cool \neg interrupt |= P\_BIT;
  if (cool→interrupt) {
     inst\_ptr.o = incr(cool \neg loc, 4); cool \neg i = noop;
   } else {
     cool \neg go.o = inst\_ptr.o;
     if (cool \neg zz) {
         \langle \text{Magically do an I/O operation, if } cool \neg loc \text{ is rT } 372 \rangle;
         cool \neg ren\_a = true, spec\_install(\&g[rK], \&cool \neg a);
         cool \neg a.known = true, cool \neg a.o = g[255].o;
         cool \neg ren\_x = true, spec\_install(\&g[255], \&cool \neg x);
         cool \neg x.known = true, cool \neg x.o = g[rBB].o;
     cool \rightarrow b = specval(\&g[cool \rightarrow zz ? rXX : rX]);
     if (\neg(cool \neg b.o.h \& sign\_bit)) (Resume an interrupted operation 323);
   } break;
```

§323 MMIX-PIPE INTERRUPTS 119

**323.** Here we set  $cool \rightarrow i = resum$ , since we want to issue another instruction after the RESUME itself.

The restrictions on inserted instructions are designed to ensure that those instructions will be the very next ones issued. (If, for example, an *incgamma* instruction were necessary, it might cause a page fault and we'd lose the operand values for RESUME\_SET or RESUME\_CONT.)

A subtle point arises here: If RESUME\_TRANS is being used to compute the page translation of virtual address zero, we don't want to execute the dummy SWYM instruction from virtual address -4! So we avoid the SWYM altogether.

```
\langle Resume an interrupted operation 323\rangle \equiv
     cool \rightarrow xx = cool \rightarrow b.o.h \gg 24, cool \rightarrow i = resum;
     head \neg loc = incr(inst\_ptr.o, -4);
     switch (cool \neg xx) {
     case RESUME_SET: cool \neg b.o.l = (SETH \ll 24) + (cool \neg b.o.l \& #ff0000);
        head \neg interrupt \mid = cool \neg b.o.h \& \#ff00;
        resuming = 2;
     case RESUME_CONT: resuming += 1 + cool \neg zz;
        if (((cool \neg b.o.l \gg 24) \& \#fa) \neq \#b8) { /* not syncd or syncid */
          m = cool \rightarrow b.o.l \gg 28;
          if ((1 \ll m) \& #8f30) goto bad_resume;
          m = (cool \neg b.o.l \gg 16) \& \#ff;
          if (m > cool\_L \land m < cool\_G) goto bad_resume;
     case RESUME_AGAIN: resume\_again: head \neg inst = cool \neg b.o.l;
        m = head \neg inst \gg 24;
                                                        /* avoid uninterruptible loop */
        if (m \equiv RESUME) goto bad\_resume;
        if (\neg cool \neg zz \land m > \mathtt{RESUME} \land m \leq \mathtt{SYNC} \land (head \neg inst \& bad\_inst\_mask[m - \mathtt{RESUME}]))
           head \rightarrow interrupt \mid = B_BIT:
        head \neg noted = false;  break;
     case RESUME_TRANS: if (cool \neg zz) {
           cool \neg y = specval(\&g[rYY]), cool \neg z = specval(\&g[rZZ]);
           if ((cool \neg b.o.l \gg 24) \neq SWYM) goto resume\_again;
           cool \neg i = resume; break;
                                              /* see "subtle point" above */
     default: bad\_resume: cool \neg interrupt |= B\_BIT, cool \neg i = noop;
        resuming = 0; break;
This code is used in section 322.
```

120 Interrupts mmix-pipe §324

```
324.
         (Insert special operands when resuming an interrupted operation 324) \equiv
   {
      if (resuming & 1) {
         cool \neg y = specval(\&g[rY]);
         cool \neg z = specval(\&g[rZ]);
      } else {
         cool \rightarrow y = specval(\&g[rYY]);
         cool \neg z = specval(\&g[rZZ]);
      \mathbf{if} \ (\mathit{resuming} \ge 3) \ \{ \qquad /* \ \mathtt{RESUME\_SET} \ */
         cool \neg need\_ra = true, cool \neg ra = specval(\&g[rA]);
      cool \neg usage = false;
This code is used in section 103.
325. #define do_resume_trans 17
                                                         /* state for performing RESUME_TRANS actions */
\langle \text{ Cases for stage 1 execution 155} \rangle + \equiv
case resume: case resum: if (data \neg xx \neq RESUME\_TRANS) goto fin\_ex;
   data \neg ptr\_a = (\mathbf{void} *)((data \neg b.o.l \gg 24) \equiv \mathsf{SWYM} ? IT cache : DT cache);
   data \rightarrow state = do\_resume\_trans;
   data \neg z.o = incr(oandn(data \neg z.o, page\_mask), data \neg z.o.l \& 7);
   data \rightarrow z.o.h \&= \#ffff;
   goto resume_trans;
326. \langle Special cases for states in the first stage 266\rangle + \equiv
\mathbf{case}\ do\_resume\_trans\colon resume\_trans\colon
   { register cache *c = (cache *) data \neg ptr\_a;}
      if (c \neg lock) wait (1);
      if (c \rightarrow filler.next) wait (1);
      p = alloc\_slot(c, trans\_key(data \rightarrow y.o));
      if (p) {
         c \rightarrow filler\_ctl.ptr\_b = (\mathbf{void} *) p;
         c \rightarrow filler\_ctl.y.o = data \rightarrow y.o;
         c \rightarrow filler\_ctl.b.o = data \rightarrow z.o;
         c \rightarrow filler\_ctl.state = 1;
         schedule(\&c \rightarrow filler, c \rightarrow access\_time, 1);
      goto fin_ex;
```

**327.** Administrative operations. The internal instructions that handle the register stack simply reduce to things we already know how to do. (Well, the internal instructions for saving and unsaving do sometimes lead to special cases, based on  $data \neg op$ ; for the most part, though, the necessary mechanisms are already present.)

```
\langle Cases for stage 1 execution 155\rangle +\equiv case noop: if (data \neg interrupt \& F_BIT) goto emulate_virt; case jmp: case pushj: case incrl: case unsave: goto fin\_ex; case sav: if (\neg(data \neg mem\_x)) goto fin\_ex; case incgamma: case save: data \neg i = st; goto switch1; case decgamma: case unsav: data \neg i = ld; goto switch1;
```

**328.** We can GET special registers  $\geq 21$  (that is, rA, rF, rP, rW-rZ, or rWW-rZZ) only in the hot seat, because those registers are implicit outputs of many instructions.

The same applies to rK, since it is changed by TRAP and by emulated instructions.

```
 \begin{array}{l} \langle \, \text{Cases for stage 1 execution 155} \, \rangle + \equiv \\ \textbf{case } \ get \colon \ \textbf{if} \ \left( data \neg zz \geq 21 \lor data \neg zz \equiv rK \right) \ \left\{ \\ \textbf{if} \ \left( data \neq old\_hot \right) \ wait(1); \\ data \neg z.o = g[data \neg zz].o; \\ \right\} \\ data \neg x.o = data \neg z.o; \ \textbf{goto} \ fin\_ex; \end{array}
```

**329.** A PUT is, similarly, delayed in the cases that hold *dispatch\_lock*. This program does not restrict the 1 bits that might be PUT into rQ, although the contents of that register can have drastic implications.

```
Cases for stage 1 execution 155 \rangle +\equiv case put: if (data \neg xx \ge 15 \land data \neg xx \le 20) {
    if (data \ne old\_hot) wait(1);
    switch (data \neg xx) {
      case rV: \langle Update the page variables 239 \rangle; break;
      case rQ: new\_Q.h \models data \neg z.o.h \& \sim g[rQ].o.h; new\_Q.l \models data \neg z.o.l \& \sim g[rQ].o.l;
      data \neg z.o.l \models new\_Q.l; data \neg z.o.h \models new\_Q.h; break;
      case rL: if (data \neg z.o.h \ne 0) data \neg z.o.h = 0, data \neg z.o.l = g[rL].o.l;
      else if (data \neg z.o.l > g[rL].o.l) data \neg z.o.l = g[rL].o.l;
      default: break;
      case rG: \langle Update rG 330\rangle; break;
    }
} else if (data \neg xx \equiv rA \land (data \neg z.o.h \ne 0 \lor data \neg z.o.l \ge \#40000)) data \neg interrupt \models B\_BIT; data \neg x.o = data \neg z.o; goto fin\_ex;
```

**330.** When rG decreases, we assume that up to *commit\_max* marginal registers can be zeroed during each clock cycle. (Remember that we're currently in the hot seat, and holding *dispatch\_lock*.)

```
\langle \text{Update rG } 330 \rangle \equiv
  if (data \neg z.o.h \neq 0 \lor data \neg z.o.l \geq 256 \lor data \neg z.o.l < g[rL].o.l \lor data \neg z.o.l < 32) data \neg interrupt \models \texttt{B\_BIT};
  else if (data \neg z.o.l < g[rG].o.l) {
                                    /* potentially interruptible */
     data \neg interim = true;
     for (j = 0; j < commit\_max; j++) {
        g[rG].o.l--;
        g[g[rG].o.l].o = zero\_octa;
        if (data \neg z.o.l \equiv g[rG].o.l) break;
     if (j \equiv commit\_max) {
        if (\neg trying\_to\_interrupt) wait (1);
     } else data \neg interim = false;
This code is used in section 329.
331. Computed jumps put the desired destination address into the go field.
\langle \text{ Cases for stage 1 execution 155} \rangle + \equiv
case go: data \rightarrow x.o = data \rightarrow go.o; goto add\_go;
case pop: data \neg x.o = data \neg y.o;
   data \rightarrow y.o = data \rightarrow b.o;
                                  /* move rJ to y field */
case pushgo: add\_go: data \neg go.o = oplus(data \neg y.o, data \neg z.o);
  if ((data \neg go.o.h \& sign\_bit) \land \neg (data \neg loc.h \& sign\_bit)) data \neg interrupt |= P\_BIT;
   data \neg go.known = true; goto fin\_ex;
```

**332.** The instruction UNSAVE z generates a sequence of internal instructions that accomplish the actual unsaving. This sequence is controlled by the instruction currently in the fetch buffer, which changes its X and Y fields until all global registers have been loaded. The first instructions of the sequence are UNSAVE 0,0,z; UNSAVE 1,rZ,z-8; UNSAVE 1,rY,z-16; ...; UNSAVE 1,rB,z-96; UNSAVE 2,255,z-104; UNSAVE 2,254,z-112; etc. If an interrupt occurs before these instructions have all been committed, the execution register will contain enough information to restart the process.

After the global registers have all been loaded, UNSAVE continues by acting rather like POP. An interrupt occurring during this last stage will find rS < rO; a context switch might then take us back to restoring the local registers again. But no information will be lost, even though the register from which we began unsaving has long since been replaced.

```
\langle Special cases of instruction dispatch 117\rangle + \equiv
case unsave: if (cool \neg interrupt \& B_BIT) cool \neg i = noop;
   else {
     cool \neg interim = true;
     op = LDOU:
                         /* this instruction needs to be handled by load/store unit */
     cool \neg i = unsav:
     switch (cool \rightarrow xx) {
     case 0: if (cool \neg z.p) goto stall;
        (Set up the first phase of unsaving 334); break;
     case 1: case 2: (Generate an instruction to unsave g[yy] 333); break;
     case 3: cool \rightarrow i = unsave, cool \rightarrow interim = false, op = UNSAVE;
        goto pop_unsave;
     default: cool \neg interim = false, cool \neg i = noop, cool \neg interrupt |= B_BIT; break;
                 /* this takes us to dispatch_done */
   break;
333. \langle Generate an instruction to unsave g[yy] 333 \rangle \equiv
   cool \neg ren\_x = true, spec\_install(\&g[cool \neg yy], \&cool \neg x);
   new\_O = new\_S = incr(cool\_O, -1);
   cool \neg z.o = shift\_left(new\_O, 3);
   cool \neg ptr\_a = (\mathbf{void} *) mem.up;
This code is used in section 332.
334. \langle Set up the first phase of unsaving 334\rangle \equiv
   cool \neg ren\_x = true, spec\_install(\&g[rG], \&cool \neg x);
   cool \neg ren\_a = true, spec\_install(\&g[rA], \&cool \neg a);
   new\_O = new\_S = shift\_right(cool \neg z.o, 3, 1);
   cool \neg set\_l = true, spec\_install(\&g[rL], \&cool \neg rl);
   cool \neg ptr \underline{\ } a = (\mathbf{void} *) mem.up;
This code is used in section 332.
         \langle Get ready for the next step of UNSAVE 335\rangle \equiv
   switch (cool \rightarrow xx) {
   case 0: head \neg inst = pack\_bytes(UNSAVE, 1, rZ, 0); break;
   case 1: if (cool \rightarrow yy \equiv rP) head \rightarrow inst = pack\_bytes(UNSAVE, 1, rR, 0);
     else if (cool \neg yy \equiv 0) head \neg inst = pack\_bytes(UNSAVE, 2, 255, 0);
     else head \rightarrow inst = pack\_bytes (UNSAVE, 1, cool \rightarrow yy - 1, 0); break;
   case 2: if (cool \rightarrow yy \equiv cool \_G) head \rightarrow inst = pack\_bytes(UNSAVE, 3, 0, 0);
     else head \neg inst = pack\_bytes(UNSAVE, 2, cool \neg yy - 1, 0); break;
This code is used in section 81.
```

This code is used in section 337.

```
336.
        \langle Handle an internal UNSAVE when it's time to load 336\rangle \equiv
  if (data \rightarrow xx \equiv 0) {
     data \neg a.o = data \neg x.o; data \neg a.o.h \& = \#ffffff;
                                                                  /* unsaved rA */
     data \neg x.o.l = data \neg x.o.h \gg 24; data \neg x.o.h = 0;
                                                                  /* unsaved rG */
     if (data \neg a.o.h \lor (data \neg a.o.l \& #fffc0000)) {
        data \neg a.o.h = 0, data \neg a.o.l \&= #3ffff; data \neg interrupt |= B_BIT;
     if (data \rightarrow x.o.l < 32) {
        data \rightarrow x.o.l = 32; data \rightarrow interrupt = B_BIT;
  }
  goto fin_ex;
This code is used in section 279.
337. Of course SAVE is handled essentially like UNSAVE, but backwards.
\langle Special cases of instruction dispatch 117\rangle + \equiv
case save: if (cool \neg xx < cool \neg G) cool \neg interrupt = B_BIT;
  if (cool \neg interrupt \& B_BIT) cool \neg i = noop;
  else if (((cool\_S.l - cool\_O.l - cool\_L - 1) \& lring\_mask) \equiv 0)
     (Insert an instruction to advance gamma 113)
  else {
     cool \neg interim = true;
     cool \neg i = sav;
     switch (cool \neg zz) {
     case 0: (Set up the first phase of saving 338); break;
     case 1: if (cool\_O.l \neq cool\_S.l) (Insert an instruction to advance gamma 113)
        cool \neg zz = 2; cool \neg yy = cool G;
     case 2: case 3: \langle Generate an instruction to save g[yy] 339\rangle; break;
     default: cool \neg interim = false, cool \neg i = noop, cool \neg interrupt |= B_BIT; break;
     }
  break;
338. If an interrupt occurs during the first phase, say between two incgamma instructions, the value
cool zz = 1 will get things restarted properly. (Indeed, if context is saved and unsaved during the interrupt,
many incgamma instructions may no longer be necessary.)
\langle Set up the first phase of saving 338\rangle \equiv
  cool \neg zz = 1;
  cool \neg ren\_x = true, spec\_install(\&l[(cool\_O.l + cool\_L) \& lring\_mask], \& cool \neg x);
  cool \neg x.known = true, cool \neg x.o.h = 0, cool \neg x.o.l = cool \bot L;
  cool \rightarrow set\_l = true, spec\_install(\&q[rL], \&cool \rightarrow rl);
  new_O = incr(cool_O, cool_L + 1);
This code is used in section 337.
339. \langle Generate an instruction to save g[yy] 339 \rangle \equiv
                   /* this instruction needs to be handled by load/store unit */
  cool \neg mem\_x = true, spec\_install(\&mem, \&cool \neg x);
  cool \neg z.o = shift\_left(cool \_O, 3);
  new\_O = new\_S = incr(cool\_O, 1);
  if (cool \neg zz \equiv 3 \land cool \neg yy > rZ) (Do the final SAVE 340)
  else cool \rightarrow b = specval(\&g[cool \rightarrow yy]);
```

340. The final SAVE instruction not only stores rG and rA, it also places the final address in global register X.

```
\langle \text{ Do the final SAVE } 340 \rangle \equiv
     cool \neg i = save;
     cool \neg interim = false;
     cool \neg ren\_a = true, spec\_install(\&g[cool \neg xx], \&cool \neg a);
This code is used in section 339.
341. \langle Get ready for the next step of SAVE 341\rangle \equiv
  switch (cool \neg zz) {
   case 1: head \neg inst = pack\_bytes(SAVE, cool \neg xx, 0, 1); break;
  case 2: if (cool \neg yy \equiv 255) head \neg inst = pack\_bytes(SAVE, cool \neg xx, 0, 3);
     else head \neg inst = pack\_bytes(SAVE, cool \neg xx, cool \neg yy + 1, 2); break;
  case 3: if (cool \neg yy \equiv rR) head \neg inst = pack\_bytes(SAVE, cool \neg xx, rP, 3);
     else head \neg inst = pack\_bytes(SAVE, cool \neg xx, cool \neg yy + 1, 3); break;
This code is used in section 81.
342. (Handle an internal SAVE when it's time to store 342) \equiv
     if (data \neg interim) data \neg x.o = data \neg b.o;
     else {
        if (data \neq old\_hot) wait(1);
                                                   /* we need the hottest value of rA */
        data \rightarrow x.o.h = g[rG].o.l \ll 24;
        data \rightarrow x.o.l = g[rA].o.l;
        data \neg a.o = data \neg y.o;
     goto fin_{\underline{-}}ex;
This code is used in section 281.
```

**343.** More register-to-register ops. Now that we've finished most of the hard stuff, we can relax and fill in the holes that we left in the all-register parts of the execution stages.

First let's complete the fixed point arithmetic operations, by dispensing with multiplication and division.

```
\langle Cases to compute the results of register-to-register operation 137 \rangle + \equiv
case mulu: data \neg x.o = omult(data \neg y.o, data \neg z.o);
   data \neg a.o = aux;
   goto quantify_mul;
case mul: data \rightarrow x.o = signed\_omult(data \rightarrow y.o, data \rightarrow z.o);
  if (overflow) data \rightarrow interrupt = V_BIT;
quantify_mul: aux = data \neg z.o;
   for (j = mul0; aux.l \lor aux.h; j \leftrightarrow) aux = shift\_right(aux, 8, 1);
                                  /* j is mul0 or mul1 or ... or mul8 */
   data \rightarrow i = j; break;
case divu: data \neg x.o = odiv(data \neg b.o, data \neg y.o, data \neg z.o);
   data \neg a.o = aux; data \neg i = div; break;
case div: if (data \neg z.o.l \equiv 0 \land data \neg z.o.h \equiv 0) {
      data \neg interrupt \mid = D_BIT; data \neg a.o = data \neg y.o;
                             /* divide by zero needn't wait in the pipeline */
      data \rightarrow i = set;
   } else {
      data \neg x.o = signed\_odiv(data \neg y.o, data \neg z.o);
      if (overflow) data \rightarrow interrupt |= V_BIT;
      data \neg a.o = aux;
   } break;
344. Next let's polish off the bitwise and bytewise operations.
\langle Cases to compute the results of register-to-register operation 137\rangle + \equiv
case sadd: data \neg x.o.l = count\_bits(data \neg y.o.h \& \sim data \neg z.o.h) + count\_bits(data \neg y.o.l \& \sim data \neg z.o.l);
   break:
case mor: data \neg x.o = bool\_mult(data \neg y.o, data \neg z.o, data \neg op \& #2); break;
case bdif: data \rightarrow x.o.h = byte\_diff(data \rightarrow y.o.h, data \rightarrow z.o.h);
   data \rightarrow x.o.l = byte\_diff(data \rightarrow y.o.l, data \rightarrow z.o.l); break;
case wdif: data \rightarrow x.o.h = wyde\_diff(data \rightarrow y.o.h, data \rightarrow z.o.h);
   data \neg x.o.l = wyde\_diff(data \neg y.o.l, data \neg z.o.l); break;
case tdif: if (data \neg y.o.h > data \neg z.o.h) data \neg x.o.h = data \neg y.o.h - data \neg z.o.h;
tdif_{-}l: if (data \neg y.o.l > data \neg z.o.l) data \neg x.o.l = data \neg y.o.l - data \neg z.o.l; break;
case odif: if (data \neg y.o.h > data \neg z.o.h) data \neg x.o = ominus(data \neg y.o., data \neg z.o.);
   else if (data \rightarrow y.o.h \equiv data \rightarrow z.o.h) goto tdif_l;
   break;
```

**345.** The conditional set (CS) instructions are, rather surprisingly, more difficult to implement than the zero set (ZS) instructions, although the ZS instructions do more. The reason is that dynamic instruction dependencies are more complicated with CS. Consider, for example, the instructions

```
LDO x,a,b; FDIV y,c,d; CSZ y,x,0; INCL y,1.
```

If the value of x is zero, the INCL instruction need not wait for the division to be completed. (We do not, however, abort the division in such a case; it might invoke a trip handler, or change the inexact bit, etc. Our policy is to treat common cases efficiently and to treat all cases correctly, but not to treat all cases with maximum efficiency.)

```
 \begin{array}{l} \langle \text{Cases to compute the results of register-to-register operation } 137 \rangle + \equiv \\ \text{case } \textit{zset} \colon \text{if } (\textit{register\_truth}(\textit{data} \neg y.o, \textit{data} \neg op)) \ \textit{data} \neg x.o = \textit{data} \neg z.o; \\ /* \ \text{otherwise } \textit{data} \neg x.o \text{ is already zero } */\\ \text{goto } \textit{fin\_ex}; \\ \text{case } \textit{cset} \colon \text{if } (\textit{register\_truth}(\textit{data} \neg y.o, \textit{data} \neg op)) \ \textit{data} \neg x.o = \textit{data} \neg z.o, \textit{data} \neg b.p = \Lambda; \\ \text{else if } (\textit{data} \neg b.p \equiv \Lambda) \ \textit{data} \neg x.o = \textit{data} \neg b.o; \\ \text{else } \{ \\ \textit{data} \neg \textit{state} = 0; \ \textit{data} \neg \textit{need\_b} = \textit{true}; \ \text{goto } \textit{switch1}; \\ \} \ \text{break}; \end{array}
```

**346.** Floating point computations are mostly handled by the routines in MMIX-ARITH, which record anomalous events in the global variable *exceptions*. But we consider the operation trivial if an input is infinite or NaN; and we may need to increase the execution time when denormals are present.

```
#define ROUND OFF 1
#define ROUND_UP 2
#define ROUND_DOWN 3
#define ROUND_NEAR 4
#define is\_denormal(x) ((x.h \& #7ff00000) \equiv 0 \land ((x.h \& #fffff) \lor x.l))
#define is\_trivial(x) ((x.h & #7ff00000) \equiv #7ff00000)
#define set_round cur_round = (data \neg ra.o.l < #10000 ? ROUND_NEAR : data \neg ra.o.l \gg 16)
\langle Cases to compute the results of register-to-register operation 137\rangle + \equiv
case fadd: set\_round; data \neg x.o = fplus(data \neg y.o, data \neg z.o);
fin\_bflot: if (is\_denormal(data \neg y.o)) data \neg denin = denin\_penalty;
fin\_uflot: if (is\_denormal(data \neg x.o)) data \neg denout = denout\_penalty;
fin\_flot: if (is\_denormal(data \neg z.o)) data \neg denin = denin\_penalty;
  data \rightarrow interrupt \mid = exceptions;
  if (is\_trivial(data \neg y.o) \lor is\_trivial(data \neg z.o)) goto fin\_ex;
  if (data \rightarrow i \equiv fsqrt \land (data \rightarrow z.o.h \& siqn\_bit)) goto fin\_ex;
  break;
case fsub: data \neg a.o = data \neg z.o;
  if (fcomp(data \neg z.o, zero\_octa) \neq 2) data \neg a.o.h \oplus = sign\_bit;
  set\_round; data \neg x.o = fplus(data \neg y.o, data \neg a.o);
  data \rightarrow i = fadd;
                          /* use pipeline times for addition */
  goto fin_bflot;
case fmul: set\_round; data \neg x.o = fmult(data \neg y.o, data \neg z.o); goto fin\_bflot;
case fdiv: set\_round; data \neg x.o = fdivide(data \neg y.o, data \neg z.o); goto <math>fin\_bflot;
case fsqrt: set\_round; data \neg x.o = froot(data \neg z.o, data \neg y.o.l); goto <math>fin\_uflot;
case fint: set\_round; data \rightarrow x.o = fintegerize(data \rightarrow z.o, data \rightarrow y.o.l); goto fin\_uflot;
case fix: set_round; data\neg x.o = fixit(data \neg z.o, data \neg y.o.l);
  if (data \rightarrow op \& #2) exceptions \&= \sim W_BIT;
                                                            /* unsigned case doesn't overflow */
  goto fin_flot;
case flot: set\_round; data \neg x.o = floatit(data \neg z.o, data \neg y.o.l, data \neg op \& #2, data \neg op \& #4);
  data \rightarrow interrupt \mid = exceptions; break;
347. \langle Special cases of instruction dispatch 117\rangle + \equiv
case fsqrt: case fint: case fix: case flot: if (cool-y.o.l > 4) goto illegal_inst;
  break;
```

```
348.
         \langle Cases to compute the results of register-to-register operation 137 \rangle + \equiv
case feps: j = fepscomp(data \neg y.o, data \neg z.o, data \neg b.o, data \neg op \neq FEQLE);
  if (j \equiv 2) data \rightarrow i = fcmp;
  else if (is\_denormal(data\neg y.o) \lor is\_denormal(data\neg z.o)) data\neg denin = denin\_penalty;
  switch (data \neg op) {
  case FUNE: if (j \equiv 2) goto cmp\_pos; else goto cmp\_zero;
  case FEQLE: goto cmp_fin;
  case FCMPE: if (j) goto cmp_zero_or_invalid;
  }
case fcmp: j = fcomp(data \rightarrow y.o, data \rightarrow z.o);
  if (j < 0) goto cmp\_neg;
cmp\_fin: if (j \equiv 1) goto cmp\_pos;
cmp\_zero\_or\_invalid: if (j \equiv 2) data \neg interrupt |= I\_BIT;
  goto cmp_zero;
case funeq: if (fcomp(data \neg y.o, data \neg z.o) \equiv (data \neg op \equiv FUN ? 2:0)) goto cmp\_pos;
  else goto cmp_zero;
349. \langle External variables 4 \rangle + \equiv
  Extern int frem_max;
  Extern int denin_penalty, denout_penalty;
        The floating point remainder operation is especially interesting because it can be interrupted when
it's in the hot seat.
\langle Cases to compute the results of register-to-register operation 137\rangle + \equiv
case frem: if (is\_trivial(data \rightarrow y.o) \lor is\_trivial(data \rightarrow z.o)) {
     data \neg x.o = fremstep(data \neg y.o, data \neg z.o, 2500); goto fin\_ex;
  if ((self + 1) \rightarrow next) wait (1);
  data \rightarrow interim = true;
  j = 1;
  if (is\_denormal(data \neg y.o) \lor is\_denormal(data \neg z.o)) j += denin\_penalty;
  pass\_after(j);
  goto passit;
351. \langle Begin execution of a stage-two operation 351 \rangle \equiv
  j = 1;
  if (data \rightarrow i \equiv frem) {
     data \rightarrow x.o = fremstep(data \rightarrow y.o, data \rightarrow z.o, frem\_max);
     if (exceptions & E_BIT) {
        data \rightarrow y.o = data \rightarrow x.o;
        if (trying\_to\_interrupt \land data \equiv old\_hot) goto fin\_ex;
     } else {
        data \neg state = 3;
        data \rightarrow interim = false;
        data \rightarrow interrupt \mid = exceptions;
        if (is\_denormal(data \neg x.o)) j += denout\_penalty;
     wait(j);
This code is used in section 135.
```

130 SYSTEM OPERATIONS MMIX-PIPE §352

**352.** System operations. Finally we need to implement some operations for the operating system; then the hardware simulation will be done!

A LDVTS instruction is delayed until it reaches the hot seat, because it changes the IT and DT caches. The operating system should use SYNC after LDVTS if the effects are needed immediately; the system is also responsible for ensuring that the page table permission bits agree with the LDVTS permission bits when the latter are nonzero. (Also, if write permission is taken away from a page, the operating system must have previously used SYNCD to write out any dirty bytes that might have been cached from that page; SYNCD will be inoperative after write permission goes away.)

```
\langle Handle special cases for operations like prego and ldvts 289\rangle + \equiv
  if (data \rightarrow i \equiv ldvts) \( Do stage 1 of LDVTS 353 \);
       \langle \text{ Do stage 1 of LDVTS } 353 \rangle \equiv
      if (data \neq old\_hot) wait(1);
      if (DTcache \neg lock \lor (j = get\_reader(DTcache)) < 0) wait(1);
      startup(\&DTcache \neg reader[j], DTcache \neg access\_time);
      data \rightarrow z.o.h = 0, data \rightarrow z.o.l = data \rightarrow y.o.l \& #7;
      p = cache\_search(DTcache, data \rightarrow y.o);
                                                            /* N.B.: Not trans\_key(data \rightarrow y.o) */
      if (p) {
         data \rightarrow x.o.l = 2;
        if (data \rightarrow z.o.l) {
           p = use\_and\_fix(DTcache, p);
           p \rightarrow data[0].l = (p \rightarrow data[0].l \& -8) + data \rightarrow z.o.l;
         } else {
           p = demote\_and\_fix(DTcache, p);
                                          /* invalidate the tag */
           p \rightarrow tag.h \mid = sign\_bit;
      pass_after(DTcache→access_time); goto passit;
This code is used in section 352.
354. \langle Special cases for states in later stages 272\rangle + \equiv
case ld\_st\_launch: if (ITcache \neg lock \lor (j = qet\_reader(ITcache)) < 0) wait(1);
   startup(\&ITcache \neg reader[j], ITcache \neg access\_time);
   p = cache\_search(ITcache, data \neg y.o); /* N.B.: Not trans_key(data \neg y.o) */
  if (p) {
      data \rightarrow x.o.l = 1;
      if (data \rightarrow z.o.l) {
        p = use\_and\_fix(ITcache, p);
        p \rightarrow data[0].l = (p \rightarrow data[0].l \& -8) + data \rightarrow z.o.l;
      } else {
        p = demote\_and\_fix(ITcache, p);
        p \rightarrow tag.h \mid = sign\_bit;
                                        /* invalidate the tag */
   }
   data \neg state = 3; wait(ITcache \neg access\_time);
```

 $\S355$  MMIX-PIPE SYSTEM OPERATIONS 131

355. The SYNC operation interacts with the pipeline in interesting ways. SYNC 0 and SYNC 4 are the simplest; they just lock the dispatch and wait until they get to the hot seat, after which the pipeline has drained. SYNC 1 and SYNC 3 put a "barrier" into the write buffer so that subsequent store instructions will not merge with previous stores. SYNC 2 and SYNC 3 lock the dispatch until all previous load instructions have left the pipeline. SYNC 5, SYNC 6, and SYNC 7 remove things from caches once they get to the hot seat.

```
\langle Special cases of instruction dispatch 117\rangle + \equiv
case sync: if (cool \neg zz > 3) {
     if (\neg(cool \neg loc.h \& sign\_bit)) goto privileged_inst;
     if (cool \neg zz \equiv 4) freeze_dispatch = true;
  } else {
     if (cool \neg zz \neq 1) freeze_dispatch = true;
     if (cool \neg zz \& 1) cool \neg mem\_x = true, spec\_install(\&mem, \&cool \neg x);
  } break;
356. \langle Cases for stage 1 execution 155\rangle + \equiv
case sync: switch (data \neg zz) {
  case 0: case 4: if (data \neq old\_hot) wait(1);
     halted = (data \neg zz \neq 0);  goto fin\_ex;
  case 2: case 3: (Wait if there's an unfinished load ahead of us 357);
     release_lock(self, dispatch_lock);
  case 1: data \rightarrow x.addr = zero\_octa; goto fin\_ex;
  case 5: if (data \neq old\_hot) wait (1);
     \langle Clean the data caches 361\rangle;
  case 6: if (data \neq old\_hot) wait (1);
     \langle \text{Zap the translation caches } 358 \rangle;
  case 7: if (data \neq old\_hot) wait (1);
     \langle \text{ Zap the instruction and data caches } 359 \rangle;
357. Wait if there's an unfinished load ahead of us 357 \ge 357
     register control *cc;
     for (cc = data; cc \neq hot;)
        cc = (cc \equiv reorder\_top ? reorder\_bot : cc + 1);
        if (cc \neg owner \land (cc \neg i \equiv ld \lor cc \neg i \equiv ldunc \lor cc \neg i \equiv pst)) wait(1);
  }
This code is used in section 356.
        Perhaps the delay should be longer here.
\langle Zap the translation caches 358\rangle \equiv
  if (DTcache \neg lock \lor (j = qet\_reader(DTcache)) < 0) wait(1);
  startup(\&DTcache \rightarrow reader[j], DTcache \rightarrow access\_time);
  set\_lock(self, DTcache \rightarrow lock);
  zap\_cache(DTcache);
  data \neg state = 10; wait(DTcache \neg access\_time);
This code is used in section 356.
```

132 SYSTEM OPERATIONS MMIX-PIPE §359

```
359.
        \langle Zap the instruction and data caches 359\rangle \equiv
  if (\neg Icache) {
     data \rightarrow state = 11; goto switch1;
  if (Icache \neg lock \lor (j = get\_reader(Icache)) < 0) wait(1);
   startup(\&Icache \rightarrow reader[j], Icache \rightarrow access\_time);
   set\_lock(self, Icache \neg lock);
   zap\_cache(Icache);
   data \neg state = 11; wait(Icache \neg access\_time);
This code is used in section 356.
360. \langle Special cases for states in the first stage 266\rangle + \equiv
case 10: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   if (ITcache \neg lock \lor (j = get\_reader(ITcache)) < 0) wait (1);
   startup(\&IT cache \neg reader[j], IT cache \neg access\_time);
   set\_lock(self, ITcache \neg lock);
   zap\_cache(ITcache);
   data \neg state = 3; wait(ITcache \neg access\_time);
case 11: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
  if (wbuf\_lock) wait(1);
   write\_head = write\_tail, write\_ctl.state = 0;
                                                            /* zap the write buffer */
  if (\neg Dcache) {
     data \rightarrow state = 12; goto switch1;
  if (Dcache \neg lock \lor (j = qet\_reader(Dcache)) < 0) wait (1);
   startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
   set\_lock(self, Dcache \neg lock);
   zap\_cache(Dcache);
   data \neg state = 12; wait(Dcache \neg access\_time);
case 12: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
  if (\neg Scache) goto fin\_ex;
  if (Scache \neg lock) wait(1);
   set\_lock(self, Scache \neg lock);
   zap\_cache(Scache);
   data \neg state = 3; \ wait(Scache \neg access\_time);
361.
         \langle Clean the data caches 361 \rangle \equiv
   if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   ⟨ Wait till write buffer is empty 362⟩;
  if (clean\_co.next \lor clean\_lock) wait(1);
   set\_lock(self, clean\_lock);
   clean\_ctl.i = sync; \ clean\_ctl.state = 0; \ clean\_ctl.x.o.h = 0;
   startup(\&clean\_co, 1);
   data \rightarrow state = 13;
   data \neg interim = true;
   wait(1);
This code is used in section 356.
```

§362 MMIX-PIPE SYSTEM OPERATIONS 133

```
362. ⟨Wait till write buffer is empty 362⟩ ≡
if (write_head ≠ write_tail) {
   if (¬speed_lock) set_lock(self, speed_lock);
      wait(1);
}
This code is used in sections 361 and 364.

363. The cleanup process might take a huge amount of time, so we must allow it to be interrupted.
(Servicing the interruption might, of course, put more stuff into the cache.)
⟨Special cases for states in the first stage 266⟩ +≡
case 13: if (¬clean_co.next) {
      data¬interim = false; goto fin_ex; /* it's done! */
}
if (trying_to_interrupt) goto fin_ex; /* accept an interruption */
      wait(1);
```

134 SYSTEM OPERATIONS MMIX-PIPE §364

**364.** Now we consider SYNCD and SYNCID. When control comes to this part of the program,  $data \neg y.o$  is a virtual address and  $data \neg z.o$  is the corresponding physical address;  $data \neg xx + 1$  is the number of bytes we are supposed to be syncing;  $data \neg b.o.l$  is the number of bytes we can handle at once (either  $Icache \neg bb$  or  $Dcache \neg bb$  or 8192).

We need a more elaborate scheme to implement SYNCD and SYNCID than we have used for the "hint" instructions PRELD, PREGO, and PREST, because SYNCD and SYNCID are not merely hints. They cannot be converted into a sequence of cache-block-size commands at dispatch time, because we cannot be sure that the starting virtual address will be aligned with the beginning of a cache block. We need to realize that the bytes specified by SYNCD or SYNCID might cross a virtual page boundary—possibly with different protection bits on each page. We need to allow for interrupts. And we also need to keep the fetch buffer empty until a user's SYNCID has completely brought the memory up to date.

```
\langle Special cases for states in later stages 272\rangle + \equiv
do\_syncid: data \neg state = 30;
case 30: if (data \neq old\_hot) wait(1);
   if (\neg Icache) {
      data \rightarrow state = (data \rightarrow loc.h \& sign\_bit ? 31 : 33); goto switch2;
   \langle Clean the I-cache block for data \neg z.o, if any 365 \rangle;
   data \rightarrow state = (data \rightarrow loc.h \& sign\_bit ? 31 : 33); wait(Icache \rightarrow access\_time);
case 31: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   Wait till write buffer is empty 362);
   if (((data \rightarrow b.o.l - 1) \& \sim data \rightarrow y.o.l) < data \rightarrow xx) data \rightarrow interim = true;
   if (\neg Dcache) goto next\_sync;
   \langle Clean the D-cache block for data \neg z.o, if any 366\rangle;
   data \neg state = 32; wait(Dcache \neg access\_time);
case 32: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   if (\neg Scache) goto next\_sync:
   \langle Clean the S-cache block for data \neg z.o, if any 367\rangle;
   data \neg state = 35; wait(Scache \neg access\_time);
do\_syncd: data \neg state = 33;
case 33: if (data \neq old\_hot) wait(1);
  if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   ⟨ Wait till write buffer is empty 362⟩;
  if (((data \rightarrow b.o.l - 1) \& \sim data \rightarrow y.o.l) < data \rightarrow xx) data \rightarrow interim = true;
   if (\neg Dcache)
      if (data \neg i \equiv syncd) goto fin\_ex; else goto next\_sync;
   \langle \text{Use } cleanup \text{ on the cache blocks for } data \neg z.o, \text{ if any } 368 \rangle;
   data \rightarrow state = 34;
case 34: if (\neg clean\_co.next) goto next\_sync;
   if (trying\_to\_interrupt \land data \neg interim \land data \equiv old\_hot) {
                                        /* anticipate RESUME_CONT */
      data \neg z.o = zero\_octa;
      goto fin_ex;
                            /* accept an interruption */
   wait(1):
next\_sync: data \neg state = 35;
case 35: if (self \neg lockloc) *(self \neg lockloc) = \Lambda, self \neg lockloc = \Lambda;
   if (data-interim) ⟨Continue this command on the next cache block 369⟩;
   data \rightarrow qo.known = true;
   goto fin_ex;
```

§365 MMIX-PIPE SYSTEM OPERATIONS 135

```
365.
         \langle Clean the I-cache block for data \neg z.o, if any 365 \rangle \equiv
  if (Icache \rightarrow lock \lor (j = get\_reader(Icache)) < 0) wait(1);
  startup(\&Icache \rightarrow reader[j], Icache \rightarrow access\_time);
  set\_lock(self, Icache \neg lock);
  p = cache\_search(Icache, data \neg z.o);
  if (p) {
     demote\_and\_fix(Icache, p);
     clean\_block(Icache, p);
This code is used in section 364.
366. (Clean the D-cache block for data \neg z.o, if any 366) \equiv
  if (Dcache \neg lock \lor (j = get\_reader(Dcache)) < 0) wait (1);
  startup(\&Dcache \neg reader[j], Dcache \neg access\_time);
  set\_lock(self, Dcache \rightarrow lock);
  p = cache\_search(Dcache, data \neg z.o);
  if (p) {
     demote\_and\_fix(Dcache, p);
     clean\_block(Dcache, p);
This code is used in section 364.
       \langle Clean the S-cache block for data \neg z.o, if any 367\rangle \equiv
  if (Scache \rightarrow lock) wait(1);
  set\_lock(self, Scache \neg lock);
  p = cache\_search(Scache, data \neg z.o);
  if (p) {
     demote\_and\_fix(Scache, p);
     clean\_block(Scache, p);
This code is used in section 364.
368. (Use cleanup on the cache blocks for data \neg z.o, if any 368) \equiv
  if (clean\_co.next \lor clean\_lock) wait(1);
  set_lock(self, clean_lock);
  clean\_ctl.i = syncd;
  clean\_ctl.state = 4;
  clean\_ctl.x.o.h = data \neg loc.h \& sign\_bit;
  clean\_ctl.z.o = data \neg z.o;
  schedule(\&clean\_co, 1, 4);
This code is used in section 364.
```

136 SYSTEM OPERATIONS MMIX-PIPE §369

```
We use the fact that cache block sizes are divisors of 8192.
\langle Continue this command on the next cache block 369\rangle \equiv
      data \neg interim = false;
      data \rightarrow xx = ((data \rightarrow b.o.l - 1) \& \sim data \rightarrow y.o.l) + 1;
      data \rightarrow y.o = incr(data \rightarrow y.o, data \rightarrow b.o.l);
      data \neg y.o.l \&= -data \neg b.o.l;
      data \neg z.o.l = (data \neg z.o.l \& -8192) + (data \neg y.o.l \& 8191);
     if ((data \neg y.o.l \& 8191) \equiv 0) goto square\_one; /* maybe crossed a page boundary */
      if (data \neg i \equiv syncd) goto do\_syncd; else goto do\_syncid;
This code is used in section 364.
370. If the first page lacks proper protection, we still must try the second, in the rare case that a page
boundary is spanned.
\langle Special cases for states in later stages 272\rangle + \equiv
sync\_check: if ((data \neg y.o.l \oplus (data \neg y.o.l + data \neg xx)) \ge 8192)  {
      data \rightarrow xx -= (8191 \& \sim data \rightarrow y.o.l) + 1;
      data \neg y.o = incr(data \neg y.o, 8192);
      data \rightarrow y.o.l \& = -8192;
      goto square_one;
```

**goto** fin\_ex;

 $\S371$  MMIX-PIPE INPUT AND OUTPUT 137

**371. Input and output.** We're done implementing the hardware, but there's still a small matter of software remaining, because we sometimes want to pretend that a real operating system is present without actually having one loaded. This simulator therefore implements a special feature: If RESUME 1 is issued in location rT, the ten special I/O traps of MMIX-SIM are performed instantaneously behind the scenes.

Of course all claims of accurate simulation go out the door when this feature is used.

```
#define max_sys_call Ftell
\langle \text{Type definitions } 11 \rangle + \equiv
  typedef enum {
     Halt, Fopen, Fclose, Fread, Fqets, Fqetws, Fwrite, Fputs, Fputws, Fseek, Ftell
  } sys_call;
372. \langle Magically do an I/O operation, if cool-loc is rT 372\rangle \equiv
  if (cool \neg loc.l \equiv g[rT].o.l \land cool \neg loc.h \equiv g[rT].o.h) {
     register unsigned char yy, zz;
     octa ma, mb;
     if (g[rXX].o.l \& #ffff0000) goto magic\_done;
     yy = g[rXX].o.l \gg 8, zz = g[rXX].o.l \& #ff;
     if (yy > max\_sys\_call) goto magic\_done;
     \langle \text{ Prepare memory arguments } ma = M[a] \text{ and } mb = M[b] \text{ if needed 380} \rangle;
     switch (yy) {
     case Halt: (Either halt or print warning 373); break;
     case Fopen: g[rBB].o = mmix\_fopen(zz, mb, ma); break;
     case Fclose: q[rBB].o = mmix\_fclose(zz); break;
     case Fread: g[rBB].o = mmix\_fread(zz, mb, ma); break;
     case Fgets: g[rBB].o = mmix\_fgets(zz, mb, ma); break;
     case Fgetws: g[rBB].o = mmix\_fgetws(zz, mb, ma); break;
     case Fwrite: g[rBB].o = mmix\_fwrite(zz, mb, ma); break;
     case Fputs: g[rBB].o = mmix\_fputs(zz, g[rBB].o); break;
     case Fputws: g[rBB].o = mmix\_fputws(zz, g[rBB].o); break;
     case Fseek: g[rBB].o = mmix\_fseek(zz, g[rBB].o); break;
     case Ftell: g[rBB].o = mmix\_ftell(zz); break;
  magic\_done: g[255].o = neg\_one; /* this will enable interrupts */
This code is used in section 322.
373. \langle Either halt or print warning 373\rangle \equiv
  if (\neg zz) halted = true;
  else if (zz \equiv 1) {
     octa trap_loc;
     trap\_loc = incr(g[rWW].o, -4);
     if (\neg(trap\_loc.h \lor trap\_loc.l \ge \#90)) print_trip_warning(trap\_loc.l \gg 4, incr(g[rW].o, -4));
This code is used in section 372.
374. \langle Global variables 20 \rangle + \equiv
  char arg\_count[] = \{1, 3, 1, 3, 3, 3, 3, 2, 2, 2, 1\};
```

**375.** The input/output operations invoked by TRAPs are done by subroutines in an auxiliary program module called MMIX-IO. Here we need only declare those subroutines, and write three primitive interfaces on which they depend.

138 INPUT AND OUTPUT MMIX-PIPE §376

```
376.
       \langle Global variables 20\rangle + \equiv
  extern octa mmix_fopen ARGS((unsigned char, octa, octa));
  extern octa mmix_fclose ARGS((unsigned char));
  extern octa mmix_fread ARGS((unsigned char, octa, octa));
  extern octa mmix_fgets ARGS((unsigned char, octa, octa));
  extern octa mmix_fgetws ARGS((unsigned char, octa, octa));
  extern octa mmix_fwrite ARGS((unsigned char, octa, octa));
  extern octa mmix_fputs ARGS((unsigned char, octa));
  extern octa mmix_fputws ARGS((unsigned char, octa));
  extern octa mmix_fseek ARGS((unsigned char, octa));
  extern octa mmix_ftell ARGS((unsigned char));
  extern void print_trip_warning ARGS((int, octa));
       \langle \text{Internal prototypes } 13 \rangle + \equiv
  int mmgetchars ARGS((char *, int, octa, int));
  void mmputchars ARGS((unsigned char *, int, octa));
  char stdin\_chr ARGS((void));
  octa magic_read ARGS((octa));
  void magic_write ARGS((octa, octa));
378. We need to cut through all the complications of buffers and caches in order to do magical I/O. The
magic_read routine finds the current octabyte in a given physical address by looking at the write buffer,
D-cache, S-cache, and memory until finding it.
\langle Subroutines 14\rangle + \equiv
  octa magic_read(addr)
       octa addr;
     register write_node *q;
     register cacheblock *p;
     for (q = write\_tail; ; )  {
       if (q \equiv write\_head) break;
       if (q \equiv wbuf\_top) q = wbuf\_bot; else q++;
       if ((q \rightarrow addr.l \& -8) \equiv (addr.l \& -8) \land q \rightarrow addr.h \equiv addr.h) return q \rightarrow o;
    if (Dcache) {
       p = cache\_search(Dcache, addr);
       if (p) return p \rightarrow data[(addr.l \& (Dcache \rightarrow bb - 1)) \gg 3];
       if (((Dcache \neg outbuf.tag.l \oplus addr.l) \& -Dcache \neg bb) \equiv 0 \land Dcache \neg outbuf.tag.h \equiv addr.h)
          return Dcache \rightarrow outbuf.data[(addr.l & (Dcache \rightarrow bb - 1)) \gg 3];
       if (Scache) {
         p = cache\_search(Scache, addr);
         if (p) return p \rightarrow data[(addr.l \& (Scache \rightarrow bb - 1)) \gg 3];
         if (((Scache \neg outbuf.tag.l \oplus addr.l) \& -Scache \neg bb) \equiv 0 \land Scache \neg outbuf.tag.h \equiv addr.h)
            return Scache \rightarrow outbuf.data[(addr.l & (Scache \rightarrow bb - 1)) \gg 3];
       }
     return mem\_read(addr);
```

 $\S379$  MMIX-PIPE INPUT AND OUTPUT 139

**379.** The *magic\_write* routine changes the octabyte in a given physical address by changing it wherever it appears in a buffer or cache. Any "dirty" or "least recently used" status remains unchanged. (Yes, this *is* magic.)

```
\langle Subroutines 14\rangle + \equiv
  void magic_write(addr, val)
        octa addr, val;
     register write_node *q;
     register cacheblock *p;
     for (q = write\_tail; ; )  {
        if (q \equiv write\_head) break;
        if (q \equiv wbuf\_top) q = wbuf\_bot; else q++;
        if ((q \rightarrow addr.l \& -8) \equiv (addr.l \& -8) \land q \rightarrow addr.h \equiv addr.h) q \rightarrow o = val;
     if (Dcache) {
        p = cache\_search(Dcache, addr);
        if (p) p \rightarrow data[(addr.l \& (Dcache \rightarrow bb - 1)) \gg 3] = val;
        if (((Dcache \neg inbuf.tag.l \oplus addr.l) \& -Dcache \neg bb) \equiv 0 \land Dcache \neg inbuf.tag.h \equiv addr.h)
           Dcache \rightarrow inbuf.data[(addr.l \& (Dcache \rightarrow bb - 1)) \gg 3] = val;
        if (((Dcache \neg outbuf.tag.l \oplus addr.l) \& -Dcache \neg bb) \equiv 0 \land Dcache \neg outbuf.tag.h \equiv addr.h)
           Dcache \rightarrow outbuf.data[(addr.l & (Dcache \rightarrow bb - 1)) \gg 3] = val;
        if (Scache) {
           p = cache\_search(Scache, addr);
           if (p) p \rightarrow data[(addr.l \& (Scache \rightarrow bb - 1)) \gg 3] = val;
           if (((Scache \neg inbuf.tag.l \oplus addr.l) \& -Scache \neg bb) \equiv 0 \land Scache \neg inbuf.tag.h \equiv addr.h)
              Scache \neg inbuf.data[(addr.l \& (Scache \neg bb - 1)) \gg 3] = val;
           if (((Scache \neg outbuf.tag.l \oplus addr.l) \& -Scache \neg bb) \equiv 0 \land Scache \neg outbuf.tag.h \equiv addr.h)
              Scache \neg outbuf.data[(addr.l & (Scache \neg bb - 1)) \gg 3] = val;
     mem\_write(addr, val);
  }
```

**380.** The conventions of our imaginary operating system require us to apply the trivial memory mapping in which segment i appears in a  $2^{32}$ -byte page of physical addresses starting at  $2^{32}i$ .

```
\langle \operatorname{Prepare\ memory\ arguments}\ ma = \operatorname{M}[a]\ \operatorname{and}\ mb = \operatorname{M}[b]\ \operatorname{if\ needed}\ 380 \rangle \equiv \\ \operatorname{if\ }(arg\_count[yy] \equiv 3)\ \{\\ \operatorname{octa\ }arg\_loc;\\ arg\_loc = g[rBB].o;\\ \operatorname{if\ }(arg\_loc.h\ \&\ ^\#9fffffff)\ mb = zero\_octa;\\ \operatorname{else\ }arg\_loc.h\ \gg = 29, mb = magic\_read(arg\_loc);\\ arg\_loc = incr(g[rBB].o, 8);\\ \operatorname{if\ }(arg\_loc.h\ \&\ ^\#9ffffffff)\ ma = zero\_octa;\\ \operatorname{else\ }arg\_loc.h\ \gg = 29, ma = magic\_read(arg\_loc);\\ \}
```

This code is used in section 372.

140 Input and output mmix-pipe  $\S 381$ 

**381.** The subroutine mmgetchars(buf, size, addr, stop) reads characters starting at address addr in the simulated memory and stores them in buf, continuing until size characters have been read or some other stopping criterion has been met. If stop < 0 there is no other criterion; if stop = 0 a null character will also terminate the process; otherwise addr is even, and two consecutive null bytes starting at an even address will terminate the process. The number of bytes read and stored, exclusive of terminating nulls, is returned.

```
\langle Subroutines 14\rangle + \equiv
  int mmgetchars(buf, size, addr, stop)
       char *buf;
       int size;
       octa addr:
       int stop;
    register char *p;
    register int m;
    octa a, x;
    if (((addr.h \& #9fffffff) \lor (incr(addr, size - 1).h \& #9fffffff)) \land size) {
       fprintf(stderr, "Attempt_to_get_tcharacters_from_off_the_page!\n");
       return 0;
    for (p = buf, m = 0, a = addr, a.h \gg 29; m < size;)
       x = magic\_read(a);
       if ((a.l \& #7) \lor m > size - 8) (Read and store one byte; return if done 382)
       else (Read and store up to eight bytes; return if done 383)
    return size;
382. \langle Read and store one byte; return if done 382\rangle \equiv
    if (a.l \& #4) *p = (x.l \gg (8 * ((\sim a.l) \& #3))) \& #ff;
    else *p = (x.h \gg (8 * ((\sim a.l) \& #3))) \& #ff;
    if (\neg *p \land stop \ge 0) {
       if (stop \equiv 0) return m;
       if ((a.l \& #1) \land *(p-1) \equiv '\0') return m-1;
    p++, m++, a = incr(a, 1);
```

This code is used in section 381.

§383 mmix-pipe input and output 141

```
\langle Read and store up to eight bytes; return if done 383\rangle \equiv
383.
  {
     *p = x.h \gg 24;
    if (\neg *p \land (stop \equiv 0 \lor (stop > 0 \land x.h < \#10000))) return m;
     *(p+1) = (x.h \gg 16) \& #ff;
    if (\neg *(p+1) \land stop \equiv 0) return m+1;
     *(p+2) = (x.h \gg 8) \& #ff;
     if (\neg *(p+2) \land (stop \equiv 0 \lor (stop > 0 \land (x.h \& \#ffff) \equiv 0))) return m+2;
     *(p+3) = x.h \& #ff;
     if (\neg *(p+3) \land stop \equiv 0) return m+3;
     *(p+4) = x.l \gg 24;
     if (\neg *(p+4) \land (stop \equiv 0 \lor (stop > 0 \land x.l < \#10000))) return m+4;
     *(p+5) = (x.l \gg 16) \& #ff;
    if (\neg *(p+5) \land stop \equiv 0) return m+5;
     *(p+6) = (x.l \gg 8) \& #ff;
    if (\neg *(p+6) \land (stop \equiv 0 \lor (stop > 0 \land (x.l \& \#ffff) \equiv 0))) return m+6;
     *(p+7) = x.l \& #ff;
    if (\neg *(p+7) \land stop \equiv 0) return m+7;
     p += 8, m += 8, a = incr(a, 8);
This code is used in section 381.
384. The subroutine mmputchars (buf, size, addr) puts size characters into the simulated memory starting
at address addr.
\langle Subroutines 14\rangle + \equiv
  void mmputchars(buf, size, addr)
       unsigned char *buf;
       int size;
       octa addr;
     register unsigned char *p;
     register int m;
     octa a, x;
    if (((addr.h \& #9fffffff) \lor (incr(addr, size - 1).h \& #9fffffff)) \land size) {
       fprintf(stderr, "Attempt_to_put_characters_off_the_page!\n");
       return;
     for (p = buf, m = 0, a = addr, a.h \gg 29; m < size;)
       if ((a.l \& #7) \lor m > size - 8) \land Load and write one byte 385)
       else (Load and write eight bytes 386);
  }
```

142 INPUT AND OUTPUT MMIX-PIPE §385

```
385.
        \langle \text{Load and write one byte 385} \rangle \equiv
  {
     register int s = 8 * ((\sim a.l) \& #3);
     x = magic\_read(a);
     if (a.l \& #4) x.l \oplus = (((x.l \gg s) \oplus *p) \& #ff) \ll s;
     else x.h \oplus = (((x.h \gg s) \oplus *p) \& #ff) \ll s;
     magic\_write(a, x);
     p++, m++, a = incr(a, 1);
This code is used in section 384.
386. \langle Load and write eight bytes 386\rangle \equiv
     x.h = (*p \ll 24) + (*(p+1) \ll 16) + (*(p+2) \ll 8) + *(p+3);
     x.l = (*(p+4) \ll 24) + (*(p+5) \ll 16) + (*(p+6) \ll 8) + *(p+7);
     magic\_write(a, x);
     p += 8, m += 8, a = incr(a, 8);
This code is used in section 384.
```

**387.** When standard input is being read by the simulated program at the same time as it is being used for interaction, we try to keep the two uses separate by maintaining a private buffer for the simulated program's StdIn. Online input is usually transmitted from the keyboard to a C program a line at a time; therefore an *fgets* operation works much better than *fread* when we prompt for new input. But there is a slight complication, because *fgets* might read a null character before coming to a newline character. We cannot deduce the number of characters read by *fgets* simply by looking at  $strlen(stdin\_buf)$ .

```
\langle Subroutines 14\rangle + \equiv
  char stdin_chr()
     register char *p;
     while (stdin\_buf\_start \equiv stdin\_buf\_end) {
       printf("StdIn>□"); fflush(stdout);
       fgets(stdin\_buf, 256, stdin);
       stdin\_buf\_start = stdin\_buf;
       for (p = stdin\_buf; p < stdin\_buf + 254; p++)
         if (*p \equiv '\n') break;
       stdin\_buf\_end = p + 1;
     return *stdin_buf_start ++;
388. \langle Global variables 20 \rangle + \equiv
  char stdin\_buf[256];
                             /* standard input to the simulated program */
  char *stdin_buf_start;
                               /* current position in that buffer */
  char *stdin_buf_end;
                              /* current end of that buffer */
```

## 389. Index.

??: 25. \_\_STDC\_\_: 6. a: 44, 91, 167, 381, 384. aa: <u>167</u>, 177, 181, 186, <u>187</u>, <u>189</u>, <u>191</u>, 193, 196, 199, 205, 233, 234. aaaaa: <u>237,</u> 243, 244. ABSTIME: 89. access\_time: <u>167</u>, 217, 224, 230, 233, 234, 257, 261, 262, 266, 267, 268, 270, 271, 272, 273, 274, 288, 291, 292, 295, 296, 300, 326, 353, 354, 358, 359, 360, 364, 365, 366. ADD: 47.  $add: \underline{49}, 51, 140.$  $add\_go: \underline{331}.$ ADDI: 47. addr: 40, 43, 44, 73, 89, 95, 100, 115, 116, 144, <u>208</u>, <u>209</u>, <u>210</u>, <u>212</u>, <u>213</u>, <u>216</u>, <u>219</u>, 236, 240, <u>246</u>, 251, <u>255</u>, 256, 257, 259, 260, 261, 262, 281, <u>297</u>, 356, <u>378</u>, <u>379</u>, <u>381</u>, <u>384</u>.  $addr\_found: \underline{256}.$ ADDU: 47.  $addu: \underline{49}, 51, 139.$ ADDUI: 47. after: 282. alf: 192, <u>193</u>, 195, <u>205</u>. alloc\_slot: 204, 205, 218, 222, 225, 261, 272, 274, 276, 298, 300, 326. and:  $\underline{49}$ , 51, 138. AND:  $\underline{47}$ . ANDI: 47ANDN: 47. andn:  $\underline{49}$ , 51, 138. ANDNH: 47. ANDNI: <u>47</u>. ANDNL: 47. ANDNMH: 47. ANDNML: 47.  $arg\_count$ : 374, 380.  $arg\_loc: 380.$ ARGS:  $\underline{6}$ ,  $\underline{9}$ ,  $\underline{13}$ ,  $\underline{18}$ ,  $\underline{21}$ ,  $\underline{24}$ ,  $\underline{27}$ ,  $\underline{30}$ ,  $\underline{32}$ ,  $\underline{34}$ ,  $\underline{38}$ ,  $\underline{42}$ , <u>45</u>, <u>55</u>, <u>62</u>, <u>72</u>, <u>90</u>, <u>92</u>, <u>94</u>, <u>96</u>, <u>156</u>, <u>158</u>, <u>161</u>, <u>169</u>, <u>171</u>, <u>173</u>, <u>175</u>, <u>178</u>, <u>180</u>, <u>182</u>, <u>184</u>, <u>186</u>, <u>188</u>, 190, 192, 195, 198, 200, 202, 204, 208, 209, 212, 240, 250, 252, 254, 376, 377. arith\_exc: 44, 46, 59, 98, 100, 146, 307. Attempt to get characters...: 381. Attempt to put characters...: 384. aux: 20, 21, 343. $avoid_D$ : 273, <u>277</u>. awaken: <u>125,</u> 222, 224, 245. bus\_words: 214, 216, 219, 223, 297. b: 44, 56, 82, 157, 167, 172.

B\_BIT: <u>54</u>, 118, 304, 323, 329, 330, 332, 336, 337. bad\_fetch: 288, 293, 296, 298, <u>301</u>. bad\_inst\_mask: 304, 305, 323.  $bad\_resume$ : 323. bb: <u>167</u>, 170, 172, 179, 185, 193, 201, 203, 216, 217, 218, 219, 221, 223, 224, 226, 227, 228, 229, 259, 262, 265, 268, 271, 273, 275, 276, 280, 292, 294, 364, 378, 379. BDIF: 47. bdif: 49, 51, 344. BDIFI: 47. before:  $\underline{282}$ . BEV: 47. BEVB: 47. big-endian versus little-endian: 304.  $bit\_code\_map: \underline{54}, 56.$  $block\_diff$ : 217, 219. BN:  $\underline{47}$ . BNB: 47. BNN: 47. BNNB: 47BNP: 47. BNPB: 47BNZ: 47. BNZB: 47BOD: 47. BODB: 47**bool**: 11, 12, 20, 21, 40, 44, 65, 66, 68, 75, 169, 170, 175, 176, 202, 203, 238, 242, 303, 315.  $bool\_mult: 21, 344.$ BP: 47.  $bp\_a$ : 150, 151, 152, 153. bp\_amask: 151, 152, 153, <u>154</u>.  $bp\_b$ : 150, 151, 152, 153. *bp\_bad\_stat*: 154, 155, 162. bp\_bcmask: 151, 152, 153, <u>154</u>.  $bp\_c$ : 150, 153. bp\_cmask: 151, 152, 153, <u>154</u>.  $bp\_good\_stat$ : 154, 155, 162.  $bp_n: 150, 153.$ *bp\_nmask*: 152, 153, <u>154</u>. bp\_npower: 151, 152, 153, <u>154</u>, 160. *bp\_ok\_stat*: 152, <u>154</u>, 162.  $bp\_rev\_stat$ : 152, <u>154</u>, 162. *bp\_table*: <u>150</u>, 151, 152, 160, 162. BPB: 47.  $br: \underline{49}, 51, 85, 106, 152, 155.$  $breakpoint \colon \ \underline{9}, \ \underline{10}, \ 304.$  $breakpoint\_hit: 10, \underline{12}, 304.$ buf: 381, 384.

 $byte\_diff: \underline{21}, 344.$ 

BZ: 47. BZB: 47.

**cache**: <u>167</u>, 168, 169, 170, 171, 172, 173, 174, 175, 176, 178, 179, 180, 181, 182, 183, 184, 185, 192, 193, 195, 196, 198, 199, 200, 201, 202, 203, 204, 205, 215, 217, 222, 224, 237, 326.

cache\_addr: 192, 193, 196, 201, 205, 217.

 $\begin{array}{c} \textit{cache\_search:} \quad \underline{192}, \ \underline{193}, \ 195, \ 205, \ 206, \ 217, \ 224, \\ 233, \ 234, \ 262, \ 267, \ 268, \ 271, \ 272, \ 273, \ 291, \ 292, \\ 296, \ 302, \ 353, \ 354, \ 365, \ 366, \ 367, \ 378, \ 379. \end{array}$ 

**cacheblock**: <u>167</u>, 169, 170, 171, 172, 178, 179, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 195, 196, 198, 199, 200, 201, 202, 203, 204, 205, 217, 222, 224, 232, 237, 257, 258, 378, 379. caches: 163.

**cacheset**: <u>167</u>, 186, 187, 188, 189, 190, 191, 193, 194, 196, 205.

calloc: 213.

 $cease: \underline{10}.$ 

choose\_victim: 186, 187, 196, 205.

chunk: 206, 209, 210, 213, 216, 219, 223, 297.

**chunknode**: <u>206</u>, 207.

clean\_block: 178, 179, 181, 276, 365, 366, 367.

clean\_co: 230, 231, 361, 363, 364, 368.

clean\_ctl: 230, 231, 361, 368.

clean\_lock: 39, 230, 233, 234, 361, 368.

cleanup: <u>129</u>, 230, 231, 232.

Clock time is...: 14.

 $cmp: \underline{49}, 51, 143.$ 

CMP:  $\underline{47}$ .

 $cmp\_fin: \underline{348}.$ 

 $cmp\_neg: 143, 348.$ 

cmp\_pos: <u>143</u>, 348.

 $cmp\_zero: 143, 348.$ 

 $cmp\_zero\_or\_invalid: 348.$ 

CMPI: 47.

cmpu: <u>49</u>, 51, 143.

CMPU:  $\underline{47}$ .

CMPUI:  $\underline{47}$ .

co: <u>76</u>, 81, 82, <u>237</u>, 243, 244. commit\_max: 59, 67, 145, 330.

confusion: <u>13</u>, 28, 135, 185, 187.

**control**: <u>44</u>, 45, 46, 60, 63, 73, 78, 124, 127, 158, 159, 167, 230, 235, 248, 254, 255, 285, 357.

control\_struct:  $23, \underline{44}$ .

 $\begin{array}{c} cool: & \underline{60}, \ 61, \ 63, \ 67, \ 69, \ 75, \ 78, \ 81, \ 82, \ 84, \ 85, \ 86, \\ & 98, \ 99, \ 100, \ 102, \ 103, \ 104, \ 105, \ 106, \ 108, \ 109, \\ & 110, \ 111, \ 112, \ 113, \ 114, \ 117, \ 118, \ 119, \ 120, \ 121, \\ & 122, \ 123, \ 145, \ 152, \ 158, \ 160, \ 227, \ 308, \ 309, \ 312, \\ & 314, \ 316, \ 322, \ 323, \ 324, \ 332, \ 333, \ 334, \ 335, \ 337, \\ & 338, \ 339, \ 340, \ 341, \ 347, \ 355, \ 372. \end{array}$ 

 $cool\_G$ : 99, 102, 104, 105, 106, 110, 117, 119, 120, 312, 323, 335, 337.

cool\_hist: 74, 75, 99, 151, 152, 160, 308, 309, 316.
cool\_L: 99, 102, 104, 105, 106, 110, 112, 119, 120, 312, 323, 337, 338.

cool\_S: 75, 98, 100, 110, 113, 114, 118, 119, 120, 145, 147, 337.

copy\_block: 184, 185, 217, 221.

copy\_in\_time: <u>167</u>, 217, 222, 224, 237, 276. copy\_out\_time: <u>167</u>, 203, 221, 233, 234, 259.

**coroutine**: <u>23</u>, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 44, 76, 124, 127, 167, 222, 224, 230, 235, 237, 248, 285.

 $coroutine\_bit$ : 8, 10, 125.

coroutine\_struct: 23.

count: 216, 219, 223.

 $count\_bits$ : 21, 344.

 $cset: \underline{49}, 51, 345.$ 

CSEV: 47.

CSEVI: 47.

CSN:  $\underline{47}$ .

CSNI: 47.

CSNN:  $\underline{47}$ .

CSNNI:  $\underline{47}$ .

CSNP:  $\underline{47}$ .

CSNPI: <u>47</u>.

CSNZ:  $\underline{47}$ .

CSNZI:  $\underline{47}$ .

CSOD:  $\underline{47}$ .

CSODI: 47.

CSP:  $\underline{47}$ .

CSPI:  $\underline{47}$ .

cswap: <u>49,</u> 51, 117, 283, 307.

CSWAP: 47, 281.

CSWAPI: 47.

CSZ: 47.

CSZI: <u>47</u>.

ctl: 23, 30, 31, 32, 44, 81, 124, 125, 128, 134, 222, 224, 231, 236, 243, 244, 245, 249, 255, 286.

ctl\_change\_bit: 81, 83, 85.

cur\_O: 44, 46, 100, 145, 147.

 $cur\_round$ : 20, 346.

cur\_S: 44, 46, 100, 145, 147.

cur\_time: 28, <u>29</u>, 125.

cycs:  $\underline{9}$ ,  $\underline{10}$ . Dlocker: <u>127</u>, 128, 276. d: <u>28</u>, <u>31</u>, <u>97</u>, <u>170</u>, <u>197</u>, <u>201</u>, <u>203</u>, <u>220</u>.  $do\_resume\_trans: 325, 326.$ D\_BIT: <u>54</u>, 308, 343. do\_syncd: 280, 364, 369. do\_syncid: 280, 364, 369. data: 124, 125, 130, 131, 132, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 155, 156, 160, doing\_interrupt: 63, 64, <u>65</u>, 314, 317, 318. 167, 172, 179, 185, 197, 201, 203, 215, 216, 217, done: 125, 134, 233, 234. 218, 219, 220, 222, 223, 224, 225, 226, 232, 233,  $done\_with\_write$ : 256. 234, 237, 239, 243, 244, 245, 257, 259, 260, 261, down: 40, 86, 89, 95, 97, 116. 262, 264, 265, 266, 267, 268, 269, 270, 271, 272, DPTco: 235, 236, 237. 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, DPTctl: 235, 236. 283, 288, 289, 291, 292, 293, 294, 295, 296, 297,  $DPTname: \underline{235}, 236.$ 298, 300, 301, 302, 304, 307, 308, 309, 310, 313, DT\_hit: 267, 268, 270, 271, 272, 273. 325, 326, 327, 328, 329, 330, 331, 336, 342, 343,  $DT\_miss: 267, 270, 272.$ 344, 345, 346, 348, 350, 351, 352, 353, 354, 272. DT\_retry: 356, 357, 358, 359, 360, 361, 363, 364, 365, DTcache: 39, 128, 168, 236, 237, 266, 267, 268, 366, 367, 368, 369, 370, 378, 379. 269, 270, 272, 325, 353, 358. Deache: 39, 128, 168, 215, 217, 222, 227, 228, 233, DUNNO: <u>254</u>, 255, 268, 270, 271, 278. 234, 257, 259, 261, 262, 263, 265, 267, 268, 271, E\_BIT: <u>54</u>, 56, 306, 314, 317, 351. 273, 274, 275, 276, 280, 360, 364, 366, 378, 379.  $emulate\_virt: 272, \underline{310}, 327.$ Dclean: 233.eps: 21. $Dclean\_inc: \underline{233}.$  $errprint\_coroutine\_id$ :  $\underline{24}$ ,  $\underline{25}$ , 28.  $Dclean\_loop: 233.$ errprint $\theta$ : 13, 22, 25. dd: 197, 203.errprint1: <u>13</u>, 14, 28, 213.  $decgamma: \underline{49}, 114, 147, 327.$ errprint2: 13, 14, 25, 210.  $default\_go: \underline{46}.$ exceptions: 20, 281, 346, 351. deissues: <u>60, 61, 63, 64, 67, 145, 160, 308, 309, 316.</u> exit: 14.del: 216. expire:  $\underline{13}$ ,  $\underline{14}$ . delay: 219.Extern: 4, 5, 9, 29, 38, 59, 60, 66, 69, 77, 86, 98,  $delta: \underline{21}.$ 115, 136, 150, 161, 168, 175, 178, 180, 207, 209, demote\_and\_fix: <u>198</u>, <u>199</u>, 233, 234, 268, 271, 273, 211, 212, 214, 242, 247, 252, 284, 349. 353, 354, 365, 366, 367. f: 75.  $demote\_usage: 190, 191, 199.$ F\_BIT: 54, 122, 256, 302, 306, 309, 310, 313, denin: 44, 100, 133, 346, 348. 314, 317, 320, 321, 327.denin\_penalty: 279, 346, 348, 349, 350. FADD: 47. denout: 44, 100, 133, 134, 346. fadd: 49, 51, 346. denout\_penalty: 281, 346, <u>349</u>, 351. false: 11, 12, 59, 75, 81, 100, 112, 113, 114, die: <u>144</u>, 160, 265, 308, 309, 310. 147, 170, 179, 201, 203, 205, 221, 239, 244, dirty: 167, 170, 172, 179, 181, 185, 197, 201, 259, 301, 304, 314, 323, 324, 330, 332, 337, 203, 216, 221, 259, 262. 340, 351, 363, 369.  $dirty\_only$ : 176, 177. Fclose: 371, 372.  $dispatch\_count$ : 64, 65, 81.  $fcmp: \underline{49}, 51, 348.$ dispatch\_done: 101, 112, 113, 114, 332. FCMP: 47. FCMPE: 47, 348. dispatch\_lock: 39, 64, 65, 75, 81, 85, 310, 329, 330, 356. fcomp: 21, 346, 348. dispatch\_max: 59, 74, 75, 85, 162. fdiv: 49, 51, 346. dispatch\_stat: 64, 66, 162. FDIV: 47. div: 7, 49, 51, 121, 343.  $fdivide: \underline{21}, 346.$ DIV: 47. feps:  $\underline{49}$ , 51, 348. DIVI: 47.  $fepscomp: \underline{21}, 348.$ divu: 49, 51, 121, 343. FEQL:  $\underline{47}$ . DIVU: 47. FEQLE: <u>47</u>, 348. **fetch**: <u>68,</u> 69, 70, 73, 74, 301. DIVUI: 47.

₹389

fetch\_bot: 69, 73, 74, 75, 301. flusher: 167, 176, 202, 203, 204, 205, 215, 217,  $fetch\_co: 285, 286, 287.$ 221, 233, 234, 259, 263.  $fetch\_ctl$ : 285, 286.  $flusher\_ctl$ : 167. fetch\_hi: 285, 294, 297, 301.  $fmul: \underline{49}, 51, 346.$ fetch\_lo: 285, 294, 297, 301, 304. FMUL: 47. fetch\_max: 59, 284, 301. fmult: 21, 346. $fetch\_one: 301.$ Fopen: 371, 372. fetch\_ready: 285, 291, 292, 296, 297, 299, 301. *fplus*: 21, 346.  $fetch\_retry: \underline{298}, 300.$ fprintf: 13, 381, 384. fetch\_top: 69, 71, 73, 74, 75, 301. Fputs: 371, 372. fetched: 284, 285, 294, 297, 301, 304. Fputws: 371, 372. fflush: 387. fread: 387.fgets: 387.Fread: <u>371</u>, 372. Fgets: 371, 372.freeze\_dispatch: 75, 81, 118, 355. Fgetws: 371, 372.frem: <u>49</u>, 51, 320, 350, 351. fill\_from\_mem: 129, 222, 224, 237. FREM: 47.  $fill\_from\_S: 129, 224, 237.$  $frem\_max: 349, 351.$ fill\_from\_virt: 129, 237, 242. fremstep: 21, 350, 351.fill\_lock: 167, 174, 222, 224, 225, 226, 237, 257, froot:  $\underline{21}$ , 346. 261, 272, 274, 298, 300. Fseek: <u>371</u>, 372. filler: 167, 176, 195, 196, 204, 218, 224, 225, 261, fsqrt: 7, 49, 51, 346, 347. 272, 274, 276, 298, 300, 326. FSQRT: 47. filler\_ctl: <u>167</u>, 176, 225, 236, 261, 272, 274, fsub: 49, 51, 346. 298, 300, 326. FSUB: 47.  $fin\_bflot: 346.$ Ftell: <u>371</u>, 372. fin\_ex: 135, 144, 155, 266, 269, 271, 272, 273, 274, FUN: 47, 348. 276, 279, 281, 283, 296, 298, 300, 301, 313, 325, **func**: 75, <u>76</u>, 77, 79. 326, 327, 328, 329, 331, 336, 342, 345, 346, func\_struct: 76. 350, 351, 356, 360, 363, 364, 370. FUNE: 47, 348.  $fin\_flot: 346$ . funeq:  $\underline{49}$ , 51, 348.  $fin\_ld: \underline{279}.$ funit: 77, 79, 82.  $fin\_st$ :  $\underline{281}$ .  $funit\_count$ :  $\underline{77}$ , 79, 82.  $fin\_uflot: 346$ . Fwrite: 371, 372. finish\_store: 272, 279, <u>280</u>. g: 86, 167, 172. fint: 49, 51, 346, 347. get: 49, 51, 118, 146, 328. FINT: 47. GET:  $\underline{47}$ . fintegerize: 21, 346. get\_reader: 182, 183, 233, 257, 266, 267, 271, first:  $\underline{216}$ . 272, 273, 288, 291, 296, 353, 354, 358, 359, fix: 49, 51, 346, 347. 360, 365, 366. FIX:  $\underline{47}$ . GETA:  $\underline{47}$ . fixit: 21, 346. GETAB: 47. FIXU: 47. gg: 167, 170, 172, 216. flags: 80, 81, 83, 312, 320. go: 44, 46, 49, 51, 85, 100, 119, 120, 122, 123, floatit:  $\underline{21}$ , 346. 128, 155, 160, 231, 236, 249, 286, 308, 312, flot: <u>49</u>, 51, 346, 347. 320, 321, 322, 331, 364. FLOT: 47. GO: 47, 235. FLOTI: 47. FLOTU: 47. GOI: 47.  $got\_DT$ :  $\underline{272}$ FLOTUI: 47.  $flush\_cache: 202, 203, 205, 217, 233, 234, 263.$ got\_IT: 291, 298. got\_one: 291, 300, 301.  $flush\_to\_mem: 129, 215.$ 

h: <u>17</u>, <u>151</u>, <u>152</u>, <u>210</u>, <u>213</u>.

 $flush\_to\_S$ : 129, 217.

H\_BIT: 54, 146, 306, 308, 313, 314, 317, 319, 320, 321.  $h\_down: 152.$  $h_{-}up: 152.$ Halt: 371, 372. halted: 10, 12, 356, 373. hash\_prime: 207, 209, 210, 213. head: <u>69,</u> 71, 73, 74, 75, 80, 81, 84, 85, 100, 110, 151, 152, 160, 228, 229, 301, 308, 309, 316, 323, 335, 341. Hennessy, John LeRoy: 58, 150, 163. hist: 44, 46, 68, 75, 85, 100, 160, 308, 309. hit: 193.hit\_and\_miss: 267, 268, 271, 273. hit\_set: 192, 193, 194, 196, 199, 201, 217. holding\_time: 247, 256, 257. hot: <u>60, 61, 63, 64, 67, 69, 86, 101, 146, 147, 149,</u> 255, 256, 314, 316, 317, 318, 319, 320, 321, 357. *i*: 10, 12, 44, 172, 176, 181, 185, 201, 246. I can't allocate...: 213. I\_BIT: <u>54</u>, 348. Icache: 39, 128, <u>168, 222, 227, 229, 265, 280, 291,</u> 292, 294, 296, 300, 359, 364, 365. Ihit\_and\_miss: 291, 292, 296, 298, 299. ii: 185, 216.IIADDU: 47. IIADDUI: 47.  $illegal\_inst: 118, 347.$ inbuf: 167, 200, 201, 219, 220, 222, 223, 226, 245, 379. incgamma: 49, 113, 147, 323, 327, 338. INCH: 47. INCL: 47. INCMH: 47. INCML: 47. Incorrect implementation...: 22. incr: 21, 46, 64, 84, 85, 100, 113, 114, 119, 120, 236, 240, 265, 279, 301, 314, 320, 322, 323, 325, 333, 338, 339, 369, 370, 373, 380, 381, 382, 383, 384, 385, 386.  $increase\_L$ : 110, 312. incrl: 49, 112, 119, 327. inst: <u>68,</u> 73, 75, 84, 100, 110, 228, 229, 304, 323, 335, 341. inst\_ptr: 71, 73, 81, 85, 119, 120, 122, 123, 160, 284, 288, 290, 294, 301, 302, 304, 308, 309, 310, 312, 314, 322, 323.  $interactive\_read\_bit: 8.$ interim: 44, 46, 81, 100, 112, 113, 114, 227, 320, 330, 332, 337, 340, 342, 350, 351, 361, 363, 364, 369.

 $internal\_op: 51, 80, 320.$ 

 $internal\_op\_name: 46, 50.$ internal\_opcode:  $44, \underline{49}, 51, 246.$ interrupt: 44, 46, 59, 68, 73, 81, 100, 118, 122, 132, 140, 141, 144, 146, 149, 160, 256, 266, 269, 271, 272, 281, 282, 288, 301, 302, 304, 306, 307, 308, 309, 310, 313, 314, 317, 319, 320, 321, 322, 323, 327, 329, 330, 331, 332, 336, 337, 343, 346, 348, 351. interrupts: 306. INTERVAL\_TIMEOUT: 57, 314. *IPTco*: 235, 236, 237.  $IPTctl: \underline{235}, 236.$ *IPTname*: 235, 236. is\_denormal: 346, 348, 350, 351.  $is\_dirty$ : 169, 170, 177, 205, 233, 234. is\_load\_store: <u>307</u>, 310, 316, 320. *is\_trivial*: <u>346</u>, 350. issue\_bit: 8, 10, 81, 145, 146, 147, 149, 283, 310, 314, 319, 320, 321. issued\_between: <u>158</u>, <u>159</u>, 160, 308, 309, 316. IT\_hit: 291, 292, 295, 296, 298, 299.  $IT\_miss: 291, 295, 298, 299.$ ITcache: 39, 128, <u>168,</u> 236, 237, 288, 291, 292, 293, 295, 298, 302, 325, 354, 360. IVADDU: 47. IVADDUI: 47. *j*: <u>10</u>, <u>12</u>, <u>56</u>, <u>162</u>, <u>170</u>, <u>172</u>, <u>176</u>, <u>179</u>, <u>181</u>, <u>183</u>, <u>185</u>, <u>189</u>, <u>191</u>, <u>203</u>.  $jj: \underline{185}$ .  $jmp: \underline{49}, 51, 84, 85, 327.$ JMP: 47. JMPB:  $\underline{47}$ . k: <u>76</u>. K\_BIT: <u>54</u>, 118, 322. keep:  $202, \ \underline{203}$ . key: 210, 213.known: 40, 43, 44, 46, 59, 85, 89, 93, 100, 102, 112, 119, 120, 131, 132, 133, 135, 144, 237, 244, 255, 265, 290, 312, 322, 331, 338, 364.  $known\_phys: 296, 298.$ l: 17, 86, 187, 189, 191. last\_h: 209, 210, 211, 213, 216, 219, 223, 297.  $last\_off: \underline{216}.$ ld: 49, 51, 117, 265, 307, 327, 357. ld\_ready: 267, 268, 270, 271, 273, 274, 277, 278, 279.  $ld\_retry$ : 272, <u>273</u>, 274.  $ld\_st\_launch$ : <u>265</u>, 266, 354. LDB: <u>47</u>, 279. LDBI: 47. LDBU: 47, 279. LDBUI: 47.

 $max\_mem\_slots$ : 86, 89. LDHT: 47, 279. LDHTI: 47.  $max\_pipe\_op: \underline{49}, 133, 136.$ LDO:  $\underline{47}$ .  $max\_real\_command: \underline{49}, 81.$ LDOI: 47.  $max\_rename\_regs: 86, 89.$ LDOU: 47, 114, 332.  $max\_stage$ : 26, 129.  $max\_sys\_call$ : 371, 372. LDOUI: 47. ldpte: 49, 235, 236, 265. mb: 372, 380.LDPTE: 235, 236, 279. mem: 113, 114, <u>115</u>, 116, 117, 227, 236, 246, 249,  $ldptp: \underline{49}, 235, 236, 265.$ 254, 255, 265, 333, 334, 339, 355. LDPTP: 235, 236, 279. mem\_addr\_time: 214, 216, 219, 225, 260, 261, 271, 274, 277, 297, 300. LDSF: 47, 279.  $mem\_chunks: 207, 213.$ LDSFI: 47.  $mem\_chunks\_max$ : 206, 207, 213. LDT: <u>47</u>, 279.  $mem\_direct$ : 257. LDTI: 47. LDTU: 47, 279. mem\_hash: 207, 209, 210, 213, 216, 219, 223, 297. mem\_lock: 39, 214, 215, 219, 222, 225, 260, 261, LDTUI: 47.  $ldunc: \underline{49}, 51, 117, 265, 268, 271, 273, 357.$ 271, 274, 277, 297, 300. LDUNC:  $\underline{47}$ . mem\_locker: <u>127</u>, 128, 219, 260, 271, 277, 297. LDUNCI: 47. mem\_read: 208, 209, 210, 219, 222, 271, 277, 297, 378. ldvts: 49, 51, 118, 265, 352. mem\_read\_time: 214, 219, 222, 223, 271, 277, 297. LDVTS: 47. mem\_slots: 63, 86, 89, 111, 145, 147, 256. LDVTSI: 47. LDW: 47, 279. mem\_write: 208, 212, 213, 216, 260, 379.  $mem\_write\_time: 214, 216, 260.$ LDWI: 47. mem\_x: 44, 46, 100, 111, 113, 117, 123, 144, 145, LDWU: 47, 279. 146, 147, 255, 327, 339, 355. LDWUI: 47. mmgetchars: 377, 381.lim: 185.list: 6.MMIX\_config: 1, 9, 23, 29, 49, 59, 136, 207, 259. mmix\_fclose: 372, <u>376</u> little-endian versus big-endian: 304.  $mmix\_fgets: 372, 376.$ load\_cache: 200, 201, 222, 224, 237. mmix\_fgetws: 372, <u>376</u> load\_sf: 21, 279. loc: 44, 46, 68, 73, 80, 81, 84, 85, 100, 118, 119,  $mmix\_fopen: 372, 376.$ mmix\_fputs: 372, <u>376</u>. 122, 144, 149, 151, 152, 160, 236, 266, 271, 296, mmix\_fputws: 372, 376. 304, 320, 322, 323, 331, 355, 364, 368, 372. mmix\_fread: 372, 376. lock: 167, 174, 200, 217, 222, 224, 225, 226, 233, 234, 237, 257, 261, 266, 267, 271, 272, 273,  $mmix\_fseek: 372, 376.$ 274, 276, 288, 291, 296, 300, 326, 353, 354, mmix\_ftell: 372, 376. 358, 359, 360, 365, 366, 367.  $mmix\_fwrite: 372, 376$ lockloc: 23, 37, 125, 145, 234, 257, 279, 287,  $MMIX\_init: 1, 9, 10.$ 301, 360, 361, 364. mmix\_opcode: 44, <u>47</u>, 75, 156, 157. lockvar: 37, 65, 167, 214, 230, 247.  $MMIX\_run: 1, 9, 10.$ lring\_mask: 88, 89, 104, 105, 106, 110, 112, 113, mmputchars: 377, 384.114, 117, 119, 120, 337, 338. mode: <u>21</u>, <u>167</u>, 217, 257, 263. lring\_size: 86, 88, 89. mor: 49, 51, 344. lru: 164, 186, 187, 189, 191. MOR: 47. m: 12, 187, 189, 191, 268, 270, 271, 278, 381, 384. More...chunks are needed: 213. ma: 372, 380.MORI: 47.  $magic\_done: 372.$  $mul: \underline{49}, 51, 343.$  $magic\_read: 377, 378, 380, 381, 385.$ MUL: 47.  $magic\_write: 377, 379, 385, 386.$ MULI:  $\underline{47}$  $mask: \underline{282}.$ mulu: <u>49</u>, 51, 121, 343. max: 268, 292.MULU: 47.

MULUI: 47. note\_usage: 188, 189, 190, 196.  $mul\theta: 49, 343.$ noted: <u>68</u>, 73, 75, 85, 304, 323.  $mul1: \underline{49}, 343.$ nullifying: 75, 85, 146, 147, 310, <u>315</u>, 316. nxor: 49, 51, 138. mul2: 49.  $mul3: \underline{49}.$ NXOR:  $\underline{47}$ . mul4: 49.NXORI: 47. mul5: 49.o: <u>19</u>, <u>40</u>, <u>157</u>, <u>246</u>. O\_BIT: <u>54</u>.  $mul6: \underline{49}.$  $mul7: \underline{49}.$ oand: 21, 241.  $mul8: \underline{49}, 343.$ oandn: 21, 146, 240, 279, 325.  $mux: \underline{49}, 51, 142.$ octa: 9, 10, <u>17</u>, 18, 19, 20, 21, 40, 44, 46, 68, 90, MUX: 47. 91, 98, 99, 141, 148, 156, 157, 167, 192, 193, MUXI: 47. 197, 201, 203, 204, 205, 206, 208, 209, 210, 212, 213, 216, 219, 220, 237, 238, 239, 240, 241, 246, MXOR: 47. MXORI: <u>47</u>. 254, 255, 268, 270, 271, 278, 282, 284, 297, 372, 373, 376, 377, 378, 379, 380, 381, 384.  $my\_div$ : 7. odif: 49, 51, 344.  $my\_fsqrt$ : 7.  $my\_random:$  7. ODIF: 47. N\_BIT: <u>54</u>, 271. ODIFI: 47. name: <u>23, 25, 39, 76, 128, 167, 174, 176, 231,</u> odiv:  $\underline{21}$ , 343. 236, 249, 286. off: 185, 210, 213, 216, 219, 223, 226. old\_hot: 60, 64, 276, 283, 310, 322, 328, 329,  $nand: \underline{49}, 51, 138.$ NAND: 47. 342, 351, 353, 356, 364. old\_tail: 64, 69, 70, 74, 75, 85, 160, 308, 309. NANDI: 47. need\_b: 44, 46, 100, 106, 108, 112, 113, 114, ominus: 21, 139, 140, 344. omult: 21, 343. 131, 312, 345. need\_ra: 44, 46, 100, 108, 112, 113, 131, 324. op: <u>44,</u> 46, <u>75,</u> 80, 81, 82, 84, 85, 100, 102, 103, NEG: 47. 108, 109, 112, 113, 114, 117, 124, 139, 151, 152, 155, 156, <u>157</u>, 236, 279, 281, 282, 312, 320, 321, neg\_one: <u>20</u>, 22, 143, 236, 282, 372. NEGI: 47. 327, 332, 339, 344, 345, 346, 348.  $opcode\_name: \underline{48}, 73.$ NEGU: 47. NEGUI: 47. operating system: 243. oplus: 21, 139, 140, 241, 265, 331.  $new\_cool$ : 75, <u>78</u>, 101. new\_fetch: 288, 298, 301, 302. ops: <u>76,</u> 79, 82. new\_head: 74, 75, 81, 85, 120. or: 49, 51, 138.  $new\_L$ : 120. OR: 47. new\_O: 75, 99, 100, 119, 120, 333, 334, 338, 339. ORH: 47. new\_Q: 146, <u>148</u>, 149, 310, 314, 329. ORI: 47. new\_S: 75, 99, 100, 113, 114, 333, 334, 339. ORL:  $\underline{47}$ . ORMH: 47 $new\_tail: 301.$ next: 23, 26, 28, 32, 33, 35, 82, 125, 134, 145, 176, ORML: 47. 183, 196, 202, 205, 217, 218, 221, 225, 233, orn: 49, 51, 138. 234, 259, 261, 263, 266, 272, 274, 276, 298, ORN: 47. 300, 326, 350, 361, 363, 364, 368. ORNI: 47.  $next\_sync: \underline{364}.$ outbuf: 167, 176, 202, 203, 215, 216, 217, 218, no\_hardware\_PT: <u>242</u>, 272, 298. 219, 221, 259, 378, 379. NONEXISTENT\_MEMORY: <u>57</u>. overflow: 20, 21, 343. noop: 49, 51, 80, 118, 122, 322, 323, 327, 332, 337. owner: 44, 46, 63, 67, 73, 81, 124, 134, 144, noop\_inst: <u>118</u>, 227. 145, 244, 314, 357. nor:  $\underline{49}$ , 51, 138. p: 26, 28, 33, 35, 40, 63, 73, 120, 170, 172, 179, <u>185</u>, <u>187</u>, <u>189</u>, <u>191</u>, <u>193</u>, <u>196</u>, <u>199</u>, <u>201</u>, <u>203</u>, <u>205</u>, NOR: 47. NORI: 47. <u>251</u>, <u>255</u>, <u>256</u>, <u>258</u>, <u>378</u>, <u>379</u>, <u>381</u>, <u>384</u>, <u>387</u>.

P\_BIT: 54, 81, 149, 160, 322, 331. preld: 49, 51, 81, 227, 265, 266, 269, 271, 272, 273, 274. pack\_bytes: <u>320</u>, 335, 341. page coloring: 268, 292. PRELD: 47. page\_b: 238, 239, 243, 244. PRELDI: 47. page\_bad: 238, 239, 266, 288. prest: 49, 51, 81, 227, 265, 269, 271, 272, page\_mask: 238, 239, 240, 241, 279, 325. 273, 274, 275.  $page_n: 238, 239, 240, 279.$ PREST: 47, 275.  $page_r: 238, 239, 244.$  $prest\_span$ : 275, <u>276</u>.  $prest\_win: \underline{267}, 276.$ page\_s: 238, 239, 243, 268, 292. panic: 13, 22, 28, 135, 185, 187, 213. PRESTI: 47. print\_bits: 46, 55, 56, 73. PARITY\_ERROR: 57.  $print\_cache: 175, 176.$ pass\_after: <u>125</u>, 134, 266, 268, 270, 271, 288, print\_cache\_block: <u>171</u>, <u>172</u>, 177. 350, 353.  $print\_cache\_locks:$  39, <u>173</u>, <u>174</u>. pass\_data: <u>134</u>, 135. passit: 134, 266, 268, 270, 271, 288, 350, 353. print\_control\_block: 45, 46, 63, 81, 125, 145, Patterson, David Andrew: 58, 150, 163. 146, 147. PBEV: 47. print\_coroutine\_id: 24, 25, 28, 33, 63, 73, 81, PBEVB: 47125, 145. PBN: 47.  $print\_fetch\_buffer: \underline{72}, \underline{73}, \underline{253}.$ PBNB: 47.  $print\_locks$ : 10, 38, 39. print\_octa: 18, 19, 43, 46, 73, 91, 149, 152, 160, PBNN: 47. PBNNB: 47. 176, 251, 283, 310, 314, 319, 320, 321. PBNP: 47.  $print\_pipe: 10, 252, 253.$ PBNPB: 47.  $print\_reorder\_buffer: \underline{62}, \underline{63}, \underline{253}.$ PBNZ: 47.  $print\_spec: \underline{42}, \underline{43}, \underline{46}.$  $print\_specnode: \underline{43}, 46.$ PBNZB: 47.  $print\_specnode\_id:$  43, 73, 90, 91. PBOD: 47. PBODB: 47. *print\_stats*: <u>161</u>, <u>162</u> print\_trip\_warning: 373, 376. PBP:  $\underline{47}$ . PBPB: 47. print\_write\_buffer: 250, 251, 253. printf: 10, 19, 25, 28, 33, 39, 43, 46, 56, 63, pbr: 49, 51, 81, 85, 106, 152, 155. PBZ: 47. 73, 81, 91, 125, 145, 146, 147, 149, 152, 160, PBZB: 47. 162, 172, 174, 176, 177, 251, 283, 310, 314, 319, 320, 321, 387. peek\_hist: 68, 74, 75, 85, 99, 100, 151, 152.  $privileged\_inst: 118, 355.$  $peekahead: \underline{59}, 74.$ phys\_addr: 240, 241, 268, 270, 272, 292, 295, 298. program counter: 284. pipe\_bit: 8, 10. PROT\_OFFSET: 54, 269, 272, 293, 298. prototypes for functions: 6.  $pipe\_limit: 136.$ pipe\_seq: 133, 134, <u>136</u>, 141. PRW\_BITS: <u>266</u>, 269, 272. policy: 186, <u>187</u>, <u>189</u>, <u>191</u>. pseudo\_lru: <u>164</u>, 186, 187, 189, 191. pop: 46, 49, 51, 85, 120, 331. pst: 49, 51, 117, 254, 265, 266, 280, 321, 357. ptr\_a: 44, 114, 117, 215, 217, 222, 224, 227, 236, POP: 47. 237, 249, 254, 255, 325, 326, 333, 334.  $pop\_unsave: 120, 332.$ ports: 128, <u>167</u>, 183. ptr\_b: 44, 217, 218, 222, 224, 225, 232, 233, 234, 237, 257, 261, 262, 272, 274, 298, 300, 326. POWER\_FAILURE: <u>57</u>. *pp*: 184, <u>185</u>. ptr\_c: 44, 224, 225, 236, 237. PR\_BIT: <u>54</u>, 266, 269. PUSHGO: 47. *predicted*: <u>85,</u> 151. pushgo: 49, 51, 85, 110, 119, 331. prego: 49, 51, 81, 227, 265, 288, 289, 294, 296, PUSHGOI: 47. 298, 300, 301. PUSHJ: 47. PREGO: 47, 235. pushj: 49, 51, 85, 110, 119, 327. PUSHJB: 47. PREGOI: 47.

put: 49, 51, 118, 146, 149, 329.

PUT: 47. PUTI: 47.

PW\_BIT: <u>54</u>, 266, 269.

PX\_BIT: 54, 269, 293, 298, 301.

q: 35, 196, 205, 255, 256, 258, 378, 379.

 $qloop: \underline{255}.$ 

 $quantify\_mul\colon \ \underline{343}.$ 

queuelist:  $\underline{34}$ ,  $\underline{35}$ , 125.

r: 35, 93, 95, 189, 191.

rA: <u>52,</u> 107, 108, 146, 324, 329, 334, 342.

random: 7, 164, 167, 186, 187.

rank: 167, 172, 186, 187, 188, 189, 191.

rB: <u>52</u>, 86, 310, 312, 319.

rBB: <u>52</u>, 312, 319, 322, 372, 380.

rC: 52, 87.

rD: 52, 107.

rE: 52, 107, 108.

 $reader\colon\ 128,\,\underline{167},\,183,\,233,\,257,\,266,\,267,\,271,$ 

272, 273, 288, 291, 296, 353, 354, 358, 359, 360, 365, 366.

REBOOT\_SIGNAL: 57.

 $register\_truth{:}\ 155,\ \underline{156},\ \underline{157},\ 345.$ 

 $rel\_addr\_bit$ : 75, 83, 106.

release\_lock: 37, 222, 226, 233, 234, 272, 298, 356.

ren\_a: <u>44</u>, 46, 100, 111, 117, 119, 121, 123, 144,

145, 146, 147, 312, 322, 334, 340.

ren\_x: 44, 46, 100, 110, 111, 112, 114, 118, 119, 120, 123, 144, 145, 146, 147, 236, 312, 322, 333, 334, 338.

rename registers: 44, 86.

rename\_regs: 63, 86, 89, 111, 145, 146, 147.

reorder\_bot: 60, 63, 67, 75, 145, 159, 318, 357.

reorder\_top: 60, 61, 63, 67, 75, 145, 159, 318, 357.

repl: 167, 196, 199, 205.

**replace\_policy**: <u>164</u>, 167, 186, 187, 188, 189, 190, 191.

 $res: \underline{93}.$ 

resum: 49, 67, 314, 323, 325.

resume: 49, 51, 85, 149, 322, 323, 325.

RESUME: 47, 304, 305, 323.

RESUME\_AGAIN: 320, 323.

 $resume\_again: 323.$ 

RESUME\_CONT: 320, 323, 364.

RESUME\_SET: 307, <u>320</u>, 323, 324.

 $resume\_trans: 325, 326.$ 

RESUME\_TRANS: 242, 320, 323, 325.

resuming: 73, <u>78</u>, 81, 103, 160, 308, 309, 316, 323, 324.

reversed: 152.

 $rF: \underline{52}.$ 

 $rG: \underline{52}, 89, 102, 329, 330, 334, 342.$ 

rH: 52, 121.

rI: 52, 314.

ring: 26, 28, 29, 34, 35.

ring\_size: 26, 27, 28, 29, 125.

 $rJ: \underline{52}, 85, 107, 119, 312, 319.$ 

 $rK: \overline{52}, 149, 314, 317, 322, 328.$ 

rl: 44, 46, 100, 112, 119, 120, 123, 145, 146, 147, 334, 338.

151

rL: 52, 102, 112, 119, 120, 329, 330, 334, 338.

rM: 52, 107.

rN: 52, 89.

 $rO: \ \underline{52}, \ 98, \ 118.$ 

ROUND\_DOWN: 346.

ROUND\_NEAR: 346.

ROUND\_OFF: 346.

ROUND\_UP:  $\underline{346}$ .

 $rP: \quad \underline{52}, \ 283, \ 335, \ 341.$ 

 $rQ: \underline{52}, 146, 149, 310, 314, 329.$ 

 $rR: \quad \underline{52}, \ 121, \ 335, \ 341.$ 

 $rS: \ \underline{52}, \ 98, \ 118.$ 

rT: 52, 122, 310, 312, 372.

rTT: 52, 314.

rU: <u>52</u>, 100, 146.

rv: 239.

rV: 52, 329.

rW: <u>52</u>, 320, 322, 373.

rWW: 52, 320, 322, 373.

 $rX: \ \underline{52}, \ 320, \ 322.$ 

rXX: <u>52</u>, 320, 322, 372.

 $rY: \underline{52}, 321, 324.$ 

rYY: 52, 321, 323, 324.

rZ: 52, 321, 324, 335, 339.

rZZ: 52, 321, 323, 324.

s: <u>21</u>, <u>28</u>, <u>43</u>, <u>133</u>, <u>134</u>, <u>187</u>, <u>189</u>, <u>191</u>, <u>193</u>, <u>196</u>, <u>205</u>, <u>385</u>.

S\_BIT: 54, 149.

 $S_non_miss: \underline{224}.$ 

sadd: 49, 51, 344.

SADD: 47.

SADDI: 47.

sav: 49, 327, 337.

 $save: \underline{49}, 51, 327, 337, 340.$ 

SAVE: 47, 81, 281, 305, 341.

Scache: 39, <u>168</u>, 215, 217, 218, 219, 220, 221,

222, 224, 225, 226, 234, 261, 274, 300, 360,

364, 367, 378, 379.

schedule: 27, 28, 31, 125, 326, 368.

schedule\_bit: 8, 10, 28, 33.

Sclean:  $\underline{234}$ .

 $Sclean\_inc: \underline{234}.$ 

Sclean\_loop: 234. **spec**: 40, 41, 42, 43, 44, 92, 93, 284.  $security\_disabled:$  66, 67. spec\_install: 94, 95, 110, 112, 113, 114, 117, 118, 119, 120, 121, 312, 322, 333, 334, 338, self: <u>124,</u> 125, 134, 215, 217, 222, 224, 225, 226, 233, 234, 237, 257, 259, 260, 261, 262, 264, 266, 339, 340, 355. 272, 274, 279, 298, 300, 301, 310, 350, 356, 358, spec\_read: 206, 208, 210. 359, 360, 361, 362, 364, 365, 366, 367, 368. spec\_rem: 96, 97, 123, 145, 146, 147, 256. spec\_write: 206, 208, 213. sentinel: 35, <u>36</u>, 125. serial: <u>164</u>, 186, 187, 189, 191.  $special\_name: 53, 91.$ set: 49, 51, 109, 137, 167, 177, 181, 192, 233, **specnode**: <u>40,</u> 43, 44, 71, 86, 92, 93, 94, 95, 96, 234, 343. 97, 100, 115, 120, 255. set\_l: 44, 46, 100, 112, 119, 120, 123, 145, 146, specnode\_struct: 40. 147, 334, 338. specval: 92, 93, 104, 105, 106, 108, 113, 118, 120, set\_lock: 37, 81, 215, 217, 219, 222, 224, 225, 226, 122, 312, 322, 323, 324, 339. 233, 234, 237, 259, 260, 261, 262, 264, 271, 272, speed\_lock: 39, <u>247</u>, 257, 362. 274, 276, 277, 297, 298, 300, 310, 358, 359, Sprep:  $233, \ \underline{234}.$ 360, 361, 362, 365, 366, 367, 368.  $square\_one: 272, 369, 370.$  $set\_round$ : 281, <u>346</u>. SR: 47. SETH: <u>47</u>, 112, 323. SRI: 47. SETL:  $\underline{47}$ . SRU: 47. SETMH: 47. SRUI: 47. SETML: 47. st: 49, 51, 117, 254, 265, 266, 267, 270, 271,  $272,\ 279,\ 280,\ 321,\ 327.$ SFLOT: 47. SFLOTI: 47. st\_ready: 267, 270, 271, 272, 280. SFLOTU: 47. stage: 23, 25, 26, 28, 39, 59, 124, 125, 126, 128, SFLOTUI: 47. 129, 134, 136, 174, 231, 236, 249, 284.  $sh: \underline{49}, 141.$ stall: 75, 82, 101, 102, 111, 120, 312, 322, 332.  $shift\_amt$ : 141. stamp: 246, 251, 256, 257.  $shift\_left$ : 21, 22, 113, 114, 118, 139, 141, 244, start\_fetch: 288, 289. 279, 282, 333, 339.  $start\_ld\_st$ :  $\underline{265}$ . shift\_right: 21, 141, 239, 243, 279, 282, 334, 343. startup: 30, 31, 81, 203, 219, 221, 225, 233, 244, shl: <u>49</u>, 51, 141. 249, 257, 259, 260, 261, 266, 267, 271, 272, 273 shlu: <u>49</u>, 51, 141. 274, 276, 277, 286, 287, 288, 291, 296, 297, 298,  $show\_pred\_bit$ : 8, 46, 152, 160. 300, 353, 354, 358, 359, 360, 361, 365, 366.  $show\_spec\_bit$ : 8. state: 30, 31, 44, 46, 124, 125, 130, 131, 133, 134, 135, 215, 217, 219, 222, 224, 232, 233, 234, 237,  $show\_wholecache\_bit: 8, 177.$  $shr: \underline{49}, 51, 141.$ 257, 259, 260, 262, 264, 265, 267, 268, 270, 271, shrt: 21.272, 273, 274, 276, 277, 278, 279, 280, 281, 288,  $shru: \underline{49}, 51, 141.$ 291, 292, 295, 296, 297, 298, 300, 301, 310, 325, sign\_bit: 80, 81, 82, 85, 89, 91, 100, 118, 119, 140, 326, 345, 351, 354, 358, 359, 360, 361, 364, 368.  $state\_4: 308, 310, 311.$ 143, 144, 149, 157, 160, 177, 179, 205, 230, 233,  $state\_5: 307, 310, 311.$ 234, 244, 266, 271, 279, 288, 296, 320, 322, 331, 346, 353, 354, 355, 364, 368. STB: <u>47</u>, 281.  $signed\_odiv: \underline{21}, 343.$ STBI: 47. signed\_omult: 21, 343. STBU: 47, 281.  $sim: \underline{21}.$ STBUI: 47. size: 381, 384. STCO: <u>47</u>, 117. STC0I: 47. SL: 47. sleep: <u>125,</u> 224, 257, 272, 274, 298, 300, 301. stderr: 13, 381, 384. sleepy:  $301, 302, \underline{303}$ . stdin: 387.SLI:  $\underline{47}$ . StdIn>: 387.  $stdin\_buf$ : 387, 388. SLU: 47. SLUI: 47.  $stdin\_buf\_end: 387, 388.$ 

stdin\_buf\_start: 387, 388.  $stdin\_chr: 377, 387.$ stdout: 387.STHT: 47, 281. STHTI: 47. STO: 47. STOI: 47. stop: 381, 382, 383. $store\_sf: \underline{21}, 281.$ STOU: 47, 113, 339. STOUI: 47. strlen: 387.STSF: 47, 281. STSFI: 47. STT: 47, 281. STTI: 47. STTU: 47, 281. STTUI: 47. stunc: <u>49,</u> 251, 254, 257, 281. STUNC: 47, 281. STUNCI: 47. STW: 47, 281. STWI: 47. STWU: 47, 281. STWUI: 47. sub: 44, 49, 51, 140. SUB:  $\underline{47}$ . SUBI: 47. SUBSUBVERSION: 89. subu: 49, 51, 139. SUBU: 47. SUBUI: 47. SUBVERSION: 89. support: <u>78,</u> 79, 80.  $suppress\_dispatch: 64, \underline{65}, 317.$ switch0: 288, 299. switch1: 130, 133, 265, 327, 345, 359, 360. switch2: 135, 364.SWYM: 47, 301, 321, 323, 325.  $swym\_one: 301, 302.$ sync: 49, 51, 230, 233, 234, 251, 254, 256, 257, 355, 356, 361. SYNC: 47, 304, 305, 323. sync\_check: 269, 271, 272, <u>370</u>. syncd: 49, 51, 230, 265, 269, 271, 272, 280, 320, 323, 364, 368, 369. SYNCD: 47. SYNCDI: 47. SYNCID: 47. syncid: 49, 51, 85, 119, 265, 266, 267, 269, 270,

271, 272, 280, 320, 323.

SYNCIDI: 47.

 $sys\_call: 371.$ system dependencies: 17, 89. t: <u>35, 82, 95, 97, 197, 241</u>. tag: 167, 172, 176, 177, 179, 185, 193, 196, 197, 201, 203, 205, 206, 210, 213, 216, 217, 218, 219, 221, 223, 226, 233, 234, 245, 259, 276, 353, 354, 378, 379. tagmask: <u>167</u>, 192, 193. tail: 64, 69, 71, 73, 74, 85, 120, 160, 301, 304, 308, 309, 316. tdif: 49, 51, 344. TDIF: 47.  $tdif_l$ : 344. TDIFI: 47. terminate: <u>125</u>, 126, 144, 215, 217, 221, 222, 224, 232, 237. tetra: 17, 21, 68, 73, 76, 78, 91, 120, 206, 210, 213, 246, 255. thinking big: 58, 74. third\_operand: 103, <u>107</u>, 108. This can't happen: 13. ticks: 10, 14, 28, 64, <u>87</u>, 187, 251, 256, 257. time: 89.TLB: 163. tmpo: 141.Tomasulo, Robert Marco: 58. trans:  $\underline{241}$ . trans\_key: <u>240</u>, 245, 267, 272, 291, 298, 302, 326, 353, 354. translation caches: 163. trap: 49, 51, 80, 81, 82, 85, 103, 149, 310, 312, 313, 317, 320. TRAP: 47, 80, 82, 320.  $trap\_loc: 373.$  $trip: \underline{49}, 51, 80, 85, 312, 313, 317.$ TRIP:  $\underline{47}$ . true: 11, 59, 68, 85, 89, 100, 106, 108, 110, 112, 113, 114, 117, 118, 119, 120, 121, 144, 170, 185, 217, 227, 236, 238, 239, 259, 262, 263, 265, 302, 304, 310, 312, 314, 316, 317, 322, 324, 330, 331, 332, 333, 334, 337, 338, 339, 340, 345, 350, 355, 361, 364, 373. *true\_head*: <u>74</u>, 81. trying\_to\_interrupt: 314, 315, 330, 351, 363, 364. tt: 28.  $u: \ \underline{75}, \ \underline{79}, \ \underline{97}.$ U\_BIT: <u>54</u>, 307.  $uninit\_mem\_bit: 8, 210.$ uninitialized memory...: 210.  $unit\_busy: 82.$  $unit\_found: 82.$ UNKNOWN\_SPEC: <u>71</u>, 73, 85, 120, 123, 290, 309.

uns: 21. $unsav: \underline{49}, 327, 332.$ UNSAVE: 47, 81, 102, 279, 305, 332, 335. unsave: 49, 51, 327, 332. unschedule: <u>32</u>, <u>33</u>, 145, 287. unsqnd: 21. $up: \underline{40}, 73, 85, 86, 89, 93, 95, 97, 100, 102, 114,$ 116, 117, 120, 146, 227, 254, 255, 312, 333, 334. usage: 44, 46, 81, 100, 146, 324. use\_and\_fix: 195, 196, 198, 201, 217, 262, 268, 269, 271, 273, 292, 293, 296, 353, 354. v: 167.V\_BIT: <u>54</u>, 140, 141, 282, 343. val: 208, 212, 213, 379. vanish: 126, 128, <u>129</u>, 260. vanish\_ctl: 127, 128. verbose: 4, 10, 28, 33, 46, 81, 125, 145, 146, 147, 149, 152, 160, 177, 210, 283, 310, 314, 319, 320, 321. VERSION: 89. victim: 167, 177, 181, 193, 196, 199, 205, 233, 234. VIIIADDU: 47. VIIIADDUI: 47.  $virt: \underline{241}.$ vrepl: <u>167</u>, 196, 199, 205. vv: 167, 177, 181, 193, 196, 199, 205, 233, 234.W\_BIT: 54, 346. wait: 125, 131, 133, 134, 215, 216, 217, 218, 219, 221, 222, 223, 224, 225, 233, 234, 237, 257, 259, 260, 261, 262, 263, 264, 266, 271, 272, 273, 276, 277, 278, 279, 281, 283, 288, 290, 297, 298, 301, 310, 326, 328, 329, 330, 342, 350, 351, 353, 354, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368. wait\_or\_pass: 288, 292, 295, 296. wbuf\_bot: 247, 251, 255, 256, 257, 378, 379. wbuf\_lock: 39, 247, 256, 257, 259, 260, 262, 264, 360. wbuf\_top: 247, 249, 251, 255, 256, 257, 378, 379.  $wdif: \underline{49}, 51, 344.$ WDIF: 47. WDIFI: 47.  $wow: \underline{11}.$ WRITE\_ALLOC: 166, 167, 217, 257. WRITE\_BACK: <u>166</u>, 167, 217, 263.  $write\_co\colon \ \underline{248},\ 249.$  $write\_ctl: 248, 249, 360.$  $write\_from\_wbuf: 129, 249, 257, 272.$ write\_head: 247, 249, 251, 255, 256, 257, 259, 260. 261, 262, 360, 362, 378, 379. write\_node: 246, 247, 251, 255, 256, 378, 379.

 $write\_restart$ : 257, 261.

write\_search: <u>254</u>, <u>255</u>, 268, 270, 271, 278. write\_tail: 247, 249, 251, 255, 256, 257, 360, 362, 378, 379.  $wyde\_diff: 21, 344.$ x: 21, 44, 56, 119, 120, 381, 384. X\_BIT: <u>54</u>, 307.  $X_i = dest_b = 101, 312, 320.$ xor: 21, 49, 51, 138. XOR:  $\underline{47}$ . XORI: 47. XVIADDU: 47. XVIADDUI: 47. xx: 44, 46, 100, 102, 106, 110, 117, 118, 119, 120, 146, 227, 265, 275, 312, 320, 323, 325, 329, 332, 335, 336, 337, 340, 341, 364, 369, 370.  $y: \ \underline{21}, \ \underline{44}.$  $yy: \underline{44}, 46, 100, 103, 105, 118, 320, 333, 335,$ 337, 339, 341, 372, 380. yz: 75, 84, 85, 109, 120. z: 21, 44. $Z_BIT: 54.$ zap\_cache: 180, 181, 358, 359, 360. zero\_octa: 20, 100, 112, 179, 237, 243, 244, 265, 279, 288, 312, 317, 330, 346, 356, 364, 380. zero\_spec: 41, 85, 100, 109, 112, 113, 114. zset: 49, 51, 345. ZSEV: 47. ZSEVI: <u>47</u> ZSN:  $\underline{47}$ . ZSNI: 47. ZSNN: 47. ZSNNI: 47ZSNP: 47. ZSNPI: 47ZSNZ: 47. ZSNZI: 47. ZSOD: 47. ZSODI: 47. ZSP:  $\underline{47}$ . ZSPI: 47. ZSZ: 47. ZSZI: 47. zz: <u>44,</u> 46, 100, 103, 104, 118, 146, 320, 322, 323, 328, 337, 338, 339, 341, 355, 356, <u>372</u>, 373.

155

```
\langle Allocate a slot p in the S-cache 218 \rangle Used in section 217.
(Assign a functional unit if available, otherwise goto stall 82) Used in section 75.
Begin an interruption and break 317 Used in section 146.
Begin execution of a stage-two operation 351 Used in section 135.
(Begin execution of an operation 132) Used in section 130.
(Begin fetch with known physical address 296) Used in section 288.
(Begin fetch without I-cache lookup 295) Used in section 291.
 Cases 0 through 4, for the D-cache 233 \ Used in section 232.
 Cases 5 through 9, for the S-cache 234 \rangle Used in section 232.
 Cases for control of special coroutines 126, 215, 217, 222, 224, 232, 237, 257 Used in section 125.
 Cases for stage 1 execution 155, 313, 325, 327, 328, 329, 331, 356 \ Used in section 132.
Cases to compute the results of register-to-register operation 137, 138, 139, 140, 141, 142, 143, 343, 344, 345, 346,
    348, 350 Used in section 132.
(Cases to compute the virtual address of a memory operation 265) Used in section 132.
 Check for a hit in pending writes 278 \rangle Used in section 273.
 Check for external interrupt 314 \rightarrow Used in section 64.
 Check for security violation, break if so 149 \) Used in section 67.
 Check for sufficient rename registers and memory slots, or goto stall 111 \( \) Used in section 75.
 Check for prest with a fully spanned cache block 275 \ Used in section 274.
 Clean the D-cache block for data \neg z.o, if any 366 \ Used in section 364.
 Clean the I-cache block for data \rightarrow z.o, if any 365 \ Used in section 364.
 Clean the S-cache block for data \neg z.o, if any 367 \ Used in section 364.
 Clean the data caches 361 \ Used in section 356.
 Commit and/or deissue up to commit_max instructions 67 \ Used in section 64.
 Commit the hottest instruction, or break if it's not ready 146 \( \rightarrow \) Used in section 67.
 Commit to memory if possible, otherwise break 256 \ Used in section 146.
 Compute the new entry for c-inbuf and give the caller a sneak preview 245 \ Used in section 237.
 Continue this command on the next cache block 369 \ Used in section 364.
 Convert relative address to absolute address 84 \ Used in section 75.
 Copy data from p into c-inbuf 226 \rightarrow Used in section 224.
 Copy the data from block q to fetched 294 \rangle Used in sections 292 and 296.
 Copy Scache \rightarrow inbuf to slot p 220 \rangle Used in section 217.
 Declare mmix_opcode and internal_opcode 47, 49 \ Used in section 44.
 Deissue all but the hottest command 316 \ Used in section 314.
 Deissue the coolest instruction 145 \rangle Used in section 67.
(Determine the flags, f, and the internal opcode, i 80) Used in section 75.
(Dispatch an instruction to the cool block if possible, otherwise goto stall 101) Used in section 75.
(Dispatch one cycle's worth of instructions 74) Used in section 64.
 Do a simultaneous lookup in the D-cache 268 \ Used in section 267.
(Do a simultaneous lookup in the I-cache 292) Used in section 291.
(Do load/store stage 1 without D-cache lookup 270) Used in section 267.
(Do load/store stage 2 without D-cache lookup 277) Used in section 273.
(Do load/store stage 1 with known physical address 271) Used in section 266.
(Do stage 1 of LDVTS 353) Used in section 352.
(Do the final SAVE 340) Used in section 339.
 Either halt or print warning 373 \ Used in section 372.
(Execute all coroutines scheduled for the current time 125) Used in section 64.
\langle \text{External prototypes } 9, 38, 161, 175, 178, 180, 209, 212, 252 \rangle Used in sections 3 and 5.
(External routines 10, 39, 162, 176, 179, 181, 210, 213, 253) Used in section 3.
\langle \text{External variables 4, 29, 59, 60, 66, 69, 77, 86, 98, 115, 136, 150, 168, 207, 211, 214, 242, 247, 284, 349} \rangle Used in
    sections 3 and 5.
⟨ Fill Scache¬inbuf with clean memory data 219⟩ Used in section 217.
```

(Finish a CSWAP 283) Used in section 281. (Finish a store command 281) Used in section 280. (Finish execution of an operation 144) Used in section 130. Forward the new data past the D-cache if it is write-through 263 \ Used in section 257. Generate an instruction to save g[yy] 339 \times Used in section 337. Generate an instruction to unsave g[yy] 333 \ Used in section 332. Get ready for the next step of PREGO 229 \ Used in section 81. Get ready for the next step of PRELD or PREST 228 \ Used in section 81. Get ready for the next step of SAVE 341 \ Used in section 81. (Get ready for the next step of UNSAVE 335) Used in section 81. Global variables 20, 36, 41, 48, 50, 51, 53, 54, 65, 70, 78, 83, 88, 99, 107, 127, 148, 154, 194, 230, 235, 238, 248, 285, 303, 305, 315, 374, 376, 388 \ Used in section 3. (Handle an internal SAVE when it's time to store 342) Used in section 281. (Handle an internal UNSAVE when it's time to load 336) Used in section 279. (Handle interrupt at end of execution stage 307) Used in section 144.  $\langle$  Handle special cases for operations like prego and ldvts 289, 352  $\rangle$  Used in section 266. (Handle write-around when flushing to the S-cache 221) Used in section 217. Handle write-around when writing to the D-cache 259 \ Used in section 257. (Header definitions 6, 7, 8, 52, 57, 87, 129, 166) Used in sections 3 and 5. (Ignore the item in write\_head 264) Used in section 257.  $\langle \text{Initialize everything } 22, 26, 61, 71, 79, 89, 116, 128, 153, 231, 236, 249, 286 \rangle$  Used in section 10.  $\langle$  Insert an instruction to advance beta and L 112 $\rangle$  Used in section 110.  $\langle$  Insert an instruction to advance gamma 113 $\rangle$  Used in sections 110, 119, and 337. (Insert an instruction to decrease gamma 114) Used in section 120. (Insert dummy instruction for page table emulation 302) Used in section 298. (Insert special operands when resuming an interrupted operation 324) Used in section 103.  $\langle$  Insert data-b.o into the proper field of data-x.o, checking for arithmetic exceptions if signed 282  $\rangle$  Used in section 281. (Install a new instruction into the tail position 304) Used in section 301.  $\langle$  Install default fields in the *cool* block 100 $\rangle$  Used in section 75. (Install register X as the destination, or insert an internal command and goto dispatch\_done if X is marginal 110 \rangle Used in section 101.  $\langle$  Install the operand fields of the *cool* block 103 $\rangle$  Used in section 101. (Internal prototypes 13, 18, 24, 27, 30, 32, 34, 42, 45, 55, 62, 72, 90, 92, 94, 96, 156, 158, 169, 171, 173, 182, 184, 186, 188, 190, 192, 195, 198, 200, 202, 204, 240, 250, 254, 377 \ Used in section 3.  $\langle$  Issue j pseudo-instructions to compute a page table entry 244 $\rangle$  Used in section 243.  $\langle \text{ Issue the } cool \text{ instruction } 81 \rangle \text{ Used in section } 75.$ (Load and write eight bytes 386) Used in section 384. (Load and write one byte 385) Used in section 384. (Local variables 12, 124, 258) Used in section 10. (Look at the head instruction, and try to dispatch it if  $j < dispatch_max$  75) Used in section 74. (Look up the address in the DT-cache, and also in the D-cache if possible 267) Used in section 266. (Look up the address in the IT-cache, and also in the I-cache if possible 291) Used in section 288. ⟨Magically do an I/O operation, if cool¬loc is rT 372⟩ Used in section 322.  $\langle \text{ Make sure } cool\_L \text{ and } cool\_G \text{ are up to date } 102 \rangle$  Used in section 101. Nullify the hottest instruction 147 \rangle Used in section 146. Other cases for the fetch coroutine 298, 301 Used in section 288.  $\langle \text{Pass } data \text{ to the next stage of the pipeline } 134 \rangle$  Used in section 130. Perform one cycle of the interrupt preparations 318 \ Used in section 64. (Perform one machine cycle 64) Used in section 10. (Predict a branch outcome 151) Used in section 85. (Prepare for exceptional trip handler 308) Used in section 307.

```
\langle Prepare memory arguments ma = M[a] and mb = M[b] if needed 380 \rangle Used in section 372.
(Prepare to emulate the page translation 309) Used in section 310.
\langle Print all of c's cache blocks 177\rangle Used in section 176.
(Read and store one byte; return if done 382) Used in section 381.
(Read and store up to eight bytes; return if done 383) Used in section 381.
\langle \text{ Read data into } c \neg inbuf \text{ and wait for the bus } 223 \rangle Used in section 222.
 Read from memory into fetched 297 \ Used in section 296.
 Record the result of branch prediction 152 \ Used in section 75.
Recover from incorrect branch prediction 160 Used in section 155.
 Redirect the fetch if control changes at this inst 85 \ Used in section 75.
 Restart the fetch coroutine 287 \ Used in sections 85, 160, 308, 309, and 316.
Resume an interrupted operation 323 Used in section 322.
\langle Set resumption registers (rB, $255) or (rBB, $255) 319 \rangle Used in section 318.
(Set resumption registers (rW, rX) or (rWW, rXX) 320) Used in section 318.
(Set resumption registers (rY, rZ) or (rYY, rZZ) 321) Used in section 318.
(Set things up so that the results become known when they should 133) Used in section 132.
(Set up the first phase of saving 338) Used in section 337.
(Set up the first phase of unsaving 334) Used in section 332.
\langle \text{Set } cool \neg b \text{ and/or } cool \neg ra \text{ from special register } 108 \rangle Used in section 103.
⟨ Set cool-b from register X 106 ⟩ Used in section 103.
\langle \text{ Set } cool \neg y \text{ from register Y } 105 \rangle Used in section 103.
(Set cool \neg z as an immediate wyde 109) Used in section 103.
\langle \text{ Set } cool \neg z \text{ from register Z } 104 \rangle Used in section 103.
(Simulate an action of the fetch coroutine 288) Used in section 125.
 Simulate later stages of an execution pipeline 135 \ Used in section 125.
(Simulate the first stage of an execution pipeline 130) Used in section 125.
(Special cases for states in later stages 272, 273, 276, 279, 280, 299, 311, 354, 364, 370) Used in section 135.
(Special cases for states in the first stage 266, 310, 326, 360, 363) Used in section 130.
(Special cases of instruction dispatch 117, 118, 119, 120, 121, 122, 227, 312, 322, 332, 337, 347, 355) Used in
     section 101.
(Start the S-cache filler 225) Used in section 224.
(Start up auxiliary coroutines to compute the page table entry 243) Used in section 237.
Subroutines 14, 19, 21, 25, 28, 31, 33, 35, 43, 46, 56, 63, 73, 91, 93, 95, 97, 157, 159, 170, 172, 174, 183, 185, 187, 189, 191,
     193, 196, 199, 201, 203, 205, 208, 241, 251, 255, 378, 379, 381, 384, 387 \rangle Used in section 3.
\langle Swap cache blocks p and q 197\rangle Used in sections 196 and 205.
Try to get the contents of location data \neg z.o in the D-cache 274 Used in section 273.
\langle \text{Try to get the contents of location } data \neg z.o \text{ in the I-cache } 300 \rangle Used in section 298.
Try to put the contents of location write_head¬addr into the D-cache 261 Used in section 257.
 Type definitions 11, 17, 23, 37, 40, 44, 68, 76, 164, 167, 206, 246, 371 \rangle Used in sections 3 and 5.
 Undo data structures set prematurely in the cool block and break 123 \ Used in section 75.
 Update DT-cache usage and check the protection bits 269 \ Used in sections 268, 270, and 272.
 Update IT-cache usage and check the protection bits 293 \) Used in sections 292 and 295.
 Update rG 330 \ Used in section 329.
 Update the page variables 239 \ Used in section 329.
 Use cleanup on the cache blocks for data \neg z.o, if any 368 \ Used in section 364.
 Wait for input data if necessary; set state = 1 if it's there 131 \quad Used in section 130.
 Wait if there's an unfinished load ahead of us 357 \ Used in section 356.
Wait till write buffer is empty 362 Used in sections 361 and 364.
 Wait, if necessary, until the instruction pointer is known 290 \ Used in section 288.
Write directly from write_head to memory 260 Used in section 257.
Write the data into the D-cache and set state = 4, if there's a cache hit 262 Used in section 257.
(Write the dirty data of c \rightarrow outbuf and wait for the bus 216) Used in section 215.
```

```
\langle Zap the instruction and data caches 359 \rangle . Used in section 356. \langle Zap the translation caches 358 \rangle . Used in section 356. \langle mmix-pipe.h _5 \rangle
```

## **MMIX-PIPE**

|                               | Section | Page |
|-------------------------------|---------|------|
| Introduction                  | 1       | 1    |
| Low-level routines            | 16      | 7    |
| Coroutines                    |         |      |
| Lists                         | 47      | 16   |
| Dynamic speculation           |         |      |
| The dispatch stage            | 68      | 28   |
| The execution stages          |         |      |
| The commission/deissue stage  | 145     | 52   |
| Branch prediction             |         |      |
| Cache memory                  | 163     | 59   |
| Simulated memory              | 206     | 72   |
| Cache transfers               | 217     | 76   |
| Virtual address translation   | 235     | 84   |
| The write buffer              | 246     | 88   |
| Loading and storing           | 265     | 95   |
| The fetch stage               | 284     | 106  |
| Interrupts                    | 306     | 113  |
| Administrative operations     | 327     | 121  |
| More register-to-register ops | 343     | 126  |
| System operations             | 352     | 130  |
| Input and output              | 371     | 137  |
| Index                         | 389     | 143  |

## © 1999 Donald E. Knuth

This file may be freely copied and distributed, provided that no changes whatsoever are made. All users are asked to help keep the MMIXware files consistent and "uncorrupted," identical everywhere in the world. Changes are permissible only if the modified file is given a new name, different from the names of existing files in the MMIXware package, and only if the modified file is clearly identified as not being part of that package. (The CWEB system has a "change file" facility by which users can easily make minor alterations without modifying the master source files in any way. Everybody is supposed to use change files instead of changing the files.) The author has tried his best to produce correct and useful programs, in order to help promote computer science research, but no warranty of any kind should be assumed.