## The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE 431 01 Final Exam December 13, 2005

| Name:         |
|---------------|
| Posting Code: |
| -             |

1. (10 points) Here is a series of address references given as word addresses: 2, 3, 11, 16, 21, 13, 64, 48, 19, 11, 3, 22, 4, 27, 6, and 11. Assuming a direct mapped cache with two-word blocks and a total size of 8 words that is initially empty, (a) label each reference in the list as a hit or a miss and (b) show all entries made to the cache, including the final contents of the cache.

2. (10 points) Suppose we want to use a laptop to send 100 files of 25 MB each to another computer over a 5 Mbit/sec wireless connection. The laptop battery currently holds 60,000 joules of energy. The wireless networking card alone consumes 5 watts while transmitting, while the rest of the laptop always consumes 20 watts. Before each file transfer we need 15 seconds to choose which file to send. How many complete files can we transfer before the laptop's battery runs down to zero?

| 3. | (10 points) A secret agency wants to simultaneously monitor cellular phone conversations by  |
|----|----------------------------------------------------------------------------------------------|
|    | sampling 2 bytes of data 4000 times each second for each conversation and bundling these     |
|    | samples into 1KB messages for transmission. The overhead latency is 150 µs per 1 KB message. |
|    | Calculate the bandwidth required of the network to support monitoring 100 conversations.     |

4. (10 points) To capture the fact that the time to access data for both hits and misses affects performance, designers often use average memory access time (AMAT) as a way to examine alternative cache designs. Average memory access time is the average time to access memory considering both hits and misses and the frequency of different accesses; it is equal to the following:

 $AMAT = Time for a hit + Miss rate \times Miss penalty$ 

(a) Find the AMAT for a processor with a 2 ns clock, a miss penalty of 20 clock cycles, a miss rate of 0.05 misses per reference, and a cache access time (including hit detection) of 1 clock cycle. Assume that the read and write miss penalties are the same and ignore other write stalls. (b) Suppose we can improve the miss rate to 0.03 misses per reference by doubling the cache size. This change causes the cache access time to increase to 1.2 clock cycles. Using the AMAT as a metric, determine whether this is a good trade-off.

5. (15 points) Consider the following idea: Let's modify the instruction set architecture and remove the ability to specify an offset for memory access instructions. Specifically, all load-store instructions with nonzero offsets would become pseudoinstructions and would be implemented using two instructions. For example:

(a) (5 points) What changes would you want to make to the single-cycle datapath and control if this simplified architecture were to be used?



| Instruction | RegDst | ALUSrc | Memto | Reg   | Mem  | Mem   | Branch | ALUOp1 | ALUOp0 |  |
|-------------|--------|--------|-------|-------|------|-------|--------|--------|--------|--|
|             |        |        | Reg   | Write | Read | Write |        |        |        |  |
| R-format    | 1      | 0      | 0     | 1     | 0    | 0     | 0      | 1      | 0      |  |
| lw          | 0      | 1      | 1     | 1     | 1    | 0     | 0      | 0      | 0      |  |
| SW          | d      | 1      | d     | 0     | 0    | 1     | 0      | 0      | 0      |  |
| beq         | d      | 0      | d     | 0     | 0    | 0     | 1      | 0      | 1      |  |
|             |        |        |       |       |      |       |        |        |        |  |

## d-don't care

(b) (10 points) If the delay for the instruction memory is 300 ps, the data memory is 200 ps, the register file delay (read or write) is 100 ps, the control (not ALU control) is 300 ps, and the ALU delay is 400 ps, what are the clock cycle times required for the original datapath and for the modified datapath? What is the highest percentage of load-store instructions with offsets that could be tolerated without total performance being degraded?

6. (10 points) Consider a pipeline for a register-memory architecture. The architecture has two instruction formats: a register-register format and a register-memory format. There is a single memory addressing mode (offset + base register). There is a set of ALU operations as follows:

```
ALUop← Rdest,Rsrc1,Rsrc2,offset
Rdest ← Rsrc1 ALUop Rsrc2

or Rdest ← Rsrc1 ALUop MEM[Rsrc2 + offset]

or Rdest ← MEM[Rsrc2 + offset]

or MEM[Rsrc2 + offset] ← Rsrc1

where the ALUop is one of the following:
Add, Subtract, And, Or(with or without offset)
Load(Rsrc1 omitted)
Store(Rdest omitted)

Rdest, Rsrc1 and Rsrc2 are registers.
```

Branches use a full compare of two registers and are PC-relative. Assume that this machine is pipelined so that a new instruction is started every clock cycle. The pipeline structure is

```
MEM
       ALU1
                     ALU2
RF
IF
              ALU1
                            ALU2
       RF
                     MEM
                                   ALU2
       IF
              RF
                     ALU1
                            MEM
                                          WB
                                          ALU2
              IF
                     RF
                            ALU1
                                   MEM
                                                 WB
                                                 ALU2
                                                        WB
                     IF
                            RF
                                   ALU1
                                          MEM
                                                        ALU2
                            ΙF
                                   RF
                                          ALU1
                                                 MEM
```

The first ALU stage is used for effective address calculation for memory references and branches. The second ALU stage is used for operations and branch comparisons. RF is both a decode and register-fetch stage. Assume that when a register read and a register write of the same register occur in the same clock cycle, the write data is forwarded. For the following code fragment:

```
add $1, $1, 0($15)
add $1, $1, 4($1)
add $2,$2, 0($15)
add $2, $2, 4($2)
store $1, 8($2)
```

identify the data dependencies and draw a figure (using the multiple clock style) showing them as lines.

7. (5 points) Consider executing the following code on a pipelined datapath like the one shown, in which there is no forwarding, and a branch completes in the MEM stage.



```
sort:
         addi$sp, $sp, -20
                                                     lw
                                                            $t4, 4($t2)
                                                     slt
                                                            $t0, $t4, $t3
         sw $ra, 16($sp)
         SW
             $s3, 12($sp)
                                                     beq
                                                            $t0, $zero, exit2
             $s2, 8($sp)
                                                     add
                                                            $a0, $s2, $zero
         SW
         sw
             $s1, 4($sp)
                                                     add
                                                            $a1, $s1, $zero
             $s0, 0($sp)
         SW
                                                     jal
                                                            swap
         add $s2, $a0, $zero
                                                     addi
                                                            $s1, $s1, -1
         add $s3, $a1, $zero
                                                            for2tst
         add $s0, $zero, $zero
                                              exit2: addi
                                                            $s0, $s0, 1
for1tst: slt $t0, $s0, $s3
                                                            for1tst
                                                     i
         beq $t0, $zero, exit1
                                              exit1: lw
                                                            $s0, 0($sp)
         addi$s1, $s0, -1
                                                     lw
                                                            $s1, 4($sp)
for2tst: slt $t0, $s1, $zero
                                                            $s2, 8($sp)
                                                     lw
                                                            $s3, 12($sp)
         bne $t0, $zero, exit2
                                                     lw
         add $t1, $s1, $s1
                                                            $ra, 16($sp)
                                                     lw
         add $t1, $t1, $t1
                                                     addi
                                                            $sp, $sp, 20
         add $t2, $s2, $t1
                                                     jr
                                                            $ra
            $t3, 0($t2)
```

If the addi sp, p-20 instruction at the sort label begins executing in cycle 1, identify the instructions in each stage of the pipeline in cycle 7

| Cycle | IF | ID | EX | MEM | WB |
|-------|----|----|----|-----|----|
| 1     |    |    |    |     |    |
| 2     |    |    |    |     |    |
| 3     |    |    |    |     |    |
| 4     |    |    |    |     |    |
| 5     |    |    |    |     |    |
| 6     |    |    |    |     |    |
| 7     |    |    |    |     |    |

| 8. | (5 points) A binary word with even parity and no errors will have an even number of 1s in it.     |
|----|---------------------------------------------------------------------------------------------------|
|    | Compute the parity bit for each of the following 8-bit words so that the resulting 9-bit word has |
|    | even parity.                                                                                      |

- a. 01100111
- b. 01010101

| 9.  | . (1 point) The alternative model to shared memory for communicating uses                    |                                                                  |  |  |  |
|-----|----------------------------------------------------------------------------------------------|------------------------------------------------------------------|--|--|--|
|     |                                                                                              | _ between processors.                                            |  |  |  |
| 10. | (1 point) The coordination needed between processors operating in parallel on shared data is |                                                                  |  |  |  |
|     | called                                                                                       | ·                                                                |  |  |  |
| 11. | (1 point)                                                                                    | is the hardest problem.                                          |  |  |  |
| 12. | (1 point) A                                                                                  | is a substance that does not conduct electricity well but is the |  |  |  |
|     | foundation of integrated                                                                     | circuits.                                                        |  |  |  |
| 13. | (1 point) A                                                                                  | is an on/off switch controlled by electricity.                   |  |  |  |
| 14. | (10 points) Represent 0.000375 as a single precision floating point number.                  |                                                                  |  |  |  |

15. (10 points) Show the single MIPS instruction or minimal sequence of instructions for this C statement:

$$c = x[10] + x[11];$$

Assume that c corresponds to register \$ $\pm$ 0 and that array x (an array of 64-bit integers) has a base address of 4,000,000<sub>10</sub> which is stored in register \$ $\pm$ 1.