## The University of Alabama in Huntsville Electrical & Computer Engineering Department CPE 431 01 Test 2 November 15, 2007

|                                                                                                   | Name:                                                                                                                                                                 |
|---------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -                                                                                                 | is an occurrence in which a planned instruction cannot clock cycle because data that is needed to execute the instruction is not yet                                  |
| (3 points) The three G                                                                            | Cs of cache misses are                                                                                                                                                |
| (1 point) A                                                                                       | occurs when an accessed page is not present in main                                                                                                                   |
|                                                                                                   | fy all of the data dependencies in the following code. (b) How is each data ndled or not handled by forwarding? <b>Draw a multiple clock cycle style your answer.</b> |
| add \$5, \$5, \$4<br>lw \$5, 28(\$5)<br>add \$3, \$4, \$9<br>sw \$3, 100(\$9<br>add \$7, \$3, \$4 | 5<br>5)                                                                                                                                                               |

5. (5 points) Given the pipelined processor of Chapter 6 with forwarding, determine which instruction is being executed in each stage of the pipeline in cycle 12 of the following instruction sequence if it begins executing in cycle 1

```
addi $23, $29, 12
1.
                                     8.
                                               $16, 4($2)
          $2, 0($29)
2.
                                     9.
                                               $17, 0($2)
     sw
                                          sw
3.
     sw
          $15, 4($29)
                                     10.
                                          sw
                                               $18, 4($2)
          $16, 8($29)
                                     11.
                                          lw
                                               $2, 0($29)
4.
                                               $15, 4($29)
5.
     muli $8, $5, 4
                                     12.
                                          lw
6.
     add
          $7, $4, $2
                                     13.
                                          lw
                                               $16, 8($29)
7.
     lw
          $15, 0($2)
                                     14.
                                          addi $29, $29, 12
IF _____ ID ____ EX ____ MEM ____ WB ____
```

6. (15 points) You have been given 50 128K x 16-bit SRAMS to build an instruction cache for a processor with a 32-bit address. You do not have a byte offset. You do need 4 bits of storage per block for valid, dirty and other status bits. What is the largest size (i.e., the largest size of the data storage area in bytes) direct-mapped instruction cache that you can build with four-word blocks? Show the breakdown on the address into its cache access components and describe how the various SRAM chips are used.

7. (15 points) Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 9, 56, 11, 4, 43, 5, 6. Assuming a two-way set associative cache with two-word blocks and a total size of 16 words that is initially empty and used LRU, (a) label each reference in the list as a hit or a miss and (b) show the final contents of the cache.

8. (10 points) Consider the case of a four-deep pipeline where the branch is resolved at the end of the second stage for unconditional branches and at the end of the third cycle for conditional branches. The program run on this pipeline has the following branch frequencies (as percentages of all instructions) are as follows:

Conditional branches 20% Jumps and calls 5%

Conditional branches 60% are taken

Assuming that the CPI of the program, neglecting branch hazards, is 1.0, how much slower is the real number, when branch hazards are considered?

9. (20 points) Unroll the following code so that three iterations of the loop are done at once. Assume the loop index is a multiple of 3 (i.e., \$10 is a multiple of twelve):

```
Loop: lw $2,0($10)

sub $4,$2,$3

sw $4,0($10)

addi $10,$10,4

bne $10,$30,Loop
```

Schedule this code for fast execution on the standard MIPS pipeline with forwarding(assume that it supports addi instruction). Assume initially \$10 is 0 and \$30 is 480 and that branches are resolved in the ID stage. How does the unrolled, scheduled code compare against the original code in terms of total execution time?

- 10. (15 points) Consider three processors with different cache configurations:
  - Cache 1: Direct-mapped with one-word blocks
  - Cache 2: Direct-mapped with four-word blocks
  - Cache 3: Two-way set associative with four-word blocks

The following miss rate measurements have been made:

- Cache 1: Instruction miss rate is 4%; data miss rate is 6%.
- Cache 2: Instruction miss rate is 2%; data miss rate is 4%.
- Cache 3: Instruction miss rate is 2%; data miss rate is 3%.

For these processors, one-half of the instructions contain a data reference. Assume that the cache miss penalty is 6 + Block size in words. The cycle times for the processors are 420 ps for the first and second processors and 310 ps for the third processor. Determine which processor is the fastest and which is the slowest.