## **Exercises: Cache Coherence + Interconnect**

The five exercises are aimed at verifying the understanding of cache-coherence protocols and interconnect networks. The background material was taught in the lecture notes: L6-MemIntro.pdf and L7-Interconnect.pdf

| Request Type                                  | Time to carry out protocol action | Traffic     |
|-----------------------------------------------|-----------------------------------|-------------|
| Read hit                                      | 1 cycle                           | N/A         |
| Write hit                                     | 1 cycle                           | N/A         |
| Read request serviced by next level           | 40 cycles                         | 6 bytes + B |
| Read-exclusive request serviced by next level | 40 cycles                         | 6 bytes + B |
| Bus Upgrade/update request                    | 10 cycles                         | 10 bytes    |

**Task 1. (3pts)** Assume that a shared-memory multiprocessor with a number of processor/private cache units connected by a shared single(atomic)-transaction bus. Our baseline cache-coherence protocol is an MSI protocol, but we want to investigate what performance gains can be achieved by using an exclusive state to make it an MESI protocol. The time to carry out various protocol actions is listed in the above table: for example, while a read and write hit takes only 1 cycle, a read request takes 40 cycles as it has to bring the block from the next level of the cache hierarchy. A bus upgrade request takes less time as it does not involve a memory-block transfer but rather it only invalidates other shared copies. We assume that a write-hit in the cache requires a Bus Upgrade (rather than read-exclusive) transaction because the memory-block copy is already valid in the cache (and the memory-block does not need to be transferred).

We write Ri/B and Wi/B to denote a read and write operation, respectively, by processor/cache unit i to block B. Assuming block X is **not** in any of the private caches of processors 1 to 4, we want to determine the time it takes to execute the following sequence of accesses:

R1/X,W1/X,W1/X, R2/X, W2/X, W2/X, R3/X, W3/X, W3/X, R4/X, W4/X, W4/X

**Task 1.a)** Assume that a transaction from state E to state M brings no access cost. How many cycles does it take to execute the access sequence under MSI vs. MESI?

**Task 1.b)** Compare the traffic generated by the MSI and MESI protocols counted in bytes transferred using the table above, and assuming that the size of a memory block B is 32 bytes.

| Time | Processor 1 | Processor 2 | Processor 3 |
|------|-------------|-------------|-------------|
| 1    | Read A      |             |             |
| 2    |             | Read B      |             |
| 3    |             |             | Read C      |
| 4    | Write A     |             |             |
| 5    |             |             | Read D      |
| 6    |             | Read B      |             |
| 7    | Write B     |             |             |
| 8    |             |             | Read C      |
| 9    |             | Read B      |             |

**Task 2. (1pt)** Consider a shared-memory multiprocessor that consists of three processor/cache units and where cache coherence is maintained by an MSI protocol. The private caches are direct-mapped. The table above shows the access sequence taken by the three processors to four variables (A,B,C, and D), where A, B, and C belong to the same block and D belongs to a different block. The two blocks map to the same entry in the caches, and the cache is full initially and A, B, C, and D have not been accessed yet in the program.

**Task 2.a)** Classify the misses with respect to cold, replacement, true sharing, and false sharing misses.

**Task 2.b)** Which of the misses could be ignored and still guarantee that the execution is correct?

| Request Type | Time to carry out protocol action | Traffic     |
|--------------|-----------------------------------|-------------|
| Read hit     | 1 cycle                           | N/A         |
| Write hit    | 1 cycle                           | N/A         |
| BusRd        | 20 cycles                         | 6 bytes     |
| RemRd        | 20 cycles                         | 6 bytes     |
| RdAck        | 40 cycles                         | 6 bytes     |
| Flush        | 100 cycles                        | 6 bytes + B |
| InvRq        | 20 cycles                         | 6 bytes     |
| InvAck       | 20 cycles                         | 6 bytes     |
| UpgrAck      | 20 cycles                         | 6 bytes     |

- **Task 3 (3pts)** We consider a scalable implementation of a shared-memory multiprocessor using a set of nodes that each contain a processor, a private cache, and a portion of the memory. Cache coherence is maintained using a directory cache protocol, where the directory uses a presence-flag vector associated with each memory block to keep track of which nodes have copies of that block. The time it takes to process a *directory request* at the home and a remote node is in both cases 50 cycles. (Also the time to lookup or install into the cache at remote/host/local node is also 50 cycles). The latency and traffic of all consistency-induced requests and responses are detailed in the table above, where the block size B is 32 bytes. *Answer the following questions for each of the two MSI coherence protocols, i.e., the classical one that works in four hops and the optimized one (Standard DASH) that works in only three hops:*
- **Task 3.a)** Determine the number of cycles needed to handle a read-cache miss when the home node is the same as the requesting (local) node and the memory copy is clean. Also determine the amount of traffic (in bytes) caused by the coherence transaction (across nodes).
- **Task 3.b)** Determine the number of cycles needed to handle a read-cache miss when the home node is the same as the requesting (local) node and the memory copy is dirty. Also determine the amount of traffic (in bytes) caused by the coherence transaction.
- **Task 3.c)** Determine the number of cycles needed to handle a read-cache miss when the home node is different from the requesting (local) node and the memory copy is clean. Also determine the amount of traffic (in bytes) caused by the coherence transaction.
- **Task 3.d)** Determine the number of cycles needed to handle a read-cache miss when the home node is **not** the same as the requesting (local) node, the memory copy is dirty, and the remote node is the same as the home node. Also determine the amount of traffic (in bytes) caused by the coherence transaction.
- **Task 3.e)** Determine the number of cycles needed to handle a read-cache miss when the home node is different from the requesting (local) node, the memory copy is dirty, and the remote node is different from the home node (and of course different from the requesting/local node). Also determine the amount of traffic (in bytes) caused by the coherence transaction.

**Task 4 (1.5pts)** We Consider a 16-by-16 torus (tori) interconnection network, in which *each link has a bandwidth of 100 Mbits/s*. Determine the following interconnection network properties:

Task 4.a) network diameter,

**Task 4.b)** bisection bandwidth,

(bisection bandwidth = bisection width x link bandwidth)

**Task 4.c)** the bandwidth per node.

(total bandwidth = total number of links \* bandwidth bandwidth per node = total bandwidth / number of nodes)

**Task 5 (1.5pts)** A design team is contemplating whether to use a n-by-n torus (tori) or an n-dimensional hypercube to build an interconnect network that connects 4, 16, 64, and 256 nodes. In this exercise you will compare the network diameter and the bisection width for the two topologies to understand the trade-offs between them:

**Task 5.a)** At what scale does the hypercube provide (strictly) higher bisection width than the torus?

**Task 5.b)** Determine the network diameter and the switch degree for the scale at which the hypercube provides a higher bisection width than the torus.

**Task 5.c)** What can you say about the relative merits of the two topologies?