# Validating Coherent Mesh Network

In this notebook we are going to look at the validation process of a cache hierarchy model for the Coherent Mesh Network (CMN) in gem5.
This cache hierarchy is used by the server class processors developed by ARM (more specifically in Neoverse N1 IP).
I have modeled this cache hierachy based on available information I could find online and my understanding of ARM AMBA CHI archietecture.
Here are some high level points about this cache hierarchy.

* As you can guess by the name the interconnect network is a Mesh that connects the cores to each other.
* The interconnect also hosts slices of the system level cache (abbv. SLC, similar to the last level cache by used system wide) where one slice is located at each point in the grid.
* Each slice of the system level cache is:
    * a victim cache (no allocate on read misses, allocate only on writebacks).
    * a home base for part of the physical memory address space: It acts as the directory for the addresses whose home is that slice.
* Connected to each slice is a *Core Tile* that encapsulates two Ares cores each with a 64KiB L1 instruction cache, 64KiB L1 data cache, and a 512KiB L2 Cache.

In this evaulation we are going to use a **8 core** system with **4 channels** of memory.
Below is a *logical* diagram of the system.

![logical digram of a system built with CMN](figures/cmn.png)


## Measuring core to core latency

In this section we will run experiments in gem5 to measure the latency of moving one cache block from one core (`provider`) to another core (`receiver`).
Since, to the best of my knowledge, the real hardware for CMN allows for forwarding clean data from the L2 cache (instead of the L1D cache) we are going to measure two latencies:

* Latency of moving a cache block from `provider` core to `receiver` core when the provider core has previously accessed that block for read shared (accessing only for read).
* Latency of moving a cache block from `provider` core to `receiver` core when the provider core has previously accessed that block for read unique (accessing for write).

Before looking at the results of the measurements, I want you to keep this important aspect of CMN in mind:
*Slices of the system level cache act as the directory for certain subset of the physical address space.*

You might have guess by now that the latency of moving a cache block at address `addr` from the `provider` core to the `receiver` core is a function of the following:
* Physical distance of `provider` core to the SLC responsible for `addr`.
* Physical distance of `receiver` core to the SLC responsible for `addr`.
* Physical distance of `provider` core and `receiver` core.

To keep these measurements limited only to the interconnect and the protocol we are going to use traffic generators instead of actuall processing cores.
Morevoer, in all the experiments we have assigned home bases to addresses by interleaving the addresses at a granularity of **512KiB**.

### Measuring core to core latency: read shared

This is the setup of the experiment:

1- The `provider` core will make a read request to address `addr` while all the other cores are idle.

2- We will wait long enough to make sure the data has arrived in the L1D cache of the `provider` core.

3- The `receiver` core will make a read request to address `addr` while all the other cores are idle.

4- We will measure the time between `receiver` core making the request and the data arriving at its L1D cache.

**NOTE**: The cores will make requests for 8 byte reads.
This detail is not significant to the setup of this specific experiment but have been noted for the sake of consistency.

Below is a heatmap of the measured core to core latencies for address `addr = 0` (SLC0: upper left corner in the diagram).

![read shared latency for address 0](figures/8B_read_addr0.png)

Below is a heatmap of the measured core to core latencies for address `addr = 512K`

![read shared latency for address 512K]()

**NOTEWORTHY STUFF**

1- The last step in the coherence protocol is forwarding the data from the L2 cache of `provider` to L1D of `receiver` (SLC1: upper right corner in the diagram.)

2- Notice how the heatmap is symmetrical to the main diagonal (latency(x->y) = latency(y->x))

3- Notice how by changing the shared address, the measured latencies change. 
This is a behavior expected of home base system

### Measuring core to core latency: read unique (8 Byte writes)

This is the setup of the experiment:

1- The `provider` core will make a write request to address `addr` while all the other cores are idle.

2- We will wait long enough to make sure the data has arrived in the L1D cache of the `provider` core.

3- The `receiver` core will make a write request to address `addr` while all the other cores are idle.

4- We will measure the time between `receiver` core making the request and the data arriving at its L1D cache.

**NOTE**: The cores will make requests for 8 byte reads.

Below is a heatmap of the measured core to core latencies for address `addr = 0` (SLC0: upper left corner in the diagram).

![read unique (8B) latency for address 0](figures/8B_write_addr0.png)

Below is a heatmap of the measured core to core latencies for address `addr = 512K`

![read unique (8B) latency for address 512K]()

**NOTEWORTHY STUFF**

1- The last step in the coherence protocol is forwarding the data from the L1D cache of `provider` to L1D of `receiver` (SLC1: upper right corner in the diagram.)

2- Notice how the heatmap is symmetrical to the main diagonal (latency(x->y) = latency(y->x))

3- Notice how by changing the shared address, the measured latencies change. 
This is a behavior expected of home base system