# CISC-221 Computer Architecture Lab The Memory Hierarchy Lab 7: Memory Performance March, 2018

Name: Alejandra Kudo

Student Number: 10136014

Instructor: Dave Dove ( [dove@cs.queensu.ca](mailto:dove@cs.queensu.ca) ).

1 Introduction

In this lab you will collect data relating to the performance of various layers of the Raspberry Pi memory hierarchy and investigate how performance can be altered significantly by the memory access patterns of a user program.

You will analyze the data that you collect and draw conclusions relating to the information collected in this lab.

# 2 Logistics

You may work individually or in pairs for this assignment. All submissions for this lab are electronic. Clarifications and corrections will be posted to the course website.

If you are working as a pair, only one submission is required for two students but make sure that both your names are included in the submission. Either student may make the submission.

# 3 Instructions

There are two parts to this lab. In the first part, you will run performance measurement programs, adjusting input and measurement parameters. Your results should reveal relationships between array sizes, access patterns and performance as well as comparative access times of multiple layers of the memory hierarchy.

In the second part, you will draw conclusions about the results that you obtained.

## 4 The tools you will be using.

**chase**: This is a simple pointer chasing program that creates an array of linked list entries that alternately access the lowest half, then the upper half of the array. Each array element consists of 8 (4 byte) pointers that will fill one 32 byte cache line of the raspberry pi. By defining the number of array elements when executing the program, you will be adjusting the array size. Although not necessary to understand, further information can be found at the end of this document about this program.

**latency**: This program uses the same pointer chasing program but instead of measuring total program execution time (User time), it measures the number of cycles taken for each memory access. You will set the number of array elements in the same way as with the chase program but the number of iterations is always 1000 (by default). This program creates a file *samples.dat* which reports the number of cycles required for each memory access. You will then use the **memhist** script to turn the analyze the results and print out a histogram of memory access times.

**memhist**: a bash script (copy from /cas/course/ydrive/cisc221) to analyze and print out a histogram of the samples.dat file generated by the latency program.

# 5 Helpful Details

References:

OnQ: Archived Lecture Notes: Storage Management, Raspberry Pi Memory

Course text: Chapters 6 and 9.

**6 What you need to do:**

## Part 1

1. Log into the raspberry pi assigned to your group and create a new directory for this lab.
2. Complete Table 1 below by running the ***chase*** program for each row of the table and recording the resulting run time reported in the file: ***chase.txt***. Run times should range from around 2 to 26 seconds. For each row of the table run the command: chase –i –s ***elements*** –n ***iterations*** where *elements* is the value listed in the Array Elements column and *iterations* is the value listed in the Iterations column.

Adjusting the number of array elements (each array element will fill one cache line) will increase the size of the array. Decreasing the number of iterations will decrease the total number of instructions executed. By increasing the size of the array and decreasing the number of iterations at the same time, the total number of instructions executed will remain constant and the User Time results should vary according to just the data access time.  
After completing the table, group the User Time readings into 3 tiers where Tier 1 covers the shortest times and Tier 3 covers the longest times. Each tier should differ significantly from other tiers in User Time.

1. Now you will vary array size and see the effect on the data access time. Accessing data from the cache will be faster than accessing data from main memory. Accessing data from main memory will take even longer if there isn’t an entry in the data Micro TLB or the Main TLB as an additional memory access is required to get the entry from the page table in main memory.  
     
   To measure we will use a program called ***latency***. We will run the program over the same list of array element sizes as before (we will be using the same list of Array Elements and resulting Array Sizes). The program tests the latency of data accesses (number of cycles) over 1000 samples. The measurements are output to a file called ***samples.dat***. After the latency program finishes, run the script ***memhist*** to print a histogram of access cycle times.  
     
   Complete Table 2 below by running the latency program with the following options: ***latency –s <elements>*** where <*elements*> is the number in the Array elements column of the table. After the latency program completes print a histogram of the results by executing the command: ***memhist***.  
     
     
   The output of the ***memhist*** program might look something like this:

62 563 ##################################################

61 311 ##################################

9 61 #######

70 22 ###

63 22 ###

123 3 #

69 2 #

64 2 #

17 2 #

138 2 #

Look at the first line of the histogram. The first number in the line refers to the number of cycles. The second number refers to the number of samples that had a latency of that number of cycles. The hash marks are an old-school text-based representation of a bar graph representing the relative number of samples (compared to the total number of samples) that had that access time.  
From the histogram above we see that most samples (563+311) had a latency of 61 or 62 cycles while 61 samples had an access time of 9 cycles, and 44 samples (22+22) had an access time of 63 or 70 cycles. The remaining single # marks result from a variety of other things that are going on and can be ignored as not statistically relevant.

Again group the readings in the histogram into the three tiers you created after completing Table 1 ranging from fastest to lowest. There will be some reading that are transient and don’t fit well into any of the three categories. They result from other activities that are happening during the execution of the program such as timer interrupts and various other minor events. Ignore these readings. You should be able to confidently categorize at least 800 of the 1000 total samples.

Notes:  
The measurement process adds 6 cycles to each memory latency measurement (3 cycles to start the cycle counter and 3 more to stop it). This means that the actual memory latency is 6 cycles less than the number reported in the first column of the histogram table. For example, a reported latency of 62 cycles actually has a latency of 62 – 6 = 56 cycles.

The clock frequency of the pi is 700MHZ. That means that 1 memory cycle takes 1/f = 1/700 Mhz = 1.43 nanoseconds

## Part 2

1. Answer the following questions. The PowerPoint slides in the Raspberry Pi Memory Hierarchy pdf have the numerical information required to answer these questions. You will find it useful to review the course text and archived lecture notes to make sure you understand how the Translation Lookaside Buffer works.

TLB (Translation Lookaside Buffer)

* 1. What is the maximum size of array that will fit into the cache?

The maximum size of array that will fit into the cache is 16KB.

* 1. Try to explain the relationship between the array sizes of Tier 1, Tier 2, Tier 3 and which level of the memory hierarchy supplies the data referenced.   
       
     Tier 3 is slower than the other tiers. The lower you go – the slower each tier will be so the bigger the array size will get.
  2. The data will always come from either the L1 cache or from main memory so why is there three tiers instead of two? (hint: it has to do with the TLB)  
       
     Because there are two ten-entry Micro TLBs – one for instructions and one for data (they are both fully associative). Because the three tiers – you have to go through the LI cache and L2 cache – the lower you go the slower it is.
  3. Explain how array size affects memory latency in terms of the capacity and actions of the Translation Lookaside Buffer (MicroTLB and MainTLB).

The array size affects memory latency in terms of capacity because the MicroTLBs are backed by the Main TLB (second level TLB). Also, MicroTLB is 40KB and Main TLB is 288KB – meaning MicroTLB is faster and easier to access because it’s smaller (results come faster).

## Table 1. Array Size vs Execution Time

## run: chase –i –s <Array Elements> -n <Iterations>; cat chase.txt

|  |  |  |  |
| --- | --- | --- | --- |
| **Array Size** | **Array Elements** | **Iterations** | **User Time (seconds)** |
| 4MB | 131072 | 1024 | 25.7 |
| 2MB | 65536 | 2048 | 24.0 |
| 1MB | 32768 | 4096 | 24.0 |
| 512KB | 16384 | 8192 | 23.4 |
| 256KB | 8192 | 16384 | 22.8 |
| 128KB | 4096 | 32768 | 16.1 |
| 64KB | 2048 | 65535 | 11.15 |
| 32KB | 1024 | 131072 | 9.8 |
| 16KB | 512 | 262144 | 2.54 |
| 8KB | 256 | 524288 | 2.37 |

Now group your time readings into 3 tiers or groups with the least time variation within each tier and enter the tier information from Table 1 below.  
Use Tier 1 for the shortest times and Tier 3 for the longest times.

Tier 1 Time range: 2.37 - 2.54 seconds Array size range: 8KB - 16KB

Tier 2 Time range: 9.8 - 16.1 seconds Array size range: 32KB - 128KB

Tier3 Time range: 22.8 - 25.7 seconds Array size range: 256KB – 4MB

## Table 2. Array Size vs Memory Access Time (latency)

## run: latency –s <Array Elements> ; memhist

|  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- |
| ***Array Size*** | ***Array Elements*** | **Latency** | | | | | |
| **Tier 1 (fastest)** | | **Tier 2** | | **Tier 3 (slowest)** | |
| **cycles** | **#samples** | **cycles** | **#samples** | **cycles** | **#samples** |
| 4MB | 131072 |  |  |  |  | 120 | 988 |
| 2MB | 65536 |  |  |  |  | 120 | 936 |
| 1MB | 32768 |  |  |  |  | 120 | 884 |
| 512KB | 16384 |  |  |  |  | 120 | 952 |
| 256KB | 8192 |  |  |  |  | 120 | 897 |
| 128KB | 4096 |  |  | 60 | 341 | 120 | 555 |
| 64KB | 2048 | 9 | 45 | 60 | 928 |  |  |
| 32KB | 1024 | 9 | 102 | 60 | 880 |  |  |
| 16KB | 512 | 9 | 998 |  |  |  |  |
| 8KB | 256 | 9 | 999 |  |  |  |  |

Tier 1 latency = (cycles -6) \* clock cycle time = 4.28 ns

Tier 2 latency = (cycles -6) \* clock cycle time = 77.1 ns

Tier 3 latency = (cycles -6) \* clock cycle time = 144.2 ns

## Additional information on how the chase program works.

The fundamental data structure in the pointer chasing program (called chase) is the linked list item:

typedef struct \_CacheLine  
{

struct \_CacheLine\* ptrCacheLine[8] ;

} CacheLine;

//the ptrCacheLine array consists of 8 4-byte (32-bit) pointers and will therefore fill one cache line of the pi.

The larger array of the chase program (8KB – 4MB in size) is initialized by the following code:

steps = array\_elements/2;

bottom = 0;

top = steps;

for (i=0; i< steps ; i++){

array[bottom].ptrCacheLine[0] = &array[top];

array[top].ptrCacheline[0] = &array[bottom+1];

top++;

bottom++;

}

array[(array\_elements-1)].ptrCacheLine[0] = NULL; //used to terminate linked list

The actual point chasing loop is quite simple:

void(test\_loop\_prototype(CacheLine\* aray)

{

CacheLine\* p;

for(p = array; p != NULL; p->ptrCacheline[0]);

}

//the program calls test\_loop\_prototype many times in order to increase the overall run time which improves repeatability and accuracy of the results.

## Pointer chasing access pattern over array[ ]

NULL

start here

bottom half of array

top half of array