

Many of the slides in this lecture are either from or adapted from slides provided by the authors of the textbook "Computer Systems: A Programmer's Perspective," 2<sup>nd</sup> Edition and are provided from the website of Carnegie-Mellon University, course 15-213, taught by Randy Bryant and David O'Hallaron in Fall 2010. These slides are indicated "Supplied by CMU" in the notes section of the slides.



If arrays x and y have the same alignment, i.e., both start in the same cache set, then each access to an element of y replaces the cache line containing the corresponding element of x, and vice versa. The result is that the loop is executed very slowly — each access to either array results in a conflict miss.



However, if the two arrays start in different cache sets, then the loop executes quickly—there is a cache miss on just every fourth access to each array.













The cache still holds two rows of the matrix, but each row may go into one of two different cache lines. In the slide, the first row goes into the first lines of the cache sets, the second row goes into the second lines of the cache sets.



There is still a cache miss on each access.



With a 2-way set-associative cache, our dot-product example runs quickly even if the two arrays have the same alignment.



The L3 cache is known as the last-level cache (LLC) in the Intel documentation.

One concern is whether what's contained in, say, the L1 cache is also contained in the L2 cache. if so, caching is said to be **inclusive**. If what's contained in the L1 cache is definitely not contained in the L2 cache, caching is said to be **exclusive**. An advantage of exclusive caches is that the total cache capacity is the sum of the sizes of each of the levels, whereas for inclusive caches, the total capacity is just that of the largest. An advantage of inclusive caches is that what's been brought into the cache hierarchy by one core is available to the other cores.

AMD processors tend to have exclusive caches; Intel processors tend to have inclusive caches.

#### **What About Writes?**

- Multiple copies of data exist:
  - L1, L2, main memory, disk
- · What to do on a write-hit?
  - write-through (write immediately to memory)
  - write-back (defer write to memory until replacement of line)
    - » need a dirty bit (line different from memory or not)
- · What to do on a write-miss?
  - write-allocate (load into cache, update line in cache)
    - » good if more writes to the location follow
  - no-write-allocate (writes immediately to memory)
- Typical
  - write-through + no-write-allocate
  - write-back + write-allocate

**CS33 Intro to Computer Systems** 

XVII-13

#### Supplied by CMU.

Most current processors use the write-back/write-allocate approach. This causes some (surmountable) difficulties for multi-core processors that have a separate cache for each core.

#### **Accessing Memory**

- Program references memory (load)
  - if not in cache (cache miss), data is requested from RAM
    - » fetched in units of 64 bytes
      - aligned to 64-byte boundaries (low-order 6 bits of address are zeroes)
    - » if memory accessed sequentially, data is pre-fetched
    - » data stored in cache (in 64-byte cache lines)
      - stays there until space must be re-used (least recently used is kicked out first)
  - if in cache (cache hit) no access to RAM needed
- · Program modifies memory (store)
  - data modified in cache
  - eventually written to RAM in 64-byte units

**CS33 Intro to Computer Systems** 

XVII-14

Copyright © 2022 Thomas W. Doeppner. All rights reserved.

This slide describes accessing memory on Intel Core I5 and I7 processors.

If the processor determines that a program is accessing memory sequentially (because the past few accesses have been sequential), then it begins the load of the next block from memory before it is requested. If this determination was correct, then the memory will be in the cache (or well on its way) before it's needed.

#### **Cache Performance Metrics**

- · Miss rate
  - fraction of memory references not found in cache (misses / accesses)
    - = 1 hit rate
  - typical numbers (in percentages):
    - » 3-10% for L1
    - » can be quite small (e.g., < 1%) for L2, depending on size, etc.
- Hit time
  - time to deliver a line in the cache to the processor
    - » includes time to determine whether the line is in the cache
  - typical numbers:
    - » 1-2 clock cycles for L1
    - » 5-20 clock cycles for L2
- Miss penalty
  - additional time required because of a miss
    - » typically 50-200 cycles for main memory (trend: increasing!)

**CS33 Intro to Computer Systems** 

XVII-15

#### Hits vs. Misses

- · Huge difference between hit and miss times
  - could be 100x, if just L1 and main memory
- 99% hit rate is twice as good as 97%!
  - consider:
     cache hit time of 1 cycle
     miss penalty of 100 cycles
  - average access time:

```
97% hits: .97 * 1 cycle + 0.03 * 100 cycles ≈ 4 cycles
99% hits: .99 * 1 cycle + 0.01 * 100 cycles ≈ 2 cycles
```

This is why "miss rate" is used instead of "hit rate"

**CS33 Intro to Computer Systems** 

XVII-16



# **Locality Example**

```
sum = 0;
for (i = 0; i < n; i++)</pre>
   sum += a[i];
return sum;
```

#### Data references

- reference array elements in succession (stride-1 reference pattern)

**Spatial locality** 

- reference variable sum each iteration Temporal locality

Instruction references

- reference instructions in sequence.

**Spatial locality** 

- cycle through loop repeatedly

**Temporal locality** 

**CS33 Intro to Computer Systems** 

XVII-18

## Quiz 2

Does this function have good locality with respect to array a? The array a is MxN.

- a) yes
- b) no

```
int sum_array_cols(int N, int a[][N]) {
   int i, j, sum = 0;

for (j = 0; j < N; j++)
   for (i = 0; i < M; i++)
      sum += a[i][j];
   return sum;
}</pre>
```

**CS33 Intro to Computer Systems** 

XVII-19

### **Writing Cache-Friendly Code**

- Make the common case fast
  - focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - repeated references to variables are good (temporal locality)
  - stride-1 reference patterns are good (spatial locality)

**CS33 Intro to Computer Systems** 

XVII-20

Supplied by CMU.

"Stride n" reference patterns are sequences of memory accesses in which every nth element is accessed in memory order. Thus stride 1 means that every element is accessed, starting at the beginning of a memory area, continuing to its end.

```
Matrix Multiplication Example
                                                         Variable sum
                                                         held in register
                                 for (i=0; i<n; i++)</pre>
· Description:
                                   for (j=0; j<n; j++) {</pre>
    - multiply N x N
                                     sum = 0.0; \leftarrow
      matrices
                                     for (k=0; k<n; k++)
        » each element is a
                                        sum += a[i][k] * b[k][j];
          double
                                     c[i][j] = sum;

 O(N³) total operations

    - N reads per source
      element
                                 /* ikj */

    N values summed per

                                 for (i=0; i<n; i++) {</pre>
      destination
                                   for (k=0; k<n; k++) {
        » but may be able to
          hold in register
                                      r = a[i][k];
                                      for (j=0; j<n; j++)
                                        c[i][j] += r * b[k][j];
                                XVII-21
CS33 Intro to Computer Systems
```

Based on slides supplied by CMU.



XVII-22

Adapted form a slide by CMU.

**CS33 Intro to Computer Systems** 

## **Layout of C Arrays in Memory (review)**

- · C arrays allocated in row-major order
  - each row in contiguous memory locations
- · Stepping through columns in one row:

```
- for (i = 0; i < N; i++)
sum += a[0][i];
```

- accesses successive elements
- if block size (B) > 8 bytes, exploit spatial locality
  - » compulsory miss rate = 8 bytes / Block
- · Stepping through rows in one column:

```
- for (i = 0; i < n; i++)
sum += a[i][0];
```

- accesses widely separated elements
- no spatial locality!
  - » compulsory miss rate = 1 (i.e. 100%)

**CS33 Intro to Computer Systems** 

XVII-23



Assume we are multiplying arrays of doubles, thus each element is eight bytes long, and thus a cache line holds eight matrix elements. The slide shows a straightforward implementation of multiplying A and B to produce C.



If we reverse the order of the two outer loops, there's no change in results or performance.



Moving the loop on k to be the outer loop does not affect the result, but it improves performance.



Switching the two outer loops affects neither results nor performance.



Moving the loop on i to be the inner loop makes performance considerably worse.



The poor performance is not improved by reversing the outer loops.

```
Summary of Matrix Multiplication
         for (i=0; i<n; i++)
           for (j=0; j<n; j++) {</pre>
                                                 ijk (& jik):
           sum = 0.0;
                                                  • 2 loads, 0 stores
           for (k=0; k<n; k++)
             sum += a[i][k] * b[k][j];
                                                   • misses/iter = 1.125
           c[i][j] = sum;
         for (k=0; k<n; k++)
         for (i=0; i<n; i++) {</pre>
                                                 kij (& ikj):
          r = a[i][k];
                                                  • 2 loads, 1 store
          for (j=0; j<n; j++)
                                                   • misses/iter = 0.25
           c[i][j] += r * b[k][j];
        for (j=0; j<n; j++)
         for (k=0; k<n; k++) {</pre>
                                                 jki (& kji):
           r = b[k][j];
                                                   • 2 loads, 1 store
           for (i=0; i<n; i++)</pre>
                                                   • misses/iter = 2.0
             c[i][j] += a[i][k] * r;
CS33 Intro to Computer Systems
                                    XVII-30
```



#### In Real Life ...

Multiply two 1024x1024 matrices of doubles on sunlab machines

```
– ijk
   » 4.185 seconds
– kij
   » 0.798 seconds
– jki
   » 11.488 seconds
```

CS33 Intro to Computer Systems

XVII-32 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# **Concluding Observations**

- Programmer can optimize for cache performance
  - organize data structures appropriately
- All systems favor "cache-friendly code"
  - getting absolute optimum performance is very platform specific
    - » cache sizes, line sizes, associativities, etc.
  - can get most of the advantage with generic code
    - » keep working set reasonably small (temporal locality)
    - » use small strides (spatial locality)

**CS33 Intro to Computer Systems** 

XVII-33

# **CS 33**

#### **Architecture and the OS**

CS33 Intro to Computer Systems

XVII-34 Copyright © 2022 Thomas W. Doeppner. All rights reserved.



#### **Processes**

- Containers for programs
  - virtual memory
    - » address space
  - scheduling
    - » one or more threads of control
  - file references
    - » open files
  - and lots more!

CS33 Intro to Computer Systems

XVII-36 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# Idiot Proof ... Can I clobber Mary's program? int main() { int i; int A[1]; for (i=0; ; i++) A[rand()] = i; } Mary's Program Copyright © 2022 Thomas W. Doeppner. All rights reserved.

## **Fair Share**

```
prevent Bob's
                       program from
void runforever(){
                       running?
  while(1)
                              Bob's
int main() {
                             Program
 runforever();
```

Can I

CS33 Intro to Computer Systems

XVII-38 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# **Architectural Support for the OS**

- · Not all instructions are created equal ...
  - non-privileged instructions
    - » can affect only current program
  - privileged instructions
    - » may affect entire system
- Processor mode
  - user mode
    - » can execute only non-privileged instructions
  - privileged mode
    - » can execute all instructions

**CS33 Intro to Computer Systems** 

XVII-39

Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# **Which Instructions Should Be Privileged?**

- I/O instructions
- Those that affect how memory is mapped
- Halt instruction
- · Some others ...

**CS33 Intro to Computer Systems** 

XVII-40 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# Who Is Privileged?

- No one
  - user code always runs in user mode
- The operating-system kernel runs in privileged mode
  - nothing else does
  - not even super user on Unix or administrator on Windows

**CS33 Intro to Computer Systems** 

XVII-41 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# **Entering Privileged Mode**

- How is OS invoked?
  - very carefully ...
  - strictly in response to interrupts and exceptions
  - (booting is a special case)

CS33 Intro to Computer Systems

XVII-42 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

# **Interrupts and Exceptions**

- · Things don't always go smoothly ...
  - I/O devices demand attention
  - timers expire
  - programs demand OS services
  - programs demand storage be made accessible
  - programs have problems
- Interrupts
  - demand for attention by external sources
- Exceptions
  - executing program requires attention

**CS33 Intro to Computer Systems** 

XVII-43 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

## **Exceptions**

- Traps
  - "intentional" exceptions
    - » execution of special instruction to invoke OS
  - after servicing, execution resumes with next instruction
- Faults
  - a problem condition that is normally corrected
  - after servicing, instruction is re-tried
- Aborts
  - something went dreadfully wrong ...
  - not possible to re-try instruction, nor to go on to next instruction

**CS33 Intro to Computer Systems** 

XVII-44 Copyright © 2022 Thomas W. Doeppner. All rights reserved.

These definitions follow those given in "Intel® 64 and IA-32 Architectures Software Developer's Manual" and are generally accepted even outside of Intel.

#### **Actions for Interrupts and Exceptions**

- · When interrupt or exception occurs
  - processor saves state of current thread/process on stack
  - processor switches to privileged mode (if not already there)
  - invokes handler for interrupt/exception
  - if thread/process is to be resumed (typical action after interrupt)
    - » thread/process state is restored from stack
  - if thread/process is to re-execute current instruction
    - » thread/process state is restored, after backing up instruction pointer
  - if thread/process is to terminate
    - » it's terminated

**CS33 Intro to Computer Systems** 

XVII-45 Copyright © 2022 Thomas W. Doeppner. All rights reserved.



#### **Entering and Exiting**

- Entering/exiting interrupt/exception handler more involved than entering/exiting a procedure
  - must deal with processor mode
    - » switch to privileged mode on entry
    - » switch back to previous mode on exit
  - interrupted process/thread's state is saved on separate kernel stack
  - stack in kernel must be different from stack in user program
    - » why?

**CS33 Intro to Computer Systems** 

XVII-47

Copyright © 2022 Thomas W. Doeppner. All rights reserved.

The reason why there must be a separate stack in privileged mode is that the OS must be guaranteed that when it is executing, it has a valid stack, that the stack pointer must be pointing to a region of memory that can be used as a stack by the OS. Since while the program was running in user mode any value could have been put into the stack-pointer register, when the OS is invoked, it switches to a pre-allocated stack set up just for it.



When a trap or interrupt occurs, the current processor state (registers, including RIP, condition codes, etc.) are saved on the kernel stack. When the system returns back to the interrupted program, this state is restored.

#### Quiz 3

If an interrupt occurs, which general-purpose registers must be pushed onto the kernel stack?

- a) all
- b) none
- c) callee-save registers
- d) caller-save registers

CS33 Intro to Computer Systems

XVII-49 Copyright © 2022 Thomas W. Doeppner. All rights reserved.