# Transcript

# 

 $\underline{\text{Version v1.0}}$ 

Daniel Kocher

Daniel.Kocher@stud.sbg.ac.at

Salzburg, January 20, 2015

## Myth Busting

Myth: I prefer programming language X because it makes it more convinient to implement my application.

Answer: Wrong!

The fundamental problem of programming is to establish functional <u>correctness</u> and adequate <u>performance</u>. Different languages provide different tools of automating the process of <u>establishing</u> correctness and performance. The language should be chosen based on that insight.



Figure 1: Process of establishing correctness.

#### Ultimate goal:

A Compiler checking correctness in such a sense that no exception is thrown while executing in any case. But this is infeasible (mathematical proof possible).

Hardware Exception: Interrupts, a mechanism to stop memory accesses in hardware.

Myth: I like garbage-collected languages like Java because they free me from memory management.

**Answer:** Wrong!

A garbage collector (GC) provides safe deallocation of unneeded memory but the programmer still needs to say what is unneeded, otherwise the system will run out of memory (memory leak).

General runtime complexity of a garbage collector: the size of the heap.



Figure 2: Unreachable, reachable and needed set of a program.

#### Multicore

Amdahl's Law: P represents program parallelism on N cores

• S(N) = N if P = 100%. This is ideal multicore scalability.

• In general:  $S(N) = \frac{1}{(1-P) \cdot \frac{P}{N}}$ 

## Sequential vs. Parallelized Code

The bottleneck of parallelized is the memory bus. It is limiting execution even if a problem can be perfectly parallelized without any side effects. Thus a parallelization factor of 100% is infeasible. Any shared resource at any level of an architecture creates a limitation.



Figure 3: Amdahl's law plot.



Figure 4: Bottleneck of parallelized code.

# Architecture

# Von Neumann Architecture

Introduced in 1945. This is a stored program computer: data = program. The Fetch-Decode-Execute cycle (see Figure 7) modifies the state of the machine.



Figure 5: Architecture hierarchy (top-down.



Figure 6: Von Neumann Architecture.



Figure 7: Fetch-Decode-Execute cycle.

#### **DLX** Machine

• Control unit: Instruction register (ir) and program counter (pc).

#### • Arithmetic unit:

32x 32-bit registers; reg[0], reg[1], ..., reg[31]. reg[0] always contains the value 0 and reg[31] is the link register. Both are reserved by convention and must not be used for any other purpose.

#### • Memory:

n 32-bit words, byte-addressed (see Figure 8), word-aligned;  $mem[0], \ldots, mem[n-1]$ .



Figure 8: Visualization of a byte-addressed memory of 32-bit words.

# **Syntax Formats**

General syntax of an instruction: op a, b, c.

#### $\mathbf{F1}$

The length of a and b allow to address all 32 registers. The two's complement is used here because the implementation of arithmetics is easier (in contrast to the one's complement).

| op                                                   | $0 \le a \le 31$  | $0 \le b \le 31$     | $-2^{15} \le c \le 2^{15} - 1$   |
|------------------------------------------------------|-------------------|----------------------|----------------------------------|
| $\begin{array}{c} 6 \ bits \\ 2^6 \ ops \end{array}$ | $5 bits  2^5 - 1$ | $5 \ bits \ 2^5 - 1$ | 16 bits sign-extended to 32 bits |

#### $\mathbf{F2}$

E.g. 
$$R1 = R2 + R3$$

| op                                                   | $0 \le a \le 31$     | $0 \le b \le 31$     | unused  | $0 \le c \le 31$     |
|------------------------------------------------------|----------------------|----------------------|---------|----------------------|
| $\begin{array}{c} 6 \ bits \\ 2^6 \ ops \end{array}$ | $5 \ bits \ 2^5 - 1$ | $5 \ bits \ 2^5 - 1$ | 11 bits | $5 \ bits \ 2^5 - 1$ |

#### **F**3

Absolute addressing in memory.

| op                                       | $0 \le c \le 2^{26} - 1$ |
|------------------------------------------|--------------------------|
| ${6 \atop 2^6} {bits} \ {2^6 \atop ops}$ | $26 \ bits$ $2^{26} - 1$ |

<u>Von Neumann Bottleneck:</u> Memory read/write operations limit the performance.

# Register Instructions

# Arithmetic Instructions

### $\mathbf{F}\mathbf{1}$

| Instruction  | Semantics         | Additional information                  |
|--------------|-------------------|-----------------------------------------|
| ADDI a, b, c | reg[a]:=reg[b]+c; | Add immediate                           |
|              | pc:=pc+4;         | c is data (a constant)                  |
| SUBI a, b, c | reg[a]:=reg[b]-c; | Substract immediate                     |
|              | pc:=pc+4;         | c is data (a constant)                  |
| MULI a, b, c | reg[a]:=reg[b]*c; | Multiply immediate                      |
|              | pc:=pc+4;         | c is data (a constant)                  |
| DIVI a, b, c | reg[a]:=reg[b]/c; | Divide immediate                        |
|              | pc:=pc+4;         | c is data (a constant)                  |
| MODI a, b, c | reg[a]:=reg[b]%c; | Modulo immediate                        |
|              | pc:=pc+4;         | c is data (a constant)                  |
| CMPI a, b, c | reg[a]:=reg[b]-c; | Compare immediate                       |
|              | pc:=pc+4;         | c is data (a constant)                  |
|              |                   | reg[a] == 0 if reg[b] == c              |
|              |                   | reg[a] >= 0 if reg[b] >= c              |
|              |                   | reg[a]>0 if reg[b]>c                    |
|              |                   | reg[a] = <0 if reg[b] = <c< td=""></c<> |
|              |                   | reg[a]<0 if reg[b] <c< td=""></c<>      |
|              |                   | reg[a]!=0 if reg[b]!=c                  |

#### F2

| Instruction | Semantics              | Additional information                         |
|-------------|------------------------|------------------------------------------------|
| ADD a, b, c | reg[a]:=reg[b]+reg[c]; | Add                                            |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
| SUB a, b, c | reg[a]:=reg[b]-reg[c]; | Substract                                      |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
| MUL a, b, c | reg[a]:=reg[b]*reg[c]; | Multiply                                       |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
| DIV a, b, c | reg[a]:=reg[b]/reg[c]; | Divide                                         |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
| MOD a, b, c | reg[a]:=reg[b]%reg[c]; | Modulo                                         |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
| CMP a, b, c | reg[a]:=reg[b]-reg[c]; | Compare                                        |
|             | pc:=pc+4;              | Register addressing reg[c]                     |
|             |                        | reg[a]==0 if reg[b]==reg[c]                    |
|             |                        | reg[a]>=0 if reg[b]>=reg[c]                    |
|             |                        | reg[a]>0 if reg[b]>reg[c]                      |
|             |                        | reg[a]=<0 if reg[b]= <reg[c]< td=""></reg[c]<> |
|             |                        | reg[a]<0 if reg[b] <reg[c]< td=""></reg[c]<>   |
|             |                        | reg[a]!=0 if reg[b]!=reg[c]                    |

#### Register Allocation Problem

Registers have to be used in order to execute an instruction. There are 29 registers (in theory) to use. In practice, at least in this course, less registers can be used because some registers are reserved for special purposes (stack pointer, heap pointer, globals pointer, frame pointer, ...). It must be guaranteed that these will not be used and furthermore already allocated registers must not be used for an instruction.

#### Examples:

| C Code    | Instructions                                | Additional information                                                                         |
|-----------|---------------------------------------------|------------------------------------------------------------------------------------------------|
| 1 + 2;    | ADDI 1, 0, 1<br>ADDI 2, 0, 2<br>ADD 1, 1, 2 | First, naive solution                                                                          |
| 1 + 2;    | ADDI 1, 0, 1<br>ADDI 1, 1, 2                | Second, better solution                                                                        |
| 1 + 2;    | ADDI 1, 0, 3                                | Third, best solution Constant folding                                                          |
| if(1 < 2) | ADDI 1, 0, 1<br>ADDI 2, 0, 2<br>CMP 1, 0, 2 |                                                                                                |
|           | BGE 1, 0, <loc></loc>                       | <pre><loc> represents the line of code in the assembly code the CPU continues with</loc></pre> |

### **Memory Instructions**

 $\mathbf{F1}$ 

| Instruction | Semantics                  | Additional information         |
|-------------|----------------------------|--------------------------------|
| LDW a, b, c | reg[a]:=mem[(reg[b]+c)/4]; | Load word (from memory)        |
|             | pc:=pc+4;                  | Register-relative addressing   |
| STW a, b, c | mem[(reg[b]+c)/4]:=reg[a]; | Store word (into memory)       |
|             | pc:=pc+4;                  | Register-relative addressing   |
| POP a, b, c | reg[a]:=mem[reg[b]/4];     | Pop (from stack)               |
|             | reg[b]:=reg[b]+c;          | c: size of the popped chunk    |
|             | pc:=pc+4;                  | reg[b]: contains stack pointer |
| PSH a, b, c | reg[b]:=reg[b]-c;          | Push (onto stack)              |
|             | mem[reg[b]/4]:=reg[a];     | c: size of the popped chunk    |
|             | pc:=pc+4;                  | reg[b]: contains stack pointer |

The stack grows from high to low addresses (top-down). In the POP and PSH instructions, reg[b]:=reg[b]+c and reg[b]:=reg[b]-c represent the actual removal of the element from the stack, respectively. In case of the C\* language c will always be 4 because only a single type (int) exists in this particular language. In general c is the amount of data you want to remove from the stack.

Without register-relative addressing, which is a concrete form of indirect addressing, you could only talk about as many memory cells as your program uses. The only way to get the same expressivity as with register-relative addressing is to write self-modifying code.



Figure 9: Memory layout when executing a program

#### Examples:

| C Code     | Instructions  | Additional information                   |
|------------|---------------|------------------------------------------|
| int x;     | LDW 1, 28, -4 | R1 := x                                  |
| x = x + 1; | ADDI 2, 0, 1  | R2 := 1                                  |
|            | ADD 1, 1, 2   | R1 := R1 + R2                            |
|            | STW 1, 28, -4 | mem[x] = R1                              |
| *x + 1;    | LDW 1, 28, -4 | R1 := x                                  |
|            | LDW 1, 1, 0   | R1 := mem[R1 + 0]                        |
|            | ADDI 2, 0, 1  | R2 := 1                                  |
|            | ADD 1, 1, 2   | R1 := R1 + R2                            |
| *(x + 1);  | LDW 1, 28, -4 | R1 := x                                  |
|            | ADDI 2, 0, 4  | R2 := 0 + 4, (4 because it's an address) |
|            | ADD 1, 1, 2   | R1 := R1 + R2                            |
|            | LDW 1, 1, 0   | R1 := mem[R1 + 0]                        |
|            |               | *(x + i) is equivalent to $x[i]$         |

When dealing with \*(x + 1), the type of x has to be known as well as its size in the memory. The size of x in the memory determines the offset which is added to the address of x. In case of the C\* language this value is always 4 because there is only one type and no support for composed types (struct). Nevertheless, the size of the type needs to be added in general.

Formula to compute \*(x + i) = x[i]: x + sizeof(x) \* i.

#### **Control Instructions**

#### $\mathbf{F1}$

| Instruction | Semantics                 | Additional information                    |
|-------------|---------------------------|-------------------------------------------|
| BEQ a, c    | if(reg[a]==0) pc:=pc+c*4; | Branch on equal to zero                   |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BGE a, c    | if(reg[a]>=0) pc:=pc+c*4; | Branch on greater than or equal to zero   |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BGT a, c    | if(reg[a]>0) pc:=pc+c*4;  | Branch on greater than to zero            |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BLE a, c    | if(reg[a]<=0) pc:=pc+c*4; | Branch on less than or equal to zero      |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BLT a, c    | if(reg[a]<0) pc:=pc+c*4;  | Branch on less than to zero               |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BNE a, c    | if(reg[a]!=0) pc:=pc+c*4; | Branch on not equal to zero               |
|             | else pc:=pc+4;            | Conditional branch, pc-relative           |
| BR c        | pc:=pc+c*4;               | Branch (unconditional)                    |
| BSR c       | reg[31]:=pc+4;            | Branch to subroutine (unconditional)      |
|             | pc:=pc+c*4;               | Tthe link register $(R31)$ is saved to be |
|             |                           | able to return to the correct instruction |

#### $\mathbf{F2}$

| Instruction | Semantics              | Additional information |
|-------------|------------------------|------------------------|
| RET c       | <pre>pc:=reg[c];</pre> | c = R31                |

#### $\mathbf{F3}$

| Instruction | Semantics      | Additional information |
|-------------|----------------|------------------------|
| JSR c       | reg[31]:=pc+4; | c = R31                |
|             | pc:=c;         | absolute addressing    |

All branches (conditional as well as unconditional) are pc-relative. The multiplication of c by 4 is because the words in the memory are byte-addressed. Branches are useful (especially because they operate pc-relative) because you generate so-called relocatable code. Relocatable code can be moved in memory anywhere and will not change its behavior because every address is computed relative to the program counter. The drawback of branches is the restriction of the addressable space: there may be more memory than you can use with branches. The solution is absolute addressing used by the F3 format. The F3 format allows you to address a range of  $2^{26}$ .

#### Example:

| C Code          | Instructions         | Additional information                             |
|-----------------|----------------------|----------------------------------------------------|
| while(x<1) $\{$ | LDW 1, 28, -4        | R1 := x                                            |
| <body></body>   | ADDI 2, 0, 1         | R2 := 1                                            |
| }               | CMP 1, 1, 2          | R1 := R1 - R2                                      |
|                 | BGE 1, 0, 0          | c is set to zero, because at this point of time    |
|                 |                      | we don't know where to jump to (in general).       |
|                 |                      | A FixUp will later update this address.            |
|                 | <body></body>        | The body of the function                           |
|                 | BR 0, 0, <top></top> | <top> is the address offset to branch BEFORE</top> |
|                 |                      | the first generated instruction (LDW 1, 28, -4).   |

C\* is a Turing-complete language:

- arithmetics or integer
- dereferencing operator
- assignment
- while loops

Functions are not needed to satisfy Turing-completeness but reduces the size of code by preventing code duplication.

In general, every if-else-construct can be replaced by while loops, but not the other way round. However, a while loop can be replaced by a combination of functions and if-else-constructs for Turing-completeness.

#### **Functions**

Why do we use a stack and no other data structure? Because a stack guarantees proper nesting of functions, constant time access and has no spatial drawback. These facts are a result of the LIFO (Last In First Out) property of a stack.

#### Example:

| C Code | Instructions              | Additional information                                 |
|--------|---------------------------|--------------------------------------------------------|
| f(x);  | LDW 1, 28, -4             | R1 := x                                                |
|        | PSH 1, 30, 4              | Push argument on stack                                 |
|        | <body f="" of=""></body>  |                                                        |
|        | BSR 0, 0, <faddr></faddr> | Branch to subroutine at address offset <faddr></faddr> |
|        | ADD 1, 0, 27              | R27 is the return register (by convention)             |

<faddr> is often resolved by a part of the compiler called Linker. There may be two
kinds of references: symbolic references (e.g. f, the name of the function) and direct
references (e.g. addr(f), the absolute address of the function in memory).

#### Examples:

| C Code                    | Instructions  | Additional information                                |  |
|---------------------------|---------------|-------------------------------------------------------|--|
| <pre>int f(int x) {</pre> | PSH 31, 30, 4 | Save link register (push R31 onto stack)              |  |
| <body f="" of=""></body>  | PSH 29, 30, 4 | Save frame pointer (push R29 onto stack)              |  |
| }                         | ADD 29, 0, 30 | Set new frame pointer to current stack pointer        |  |
|                           |               | See Figure 10                                         |  |
| return x;                 | LDW 1, 29, 12 | Load argument $\mathbf{x}$ from memory. The offset is |  |
|                           |               | 12 here because $\mathbf{x}$ is pushed onto the stack |  |
|                           |               | before link and frame register.                       |  |
|                           | ADD 27, 0, 1  | Save x into return register                           |  |
|                           | POP 29, 30, 4 | Restore caller's frame (register)                     |  |
|                           | POP 31, 30, 8 | Pop link register from stack as well as x.            |  |
|                           |               | This could also be done by two POP instructions       |  |
|                           |               | each popping 4 bytes. In general the size to pop      |  |
|                           |               | is $4 + \#args * sizeof(arg_i)$ .                     |  |
|                           | RET 0, 0, 31  | Return to where function was invoked                  |  |



Figure 10: Stack snapshot of a function with one argument x.

#### Caller vs. Callee:

```
The caller of a function f invokes the function, e.g. f(x);. The callee is the function itself, e.g. int f(int x) \{ ... \}. The callee can again be a caller if it invokes another function in its body, e.g. int f(int x) \{ ... g(x); ... \}.
```

#### Declaration vs. Definition:

A declaration is inserted into the symbol table of a compiler, e.g. int x;. The declaration tells the compiler that a variable or function exists.

A definition is the actual value assignment, e.g. x = 2;. A definition can be done multiple times in a source code.

#### Example:

| C Code        | Instructions              | Additional information                     |
|---------------|---------------------------|--------------------------------------------|
| y = x + f(x); | LDW 1, 28, -4             | Load x                                     |
|               | PSH 1, 30, 4              | Save context of f                          |
|               | LDW 1, 28, -4             | Load x                                     |
|               | PSH 1, 30, 4              | Push actual argument(s) for f onto stack   |
|               | BSR 0, 0, <faddr></faddr> | Branch to f                                |
|               |                           | Execute function f                         |
|               | POP 1, 30, 4              | Restore context of f                       |
|               | ADD 2, 0, 27              | Store return value into R2                 |
|               | ADD 1, 1, 2               | Actual addition of $x$ and $f(x)$          |
|               | STW 1, 28, -8             | Store result.                              |
|               |                           | -8 because it's the second global variable |

## Heap

Dynamically allocated memory is located on the heap. In C there is the malloc/free combination to accomplish this, in Java there is the new keyword to allocate memory dynamically but no explicit mechanism to deallocate it. Java has a garbage collector which safely deallocates unneeded memory. However, Java provides implicit deallocation mechanisms. The first is to set a reference explicitly to null:

```
x = new A(); // dynamic allocation
...
x = null;
// if no other reference to x exists here
// x will be deallocated by the garbage collector
```

The second approach uses namespaces. When jumping out from a namespace, every namespace-local variables are unneeded if and only if there are no references from outside. If there are no such references the garbage collector will deallocate these variables:

```
{
    x = new A();
    ...
    ...
}
```

The simplest (but most unintelligent) implementation of a dynamic heap allocator is a so-called *bump pointer allocator*. A bump pointer is a pointer which only counts upwards, never downwards. In this case, there is no free to deallocate because it only grows upwards. However, the implementation of malloc is pretty simple and straigh-forward:

```
void* malloc(int s) {
    LDW 1, 29, 12
    ADD 27, 0, 26
    ADD 26, 26, 1 // R26 = Heap pointer
}
```

So every time malloc is invoked, these three DLX instructions are executed.



#### Byte-addressing and word-alignment:



Wie funktioniert der Zugriff auf den Stack? Wieso wird R1, R28, -4 gemacht? Latency.

#### Implementation von void\* malloc(int s):

LDW 1, 29, 12 ADD 27, 0, 26 ADD 26, 26, 1

#### s: R1 = sum[R29 + 12]

Da man in C\* keinen Zugriff auf Register hat, muss obiges Verhalten per Assembler implementiert werden.

malloc ist in der (g)libc enthalten und ist heutzutage de facto ein Betriebssystemfeature.

### Implementation von int getchar():

RDC 0, 0, 27

#### Implementation von void putchar(int c):

LDW 1, 29, 12 WRC 0, 0, 1

Liest den nächsten Character aus einem Stream.

R27 = getchar();

Streams: stdin, stdout, stderr



Figure 11: Memory layout revisited

#### $\mathbf{F2}$

| Instruction | Semantics                             | Additional information |
|-------------|---------------------------------------|------------------------|
| RDC c       | reg[c] = getchar();                   |                        |
| WRC c       | <pre>pc:=pc+4; putchar(reg[c]);</pre> |                        |
|             | pc:=pc+4;                             |                        |

Why is it so hard to implement  $\mathtt{RDC}$  and  $\mathtt{WRC}$  on application level? Because operating systems stuff is self-referential!



Figure 12: Application and processor

How does I/O work on a processor?



Figure 13: I/O on a processor

#### Computability & Complexity of algorithms

What is the minimal machine that is still universal?

- RISC: Reduced Instruction Set Computer Has separate instructions to load/store & compute. E.g. ARM, MIPS, SPARC, . . .
- CISC: Complex Instruction Set Computer More complex instructions which load, compute & store in a single instruction. E.g. x86

RISC vs. CISC  $\leftrightarrow$  Compiler vs. processor.

OISC/URISC: One/Ultimate Reduced Instruction Set Computer

```
SUBLEQ a, b, c: Subtraction less or equal
  mem[b] := mem[b] - mem[a];
  if mem[b] \leq 0 goto c;
  else pc := pc + 4;
```



Figure 14: Asymptotic computational complexity [1]



Figure 15: Asymptotic computational complexity [2] (for operating systems)



Figure 16: Big O notation (upper bound)