

#### COMPUTER ORGANIZATION AND DE

The Hardware/Software Interface



# **Chapter 3**

#### **Arithmetic for Computers**

### **Arithmetic for Computers**

- Operations on integers
  - Addition and subtraction
  - Multiplication and division
  - Dealing with overflow
- Floating-point real numbers
  - Representation and operations

### **Integer Addition**

Example: 7 + 6



- Overflow if result out of range
  - Adding +ve and –ve operands, no overflow
  - Adding two +ve operands
    - Overflow if result sign is 1
  - Adding two –ve operands
    - Overflow if result sign is 0



### Integer Subtraction

- Add negation of second operand
- Example: 7 6 = 7 + (-6)

```
+7: 0000 0000 ... 0000 0111
```

<u>-6: 1111 1111 ... 1111 1010</u>

+1: 0000 0000 ... 0000 0001

- Overflow if result out of range
  - Subtracting two +ve or two –ve operands, no overflow
  - Subtracting +ve from –ve operand
    - Overflow if result sign is 0
  - Subtracting –ve from +ve operand
    - Overflow if result sign is 1

# **Dealing with Overflow**

- Some languages (e.g., C) ignore overflow
  - Use MIPS addu, addui, subu instructions
- Other languages (e.g., Ada, Fortran) require raising an exception
  - Use MIPS add, addi, sub instructions
  - On overflow, invoke exception handler
    - Save PC in exception program counter (EPC) register
    - Jump to predefined handler address
    - mfc0 (move from coprocessor reg) instruction can retrieve EPC value, to return after corrective action

#### **Arithmetic for Multimedia**

- Graphics and media processing operates on vectors of 8-bit and 16-bit data
  - Use 64-bit adder, with partitioned carry chain
    - Operate on  $8\times8$ -bit,  $4\times16$ -bit, or  $2\times32$ -bit vectors
  - SIMD (single-instruction, multiple-data)
- Saturating operations
  - On overflow, result is largest representable value
    - c.f. 2s-complement modulo arithmetic
  - E.g., clipping in audio, saturation in video



#### 快速进位链

#### 1. 并行加法器



$$C_{i} = \overline{A}_{i} B_{i} C_{i-1} + A_{i} \overline{B}_{i} C_{i-1} + A_{i} B_{i} \overline{C}_{i-1} + A_{i} B_{i} \overline{C}_{i-1}$$

$$= A_{i} B_{i} + (A_{i} + B_{i}) C_{i-1}$$

$$= A_i B_i + (A_i + B_i) C_{i-1}$$

$$d_i = A_i B_i$$
 本地进位

$$d_i = A_i B_i$$
 本地进位  $t_i = A_i + B_i$  传送条件



#### 2. 串行进位链

进位链

传送进位的电路

串行进位链

进位串行传送

以 4 位全加器为例,每一位的进位表达式为

$$C_0 = d_0 + t_0 C_{-1} = \overline{d_0 \cdot t_0 C_{-1}}$$

$$C_1 = d_1 + t_1 C_0$$

$$C_2 = d_2 + t_2 C_1$$

设与非门的级延迟时间为t。

$$C_3 = d_3 + t_3 C_2$$



4位 全加器产生进位的全部时间为 8t,

n 位全加器产生进位的全部时间为  $2nt_v$ 

#### 3. 并行进位链(先行进位,跳跃进位)

n 位加法器的进位同时产生 以 4 位加法器为例

#### (1) 单重分组先行进位链

n 位全加器分若干小组,小组中的进位同时产生,小组与小组之间采用串行进位 以 n = 16 为例



当 
$$d_i t_i$$
 形成后 经 2.5  $t_y$  产生  $C_3 \sim C_0$    
 5  $t_y$  产生  $C_7 \sim C_4$    
 7.5  $t_y$  产生  $C_{11} \sim C_8$    
  $10 t_y$  产生  $C_{15} \sim C_{12}$ 

#### (2) 双重分组先行进位链

n 位全加器分若干大组,大组中又包含若干小组。每个大组中小组的最高位进位同时产生。 大组与大组之间采用串行进位。

以 n=32 为例



MORGAN KAUFMANN

2

3

4

5

6

7

#### (6) n = 16 双重分组跳跃进位链



当  $d_i t_i$ 和 $C_{-1}$ 形成后 经  $2.5 t_y$  产生  $C_2$ 、 $C_1$ 、 $C_0$ 、 $D_5 \sim D_8$ 、 $T_5 \sim T_8$  经  $5 t_y$  产生  $C_{15}$ 、 $C_{11}$ 、 $C_7$ 、 $C_3$  经  $7.5 t_y$  产生  $C_{14} \sim C_{12}$ 、 $C_{10} \sim C_8$ 、 $C_6 \sim C_4$ 

串行进位链  $经 32t_y$  产生 全部进位 单重分组跳跃进位链  $经 10t_y$  产生 全部进位



#### (7) n=32 双重分组跳跃进位链



当 
$$d_i t_i$$
 形成后 经 2.5  $t_y$  产生  $C_2$ 、 $C_1$ 、 $C_0$ 、 $D_1 \sim D_8$ 、 $T_1 \sim T_8$ 
5  $t_y$  产生  $C_{15}$ 、 $C_{11}$ 、 $C_7$ 、 $C_3$ 
7.5  $t_y$  产生  $C_{18} \sim C_{16}$ 、 $C_{14} \sim C_{12}$ 、 $C_{10} \sim C_8$ 、 $C_6 \sim C_4$   $C_{31}$ 、 $C_{27}$ 、 $C_{23}$ 、 $C_{19}$ 

 $10t_y$  产生  $C_{30} \sim C_{28}$ 、  $C_{26} \sim C_{24}$ 、  $C_{22} \sim C_{20}$ 



## Multiplication

Start with long-multiplication approach



Length of product is the sum of operand lengths



### **Multiplication Hardware**





## **Optimized Multiplier**

Perform steps in parallel: add/shift



- One cycle per partial-product addition
  - That's ok, if frequency of multiplications is low

#### 例题 • 乘法算法

为了节省空间,使用的是 4 位长的数,计算 210×310,或 00102×00112的积。

| 迭代次数 | 步骤              | 乘数   | 被乘数       | 乘积        |
|------|-----------------|------|-----------|-----------|
| 0    | 初始值             | 001① | 0000 0010 | 0000 0000 |
| 1    | la: l⇒乘积=乘积+被乘数 | 0011 | 0000 0010 | 0000 0010 |
|      | 2: 左移被乘数        | 0011 | 0000 0100 | 0000 0010 |
|      | 3: 右移乘数         | 000① | 0000 0100 | 0000 0010 |
| 2    | la: l⇒乘积=乘积+被乘数 | 0001 | 0000 0100 | 0000 0110 |
|      | 2: 左移被乘数        | 0001 | 0000 1000 | 0000 0110 |
|      | 3: 右移乘数         | 0000 | 0000 1000 | 0000 0110 |
| 3    | 1: 0⇒无操作        | 0000 | 0000 1000 | 0000 0110 |
|      | 2: 左移被乘数        | 0000 | 0001 0000 | 0000 0110 |
|      | 3: 右移乘数         | 0000 | 0001 0000 | 0000 0110 |
| 4    | 1: 0⇒无操作        | 0000 | 0001 0000 | 0000 0110 |
|      | 2: 左移被乘数        | 0000 | 0010 0000 | 0000 0110 |
|      | 3: 右移乘数         | 0000 | 0010 0000 | 0000 0110 |

图 3-6 使用图 3-4 中算法的乘法例子。圆圈圈起来的是下一步需要检测的位



### **Faster Multiplier**

- Uses multiple adders
  - Cost/performance tradeoff



- Can be pipelined
  - Several multiplication performed in parallel



### **MIPS Multiplication**

- Two 32-bit registers for product
  - HI: most-significant 32 bits
  - LO: least-significant 32-bits
- Instructions
  - mult rs, rt / multu rs, rt
    - 64-bit product in HI/LO
  - mfhi rd / mflo rd
    - Move from HI/LO to rd
    - Can test HI value to see if product overflows 32 bits
  - mul rd, rs, rt
    - Least-significant 32 bits of product —> rd

#### **Division**



*n*-bit operands yield *n*-bit quotient and remainder

- Check for 0 divisor
- Long division approach
  - If divisor ≤ dividend bits
    - 1 bit in quotient, subtract
  - Otherwise
    - 0 bit in quotient, bring down next dividend bit
- Restoring division
  - Do the subtract, and if remainder goes < 0, add divisor back</li>
- Signed division
  - Divide using absolute values
  - Adjust sign of quotient and remainder as required

#### **Division Hardware**



in left half

Quotient

32 bits

test

Shift left

### **Optimized Divider**



- One cycle per partial-remainder subtraction
- Looks a lot like a multiplier!
  - Same hardware can be used for both

#### 例题。除法算法

为了节省篇幅,我们使用 4 位的数据。计算  $7_{10}$  除以  $2_{10}$ ,即 0000 0111<sub>2</sub>除以 0010<sub>2</sub>。

| 迭代次数 | 步骤                       | 商    | 除数        | 余数               |
|------|--------------------------|------|-----------|------------------|
| 0    | 初始值                      | 0000 | 0010 0000 | 0000 0111        |
| 1    | 1: 余数=余数-除数              | 0000 | 0010 0000 | <b>①110 0111</b> |
|      | 2b: 余数<0 ⇒+除数,商左移,上最低位上0 | 0000 | 0010 0000 | 0000 0111        |
|      | 3: 除数右移                  | 0000 | 0001 0000 | 0000 0111        |
| 2    | 1: 余数=余数-除数              | 0000 | 0001 0000 | <b>①111 0111</b> |
|      | 2b: 余数<0 ⇒+除数,商左移,上最低位上0 | 0000 | 0001 0000 | 0000 0111        |
|      | 3: 除数右移                  | 0000 | 0000 1000 | 0000 0111        |
| 3    | 1: 余数=余数-除数              | 0000 | 0000 1000 | <b>①111 1111</b> |
|      | 2b: 余数<0⇒+除数,商左移,上最低位上0  | 0000 | 0000 1000 | 0000 0111        |
|      | 3: 除数右移                  | 0000 | 0000 0100 | 0000 0111        |
| 4    | 1: 余数=余数-除数              | 0000 | 0000 0100 | ⊕000 0011        |
|      | 2a: 余数≥0⇒商左移,上最低位上1      | 0001 | 0000 0100 | 0000 0011        |
|      | 3: 除数右移                  | 0001 | 0000 0010 | 0000 0011        |
| 5    | 1: 余数=余数-除数              | 0001 | 0000 0010 | <b>@000 0001</b> |
|      | 2a: 余数≥0⇒商左移,上最低位上1      | 0011 | 0000 0010 | 0000 0001        |
|      | 3: 除数右移                  | 0011 | 0000 0001 | 0000 0001        |

图 3-10 除法的例子,采用图 3-9 中的算法。图中圈起来的位用于决定下一步的操作

#### **Faster Division**

- Can't use parallel hardware as in multiplier
  - Subtraction is conditional on sign of remainder
- Faster dividers (e.g. SRT devision)
   generate multiple quotient bits per step
  - Still require multiple steps

#### **MIPS Division**

- Use HI/LO registers for result
  - HI: 32-bit remainder
  - LO: 32-bit quotient
- Instructions
  - div rs, rt / divu rs, rt
  - No overflow or divide-by-0 checking
    - Software must perform checks if required
  - Use mfhi, mflo to access result

#### (2) 不恢复余数法(加减交替法)

• 恢复余数法运算规则

余数 
$$R_i > 0$$
 上商 "1", $2R_i - y^*$  余数  $R_i < 0$  上商 "0", $R_i + y^*$  恢复余数  $2(R_i + y^*) - y^* = 2R_i + y^*$ 

• 不恢复余数法运算规则

上商"1" 
$$2R_i - y^*$$
 加减交替上商"0"  $2R_i + y^*$ 

# Floating Point

- Representation for non-integral numbers
  - Including very small and very large numbers
- Like scientific notation

In binary

$$\pm 1.xxxxxxx_2 \times 2^{yyyy}$$

Types float and double in C

### Floating Point Standard

- Defined by IEEE Std 754-1985
- Developed in response to divergence of representations
  - Portability issues for scientific code
- Now almost universally adopted
- Two representations
  - Single precision (32-bit)
  - Double precision (64-bit)

# **IEEE Floating-Point Format**

single: 8 bits single: 23 bits double: 11 bits double: 52 bits

S Exponent Fraction

$$x = (-1)^{S} \times (1 + Fraction) \times 2^{(Exponent-Bias)}$$

- S: sign bit  $(0 \Rightarrow \text{non-negative}, 1 \Rightarrow \text{negative})$
- Normalize significand: 1.0 ≤ |significand| < 2.0</p>
  - Always has a leading pre-binary-point 1 bit, so no need to represent it explicitly (hidden bit)
  - Significand is Fraction with the "1." restored
- Exponent: excess representation: actual exponent + Bias
  - Ensures exponent is unsigned
  - Single: Bias = 127; Double: Bias = 1203

### Single-Precision Range

- Exponents 00000000 and 11111111 reserved
- Smallest value
  - Exponent: 00000001⇒ actual exponent = 1 - 127 = -126
  - Fraction:  $000...00 \Rightarrow \text{significand} = 1.0$
  - $\pm 1.0 \times 2^{-126} \approx \pm 1.2 \times 10^{-38}$
- Largest value
  - exponent: 11111110⇒ actual exponent = 254 127 = +127
  - Fraction: 111...11 ⇒ significand ≈ 2.0
  - $\pm 2.0 \times 2^{+127} \approx \pm 3.4 \times 10^{+38}$

### **Double-Precision Range**

- Exponents 0000...00 and 1111...11 reserved
- Smallest value
  - Exponent: 0000000001⇒ actual exponent = 1 - 1023 = -1022
  - Fraction:  $000...00 \Rightarrow \text{significand} = 1.0$
  - $\pm 1.0 \times 2^{-1022} \approx \pm 2.2 \times 10^{-308}$
- Largest value
  - Exponent: 11111111110⇒ actual exponent = 2046 1023 = +1023
  - Fraction: 111...11 ⇒ significand ≈ 2.0
  - $\pm 2.0 \times 2^{+1023} \approx \pm 1.8 \times 10^{+308}$

### Floating-Point Precision

- Relative precision
  - all fraction bits are significant
  - Single: approx 2<sup>-23</sup>
    - Equivalent to 23 × log<sub>10</sub>2 ≈ 23 × 0.3 ≈ 6 decimal digits of precision
  - Double: approx 2<sup>-52</sup>
    - Equivalent to 52 × log<sub>10</sub>2 ≈ 52 × 0.3 ≈ 16 decimal digits of precision

### Floating-Point Example

- Represent –0.75
  - $-0.75 = (-1)^1 \times 1.1_2 \times 2^{-1}$
  - S = 1
  - Fraction =  $1000...00_2$
  - Exponent = -1 + Bias
    - Single:  $-1 + 127 = 126 = 011111110_2$
    - Double:  $-1 + 1023 = 1022 = 0111111111110_2$
- Single: 1011111101000...00
- Double: 10111111111101000...00

### Floating-Point Example

 What number is represented by the singleprecision float

11000000101000...00

- S = 1
- Fraction =  $01000...00_2$
- Fxponent = 10000001<sub>2</sub> = 129

$$x = (-1)^{1} \times (1 + 01_{2}) \times 2^{(129 - 127)}$$

$$= (-1) \times 1.25 \times 2^{2}$$

$$= -5.0$$

### Floating-Point Addition

- Consider a 4-digit decimal example
  - $-9.999 \times 10^{1} + 1.610 \times 10^{-1}$
- 1. Align decimal points
  - Shift number with smaller exponent
  - $\bullet$  9.999  $\times$  10<sup>1</sup> + 0.016  $\times$  10<sup>1</sup>
- 2. Add significands
  - $\bullet$  9.999  $\times$  10<sup>1</sup> + 0.016  $\times$  10<sup>1</sup> = 10.015  $\times$  10<sup>1</sup>
- 3. Normalize result & check for over/underflow
  - $\bullet$  1.0015  $\times$  10<sup>2</sup>
- 4. Round and renormalize if necessary
  - $1.002 \times 10^{2}$

### Floating-Point Addition

- Now consider a 4-digit binary example
  - $-1.000_2 \times 2^{-1} + -1.110_2 \times 2^{-2} (0.5 + -0.4375)$
- 1. Align binary points
  - Shift number with smaller exponent

$$-1.000_2 \times 2^{-1} + -0.111_2 \times 2^{-1}$$

2. Add significands

$$-1.000_2 \times 2^{-1} + -0.111_2 \times 2^{-1} = 0.001_2 \times 2^{-1}$$

- 3. Normalize result & check for over/underflow
  - $1.000_2 \times 2^{-4}$ , with no over/underflow
- 4. Round and renormalize if necessary
  - $-1.000_2 \times 2^{-4}$  (no change) = 0.0625

#### **FP Adder Hardware**

- Much more complex than integer adder
- Doing it in one clock cycle would take too long
  - Much longer than integer operations
  - Slower clock would penalize all instructions
- FP adder usually takes several cycles
  - Can be pipelined

#### **FP Adder Hardware**





#### **FP Arithmetic Hardware**

- FP multiplier is of similar complexity to FP adder
  - But uses a multiplier for significands instead of an adder
- FP arithmetic hardware usually does
  - Addition, subtraction, multiplication, division, reciprocal, square-root
  - FP ↔ integer conversion
- Operations usually takes several cycles
  - Can be pipelined



### **FP Instructions in MIPS**

- FP hardware is coprocessor 1
  - Adjunct processor that extends the ISA
- Separate FP registers
  - 32 single-precision: \$f0, \$f1, ... \$f31
  - Paired for double-precision: \$f0/\$f1, \$f2/\$f3, ...
    - $lue{}$  Release 2 of MIPs ISA supports 32 imes 64-bit FP reg's
- FP instructions operate only on FP registers
  - Programs generally don't do integer ops on FP data, or vice versa
  - More registers with minimal code-size impact
- FP load and store instructions
  - lwc1, ldc1, swc1, sdc1
    - e.g., ldc1 \$f8, 32(\$sp)



### **FP Instructions in MIPS**

- Single-precision arithmetic
  - add.s, sub.s, mul.s, div.s
    - e.g., add.s \$f0, \$f1, \$f6
- Double-precision arithmetic
  - add.d, sub.d, mul.d, div.d
    - e.g., mul.d \$f4, \$f4, \$f6
- Single- and double-precision comparison
  - c.xx.s, c.xx.d (xx is eq, lt, le, ...)
  - Sets or clears FP condition-code bit
    - e.g. c.lt.s \$f3, \$f4
- Branch on FP condition code true or false
  - bc1t, bc1f
    - e.g., bc1t TargetLabel

# FP Example: ° F to ° C

C code:

```
float f2c (float fahr) {
  return ((5.0/9.0)*(fahr - 32.0));
}
```

- fahr in \$f12, result in \$f0, literals in global memory space
- Compiled MIPS code:

```
f2c: lwc1  $f16, const5($gp)
  lwc2  $f18, const9($gp)
  div.s  $f16, $f16, $f18
  lwc1  $f18, const32($gp)
  sub.s  $f18, $f12, $f18
  mul.s  $f0, $f16, $f18
  jr  $ra
```

## FP Example: Array Multiplication

- $X = X + Y \times Z$ 
  - All 32 × 32 matrices, 64-bit double-precision elements
- C code:

Addresses of x, y, z in \$a0, \$a1, \$a2, and
 j, j, k in \$s0, \$s1, \$s2

## FP Example: Array Multiplication

#### MIPS code:

```
li $t1, 32
                   # $t1 = 32 (row size/loop end)
   1i $s0, 0
                   # i = 0; initialize 1st for loop
L1: li $s1, 0
                   # j = 0; restart 2nd for loop
L2: 1i $s2, 0 # k = 0; restart 3rd for loop
   addu t2, t2, t2, t2, t2 = i * size(row) + j
   sll $t2, $t2, 3 # $t2 = byte offset of [i][j]
   addu t2, a0, t2 \# t2 = byte address of <math>x[i][j]
   1.d f4, O(t2) # f4 = 8 bytes of x[i][j]
L3: s11 $t0, $s2, 5 # $t0 = k * 32 (size of row of z)
   addu t0, t0, s1 # t0 = k * size(row) + j
   sll $t0, $t0, 3 # $t0 = byte offset of [k][j]
   addu t0, a2, t0 # t0 = byte address of <math>z[k][j]
   1.d f16, 0(t0) # f16 = 8 bytes of z[k][j]
```

•••



## FP Example: Array Multiplication

...

```
\$11 \$t0, \$s0, 5    # \$t0 = i*32 (size of row of y)
addu t0, t0, s2 # t0 = i*size(row) + k
sll $t0, $t0, 3 # $t0 = byte offset of [i][k]
addu t0, a1, t0 # t0 = byte address of y[i][k]
1.d f18, 0(t0) # f18 = 8 bytes of y[i][k]
mul.d f16, f18, f16 # f16 = y[i][k] * z[k][j]
add.d f4, f4, f4 # f4=x[i][j] + y[i][k]*z[k][j]
addiu $s2, $s2, 1 # $k k + 1
bne \$s2, \$t1, L3 # if (k != 32) go to L3
s.d f4, O(t2) # x[i][j] = f4
addiu $$1, $$1, 1 # $j = j + 1
bne $s1, $t1, L2 # if (j != 32) go to L2
addiu $s0, $s0, 1
                    # $i = i + 1
bne $s0, $t1, L1 # if (i != 32) go to L1
```

### **Accurate Arithmetic**

- IEEE Std 754 specifies additional rounding control
  - Extra bits of precision (guard, round, sticky)
  - Choice of rounding modes
  - Allows programmer to fine-tune numerical behavior of a computation
- Not all FP units implement all options
  - Most programming languages and FP libraries just use defaults
- Trade-off between hardware complexity, performance, and market requirements

#### **Subword Parallellism**

- Graphics and audio applications can take advantage of performing simultaneous operations on short vectors
  - Example: 128-bit adder:
    - Sixteen 8-bit adds
    - Eight 16-bit adds
    - Four 32-bit adds
- Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data (SIMD)

### x86 FP Architecture

- Originally based on 8087 FP coprocessor
  - 8 × 80-bit extended-precision registers
  - Used as a push-down stack
  - Registers indexed from TOS: ST(0), ST(1), ...
- FP values are 32-bit or 64 in memory
  - Converted on load/store of memory operand
  - Integer operands can also be converted on load/store
- Very difficult to generate and optimize code
  - Result: poor FP performance



## **x86 FP Instructions**

| Data transfer                                  | Arithmetic                                                                               | Compare             | Transcendental                            |
|------------------------------------------------|------------------------------------------------------------------------------------------|---------------------|-------------------------------------------|
| FILD mem/ST(i) FISTP mem/ST(i) FLDPI FLD1 FLDZ | FIADDP mem/ST(i) FISUBRP mem/ST(i) FIMULP mem/ST(i) FIDIVRP mem/ST(i) FSQRT FABS FRNDINT | FICOMP FSTSW AX/mem | FPATAN F2XMI FCOS FPTAN FPREM FPSIN FYL2X |

#### Optional variations

- I: integer operand
- P: pop operand from stack
- R: reverse operand order
- But not all combinations allowed



### **Streaming SIMD Extension 2 (SSE2)**

- $\blacksquare$  Adds 4 imes 128-bit registers
  - Extended to 8 registers in AMD64/EM64T
- Can be used for multiple FP operands
  - 2 × 64-bit double precision
  - 4 × 32-bit double precision
  - Instructions operate on them simultaneously
    - Single-Instruction Multiple-Data

#### Unoptimized code:

```
1. void dgemm (int n, double* A, double* B, double* C)
2. {
3. for (int i = 0; i < n; ++i)
4. for (int j = 0; j < n; ++j)
5. {
6. double cij = C[i+j*n]; /* cij = C[i][j] */
7. for(int k = 0; k < n; k++)
8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */
9. C[i+j*n] = cij; /* C[i][j] = cij */
10. }
11. }</pre>
```



#### x86 assembly code:

```
1. vmovsd (%r10), %xmm0 # Load 1 element of C into %xmm0
2. mov %rsi,%rcx # register %rcx = %rsi
3. xor %eax, %eax
                 # register %eax = 0
4. vmovsd (%rcx), %xmm1 # Load 1 element of B into %xmm1
5. add r9, rcx # register rcx = rcx + register
6. vmulsd (%r8,%rax,8),%xmm1,%xmm1 # Multiply %xmm1,
  element of A
7. add \$0x1, \$rax # register \$rax = \$rax + 1
8. cmp %eax, %edi # compare %eax to %edi
9. vaddsd %xmm1, %xmm0, %xmm0 # Add %xmm1, %xmm0
10. jg 30 \langle dgemm + 0x30 \rangle # jump if eax > edi
11. add \$0x1,\$r11d # register \$r11 = \$r11 + 1
12. vmovsd %xmm0, (%r10) # Store %xmm0 into C element
```

#### Optimized C code:

```
1. #include <x86intrin.h>
2. void dgemm (int n, double* A, double* B, double* C)
3. {
4. for (int i = 0; i < n; i+=4)
5. for (int j = 0; j < n; j++) {
     m256d c0 = mm256 load pd(C+i+j*n); /* c0 = C[i][j]
  * /
7.
     for ( int k = 0; k < n; k++ )
8.
     c0 = mm256 \text{ add pd(}c0, /* c0 += A[i][k]*B[k][j] */
9.
                mm256 mul pd(mm256 load pd(A+i+k*n),
10.
                mm256 broadcast sd(B+k+j*n));
     _{mm256\_store\_pd(C+i+j*n, c0); /* C[i][j] = c0 */
11.
12.
13. }
```

#### Optimized x86 assembly code:

```
1. vmovapd (%r11),%ymm0
                       # Load 4 elements of C into %ymm0
2. mov %rbx, %rcx
                 # register %rcx = %rbx
                    # register %eax = 0
3. xor %eax, %eax
4. vbroadcastsd (%rax, %r8,1), %ymm1 # Make 4 copies of B element
5. add $0x8,%rax
                     # register %rax = %rax + 8
6. vmulpd (%rcx), %ymm1, %ymm1 # Parallel mul %ymm1, 4 A elements
7. add %r9,%rcx
                         # register %rcx = %rcx + %r9
8. cmp %r10,%rax
                        # compare %r10 to %rax
9. vaddpd %ymm1,%ymm0,%ymm0 # Parallel add %ymm1, %ymm0
10. jne 50 <dqemm+0x50> # jump if not %r10 != %rax
                          # register % esi = % esi + 1
11. add $0x1, %esi
12. vmovapd %ymm0, (%r11) # Store %ymm0 into 4 C elements
```



## **Right Shift and Division**

- Left shift by i places multiplies an integer by 2<sup>i</sup>
- Right shift divides by 2<sup>i</sup>?
  - Only for unsigned integers
- For signed integers
  - Arithmetic right shift: replicate the sign bit
  - e.g., -5 / 4
    - $\blacksquare$  11111011<sub>2</sub> >> 2 = 111111110<sub>2</sub> =  $\blacksquare$ 2
    - Rounds toward -∞
  - c.f.  $11111011_2 >>> 2 = 001111110_2 = +62$



## **Associativity**

- Parallel programs may interleave operations in unexpected orders
  - Assumptions of associativity may fail

|   |           | (x+y)+z  | x+(y+z)   |
|---|-----------|----------|-----------|
| X | -1.50E+38 |          | -1.50E+38 |
| у | 1.50E+38  | 0.00E+00 |           |
| Z | 1.0       | 1.0      | 1.50E+38  |
|   |           | 1.00E+00 | 0.00E+00  |

 Need to validate parallel programs under varying degrees of parallelism

## Who Cares About FP Accuracy?

- Important for scientific code
  - But for everyday consumer use?
    - "My bank balance is out by 0.0002¢!" ⊗
- The Intel Pentium FDIV bug
  - The market expects accuracy
  - See Colwell, The Pentium Chronicles

# **Concluding Remarks**

- Bits have no inherent meaning
  - Interpretation depends on the instructions applied
- Computer representations of numbers
  - Finite range and precision
  - Need to account for this in programs



# **Concluding Remarks**

- ISAs support arithmetic
  - Signed and unsigned integers
  - Floating-point approximation to reals
- Bounded range and precision
  - Operations can overflow and underflow
- MIPS ISA
  - Core instructions: 54 most frequently used
    - 100% of SPECINT, 97% of SPECFP
  - Other instructions: less frequent

## **Concluding Remarks**

- 3.23 [10] < 3.5 > 写出十进制数 63.25 的二进制表达。设采用 IEEE 754 单精度格式。
- 3.24 [10] < 3.5 > 写出十进制数 63.25 的二进制表达。设采用 IEEE 754 双精度格式。
- 3.25 [10] < 3.5 > 写出十进制数 63.25 的二进制表达。设采用 IBM 单精度格式存储(基数为 16 而不是 2, 有 7 位指数位)。