# What every quant should know about Numerical Computing

## Agenda

* IEEE FLoating Point representation
* Rounding
* Overflow, Bignums
* SIMD/AVX2


## 1.0 Floating Point numbers

|Significand	|Exponent	|Scientific notation	|Fixed-point value|
|----|----|----|----|
|1.5	|4	|1.5 ⋅ 10<sup>4</sup>	|15000|
|-2.001	|2	|-2.001 ⋅ 10<sup>2</sup>	|-200.1|
|5	|-3	|5 ⋅ 10<sup>-3</sup>	|0.005|
|6.667	|-11	|6.667 ⋅ 10<sup>-11</sup>	|0.00000000006667|

### Binary Floating Point

![Double Precision IEEE](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/IEEE_754_Double_Floating_Point_Format.svg/618px-IEEE_754_Double_Floating_Point_Format.svg.png)

### The Sign bit

In [1]:
bits(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [2]:
bits(-1.0)

"1011111111110000000000000000000000000000000000000000000000000000"

### The exponent (powers of 2)

In [3]:
Int(0b01111111111)

1023

### Significand

The Signifcand stored as 52 bits, and is interpreted as <b> 1.b<sub>1</sub>b<sub>2</sub>&#x2026;b<sub>52</sub> </b>

In [4]:
bits(1.0)

"0011111111110000000000000000000000000000000000000000000000000000"

In [5]:
1 * 2^(1023-1023)

1

In [6]:
bits(1.5)

"0011111111111000000000000000000000000000000000000000000000000000"

In [7]:
(1 + 1/2) * 2 ^ (1023-1023)

1.5

In [8]:
?frexp

search: [1mf[22m[1mr[22m[1me[22m[1mx[22m[1mp[22m



```
frexp(val)
```

Return `(x,exp)` such that `x` has a magnitude in the interval $[1/2, 1)$ or 0, and `val` is equal to $x \times 2^{exp}$.


In [9]:
frexp(1.5)

(0.75, 1)

In [10]:
bits(1.75)

"0011111111111100000000000000000000000000000000000000000000000000"

In [11]:
(1 + 1/2 + 1/2^2) * 2 ^ (1023-1023)

1.75

In [12]:
bits(15.0)

"0100000000101110000000000000000000000000000000000000000000000000"

In [13]:
Int(0b10000000010)

1026

In [14]:
(1 + 1/2 + 1/4 + 1/8)

1.875

In [15]:
1.875 * 2^(1026-1023)

15.0

In [16]:
frexp(15.0)

(0.9375, 4)

In [17]:
ldexp(0.9375, 4)

15.0

Not all decimal numbers are exactly representable as binary floating point number. And many decimal values can be approximated by the same float. 

In [18]:
bits(0.1)

"0011111110111001100110011001100110011001100110011001100110011010"

In [19]:
Rational(0.1)

3602879701896397//36028797018963968

In [20]:
float(big(Rational(0.1)))

1.000000000000000055511151231257827021181583404541015625000000000000000000000000e-01

In [21]:
bits(0.10000000000000001)

"0011111110111001100110011001100110011001100110011001100110011010"

In [22]:
bits(0.10000000000000001) == bits(0.1)

true

In [23]:
eps(0.1)

1.3877787807814457e-17

In [24]:
reinterpret(Float64, 0b0011111110111001100110011001100110011001100110011001100110011011) - 
reinterpret(Float64, 0b0011111110111001100110011001100110011001100110011001100110011010)

1.3877787807814457e-17

In [25]:
nextfloat(0.1)

0.10000000000000002

In [26]:
bits(nextfloat(0.1))

"0011111110111001100110011001100110011001100110011001100110011011"

## 2.0 Special Forms

### Signed Zeros

In [27]:
bits(0.0)

"0000000000000000000000000000000000000000000000000000000000000000"

In [28]:
bits(-0.0)

"1000000000000000000000000000000000000000000000000000000000000000"

In [29]:
0.0 == -0.0

true

In [30]:
0.0 === -0.0

false

### Infinity

In [31]:
bits(Inf)

"0111111111110000000000000000000000000000000000000000000000000000"

In [32]:
bits(-Inf)

"1111111111110000000000000000000000000000000000000000000000000000"

### Not A Number (NaN)

In [33]:
bits(NaN)

"0111111111111000000000000000000000000000000000000000000000000000"

In [34]:
reinterpret(Float64, 0b0111111111110000000000000000000000000000000000000000000000000001)

NaN

In [35]:
NaN == NaN

false

In [36]:
reinterpret(Float64, 0b0111111111110000000000000000000000000000000000000000000000000001) == 
reinterpret(Float64, 0b0111111111110000000000000000000000000000000000000000000000000001)

false

In [37]:
0/0

NaN

In [38]:
0/0 == 0/0

false

In [39]:
1.5/0

Inf

### Subnormal

In [40]:
typemin(Float64)

-Inf

In [41]:
reinterpret(Float64, 0b0000000000010000000000000000000000000000000000000000000000000000)

2.2250738585072014e-308

In [42]:
2.0^(1-1023)

2.2250738585072014e-308

In [43]:
reinterpret(Float64, 0b0000000000001000000000000000000000000000000000000000000000000000)

1.1125369292536007e-308

In [44]:
reinterpret(Float64, 0b0000000000000000000000000000000000000000000000000000000000000001)

5.0e-324

Subnormals are useful for gradual underflow. With subnormals, addition or subtraction of normal floats will *not* underflow. This prevents erroneous division by zero errors. 

In [45]:
3.001e-308 - 3e-308 

1.0e-311

In [46]:
issubnormal(1.0)

false

In [47]:
issubnormal(3.001e-308 - 3e-308 )

true

In [48]:
set_zero_subnormals(true)

true

In [49]:
3.001e-308 - 3e-308 

0.0

Subnormals can sometimes have a performance impact

In [50]:
using BenchmarkTools

function timestep(b::Vector{T}, a::Vector{T}, Δt::T) where T
    @assert length(a)==length(b)
    n = length(b)
    b[1] = 1                            # Boundary condition
    for i=2:n-1
        b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
    end
    b[n] = 0                            # Boundary condition
end

function heatflow(a::Vector{T}, nstep::Integer) where T
    b = similar(a)
    for t=1:div(nstep,2)                # Assume nstep is even
        timestep(b,a,T(0.1))
        timestep(a,b,T(0.1))
    end
end


a = zeros(Float32,1000);
heatflow(a, 1000) #Force compile

In [51]:
for trial=1:6
    a = zeros(Float32,1000)
    set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
    @time heatflow(a,1000)
end

  0.002992 seconds (1 allocation: 4.063 KiB)
  0.001782 seconds (1 allocation: 4.063 KiB)
  0.004246 seconds (1 allocation: 4.063 KiB)
  0.001888 seconds (1 allocation: 4.063 KiB)
  0.003259 seconds (1 allocation: 4.063 KiB)
  0.001740 seconds (1 allocation: 4.063 KiB)


## 3.0 Rounding

In [52]:
0.1 + 0.1 + 0.1

0.30000000000000004

Floating point opertions are not associative.

In [64]:
(0.1 + 0.2) + 0.3

0.6000000000000001

In [65]:
0.1 + (0.2 + 0.3)

0.6

In [53]:
sum([1.0, 10e100, 1.0, -10e100])

0.0

In [54]:
1.0 + 10e100 + 1.0 +  -10e100

0.0

In [55]:
sum_kbn([1.0, 10e100, 1.0, -10e100])

2.0

In [56]:
sum_kbn([0.2,0.2,0.2])

0.6000000000000001

### Cancellation

Errors can blow up when subtracting two numbers that are close together. Consider the following funciton, which can be shown to be:   `f(x) < 0.5 ∀ x`

In [57]:
f(x) = (1 - cos(x))/x^2

f (generic function with 1 method)

In [58]:
f(1.2e-8)

0.7709882115452477

In [59]:
cos(1.2e-8)

0.9999999999999999

In [60]:
1-0.9999999999999999

1.1102230246251565e-16

In [61]:
1.1102230246251565e-16 / 1.44e-16

0.7709882115452477

## 4.0 Fused Multiply Add

_TODO_ better examples

In [62]:
fma(3, 4, 5)

17

`muladd` can sometime be faster. 

In [63]:
muladd(3, 4, 5)

17

### 5.0 Overflow

In [66]:
typemax(Int64)

9223372036854775807

In [67]:
9223372036854775807 + 1

-9223372036854775808

In [69]:
bits(9223372036854775807)

"0111111111111111111111111111111111111111111111111111111111111111"

In [70]:
bits(9223372036854775807 + 1)

"1000000000000000000000000000000000000000000000000000000000000000"

In [71]:
typemax(Float64)

Inf

In [83]:
a=reinterpret(Float64, 0b0111111111101111111111111111111111111111111111111111111111111111)

1.7976931348623157e308

In [86]:
a+1

1.7976931348623157e308