# Floating point numbers
## Or why `x == y` is bad...

### Recommend reading:
#### What Every Computer Scientist Should Know About Floating-Point Arithmetic
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

In [3]:
import numpy as np

In [4]:
x = 1
y = 1/3

assert x == 3*y

In [5]:
import sys

print(sys.float_info)

sys.float_info(max=1.7976931348623157e+308, max_exp=1024, max_10_exp=308, min=2.2250738585072014e-308, min_exp=-1021, min_10_exp=-307, dig=15, mant_dig=53, epsilon=2.220446049250313e-16, radix=2, rounds=1)


In [9]:
x = np.float64(1)
y = np.float64(1/3)

assert x == 3*y

In [10]:
x = np.float32(1)
y = np.float32(1/3)

assert x == 3*y

AssertionError: 

In [11]:
assert 0.3 == 0.1 + 0.2

AssertionError: 

In [22]:
L = 8

vals = np.array([5*10**-i for i in range (L)], dtype=np.float32)

print("Trial 1: Adding from large to small")
sum1 = vals[0]
for i in range(1, L):
    print(i, "%e + %e = %e"%(sum1, vals[i], sum1+vals[i]))
    sum1 += vals[i]
print ("sum = %e\n"%sum1)

Trial 1: Adding from large to small
1 5.000000e+00 + 5.000000e-01 = 5.500000e+00
2 5.500000e+00 + 5.000000e-02 = 5.550000e+00
3 5.550000e+00 + 5.000000e-03 = 5.555000e+00
4 5.555000e+00 + 5.000000e-04 = 5.555501e+00
5 5.555501e+00 + 5.000000e-05 = 5.555551e+00
6 5.555551e+00 + 5.000000e-06 = 5.555555e+00
7 5.555555e+00 + 5.000000e-07 = 5.555556e+00
sum = 5.555556e+00



In [24]:
print("Trial 2: Adding from small to large")
sum2 = vals[L-1]
for i in range(L-2, -1, -1):
    print(i, "%e+%e = %e"%(sum2, vals[i], sum2+vals[i]))
    sum2 += vals[i]
print ("sum = %e\n"%sum2)

Trial 2: Adding from small to large
6 5.000000e-07+5.000000e-06 = 5.500000e-06
5 5.500000e-06+5.000000e-05 = 5.550000e-05
4 5.550000e-05+5.000000e-04 = 5.555000e-04
3 5.555000e-04+5.000000e-03 = 5.555500e-03
2 5.555500e-03+5.000000e-02 = 5.555550e-02
1 5.555550e-02+5.000000e-01 = 5.555555e-01
0 5.555555e-01+5.000000e+00 = 5.555555e+00
sum = 5.555555e+00



In [15]:
import random

for trial in range(5):
    random.shuffle(vals)
    print("Shuffle trial %2d: sum = %e "%(trial, vals.sum()))

Shuffle trial  0: sum = 5.555555e+00 
Shuffle trial  1: sum = 5.555555e+00 
Shuffle trial  2: sum = 5.555556e+00 
Shuffle trial  3: sum = 5.555556e+00 
Shuffle trial  4: sum = 5.555555e+00 


## Practical considerations using floating-point arithmetic
### Adapted from https://en.wikipedia.org/wiki/Numerical_differentiation

Example below showing the difficulty of choosing $h$ due to both rounding error and formula error.

An important consideration in practice when the function is calculated using floating-point arithmetic is how small a value of $h$ to choose. If chosen too small, the subtraction will yield a large rounding error. In fact, all the finite-difference formulae are ill-conditioned and due to cancellation will produce a value of zero if $h$ is small enough. If too large, the calculation of the slope of the secant line will be more accurately calculated, but the estimate of the slope of the tangent by using the secant could be worse.

![alt text](./AbsoluteErrorNumericalDifferentiationExample.png)

For the numerical derivative formula evaluated at $x$ and $x + h$, a choice for $h$ that is small without producing a large rounding error is $x\sqrt{\varepsilon}$ (though not when $x = 0$), where the machine epsilon $\varepsilon$ is typically of the order of $2.2 \times 10^{−16}$. 

### So... what gives?

## Floating Point Numbers
### Taken from http://www.doc.ic.ac.uk/~eedwards/compsys/float/

**Real Numbers**: `pi=3.14159265`... `e = 2.71828`...

**Scientific Notation**: has a single digit to the left of the decimal point.

**A number in Scientific Notation with no leading 0's is called a Normalized Number**: $1.0 \times 10^{-8}$

**Not in normalized form**: $0.1 \times 10^{-7}$ or $10.0 \times 10^{-9}$

**Can also represent binary numbers in scientific notation**: $1.0 \times 2^{-3}$

Computer arithmetic that supports such numbers is called Floating Point.

The form is $s \times 1.xxxx… \times 2yy…$.

Using normalized scientific notation:
- Simplifies the exchange of data that includes floating-point numbers
- Simplifies the arithmetic algorithms to know that the numbers will always be in this form
- Increases the accuracy of the numbers that can be stored in a word, since each unnecessary leading 0 is replaced by another significant digit to the right of the decimal point

## Representation of Floating-Point numbers

$$-1^S \times M \times 2^E$$

|Bit No|Size    |Field Name  |
|------|--------|------------|
|31	   |1 bit 	|Sign (S)    |
|23-30 |8 bits	|Exponent (E)|
|0-22  |23 bits	|Mantissa (M)|

A Single-Precision floating-point number occupies 32-bits, so there is a compromise between the size of the mantissa and the size of the exponent.

These chosen sizes provide a range of approx:

$$\pm 10^{-38} ... 10^{38}$$

To reduce the chances of rounding errors, developers often use 64-bit Double-Precision arithmetic. **However, there is no such thing as a free lunch** as this doubles the memory requirements and increases the cost of the computation (and indeed doesn't always work).

|Bit No	|Size	 |Field Name  |
|-------|--------|------------|
|63	    |1 bit 	 |Sign (S)    |
|52-62	|11 bits |Exponent (E)|
|0-51	|52 bits |Mantissa (M)|

providing a range of

$\pm 10^{-308} ... 10^{308}$

These formats are called ...

## IEEE 754 Floating-Point Standard

Since the mantissa is always 1.xxxxxxxxx in the normalised form, no need to represent the leading 1. So, effectively:

- Single Precision: mantissa ===> 1 bit + 23 bits
- Double Precision: mantissa ===> 1 bit + 52 bits

Since zero (0.0) has no leading 1, to distinguish it from others, it is given the reserved bit pattern all 0s for the exponent so that hardware won't attach a leading 1 to it. Thus:

- Zero (0.0) = 0000...0000
- Other numbers = $-1^S \times (1 + Mantissa) \times 2^E$

If we number the mantissa bits from left to right m1, m2, m3, ...

$mantissa = m1 \times 2^{-1} + m2 \times 2^{-2} + m3 \times 2^{-3} + ....$

Negative exponents could pose a problem in comparisons.

For example (with two's complement):

|                    |Sign|Exponent|Mantissa                 |
|--------------------|----|--------|-------------------------|
|$1.0 \times 2^{-1}$ |0   |11111111|0000000 00000000 00000000|
|$1.0 \times 2^{+1}$ |0   |00000001|0000000 00000000 00000000|

With this representation, the first exponent shows a "larger" binary number, making direct comparison more difficult.

To avoid this, **Biased Notation** is used for exponents.

If the real exponent of a number is X then it is represented as (X + bias)

IEEE single-precision uses a bias of 127. Therefore, an exponent of

|||
|---|--------------------------------------------|
|-1 |is represented as -1 + 127 = 126 = 011111102|
| 0 |is represented as  0 + 127 = 127 = 011111112|
|+1 |is represented as +1 + 127 = 128 = 100000002|
|+5 |is represented as +5 + 127 = 132 = 100001002|

So the actual exponent is found by subtracting the bias from the stored exponent. Therefore, given S, E, and M fields, an IEEE floating-point number has the value:

$-1^S \times (1.0 + 0.M) \times 2^{E-bias}$
(Remember: it is (1.0 + 0.M) because, with normalized form, only the fractional part of the mantissa needs to be stored)

## Floating Point Addition

Add the following two decimal numbers in scientific notation:
$$8.70 \times 10^{-1}$$ with $$9.95 \times 10^1$$

1. Rewrite the smaller number such that its exponent matches with the exponent of the larger number.
$$8.70 \times 10^{-1} = 0.087 \times 10^1$$

2. Add the mantissas
$$9.95 + 0.087 = 10.037$$ and write the sum $$10.037 \times 10^1$$

3. Put the result in Normalized Form
$$10.037 \times 10^1 = 1.0037 \times 10^2$$ (shift mantissa, adjust exponent) check for overflow/underflow of the exponent after normalization

4. Round the result
If the mantissa does not fit in the space reserved for it, it has to be rounded off.

For Example: If only 4 digits are allowed for mantissa

$1.0037 \times 10^2$ ===> $1.004 \times 10^2$

(only have a hidden bit with binary floating point numbers)

## Example addition in binary

Perform $0.5 + (-0.4375)$

$$0.5 = 0.1 \times 2^0 = 1.000 \times 2^{-1} \text{(normalised)}$$ 

$$-0.4375 = -0.0111 \times 2^0 = -1.110 \times 2^{-2} \text{(normalised)}$$

Rewrite the smaller number such that its exponent matches with the exponent of the larger number.
$$-1.110 \times 2^{-2} = -0.1110 \times 2^{-1}$$

Add the mantissas:
$$1.000 \times 2^{-1} + -0.1110 \times 2^{-1} = 0.001 \times 2^{-1}$$

Normalise the sum, checking for overflow/underflow:
$$0.001 \times 2^{-1} = 1.000 \times 2^{-4}$$

$-126 <= -4 <= 127$ ===> No overflow or underflow

Round the sum:
- The sum fits in 4 bits so rounding is not required

Check:
- $1.000 \times 2^{-4} = 0.0625$ which is equal to $0.5 - 0.4375$

Correct!

## Floating Point Multiplication

Multiply the following two numbers in scientific notation by hand:

$$1.110 \times 10^{10} \times 9.200 \times 10^{-5}$$

Add the exponents to find
- New Exponent $= 10 + (-5) = 5$

If we add biased exponents, bias will be added twice. Therefore we need to subtract it once to compensate:
$$(10 + 127) + (-5 + 127) = 259$$

$259 - 127 = 132$ which is $(5 + 127) =$ biased new exponent

Multiply the mantissas
$$1.110 \times 9.200 = 10.212000$$

Can only keep three digits to the right of the decimal point, so the result is

$$10.212 \times 10^5$$

Normalize the result
$$1.0212 \times 10^6$$

Round it
$$1.021 \times 10^6$$

## Example multiplication in binary:

$$1.000 \times 2^{-1} \times -1.110 \times 2^{-2}$$

1. Add the biased exponents
$(-1 + 127) + (-2 + 127) - 127 = 124$ ===> $(-3 + 127)$

2. Multiply the mantissas

$\hspace{46pt}1.000$<br>
$\hspace{40pt}\times 1.110$<br>
----------------------------<br>
$\hspace{62pt}0000$<br>
$\hspace{57pt}1000$<br>
$\hspace{53pt}1000$<br>
$\hspace{42pt}+1000$<br>
----------------------------<br>
$\hspace{47pt}1110000$ ===> $1.110000$

 - The product is $1.110000 \times 2^{-3}$
 - Need to keep it to 4 bits $1.110 \times 2^{-3}$

3. Normalize (already normalized)
 - At this step check for overflow/underflow by making sure that

$$-126 <= \text{Exponent} <= 127$$

$$1 <= \text{Biased Exponent} <= 254$$

4. Round the result (no change)
5. Adjust the sign. Since the original signs are different, the result will be negative

$$-1.110 \times 2^{-3}$$

## Use numpy.isclose rather than equality for floating point comparisons

In [26]:
#print(np.isclose.__doc__)

In [27]:
# Repeating the experiment above

x = np.float32(1)
y = np.float32(1/3)

assert x == 3*y

AssertionError: 

In [28]:
assert np.isclose(x, 3*y)

## Real world example of floating point bugs

### Real world example: Patriot missile failure due to magnification of roundoff error
#### https://en.wikipedia.org/wiki/Round-off_error#Real_world_example:_Patriot_missile_failure_due_to_magnification_of_roundoff_error

American Patriot missile: On 25 February 1991, during the Gulf War, an American Patriot missile battery in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks and killed 28 soldiers. A report of the General Accounting office, GAO/IMTEC-92-26, entitled Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia reported on the cause of the failure. It turns out that the cause was an inaccurate calculation of the time since boot due to computer arithmetic errors. Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in seconds. This calculation was performed using a 24-bit fixed point register. In particular, the value 1/10, which has a non-terminating binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving the time in tenths of a second, led to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds. Multiplying by the number of tenths of a second in $100$ hours gives $0.000000095\times 100\times 60\times 60\times 10=0.34$). A Scud travels at about 1676 meters per second, and so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the Patriot tracked. Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed to the problem, since it meant that the inaccuracies did not cancel.