# Finite Precision Arithmetic

The following discussion is based on:

### Lessons in Scientific Computing: <br>Numerical Mathematics, Computer Technology, and Scientific Discovery
##### ByNorbert Schorghofer


Ch. 3 - Roundoff & Number Representation

This book is available in electronic form from UW Libraries. Please get access to and read Ch. 3 (and whatever other sections you may find to be of interest).

For more information on the details of IEEE754 Standard for Floating-Point Arithmetic, see the classic "What Every Computer Scientist Should Know About Floating-Point Arithmetic," by David Goldberg, ACM Computing Surveys. A version of this paper is available in the readings folder of the class Canvas files.

## A bit of context

What is this class all about? 
<br>__Practical__ numerical methods for obtaining good approximate solutions to engineering problems (for which analytic solutions may not be readily available)

What are __numerical methods__ about?
<br>Using digital computers to obtain those good approximate solutions...

Computers are now both:

- Extremely fast/capable
- Extremely dumb/literal

The computer is not smarter than you, so you have to tell it __very explicitly__ what you want it to do. (_Exception?_)

The specification of what to do is an __algorithm__:
<br>Unambiguous finite rule that, after a finite number of steps provides a solution to a class of problems.

(We will look at both algorithms and __their implementations as codes/programs__.)

Algorithms have inputs (typically in $ℤ^n$ or $ℝ^n$) that are converted into outputs by a sequence of __elementary operations__ including `+,-,*,/`.

__Well-conditioned problem__:
<br>small change in input $\iff$ small change in output

__Numerically stable algorithm__:
<br>actual computational error comparable to __unavoidable error__

Why would there be unavoidable errors?
- Exact representation of a real number can require infinitely many digits.
- Infinite data storage requires infinite computing resources.
- Memory chip prices have come down, but an infinite amount is neither physically nor economically possible.

So we typically make a compromise and use finite precision numeric representations. 

### Section 3.1 -  Number Representation

What we think of as real numbers in an algorithm, we approximate for computing purposes as __floating-point__ numbers of the form:
<br>$(-1)^s \; (d_0.d_1 d_2 \ldots d_{p-1}) \; \beta^e$
<br>where $s\in {0,1}$ indicates the sign
<br>$p$ is the number of digits or __precision__
<br>each digit $d_i \in [0, 1, \ldots, \beta-1]$
<br>$\beta$ is the base and $e$ is the exponent.

While we, as humans, tend toward base $\beta = 10$, computers almost universally use a __binary representation with $\beta = 2$__.

The floating-point representation corresponds to a unique real number:
$\bar{x} = (-1)^s \big( d_0 + \frac{d_1} {\beta} + \frac{d_2}{\beta^2} + \ldots + \frac{d_{p-1}}{\beta^{p-1}} \big) \beta^e$

Some real numbers have an exact (but not unique) floating-point representation; e.g. $x=0.5$ has the following representations:
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^{-1} = (-1)^0 \; (1+\frac{0}{2}) \; 2^{-1} = \frac{1}{2}$
<br>and
<br>$\bar{x} = (-1)^0 \; (0.1) \; 2^0 = (-1)^0 \; (0+\frac{1}{2}) \; 2^0 = \frac{1}{2}$

For uniqueness, choose the __normalized__ (first) version with $d_0 \neq 0$.

A floating point number system is defined by base $\beta$, precision (number of digits) $p$, and range of exponents $[e_{min}, e_{max}]$.

Let's take a look at an example: $\beta=2, p=2, e \in [-2,3]$.

Positive normalized numbers:
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^{-2} = \frac{1}{4}$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^{-2} = \frac{3}{8}$
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^{-1} = \frac{1}{2}$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^{-1} = \frac{3}{4}$
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^0 = 1$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^0 = \frac{3}{2}$
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^1 = 2$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^1 = 3$
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^2 = 4$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^2 = 6$
<br>$\bar{x} = (-1)^0 \; (1.0) \; 2^3 = 8$
<br>$\bar{x} = (-1)^0 \; (1.1) \; 2^3 = 12$

$\bar{x} \in {\bf{X}} = ± \{1/4, 3/8, 1/2, 3/4, 1, 3/2, 2, 3, 4, 6, 8, 12 \}$

- Not all reals can be represented

- No representation for zero (although one can be designated)

- Uneven spacing: larger gaps between larger numbers

Typical usage: round real number $x$ to nearest element of $\bf{X}$.
<br>This incurs an __absolute error__: $\lvert x-\bar{x} \rvert = E(x)$
<br>Can also consider __relative error__: $$\frac{E(x)}{\lvert x \rvert} = R(x)$$

Generally, floating point systems aim to even out relative error - larger gaps occur between larger elements of $\bf{X}$.

Floating point systems have limitations in terms of arithmetic operations:

$ 12 * 2 = 24 \notin {\bf X} \implies$ overflow ($±$ `Inf`)
<br>$\frac{1}{4} / 2 = \frac{1}{8} \notin {\bf X} \implies$ underflow ( ±`0`, `NaN`)
<br>IEEE754 __de-normalizes__ for gradual underflow: $(-1)^0 \; (0.1) \; 2^{-2} = \frac{1}{8}$

### Section 3.2 - IEEE Standardization

Most common floating point systems:

- __Single precision__: 24 bits (biniary digits), 1 sign bit, 7 bits of exponent

- __Double precision__: 52 bits (biniary digits), 1 sign bit, 11 bits of exponent

Note that bit counts are multiples of 8 because hardware organizes the bits into __bytes__ (1 byte = 8 bits, single = 4 bytes, double = 8 bytes)

__Single precision: 4 bytes, about 6-9 significant decimal digits
<br>Double precision: 8 bytes, about 15-17 significant decimal digits__

See Table 3.1 in "Lessons in Scientific Computing" for details of representation range.

Consider details of rounding error for real number $x$ that rounds to $\bar{x}$ so they agree to $p$ digits: 

<br>$\bar{x} = x \; (1 + \epsilon)$

$R(x) =  \lvert \epsilon \rvert \leq (1/2) \beta^{1-p} = u$ __unit roundoff__ <br>($u$ is the upper bound on relative rounding error)

Spacing between normalized floats is $2 u \lvert x \rvert$
<br>For $x=1$, the gap is $2u$, so the next f.p. number available is $1 + 2u$.

An alternative characterization of precision involves machine epsilon, $\epsilon_{M}$, which is usually defined as the smallest number you can add to 1 without producing the result 1: $$\overline{1+x} = 1 \; \forall x \; s.t. \; |x|<\epsilon_M$$

Either $u$ or $\epsilon_M$ (which some authors define to differ by a factor of 2) provide a measure of the resolution of the number system.

### Section 3.3 - Roundoff Sensitivity

Consider how roundoff error in inputs propagates when performing floating point arithmetic operations.

Let the real inputs be $x_1$ and $x_2$ with rounded versions $$\bar{x}_1 = x_1 (1+\epsilon_1)$$ $$\bar{x}_2 = x_2 (1+\epsilon_2)$$

$\epsilon$ indicates signed relative error bounded by $u < 10^{-6} <<1$ (for single and double precision)

- Multiplication: 
$$\bar{x}_1 * \bar{x}_2 = x_1 (1+\epsilon_1) * x_2 (1+\epsilon_2)$$
$$\bar{x}_1 * \bar{x}_2 = x_1 * x_2 \; (1 + \epsilon_1 +\epsilon_2 + \epsilon_1 * \epsilon_2)$$

Product of relative errors is VERY small, so ignore...
$$\bar{x}_1 * \bar{x}_2 = x_1 * x_2 \; (1 + \epsilon_1 +\epsilon_2 + \ldots)$$
<br> so __when multiplying, relative errors to add__.

- Division: 
$$\bar{x}_1 / \bar{x}_2 = x_1 (1+\epsilon_1) / (x_2 (1+\epsilon_2))$$
$$\bar{x}_1 / \bar{x}_2 = x_1 / x_2 (1 + \epsilon_1) * \frac{1}{1+\epsilon_2}$$ 
$$\approx x_1 / x_2 * (1 + \epsilon_1) * (1 - \epsilon_2 + \epsilon^2 + \ldots)$$
$$\approx x_1 / x_2 * (1 + \epsilon_1 - \epsilon_2 + \ldots)$$

Again ignore product of relative errors, so output error is bounded by sum of input relative errors.

__Mutiply or divide: relative error bounded by sum of input relative errors.__

- Addition: $$\bar{x}_1 + \bar{x}_2 = x_1 (1+\epsilon_1) + x_2 (1+\epsilon_2)$$
$$= x_1 + x_2 + (x_1 * \epsilon_1 + x_2 * \epsilon_2) = (x_1 + x_2) * (1 + \epsilon_+)$$
$$\implies \lvert \epsilon_+ \rvert = \frac{ \lvert x_1 * \epsilon_1 + x_2 * \epsilon_2 \rvert}{\lvert x_1 + x_2 \rvert} \lessapprox u$$


__Addition: Relative error near unit roundoff__
<br>__if $x_1$ and $x_2$ have the same sign!__
<br>__When $x_1 \approx -x_2$ the denominator becomes arbitrarily small and the error is not well bounded.__

__The main concern is Subtraction leading to__ ___Catastrophic cancellation___:<br>__Subtracting nearly equal numbers wipes out the significant digits and leaves noise...__


### Examples of how to deal with limitations of fixed precision arithmetic

- Compute $x^2 - y^2$ in toy number system with $x=4$ and $y=2$
<br>$\bar{x} \in {\bf{X}} = ± \{1/4, 3/8, 1/2, 3/4, 1, 2, 3, 4, 6, 8, 12 \}$
<br>Blindly computing $x*x = 4*4$ causes overflow
<br>Instead, consider using your math skills:
<br>Rewrite expression as $(x-y) * (x+y)$ so the computation becomes $(4-2) * (4+2) = 2*6 = 12$ which works!
    <br>Note: No guarantee of exactness; e.g. $x=4$ and $y=3$
$$(4-3) * (4+3) \approx 1*8 = 8$$ (not exact, but better than overflow)
    
    More general approach to overflow: 
    <br>Normalize with units that make your numerical values close to 1
    
- Evaluate the roots of $a x^2 + b x + c = 0 = a(x-x_+)(x-x_-)$ using quadratic formula: 
<br>$$x_\pm = \frac{-b \pm \sqrt{b^2- 4 a c}}{2 a}$$
What about this could be problematic?
<br><br><br><br>
<br>
<br>
When $4 a c \ll b^2$ the square root is nearly equal to $b$, so if $b>0$ the root $x_+$ is subject to catastrophic cancellation. However the root $x_-$ should be OK and we know that $a* x_+ * x_- = c$.
<br>$\implies$ Effective plan for $b>0$ : 

    1) Compute $$q = b + \sqrt{b^2 -4 a c}$$

    2) Compute $$x_1 = -\frac{q}{2a}$$ $$x_2 = \frac{c}{a x_1} = -2 \frac{c}{q}$$
    
    Shorghofer notes a couple possible improvements:

    1) Compute $$q = (-1/2) (b + \mathrm{sgn(b)} \sqrt{b^2 -4 a c})$$ 
then roots are $$x_1 = \frac{q}{a}$$ $$x_2 = \frac{c}{a x_1} = \frac{c}{q}$$

    Note that basic python does not include a "sign" function. You can define your own or import one from python's symbolics package `sympy`.
    
- Sometimes you can rearrange to avoid cancellation.
<br> Consider a sum where cancellation would be of concern:
$$ S = 1 -\frac{1}{2}+\frac{1}{3}- \frac{1}{4}+\ldots$$
<br>Each pair of terms can be collected over common denominator to give:
$$ S = \frac{1}{1 \cdot 2}+\frac{1}{3 \cdot 4} +\ldots$$
<br>Subtraction is eliminated and cancellation trouble is avoided.

### ASIDE: Interval arithmetic

- An interesting alternative approach to dealing with roundoff errors

- Instead of trying to represent a particular real number (which we cannot do with finite precision), keep track of the lower and upper bounds of an interval that is guaranteed to contain the exact number

- Output of an operation needs to guarantee inclusion of the exact answer

- Useful property for root-finding/isolation: finite convergence

- An example due to Rump [E. Loh, G. W. Walster, “Rump’s example revisited”, Reliable Computing, vol. 8 (2002), n. 2, pp. 245–248.]

$$f(x,y) = (\frac{1335}{4} - x^2) y^6 +x^2 (11 x^2 y^2 - 121 y^4 -2) + \frac{11}{2} y^8 + \frac{x}{2 y}$$

This is a function with rational coefficients, so with rational (or integer) arguments, it can be evaluated exactly (e.g. with Mathematica): 
$$f(77617, 33096) = -\frac{54767}{66192} \approx −0.827396\ldots$$

Now let's try floating point evaluation:

In [4]:
def f(x,y):
    return ((333.75 - x**2)* y**6 + x**2 * (11* x**2 * y**2 - 121 * y**4 - 2) + 5.5 * y**8 + x/(2*y))

In [51]:
 f(77617.0, 33096.0)

1.1726039400531787

We can't even evaluate the function, so would there be any chance of locating its roots?

The answer turns out to be YES (but we will stick to the simpler case with 1 variable)

Basic plan: 

Define __interval extensions__ of arithmetic functions that reliably contain the correct result

Evaluate the interval extension of the function over some input interval to obtain an output interval

If the output interval includes 0, then the input interval can contain a root

Subdivide and evaluate on subinterval until the output interval excludes 0 $\implies$ NO ROOTS in the subinterval

Eventually (in finite steps) obtain a set of narrow intervals that are candidate root locations

For more details see the following paper available on the Canvas page.

"Interval Arithmetic: Python Implementation and Applications," Proceedings of the 7th Python in Science Conference (SciPy 2008)

by Stefano Taschini, Altis Investment Management AG
<br>(note the affiliation...)

In [25]:
import numpy as np

def make_interval(x0,x1):
    x_min, x_max = min(x0,x1), max(x0,x1)
    return np.array([x_min, x_max])

def i_add(x,y):
    """
    perform interval addition
    
    arguments:
        x, y: intervals represented as numpy arrays of length 2
    """
    
    return np.array([x[0] + y[0], x[1] + y[1]])

def i_mult(x,y):
    """
    perform interval multiplication
    
    arguments:
        x, y: intervals represented as numpy arrays of length 2
    """
    products = np.array([x[0]*y[0],x[0]*y[1],x[1]*y[0],x[1]*y[1]])
    out_min = np.min(products)
    out_max = np.max(products)
    
    return make_interval(out_min, out_max)

In [23]:
x = make_interval(0,1)
y = make_interval(2,3)
z = make_interval(-2,1)
i_add(x,y)

array([2, 4])

In [27]:
i_mult(y,z)

array([-6,  3])

In [28]:
1-x

array([1, 0])

In [29]:
def i_f(x):
    return i_mult(x, 1-x)

In [33]:
x0 = make_interval(-3,2)
i_f(x0)

array([-12,   8])

This can contain roots of $f$, so subdivide.

In [41]:
def i_left(x):
    mid = (x[0]+x[1])/2
    return make_interval(x[0],mid)

def i_right(x):
    mid = (x[0]+x[1])/2
    return make_interval(mid, x[1])

In [42]:
x00 = i_left(x0)
x01 = i_right(x0)
i_f(x00), i_f(x01)

(array([-12.  ,  -0.75]), array([-2.,  3.]))

Note that the left interval contains only negative values, so no roots can exist there. <br>Continue sudividing right interval `x01`.

In [44]:
x010 = i_left(x01)
x011 = i_right(x01)
x010, i_f(x010), x011, i_f(x011)

(array([-0.5 ,  0.75]),
 array([-0.75 ,  1.125]),
 array([0.75, 2.  ]),
 array([-2. ,  0.5]))

Both intervals can have roots, but let's choose to focus on (and continue subdividing) the right interval `x011`

In [46]:
x0110 = i_left(x011)
x0111 = i_right(x011)
x0110, i_f(x0110), x0111, i_f(x0111)

(array([0.75 , 1.375]),
 array([-0.515625,  0.34375 ]),
 array([1.375, 2.   ]),
 array([-2.      , -0.515625]))

Here the right subinterval is entirely negative, so continue subdividing the left subinterval.

In [47]:
x01100 = i_left(x0110)
x01101 = i_right(x0110)
x01100, i_f(x01100), x01101, i_f(x01101)

(array([0.75  , 1.0625]),
 array([-0.06640625,  0.265625  ]),
 array([1.0625, 1.375 ]),
 array([-0.515625  , -0.06640625]))

This time the right subinterval is root-free, so continue subdiving left sub-interval.

In [48]:
x011000 = i_left(x01100)
x011001 = i_right(x01100)
x011000, i_f(x011000), x011001, i_f(x011001)

(array([0.75   , 0.90625]),
 array([0.0703125, 0.2265625]),
 array([0.90625, 1.0625 ]),
 array([-0.06640625,  0.09960938]))

Discard the left subinterval and continue on the right...

But it is clear already that this is an example of why you need to have coding skills. Doing this by hand is going to produce mistakes that you want to avoid.

When it is coded up and working reliably, what happens?

It is guaranteed that in a finite number of steps the interval stabilizes; i.e. the output interval that can contain a root is the same as the input interval so you know you can stop. (Even though you do not really know a root exists unless you have other information...)

Here is an example from the paper to illustrate finite interval convergence.

![title](intervalRoots.png)

What have we neglected to make sure that our interval arighmetic is "working reliably"?
<br><br><br><br><br><br><br><br>

Need to control roundoff direction to sensure containment:
<br>Round down for lower bound
<br>Round up for upper bound