# Numbers and Precision

A brief exploration of why it is so hard to get computers to do what we want, or, why computational physics **is** a separate field of study (to go along with experimental and theoretical physics).

In [1]:
import numpy as np

## Basic Arithmetic

An advantage of a scripting language and of a nice interface like the Jupyter notebook is that we can easily test things without needed to write lines and lines of boiler plate code.  Here we will test some simple arithmetic.

We begin by adding integers and testing for the expected result.

In [2]:
1 + 2 == 3

True

Notice that the result is `True`, so it is in fact true that 1+2 is equal to 3!

We now test something very similar: just divide the expression by 10, what could change?

(*Note:* **Never do this is real code!**)

In [3]:
0.1 + 0.2 == 0.3 # WRONG CODE

False

Now it claims this result is `False`!  How can the computer be so wrong!  It must be broken!  The sky is falling! *Etc.*

What we are encountering here is the heart of the difficulty and annoyance of using computers for numerical work.  How can it possibly get this wrong?

#### Finite Precision

Let's see what it gives for the value of the sum.

In [4]:
0.1 + 0.2

0.30000000000000004

Notice it does **not** give 0.3, there is a small "error" in the calculation.

If you are aware of this issue then you might know that computers cannot represent every number exactly: they have *finite precision*.  Maybe this means that 0.3 just cannot be represented exactly.  If this is the case we can "work around" the problem by dividing by "0.3".  This should give us "1", exactly.

In [5]:
(0.1 + 0.2) / 0.3

1.0000000000000002

We see that it does not!  There is even more going on.

As one example of how this can be a big problem in numerical work, suppose this calculation was the cosine of an angle.  In other words, suppose $\cos\theta = (0.1+0.2)/0.3$.  We know this means that $\theta=0$ (or any integer multiple of $2\pi$), but what happens when we try to calculate this numerically?

In [6]:
np.arccos((0.1 + 0.2) / 0.3)

  np.arccos((0.1 + 0.2) / 0.3)


nan

We get an error and the dreaded `nan`.  NaN means "not a number", it is pretty bad when you do a calculation with numbers and the result is not even a number!  Even though this is simple to calculate analytically, it produces an error when computed numerically. Though this is a contrived example, this sort of issue shows up far more frequently than we would like!  (Yes, I have encountered precisely this type of error more than once.)

Another straight forward example of an error that can occur is using a square root.

In [7]:
np.sqrt(0.3 - (0.1 + 0.2))

  np.sqrt(0.3 - (0.1 + 0.2))


nan

## Numerical Precision

At the heart of the issue is the fact that only a finite number of digits (finite amount of information) can be stored.  In fact we have encountered this issue before.  Consider
$$ \frac{1}{3} = 0.33333333333333333333\ldots .$$
There are infinitely many digits in this expression.  We cannot write out an infinite number of digits so let us truncate it to three decimal places
$$ \frac{1}{3} = 0.333 . $$
Suppose we calculate the sum of three such terms, we find
$$ \frac{1}{3} + \frac{1}{3} + \frac{1}{3} = 0.999 . $$
Notice this is **not equal to one**!

If we add in more digits that does not help.  No matter how many digits we include we still do not get one!

This is an example that shows up since we are using base 10 to represent the fraction $1/3$.  It cannot be exactly represented in base 10 by a finite number of digits.  Many other numbers can be, for example, $\frac{1}{10} = 0.1$ is exactly represented, at least in base 10.

## Floating Point Representation

We have seen that all real numbers cannot be exactly represented in base 10.  The same is true in every other base.  Computer hardware (almost always) uses base 2 to represent numbers.  Again, all real numbers cannot be exactly represented in base 2 either (or any finite base, for that matter).  Furthermore, computers have finite storage space so can only store a finite number of digits for any number.  They must truncate numbers, much like our $1/3$ example above.

*This fact is the single most important and confusing thing about numerical work.*  Countless hours are wasted tracking down problems caused by this issue!

We will not go into the exact details of how numbers are stored, how calculations and round off is handled, *etc.*, as there are many technical details.  We will, however, try to get a basic idea.

### IEEE 754 Standard

How to represent numbers, and more importantly, how to handle calculations with truncated numbers and round off, has been standardized in the *IEEE 754 standard*.  A somewhat readable discussion is available from [Wikipedia](https://en.wikipedia.org/wiki/IEEE_754).  The standard reference everyone is sent to when they ask this question in a public forum is [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html).  A greatly simplified discussion can be found at [
What Every Programmer Should Know About Floating-Point Arithmetic](http://floating-point-gui.de/).  This is just a small sample of the material available online.  It is a common and confusing question that comes up when we first start using computers to solve numerical problems.

We can see how a number is actually stored using the following code.  It comes from (with some tweaks to make it work for us) a [Stack Overflow post](https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate).  See that post for a more complete discussion.  You do not need to understand this code, it is just here for us to use it.

In [8]:
import struct
import itertools
def float_to_bin_parts(number, bits=64):
    if bits == 64:      # double precision. Default for a Python and NumPy floats.
        int_pack      = 'Q'
        float_pack    = 'd'
        exponent_bits = 11
        mantissa_bits = 52
        exponent_bias = 1023
    elif bits == 32:    # single precision
        int_pack      = 'I'
        float_pack    = 'f'
        exponent_bits = 8
        mantissa_bits = 23
        exponent_bias = 127
    
    else:
        raise ValueError('bits argument must be 32 or 64')
    bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
    return [''.join(itertools.islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]

Let us look at the number 0.1 as it is represented (in 64 bits).

In [9]:
fb = float_to_bin_parts(0.1)
fb

['0', '01111111011', '1001100110011001100110011001100110011001100110011010']

 ### Interpreting the Data

You will see that the return value has been split into three components. These components are:

    Sign
    Exponent
    Mantissa (also called Significand or Fraction)
    

The short version of how to convert these values in base 2 when numbers are stored on a 64 bit computer (which is what pretty much everyone uses these days) is 
* Sign (1 bit): 0 for positive, 1 for negative
* Exponent (11 bits): Subtract `2**[(# of bits) - 1] - 1` to get the true exponent
* Mantissa (52 bits): Divide by `2**(# of bits)` and add `1` to get the true mantissa

For our case the sign is `0` since this is a positive number.

We can calculate the exponent as

In [10]:
e = int(fb[1], 2) - (2**10 - 1)
e

-4

We can calculate the mantissa as

In [11]:
m = int(fb[2], 2) / 2**52 + 1
format(m, '.17f')

'1.60000000000000009'

Notice the this is not an exact decimal number.

What all of this shows is that we can represent our decimal number, 0.1, as
$$ 0.1 \approx +1.6 \times 2^{-4}. $$

We can convert to decimal by calculating $m \times 2^e$:

In [12]:
format(m * 2**(e), '.17f')

'0.10000000000000001'

Again, this is not exact in binary even though it is (of course) exact in decimal.

We could do the same for 0.2 and 0.3 to see how the errors enter into our calculation.

### Rule of Thumb

For the representation of a number in the way given above (which corresponds to double precision) we get about 16 digits of accuracy.

This does not mean that the smallest number we can have is $10^{-16}$.  It also does not mean that a number like $10^{-19}$ is (necessarily) less accurate than a number like $0.1$.  The important facts to keep in mind are :

1. the number of digits corresponds to the number of bits in the mantissa, and
2. the complete range of values we can have is determined by the number of bits in the exponent.

There are 53 bits in the mantissa. (Notice that we said 52 above, however, we always choose to write our number such that the first binary digit is '1'. Thus, we do not need to including this digit and we get one extra bit of precision for "free".)  This means the number of digits in base 10 is
$$ \log_{10}(2^{53}) = 15.95\ldots \approx 16. $$

There are 11 bits in the exponent, but one of them is to determine the sign, thus we really only have 10 bits.  This means that in base 10 the maximum exponent is
$$ 2^{10} / \log_2(10) \approx 308. $$

### Actual floating point information

While this is nice, we can learn this and much more from `numpy` itself, in particular we can get all the floating point information we want from `np.finfo()`.  Some of this information is given below.

In [13]:
fi = np.finfo(float)

In [None]:
fi?

In [14]:
print(f"""Bits={fi.bits}
Max exp={fi.maxexp}, Max val={fi.max}
Min exp={fi.minexp}, Min val={fi.min}
Smallest number = {fi.tiny}
Smallest difference = {fi.eps}
""")

Bits=64
Max exp=1024, Max val=1.7976931348623157e+308
Min exp=-1022, Min val=-1.7976931348623157e+308
Smallest number = 2.2250738585072014e-308
Smallest difference = 2.220446049250313e-16



Notice that the smallest number and the smallest difference are very different.  The smallest difference is the smallest number we can add and subtract one from such that it will be preserved, in other words,
$$ (1+\epsilon)-1 = \epsilon $$
is true.

In [15]:
x = 1 + fi.eps
y = x - 1
print(f"y={y}, eps={fi.eps}")

y=2.220446049250313e-16, eps=2.220446049250313e-16


If we choose a smaller number this simple relation is no longer true!

In [16]:
x = 1 + 1e-18
y = x - 1
format(y, ".17f")

'0.00000000000000000'

### Accurate step sizes

As one example of how these issues show up and are worked around we consider the case of step sizes.  Many algorithms we want to implement require us to use small step sizes.  Since simple floating point numbers in base 10 are typically not exactly represented in base 2 it is not accurate to set a step size in base 10 and use it on a computer (which is in base 2).

To understand what this means consider calculating a numerical derivative.  An algorithm we will see in the future called center differencing is given by
$$ \frac{\mathrm{d}f(x)}{\mathrm{d}x} = \frac{f(x+h) - f(x-h)}{2h}, $$
for some small value of $h$.

Let us calculate the derivative of $f(x)=\cos(x)$ for $x_0=\pi/4$.  Analytically we know that $f'(x) = -\sin(x)$.

To accurately represent the step size we use the idiom
$$ h = (x_0 + h_{10}) - x_0 $$
where $h_{10}$ is the step size we specify in base 10.  As we have seen, due to finite precision we expect that $h\ne h_{10}$.  As a specific example consider the following:

In [17]:
h10 = 1e-5 # NOTE!!!
x0 = np.pi / 4
h = x0 + h10
h = h - x0
print(f"h10={h10}, h={h}")

h10=1e-05, h=9.99999999995449e-06


We now apply these two to step sizes to the center differencing algorithm.  We see that it does make a difference!  The accurately represented step size gives a more accurate numerical derivative.

In [18]:
def center_difference(f, x, h):
    return (f(x+h) - f(x-h)) / (2 * h)

fp_h10 = center_difference(np.cos, x0, h10)
fp_h = center_difference(np.cos, x0, h)
fp_true = -np.sin(x0)
print(f"h10 : fractional error = {np.abs(1 - fp_h10 / fp_true)}")
print(f"h   : fractional error = {np.abs(1 - fp_h / fp_true)}")

h10 : fractional error = 1.974753693900766e-11
h   : fractional error = 1.5196288671859293e-11


Notice that using $h$ instead of $h_{10}$ gives a more accurate result!  Of course it is a very small difference here, but it "cost us nothing" to use an accurate step size and this type of error can add up quickly.