In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Floating Point Numbers

This is a tough subject, but in order to better understand how a computer works with numbers, we first have to wrap our heads around with which numbers a computer works.  To do this, we briefly need to acquaint ourselves with the binary representation of a number.  All this means is that we are expanding relative to powers of $2$.  

So first we remember that when we write a decimal number, we are always writing something with respect to powers of $10$.  Thus

\begin{align}
7 = & 7\cdot10^{0}\\
17 = & 1\cdot 10^{1} + 7\cdot 10^{0}\\
107 = & 1\cdot 10^{2} + 0\cdot 10^{1} + 7 \cdot 10^{0}\\
107.3 = & 1\cdot 10^{2} + 0\cdot 10^{1} + 7 \cdot 10^{0} + 3 \cdot 10^{-1}
\end{align}

and more generally, we have that for $x\in \mathbb{R}$ that 

$$
x = \pm \sum_{j=-\infty}^{M} d_{j}10^{j}, ~ d_{j}=0,\cdots,9.
$$

We then see why we write things like $1/3 = .\bar{3}$ since,

$$
.\bar{3} = \sum_{j=1}^{\infty}\frac{3}{10^{j}} = 3 \left(\frac{1}{1-1/10} - 1 \right) = \frac{3}{9} = \frac{1}{3}.
$$

Now instead of powers of $10$, we do everything in powers of 2.  

\begin{align}
10 = 8 + 2 = & 1\cdot2^{3} + 0\cdot2^{2} + 1\cdot2^{1} + 0\cdot2^{0}\\
107 = 64 + 32 + 8 + 2 + 1 = & 1\cdot2^{6} + 1\cdot2^{5} + 0\cdot 2^{4} + 1\cdot2^{3} + 0\cdot 2^{2} + 1 \cdot 2^{1} + 1\cdot 2^{0} 
\end{align}

We abbreviate binary expansions in much the same way we abbreviate decimal expansions i.e. 

\begin{align}
10 = & 1010\\
107 = & 1101011 
\end{align}

For values $0\leq x < 1$, things get a little bizarre relative to results to which we are accustomed to seeing.  For example, 

\begin{align}
\frac{1}{2} & = 1\cdot 2^{-1}\\
\frac{3}{4} & = 1\cdot 2^{-1} + 1\cdot 2^{-2}
\end{align}

and you will see people write things like 

\begin{align}
\frac{1}{2} & = .1\\
\frac{3}{4} & = .11
\end{align}

Things get weird when we look at say $.1$.  So we have 

$$
.1 = \frac{1}{10} = \frac{b_{1}}{2} + \frac{b_{2}}{4} + \frac{b_{3}}{8} + \cdots , ~ b_{j}=0,1.
$$

So we see that if we multiply by $2$, then 

$$
.2 = b_{1} + \frac{b_{2}}{2} + \frac{b_{3}}{4} + \frac{b_{4}}{8} + \cdots,
$$

but since $.2<1$, then $b_{1}=0$.  Repeating this process, we get 

$$
.4 = b_{2} + \frac{b_{3}}{2} + \frac{b_{4}}{4} + \frac{b_{5}}{8} + \cdots.
$$

Again $.4<1$, so $b_{2}=0$, and $.8<1$, so $b_{3}=0$.  But then we get 

$$
1.6 = b_{4} + \frac{b_{5}}{2} + \frac{b_{6}}{4} + \frac{b_{7}}{8} + \frac{b_{8}}{16} + \cdots
$$

Now, $1.6>1$, so $b_{4}=1$, and we get 

$$
.6 = \frac{b_{5}}{2} + \frac{b_{6}}{4} + \frac{b_{7}}{8} + \frac{b_{8}}{16} + \cdots
$$

Multiply by $2$, and we get 

$$
1.2 = b_{5} + \frac{b_{6}}{2} + \frac{b_{7}}{4} + \frac{b_{8}}{8} + \cdots
$$

and thus $b_{5}=1$, and after repeating our process, we have 

$$
.4 = b_{6} + \frac{b_{7}}{2} + \frac{b_{8}}{4} + \frac{b_{9}}{8} \cdots
$$

So to summarize we have shown $b_{1}=b_{2}=b_{3}=0$ and $b_{4}=b_{5}=1$.  We also see then that 

$$
b_{2} + \frac{b_{3}}{2} + \frac{b_{4}}{4} + \frac{b_{5}}{8} + \cdots = b_{6} + \frac{b_{7}}{2} + \frac{b_{8}}{4} + \frac{b_{9}}{8} \cdots,
$$

and thus we have shown that 

$$
.1 = .0001100110011001100110011\cdots
$$

So what was a number with a simple decimal expansion in base-10 becomes a far more complicated creature in base-2.  This of coures begs for code.  So, let's think about writing code which turns a decimal number into it's corresponding binary representation.  

So first, let's think about a positive integer $d$.  We know it has some binary expansion, which looks like

$$
d = b_{m}2^{m} + b_{m-1}2^{m-1} + \cdots b_{1}2^{1} + b_{0}2^{0}, ~ b_{j}=\left\{\begin{array}{rl} 1 & j=m\\ 0,1 & 0\leq j < m
\end{array}\right.
$$

_Problem_: In terms of the variables $b_{j}$, what is `d%2`? 

_Problem_: If I know $b_{0}$, how would I find $b_{1}$?  

_Problem_: How would I print an array backwards in Python?  If I have an array `avals`, what does `avals[::-1]` do ?

_Problem_: What is an algorithm for generating $b_{j}$?

In [17]:
def bin_exp(d):
    bstr = ''
    while d > 0: 
        b0 = int(d % 2)
        d = (d-b0)/2
        bstr += str(b0)
    return bstr[::-1]

In [18]:
print(bin_exp(10))

1010


So what are we going to do about decimal parts of numbers?  In other words, suppose we have $0<d<1$ where

$$
d = b_{-1}\frac{1}{2} + b_{-2}\frac{1}{2^{2}} + \cdots b_{-j}\frac{1}{2^{j}} + \cdots, ~ b_{-j}=0,1.
$$

What is an algorithm for determining the coefficients $b_{-j}$?  

In [19]:
def bin_exp_dec(d):
    bstr=''
    cnt = 0
    while cnt <= 53:
        d *= 2. 
        if d >= 1.:
            b1 = 1. 
            bstr += '1' 
        else:
            b1 = 0. 
            bstr += '0' 
        d -= b1     
        cnt += 1
        
    return bstr        

In [21]:
print(bin_exp_dec(.1))

000110011001100110011001100110011001100110011001100110


In [22]:
print(2.**(-53))

1.1102230246251565e-16


## The Floating Point Representation of Machine Numbers

Here is where things get markedly more complex, but, as is so often the case, interesting.  What this all comes down to are what are called _ memory registers _.  For this, we need a picture

![Memory](https://upload.wikimedia.org/wikipedia/commons/d/d8/ABasicComputer.gif)

![Real_Life](https://upload.wikimedia.org/wikipedia/commons/5/52/EBIntel_Corei5.JPG)

In the image above, we see the CPU for a laptop to the right as the bronze square with a pipe on top of it.   

So what we are talking about when we talk about registers are physical locations on the CPU.  In effect, they are the CPU's personal scratch pad.  The registers themselves are made of 64-_bits_, i.e. each register contains a sequence of 64 1's and 0's.  When the registers are used for numbers, we represent a machine, or floating point number, say ${\bf x}_{f}$, via the form 

$$
{\bf x}_{f} = \left(s ~c_{10} c_{9} \cdots c_{0} ~f_{1} f_{2} \cdots f_{52} \right)
$$

The bit in $s$ is the sign.  A 0 means positive, 1 negative.  The next 11 bits represented by the values $c_{j}$ make up the _characteristic_.  The remaining 52 bits represented by the vaues $f_{j}$ make up the _ mantissa _.  The actual number represented by all these bits is found via the formula    

$$
x_{f} = (-1)^{s}2^{\tilde{c}}(1 + \tilde{f}),
$$

where

$$
\tilde{c} = \sum_{j=0}^{10}c_{j}2^{j} - 1023, ~ \tilde{f} = \sum_{j=1}^{52}\frac{f_{j}}{2^{j}}
$$

Just to be clear, we are making a distinction between ${\bf x}_{f}$, which is a collection of 64 bits in a register inside a CPU, and $x_{f}$ which is an actual number as we commonly understand them.  Ignoring the sign, and using the formula 

$$
\sum_{j=0}^{n} a^{j} = \frac{a^{n+1}-1}{a-1}.
$$

we can then determine what the following machine numbers are 

\begin{align}
\mbox{Inf} = & (0 ~ 1\cdots1 ~0\cdots 0)\\
0 = & (0 ~ 0\cdots0 ~0\cdots 0)\\
{\bf x}^{max}_{f} = & (0 ~ 1\cdots1 0 ~ 1\cdots 1)\\
{\bf x}^{min}_{f} = & (0 ~ 0\cdots0 1 ~ 0\cdots 0)
\end{align}

_Problem_: You know what to do here.  

_Problem_: Determine the range of characteristic values.  

_Problem_: Given an array `avals` how would I slice out the first 11 entries?  What does the command `avals[:11]` do?  What does the command `avals[1:12]` return?

_Problem_: Determine the range of mantissa values.

_Problem_: Determine the machine representation of $.1$ using the example from above.  

So now let's think about how to take a given number and determine its floating point representation, and vice versa, how to take a floating point bit representation and turn it into a number.  

In [14]:
def bit_to_num(bvec):
    s = bvec[]
    cvec = bvec[]
    fvec = bvec[]
    cpows = np.array()
    fpows = -np.array()
    ctil = np.sum(cvec*(2.**cpows[])) - 1023
    ftil = np.sum(fvec*(2.**fpows))
    num = ((-1.)**s) * (2.**ctil) * (1.+ftil)
    return num

In [None]:
cvec = np.array([0,1,1,1,1,1,1,1,0,1,1])
fvec = np.array([1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1])
frup = np.array([1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,1,0])

bvec = np.zeros(64)
brup = np.zeros(64)
bvec[1:] = np.concatenate((cvec,fvec))
brup[1:] = np.concatenate((cvec,frup))

print "%1.19f" %bit_to_num(bvec)
print "%1.19f" %bit_to_num(brup)

In [29]:
def num_to_bit(d):
    # add code to transform a digit into its binary representation 

In [None]:
num_to_bit(.1)

What this means then is that the computer does not see any number between $0$ and $x^{min}_{f} = 2^{-1023}$.  They do not exist on the machine, and thus anything in between that may appear in a computation must be rounded one way or the other.  

When the numbers are small, this rounding may not seem like much, but floating point is a system based on _relative_ magnitudes.  To understand what this means, first suppose we have a positive floating point number $x_{f}$ which has a mantissa $0\leq \tilde{f} < 2-2^{-52}$.  We now look at the two closest numbers to $x_{f}$, say $x_{f}^{+}$ and $x_{f}^{-}$.  This gives us

\begin{align}
x^{+}_{f} = & 2^{\tilde{c}}(1 + \tilde{f} + 2^{-52} )\\
x_{f} = & 2^{\tilde{c}}(1 + \tilde{f} )\\
x^{-}_{f} = & 2^{\tilde{c}}(1 + \tilde{f} - 2^{-52} )
\end{align}

Thus we see that 

$$
\left|x^{+}_{f} - x_{f}\right| = \left|x_{f} - x^{-}_{f}\right| = 2^{\tilde{c}-52}
$$

which shows that as $\tilde{c}$ gets bigger, the _absolute spacing_ between floating point numbers _increases_.  Keep in mind, this means that as the characteristic increases, there are more and more numbers not represented by the computer.  

However, if we look at the relative spacing, we get 

$$
\frac{\left|x^{+}_{f} - x_{f}\right|}{2^{\tilde{c}}} = \frac{\left|x_{f} - x^{-}_{f}\right|}{2^{\tilde{c}}} = 2^{-52}.
$$

Thus, in the floating point system, absolute differences change based on the magnitude of the number, which is set by the characteristic $\tilde{c}$.  However, the relative difference stays exactly the same, and we define this fixed relative difference to be what is called _machine precision_.   