# Numbers
<div class="alert alert-block alert-info">
If we have $n$ bits available, then we may encode $2^n$ different objects. 
For instance, $3$ bits enable encoding of $8=2^3$ different objects by 
\begin{align*}
000 && 001&&010 && 100 \\
011&& 101 && 110 & &111
\end{align*}
In $64$-bit systems, we may have $2^{64}\approx 10^{19}$ different machine numbers. The observation
$2^{64} = \#\{ 0 ,\ldots 2^{64}-1\} = \#\{ -2^{63},\ldots,2^{63}-1\}$ 
leads to 
    
<b>Integers:</b> 
\begin{equation}
Int64=\{-2^{63},\ldots,2^{63}-1\},\qquad 2^{63}\approx 10^{19}
\end{equation}
</div>

In [1]:
println(bitstring(0))
println(bitstring(1))
println(bitstring(2^63-1))

0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000001
0111111111111111111111111111111111111111111111111111111111111111


In [2]:
println(bitstring(-2^63))
println(bitstring(-2^63+1))
println(bitstring(-1))

1000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000001
1111111111111111111111111111111111111111111111111111111111111111


In [3]:
a = 2^63-1

9223372036854775807

In [5]:
typeof(a)

Int64

In [6]:
b = -2^63

-9223372036854775808

In [7]:
typeof(b)

Int64

In [8]:
a+1 == b ,                 # overflow
b-1 == a                   # overflow

(true, true)

---
---

<div class="alert alert-block alert-info">
<b>Rationals:</b> 
\begin{equation}
Rational\{Int64\} = \left\{\frac{p}{q} : p,q\in Int64, q\neq 0\right\}\cup\{\pm\infty\},
\end{equation}
where $1//0=\infty$.
</div>

In [9]:
x = 2//3

2//3

In [10]:
typeof(x)

Rational{Int64}

In [11]:
2//4 == 1//2

true

In [12]:
1//0 == 4//0             # infinity     

true

## Floating point numbers

The **floating point numbers** are not equally spaced to cover a much larger range while also allowing for fractions. We follow the IEEE 754 standard.

For a binary number such as $110$ or $1,1001$, we put

\begin{align*}
(110)_2& : = 1\cdot 2^2 + 1\cdot 2^1 + 0\cdot 2^0 = 6\\ 
(1,1001)_2 &:= 1+1\cdot 2^{-1}+0\cdot 2^{-2}+0\cdot 2^{-3}+1\cdot 2^{-4}=1+\frac{9}{16}=\frac{25}{16}.
%9+\frac{11}{16}.
\end{align*}

Consider 

\begin{equation}\label{eq:general floating point number}
\pm \big(1,a_1 a_2\ldots a_{52}\big)_2 \cdot 2^\alpha = (-1)^s 2^\alpha\left(1+\sum_{k=1}^{52} a_k 2^{-k}\right),\qquad s,a_k\in\{0,1\}.
\end{equation}

In $64$ bit-encoding, the **sign** $\pm$ requires $1$ bit for $s$, the **mantisse** $a_1,\ldots a_{52}$ requires $52$ bits, so that there are $11$ bits left for encoding the **exponent** $\alpha\in\mathbb{Z}$. We observe

\begin{equation*}
2^{11} = 2048 = \# \{0,\ldots,2047 \} = \# \{\underbrace{-1023}_{\text{special}},-1022,\ldots,1023,\underbrace{1024}_{\text{special}}\},
\end{equation*}

so that $\alpha\in \{-1022,\ldots,1023\}$. Note that $-1023$ and $1024$ are used for encoding special "numbers".



| $\alpha$ | $11$-bits     |
|---------:|:-------------|
|$\text{special}$   | $00000000000$ |
|$-1022$   | $00000000001$ |
|$-1021$   | $00000000010$ |
|$\vdots$  |   $\vdots$    |
|$-1$      | $01111111110$ |
|$0$       | $01111111111$ |
|$1$       | $10000000000$ |
|$\vdots$  | $\vdots$     |
|$1023$    | $11111111110$ |
|$\text{special}$   | $11111111111$ |


<div class="alert alert-block alert-info">
<b>Definition:</b> 
The floating point numbers (for the $64$-bit IEEE standard) are
\begin{equation*}
\mathbb{F}_{64}:=\left\{\pm \big(1,a_1 a_2\ldots a_{52}\big)_2 \cdot 2^\alpha \; : \;  a_k\in\{0,1\},\; \alpha\in \{-1022,\ldots,1023\} \right\},
\end{equation*}
and we define
\begin{align*}
R_{64} := \max(\mathbb{F}_{64}),
\qquad\qquad
r_{64}:=\min(\{x\in \mathbb{F}_{64} : x>0\}).
\end{align*}
</div>


If $x\in \mathbb{F}_{64}$, then we have $|x|\in  [r_{64},R_{64}]$.


<div class="alert alert-block alert-warning">
<b>Lemma:</b> 
We have 
\begin{align*}
R_{64} = 2^{1024}\left(1-2^{-53}\right)>10^{308}, \qquad\qquad
r_{64}=2^{-1022} < 3\cdot 10^{-308},
\end{align*}
and $\mathbb{F}_{64}$ is a union of equally spaced numbers with gaps $2^{\alpha-52}$, i.e., 
\begin{equation}\label{eq:floating all}
\mathbb{F}_{64} = \bigcup_{\alpha=-1022}^{1023} \{2^\alpha(1+k2^{-52}) : k=0,1,\ldots,2^{52}-1\}.
\end{equation}
</div>

**Proof:**
We directly compute 
	\begin{align*}
		R_{64}&=(1,1\ldots 1)_2 \cdot 2^{1023}\\
		& =  \sum_{k=0}^{52} 2^{-k} \cdot 2^{1023} \\
		& = \frac{1-2^{-53}}{1/2}  2^{1023} = 2^{1024}\left(1-2^{-53}\right).
	\end{align*}


In [13]:
typeof(2)

Int64

In [14]:
typeof(2.0)

Float64

In [15]:
typeof(2.0^63)

Float64

In [16]:
Sign = 1
exponent = (2:12)
mantisse = (13:64)
println(bitstring(2.0^-1022)[exponent])
println(bitstring(2.0^-1021)[exponent])
println(bitstring(2.0^-1)[exponent])
println(bitstring(2.0^0)[exponent])
println(bitstring(2.0^1)[exponent])
println(bitstring(2.0^1023)[exponent])

00000000001
00000000010
01111111110
01111111111
10000000000
11111111110


In [18]:
x = 2.0^1023*(2-2^(-52))  # biggest floating point number 2^1024(1-2^-53)
println(bitstring(x)[Sign])
println(bitstring(x)[exponent])
println(bitstring(x)[mantisse])

0
11111111110
1111111111111111111111111111111111111111111111111111


In [19]:
y = x+2.0^969
println(bitstring(y)[Sign])
println(bitstring(y)[exponent])
println(bitstring(y)[mantisse])

0
11111111110
1111111111111111111111111111111111111111111111111111


In [20]:
x == y

true

In [21]:
z = x+2.0^970

Inf

## Machine precision

<div class="alert alert-block alert-info">
<b>Definition:</b> 
The number $\mathrm{eps}_{64}:=2^{-52}\approx 10^{-16}$ is called machine precision.
</div>

In [22]:
eps(Float64)

2.220446049250313e-16

In [23]:
eps() == eps(Float64)

true

<div class="alert alert-block alert-success">
<b>Example:</b>[Smallest machine number bigger than $5$]
We have $5=2^2+2^0$ and 
\begin{equation}
2^2\left(2^0+2^{-2}+2^{-52}\right)= 5+4\mathrm{eps}_{64}
\end{equation}
is the smallest machine number bigger than $5$.
</div>

In [25]:
a = 5.; b = 5+2*eps(); c = 5+3*eps(); d = 5+4*eps(); 
a == b,
a == c,
a == d

(true, false, false)

---

<div class="alert alert-block alert-success">
<b>Example:</b>[Smallest positive integer that is not in $\mathbb{F}_{64}$]
We have $2^{53}\in\mathbb{F}_{64}$ and $2^{53}+1\not\in\mathbb{F}_{64}$.
</div>

In [26]:
maxintfloat()==2^53      

true

In [27]:
x = 2.0^53; y = 2.0^53+1; x == y

true

In [28]:
floor(1.13*100)

112.0

In [30]:
function myeps()
    value = 1
    while 1<1+value
        value = value / 2;
    end
value = value * 2
end
myeps() == eps()

true

Any number in $[-R_{64},-r_{64}] \cup [r_{64},R_{64}]$ can be written as 
\begin{equation*}
\pm \left(\sum_{k=0}^\infty a_k 2^{-k} \right)2^\alpha,
\end{equation*}
with $a_k\in\{0,1\}$ and $\alpha\in \{-1022,\ldots,1023\}$, while agreeing on $a_0:=1$.

<div class="alert alert-block alert-info">
<b>Definition:</b> 
Rounding down $\mathrm{fl}_{\downarrow} : [-R_{64},-r_{64}] \cup [r_{64},R_{64}] \rightarrow \mathbb{F}_{64}$ is defined by the truncation
\begin{equation}\label{eq:rd}
\pm \left( \sum_{k=0}^{\infty} a_k 2^{-k} \right)2^\alpha\mapsto  \pm \left(\sum_{k=0}^{52} a_k 2^{-k} \right)2^\alpha.
\end{equation}
</div>
Define $\mathrm{fl}_{\uparrow}$ and $\mathrm{fl}$ ...

<div class="alert alert-block alert-warning">
<b>Lemma:</b> 
For $x\in\mathbb{R}$ with $|x|\in [r_{64},R_{64}]$, the relative rounding error satisfies
\begin{equation}\label{eq:rding}
\left|\frac{\mathrm{fl}(x)-x}{x}\right|\leq \mathrm{eps}_{64}.
\end{equation}
</div>

**Proof:**
Rounding yields 
	\begin{equation*}
		\left|\frac{\mathrm{fl}(x)-x}{x}\right|  \leq \left|\frac{\left(\sum_{k=53}^{\infty} a_k 2^{-k}\right) 2^\alpha}{\left(\sum_{k=0}^{\infty} a_k 		2^{-k} \right)2^\alpha}\right|  = \left|\frac{\sum_{k=53}^{\infty} a_k 2^{-k} }{\sum_{k=0}^{\infty} a_k 2^{-k}}\right|.
	\end{equation*}
	Since $a_0=1$, we observe $\sum_{k=0}^{\infty} a_k 2^{-k}\geq 1$, so that $\sum_{k=53}^{\infty} a_k 2^{-k}\leq 2^{-52}$ 		concludes the proof.

The computer addition $\oplus$ and multiplication $\odot$ are implemented such that, for all $x,y\in \mathbb{F}_{64}$ with $|x + y|\in [r_{64},R_{64}]$,
\begin{equation}\label{eq:exact star}
x\oplus y := \mathrm{fl}(x + y).
\end{equation}
The relative errors of computer addition and multiplication are bounded by the machine precision:
<div class="alert alert-block alert-info">
<b>Theorem:</b> 
For all $x,y\in \mathbb{F}_{64}$, 
\begin{align}
|x+y|\in [r_{64},R_{64}]\quad & \Rightarrow\quad \left| \frac{x\oplus y-(x + y)}{x +y}\right|  \leq \mathrm{eps}_{64},\\
    |x\cdot y|\in [r_{64},R_{64}]\quad &\Rightarrow\quad \left| \frac{x\odot y-(x \cdot y)}{x \cdot y}\right|  \leq \mathrm{eps}_{64}.
\end{align}
</div>

## Subnormal floating point numbers
\begin{equation*}
\mathbb{F}_{64,sub}=\{\pm k 2^{-1074} : k=1,\ldots,2^{52}-1\},\qquad\qquad Float64 = \mathbb{F}_{64}\cup\mathbb{F}_{64,sub} \cup \{\pm \infty,NaN\}
\end{equation*}


In [31]:
nextfloat(0.0) == 2^-1074

true

In [32]:
issubnormal(2^-1022),issubnormal(2^-1023),issubnormal(2^-1074)

(false, true, true)

In [33]:
2.0^-1075

0.0

| $\mathbb{F}_{64,\mathrm{sub}}$  | $52$-bit mantisse  |
|--------------------------------:|:--------------------:|
| $2^{-1074}$                     | $0\cdots001$        |
| $2\cdot 2^{-1074}$              | $0\cdots010$        |
| $3\cdot 2^{-1074}$              | $0\cdots011$        |
| $\vdots$                          | $\vdots$              |
| $(2^{52}-1)\cdot 2^{-1074}$     | $1\cdots111$        |

In [34]:
println(bitstring(2^-1074))
println(bitstring(2*2^-1074))
println(bitstring(3*2^-1074))
println(bitstring((2^52-1)*2^-1074))

0000000000000000000000000000000000000000000000000000000000000001
0000000000000000000000000000000000000000000000000000000000000010
0000000000000000000000000000000000000000000000000000000000000011
0000000000001111111111111111111111111111111111111111111111111111


In [35]:
issubnormal(0.0)  

false

Computer rounding $\mathrm{fl}$ is extended to $\mathbb{F}\cup \mathbb{F}_{\mathrm{sub}}$.
<div class="alert alert-block alert-success">
<b>Example:</b>
The number $(1+\frac{1}{2})2^{-1074}$ is rounded to $2^{-1073}$ and we have
\begin{equation*}
\left|\frac{2^{-1073} - (1+\frac{1}{2})2^{-1074}}{(1+\frac{1}{2})2^{-1074}}  \right| = \frac{1}{3}.
\end{equation*}
</div>

In [36]:
(1+1/2)*2^(-1074) == 2^(-1073)

true

---

<div class="alert alert-block alert-info">
<b>Definition:</b> 
The set of machine numbers is $\mathbb{F}_{64}\cup \mathbb{F}_{64,\mathrm{sub}}$.
</div>

In Julia, we have 
\begin{equation*}
\mathrm{Float}64 = \mathbb{F}_{64}\cup \mathbb{F}_{64,\mathrm{sub}} \cup \{\pm 0, \pm \mathrm{Inf},\mathrm{NaN}\},
\end{equation*}
where $\pm\mathrm{Inf}=\pm \infty$ and $\mathrm{NaN}$ is Not a Number such as $0/0$.

In [37]:
println(bitstring(0.0))
println(bitstring(-0.0))

0000000000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000000000


In [38]:
0.0 == -0.0

true

In [40]:
1/0

Inf

In [41]:
0/0

NaN

In [None]:
typeof(NaN)

In [42]:
println(bitstring(Inf))
println(bitstring(1/0))
println(bitstring(-2/0))
println(bitstring(NaN))

0111111111110000000000000000000000000000000000000000000000000000
0111111111110000000000000000000000000000000000000000000000000000
1111111111110000000000000000000000000000000000000000000000000000
0111111111111000000000000000000000000000000000000000000000000000


## Distribution of floating point numbers

In [27]:
using Plots
f(α) = 2. ^α
using WebIO
using Interact
@manipulate for n = -1021:1:1023
    Plots.vline(f.(-1022:n),label=L"2^α,\quad α=-1022,\ldots,"*"$n",title="there are "*L"2^{52}-1"*" Float64 numbers between two lines")
end