## IEEE 754: Definition



For 32\-bit (**single** precision):

$$
\underbrace{0}_\text{sign}
\overbrace{00000000}^\text{8-bit exponent}
\underbrace{00000000000000000000000}_\text{23-bit mantissa}
$$

For 64-bit (**double** precision):

$$
\underbrace{0}_\text{sign}
\overbrace{00000000000}^\text{11-bit exponent}
\underbrace{0000000000000000000000000000000000000000000000000000}_\text{52-bit mantissa}
$$

For any real $x$, it can be represented as:

$$
x = (-1)^s \cdot (1 + m) \cdot 2^{e}
$$

where $s$ be a sign bit, $m = \sum^{52}_{i=1} m_i 2^{-i}$, and $e = \sum^{10}_{i = 0} c_i 2^i - \overline{c}$ which $\overline{c} = 1023$ in double precision, and $\overline{c} = 127$ in single precision.

## Special Cases for IEEE 754 Representation

### Zero

It is nearly impossible to represent $0$ with IEEE 754. Let's say $s = m = e = 0$, if substituted, then we get:

$$
\underbrace{0}_\text{sign}
\overbrace{00000000}^\text{8-bit exponent}
\underbrace{00000000000000000000000}_\text{23-bit mantissa}
$$

This is called **positive zero (written as $+0$)**. Meanwhile,

$$
\underbrace{1}_\text{sign}
\overbrace{00000000}^\text{8-bit exponent}
\underbrace{00000000000000000000000}_\text{23-bit mantissa}
$$

is called **negative zero (written as $-0$)**.

**Therefore, regardless of the sign bit, if both $m = 0$ and $e = 0$, then the value is $0$.**

### Denormalized Numbers

If $e = 0$ and $m \neq 0$, the number will be considered **denormalized**.

In normalized number, we specify $m$ with an assumption that the value can be represented as:

$$
x = (-1)^s \times 1.m \times 2^e = (-1)^s \times 1.d_0d_1d_2... \times 2^e
$$

Where $d_0, d_1, d_2, ...$ are the decimal-point binary digits of $m$. We assume that the mantissa is in the range $[1, 2)$

However, if $e = 0$, we assume that the mantissa is in the range $[0, 1)$ instead, or:

$$
x = (-1)^s \times 0.m \times 2^{1-\overline{c}} = (-1)^s \times 0.d_0d_1d_2... \times 2^{1-\overline{c}}
$$

For example, this 32-bit number is denormalized

$$
\underbrace{0}_\text{sign}
\overbrace{00000000}^\text{8-bit exponent}
\underbrace{10001111100110111100010}_\text{23-bit mantissa}
$$

So its values is

$$
x = (-1)^0 \times (0 + 2^{-23} + 2^{-19} + 2^{-18} +...) \times 2^{-126}
$$

According to this rule, **the minimum non-zero positive value of 32-bit float numbers is**

$$
\begin{align*}
\underbrace{0}_\text{sign}
\overbrace{00000000}^\text{8-bit exponent}
\underbrace{00000000000000000000001}_\text{23-bit mantissa}
&= (-1)^0 + \times (0 + 2^{-23}) \times 2^{-126}\\
&= 2^{-149}
\end{align*}
$$

And for **64-bit float number is**

$$
\begin{align*}
\underbrace{0}_\text{sign}
\overbrace{00000000000}^\text{11-bit exponent}
\underbrace{0000000000000000000000000000000000000000000000000001}_\text{52-bit mantissa}
&= (-1)^0 + \times (0 + 2^{-52}) \times 2^{-1022}\\
&= 2^{-1074}
\end{align*}
$$

### Infinity

If $e$ is all $1's$ (i.e., or 1024 for double precision) and $m = 0$, then the value is $\inf$ (infinity).

Therefore, the maximum possible exponent is only $11111111110_2 = 1023$.

If the value with a sign number of $0$ is

$$
\underbrace{0}_\text{sign}
\overbrace{11111111111}^\text{8-bit exponent}
\underbrace{00000000000000000000000}_\text{23-bit mantissa}
$$

it is the **positive infinity**, written as $+\infty$.

And if the value with a sign number of $1$ is

$$
\underbrace{1}_\text{sign}
\overbrace{11111111111}^\text{8-bit exponent}
\underbrace{00000000000000000000000}_\text{23-bit mantissa}
$$

it is the **negative infinity**, written as $-\infty$.

### NaN (Not a Number)

It is impossible to represent a mantissa of *infinity*. That is, if $e$ is **max** and $m$ is **non-zero**, then the value is $\text{NaN}$.

For example,

$$
\underbrace{0}_\text{sign}
\overbrace{11111111111}^\text{8-bit exponent}
\underbrace{10101010101010101010101}_\text{23-bit mantissa}
$$

is $\text{NaN}$

# Exercise

Import `printf`.

In [2]:
using Printf

Define `mantissa()` to extract a mantissa out of a 64-bit string.

In [3]:
@enum IEEE754Precision single double

In [4]:
function mantissa(
        bitstr::String,
        denormalized::Bool = false,
        precision::IEEE754Precision = double,
        debug::Bool = false,
    )::Float64
    
    bitlen::Int64 = 52
    if precision == single
        bitlen = 23
    end
    
    bitstr::String = bitstr[end - bitlen + 1 : end]
    ret::Float64 = 1.0
    if denormalized
        ret = 0.0
    end
    if debug
        @printf("   i |  monomial  |  mantissa  \n")
        @printf("-----|------------|------------\n")
    end
    for i in 1:bitlen
        b::Int = parse(Int, bitstr[i])
        tmp::Float64 = b * 2.0^(-i)
        ret += tmp
        if debug
            @printf(" %3d | %.8f | %.8f \n", i, tmp, ret)
        end
    end
    
    return ret
end

mantissa (generic function with 4 methods)

Define `exponent()` to extract an exponent out of a 64-bit string.

In [5]:
function exponent(
        bitstr::String,
        precision::IEEE754Precision = double,
        debug::Bool = false,
    )::Tuple{Int64, Bool}
    
    bitlen::Int64 = 11
    if precision == single
        bitlen = 8
    end
    
    bitstr::String = bitstr[2 : bitlen + 1]
    ret::Int64 = 0
    if debug
        @printf("   i | monomial | exponent \n")
        @printf("-----|----------|----------\n")
    end
    for i in 0:bitlen-1
        b::Int = parse(Int, bitstr[i + 1])
        tmp::Int64 = b * (1 << (bitlen - 1 - i))
        ret += tmp
        if debug
            @printf(" %3d | %8d | %8d \n", i, tmp, ret)
        end
    end
    
    biased_term::Int64 = precision == double ? 1023 : 127
    ret -= biased_term
    if debug
        @printf("==> subtract with %d: %d\n", biased_term, ret)
    end
    
    denormalized::Bool = false
    if ret == -biased_term
        ret += 1
        denormalized = true
        if debug
            @printf("====> denormalized to %d\n", ret)
        end
    end
    
    return (ret, denormalized)
end

exponent (generic function with 3 methods)

In [6]:
function calcIEEE754Double(bitstr::String, debug::Bool = false)::Float64
    precision::IEEE754Precision = length(bitstr) == 64 ? double : single
    
    expo, denormalized = exponent(bitstr, precision, debug)
    mant = mantissa(bitstr, denormalized, precision, debug)
    sign::Int64 = bitstr[1] == '0' ? 1 : -1
    
    maxexpo::Bool = (expo == 1024 && precision == double) || (expo == 128 && precision == single)
    
    if maxexpo && mant == 1.0
        if bitstr[1] == '0'
            return Inf
        else
            return -Inf
        end
    end
    
    if maxexpo && mant != 1.0
        return NaN
    end
    
    return sign * mant * 2.0^expo
end

calcIEEE754Double (generic function with 2 methods)

Verify the result.

In [7]:
input = bitstring(3.0)
println("value: $(3.0)")
println("bit string: $input")

m = mantissa(input)
e, denom = exponent(input)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: 3.0
bit string: 0100000000001000000000000000000000000000000000000000000000000000


mantissa: 1.500000000000000000000000000000000000000000000
exponent: 1


x = 1.50 * 2^1 = 3.000000000000000000000000000000000000000000000


In [8]:
input = bitstring(1.1)
println("value: $(1.1)")
println("bit string: $input")

m = mantissa(input)
e, denom = exponent(input)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: 1.1
bit string: 0011111111110001100110011001100110011001100110011001100110011010
mantissa: 1.100000000000000088817841970012523233890533447
exponent: 0
x = 1.10 * 2^0 = 1.100000000000000088817841970012523233890533447


In [9]:
input = bitstring(0.1)
println("value: $(0.1)")
println("bit string: $input")

m = mantissa(input)
e, denom = exponent(input)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: 0.1
bit string: 0011111110111001100110011001100110011001100110011001100110011010
mantissa: 1.600000000000000088817841970012523233890533447
exponent: -4
x = 1.60 * 2^-4 = 0.100000000000000005551115123125782702118158340


In [10]:
input = "01111111110101010101010101010101"
println("value: $(NaN)")
println("bit string: $input")

m = mantissa(input, false, single)
e, denom = exponent(input, single)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: NaN
bit string: 01111111110101010101010101010101


mantissa: 1.666666626930236816406250000000000000000000000
exponent: 128
x = 1.67 * 2^128 = NaN


In [11]:
input = "0111111111110000000000000000000000000000000000000000000000000000"
println("value: $(Inf)")
println("bit string: $input")

m = mantissa(input, false, single)
e, denom = exponent(input, single)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: Inf
bit string: 0111111111110000000000000000000000000000000000000000000000000000
mantissa: 1.000000000000000000000000000000000000000000000
exponent: 128
x = 1.00 * 2^128 = Inf


In [12]:
input = "1111111111110000000000000000000000000000000000000000000000000000"
println("value: $(-Inf)")
println("bit string: $input")

m = mantissa(input, false, single)
e, denom = exponent(input, single)

@printf("mantissa: %.45f\n", m)
@printf("exponent: %d\n", e)

@printf("x = %.2f * 2^%d = %.45f\n", m, e, calcIEEE754Double(input))

value: -Inf
bit string: 1111111111110000000000000000000000000000000000000000000000000000
mantissa: 1.000000000000000000000000000000000000000000000
exponent: 128
x = 1.00 * 2^128 = -Inf


# Additional Homework

## Show that the largest possible number in 64-bit IEEE floating point is $2^{1023} (2-2^{-52})$

Before we're going to do the boring math, let's do the **coding** way to find the bit representation, so that they give us a clue on how to prove it correctly.

In [14]:
x = 2.0^1023 * (2.0 - 2.0^(-52))
bitx = bitstring(x)
@printf("decimal: %f\n", x)
@printf("bits: %s\n", bitx)

m = mantissa(bitx)
e, denom = exponent(bitx)

@printf("mantissa: %s\n", m)
@printf("exponent: %s\n", e)

@printf("x = %.2f * 2^%.2f = %.2f\n", m, e, calcIEEE754Double(bitx))

decimal: 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000
bits: 0111111111101111111111111111111111111111111111111111111111111111
mantissa: 1.9999999999999998
exponent: 1023


x = 2.00 * 2^1023.00 = 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.00


Though the last equation does not match our expectation, the bit representation could tell us something.

The largest possible number written in 64-digit binary number as:

$$
\underbrace{0}_\text{sign}
\overbrace{11111111110}^\text{11-bit exponent}
\underbrace{1111111111111111111111111111111111111111111111111111}_\text{52-bit mantissa}
$$

The first digit is the signed bit for the whole number, which is zero to indicate that the number is positive.

The next 11 digits are the exponent ($11111111110_2 = 1023$; ten digits of ones, then zero) Notice that the 12th digit (the rightmost digit of the exponent) is zero because that is the maximum exponent possible. *Note: $e$ with all 1's is reserved for infinity value only, and infinity is not considered a real number according to mathematicians*.

Let's consider the mantissa and the exponent's values.

Consider
$$
\begin{align*}
e &= 0 \cdot 2^0 + 1 \cdot 2^1 + 1 \cdot 2^2 + ... +  1 \cdot 2^{10} - 1023 \\
&= 1023
\end{align*}
$$

Consider

$$
\begin{align*}
m &= 2^{-52} + \cancel{2^{-51} + ... + 2^{-1}} &\_\_\_(1)\\
2m &= \cancel{2^{-51} + ...  + 2^{-1}} + 2^{0} & \_\_\_ (2)\\
(2)-(1);\ \ \ \ \ 2m - m &= 2^{0} - 2^{-52}\\
m &= 1 - 2^{-52}
\end{align*}
$$

Consider the IEEE form of floating-point representation

$$
\begin{align*}
(-1)^s \cdot (1 + m) \cdot 2^{e}
&= \cancel{(-1)^0} \cdot (1 + 1 - 2^{-52}) \cdot 2^{1023} \\
&= 2^{1023} \cdot (2 - 2^{-52})
\end{align*}
$$

$$
\therefore \text{The maximum value of 64-bit IEEE float is } 2^{1023}(2 - 2^{-52})
$$