In [None]:
// run this cell to prevent Jupyter from displaying the null output cell
com.twosigma.beakerx.kernel.Kernel.showNullExecutionResult = false;

<a id='notebook_id'></a>
# Representing real numbers on a computer

On most consumer desktop and handheld computers, real numbers are represented using 32 or 64 digit binary values (32 or 64 bits).
Using such values to illustrate the characteristics of arithmetic is cumbersome because of the number of digits and the use of
binary (as opposed to base 10 that most humans are accustomed to).

Instead of using binary values, we begin by investigating some properties of a representation of real values where up to 4 digits 
(between 0 and 9) are used.

## Fixed-point representation

Suppose that real values are approximated using a type `fixed_pt` where every `fixed_pt` value has the following form:

$$\pm\boxed{d_1}\boxed{d_2}.\boxed{d_3}\boxed{d_4}$$

where $d_1$ is a single digit between 0 and 9.
In other words, a `fixed_pt` value can be positive or negative and *always* has two digits before the decimal point and two digits after the decimal point.
For example, the value closest to $\pi$ in `fixed_pt` is:

$$+03.14$$

Because there are always a fixed number of digits after the decimal point, such a representation is called a *fixed-point* representation.

**Exercise 1** What is the smallest positive `fixed_pt` value? Call this value $a$.

**Exercise 2** What is the largest positive `fixed_pt` value? Call this value $b$.

**Exercise 3** Is there a `fixed_pt` value that can represent $a+b$ exactly?

**Exercise 4** Is there a `fixed_pt` value that can represent $b-a$ exactly?

**Exercise 5** Consider the two `fixed_pt` values $x = +01.01$ and $y = +10.10$. Is there a `fixed_pt` value that exactly represents the sum $x + y$?

**Exercise 6** Consider all `fixed_pt` values $c$ and the `fixed_pt` value closest to but not equal to $c$. What is the distance between these two values?

**Exercise 7** What is the range of `fixed_pt` values?

**Exercise 8** How many different `fixed_pt` values correspond to the numeric value $1$?

Representing real values using a fixed-point representation is uncommon but has some applications (such as in financial applications where
values are always rounded to the nearest cent).

## Floating-point representation

Suppose that real values are approximated using a type `float_pt` where every `float_pt` value is made up of two values:

$$s = \pm\boxed{d_1}\boxed{d_2}\boxed{d_3}$$

$$e = \boxed{d_4} - 5$$

which corresponds to the numeric value $$v = s \times 10^e.$$

Furthermore, it is also the case that $d_1 \neq 0$; this prevents leading zeroes in $s$. If there are no leading zeroes allowed, then
it is impossible to represent the value $0$. To solve this problem, we say that the `float_pt` value:

$$s = \pm\boxed{1}\boxed{0}\boxed{0}$$

$$e = \boxed{0} - 5$$

represents the value $0$.

In summary, a `float_pt` value has three digits and a decimal point whose position is determined by the value of the exponent $e$. For example, the value
closest to $\pi$ in `float_pt` is:

$$s = \pm\boxed{3}\boxed{1}\boxed{4}$$

$$e = \boxed{7} - 5 = -2$$

where $$v = s \times 10^e = 314 \times 10^{-2} = 3.14.$$

Because the decimal point can move *or float* between digits, such a representation is called a *floating-point* representation.

The value $s$ is called the *significand* and the value $e$ is called the *exponent*. Notice that the exponent is defined as a value minus some
constant value (in this case $5$); the constant value is called the *bias* of the exponent. The reason a bias is used is so that negative
exponents can be obtained without having to explicitly store the sign of the exponent.

The number of digits in the significand is often given the symbol $p$. A `float_pt` value has $p=3$.

When leading zeroes are not allowed in the significand, the format of the floating-point number is said to be *normalized*. If leading zeroes are
allowed in the significand, then the the format of the floating-point number is said to be *denormalized* or *subnormal*. The representation
used above results in a normalized representation.

**Exercise 9** What is the smallest positive `float_pt` value? Call this value $a$.

**Exercise 10** What is the largest positive `float_pt` value? Call this value $b$.

**Exercise 11** Is there a `float_pt` value that can represent $a+b$ exactly?

**Exercise 12** Is there a `float_pt` value that can represent $b-a$ exactly?

**Exercise 13** Consider the two `float_pt` values $x = 1.01$ and $y = 10.1$. Is there a `float_pt` value that exactly represents the sum $x + y$?

**Exercise 14** Consider all `float_pt` values $c$ and the `float_pt` value closest to but not equal to $c$. Is the distance between these two values the
same for all values of $c$?

**Exercise 15** What is the range of `float_pt` values?

**Exercise 16** How many different `float_pt` values correspond to the numeric value $1$?

**Exercise 17** Suppose that we allow $d_1$, $d_2$, and $d_3$ to have any value (including $0$). 
How many different `float_pt` values correspond to the numeric value $1$?

## IEEE 754

Real values are most often approximated as floating-point values in general purpose computing. 
The [IEEE Standard for Floating-Point Arithmetic (IEEE 754)](https://en.wikipedia.org/wiki/IEEE_754)
is the standard for floating-point arithmetic used in most modern CPUs intended for general purpose computing.
The primary architect of the original 1985 standard was the Canadian mathematician and computer scientist
[William Kahan](https://en.wikipedia.org/wiki/William_Kahan) who won the 1989 Turing Award for contributions to numerical analysis.

Java's `float` and `double` primitive types correspond to the single-precision floating-point format
[binary32](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) 
and the double-precision floating-point format
[binary64](https://en.wikipedia.org/wiki/Double-precision_floating-point_format), respectively.
The details of these formats are beyond the scope of this notebook; interested readers are encouraged to follow
the links to learn more about these formats.

What is important to know is that every floating-point value in Java has the numeric value

$$\pm s \times 2^e$$

where the significand $s$ and the exponent $e$ are both integer values. The $2$ appears because computers
use base-2 or binary representation for values.

The base-2 representation means that many base-10 numbers cannot be represented exactly. For example, the Java statement:

```java
double x = 0.1;
```

produces a value of `x` that is not exactly equal to $0.1$ because $0.1$ cannot be expressed as $s \times 2^e$. The proof is by contradiction:

Assume that there are some integer values $s$ and $e$ such that

$$s \times 2^e = 0.1$$

is true. Because $s$ is an integer value, the value of the exponent $e$ must be negative and we can write:

$$s \times \frac{1}{2^{-e}} = 0.1$$

Multiplying both sides of the equality by $10 \times 2^{-e}$ yields:

$$10s = 2^{-e}$$

which is a contradiction because the prime factorization of the left-hand side must include a $5$ but the 
prime factorization of the right-hand side includes only $2$s. Therefore, $0.1$ has no exact representation of the
form $s \times 2^e$, and it must lie between two floating-point values.

The actual value of `x` is approximately $0.1 + 6\times 10^{-18}$ so the magnitude of the representation error
is very small, but the fact remains that `x` is not exactly equal to $0.1$.

**Exercise 18** Show that 0.3 cannot have an exact representation in IEEE 754.

**Exercise 19** Show that 0.5 can have an exact representation in IEEE 754.