# Numerical Instability

There are 3 major problems regarding numerical errors.

## Round-off Error or Rounding Error

Rounding error usually occurs when calculating **floating-point numbers**. According to IEEE-754, a number of bits in ********mantissa******** determines the maximum possible precisions. Most of the time, floating-point numbers just cannot be represented as a sum of the negative power of 2.

For example, $0.1$ is simply $10^{-1}$ in decimal representation, but in the binary representation, it is $0.100110011001100110011001..._2$ (the repeating sequence of $1001$ after the binary point).

In 32-bit precision, the actual stored decimal value of $0.1$ is $0.100000001490116119384765625$, which is somewhat imprecise for detailed calculations.

To let you see a better picture, let's consider the value of $1.1^{6}$.

Calculating by hand, it is

$$
\begin{align*}
1.1^6 &= (10^0 + 10^{-1})^6 \\
&= (1 + 10^{-1})^6 \\
&= (10^{-1})^6 + 6(10^{-1})^5 + 15(10^{-1})^4 + 20(10^{-1})^3 + 15(10^{-1})^2 + 6(10^{-1}) + (10^{-1})^0 \\
&= 0.000001 + 0.00006 + 0.0015 + 0.020 + 0.15 + 0.6 + 1 \\
&= 1.771561
\end{align*}
$$

According to Julia, it is.

In [3]:
using Printf

In [4]:
x::Float32 = 1.1
y::Float32 = 1.771561
@printf("The number of x is %.23f\n", x)
@printf("The number of x^6 is %.23f\n", x^6)
@printf("The number of y is %.23f\n", y)

The number of x is 1.10000002384185791015625
The number of x^6 is 1.77156126499176025390625
The number of y is 1.77156102657318115234375


## Overflow and Underflow

### Integers

In Julia, it is just impossible to assign an overflow value. The following prompt was caused by running this line of Julia code:

```julia
a::Int32 = 2^31
```

The output is
```text
InexactError: trunc(Int32, 2147483648)

Stacktrace:
 [1] throw_inexacterror(f::Symbol, #unused#::Type{Int32}, val::Int64)
   @ Core ./boot.jl:634
 [2] checked_trunc_sint
   @ ./boot.jl:656 [inlined]
 [3] toInt32
   @ ./boot.jl:693 [inlined]
 [4] Int32
   @ ./boot.jl:783 [inlined]
 [5] convert(#unused#::Type{Int32}, x::Int64)
   @ Base ./number.jl:7
 [6] top-level scope
   @ In[8]:1
```

In C++, if running the following code:

```c++
#include <iostream>
int main() {
    int a = (1 << 31) + 5;
    std::cout << a << '\n';
    return 0;
}
```

the output will be `-2147483643`. Note that `1 << n` is equivalent to $2^n$.

### Floating-points

If the value is too high, it becomes $+\infty$. This also happens in C++.

In [9]:
tooHigh::Float32 = 1e40
println(tooHigh)

Inf


If the value is too low, it becomes $-\infty$.

In [3]:
tooLow::Float32 = -1e40
println(tooLow)

-Inf


If the precision is too small, it becomes either $+0$ or $-0$, depending on the sign bit.

In [12]:
tooSmall::Float32 = 1e-50
println(tooSmall)

0.0


In [13]:
tooSmall::Float32 = -1e-50
println(tooSmall)

-0.0


## Cancellation Error

Cancellation Errors happen due to the mitigation of significant digits (from minus operation) between 2 floating-point numbers.

In [6]:
using Printf

For example, the expected result of $12.4 - 12.1$ is $0.3$.

In [20]:
x1::Float32 = 12.4
x2::Float32 = 12.1
@printf("x1 = %.24f\n", x1)
@printf("x2 = %.24f\n", x2)
println("bitstr(x1): $(bitstring(x1))")
println("bitstr(x2): $(bitstring(x2))")

x1 = 12.399999618530273437500000
x2 = 12.100000381469726562500000
bitstr(x1): 01000001010001100110011001100110
bitstr(x2): 01000001010000011001100110011010


In [22]:
result::Float32 = x1 - x2
@printf("x1 - x2 = %.24f\n", result)
println("bitstr(x1 - x2): $(bitstring(result))")

x1 - x2 = 0.299999237060546875000000
bitstr(x1 - x2): 00111110100110011001100110000000
