# PHY480 Day 2

## In-class activity: Experiments with round-off errors

Floating-precision representation of real numbers on a computer allows one to store about 7 significant digits in single precision (4 bytes or 32 bits) and about 15 digits in double precision (8 bytes or 64 bits). Python by default represents numbers and performs arithmetic in double precision. However, there may be situations where you need to use a single-precision representation. We would like to explore both. Luckily, the NumPy package allows one to control precision with two functions: `float32` and `float64`.

## Experiment 1

To understand the effect of the round-off error we perform the following experiment. Let us calculate the following sum:

$$
f(n)=\sum_{k=0}^n k^2
$$

as function of $n$. We can do this in two ways: a) summing from $k=0$ to $k=n$ (forward), or b) summing from $k=n$ to $k=0$ (backward). The difference of the two should be 0, as the result should not depend on the order of summation.


In [1]:
import numpy as np

# Function that computes the sum of squares of integers up to n
# Input:
# n -- the number of terms
# reverse_order -- if True, sum in the reverse order
# double_prec -- if True, use double precision float64, otherwise single precision float32
# Output:
# the sum
def compute_k2_sum( n, reverse_order=False, double_prec=False ):

    # control the order of summation
    if reverse_order:
        R = range(n,-1,-1)
    else:
        R = range(0,n+1)

    # control single/double precision
    if double_prec:
        fp_prec = np.float64
    else:
        fp_prec = np.float32

    s = fp_prec( 0 ) # it is important to initialize the sum in precision we need
    for k in R:
        x = fp_prec( k ) # convert integer k into x of the precision we need
        s += x*x

    return s
        

In [17]:
N = 400 # experiment with increasing N until you see the effect of the round-off errors

# compute the sums: experiment with double_prec=False and double_prec=True
s1 = compute_k2_sum( N, reverse_order=False, double_prec=False )
s2 = compute_k2_sum( N, reverse_order=True, double_prec=False )

# print out the sums individually and their difference
print( "{:.20e}".format( s1 ) )
print( "{:.20e}".format( s2 ) )
print( "{:.20e}".format( s1 - s2 ) )


2.14133840000000000000e+07
2.14132800000000000000e+07
1.04000000000000000000e+02


## Experiment 2

Because of the limited precision of floating point numbers, there is a notion of _machine epsilon_ $\varepsilon_m$, i.e. the smallest number that can be represented on computer that the following is true:

$$
1+\varepsilon_m > 1.
$$

We can employ a simple iterative algorithm for estimating $\varepsilon_m$: start with $\varepsilon_m=1$ and keep dividing by 2 until we reach $1+\varepsilon_m=1$.

In [18]:
fp_prec = np.float32 # experiment with np.float32 (single) and np.float64 (double)

one = fp_prec( 1 )
epsilon = one

max_iter = 100 # it is a good idea to set the maximum number of iterations to avoid accidental infinite loops
i = 0
while one + epsilon > one and i < max_iter:
    epsilon /= fp_prec( 2 )
    i += 1

epsilon *= fp_prec( 2 ) # the last iteration is when the inequality is not satisfied,
                        # so the actual value is the one next to last

print( "Iterations:", i )
print( "Machine epsilon:", epsilon )


Iterations: 24
Machine epsilon: 1.1920929e-07


We can compare our estimate with the built-in values of machine epsilon that NumPy uses internally. Search up on how to print out the NumPy machine epsilon values and print them in the cell below.

In [21]:
# YOUR CODE

print(np.finfo(np.float32), np.finfo(np.float64))


Machine parameters for float32
---------------------------------------------------------------
precision =   6   resolution = 1.0000000e-06
machep =    -23   eps =        1.1920929e-07
negep =     -24   epsneg =     5.9604645e-08
minexp =   -126   tiny =       1.1754944e-38
maxexp =    128   max =        3.4028235e+38
nexp =        8   min =        -max
smallest_normal = 1.1754944e-38   smallest_subnormal = 1.4012985e-45
---------------------------------------------------------------
 Machine parameters for float64
---------------------------------------------------------------
precision =  15   resolution = 1.0000000000000001e-15
machep =    -52   eps =        2.2204460492503131e-16
negep =     -53   epsneg =     1.1102230246251565e-16
minexp =  -1022   tiny =       2.2250738585072014e-308
maxexp =   1024   max =        1.7976931348623157e+308
nexp =       11   min =        -max
smallest_normal = 2.2250738585072014e-308   smallest_subnormal = 4.9406564584124654e-324
------------------

&#169; Copyright 2025,  Michigan State University Board of Trustees

In [23]:
7500+6500+625+500

15125