tags | ||
---|---|---|
|
The bit sequence
If the most significant bit is 1, the overall value will be negative because that first bit contributes the largest absolute value to the sum
The value
It uses the notation
It can represent numbers of similar magnitudes with the same precision. For example, numbers with 4 digits after the decimal point have precision up to
It can only exactly represent numbers of the form
The binary point has a fixed position, so a lot of digits are needed to represent very small or very large numbers
It represents numbers of the form
The number is represented as
The bit representation is divided into three fields to encode these values:
- The single sign bit
$s$ directly encodes the sign$S$ - The
$k$ -bit exponent field$exp=e_{k-1}\ldots e_1e_0$ encodes the exponent$E$ - The
$n$ -bit fraction field$frac=f_{n-1}\ldots f_1f_0$ encodes the significand (mantissa)$M$
These are used to represent
The smallest normalized value
- If
$frac$ is all zeros, it represents$\pm\infty$ depending on sign$S$ - If
$frac$ is nonzero, it represents$NaN$
The IEEE format was designed so that floating-point numbers could be sorted using an integer sorting routine. If we interpret the bit representations of the values as unsigned integers, they occur in ascending order, as do the values they represent as floating-point numbers
This is why we use
- From
int
tofloat
, the number cannot overflow, but it may be rounded - From
int
orfloat
todouble
, the exact numeric value can be preserved becausedouble
has both greater range (i.e., the range of representable values) and greater precision (i.e., the number of significant bits) - From
double
tofloat
, the value can overflow to$\pm\infty$ , since the range is smaller. Otherwise, it may be rounded because the precision is smaller - From
float
ordouble
toint
, the value will be rounded toward zero. Furthermore, the value may overflow - From a smaller unsigned integer to a larger integer, the value is zero-extended
- From a smaller signed integer to a larger number, the value is sign-extended
- From a larger integer to a smaller integer, the value is truncated
- Exponent bias - Wikipedia
- The Floating-Point Guide - What Every Programmer Should Know About Floating-Point Arithmetic
- Computer Systems A Programmer's Perspective, Global Edition (3rd ed). Randal E. Bryant, David R. O'Hallaron
- IEEE floating-point standard - Wikipedia, the free encyclopedia
- Single-precision floating-point format - Wikipedia
- Floating-point arithmetic - Wikipedia
- IEEE 754 - Wikipedia