Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IEEE 754:2019 Compliance #1387

tannergooding opened this issue Jan 7, 2020 · 2 comments

IEEE 754:2019 Compliance #1387

tannergooding opened this issue Jan 7, 2020 · 2 comments


Copy link

@tannergooding tannergooding commented Jan 7, 2020

IEEE 754:2019 was published last year and this details the "required" and "recommended" operations for any conforming implementation:


sourceFormat roundToIntegralTiesToEven(source) double Math.Round(double, MidpointRounding.ToEven) float MathF.Round(float, MidpointRounding.ToEven)
sourceFormat roundToIntegralTiesToAway(source) double Math.Round(double, MidpointRounding.AwayFromZero) float MathF.Round(float, MidpointRounding.AwayFromZero)
sourceFormat roundToIntegralTowardZero(source) double Math.Round(double, MidpointRounding.ToZero) float MathF.Round(float, MidpointRounding.ToZero)
sourceFormat roundToIntegralTowardPositive(source) double Math.Round(double, MidpointRounding.ToPositiveInfinity) float MathF.Round(float, MidpointRounding.ToPositiveInfinity)
sourceFormat roundToIntegralTowardNegative(source) double Math.Round(double, MidpointRounding.ToNegativeInfinity) float MathF.Round(float, MidpointRounding.ToNegativeInfinity)
sourceFormat nextUp(source) double Math.BitIncrement(double) float MathF.BitIncrement(float)
sourceFormat nextDown(source) double Math.BitDecrement(double) float MathF.BitDecrement(float)
sourceFormat remainder(source, source) double Math.IEEERemainder(double, double) float MathF.IEEERemainder(float, float)
sourceFormat scaleB(source, logBFormat) double Math.ScaleB(double, int) float MathF.ScaleB(float, int)
logBFormat logB(source) int Math.ILogB(double) int MathF.ILogB(float)
formatOf-addition(source1, source2) double = double + double float = float + float
formatOf-subtraction(source1, source2) double = double - double float = float - float
formatOf-multiplication(source1, source2) double = double * double float = float * float
formatOf-division(source1, source2) double = double / double float = float / float
formatOf-squareRoot(source1) double Math.Sqrt(double) float MathF.Sqrt(float)
formatOf-fusedMultiplyAdd(source1, source2, source3) double Math.FusedMultiplyAdd(double, double, double) float MathF.FusedMultiplyAdd(float, float, float)
formatOf-convertFromInt(int) double = (double)int float = (float)int
intFormatOf-convertToIntegerTowardZero(source) int = (int)double int = (int)float
formatOf-convertFormat(source) double = (double)float float = (float)double
formatOf-convertFromDecimalCharacter(decimalCharacterSequence) double double.Parse(string) float float.Parse(string)
decimalCharacterSequence convertToDecimalCharacter(source, conversionSpecification) string double.ToString() string float.ToString()
hexCharacterSequence convertToHexCharacter(source, conversionSpecification)
sourceFormat copy(source) double = double float = float
sourceFormat negate(source) double = -double float = -float
sourceFormat abs(source) double Math.Abs(double) float = MathF.Abs(float)
sourceFormat copySign(source, source) double Math.CopySign(double, double) float MathF.CopySign(float, float)
boolean compareQuietEqual(source1, source2) bool = double == double bool = float == float
boolean compareQuietNotEqual(source1, source2) bool = double != double bool = float != float
boolean compareQuietGreater(source1, source2) bool = double > double bool = float > float
boolean compareQuietGreaterEqual(source1, source2) bool = double >= double bool = float >= float
boolean compareQuietLess(source1, source2) bool = double < double bool = float < float
boolean compareQuietLessEqual(source1, source2) bool = double <= double bool = float <= float
boolean compareQuietUnordered(source1, source2)
boolean compareQuietNotGreater(source1, source2)
boolean compareQuietLessUnordered(source1, source2)
boolean compareQuietNotLess(source1, source2)
boolean compareQuietGreaterUnordered(source1, source2)
boolean is754version1985(void)
boolean is754version2008(void)
boolean is754version2019(void)
enum class(source)
boolean isSignMinus(source) bool double.IsNegative(double) bool float.IsNegative(float)
boolean isNormal(source) bool double.IsNormal(double) bool float.IsNormal(float)
boolean isFinite(source) bool double.IsFinite(double) bool float.IsFinite(float)
boolean isZero(source)
boolean isSubnormal(source) bool double.IsSubnormal(double) bool float.IsSubnormal(float)
boolean isInfinite(source) bool double.IsInfinity(double) bool float.IsInfinity(float)
boolean isNaN(source) bool double.IsNaN(double) bool float.IsNaN(float)
boolean isSignaling(source)
enum radix(source)
boolean totalOrder(source, source)
boolean totalOrderMag(source, source)

The following IEEE APIs are also "required" but we do not support the IEEE floating-point exceptions and so they are equivalent to other APIs we expose:

  • sourceFormat roundToIntegralExact(source)
  • intFormatOf-convertToIntegerExactTiesToEven(source)
  • intFormatOf-convertToIntegerExactTowardZero(source)
  • intFormatOf-convertToIntegerExactTowardPositive(source)
  • intFormatOf-convertToIntegerExactTowardNegative(source)
  • intFormatOf-convertToIntegerExactTiesToAway(source)

The following IEEE APIs are also "required" but we do not support throwing for NaN inputs, so they are equivalent to other APIs we expose:

  • boolean compareSignalingEqual(source1, source2)
  • boolean compareSignalingGreater(source1, source2)
  • boolean compareSignalingGreaterEqual(source1, source2)
  • boolean compareSignalingLess(source1, source2)
  • boolean compareSignalingLessEqual(source1, source2)
  • boolean compareSignalingNotEqual(source1, source2)
  • boolean compareSignalingNotGreater(source1, source2)
  • boolean compareSignalingLessUnordered(source1, source2)
  • boolean compareSignalingNotLess(source1, source2)
  • boolean compareSignalingGreaterUnordered(source1, source2)


exp double Math.Exp(double) float Math.Exp(float)
log double Math.Log(double) float MathF.Log(float)
log2 double Math.Log2(double) float MathF.Log2(float)
log10 double Math.Log10(double) float MathF.Log10(float)
hypot(x, y)
compound(x, n)
rootn(x, n)
pown(x, n)
pow(x, y) double Math.Pow(double, double) float MathF.Pow(float, float)
powr(x, y)
sin double Math.Sin(double) float MathF.Sin(float)
cos double Math.Cos(double) float MathF.Cos(float)
tan double Math.Tan(double) float MathF.Tan(float)
asin double Math.Asin(double) float MathF.Asin(float)
acos double Math.Acos(double) float MathF.Acos(float)
atan double Math.Atan(double) float MathF.Atan(float)
atan2(y, x) double Math.Atan2(double, double) float MathF.Atan2(float, float)
atan2Pi(y, x)
sinh double Math.Sinh(double) float MathF.Sinh(float)
cosh double Math.Cosh(double) float MathF.Cosh(float)
tanh double Math.Tanh(double) float MathF.Tanh(float)
asinh double Math.Asinh(double) float MathF.Asinh(float)
acosh double Math.Acosh(double) float MathF.Acosh(float)
atanh double Math.Atanh(double) float MathF.Atanh(float)
sourceFormat sum(source vector, integralFormat)
sourceFormat dot(source vector, source vector, integralFormat)
sourceFormat sumSquare(source vector, integralFormat)
sourceFormat sumAbs(source vector, integralFormat)
(sourceFormat, integralFormat) scaledProd(source vector, integralFormat)
(sourceFormat, integralFormat) scaledProdSum(source vector, source vector, integralFormat)
(sourceFormat, integralFormat) scaledProdDiff(source vector, source vector, integralFormat)
(sourceFormat, sourceFormat) augmentedAddition(source, source)
(sourceFormat, sourceFormat) augmentedSubtraction(source, source)
(sourceFormat, sourceFormat) augmentedMultiplication(source, source)
sourceFormat minimum(source, source) double Math.Min(double, double) float MathF.Min(float, float)
sourceFormat minimumNumber(source, source)
sourceFormat maximum(source, source) double Math.Max(double, double) float MathF.Max(float, float)
sourceFormat maximumNumber(source, source)
sourceFormat minimumMagnitude(source, source) double Math.MinMagnitude(double, double) float MathF.MinMagnitude(float, float)
sourceFormat minimumMagnitudeNumber(source, source)
sourceFormat maximumMagnitude(source, source) double Math.MaxMagnitude(double, double) float MathF.MaxMagnitude(float, float)
sourceFormat maximumMagnitudeNumber(source, source)
sourceFormat getPayload(source)
sourceFormat setPayload(source)
sourceFormat setPayloadSignaling(source)

The following IEEE APIs are also "recommended" but cover modifying the floating-point environement, which we don't currently support:

  • binaryRoundingDirection getBinaryRoundingDirection(void)
  • void setBinaryRoundingDirection(binaryRoundingDirection)
  • modeGroup saveModes(void)
  • void restoreModes(modeGroup)
  • void defaultModes(void)

This comment has been minimized.

Copy link
Member Author

@tannergooding tannergooding commented Jan 7, 2020

From the required operations, we are notable missing:

  • Conversion from float-format to int-format using a specified rounding direction
  • Conversion from hex-string to float-format (and vice-versa)
  • Unordered comparisons (due to NaN, x >= y is not the opposite of x < y)
  • An explicit IsZero API (although users can use x == 0, this is meant to be a separate explicit API)
  • An API which classifies the floating-point type (this is meant to be separate from the other Is* APIs)
  • An API to determine if we are spec compliant
  • An API to determine if two inputs have "total order" (Although IComparable provides similar functionality, it doesn't handle edge cases like +/-0 as the spec defines)
  • An API to explicitly get the radix (this is always 2 for float/double)

For the recommended operations, which provide more accurate computations than can be manually computed, we are notably missing the following:

  • The p1 (+1) and m1 (-1) APIs (for example: log2(1 + x))
  • The trigonometric pi operations (for example: sin(pi * x))
  • An API to compute the hypotenuse of a triangle
  • An API to compute the reciprocal square root
  • An API to compute an arbitrary root
  • An API to compound values
  • An API which specially handles integral and positive powers

There are also several new recommended APIs in IEEE 754:2019:

  • Reduction operations which take "vectors" (arrays)
  • Augmented arithmetic which return a tuple (the result and the error from rounding the result)
  • Min/Max number APIs which were "required" in IEEE 754:2008, but which didn't clearly define NaN propagation
  • APIs to get/set the payload of a NaN

This comment has been minimized.

Copy link
Member Author

@tannergooding tannergooding commented Jan 7, 2020

CC. @dotnet/fxdc, since this came up in API review today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
2 participants
You can’t perform that action at this time.