#  Floating-Point Numbers in C#

###  *1. Mathematics & Computer Science*
- **Fractional Numbers in Math**
    - This is a general math term, meaning any number with a fractional part, e.g. 4.56, ½, ⅔, etc. It doesn’t care about how the number is stored.
- **Decimal Numbers in Math**
    - Number written in base 10
        - Any number expressed in the decimal (base-10) system. Example: 7, 42, 1234
    - Number with a fractional part (decimal fraction)
        - Numbers that use a decimal point to show fractions. Example: 4.56, 0.125, -3.14. Often called decimal fractions.
    - Term in positional notation
        - Any number expressed using digits 0–9 in positional notation. Example: 345 = 3 × 100 + 4 × 10 + 5 × 1
- **Floating-Point Numbers in Computer Science**
    - This is a computer science term, meaning a number stored using the floating-point format (float, double in C#). In this format numbers are stored in binary (base 2). This is the main reason that `double` and `float` data types cannot represent exactly some decimal numbers such as `0.1` and `0.2`. Example: double x = 4.56; → this is a floating-point number in C#.
- **Decimal Numbers in Computer Science**
    - It is a data type different from floating-point types. It’s also a fractional number, but it is stored in a different way (base 10, higher precision for financial calculations). Example: decimal y = 4.56m; → this is a decimal number.

### *2. Floating-Point Number Representation in Computer Science*

A floating-point number is usually stored in binary scientific notation, similar to how we write decimal scientific notation:

$$ (-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent}} $$

However, the IEEE 754 standard, which is the industry standard for floating-point arithmetic, defines the formula below, and most modern languages follow it for `float / double` data types:

$$ (-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent - bias}} $$

**NOTE:** C# follows exact IEEE 754 standard.

**Sign bit** (1 bit): Determines if the number is positive (0) or negative (1).

**Exponent** (8 bits in single precision, 11 bits in double precision): Encodes the magnitude of the number.

**Mantissa** (or fraction/significand, 23 bits in single precision, 52 in double): Encodes the precision digits of the number. The partition after the point.

**Bias**: It is a fixed value added to the actual exponent when storing a floating-point number, so that the exponent can be represented as an unsigned number in the binary format.
It allows both positive and negative exponents to be stored using only non-negative bits.

**Bias values:**

32-bit float: bias = 127

64-bit double: bias = 1023

**Calculating Bias:**

    Example (32-bit float in C#)

    Exponent field: 8 bits → can store 0..255

    Bias = 127

    Real exponent = Stored exponent − 127

    Stored exponent 129 → real exponent = 129 − 127 = 2

    Stored exponent 123 → real exponent = 123 − 127 = -4

When reading the stored number, the real exponent is calculated as:
$$
Real \; exponent = Stored \; exponent − Bias
$$

Graphically a floating-point number in memory is;

| **Sign** (1 bit) | **Exponent** (8 bits for float / 11 bits for double) | **Mantissa** (23 bits for float / 52 bits for double) |

##### *Different floating-point implememantations;*
The differences between programming languages about floating-point standard implementation are mostly occurs in; 
- Precision and rounding: Rounding behavior might vary slightly depending on compiler or runtime.
    - Some languages (like JavaScript) only have 64-bit double for all numbers.
- Subnormal numbers / denormals: IEEE 754 allows tiny numbers < min normal (subnormals)
    - Handling may differ slightly in speed or hardware implementation.
- Special values:
    - Infinity, -Infinity, NaN are part of IEEE 754 → supported in all modern languages.

### *3. Calculation of Floating-Point Representation*

Example 1;

    Convert 5.75 into 32-bit float.

- Step 1: Convert integer and fraction into binary

    - Integer part: `5₁₀ = 101₂`

    - Fraction part: `.75 × 2 = 1.5 → 1.5 × 2 = 1.0 → 1`→ Fraction = `.11₂`

    - So:

        $ 5.75_{10} = 101.11_2 $

- Step 2: Normalise (scientific binary form)

    - Move binary point so there’s a 1. in front:

        $ 101.11_{2} ​= 1.0111_{2}​×2^{2} $
    - So:

        - Mantissa = `01110...`

        - Real Exponent = `2`

- Step 3: Encode each part

    - **Sign:** positive → `0`

    - **Exponent:** Stored exponent = Real exponent + Bias

        $ 2 + 127 = 129 = 10000001_{2} $

    - **Mantissa:** store only fractional part after `1. → 0111000…` (23 bits)

- Step 4: Put it together

    - Sign:     `0`

    - Exponent: `10000001`

    - Mantissa: `01110000000000000000000`

- Final binary (32 bits):
    - `0 10000001 01110000000000000000000`


Example 2;

    Convert −0.15625 into 32-bit float.

- Step 1: Convert integer and fraction into binary

    - Integer part: `0` already binary

    - Fraction part: 
        - `0.15625 × 2 = 0.3125 → 0`
        - `0.3125 × 2 = 0.625 → 0`
        - `0.625 × 2 = 1.25 → 1`
        - `0.25 × 2 = 0.5 → 0`
        - `0.5 × 2 = 1.0 → 1`
    - So:

        $ 0.00101_{2} $

- Step 2: Normalise (scientific binary form)

    - Move binary point so there’s a 1. in front:

        $ 0.00101_{2} = 1.01_{2} × 2^{(-3)} $
    - So:

        - Mantissa = fractional part after leading 1 → `010`

        - Real Exponent = `-3`

- Step 3: Encode each part

    - **Sign:** negative → `1`

    - **Exponent:** Stored exponent = Real exponent + Bias

        $ -3 + 127 = 124 = 01111100_{2} $

    - **Mantissa:** store only fractional part after `1. → 0100…` (23 bits)

- Step 4: Put it together

    - Sign:     `1`

    - Exponent: `01111100`

    - Mantissa: `01000000000000000000000`

- Final binary (32 bits):
    - `1 01111100 01000000000000000000000`

- Bonus: Verification

    $ (-1)^{\text{sign}} \times 1.\text{mantissa} \times 2^{\text{exponent - bias}} $

    $ value = (−1)^1 × 1.01_{2} ​× 2^{−3} $

    $ mantissa: 1.01_{2} = 1 + 0.25 $

    $ mantissa × exponent:  1.01_{2} × 2^{−3} = 1.25 ÷ 8 = 0.15625$

    $ sign × mantissa × exponent: (−1)^1 × 0.15625 = - 0.15625 $

### *4. Why Cannot Some Decimal Numbers Be Represented in Floating-Point Format*

Example;

    Convert 0.1 into 32-bit float.

- Step 1: Convert integer and fraction into binary

    - Integer part: `0`

    - Fraction part: 

        - `0.1 × 2 = 0.2 → digit = 0`

        - `0.2 × 2 = 0.4 → digit = 0`

        - `0.4 × 2 = 0.8 → digit = 0`

        - `0.8 × 2 = 1.6 → digit = 1, remainder = 0.6`

        - `0.6 × 2 = 1.2 → digit = 1, remainder = 0.2`

        - `0.2 × 2 = 0.4 → digit = 0`

        - `0.4 × 2 = 0.8 → digit = 0`

        - `0.8 × 2 = 1.6 → digit = 1, remainder = 0.6`

        - `0.6 × 2 = 1.2 → digit = 1, remainder = 0.2`→ Fraction = `.00011001100110011...₂`

    - So:

        $ 0.1_{10} = 0.00011001100110011..._{2} $

- Step 2: Normalise (scientific binary form)

    - Move binary point so there’s a 1. in front:

        $ 0.00011001100110011..._{2} = 1.1001100..._{2} × 2^{(-4)} $
    - So:

        - Mantissa = fractional part after leading 1 → `≈ 1.10011001100110011001100` (truncated for 23 bits in float)

        - Real Exponent = `-4`

- Step 3: Encode each part

    - **Sign:** positive → `0`

    - **Exponent:** Stored exponent = Real exponent + Bias

        $ -4 + 127 = 123 = 01111011_{2} $

    - **Mantissa:** store only fractional part after `1. → 10011001100110011001100` (23 bits)

- Step 4: Put it together

    - Sign:     `0`

    - Exponent: `01111011`

    - Mantissa: `10011001100110011001101`

- Final binary (32 bits / 64 bits):
    - `0 01111011 100110011001100110011010` (single-precision)
        - Which equals about 0.10000000149011612 in decimal.
    - `0 01111011 1001100110011001100110011001100110011001100110011001` (double-precision)
        - Which equals about 0.10000000000000000555 in decimal.

So;
- 0.1 in float ≈ 0.10000000149 (error ≈ 1.5e-9)

- 0.1 in double ≈ 0.10000000000000000555 (error ≈ 5.5e-17)

- Neither can represent 0.1 exactly, because its binary expansion is infinite repeating.
