Skip to content

Latest commit

 

History

History
430 lines (348 loc) · 14.7 KB

impl.md

File metadata and controls

430 lines (348 loc) · 14.7 KB

(impl)=

Implementation Notes

Multiplication

(karatsuba)=

Discuss Karatsuba algorithm.

Division

Wikipedia discusses division algorithms, and the code works just fine. However, I found their definition of the recurrence relation that governs "slow" (one digit per cycle) division algorithms confusing. These are my notes for future-me.

(derive-res)=

Deriving Restoring Division

Consider the following long division, meant to portray any "generic" divide operation:

(sample-div)=

1362 / 14 = 97\bmod{4} \Leftrightarrow N / D = q\bmod{R}

If you do this using long division1, you can derive the following intermediate steps:

(derive1)=

\begin{array}{rll}
R := 1362 && (initial) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 1000, R = 1362) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 100, R = 1362) \\
R \mathrel{-}= 1260 && (1260 = (9 * 14) * 10, R = 102) \\
R \mathrel{-}= 98 && (98 = (7 * 14) * 1, R = 4) \\
\end{array}

You can annotate the above to make it generic to any 4 digit base-10 multiplication.

(derive2)=

R_0 := N \\
R_1 := R_0 - (q_3 * D * 10^3) \\
R_2 := R_1 - (q_2 * D * 10^2) \\
R_3 := R_2 - (q_1 * D * 10^1) \\
R_4 := R_3 - (q_0 * D * 10^0) \\

Mapping the second set of assignments to the first:

  • {math}N = 1362
  • {math}D = 14
  • {math}q_j is the {math}jth digit of the quotient, where {math}j ranges from {math}0 up to {math}n-1. {math}q_j must be between 0 and 9. In the above example {math}q_3 = 0, {math}q_2 = 0, {math}q_1 = 9, and {math}q_0 = 7.
  • {math}R_n is the remainder after each step, starting at {math}R_0 = N before division begins. {math}R_4 is the remainder we want, after the 4th and final digit of the dividend. {math}R_n must always be positive.

It turns out2 there's a generic recurrence relation for finding the {math}jth remainder of a division where the dividend has {math}n digits:

(rr1)=

R_0 := N \\
R_{j+1} := R_j - q_{n - (j + 1)} * 10^{n - (j + 1)} * D \\

This formula doesn't actually match Wikipedia's, even after accounting for arbitrary base ({math}10 \Rightarrow B). Somewhere along the line, someone realized that you can modify the above equation to make it friendlier to implement on hardware, by making the following substitution:

S_j = (10^j) * R_j

Note that {math}S_0 = R_0. If you substitute3 {math}10^{-j} * S_j for {math}R_j into the above equation, and then multiply both sides by {math}10^{-j - 1} you get a new recurrence relation:

(rr2)=

S_0 := N \\
S_{j+1} := 10 * S_j - q_{n - (j + 1)} * 10^n * D \\

The {math}10^n multipler on {math}D doesn't match Wikipedia's definition, but it does match the accompanying psuedocode. As alluded to earlier, these recurrence relations work for arbitrary bases. In base-2, since multiplying by powers of two is equivalent to left-shifting, the relation looks like:

(rr3)=

S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\

{math}q_j can only be either {math}0 or {math}1, which makes multiplication trivial (either subtract {math}0 or {math}D << n). Since {math}S_j must be positive, we've turned finding the quotient and remainder into the following Python code:

(resdiv-py)=

>>> def restoring_div(N, D, n):
...    S = N
...    D = D << n
...    q = 0
...
...    for j in range(n):
...        S = (S << 1) - D
...        if S < 0:
...            S += D
...        else:
...            q |= 1
...        q <<=1
...
...    return (q >> 1, S >> n)
>>>
>>> restoring_div(1362, 14, 12)
(97, 4)

This is the core of restoring division. It's called restoring because if S < 0 during any step, we have to restore the D << n that we subtracted. We can get rid of this step using non-restoring divison.

(derive-nr)=

Deriving Non-restoring Division

In restoring division, {math}q_0 through {math}q_{n-1} map precisely to quotient digits in base-24, where each {math}q_j is either {math}0 or {math}1:

q = 2^{n - 1} * q_{n - 1} + 2^{n - 2} * q_{n - 2} + ... + 2^1 * q_1 + 2^0 * q_0

If we instead map each {math}q_j to either {math}-1 or {math}1, we can still represent odd numbers:

{-1}{-1} = 2^1 * -1 + 2^0 * -1 = -3 \\
{-1}1 = 2^1 * -1 + 2^0 * -1 = -1 \\
1{-1} = 2^1 * -1 + 2^0 * -1 = 1 \\
11 = 2^1 * -1 + 2^0 * -1 = 3 \\

The quotient {math}q and remainder {math}r results one gets from dividing {math}N / D aren't exactly unique. Without additional constraints, there are infinite integers {math}q and {math}r that satisfy:

(diveq)=

N = D * q + r

(constraint)= Our additional constrait is that we want {math}0 \leq r \lt D; this uniquely constrains {math}q. We can temporarily relax this constraint however. Given {math}N and {math}D, when {math}q is even, we can get a new mathematically valid result with an odd {math}q if we set {math}q := q + 1 and {math}r := r - D. Note that {math}r will be negative, which means that for arbitrary {math}q, {math}-D \leq r \lt D:

N = D * (q + 1) + (r - D)

Our equation we used for restoring division, reprinted below, still works5 when {math}q_j is mapped to either {math}-1 or {math}1:

(rr3-copy)=

S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\

For each iteration {math}j of the above equation:

  • If {math}S_j goes/remains positive, we're trying to get successive {math}S_{j + 1} closer to 0, much the same with restoring division.
  • On the other hand, we will tolerate when {math}S_j goes/remains negative. When calculating {math}S_{j + 1} for the next iteration, we attempt to get closer to 0 by setting {math}q_{n - (j + 1)} := -1 the next iteration and adding {math}D.

(nrdiv-restore)= If the final {math}S_n is negative, we do a final restoring step by setting {math}q := q - 1 and {math}S := S + D. This will bring the shifted remainder {math}S positive, and make our quotient {math}q even, satisfying our constraints. A working implementation looks like such:

(nrdiv-py)=

>>> def nonrestoring_div(N, D, n):
...     S = N
...     D = D << n
...     q = 0
...     
...     for j in range(n):
...         if S < 0:
...             S = (S << 1) + D
...             q -= 2**(n - j - 1)
...         else:
...             S = (S << 1) - D
...             q += 2**(n - j - 1)
...     
...     if S < 0:
...         S += D
...         q -= 1
...     
...     return (q, S >> n)
>>>
>>> nonrestoring_div(1362, 14, 12)
(97, 4)

This technique is called non-restoring division, thanks to the lack of restoring step to bring {math}S_j positive except possibly at the final step.

Benchmarking

This section is still technically correct. However, I originally wrote it
before I made a restoring divider version of {class}`~smolarith.div.MulticycleDiv`.
For general use-cases, a restoring divider (the default as of v0.1.1) wins for
size over a non-restoring divider. Both a restoring and non-restoring
implementation are kept for future testing.

It seems that  the restoring step each iteration doesn't have a perf penalty
when done in parallel in hardware. In fact, the restoring version finishes one
cycle sooner due to the lack of a final restoring step! Additionally, the size
impact of restoring once per iteration seems negligible compared to the
calculations required per iteration of a non-restoring divider (coupled with
the possible final restore step).

At this time, I have not done any experiments on making a signed restoring
divider. There may or may not be the same complications as with a signed
non-restoring divider to satisfy RISC-V semantics.

Why would you ever go through all this trouble? Well, here's a benchmark...

:language: powershell
:prompts: PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>,>>>,...
:modifiers: auto
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> pdm run
No command is given, default to the Python REPL.
Python 3.11.8 (main, Feb 13 2024, 07:18:52)  [GCC 13.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from amaranth.back.verilog import convert
>>> from smolarith.div import MulticycleDiv, LongDivider
>>>
>>> nrdiv_v = convert(MulticycleDiv(32))
>>> with open("nrdiv.v", "w") as fp:  # doctest: +SKIP
...     fp.write(nrdiv_v)
...
53043
>>>
>>> longdiv_v = convert(LongDivider(32))
>>> with open("longdiv.v", "w") as fp:    # doctest: +SKIP
...     fp.write(longdiv_v)
...
76614
>>> exit()  # doctest: +SKIP
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' longdiv.v

-- Parsing `longdiv.v' using frontend ` -vlog2k' --

1. Executing Verilog-2005 frontend: longdiv.v
Parsing Verilog input from `longdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.mag'.
Successfully finished Verilog frontend.

-- Running command `tee -q synth_ice40; stat' --

3. Printing statistics.

=== top ===

   Number of wires:               2576
   Number of wire bits:          12802
   Number of public wire bits:   12802
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:               4146
     $scopeinfo                      1
     SB_CARRY                      547
     SB_DFFESR                       9
     SB_DFFSR                       97
     SB_LUT4                      3492

PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' nrdiv.v

-- Parsing `nrdiv.v' using frontend ` -vlog2k' --

1. Executing Verilog-2005 frontend: nrdiv.v
Parsing Verilog input from `nrdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.from_u'.
Storing AST representation for module `$abstract\top.nrdiv'.
Storing AST representation for module `$abstract\top.to_u'.
Successfully finished Verilog frontend.

-- Running command `tee -q synth_ice40; stat' --

3. Printing statistics.

=== top ===

   Number of wires:                894
   Number of wire bits:           2865
   Number of public wires:         894
   Number of public wire bits:    2865
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:               1162
     $scopeinfo                      3
     SB_CARRY                      251
     SB_DFFESR                     177
     SB_DFFSR                       34
     SB_LUT4                       697

PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -V
Yosys 0.38+92 (git sha1 84116c9a3, sccache x86_64-w64-mingw32-g++ 13.2.0 -Os)
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>

While I haven't spent much effort optimizing either {class}~smolarith.div.LongDivider or {class}~smolarith.div.MulticycleDiv, it's clear that the latter non-restoring implementation wins on LUT usage alone, with a modest increase in storage elements (FFs). The extra storage elements should also help with timing closure. So without further testing for now, I'd recommend using {class}~smolarith.div.MulticycleDiv for your division needs until I can investigate.

(signedness)=

Truncating Division And Signedness

You can implement various types of division by choosing to constrain {math}r in the equation above relating {math}N, {math}D, {math}q, and {math}r.

I initially wrote the long divider implementation with RISC-V in mind. Page 44 of the RISC-V Unprivileged ISA, V20191213, states:

For REM, the sign of the result equals the sign of the dividend.

Wikipedia calls this truncating division, and in my experience it is the "natural" version of division to implement when doing a signed long divider because regardless of the sign of {math}N, you converge toward 0 remainder by subtracting or adding shifted copies of D.

While the above sections are focused on division with positive numbers, with a bit of effort to deal with the restoring step, you can also make non-restoring division work with signed numbers and just like truncating division:

if ((S < 0) and (N >= 0) and (D > 0)):
    q -= 1
    S += D
elif (S > 0) and (N < 0) and (D > 0):
    q += 1
    S -= D
elif (S > 0) and (N < 0) and (D < 0):
    q -= 1
    S += D
elif (S < 0) and (N >= 0) and (D < 0):
    q += 1
    S -= D

However, RISC-V has specific semantics for edge case divisons, list on page 45 of the 2019 manual:

  • Division by 0 for arbitrary {math}x should return all bits set for the quotient, and {math}x for the remainder, regardless of signed or unsigned division.
  • Division of the most negative value by {math}-1 should return the dividend unchanged.

The former of these is easy enough to implement for free in a signed divider, but the latter is a provision meant for common signed divider implementations:

Signed division is often implemented using an unsigned division circuit and specifying the same overflow result simplifies the hardware.

I started with a signed divider under the assumption that I could spread out the cost of signed/unsigned conversion. But the poor resource usage of my long divider, coupled with signed division indeed commonly being implemented using an unsigned division circuit, are making me rethink my strategy for a non-restoring divider.

Footnotes

I believe non-restoring division in fact works for base 10 as well, where 

{math}-9 \leq q \lt 10, q \neq 0. However, I didn't work out the details on paper about how close to {math}0 each successive {math}S_{j+1} should be.

Footnotes

  1. I did this using actual pen and paper :). Sadly, trying to format a traditional long division in MathJax didn't work as well as I'd like.

  2. I didn't actually prove this, I just assume the math works out for arbitrary-widths.

  3. Just trust me, I did it by hand.

  4. This applies for arbitrary base {math}B, but let's stick with base-2.

  5. I didn't actually prove this, but I can visualize why the math would work out if I did a base {math}-1/{math}1 division by hand analogous to the one in the restoring section.