(impl)=
(karatsuba)=
Discuss Karatsuba algorithm.
Wikipedia discusses division algorithms, and the code works just fine. However, I found their definition of the recurrence relation that governs "slow" (one digit per cycle) division algorithms confusing. These are my notes for future-me.
(derive-res)=
Consider the following long division, meant to portray any "generic" divide operation:
(sample-div)=
1362 / 14 = 97\bmod{4} \Leftrightarrow N / D = q\bmod{R}
If you do this using long division1, you can derive the following intermediate steps:
(derive1)=
\begin{array}{rll}
R := 1362 && (initial) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 1000, R = 1362) \\
R \mathrel{-}= 0 && (0 = (0 * 14) * 100, R = 1362) \\
R \mathrel{-}= 1260 && (1260 = (9 * 14) * 10, R = 102) \\
R \mathrel{-}= 98 && (98 = (7 * 14) * 1, R = 4) \\
\end{array}
You can annotate the above to make it generic to any 4 digit base-10 multiplication.
(derive2)=
R_0 := N \\
R_1 := R_0 - (q_3 * D * 10^3) \\
R_2 := R_1 - (q_2 * D * 10^2) \\
R_3 := R_2 - (q_1 * D * 10^1) \\
R_4 := R_3 - (q_0 * D * 10^0) \\
Mapping the second set of assignments to the first:
- {math}
N = 1362
- {math}
D = 14
- {math}
q_j
is the {math}j
th digit of the quotient, where {math}j
ranges from {math}0
up to {math}n-1
. {math}q_j
must be between0
and9
. In the above example {math}q_3 = 0
, {math}q_2 = 0
, {math}q_1 = 9
, and {math}q_0 = 7
. - {math}
R_n
is the remainder after each step, starting at {math}R_0 = N
before division begins. {math}R_4
is the remainder we want, after the 4th and final digit of the dividend. {math}R_n
must always be positive.
It turns out2 there's a generic recurrence relation
for finding the {math}j
th remainder of a division where the dividend has {math}n
digits:
(rr1)=
R_0 := N \\
R_{j+1} := R_j - q_{n - (j + 1)} * 10^{n - (j + 1)} * D \\
This formula doesn't actually match Wikipedia's, even after accounting for
arbitrary base ({math}10 \Rightarrow B
). Somewhere along the line, someone
realized that you can modify the above equation to make it friendlier
to implement on hardware, by making the following substitution:
S_j = (10^j) * R_j
Note that {math}S_0 = R_0
. If you substitute3 {math}10^{-j} * S_j
for
{math}R_j
into the above equation, and then multiply both sides by
{math}10^{-j - 1}
you get a new recurrence relation:
(rr2)=
S_0 := N \\
S_{j+1} := 10 * S_j - q_{n - (j + 1)} * 10^n * D \\
The {math}10^n
multipler on {math}D
doesn't match Wikipedia's definition,
but it does match the accompanying psuedocode. As alluded to earlier, these
recurrence relations work for arbitrary bases. In base-2, since multiplying
by powers of two is equivalent to left-shifting, the relation looks like:
(rr3)=
S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\
{math}q_j
can only be either {math}0
or {math}1
, which makes multiplication trivial
(either subtract {math}0
or {math}D << n
). Since {math}S_j
must be positive, we've
turned finding the quotient and remainder into the following Python code:
(resdiv-py)=
>>> def restoring_div(N, D, n):
... S = N
... D = D << n
... q = 0
...
... for j in range(n):
... S = (S << 1) - D
... if S < 0:
... S += D
... else:
... q |= 1
... q <<=1
...
... return (q >> 1, S >> n)
>>>
>>> restoring_div(1362, 14, 12)
(97, 4)
This is the core of restoring division. It's called restoring because if
S < 0
during any step, we have to restore the D << n
that we subtracted.
We can get rid of this step using non-restoring divison.
(derive-nr)=
In restoring division, {math}q_0
through {math}q_{n-1}
map precisely to
quotient digits in base-24, where each {math}q_j
is either {math}0
or
{math}1
:
q = 2^{n - 1} * q_{n - 1} + 2^{n - 2} * q_{n - 2} + ... + 2^1 * q_1 + 2^0 * q_0
If we instead map each {math}q_j
to either {math}-1
or {math}1
, we can
still represent odd numbers:
{-1}{-1} = 2^1 * -1 + 2^0 * -1 = -3 \\
{-1}1 = 2^1 * -1 + 2^0 * -1 = -1 \\
1{-1} = 2^1 * -1 + 2^0 * -1 = 1 \\
11 = 2^1 * -1 + 2^0 * -1 = 3 \\
The quotient {math}q
and remainder {math}r
results one gets from dividing
{math}N / D
aren't exactly unique. Without additional constraints, there
are infinite integers {math}q
and {math}r
that satisfy:
(diveq)=
N = D * q + r
(constraint)=
Our additional constrait is that we want {math}0 \leq r \lt D
; this uniquely constrains {math}q
. We can temporarily
relax this constraint however. Given {math}N
and {math}D
, when {math}q
is even, we can get a new mathematically valid result with an odd {math}q
if
we set {math}q := q + 1
and {math}r := r - D
. Note that {math}r
will be
negative, which means that for arbitrary {math}q
, {math}-D \leq r \lt D
:
N = D * (q + 1) + (r - D)
Our equation we used for restoring division, reprinted below, still
works5 when {math}q_j
is mapped to either {math}-1
or {math}1
:
(rr3-copy)=
S_0 := N \\
S_{j+1} := (S_j << 1) - q_{n - (j + 1)} * (D << n) \\
For each iteration {math}j
of the above equation:
- If {math}
S_j
goes/remains positive, we're trying to get successive {math}S_{j + 1}
closer to 0, much the same with restoring division. - On the other hand, we will tolerate when {math}
S_j
goes/remains negative. When calculating {math}S_{j + 1}
for the next iteration, we attempt to get closer to 0 by setting {math}q_{n - (j + 1)} := -1
the next iteration and adding {math}D
.
(nrdiv-restore)=
If the final {math}S_n
is negative, we do a final restoring step by setting
{math}q := q - 1
and {math}S := S + D
. This will bring the shifted
remainder {math}S
positive, and make our quotient {math}q
even, satisfying
our constraints. A working implementation looks like such:
(nrdiv-py)=
>>> def nonrestoring_div(N, D, n):
... S = N
... D = D << n
... q = 0
...
... for j in range(n):
... if S < 0:
... S = (S << 1) + D
... q -= 2**(n - j - 1)
... else:
... S = (S << 1) - D
... q += 2**(n - j - 1)
...
... if S < 0:
... S += D
... q -= 1
...
... return (q, S >> n)
>>>
>>> nonrestoring_div(1362, 14, 12)
(97, 4)
This technique is called non-restoring division, thanks to the lack of
restoring step to bring {math}S_j
positive except possibly at the final
step.
This section is still technically correct. However, I originally wrote it
before I made a restoring divider version of {class}`~smolarith.div.MulticycleDiv`.
For general use-cases, a restoring divider (the default as of v0.1.1) wins for
size over a non-restoring divider. Both a restoring and non-restoring
implementation are kept for future testing.
It seems that the restoring step each iteration doesn't have a perf penalty
when done in parallel in hardware. In fact, the restoring version finishes one
cycle sooner due to the lack of a final restoring step! Additionally, the size
impact of restoring once per iteration seems negligible compared to the
calculations required per iteration of a non-restoring divider (coupled with
the possible final restore step).
At this time, I have not done any experiments on making a signed restoring
divider. There may or may not be the same complications as with a signed
non-restoring divider to satisfy RISC-V semantics.
Why would you ever go through all this trouble? Well, here's a benchmark...
:language: powershell
:prompts: PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>,>>>,...
:modifiers: auto
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> pdm run
No command is given, default to the Python REPL.
Python 3.11.8 (main, Feb 13 2024, 07:18:52) [GCC 13.2.0 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from amaranth.back.verilog import convert
>>> from smolarith.div import MulticycleDiv, LongDivider
>>>
>>> nrdiv_v = convert(MulticycleDiv(32))
>>> with open("nrdiv.v", "w") as fp: # doctest: +SKIP
... fp.write(nrdiv_v)
...
53043
>>>
>>> longdiv_v = convert(LongDivider(32))
>>> with open("longdiv.v", "w") as fp: # doctest: +SKIP
... fp.write(longdiv_v)
...
76614
>>> exit() # doctest: +SKIP
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' longdiv.v
-- Parsing `longdiv.v' using frontend ` -vlog2k' --
1. Executing Verilog-2005 frontend: longdiv.v
Parsing Verilog input from `longdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.mag'.
Successfully finished Verilog frontend.
-- Running command `tee -q synth_ice40; stat' --
3. Printing statistics.
=== top ===
Number of wires: 2576
Number of wire bits: 12802
Number of public wire bits: 12802
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 4146
$scopeinfo 1
SB_CARRY 547
SB_DFFESR 9
SB_DFFSR 97
SB_LUT4 3492
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -QTp 'tee -q synth_ice40; stat' nrdiv.v
-- Parsing `nrdiv.v' using frontend ` -vlog2k' --
1. Executing Verilog-2005 frontend: nrdiv.v
Parsing Verilog input from `nrdiv.v' to AST representation.
Storing AST representation for module `$abstract\top'.
Storing AST representation for module `$abstract\top.from_u'.
Storing AST representation for module `$abstract\top.nrdiv'.
Storing AST representation for module `$abstract\top.to_u'.
Successfully finished Verilog frontend.
-- Running command `tee -q synth_ice40; stat' --
3. Printing statistics.
=== top ===
Number of wires: 894
Number of wire bits: 2865
Number of public wires: 894
Number of public wire bits: 2865
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 1162
$scopeinfo 3
SB_CARRY 251
SB_DFFESR 177
SB_DFFSR 34
SB_LUT4 697
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith> yosys -V
Yosys 0.38+92 (git sha1 84116c9a3, sccache x86_64-w64-mingw32-g++ 13.2.0 -Os)
PS C:\\msys64\\home\\William\\Projects\\FPGA\\amaranth\\smolarith>
While I haven't spent much effort optimizing either {class}~smolarith.div.LongDivider
or {class}~smolarith.div.MulticycleDiv
, it's clear that the latter non-restoring
implementation wins on LUT usage alone, with a modest increase in storage
elements (FFs). The extra storage elements should also help with timing
closure. So without further testing for now, I'd recommend using
{class}~smolarith.div.MulticycleDiv
for your division needs until I can
investigate.
(signedness)=
You can implement various types
of division by choosing to constrain {math}r
in the equation above
relating {math}N
, {math}D
, {math}q
, and {math}r
.
I initially wrote the long divider implementation with RISC-V in mind. Page 44 of the RISC-V Unprivileged ISA, V20191213, states:
For REM, the sign of the result equals the sign of the dividend.
Wikipedia calls this
truncating division, and in my experience it is the "natural" version of
division to implement when doing a signed long divider because regardless of
the sign of {math}N
, you converge toward 0 remainder by subtracting or adding
shifted copies of D
.
While the above sections are focused on division with positive numbers, with a bit of effort to deal with the restoring step, you can also make non-restoring division work with signed numbers and just like truncating division:
if ((S < 0) and (N >= 0) and (D > 0)):
q -= 1
S += D
elif (S > 0) and (N < 0) and (D > 0):
q += 1
S -= D
elif (S > 0) and (N < 0) and (D < 0):
q -= 1
S += D
elif (S < 0) and (N >= 0) and (D < 0):
q += 1
S -= D
However, RISC-V has specific semantics for edge case divisons, list on page 45 of the 2019 manual:
- Division by 0 for arbitrary {math}
x
should return all bits set for the quotient, and {math}x
for the remainder, regardless of signed or unsigned division. - Division of the most negative value by {math}
-1
should return the dividend unchanged.
The former of these is easy enough to implement for free in a signed divider, but the latter is a provision meant for common signed divider implementations:
Signed division is often implemented using an unsigned division circuit and specifying the same overflow result simplifies the hardware.
I started with a signed divider under the assumption that I could spread out the cost of signed/unsigned conversion. But the poor resource usage of my long divider, coupled with signed division indeed commonly being implemented using an unsigned division circuit, are making me rethink my strategy for a non-restoring divider.
I believe non-restoring division in fact works for base 10 as well, where
{math}-9 \leq q \lt 10, q \neq 0
. However, I didn't work out the details on paper
about how close to {math}0
each successive {math}S_{j+1}
should be.
Footnotes
-
I did this using actual pen and paper :). Sadly, trying to format a traditional long division in MathJax didn't work as well as I'd like. ↩
-
I didn't actually prove this, I just assume the math works out for arbitrary-widths. ↩
-
Just trust me, I did it by hand. ↩
-
This applies for arbitrary base {math}
B
, but let's stick with base-2. ↩ -
I didn't actually prove this, but I can visualize why the math would work out if I did a base {math}
-1
/{math}1
division by hand analogous to the one in the restoring section. ↩