-----

## Batcher Sorting Networks

Burton Rosenberg

_Creation Date:_ June 2023

_Last update:_ 20 June 2023

&copy; Copyright 2023 Burton Rosenberg. All rights reserved.


----


### Table of contents.

1. <a href="#introduction">Sorting Networks</a>
1. <a href="#oddeven">Batcher Odd-Even Sorting Network</a>
1. <a href="#python">Python code implementation</a>
1. <a href="#bitonic">Batcher Bitonic Sorting Network</a>
1. <a href="#bitonic-python">Batcher Bitonic Sort in Python</a>


### <a name=introduction>Sorting Networks</a>

The question is how to sort in parallel, especially if the parallelism can be exploited for a faster sort.

It is possible to realize parallel compution with a circuit consisting of computing units and wires connectin the units, or on a device such as a GPU that has a common memory accessible to multiple computing threads. In the sorting networks the only computation needed is a comparison and swap, which in the circuit model would realized in hardward, and in the GPU model realized in software. 

For instance, on a CPU, if multiple threads were launched with access to their thread identifiers, and two functions, f and g, that map the identifiers to indices into a value array a in memory, the code would be,

<pre>
__global__ void swap(int * a) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x ;
    int i = f(tid) ;
    int j = g(tid) ;
    if (a[i]&lt;a[j]) {
        int t = a[i] ;
        a[i] = a[j] ;
        a[j] = t ;
     }
     return ;
</pre>

We will visualize the algorithm in the circuit model where wires carry values to and from computation units, the units arranged in layers and the data flow occuring at a consant one unit of time per layer (flowing from left to right). To implement such a circuit on a GPU, queue a kernel launch for each layer, with one thread assigned to each computation unit in that layer. Learning from the thread ID the wire on which the computation acts is the GPU programs job.


These circuits for sorting are called _sorting networks_. 

Here is a sorting network for four elements, based on the Bacher Odd-Even algorithm. The wires are the horizontal lines and slanted lines, and the computation units are the vertical lines. Verify that this circuit works.

<pre>
a ---+----------+------- s
     |          |
b ---+--+    +--+--+---  t
          \ /      |
           /       |
          / \      |
c ---+---+   +--+--+---  u
     |          |
d ---+----------+------- v
</pre>


-----

### <a name=bitonic>Batcher Bitonic Sort</a>

----

In the same 1968 report, Ken Batcher gave a second sorting network made up of units of swaps on the following principle,

_Given a bitonic sequence, split the sequence into halves such that each half is bitonic and all numbers in the first half are at least as large as any number in the second half._

Then a recursion is possible until the bitonic sequence as output is of size 1. To complete the sort, one builds recursively upwards from an input of size 2, then 4, then 8, etc. These are arranged however so that the subcircuits alternate largest first or smallest first. This way, when paired, the sequence is bitonic.

The definition of biotonic is nuanced. It says that for some rotation the sequence is moves in a direction, asecending or descending, that changes direction at most once. A sort sequence is bitonic, and so is a concatenation of an a sequence assorted upwards with a sequence assorted downwards. 

However, a sequence can ascend, descend and then ascend again and still be bitonic, if it is possible to rotate the sequence such that the two ascending sections can be merged.

__Example:__ The sequence 

$$
0, 1, 2, 1, 0, -1, -2, -1
$$

is bitonic, as it can be rotated to the form 

$$
2, 1, 0, -1, -2, -1, 0, 1
$$

or

$$
-2, -1, 0, 1, 2, 1, 0, -1
$$

Without going into a full proof, here is an illustration of the concept. Given the above sequence, consider sliding the second half under the first half,

<pre>
     ---sequence--            --folded --
 2     x                    2      x
 1   x   x                  1    x   x
 0 x       x                0 xo
-1           x   x         -1    o   o
-2             x           -2      o
</pre>

the swap gates will receive the o-x pairs at the same index, and send up the larger, and down the smaller,

<pre>

    --TOP--        --BOTTOM
 2     x        2
 1   x    x     1
 0 x            0 o
-1             -1   o   o
-2             -2     o

</pre>

note they are both monotonic and everything in the top is at least as large as anything in the bottom.

The folding, swaping, and separating are done with the circuit below (in the case of 8 inputs), with an example input and output.

<pre>
-2 ----+---------  2
       |
-1 ------+-------  1
       | |
 0 --------+-----  0
       | | |
 1 ----------+---  1
       | | | |
       | | | |       
 2 ----+--------- -2
         | | |
 1 ------+------- -1
           | |
 0 --------+-----  0
             |
-1 ----------+--- -1
</pre> 


#### Recursive structure

The simple single layer described above is denoted $S_n$, where $n$ is the number of wires both input and output. The circuit takes a bitonic sequence and splits it into two half-length sequences, each bitonic, with any number in the upper sequence at least as large as any number in th lower sequence. A sorting unit $B_n$ on $n$ inputs, which takes a bitonic sequence as input and outputs the values sorted, is recursively define as,

<pre>
               -----  B_n -----
              +-----+
              |     |    +-----+
              |     |    |     |   
              |     | => |B_n/2| =>|
              |     |    |     |   |
              |     |    +-----+   |
   bitonic => | S_n |              | => sorted
              |     |    +-----+   |
              |     |    |     |   |
              |     | => |B_n/2| =>|
              |     |    |     |   
              |     |    +-----+           
              +-----+
</pre>

With the basis case of $B_1$ being a straight wire and $S_2$ being a single swap unit.

We also define $B'_n$ which is $B_n$ with the order of the sort reversed.

We construct a merge structure to create from two $n$ length bitonic sequences one $2n$ length bitonic sequence by one instance of $B_{n}$ and one instance of $B'_{n}$ stacked to oppose their sorting direction,

<pre>
              +-----+
              |     |
   bitonic => | B_n | =>|
              |     |   |
              +-----+   |
                        | => bitonic
              +-----+   |
              |     |   |
   bitonic => |B'_n | =>|
              |     |    
              +-----+
</pre>

So an entire sort is depicted here,

<pre>
       +----+   
   ----|    |   +----+
       |B_2 |---|    |
   ----|    |   |    |    
       +----+   |    |   +----+
                |B_4 |---|    |
       +----+   |    |   |    |        
   ----|    |   |    |   |    |            
       |B_2'|---|    |   |    |            
   ----|    |   +----+   |    |              
       +----+            |    |
                         |B_8 | => sorted
       +----+            |    |
   ----|    |   +----+   |    |        
       |B_2 |---|    |   |    |        
   ----|    |   |    |   |    |                 
       +----+   |    |   |    |           
                |B_4'|---|    |       
       +----+   |    |   +----+   
   ----|    |   |    |   
       |B_2'|---|    |
   ----|    |   +----+      
       +----+  
</pre>




### <a name="bitonic-python">Batcher Bitonic Sort in Python</a>

The challenge is to navigate the double recursion and know where the swaps should be based only on the thread index and some level global variables.

The two parameters $i$ and $j$ are interpreted with the thread index $t$ as follows. The lower $i$ bits of $t$ are the offset inside a $B^l_k$. The $l$ is absent if bit $j$ is 0, or the prime (for the inverted sort order) if bit $j$ is 1.

The $j$ controls the larger recursion structure, and $i$ begins at $j$ and counts down for each $j$, being the inside recursive structure.



In [40]:
def bitonic_wiring(tid,j,i):
    assert j>=i
    d = 2**i
    mask = d-1
    tid_top = (tid>>i)<<(i+1)
    tid_bot = tid & mask 
    tid_dir = (tid>>j)%2
    return (tid_top+tid_bot, tid_top+tid_bot+d, tid_dir)
    
def bitonic_wiring_test(bits):
    
    def bitonic_wiring_test_aux(j,i):
        u_prev = 0
        for tid in range(2**bits):
            (u,v,color) = bitonic_wiring(tid,j,i)
            dir = '+'
            if color:
                dir = '-'
            if u-u_prev>1:
                print('---')
            u_prev = u
            print(f'{u}\t{v}\t{dir}')
            
    for c in range(bits):
        print(f'\ni==j=={c}')
        bitonic_wiring_test_aux(c,c)
        

bitonic_wiring_test(4)


i==j==0
0	1	+
---
2	3	-
---
4	5	+
---
6	7	-
---
8	9	+
---
10	11	-
---
12	13	+
---
14	15	-
---
16	17	+
---
18	19	-
---
20	21	+
---
22	23	-
---
24	25	+
---
26	27	-
---
28	29	+
---
30	31	-

i==j==1
0	2	+
1	3	+
---
4	6	-
5	7	-
---
8	10	+
9	11	+
---
12	14	-
13	15	-
---
16	18	+
17	19	+
---
20	22	-
21	23	-
---
24	26	+
25	27	+
---
28	30	-
29	31	-

i==j==2
0	4	+
1	5	+
2	6	+
3	7	+
---
8	12	-
9	13	-
10	14	-
11	15	-
---
16	20	+
17	21	+
18	22	+
19	23	+
---
24	28	-
25	29	-
26	30	-
27	31	-

i==j==3
0	8	+
1	9	+
2	10	+
3	11	+
4	12	+
5	13	+
6	14	+
7	15	+
---
16	24	-
17	25	-
18	26	-
19	27	-
20	28	-
21	29	-
22	30	-
23	31	-


### END