-----

## Batcher Sorting Networks: Merge Sort

Burton Rosenberg

_Creation Date:_ June 2023

_Last update:_ 20 June 2023

&copy; Copyright 2023 Burton Rosenberg. All rights reserved.


----


### Table of contents.

1. <a href="#introduction">Sorting Networks</a>
1. <a href="#oddeven">Batcher Odd-Even Sorting Network</a>
1. <a href="#python">Python code implementation</a>


### <a name=introduction>Sorting Networks</a>

The question is how to sort in parallel, especially if the parallelism can be exploited for a faster sort.

It is possible to realize parallel compution with a circuit consisting of computing units and wires connectin the units, or on a device such as a GPU that has a common memory accessible to multiple computing threads. In the sorting networks the only computation needed is a comparison and swap, which in the circuit model would realized in hardward, and in the GPU model realized in software. 

For instance, on a CPU, if multiple threads were launched with access to their thread identifiers, and two functions, f and g, that map the identifiers to indices into a value array a in memory, the code would be,

<pre>
__global__ void swap(int * a) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x ;
    int i = f(tid) ;
    int j = g(tid) ;
    if (a[i]&lt;a[j]) {
        int t = a[i] ;
        a[i] = a[j] ;
        a[j] = t ;
     }
     return ;
</pre>

We will visualize the algorithm in the circuit model where wires carry values to and from computation units, the units arranged in layers and the data flow occuring at a consant one unit of time per layer (flowing from left to right). To implement such a circuit on a GPU, queue a kernel launch for each layer, with one thread assigned to each computation unit in that layer. Learning from the thread ID the wire on which the computation acts is the GPU programs job.


These circuits for sorting are called _sorting networks_. 

Here is a sorting network for four elements, based on the Bacher Odd-Even algorithm. The wires are the horizontal lines and slanted lines, and the computation units are the vertical lines. Verify that this circuit works.

<pre>
a ---+----------+------- s
     |          |
b ---+--+    +--+--+---  t
          \ /      |
           /       |
          / \      |
c ---+---+   +--+--+---  u
     |          |
d ---+----------+------- v
</pre>


### <a name=oddeven>Batcher Odd-Even network</a>


In a 1968 report, Ken Batcher presented two sorting networks that have $O((\log n)^2)$ layers. Since each layer is computed in unit time, either as a circuit or on a GPU, the time to sort is also $O((\log n)^2)$. In the circuit model $O(n\,(\log n)^2)$ swap units are needed. In the GPU model, $O(n)$ threads are needed in each thead launch.

##### The 0-1 Principle 

Batcher did not know this, but it is a fact that if a sorting network works when the values to sort are restricted to $\{\,0,1\,\}$, then it works in general.

A list or vector of the form $i$ 1's followed by $n-i$ zeros, where $n$ is a pure power of two will be denoted as $\langle\,i\,\rangle_n$. In recusive algorithms, the basis case $\langle\,i\,\rangle_1$ is trivially sorted.

##### The Induction Hypothesis and Goal

We assume to  have $\langle\,i\,\rangle_n$ and $\langle\,j\,\rangle_n$ and want their merge, $\langle\,i\cup j\,\rangle_{2n}$.

##### Even-Odd Split

The list $\langle\,i\,\rangle_n$ is split into even and odd locations. If $i$ is even, then the two splits lists are both $\langle\,i/2\,\rangle_{n/2}$. Example:

$$
\langle\,1,1,1,1,1,1,0,0\,\rangle_{8} \longrightarrow \langle\,1,1,1,0\,\rangle_{4} , \langle\,1,1,1,0\,\rangle_{4} 
$$

If $i$ is odd the lists are $\langle\,(i+1)/2\,\rangle_{n/2}$ and $\langle\,(i-1)/2\,\rangle_{n/2}$. Example:

$$
\langle\,1,1,1,0,0,0,0,0\,\rangle_{8} \longrightarrow \langle\,1,1,0,0\,\rangle_{4} , \langle\,1,0,0,0\,\rangle_{4} 
$$

The same considerations for the list $\langle\,j\,\rangle_n$. 

In the case of $i$ odd, the list with the greater number of 1's is called the top list, and the other is called the bottom list. For $i$ even either one is called the top and the other is called the bottom.

##### Recursion 

Merge the top lists together; merge the bottom lists together. This is a recursive structure since these are lists of length $n/2$.

There are three possible outcomes for the two merged lists.

- Both are $\langle\,(i+j)/2\,\rangle_{n}$.
- One is $\langle\,(i+j+1)/2\,\rangle_{n}$ and the other $\langle\,(i+j-1)/2\,\rangle_{n}$.
- One is $\langle\,(i+j)/2+1\,\rangle_{n}$ and the other $\langle\,(i+j)/2-1\,\rangle_{n}$.

##### Combine

If the highest element of the top list is zero, the both lists are completely zero, and the combine step succeeds. 

Consider the case that the highest elment of the top list is one. The highest element of the top list is set aside and the lowest element of the bottom list is set aside. The two lists, reduced to the remaining $n-1$ elements, are paired element-wise. According to the three cases in the recursion step, the cases are, 

- The list $\langle\,(i+j)/2-1\,\rangle_{n-1}$ is paired with $\langle\,(i+j)/2\,\rangle_{n-1}$
- Both lists in the pair are $\langle\,(i+j-1)/2\,\rangle_{n-1}$.
- The list $\langle\,(i+j)/2\,\rangle_{n-1}$ is paired with $\langle\,(i+j)/2-1\,\rangle_{n-1}$.

So all but one pairing will be 0-0 or 1-1. The combined list is sorted by the swaping of the (possible) 0-1 pairing.

##### Overall algorithm and analysis

The overall network combines two logarithmically scalling intentions. The input will be divided into 2, 4, 8, and so on, consecutive wires, and merged. 

For 2, the two single wires are trivially sorted, and the output is sorted; and is simply the swap device between these wires.

Then the pairs of sorted 2's are merged into 4's, etc. With $\log n$ levels of merge.

The above describes the Batcher even-odd merge, which breaks the task of merging to $k$ length lists to two size $k/2$ merges. This is solved recursively. In terms of what circuits are generated at each recursion level, it is a single bank of swap devices. Hence the complete result from the $\log n$ depth recursion of subproblems is $\log n$ layers of swap devices.

Hence the overall depth of $(\log n)^2$.

### <a name=python>Python code</a>

Follows it python code carrying out the sort; however this code does not attempt to simulate a network.



In [1]:
# batcher's even odd sort

class BatcherOddEvenMerge:
    
    def __init__(self):
        pass
    
    def sort_aux(self,a,b):
        
        if len(a)==1:
            return [min(a[0],b[0]),max(a[0],b[0])]

        # using batcher's numbering for odd and even 
        # (contrary to based at zero arrays)
        a_odd = a[0::2]
        a_even = a[1::2]
        b_odd = b[0::2]
        b_even = b[1::2]

        odd = self.sort_aux(a_odd,b_odd)
        even = self.sort_aux(a_even,b_even)

        c = [0]*(len(a)+len(b))
        c[0] = odd[0]
        c[-1] = even[-1]
        c[1:-1:2] = odd[1::]
        c[2:-1:2] = even[0:len(even)-1]
        for i in range(1,len(c)-1,2):
            if c[i]>c[i+1] : c[i],c[i+1]=c[i+1],c[i]
        return c
    
    def sort(self,a,b):
        
        def power_of_two(n):
            while n>1:
                if n%2==1:
                    return False
                n //= 2
            return True
        
        assert len(a)==len(b), 'lists must be of equal size'
        assert power_of_two(len(a)), 'list length must be a power of 2'
        return self.sort_aux(a,b)

In [2]:
import random

k= 16
a = [i for i in range(k)]
#random.shuffle(a)
print(a)
b = [i for i in range(k)]
#random.shuffle(b)


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


In [3]:
bod = BatcherOddEvenMerge()
bod.sort(a,b)

[0,
 0,
 1,
 1,
 2,
 2,
 3,
 3,
 4,
 4,
 5,
 5,
 6,
 6,
 7,
 7,
 8,
 8,
 9,
 9,
 10,
 10,
 11,
 11,
 12,
 12,
 13,
 13,
 14,
 14,
 15,
 15]

In [4]:
a = [0,0,0,0,1,1,1,1]
b = [0,1,1,1,1,1,1,1]
bod.sort(a,b)


[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

### END