Matrix multiplication (128*128 elements per matrix, 10 iterations)

| Prog. Language     |      OS      |  Architecture |     Algorithm    | Runtime/iteration (ms) |
| -------------------|:-------------|:-------------:|:----------------:| -------------:|
|  C                 | Ubuntu 15.10 | x86_64, VirtualBox | 3-loop           |         11.10 |
|  C                 | Ubuntu 15.10 | x86_64, VirtualBox | 6-loop           |          6.60 |
|  Python (objects)  | Ubuntu 15.10 | x86_64, VirtualBox | 3-loop           |        248.68 |
|  Python (numpy)    | Ubuntu 15.10 | x86_64, VirtualBox | N/A              |          1.86 |
|  Python (C API)    | Ubuntu 15.10 | x86_64, VirtualBox | 6-loop           |          1.86 |
|  C                 | Redhat 6.5   | x86_64, xcoaspen40 | 3-loop           |          8.00 |
|  C                 | Redhat 6.5   | x86_64, xcoaspen40 | 6-loop           |          5.00 |
|  Python (objects)  | Redhat 6.5   | x86_64, xcoaspen40 | 3-loop           |        454.16 |
|  Python (numpy)    | Redhat 6.5   | x86_64, xcoaspen40 | N/A              |          1.99 |
|  Python (C API)    | Redhat 6.5   | x86_64, xcoaspen40 | 6-loop           |          2.01 |
|  C                 | Ubuntu 15.10 | armhf, Zybo   | 3-loop           |        225.20 |
|  C                 | Ubuntu 15.10 | armhf, Zybo   | 6-loop           |        131.00 |
|  Python (objects)  | Ubuntu 15.10 | armhf, Zybo   | 3-loop           |       4369.21 |
|  Python (numpy)    | Ubuntu 15.10 | armhf, Zybo   | N/A              |         53.98 |
|  Python (C API)    | Ubuntu 15.10 | armhf, Zybo   | 6-loop           |         25.02 |
|  Python (SDSoC PL) | Ubuntu 15.10 | armhf, Zybo   | 6-loop           |          6.34 |
|  C (SDSoC)         | PetaLinux(2015.2.1) | armhf, Zybo   | 3-loop    |        190.41 |
|  C (SDSoC)         | PetaLinux(2015.2.1) | armhf, Zybo   | 6-loop    |         90.77 |
|  C (SDSoC PL)      | PetaLinux(2015.2.1) | armhf, Zybo   | 6-loop    |          9.17 |
|  C (SDSoC)         | Baremetal    | armhf, Zybo   | 3-loop           |        180.23 |
|  C (SDSoC)         | Baremetal    | armhf, Zybo   | 6-loop           |         88.04 |
|  C (SDSoC PL)      | Baremetal    | armhf, Zybo   | 6-loop           |          8.92 |

* The original C code in SDSoC project uses 3-loop, which gives the worst performance on all platforms.





### Python (PyObjects) and Numpy
#### 1. PyObjects
http://www.programiz.com/python-programming/examples/multiply-matrix <br>
#### 2. Numpy library
http://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html <br>
http://stackoverflow.com/questions/2866380/how-can-i-time-a-code-segment-for-testing-performance-with-pythons-timeit  <br>
http://stackoverflow.com/questions/10442365/why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python <br>

In [14]:
# define the parameters
dim = 128
iteration = 10

import random
import timeit
import numpy as np

def py_mmult1(a,b,result):
    for i in range(len(a)):
       for j in range(len(b[0])):
           result[i][j] = sum(a[i][k]*b[k][j] for k in range(len(b)))

def py_mmult2(a,b,result):
    result = [[sum(aa*bb for aa,bb in zip(A_row,B_col)) for B_col in zip(*b)] for A_row in a]
    
def numpy_mmult2(a,b,result):
    result += a*b

t1_sec = 0
t2_sec = 0
t3_sec = 0
for ii in range(iteration):
    # generate matrices
    a = [[random.randint(0,dim-1) for i in range(dim)] for j in range(dim)]
    b = [[random.randint(0,dim-1) for i in range(dim)] for j in range(dim)]
    result1 = [[0 for i in range(dim)] for j in range(dim)]
    result2 = [[0 for i in range(dim)] for j in range(dim)]

    t1 = timeit.Timer(lambda: py_mmult1(a,b,result1))
    t1_sec += t1.timeit(number=1)
    
    t2 = timeit.Timer(lambda: py_mmult1(a,b,result2))
    t2_sec += t2.timeit(number=1)

    x = np.matrix(np.random.randint(127,size=(dim,dim)))
    y = np.matrix(np.random.randint(127,size=(dim,dim)))
    result3 = np.matrix(np.random.randint(127,size=(dim,dim)))
    
    for i in range(dim):
        for j in range(dim):
            x[i,j] = a[i][j]
            y[i,j] = b[i][j]
            result3[i,j] = 0
    t3 = timeit.Timer(lambda: numpy_mmult2(x,y,result3))
    t3_sec += t3.timeit(number=1)
    
    for i in range(dim):
        for j in range(dim):
            assert result1[i][j]==result2[i][j], "results not equal"
            assert result1[i][j]==result3[i,j], "results not equal to numpy"
    
print ("Python object1 on ARM: " + str(t1_sec*1000/iteration) + " ms")
print ("Python object2 on ARM: " + str(t2_sec*1000/iteration) + " ms")
print ("Numpy on ARM: " + str(t3_sec*1000/iteration) + " ms")
print ("Test passed")

Python object1 on ARM: 4369.213620899791 ms
Python object2 on ARM: 4370.525037399784 ms
Numpy on ARM: 53.983153800072614 ms
Test passed


In [8]:
iteration = 10
dim = 128

import numpy as np

a = np.matrix(np.random.randint(16,size=(dim,dim)))
b = np.matrix(np.random.randint(16,size=(dim,dim)))

def numpy_mmult(a,b):
    return a*b

# Timeit - create callable function...
import timeit
t = timeit.Timer(lambda: numpy_mmult(a,b))
t_sec = t.timeit(number=iteration)/iteration

print ("Numpy on ARM: " + str(t_sec*1000) + " ms")


Numpy on ARM: 52.540790399962134 ms


### Python + PL (Binding SDSoC Project)

#### 1. Prep Development Environment

```bash
source /proj/gsd/sdsoc/SDSoC/2015.4/settings64.sh
```
```bash
export PATH=/proj/gsd/sdsoc/SDSoC/2015.4/SDK/2015.4/gnu/aarch32/lin/gcc-arm-linux-gnueabi/bin/:${PATH}
```
The next step is optional if you already have a project folder.
```bash
unzip lab3.zip
```
Create a working folder.
```bash
cd lab3/lab3/SDRelease/_sds/swstubs/
mkdir xpp ; cp .c .h *.cpp xpp ; cd xpp
```

#### 2. Modify mmult.cpp to make mmult callable from shared library

Find the follow statement. "1024" is for the case of 32-by-32 matrix multiplication.
```c
void _p0_mmult_0(data_t A[1024], data_t B[1024], data_t C[1024]);
```
Change to:
```c
extern "C" void _p0_mmult_0(data_t A[1024], data_t B[1024], data_t C[1024]);
```

#### 3. Compile generated cf files

```bash
arm-linux-gnueabihf-gcc -fPIC -I /proj/gsd/sdsoc/SDSoC/2015.4/arm-xilinx-linux-gnueabi/include -c devreg.c
```
```bash
arm-linux-gnueabihf-gcc -fPIC -I /proj/gsd/sdsoc/SDSoC/2015.4/arm-xilinx-linux-gnueabi/include -c portinfo.c
```
```bash
arm-linux-gnueabihf-gcc -fPIC -I /proj/gsd/sdsoc/SDSoC/2015.4/arm-xilinx-linux-gnueabi/include -c cf_stub.c
```
#### 4. Compile the accelerator function's C file

```bash
arm-linux-gnueabihf-g++ -fPIC -I /proj/gsd/sdsoc/SDSoC/2015.4/arm-xilinx-linux-gnueabi/include -I../../../../src -Wall -O3 -D __SDSCC__ -c mmult.cpp
```

#### 5. Build the final .so file against hf-gcc compiled sds_lib.a

```bash
arm-linux-gnueabihf-g++ -rdynamic devreg.o portinfo.o cf_stub.o mmult.o -L/group/xrlabs2/grahams/SDSoC/cf.2015.4/build_linux -Wl,--start-group -lpthread -lsds_lib -Wl,--end-group -o libmmult.so -fPIC -shared
```

#### 6. Copy .so and .bit to Zybo

```bash
rsync -avz -e ssh ../../../lab3.elf.bit libmmult.so xpp@IP_ADDR:/home/xpp 
```
or directly copy those files onto Zybo.

#### 7. Download the bitstream.
The following notebook scripts can be run directly on Zybo once steps above have been completed.

In [1]:
from pyxi.pl import Overlay
# Change the name of the bitstream accordingly
Overlay().download_bitstream("/home/xpp/jupyter_notebooks/mm_python_linux.elf.bit")
print("Bitstream has been downloaded for matrix multiplication")

Bitstream has been downloaded for matrix multiplication


#### 8. Install and use Python package
We need the "cffi" package if it has not been installed.
```bash
pip install cffi
```
#### 9. Run matrix multiplication script

In [2]:
dim = 128
iteration = 10

import cffi
import timeit
import numpy as np

ffi = cffi.FFI()

# Pull out mmult, alloc and deallocate functions
# 16384 is for 128*128 matrix multiplication
ffi.cdef("void _p0_mmult_0(int A[16384], int B[16384], int C[16384]);")
ffi.cdef("int* sds_alloc(int size);")
ffi.cdef("void sds_free(int* memptr);")
ffi.cdef("void mmult_sw(int* A, int* B, int* C);")

# Pull in the shared object file from SDSoC (driver + datamovement)
lib = ffi.dlopen('/home/xpp/jupyter_notebooks/libmmult.so')

# allocate contiguous physical memory for matrices (SDSoC Required)
# Need a space of (bytes per data_t)*dim*dim according to mmult.cpp
size = dim*dim
a = lib.sds_alloc(4*size)
b = lib.sds_alloc(4*size)    
c = lib.sds_alloc(4*size)

t1_sec = 0
t2_sec = 0
success = 1
print("Call the C-API with PL...")

for ii in range(iteration):
    x = np.matrix(np.random.randint(dim-1,size=(dim,dim)))
    y = np.matrix(np.random.randint(dim-1,size=(dim,dim)))
    z = x*y
    t1 = timeit.Timer(lambda: x*y)
    t1_sec += t1.timeit(number=1)

    # Initialize A and B
    for i in range(dim):
        for j in range(dim):
            a[i*dim+j] = ffi.cast("int", x[i,j])
            b[i*dim+j] = ffi.cast("int", y[i,j])

    # Call the HW version
    t2 = timeit.Timer(lambda: lib._p0_mmult_0(a,b,c))
    t2_sec += t2.timeit(number=1)

    # Checking the results
    for i in range(dim):
        for j in range(dim):
            if not c[i*dim+j]==z[i,j]:
                print("Error: wrong results.")
                success = 0
                break
if success==1:
    print("Numpy on ARM: " + str(t1_sec*1000/iteration) + " ms")
    print("Python-binding SDSoC HW on ARM: " + str(t2_sec*1000/iteration) + " ms")
    print("Test passed")

Call the C-API with PL...
Numpy on ARM: 52.62119130000116 ms
Python-binding SDSoC HW on ARM: 6.449775300002614 ms
Test passed


### Python + C (Binding SDSoC Project)
This method is using the software version of the matrix multiplication in SDSoC project. Since pointers are passed from Python to C, the Python capsule is used. For more information:

https://docs.python.org/3.1/c-api/capsule.html <br>

#### 1. Build the Python-C API
Change directory into the folder (by default, "/home/xpp/jupyter_notebooks") where "mmmodule.c" and "setup.py" can be found. Do:
```bash
python3.4 setup.py build
```

This will genenrate a "build" folder with ".so" files. Then do:
```bash
python3.4 setup.py install
```

This will expose "mmmod" to the Python environment.

#### 2. Use "mmmod"
The "mmmod" implements a basic data structure: Array. Uses can also write() and read() a single element of the array each time. Function clear() set all the elements to 0, while set() set all the elements to the specified value.

In [7]:
import mmmod
import timeit
import numpy as np

# For mmmod, the dimension is integrated into C code
dim = 128
iteration = 10

a = mmmod.Array()
b = mmmod.Array()
c = mmmod.Array()

print("Call the C-API version...")
success = 1
t1_sec = 0
t2_sec = 0

for ii in range(iteration):
    x = np.matrix(np.random.randint(dim-1,size=(dim,dim)))
    y = np.matrix(np.random.randint(dim-1,size=(dim,dim)))
    z = x*y
    
    for i in range(dim):
        for j in range(dim):
            mmmod.write(a,i,j,x[i,j])
            mmmod.write(b,i,j,y[i,j])
    mmmod.clear(c)
    
    t2 = timeit.Timer(lambda: mmmod.mmult_sw(a,b,c))
    t2_sec += t2.timeit(number=1)
    
    # Checking the results
    for i in range(dim):
        for j in range(dim):
            if not mmmod.read(c,i,j)==z[i,j]:
                print("Error: wrong results.")
                success = 0
                break
                
if success==1:
    print("Python-binding C on ARM: " + str(t2_sec*1000/iteration) + " ms")
    print("Test passed")

Call the C-API version...
Python-binding C on ARM: 24.979956800052605 ms
Test passed
