# Accelerate Signal Processing: FFT Convolution

In this lesson, we demonstrate the use of **hardware optimized FFT** functions from *Accelerate* to **speedup convolution** for image processing.

* We can use the `vectorize` decorator for higher-level annotation of numeric functions and get a great deal of GPU benefit.  
* However, we also have the option of more explicitly controlling CUDA operations, as shown in this notebook.

## Table of Contents
* [Accelerate Signal Processing: FFT Convolution](#Accelerate-Signal-Processing:-FFT-Convolution)
	* [Overview](#Overview)
	* [Set-up](#Set-up)
	* [Prepare data](#Prepare-data)
* [Comparing implementations of a Convolution](#Comparing-implementations-of-a-Convolution)
	* [Numpy naive implementation](#Numpy-naive-implementation)
	* [Scipy Implementation](#Scipy-Implementation)
	* [Accelerate Implementation with MKL](#Accelerate-Implementation-with-MKL)
	* [Accelerate Implementation with GPU](#Accelerate-Implementation-with-GPU)
	* [Comparing all Implementations](#Comparing-all-Implementations)
	* [Using VML to intrinsics](#Using-VML-to-intrinsics)
	* [Script for convolution filter](#Script-for-convolution-filter)


## Set-up

In [None]:
import sys

import numpy as np
from scipy.signal import fftconvolve
from scipy.misc import imresize
import skimage.data
from skimage.color import rgb2gray
from matplotlib import pyplot as plt

from numba import cuda, vectorize
from timeit import default_timer as timer

%matplotlib inline

## Prepare data

In [None]:
# Build 5x5 laplacian filter
laplacian_pts = '''
-4 -1 0 -1 -4
-1  2 3  2 -1
 0  3 4  3  0
-1  2 3  2 -1
-4 -1 0 -1 -4
'''.split()

laplacian = np.array(laplacian_pts, dtype=np.float32).reshape(5, 5)

# Build Image
image = rgb2gray(skimage.data.astronaut())
image = imresize(image, 2.0).astype(np.float32)

print("Image size: %s" % (image.shape,))

response = np.zeros_like(image)
response[:5, :5] = laplacian

plt.figure(figsize=(8,8))
plt.imshow(image, cmap=plt.cm.gray)

In [None]:
plt.figure(figsize=(8,8))
plt.imshow(response[:5, :5], cmap=plt.cm.gray)

In [None]:
response

# Comparing implementations of a Convolution

Herein, we'll demonstrate 4 different implementations and then profile and compare each.

## Numpy naive implementation

In [None]:
def naive_fftconvolve(image):
    freq_image = np.fft.rfft2(image)
    freq_response = np.fft.rfft2(response)
    return np.fft.irfft2(freq_image * freq_response).real

cvimage_naive = naive_fftconvolve(image)

plt.figure(figsize=(8,8))
plt.imshow(cvimage_naive, cmap=plt.cm.gray);

## Scipy Implementation

Using scipy.signal.fftconvolve

In [None]:
cvimage_cpu = fftconvolve(image, laplacian, mode='same')

plt.figure(figsize=(8,8))
plt.imshow(cvimage_cpu, cmap=plt.cm.gray);

## Accelerate Implementation with MKL

In [None]:
import accelerate.mkl.fftpack as mklfft

def mkl_fftconvolve(image):
    freq_image = mklfft.rfft2(image)
    freq_response = mklfft.rfft2(response)
    return mklfft.irfft2(freq_image * freq_response).real

cvimage_mkl = mkl_fftconvolve(image)

plt.figure(figsize=(8,8))
plt.imshow(cvimage_mkl, cmap=plt.cm.gray);

## Accelerate Implementation with GPU


In [None]:
import accelerate.cuda.fft as cufft

@vectorize(['complex64(complex64, complex64)'], target='cuda')
def gpu_mult(a, b):
    # a GPU ufunc to compute the elementwise product 
    return a * b


def gpu_fftconvolve(image):
    image_complex = image.astype(np.complex64)
    response_complex = response.astype(np.complex64)

    # explicit CPU->GPU memory transfer
    d_image_complex = cuda.to_device(image_complex)
    d_response_complex = cuda.to_device(response_complex)

    # GPU forward FFT
    cufft.fft_inplace(d_image_complex)
    cufft.fft_inplace(d_response_complex)

    # GPU ufunc
    gpu_mult(d_image_complex, d_response_complex, out=d_image_complex)

    # GPU inverse FFT
    cufft.ifft_inplace(d_image_complex)

    # explicit GPU->CPU memory transfer
    cvimage_gpu = d_image_complex.copy_to_host().real
    return cvimage_gpu

Note: the `gpu_mult` gpu ufunc is necessary to keep the memory 

In [None]:
cvimage_gpu = gpu_fftconvolve(image)

plt.figure(figsize=(8,8))
plt.imshow(cvimage_gpu, cmap=plt.cm.gray);

## Timing all Implementations

In [None]:
print('scipy')
%timeit fftconvolve(image, laplacian, mode='same')
print('\nnaive')
%timeit naive_fftconvolve(image)
print('\nmkl')
%timeit mkl_fftconvolve(image)
print('\ngpu')
%timeit gpu_fftconvolve(image)

## Using VML to intrinsics

Intel's Vector Math Library (VML)

The VML math functions leverages SIMD instructions for higher throughput.  Use vml ufuncs inplace of numpy ufuncs for simple performance gain when lower precision result is acceptable.

See https://docs.continuum.io/accelerate/mkl_ufuncs for a full list of supported functions.