python wrapper for the OpenCL FFT library clFFT
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


A Python wrapper for the OpenCL FFT library clFFT.



The open source library clFFT implements FFT for running on a GPU via OpenCL. Some highlights are:

  • batched 1D, 2D, and 3D transforms
  • supports many transform sizes (any combinatation of powers of 2,3,5,7,11, and 13)
  • flexible memory layout
  • single and double precisions
  • complex and real-to-complex transforms
  • supports injecting custom code for data pre- and post-processing


This python wrapper is designed to tightly integrate with PyOpenCL. It consists of a low-level Cython based wrapper with an interface similar to the underlying C library. On top of that it offers a high-level interface designed to work on data contained in instances of pyopencl.array.Array, a numpy work-alike array class for GPU computations. The high-level interface takes some inspiration from pyFFTW. For details of the high-level interface see


  • 2017/11/05 for 2D and 3D transforms with default (empty) settings for the transform axes, now a more clever ordering of the transform axes is chosen, depending on the memory layout: last axis is transformed first for a C contiguous input array. I have seen huge performance improvements, 3x to 4x compared to the previous approach (always first axis first). Please report back benchmark results ('python -m gpyfft.benchmark') if this holds true for your GPU.


The low lever interface is complete (more or less), the high-level interface is not yet settled and likely to change in future. Features to come (not yet implemented in the high-level interface):

work done

  • low level wrapper (mostly) completed
  • high level wrapper
  • complex-to-complex transform, in- and out-of-place
  • real-to-complex transform (out-of-place)
  • complex-to-real transform (out-of-place)
  • single precision
  • double precision
  • interleaved data
  • support injecting custom OpenCL code (pre and post callbacks)
  • accept pyopencl arrays with non-zero offsets (Syam Gadde)
  • heuristics for optimal performance for choosing order axes transform if none given (Release 0.7.1)

Basic usage

Here we describe a simple example of performing a batch of 2D complex-to-complex FFT transforms on the GPU, using the high-level interface of gpyfft. The full source code of this example ist contained in, which is the essence of Note, for testing it is recommended to start from the command line, so you have the possibility to interactively choose an OpenCL context (otherwise, e.g. when using an IPython, you are not asked end might end up with a CPU device, which is prone to fail).


import numpy as np
import pyopencl as cl
import pyopencl.array as cla
from gpyfft.fft import FFT

initialize GPU:

context = cl.create_some_context()
queue = cl.CommandQueue(context)

initialize memory (on host and GPU). In this example we want to perform in parallel four 2D FFTs for 1024x1024 single precision data.

data_host = np.zeros((4, 1024, 1024), dtype = np.complex64)
#data_host[:] = some_useful_data
data_gpu = cla.to_device(queue, data_host)

create FFT transform plan for batched inline 2D transform along second two axes.

transform = FFT(context, queue, data_gpu, axes = (2, 1))

If you want an out-of-place transform, provide the output array as additional argument after the input data.

Start the work and wait until it is finished (Note that enqueu() returns a tuple of events)

event, = transform.enqueue()

Read back the data from the GPU to the host

result_host = data_gpu.get()


A simple benchmark is contained as a submodule, you can run it on the command line by python -m gpyfft.benchmark, or from Python

import gpyfft.benchmark

Note, you might want to set the PYOPENCL_CTX environment variable to select your OpenCL platform and device.