GPU Accelerated t-SNE for CUDA with Python bindings
Clone or download
Latest commit efa2098 Nov 11, 2018


WARNING: This code is still in active development. While the core code is tested and working, some features need aditional testing.

This repo is an optimized CUDA version of Barnes-Hut t-SNE by L. Van der Maaten with associated python modules. We find that our implementation of t-SNE can be up to 1200x faster than Sklearn, or up to 50x faster than Multicore-TSNE when used with the right GPU. The paper describing our approach, as well as the results below, is available at

To begin, check out our wiki for install instructions and usage:


Simulated Data

Time taken compared to other state of the art algorithms on synthetic datasets with 50 dimensions and four clusters for varying numbers of points. Note the log scale on both the points and time axis, and that the scale of the x-axis is in thousands of points (thus, the values on the x-axis range from 1K to 10M points. Dashed lines represent projected times. Projected scaling assumes an O(nlog(n)) implementation.


The performance of t-SNE-CUDA compared to other state-of-the-art implementations on the MNIST dataset. t-SNE-CUDA runs on the raw pixels of the MNIST dataset (60000 images x 768 dimensions) in under 7 seconds.


The performance of t-SNE-CUDA compared to other state-of-the-art implementations on the CIFAR-10 dataset. t-SNE-CUDA runs on the raw pixels of the CIFAR-10 training set (50000 images x 1024 dimensions x 3 channels) in under 12 seconds.

Comparison of Embedding Quality

The quality of the embeddings produced by t-SNE-CUDA do not differ significantly from the state of the art implementations. See below for a comparison of MNIST cluster outputs.

Left: MULTICORE-4 (501s), Middle: BH-TSNE (1156s), Right: t-SNE-CUDA (Ours, 6.98s).


To install our library, follow the instructions in the installation section of the wiki.

Note: There appear to be some compilation instability issues when using parallel compilation. Running Make twice seems to fix it, as does running make without parallel compile. We believe that this is an NVIDIA issue with our externed device variables, so we're looking into how we can fix that.


Like many of the libraries available, the python wrappers subscribe to the same API as sklearn.manifold.TSNE.

You can run it as follows:

from tsnecuda import TSNE
X_embedded = TSNE(n_components=2, perplexity=15, learning_rate=10).fit_transform(X)

It's worth noting that if n_components is >= 3, then the program uses the naive O(n^2) method by default. If the number of components is 2, then you can use the heavily optimized Barnes-Hut implementation.

For more information on running the library, or using it as a C++ library, see the Python usage or C++ Usage sections of the wiki.

Future work

  • Allow for double precision
  • Expand FMM methods
  • Add multi-threaded CPU version for those without a GPU

Known Bugs

  • Odd bug with some datasets that causes a hang/gpu memory error.


Please cite this repository if it was useful for your research:

  title={t-SNE-CUDA: GPU-Accelerated t-SNE and its Applications to Modern Data},
  author={Chan, David M and Rao, Roshan and Huang, Forrest and Canny, John F},
  journal={arXiv preprint arXiv:1807.11824},

This library is built on top of the following technology, without this tech, none of this would be possible!

L. Van der Maaten's paper



CUDA Utilities/Pairwise Distance






Our code is built using components from FAISS, the Lonestar GPU library, GTest, CXXopts, and OrangeOwl's CUDA utilities. Each portion of the code is governed by their respective licenses - however our code is governed by the BSD-3 license found in LICENSE.txt