This repository holds the code for https://www.jmlr.org/papers/v23/21-0055.html: Attraction-Repulsion Spectrum in Neighbor Embeddings.
If you use the work herein, we’d appreciate the following citation:
@article{boehm2022attraction, author = {Jan Niklas Böhm and Philipp Berens and Dmitry Kobak}, title = {Attraction-Repulsion Spectrum in Neighbor Embeddings}, journal = {Journal of Machine Learning Research}, year = {2022}, volume = {23}, number = {95}, pages = {1--32}, url = {http://jmlr.org/papers/v23/21-0055.html} }
After all instructions in this section have been completed, the code can be installed via
git clone https://github.com/berenslab/ne-spectrum
cd ne-spectrum
pip install --user -r requirements.txt
python setup.py build
mv bh*.so jnb_msc/transformer/
pip install --user -e .
The above command will probably fail to compile the cython extensions.
For that you need to install/compile openTSNE manually (clone the repo
and install it similarly as above). This project has a build time
dependency on a build time artifact (the file quad_tree.pxd
) that is
not installed along openTSNE by default.
After installing openTSNE this way you have to adapt the two lines in
setup.py
that point to the locally installed openTSNE folder, so
that during the build process the missing file can be found.
Furthermore, you need a patched version of forceatlas2 from https://github.com/jnboehm/forceatlas2, where degree repulsion has been added to fa2. Install it as follows
git clone https://github.com/jnboehm/forceatlas2
cd forceatlas2
rm fa2/fa2util.c
python setup.py build
pip install --user -e .
There is also a requirements.txt
file to install the dependencies.
The code has been run in a conda environment with python 3.8.
The preprocessing script for the treutlein dataset resides in
static/
.
To create a figure, you can simply redo one of the files in media/
.
For example, after installing redo, you can write
redo -j6 media/ar-spectrum.pdf
. This will make sure that the data
is present and up-to-date and generate the figure. The instructions
are written in the file media/ar-spectrum.pdf.do
. This calls out to
redo again ([[file:media/ar-spectrum.pdf.do::redo.redo_ifchange(datafiles + \[plotter.labelname, plotter.rc\])][l. 268, in =media/ar-spectrum.pdf.do=]]), which will recurse
until all dependencies have been satisfied and afterwards create the
figure. The file itself is written in python, although the do file
itself is language agnostic and can be set by the shebang (#!
) in
the first line of the file.
To see which parameters have been set one can investigate which
filenames are generated by the script (look at what is supplied to
jnb_msc.redo.redo_ifchange(...)
). This shows what parameters are
deviating from the defaults set in the class definition.
The classes in the project are all derived from a single base class. It forsees that every subclass implements four methods:
get_datadeps()
load()
transform()
save()
The first function allows to query the object what files it needs,
this is used by redo in order to track the dependencies properly. The
other remaining functions should be more or less self explanatory. It
is of course also possible to use the algorithms manually. For that
the .data
field needs to be populated with suitable data and
possibly the field .init
, depending on the algorithm at hand.
There are four major different types:
GenStage
NDStage
NNstage
SimStage
GenStage
is the root class for the classes that will generate a
dataset. This can be simulated data or simply taking a dataset and
putting it in the correct place (again, for redo and this project
structure). NDStage
will take in an NxD
matrix and reduce its
dimensionality to a lower one; one example for this would be PCA.
NNStage
can take the same input as NDStage
(but usually takes the
output of e. g. PCA) and will turn this into an NxN
affinity/adjaceny matrix. This can then, in turn, be fed into the last
one, SimStage
. These types of classes take in both an NxN
matrix
and an NxD
(D=2
) array, that will serve as the initial layout.
There are further minor classes, for examle simple classes that will
rescale the input to have a predefined std or maximum scale (code in
jnb_msc/transformer/scale.py
).
If anything is unclear, please let me know.
This repository uses redo to essentially “cache” the computations that are carried out by the experiments. It works similar to `make` in that it tries to guess what files have been changed and what parts needs to be rebuilt. I chose this approach so that I wouldn’t have to either recompute everything every time or manually change the code to either load a (possibly stale) file or recompute it and save it.
For more information, the (rough) notes on the original design are here.
Unfortunately, the implementation I am using is written in python2 and
hence needs to be installed separately. It is not strictly necessary
to install this library, but all the code to generate the figures uses
this to check the presence (and staleness) of the files. Furthermore,
the load()
and save()
functions are written with redo in mind.
For example, to get an image of t-SNE on MNIST, one could write in the root of the repository:
redo 'data/mnist/pca/affinity/stdscale;f:1e-4/tsne/data.png'
This will “generate” the dataset MNIST, then reduce it with PCA to 50
dimensions, the default here. Afterwards it will calculate the
pairwise affinities from that. Then the std will be set to the value
given and finally tsne will be run with the scaled dense NxD
matrix
and the NxN
matrix for its affinities. After the optimization, the
embedding (named data.npy
) will be used to create a scatter plot,
which will in turn be saved as data.png
. This file can then be
viewed.
The prefix data/
is not mandatory. It can be omitted or it can be
structured in any way. The “effect” of the other folder names is
shown in jnb_msc/util.py
. The names are resolved to classes.
Further arguments, in colon-separated pairs, can be separated with a
semicolon, for example stdscale
will be called with f=1e-4
.
The folder prepped/
is used to dump all the produced files by the
algorithms. This has two reasons. Firstly, it prevents clutter in
the main directories. Secondly, this way the files can actually be
tracked via redo since it does not support multiple output files from
one run. For more information on that, see also the documentation
(the heading “Virtual targets, side effects, and multiple outputs”).