Frequently Asked Questions
If your question is not answered here, just send us an email to email@example.com. We'd be happy to help!
Q: You say that FALCONN uses LSH. What is LSH?
A: LSH stands for Locality-Sensitive Hashing, which is a way to partition a metric space into nicely shaped cells. We have written a short introduction to LSH, which also contains pointers to further literature.
Q: I read that LSH requires a lot of memory and is thus impractical. Is that true?
A: Yes and no. As defined originally, "vanilla" LSH can indeed sometimes require a lot of memory, and for large datasets this can become a problem. However, researchers have come up with multiprobe LSH, which is a simple and efficient way to reduce the space used by LSH-based data structures. FALCONN supports multiprobe LSH and works very well in the regime where the entire LSH data structure is about as large or even smaller than the dataset itself. For instance, if you have a data set of size 4GB and your machine has 8GB of RAM, FALCONN should not run into memory issues.
Q: FALCONN provides LSH for cosine similarity. How can it be used for the general Euclidean distance?
A: LSH for cosine similarity is good enough if your dataset is "nice". What it means formally is that there is a decent correlation between an angle between two vectors and the corresponding Euclidean distance. In this case, you can partition the dataset using LSH for cosine similarity, and then use the genuine Euclidean distance to find the k-nearest neighbors among the candidates. In the beginning, it is good to center your dataset so that the mean is at the origin. It often makes the dataset more isotropic and easier to partition.
Q: I have heard that pairwise distances in high-dimensional datasets tend to have the same value, which renders LSH useless. Is that true?
A: The above statement is provably true for random data sets. Most applications involve datasets that are not random but instead contain some structure. As a result, there are many real datasets where the nearest-neighbor distances are different from arbitrary pairwise distances.
Second, if your data is essentially random and has very similar pairwise distances, your best bet would likely be a simple linear scan. It is conjectured that such instances are hard nearest neighbor search problems in general (not only for LSH-based methods), i.e., that they require linear time to answer queries.
Q: Will FALCONN work well on my data?
A: That depends on the dataset. One property that enables fast nearest neighbor search is that the typical distance between a point and its nearest neighbor is significantly smaller than the typical distance between two arbitrary points in the dataset.
If you have trouble getting FALCONN to work well on your data, feel free to send us an email.
Q: How should I set the parameters?
A: See the corresponding section in the LSH Primer to understand the different parameters. We also provide a function
get_default_parameters that uses reasonable default values for sufficiently "nice" datasets (see How to use FALCONN for details).
Q: How long does it take to set up a FALCONN data structure?
A: Usually, building the data structure is very quick and often significantly faster than more data-adaptive methods. As a rough guideline, setting up an LSH table for several million points in several hundred dimensions takes around 5 to 10 minutes on a single CPU core (we plan to publish more detailed comparison benchmarks).