[SPARK-5992][ML] Locality Sensitive Hashing #15148

Closed
wants to merge 46 commits into
from

Conversation

Projects
None yet
9 participants
@Yunni
Contributor

Yunni commented Sep 19, 2016

What changes were proposed in this pull request?

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the design doc.

Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:

  • Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
  • PySpark Integration for the scala classes and methods.

How was this patch tested?

Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.

Tested the methods on WEX dataset from AWS, with the steps and results here.

References

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

@srowen

This comment has been minimized.

Show comment
Hide comment
+ * @return The distance between hash vectors x and y in double
+ */
+ protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+ (x.asBreeze - y.asBreeze).toArray.map(math.abs).min

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

This seems include redundant operations.
For DenseVector, we can directly use its values: Array[Double].
For SparseVector, we can use Breeze's subtraction op then get the data from the result.

@viirya

viirya Sep 19, 2016

Contributor

This seems include redundant operations.
For DenseVector, we can directly use its values: Array[Double].
For SparseVector, we can use Breeze's subtraction op then get the data from the result.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

I am wondering what's API to calculate the difference between two spark Vectors?

@Yunni

Yunni Sep 19, 2016

Contributor

I am wondering what's API to calculate the difference between two spark Vectors?

This comment has been minimized.

@viirya

viirya Sep 20, 2016

Contributor

For a pair of DenseVector, you can directly use its values member and do something like:

x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min

For a pair of SparseVector, you may not need to conver (x.asBreeze - y.asBreeze) back to Array, because the resulting array should be sparse too. We can directly map on the Breeze vector, i.e., (x.asBreeze - y.Breeze).map(math.abs).min.

@viirya

viirya Sep 20, 2016

Contributor

For a pair of DenseVector, you can directly use its values member and do something like:

x.values.zip(y.values).map(x => math.abs(x._1 - x._2)).min

For a pair of SparseVector, you may not need to conver (x.asBreeze - y.asBreeze) back to Array, because the resulting array should be sparse too. We can directly map on the Breeze vector, i.e., (x.asBreeze - y.Breeze).map(math.abs).min.

This comment has been minimized.

@Yunni

Yunni Sep 20, 2016

Contributor

Thanks! Since it's generated by hashing, I am assuming it's a pair of dense vector.

@Yunni

Yunni Sep 20, 2016

Contributor

Thanks! Since it's generated by hashing, I am assuming it's a pair of dense vector.

@viirya

This comment has been minimized.

Show comment
Hide comment
@viirya

viirya Sep 19, 2016

Contributor

@Yunni Please use a proper title as "[SPARK-5992][ML] ...".

Contributor

viirya commented Sep 19, 2016

@Yunni Please use a proper title as "[SPARK-5992][ML] ...".

+ */
+ def approxNearestNeighbors(dataset: Dataset[_], key: KeyType, k: Int = 1,
+ distCol: String = "distance"): Dataset[_] = {
+ if (k < 1) {

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Usually we use assert for this. And more informative error message might be The number of nearest neighbors cannot be less than 1.

@viirya

viirya Sep 19, 2016

Contributor

Usually we use assert for this. And more informative error message might be The number of nearest neighbors cannot be less than 1.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

+ val nearestHashValue = nearestHashDataset.collect()(0)(0).asInstanceOf[Double]
+
+ // Filter the dataset where the hash value equals to u
+ val modelSubset = modelDataset.filter(hashDistUDF(col($(outputCol))) === nearestHashValue)

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

You do hashDistUDF twice for the dataset. Besides, you might get less than k nearest neighbors in current approach. We can do this like:

val hashDistCol = "_hash_dist"
modelDataset.withColumn(hashDistCol, hashDistUDF(col($(outputCol))))
  .sort(hashDistCol)
  .drop(hashDistCol)
  .limit(k)
  .withColumn(distCol, keyDistUDF(col($(inputCol))))
@viirya

viirya Sep 19, 2016

Contributor

You do hashDistUDF twice for the dataset. Besides, you might get less than k nearest neighbors in current approach. We can do this like:

val hashDistCol = "_hash_dist"
modelDataset.withColumn(hashDistCol, hashDistUDF(col($(outputCol))))
  .sort(hashDistCol)
  .drop(hashDistCol)
  .limit(k)
  .withColumn(distCol, keyDistUDF(col($(inputCol))))

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Actually this does not work because number of elements with the same "hashDistCol" can be much larger than k. In that case, we are random selecting k elements of the same "hashDistCol" value.

To resolve the issue you mentioned, I am changing nearestHashValue to hashThreshold, which is the maximum "hashDistCol" for the top k elements.

@Yunni

Yunni Sep 19, 2016

Contributor

Actually this does not work because number of elements with the same "hashDistCol" can be much larger than k. In that case, we are random selecting k elements of the same "hashDistCol" value.

To resolve the issue you mentioned, I am changing nearestHashValue to hashThreshold, which is the maximum "hashDistCol" for the top k elements.

This comment has been minimized.

@viirya

viirya Sep 20, 2016

Contributor

Yeah, I think we can replace the limit above to a filter to choose the elements failed in this range.

@viirya

viirya Sep 20, 2016

Contributor

Yeah, I think we can replace the limit above to a filter to choose the elements failed in this range.

+ val explodeCols = Seq("lsh#entry", "lsh#hashValue")
+ val explodedA = processDataset(datasetA, explodeCols)
+
+ // If this is a self join, we need to recreate the inputCol of datasetB to avoid ambiguity.

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Do we need this? I think we already do dedup operation in Analyzer for self-join.

@viirya

viirya Sep 19, 2016

Contributor

Do we need this? I think we already do dedup operation in Analyzer for self-join.

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Got it. You want to access inputCol from both left and right sides.

@viirya

viirya Sep 19, 2016

Contributor

Got it. You want to access inputCol from both left and right sides.

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Once #14719 is merged, I think we can skip this redundant operation.

@viirya

viirya Sep 19, 2016

Contributor

Once #14719 is merged, I think we can skip this redundant operation.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Added a TODO.

@Yunni

Yunni Sep 19, 2016

Contributor

Added a TODO.

+ )
+
+ // Filter the joined datasets where the distance are smaller than the threshold.
+ joinedDatasetWithDist.distinct().filter(col(distCol) < threshold)

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

I think do distinct after filter should be better as you will filter out most of records.

@viirya

viirya Sep 19, 2016

Contributor

I think do distinct after filter should be better as you will filter out most of records.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Very good point. Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Very good point. Done.

+
+class RandomProjectionModel(
+ override val uid: String,
+ val randUnitVectors: Array[breeze.linalg.Vector[Double]])

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Can we use spark vector? We have BLAS library (BLAS.dot) for spark vector and you don't need to convert to Breeze and back to spark vector below.

@viirya

viirya Sep 19, 2016

Contributor

Can we use spark vector? We have BLAS library (BLAS.dot) for spark vector and you don't need to convert to Breeze and back to spark vector below.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

+ }
+
+ override protected[this] def keyDistance(x: Vector, y: Vector): Double = {
+ euclideanDistance(x.asBreeze, y.asBreeze)

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

Vectors.sqdist is specified for spark vector. We can use it and get its square root.

@viirya

viirya Sep 19, 2016

Contributor

Vectors.sqdist is specified for spark vector. We can use it and get its square root.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

+
+ private[this] var inputDim = -1
+
+ private[this] lazy val randUnitVectors: Array[breeze.linalg.Vector[Double]] = {

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

As mentioned above, we can use spark vector to avoid Breeze conversion.

@viirya

viirya Sep 19, 2016

Contributor

As mentioned above, we can use spark vector to avoid Breeze conversion.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

+
+ // Compute precision and recall
+ val correctCount = expected.join(actual, model.getInputCol).count().toDouble
+ (correctCount / expected.count(), correctCount / actual.count())

This comment has been minimized.

@viirya

viirya Sep 19, 2016

Contributor

I think the precision and recall values should be swapped. correctCount / expected.count() should be recall. correctCount / actual.count() should be precision.

@viirya

viirya Sep 19, 2016

Contributor

I think the precision and recall values should be swapped. correctCount / expected.count() should be recall. correctCount / actual.count() should be precision.

This comment has been minimized.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@Yunni

Yunni Sep 19, 2016

Contributor

Done.

@viirya

This comment has been minimized.

Show comment
Hide comment
@viirya

viirya Sep 19, 2016

Contributor

This looks pretty solid. cc @dbtsai @jkbradley

Contributor

viirya commented Sep 19, 2016

This looks pretty solid. cc @dbtsai @jkbradley

@Yunni Yunni changed the title from Spark 5992 yunn lsh to [SPARK-5992][ML] Locality Sensitive Hashing Sep 19, 2016

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Sep 19, 2016

Contributor

@Yunni Could you provide the specific reference paper this patch is based on? Also, it might be nice to put the reference in the code somewhere, e.g. the scaladoc for LSH/Random Projections. Thanks!

Contributor

sethah commented Sep 19, 2016

@Yunni Could you provide the specific reference paper this patch is based on? Also, it might be nice to put the reference in the code somewhere, e.g. the scaladoc for LSH/Random Projections. Thanks!

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Sep 19, 2016

Contributor

Thanks very much for reviewing @viirya I made some changes based on your comments. PTAL.

Contributor

Yunni commented Sep 19, 2016

Thanks very much for reviewing @viirya I made some changes based on your comments. PTAL.

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Sep 19, 2016

Contributor

Hi @sethah, I have updated the reference in the PR and scaladoc for LSH.

Contributor

Yunni commented Sep 19, 2016

Hi @sethah, I have updated the reference in the PR and scaladoc for LSH.

@viirya

This comment has been minimized.

Show comment
Hide comment
@viirya

viirya Sep 20, 2016

Contributor

@Yunni Thanks for working on this.

Contributor

viirya commented Sep 20, 2016

@Yunni Thanks for working on this.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Sep 20, 2016

Contributor

A few high-level comments/questions:

  • Should this go into the feature package as a feature estimator/transformer? That is where other dimensionality reduction techniques have gone and I'm not sure we should create a new package for this.
  • Could you please point me to a specific section of a specific paper that documents the approaches used here? AFAICT, this patch implements something different than most of the Approximate nearest neighbors via LSH algorithms found in papers. For instance, the method in section 2 here as well as the method on Wikipedia here are different than the implementation in this pr. Also, the spark package spark-neighbors employs those approaches. I'm not an expert in LSH so I was just hoping for some clarification.
  • The implementation of the RandomProjections class actually follows the implementation of the "2-stable" (or more generically, "p-stable") LSH algorithm, and not the "Random Projection" algorithm in the paper that is referenced. At the very least, we should clarify this. Potentially, we should think of a better name.

@karlhigley Would you mind taking a look at the patch, or providing your input on the comments?

Contributor

sethah commented Sep 20, 2016

A few high-level comments/questions:

  • Should this go into the feature package as a feature estimator/transformer? That is where other dimensionality reduction techniques have gone and I'm not sure we should create a new package for this.
  • Could you please point me to a specific section of a specific paper that documents the approaches used here? AFAICT, this patch implements something different than most of the Approximate nearest neighbors via LSH algorithms found in papers. For instance, the method in section 2 here as well as the method on Wikipedia here are different than the implementation in this pr. Also, the spark package spark-neighbors employs those approaches. I'm not an expert in LSH so I was just hoping for some clarification.
  • The implementation of the RandomProjections class actually follows the implementation of the "2-stable" (or more generically, "p-stable") LSH algorithm, and not the "Random Projection" algorithm in the paper that is referenced. At the very least, we should clarify this. Potentially, we should think of a better name.

@karlhigley Would you mind taking a look at the patch, or providing your input on the comments?

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Sep 20, 2016

Contributor

Hi @sethah,
Thanks for the comments.

  • I agree. I have moved lsh package to be under feature
  • In "Similarity search in high dimensions via hashing", there is an algorithm in the box Approximate Nearest Neighbor Query. It's almost the same as the algorithm on Wikipedia. I think you find it looks different because
    1. it is using the Dataset API instead of RDD.
    2. it finds exactly k elements regardless of the bucket sizes. (unless #elems in the origin dataset < k)
  • I am clarifying this in the scaladoc of RandomProjection. I will implement the LSH for cos distance (which is RandomProjection in the paper) as SignRandomProjection. Please advise if you come up with a better name.
Contributor

Yunni commented Sep 20, 2016

Hi @sethah,
Thanks for the comments.

  • I agree. I have moved lsh package to be under feature
  • In "Similarity search in high dimensions via hashing", there is an algorithm in the box Approximate Nearest Neighbor Query. It's almost the same as the algorithm on Wikipedia. I think you find it looks different because
    1. it is using the Dataset API instead of RDD.
    2. it finds exactly k elements regardless of the bucket sizes. (unless #elems in the origin dataset < k)
  • I am clarifying this in the scaladoc of RandomProjection. I will implement the LSH for cos distance (which is RandomProjection in the paper) as SignRandomProjection. Please advise if you come up with a better name.
@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Sep 23, 2016

Contributor

Thanks for your clarifications. I still don't see where the algorithm used in this patch comes from. Here is my summary of how the approach here is different than the approach found on wikipedia and several papers, please let me know if you don't agree with this:

Approach found on Wikipedia and here and here

  1. Select d Gaussian unit vectors g1, ..., gd

  2. construct a hash function h(x) = (floor((g1 dot x) / w), ..., floor((gd dot x) / w)) (h(x) is a d dimensional vector)

  3. construct L hash functions (i.e. L d-dimensional vectors)

  4. for every point x_i in the input, compute L hash functions h1(x_i), ..., hL(x_i)

  5. for the query point q, compute the L hash functions h1(q), ..., hL(q)

  6. For each point in the input

    For each hash function l = 1 to L:

    if (h_l(x_i) == h_l(q)) select as candidate

  7. compute true distances for each candidate

  8. return the candidate if its true distance is less than some threshold R

Approach in this PR

  1. Select d Gaussian unit vectors g1, ..., gd
  2. construct a hash function h(x) = (floor((g1 dot x) / w), ..., floor((gd dot x) / w)) (h(x) is a d dimensional vector)
  3. for every point in the input, compute a single hash function h(x_i)
  4. compute a single hash function for the query point h(q)
  5. for every point in the input, compute the minimum element of the absolute distance between the hashes, i.e. min(h(x_i) - h(q))
  6. take the subset of the k smallest "hash distances" from 5.
  7. compute the true distances for the subset and return the subset sorted on this column

Looking forward to hearing your thoughts, thanks!

Contributor

sethah commented Sep 23, 2016

Thanks for your clarifications. I still don't see where the algorithm used in this patch comes from. Here is my summary of how the approach here is different than the approach found on wikipedia and several papers, please let me know if you don't agree with this:

Approach found on Wikipedia and here and here

  1. Select d Gaussian unit vectors g1, ..., gd

  2. construct a hash function h(x) = (floor((g1 dot x) / w), ..., floor((gd dot x) / w)) (h(x) is a d dimensional vector)

  3. construct L hash functions (i.e. L d-dimensional vectors)

  4. for every point x_i in the input, compute L hash functions h1(x_i), ..., hL(x_i)

  5. for the query point q, compute the L hash functions h1(q), ..., hL(q)

  6. For each point in the input

    For each hash function l = 1 to L:

    if (h_l(x_i) == h_l(q)) select as candidate

  7. compute true distances for each candidate

  8. return the candidate if its true distance is less than some threshold R

Approach in this PR

  1. Select d Gaussian unit vectors g1, ..., gd
  2. construct a hash function h(x) = (floor((g1 dot x) / w), ..., floor((gd dot x) / w)) (h(x) is a d dimensional vector)
  3. for every point in the input, compute a single hash function h(x_i)
  4. compute a single hash function for the query point h(q)
  5. for every point in the input, compute the minimum element of the absolute distance between the hashes, i.e. min(h(x_i) - h(q))
  6. take the subset of the k smallest "hash distances" from 5.
  7. compute the true distances for the subset and return the subset sorted on this column

Looking forward to hearing your thoughts, thanks!

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Sep 23, 2016

Contributor

Hi @sethah

  • I think h(x) = floor((g dot x) / w) is one hash function, as is in the wiki. Please note that g and x are vectors.
  • In bulletpoint 6 of "Approach found on Wikipedia and here and here", we have a problem of the number of candidates is smaller than k. So we use a trick of hashDistance to avoid this problem.

For you to understand, the definition is hashDistance(x, q) = min_i(h_i(x) - h_i(q)). From the definition, we can see that for l = 1 to L exist(h_l(x_i) == h_l(q)) is equivalent to hashDistance(x, q) = 0 Does this explanation sound more clear to you?

Contributor

Yunni commented Sep 23, 2016

Hi @sethah

  • I think h(x) = floor((g dot x) / w) is one hash function, as is in the wiki. Please note that g and x are vectors.
  • In bulletpoint 6 of "Approach found on Wikipedia and here and here", we have a problem of the number of candidates is smaller than k. So we use a trick of hashDistance to avoid this problem.

For you to understand, the definition is hashDistance(x, q) = min_i(h_i(x) - h_i(q)). From the definition, we can see that for l = 1 to L exist(h_l(x_i) == h_l(q)) is equivalent to hashDistance(x, q) = 0 Does this explanation sound more clear to you?

@karlhigley

On the whole, this looks pretty solid to me. I commented on some of the places @sethah mentioned as potential discrepancies.

With regard to the issue of single vs multiple hash functions, this PR considers a set of random projections as a single compound hash function that produces an LSH signature instead of multiple hash functions that produce elements of the LSH signature. It was a little confusing until I read the RandomProjection class, but it looks right to me and I think it will work for future distance measures.

On the algorithmic difference concerning lookup in multiple buckets to return exactly k neighbors, this PR looks like it uses a version of multi-probe LSH. Probing multiple buckets allows the use of shorter LSH signatures to achieve similar recall, which reduces memory requirements. It might be nice to provide the option to probe a single bucket and accept less than k neighbors to get faster lookups, though.

Finally, I am a little bit concerned about the performance of recomputing the hash tables (modelDataset) for every lookup. Maybe I'm missing something about how that works in this PR, though?

+ */
+ protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+ // Since it's generated by hashing, it will be a pair of dense vectors.
+ x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - x._2)).min

This comment has been minimized.

@karlhigley

karlhigley Sep 25, 2016

By default, this is computing the Manhattan distance between hash values, which probably works as a proxy for the distance between hash buckets when using LSH based on p-stable distributions and any other approach that produces vectors of integers/doubles as hash signatures (e.g. MinHash).

However, the default won't work for approaches that produce vectors of booleans as hash signatures (e.g. sign random projection for cosine distance). It could be overridden to compute Hamming distance in that case, though.

@karlhigley

karlhigley Sep 25, 2016

By default, this is computing the Manhattan distance between hash values, which probably works as a proxy for the distance between hash buckets when using LSH based on p-stable distributions and any other approach that produces vectors of integers/doubles as hash signatures (e.g. MinHash).

However, the default won't work for approaches that produce vectors of booleans as hash signatures (e.g. sign random projection for cosine distance). It could be overridden to compute Hamming distance in that case, though.

This comment has been minimized.

@Yunni

Yunni Sep 26, 2016

Contributor

Yes, I am planning to override it for BitSampling (LSH for Hamming distance)

@Yunni

Yunni Sep 26, 2016

Contributor

Yes, I am planning to override it for BitSampling (LSH for Hamming distance)

This comment has been minimized.

@jkbradley

jkbradley Sep 26, 2016

Member

If it's algorithm-specific, I'd recommend making it abstract here so it's more future bug-proof.

@jkbradley

jkbradley Sep 26, 2016

Member

If it's algorithm-specific, I'd recommend making it abstract here so it's more future bug-proof.

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ val hashThreshold = thresholdDataset.collect()(0)(0).asInstanceOf[Double]
+
+ // Filter the dataset where the hash value is less than the threshold.
+ val modelSubset = modelDataset.filter(hashDistCol <= hashThreshold)

This comment has been minimized.

@karlhigley

karlhigley Sep 25, 2016

This looks like a variant of multi-probe LSH, which seems extensible to the other distance measures and hashing schemes that are likely candidates for future work. It might be nice to have an option to select between probing a single bucket and probing multiple buckets though -- in some cases, the user may be happy to accept less than exactly k neighbors (i.e. a non-zero miss rate) in exchange for faster lookups.

@karlhigley

karlhigley Sep 25, 2016

This looks like a variant of multi-probe LSH, which seems extensible to the other distance measures and hashing schemes that are likely candidates for future work. It might be nice to have an option to select between probing a single bucket and probing multiple buckets though -- in some cases, the user may be happy to accept less than exactly k neighbors (i.e. a non-zero miss rate) in exchange for faster lookups.

This comment has been minimized.

@Yunni

Yunni Sep 26, 2016

Contributor

This is a really good advice. Added an option for Single/Multiple probing.

@Yunni

Yunni Sep 26, 2016

Contributor

This is a really good advice. Added an option for Single/Multiple probing.

+ assert(k > 0, "The number of nearest neighbors cannot be less than 1")
+ // Get Hash Value of the key v
+ val keyHash = hashFunction(key)
+ val modelDataset = transform(dataset)

This comment has been minimized.

@karlhigley

karlhigley Sep 25, 2016

Does this transform the original dataset for each key lookup? If so, it seems inefficient for repeated lookups. I can imagine a few possibilities to help with that:

  • Allowing lookups of multiple keys simultaneously (to amortize the cost of building the hash tables), and/or
  • Providing a way to pre-compute the hash tables (i.e. modelDataset) and execute multiple lookups against them, and/or
  • Storing the hash tables on the model and making it possible to cache them in memory

The ability to load and save models with their modelDatasets would also help, but that's probably out of scope for this initial PR. Structuring the model so that hash tables can be pre-computed/cached would set up that future work nicely, though.

@karlhigley

karlhigley Sep 25, 2016

Does this transform the original dataset for each key lookup? If so, it seems inefficient for repeated lookups. I can imagine a few possibilities to help with that:

  • Allowing lookups of multiple keys simultaneously (to amortize the cost of building the hash tables), and/or
  • Providing a way to pre-compute the hash tables (i.e. modelDataset) and execute multiple lookups against them, and/or
  • Storing the hash tables on the model and making it possible to cache them in memory

The ability to load and save models with their modelDatasets would also help, but that's probably out of scope for this initial PR. Structuring the model so that hash tables can be pre-computed/cached would set up that future work nicely, though.

This comment has been minimized.

@Yunni

Yunni Sep 26, 2016

Contributor

Based on my discussion with @jkbradley, storing the hash tables on the model is not an option. To make the interface cleaner, I went with option (2) in the new commit. Basically, transforms() will skip the computation if outputCol is already there. So users can do the following to avoid inefficiency in repeated lookups.
val model = new MinHash()...fit(df)
val transformedDf = model.transform(df).cache()
model.approxNearestNeighbor(transformedDf, key1, k=3)
model.approxNearestNeighbor(transformedDf, key2, k=5)

Meanwhile, putting the raw Df works as well, but with low performance for multiple queries,
val model = new MinHash()...fit(df)
model.approxNearestNeighbor(df, key, k=3)

@Yunni

Yunni Sep 26, 2016

Contributor

Based on my discussion with @jkbradley, storing the hash tables on the model is not an option. To make the interface cleaner, I went with option (2) in the new commit. Basically, transforms() will skip the computation if outputCol is already there. So users can do the following to avoid inefficiency in repeated lookups.
val model = new MinHash()...fit(df)
val transformedDf = model.transform(df).cache()
model.approxNearestNeighbor(transformedDf, key1, k=3)
model.approxNearestNeighbor(transformedDf, key2, k=5)

Meanwhile, putting the raw Df works as well, but with low performance for multiple queries,
val model = new MinHash()...fit(df)
model.approxNearestNeighbor(df, key, k=3)

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

Can you elaborate on this discussion? I don't see it referenced anywhere on the PR, JIRA or design doc.

@MLnick

MLnick Sep 26, 2016

Contributor

Can you elaborate on this discussion? I don't see it referenced anywhere on the PR, JIRA or design doc.

This comment has been minimized.

@jkbradley

jkbradley Sep 26, 2016

Member

@MLnick I believe the discussion is in the resolved comments in the design doc.

The main issue is that, currently, no MLlib models store the entire dataset. Some do store transient references, but those do not need to be saved when models are saved. If a model method like approxNearestNeighbor assumed that the dataset were part of the model, then model.save would need to save the entire dataset.

I'm in favor of supporting pre-computation too. I'd recommend that approxNearestNeighbor implement the logic to check whether outputCol already exists; this should not belong in transform (to be consistent with the rest of MLlib).

@jkbradley

jkbradley Sep 26, 2016

Member

@MLnick I believe the discussion is in the resolved comments in the design doc.

The main issue is that, currently, no MLlib models store the entire dataset. Some do store transient references, but those do not need to be saved when models are saved. If a model method like approxNearestNeighbor assumed that the dataset were part of the model, then model.save would need to save the entire dataset.

I'm in favor of supporting pre-computation too. I'd recommend that approxNearestNeighbor implement the logic to check whether outputCol already exists; this should not belong in transform (to be consistent with the rest of MLlib).

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Moved the checking logic to approxNearestNeighbor and approxSimilarityJoin

@Yunni

Yunni Sep 28, 2016

Contributor

Moved the checking logic to approxNearestNeighbor and approxSimilarityJoin

Yunni added some commits Sep 26, 2016

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Sep 26, 2016

Contributor

Thanks @karlhigley All of your comments are very helpful. I made some changes to make it work. :)

Contributor

Yunni commented Sep 26, 2016

Thanks @karlhigley All of your comments are very helpful. I made some changes to make it work. :)

@MLnick

Made a first pass. Will dig deeper into algorithm technical details soon.

+ /** @group getParam */
+ final def getOutputDim: Int = $(outputDim)
+
+ setDefault(outputDim -> 1)

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

Make this one line, i.e. setDefault(outputDim -> 1, outputCol -> "lsh_output")

@MLnick

MLnick Sep 26, 2016

Contributor

Make this one line, i.e. setDefault(outputDim -> 1, outputCol -> "lsh_output")

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ *
+ * @group param
+ */
+ final val outputDim: IntParam = new IntParam(this, "outputDim", "output dimension",

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

Strictly speaking this is actually the number of hash buckets/functions. Yes, it is a "dimensionality reduction" and the output vectors have this dimension, but perhaps we can add a bit more documentation here expanding on the role of the buckets?

@MLnick

MLnick Sep 26, 2016

Contributor

Strictly speaking this is actually the number of hash buckets/functions. Yes, it is a "dimensionality reduction" and the output vectors have this dimension, but perhaps we can add a bit more documentation here expanding on the role of the buckets?

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

This is the dimension of LSH OR-amplification. Added in Scaladoc

@Yunni

Yunni Sep 28, 2016

Contributor

This is the dimension of LSH OR-amplification. Added in Scaladoc

+ * @param schema The schema of the input dataset without outputCol
+ * @return A derived schema with outputCol added
+ */
+ final def transformLSHSchema(schema: StructType): StructType = {

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

This method should be protected. Also, while not strictly required, it's common to call this type of shared method validateAndTransformSchema (it's not defined as an internal API, it's just what has ended up being used commonly in various models & transformers).

@MLnick

MLnick Sep 26, 2016

Contributor

This method should be protected. Also, while not strictly required, it's common to call this type of shared method validateAndTransformSchema (it's not defined as an internal API, it's just what has ended up being used commonly in various models & transformers).

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ final def transformLSHSchema(schema: StructType): StructType = {
+ val outputFields = schema.fields :+
+ StructField($(outputCol), new VectorUDT, nullable = false)
+ StructType(outputFields)

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

You can use SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) for this purpose

@MLnick

MLnick Sep 26, 2016

Contributor

You can use SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT) for this purpose

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ * the distance between each record and the key.
+ */
+ def approxNearestNeighbors(dataset: Dataset[_], key: KeyType, k: Int = 1,
+ singleProbing: Boolean = true,

This comment has been minimized.

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ * @tparam T The class type of lsh
+ * @return A tuple of two doubles, representing the false positive and false negative rate
+ */
+ def checkLSHProperty[KeyType, T <: LSHModel[KeyType, T]]

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

Minor point, but technically, this method is only calculating the property - not checking it. The check happens in the relevant test. Perhaps reword the doc?

@MLnick

MLnick Sep 26, 2016

Contributor

Minor point, but technically, this method is only calculating the property - not checking it. The check happens in the relevant test. Perhaps reword the doc?

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Renamed the method.

@Yunni

Yunni Sep 28, 2016

Contributor

Renamed the method.

+ * @tparam T The class type of lsh
+ * @return A tuple of two doubles, representing precision and recall rate
+ */
+ def checkApproxNearestNeighbors[KeyType, T <: LSHModel[KeyType, T]]

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

Again, not actually checking the Precision / Recall in the method

@MLnick

MLnick Sep 26, 2016

Contributor

Again, not actually checking the Precision / Recall in the method

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ * @tparam T The class type of lsh
+ * @return A tuple of two doubles, representing precision and recall rate
+ */
+ def checkApproxSimilarityJoin[KeyType, T <: LSHModel[KeyType, T]]

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

same here

@MLnick

MLnick Sep 26, 2016

Contributor

same here

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ * @return A tuple of two doubles, representing precision and recall rate
+ */
+ def checkApproxSimilarityJoin[KeyType, T <: LSHModel[KeyType, T]]
+ (lsh: LSH[KeyType, T], datasetA: Dataset[_], datasetB: Dataset[_],

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

arg indentation style

@MLnick

MLnick Sep 26, 2016

Contributor

arg indentation style

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

+ * @return A tuple of two doubles, representing precision and recall rate
+ */
+ def checkApproxNearestNeighbors[KeyType, T <: LSHModel[KeyType, T]]
+ (lsh: LSH[KeyType, T], dataset: Dataset[_], key: KeyType, k: Int,

This comment has been minimized.

@MLnick

MLnick Sep 26, 2016

Contributor

arg indentation style

@MLnick

MLnick Sep 26, 2016

Contributor

arg indentation style

This comment has been minimized.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@Yunni

Yunni Sep 28, 2016

Contributor

Done.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Sep 26, 2016

Contributor

At a high level I like the idea here and the work that's gone into a unified interface. A few comments:

Data types

I'm not that keen on mixing up the input data types between Vector, Array[Double] and (later) Array[Boolean]. I think we should stick with Vector throughout.

For MinHash what is the thinking behind Array[Double] rather than Vector?

I can see for binary (i.e. hamming dist) that Array[Boolean] is attractive as a kind of type safety thing, but still I think a Vector interface is more natural.

In both cases the input could be sparse, right? So forcing arrays as input can have some space implications. Vector also neatly allows supporting dense and sparse cases.

NN search

It seems to me that, while yes technically this is a "transformer" to a low-dimensional representation (so transform outputs the lower-dim vectors), the main use case is either approx NN (aka top K) or the similarity join. (correct me if I'm wrong, but generally the low-dim vectors are not used as inputs in some model, say, such as the case for PCA / SVD etc, but rather for the approx similarity search).

For approxNearestNeighbors, a common use case is in recommendations, to efficiently support top-k recommendations across an entire dataset. This can't so easily be achieved with the self approxSimilarityJoin because usually we want up to k recommended items (or most similar items), and how do we select the similarity threshold to achieve this? It's data dependent.

So I do think we need an efficient way to do approxNearestNeighbors over a DataFrame of inputs rather than only one key at a time. I'd like to see this applied to predict top-k with ALSModel as that will enable efficient prediction (and make cross-validation on ranking metrics feasible). The current approach when applied to say computing top-k most similar items for each of 1 million items, would not I think be scalable. Perhaps either the ANN approach can be extended to multiple inputs, or the similarity join can be extended to also handle k neighbors per item rather than the similarity threshold.

I'd be interested to hear your other use cases - is it mainly similarity join, or really doing ANN on only 1 item?

Contributor

MLnick commented Sep 26, 2016

At a high level I like the idea here and the work that's gone into a unified interface. A few comments:

Data types

I'm not that keen on mixing up the input data types between Vector, Array[Double] and (later) Array[Boolean]. I think we should stick with Vector throughout.

For MinHash what is the thinking behind Array[Double] rather than Vector?

I can see for binary (i.e. hamming dist) that Array[Boolean] is attractive as a kind of type safety thing, but still I think a Vector interface is more natural.

In both cases the input could be sparse, right? So forcing arrays as input can have some space implications. Vector also neatly allows supporting dense and sparse cases.

NN search

It seems to me that, while yes technically this is a "transformer" to a low-dimensional representation (so transform outputs the lower-dim vectors), the main use case is either approx NN (aka top K) or the similarity join. (correct me if I'm wrong, but generally the low-dim vectors are not used as inputs in some model, say, such as the case for PCA / SVD etc, but rather for the approx similarity search).

For approxNearestNeighbors, a common use case is in recommendations, to efficiently support top-k recommendations across an entire dataset. This can't so easily be achieved with the self approxSimilarityJoin because usually we want up to k recommended items (or most similar items), and how do we select the similarity threshold to achieve this? It's data dependent.

So I do think we need an efficient way to do approxNearestNeighbors over a DataFrame of inputs rather than only one key at a time. I'd like to see this applied to predict top-k with ALSModel as that will enable efficient prediction (and make cross-validation on ranking metrics feasible). The current approach when applied to say computing top-k most similar items for each of 1 million items, would not I think be scalable. Perhaps either the ANN approach can be extended to multiple inputs, or the similarity join can be extended to also handle k neighbors per item rather than the similarity threshold.

I'd be interested to hear your other use cases - is it mainly similarity join, or really doing ANN on only 1 item?

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Sep 26, 2016

Contributor

ok to test

Contributor

MLnick commented Sep 26, 2016

ok to test

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 28, 2016

Test build #67683 has finished for PR 15148 at commit 6cda936.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 28, 2016

Test build #67683 has finished for PR 15148 at commit 6cda936.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 28, 2016

Test build #67688 has finished for PR 15148 at commit 97e1238.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 28, 2016

Test build #67688 has finished for PR 15148 at commit 97e1238.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
+ *
+ * Model produced by [[MinHash]], where multiple hash functions are stored. Each hash function is
+ * a perfect hash function:
+ * `g_i(x) = (x * k_i mod prime) mod numEntries`

This comment has been minimized.

@jkbradley

jkbradley Oct 28, 2016

Member

should be x+1, not x, right?

@jkbradley

jkbradley Oct 28, 2016

Member

should be x+1, not x, right?

This comment has been minimized.

@Yunni

Yunni Oct 28, 2016

Contributor

x should be indices from Z_prime^* so 0 should not be included.

I will add more docs here.

@Yunni

Yunni Oct 28, 2016

Contributor

x should be indices from Z_prime^* so 0 should not be included.

I will add more docs here.

+/**
+ * :: Experimental ::
+ *
+ * Model produced by [[RandomProjection]]

This comment has been minimized.

@jkbradley

jkbradley Oct 28, 2016

Member

It would be good to document about normalization:

  • The input vectors are not normalized, so the number of buckets will be (max L2 norm of input vectors) / bucketLength.
  • The randUnitVectors are normalized to be unit vectors.
@jkbradley

jkbradley Oct 28, 2016

Member

It would be good to document about normalization:

  • The input vectors are not normalized, so the number of buckets will be (max L2 norm of input vectors) / bucketLength.
  • The randUnitVectors are normalized to be unit vectors.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 28, 2016

Test build #67721 has finished for PR 15148 at commit 3570845.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 28, 2016

Test build #67721 has finished for PR 15148 at commit 3570845.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Oct 28, 2016

Member

This LGTM now. Any other comments from other reviewers? I'll merge this, but we can follow up as needed.

Thanks very much @Yunni for the PR and everyone else for helping to review!

Merging with master

Member

jkbradley commented Oct 28, 2016

This LGTM now. Any other comments from other reviewers? I'll merge this, but we can follow up as needed.

Thanks very much @Yunni for the PR and everyone else for helping to review!

Merging with master

@asfgit asfgit closed this in ac26e9c Oct 28, 2016

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Oct 28, 2016

Contributor

Awesome! Thanks Joseph and thanks everyone else for reviewing this! 👍

Contributor

Yunni commented Oct 28, 2016

Awesome! Thanks Joseph and thanks everyone else for reviewing this! 👍

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016

[SPARK-5992][ML] Locality Sensitive Hashing
## What changes were proposed in this pull request?

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).

Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:
 - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
 - PySpark Integration for the scala classes and methods.

## How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.

Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).

## References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>

Closes #15148 from Yunni/SPARK-5992-yunn-lsh.
@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 5, 2016

Contributor

I apologize for coming late to this, but I am taking a look at some of the documentation now. For RandomProjection class there are two links: one to wikipedia entry on stable distributions and one to a survey paper. The wikipedia links to the "stable distributions" section despite also having a section on random projections, which is the supposed algorithm. The paper has a "Random Projection" section as well - neither of the Random Projection methods in the links match the code here. I expressed this concern before. The approach in the Random Projection class does not match either the "Random Projection" method OR the "P-Stable distribution" methods that I find in the literature.

I summarized this in a comment way up towards the top. If this method is some well-accepted hybrid of the two, fine, but I think the references would leave users quite confused. I think it's nice to have certainty about the practical effectiveness of this method since it has already been deployed in industry, so my main concern is really just documentation. Right now, we're linking to sources which describe distinctly different algorithms than what we have implemented. Thoughts?

For convenience, some references:

Contributor

sethah commented Nov 5, 2016

I apologize for coming late to this, but I am taking a look at some of the documentation now. For RandomProjection class there are two links: one to wikipedia entry on stable distributions and one to a survey paper. The wikipedia links to the "stable distributions" section despite also having a section on random projections, which is the supposed algorithm. The paper has a "Random Projection" section as well - neither of the Random Projection methods in the links match the code here. I expressed this concern before. The approach in the Random Projection class does not match either the "Random Projection" method OR the "P-Stable distribution" methods that I find in the literature.

I summarized this in a comment way up towards the top. If this method is some well-accepted hybrid of the two, fine, but I think the references would leave users quite confused. I think it's nice to have certainty about the practical effectiveness of this method since it has already been deployed in industry, so my main concern is really just documentation. Right now, we're linking to sources which describe distinctly different algorithms than what we have implemented. Thoughts?

For convenience, some references:

@karlhigley

This comment has been minimized.

Show comment
Hide comment
@karlhigley

karlhigley Nov 6, 2016

@sethah: I think you're right that there's a discrepancy here, and I'm embarrassed that I didn't see it when I first reviewed the PR. On a reread of the source and your comment above, it looks like the LSH models in this PR use a single hash function to compute a single hash table, which doesn't match my understanding of OR-amplification. For OR-amplification, multiple hash functions would be applied to compute multiple hash tables, and points placed in the same bucket in any hash table would be considered candidate neighbors.

From the comments, it looks like the discrepancy might be due to some confusion between the number of hash functions applied and the dimensionality of the hash functions. This is a subtle point that I was confused about too, and it took me quite a while to work it out because different authors use the term "hash function" to refer to different things at different levels of abstraction. In one sense (at a lower level), a random projection is made up of many component hash functions, but in another sense (at a higher level) a random projection represents a single hash function for the purposes of OR-amplification.

Given that the PR has already been merged, I concur that the best way forward is to adjust the comments and documentation. That probably involves changing the references to OR-amplification to simply refer to the dimensionality of the hash function.

On the other issue you mentioned regarding mismatches between what's implemented and the linked documents, I think some of that confusion also stems from inconsistent terminology in the source material. LSH based on p-stable distributions (for Euclidean distance) does involve random projections, although the authors don't directly say so in the paper. There's a somewhat similar LSH method for cosine distance that's sometimes referred to as "sign random projection" (though the authors of the paper don't use that term either). Sign random projection is what the "Random Projection" section of the Wikipedia page is referring to; what's implemented here looks like LSH based on p-stable distributions. Maybe one way to clarify would be to name the models after the distance measures they're intended to approximate, and provide explanations of the methods they use in the comments?

karlhigley commented Nov 6, 2016

@sethah: I think you're right that there's a discrepancy here, and I'm embarrassed that I didn't see it when I first reviewed the PR. On a reread of the source and your comment above, it looks like the LSH models in this PR use a single hash function to compute a single hash table, which doesn't match my understanding of OR-amplification. For OR-amplification, multiple hash functions would be applied to compute multiple hash tables, and points placed in the same bucket in any hash table would be considered candidate neighbors.

From the comments, it looks like the discrepancy might be due to some confusion between the number of hash functions applied and the dimensionality of the hash functions. This is a subtle point that I was confused about too, and it took me quite a while to work it out because different authors use the term "hash function" to refer to different things at different levels of abstraction. In one sense (at a lower level), a random projection is made up of many component hash functions, but in another sense (at a higher level) a random projection represents a single hash function for the purposes of OR-amplification.

Given that the PR has already been merged, I concur that the best way forward is to adjust the comments and documentation. That probably involves changing the references to OR-amplification to simply refer to the dimensionality of the hash function.

On the other issue you mentioned regarding mismatches between what's implemented and the linked documents, I think some of that confusion also stems from inconsistent terminology in the source material. LSH based on p-stable distributions (for Euclidean distance) does involve random projections, although the authors don't directly say so in the paper. There's a somewhat similar LSH method for cosine distance that's sometimes referred to as "sign random projection" (though the authors of the paper don't use that term either). Sign random projection is what the "Random Projection" section of the Wikipedia page is referring to; what's implemented here looks like LSH based on p-stable distributions. Maybe one way to clarify would be to name the models after the distance measures they're intended to approximate, and provide explanations of the methods they use in the comments?

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 7, 2016

Contributor

@karlhigley Thanks for your detailed response. From the amplification section on Wikipedia, it is pretty clear to me that this implementation is not doing OR/AND amplification. outputDim is just the number of concatenated random hash functions (k in the wiki article).

For now we can clarify some of this a bit better in the documentation, and perhaps in the future we can extend this implementation to use optional AND/OR amplification. I can work on a PR for it this week, unless there are any objections. @jkbradley @Yunni @MLnick ?

Contributor

sethah commented Nov 7, 2016

@karlhigley Thanks for your detailed response. From the amplification section on Wikipedia, it is pretty clear to me that this implementation is not doing OR/AND amplification. outputDim is just the number of concatenated random hash functions (k in the wiki article).

For now we can clarify some of this a bit better in the documentation, and perhaps in the future we can extend this implementation to use optional AND/OR amplification. I can work on a PR for it this week, unless there are any objections. @jkbradley @Yunni @MLnick ?

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Nov 7, 2016

Contributor

@sethah I think you are right. OR-amplification is only applied inside NN search and similarity join through hashDistance and explode. transform itself does not apply amplifications.

Sorry to miss this. I will clarify this in the user guide, and I am happy for the PR you send to fix the documentation. @jkbradley @MLnick

Contributor

Yunni commented Nov 7, 2016

@sethah I think you are right. OR-amplification is only applied inside NN search and similarity join through hashDistance and explode. transform itself does not apply amplifications.

Sorry to miss this. I will clarify this in the user guide, and I am happy for the PR you send to fix the documentation. @jkbradley @MLnick

+ @Since("2.1.0")
+ override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+ // Since it's generated by hashing, it will be a pair of dense vectors.
+ x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - pair._2)).min

This comment has been minimized.

@sethah

sethah Nov 7, 2016

Contributor

Does this make sense for MinHash? For the RandomProjection class I understand that the absolute difference between their hash values is a measure of their similarity, but for MinHash I don't think it is. It is true that dissimilar items have a lower likelihood of hash collisions, but it should not be true that they have a low likelihood to hash to buckets near each other. We use this hashDistance to ensure that we get enough near-neighbor candidates, but I don't see how this hashDistance corresponds to similarity in the case where there are no zero distance elements.

@sethah

sethah Nov 7, 2016

Contributor

Does this make sense for MinHash? For the RandomProjection class I understand that the absolute difference between their hash values is a measure of their similarity, but for MinHash I don't think it is. It is true that dissimilar items have a lower likelihood of hash collisions, but it should not be true that they have a low likelihood to hash to buckets near each other. We use this hashDistance to ensure that we get enough near-neighbor candidates, but I don't see how this hashDistance corresponds to similarity in the case where there are no zero distance elements.

This comment has been minimized.

@Yunni

Yunni Nov 7, 2016

Contributor

Make sense. hashDistance for MinHash should just be binary. I will make another PR to fix this.

@Yunni

Yunni Nov 7, 2016

Contributor

Make sense. hashDistance for MinHash should just be binary. I will make another PR to fix this.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 7, 2016

Contributor

Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here.

Most or all of the LSH algorithms in the literature describe a process which applies a composition of AND and OR amplification. @karlhigley This is what the package spark-neighbors does as well, correct? AND amplification is applied by generating hash functions g(x) = (h1(x), h2(x), ..., hd(x)) which are concatenations of several of the vanilla locality sensitive hashing functions. These algorithms only compare g(x) == g(y) for near-neighbor candidacy. Still, they then apply OR amplification by using L of these hashing functions and accepting a point as a candidate if any of the g_i for i = 1 to L hash functions fall into the same bucket as the query point.

In this patch we only apply OR amplification by generating a single g(x) = (h1(x), h2(x), ..., hd(x)) and we consider candidates if any of the h_i for i = 1 to d match. For a (p1, p2) sensitive hashing family, this OR amplification transforms it into a (1 - (1 - p1)^d, 1 - (1 - p2)^d) family, where p1 is a "good" collision and p2 is a "bad" collision. Consider a (0.8, 0.2) hash family where we apply OR amplification with a dimension d = 10. We will transform this into a (0.99999989, 0.893) sensitive family. Basically, we amplify the good and bad collisions. If instead we implement the composition of AND then OR amplification as in the literature, we transform a (0.8, 0.2) sensitive family into a (.8785, .0064). In this way, we amplify the "good" collision and dampen the "bad" collision probabilities. If this is correct, then I think the current implementation will end up selecting most of the points as candidates and may impact the runtime performance. This reference sums it up nicely IMO.

I will look into testing this out more concretely.

Contributor

sethah commented Nov 7, 2016

Ok, I'm looking more closely at this algorithm versus the literature. I agree that there is a lot of inconsistent terminology which is probably leading to some of the confusion here.

Most or all of the LSH algorithms in the literature describe a process which applies a composition of AND and OR amplification. @karlhigley This is what the package spark-neighbors does as well, correct? AND amplification is applied by generating hash functions g(x) = (h1(x), h2(x), ..., hd(x)) which are concatenations of several of the vanilla locality sensitive hashing functions. These algorithms only compare g(x) == g(y) for near-neighbor candidacy. Still, they then apply OR amplification by using L of these hashing functions and accepting a point as a candidate if any of the g_i for i = 1 to L hash functions fall into the same bucket as the query point.

In this patch we only apply OR amplification by generating a single g(x) = (h1(x), h2(x), ..., hd(x)) and we consider candidates if any of the h_i for i = 1 to d match. For a (p1, p2) sensitive hashing family, this OR amplification transforms it into a (1 - (1 - p1)^d, 1 - (1 - p2)^d) family, where p1 is a "good" collision and p2 is a "bad" collision. Consider a (0.8, 0.2) hash family where we apply OR amplification with a dimension d = 10. We will transform this into a (0.99999989, 0.893) sensitive family. Basically, we amplify the good and bad collisions. If instead we implement the composition of AND then OR amplification as in the literature, we transform a (0.8, 0.2) sensitive family into a (.8785, .0064). In this way, we amplify the "good" collision and dampen the "bad" collision probabilities. If this is correct, then I think the current implementation will end up selecting most of the points as candidates and may impact the runtime performance. This reference sums it up nicely IMO.

I will look into testing this out more concretely.

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Nov 7, 2016

Contributor

@sethah Yes, that's why outputDim is introduced for users to trade off between false negative rate and running time.
During my tests, LSH without amplification can be (0.5, 0.5)-sensitive or even worse depending on the input distribution. Even that case, outputDim = 4 or outputDim = 5 already gives very good accuracy. And the number of rows being scanned should be outputDim * averageBucketSize.

Contributor

Yunni commented Nov 7, 2016

@sethah Yes, that's why outputDim is introduced for users to trade off between false negative rate and running time.
During my tests, LSH without amplification can be (0.5, 0.5)-sensitive or even worse depending on the input distribution. Even that case, outputDim = 4 or outputDim = 5 already gives very good accuracy. And the number of rows being scanned should be outputDim * averageBucketSize.

@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Nov 7, 2016

Member

It sounds like discussions are converging, but I want to confirm a few things + make a few additions.

Amplification

Is this agreed?

  • Approx neighbors and similarity are doing OR-amplification when comparing hash values, as described in the Wikipedia article. This is computing an amplified hash function implicitly.
  • transform() is not doing amplification. It outputs the value of a collection of hash functions, rather than aggregating them to do amplification.
    • This is my main question: Is amplification ever done explicitly, and when would you ever need that?

Adding combined AND and OR amplification in the future sounds good to me. My main question right now is whether we need to adjust the API before the 2.1 release. I don't see a need to, but please comment if you see an issue with the current API.

  • One possibility: We could rename outputDim to something specific to OR-amplification.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

Random Projection

I agree this should be renamed to something like "PStableHashing." My apologies for not doing enough background research to disambiguate.

MinHash

I think this is implemented correctly, according to the reference given in the linked Wikipedia article.

  • This reference to perfect hash functions may be misleading. I'd prefer to remove it.

hashDistance

Rethinking this, I am unsure about what function we should use. Currently, hashDistance is only used by approxNearestNeighbors. Since approxNearestNeighbors sorts by hashDistance, using a soft measure might be better than what we currently have:

  • MinHash
    • Currently: Uses OR-amplification for single probing, and something odd for multiple probing
    • Best option for approxNearestNeighbors: this Wikipedia section, which is equivalent or OR-amplification when using single probing. I.e., replace this line of code with: x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size
  • RandomProjection
    • Currently: Uses OR-amplification for single probing, and something reasonable for multiple probing

@Yunni What is the best resource you have for single vs multiple probing? I'm wondering now if they are uncommon terms and should be renamed.

Member

jkbradley commented Nov 7, 2016

It sounds like discussions are converging, but I want to confirm a few things + make a few additions.

Amplification

Is this agreed?

  • Approx neighbors and similarity are doing OR-amplification when comparing hash values, as described in the Wikipedia article. This is computing an amplified hash function implicitly.
  • transform() is not doing amplification. It outputs the value of a collection of hash functions, rather than aggregating them to do amplification.
    • This is my main question: Is amplification ever done explicitly, and when would you ever need that?

Adding combined AND and OR amplification in the future sounds good to me. My main question right now is whether we need to adjust the API before the 2.1 release. I don't see a need to, but please comment if you see an issue with the current API.

  • One possibility: We could rename outputDim to something specific to OR-amplification.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

Random Projection

I agree this should be renamed to something like "PStableHashing." My apologies for not doing enough background research to disambiguate.

MinHash

I think this is implemented correctly, according to the reference given in the linked Wikipedia article.

  • This reference to perfect hash functions may be misleading. I'd prefer to remove it.

hashDistance

Rethinking this, I am unsure about what function we should use. Currently, hashDistance is only used by approxNearestNeighbors. Since approxNearestNeighbors sorts by hashDistance, using a soft measure might be better than what we currently have:

  • MinHash
    • Currently: Uses OR-amplification for single probing, and something odd for multiple probing
    • Best option for approxNearestNeighbors: this Wikipedia section, which is equivalent or OR-amplification when using single probing. I.e., replace this line of code with: x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size
  • RandomProjection
    • Currently: Uses OR-amplification for single probing, and something reasonable for multiple probing

@Yunni What is the best resource you have for single vs multiple probing? I'm wondering now if they are uncommon terms and should be renamed.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 7, 2016

Contributor

So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function h(x) which is (R, cR, p1, p2)-sensitive which just means it meets certain properties needed for LSH. In the case of 2-stable method h(x) = floor((x dot r) / w) which maps Vector[Double] => Int. p1 and p2 correspond to "good" and "bad" collision probabilities respectively. To decrease the probability of a bad collision we can use AND-amplification by creating a new, compound hash function g(x) = [h1(x), h2(x), ..., hd(x)] where the h_i(x) correspond to different random vectors r. Now we only consider collisions for two vectors x and y if g(x) == g(y) (i.e. standard vector equality). This makes the probability of both types of collisions decrease to (p1^d, p2^d). For a hypothetical (0.8, 0.2)-sensitive distribution this goes to (0.4, 0.0016) for d = 4. Making the false-positive rate very low, but meaning we also miss a lot of good candidates. To mitigate this we can further apply OR-amplification by generating not one compound hash function g(x) but L compound functions

g1(x) = [h11(x), ..., h1d(x)]
g2(x) = [h21(x), ..., h2d(x)]
gL(x)  = [hL1(x), ..., hLd(x)]

Then we convert the original probabilities to (1 - (1 - p1^L)^b, 1 - (1 - p2^L)^b) and in our example (0.8, 0.2) => (0.8785, 0.006) for L = 4, d = 4.

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d. The concern I have with the RandomProjection API right now is that if we extend to offer arbitrary L then our models do not store just a d-dimensional array of random vectors but more like a L x d matrix of random vectors. And we would have hashFunctions instead of hashFunction (though this is still private). One question I have is - why do we expose randUnitVectors at all? I feel it leaves us more room for changes in the future if we do not expose it, especially considering the points I just made. There may be some reason to expose it that I haven't thought of though. What do we think about changing it to private?

I like the idea of changing outputDim to something related to OR-amplification a lot. I think minhash is done properly right now but the hashDistance measure doesn't make sense as already discussed. Right now, I'd like to focus on making sure we don't corner ourselves with the API since internal algo details and documentation can always be changed later.

Contributor

sethah commented Nov 7, 2016

So I'll try to summarize the AND/OR amplification and how I think it fits into the current API right now. LSH relies on a single hashing function h(x) which is (R, cR, p1, p2)-sensitive which just means it meets certain properties needed for LSH. In the case of 2-stable method h(x) = floor((x dot r) / w) which maps Vector[Double] => Int. p1 and p2 correspond to "good" and "bad" collision probabilities respectively. To decrease the probability of a bad collision we can use AND-amplification by creating a new, compound hash function g(x) = [h1(x), h2(x), ..., hd(x)] where the h_i(x) correspond to different random vectors r. Now we only consider collisions for two vectors x and y if g(x) == g(y) (i.e. standard vector equality). This makes the probability of both types of collisions decrease to (p1^d, p2^d). For a hypothetical (0.8, 0.2)-sensitive distribution this goes to (0.4, 0.0016) for d = 4. Making the false-positive rate very low, but meaning we also miss a lot of good candidates. To mitigate this we can further apply OR-amplification by generating not one compound hash function g(x) but L compound functions

g1(x) = [h11(x), ..., h1d(x)]
g2(x) = [h21(x), ..., h2d(x)]
gL(x)  = [hL1(x), ..., hLd(x)]

Then we convert the original probabilities to (1 - (1 - p1^L)^b, 1 - (1 - p2^L)^b) and in our example (0.8, 0.2) => (0.8785, 0.006) for L = 4, d = 4.

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d. The concern I have with the RandomProjection API right now is that if we extend to offer arbitrary L then our models do not store just a d-dimensional array of random vectors but more like a L x d matrix of random vectors. And we would have hashFunctions instead of hashFunction (though this is still private). One question I have is - why do we expose randUnitVectors at all? I feel it leaves us more room for changes in the future if we do not expose it, especially considering the points I just made. There may be some reason to expose it that I haven't thought of though. What do we think about changing it to private?

I like the idea of changing outputDim to something related to OR-amplification a lot. I think minhash is done properly right now but the hashDistance measure doesn't make sense as already discussed. Right now, I'd like to focus on making sure we don't corner ourselves with the API since internal algo details and documentation can always be changed later.

@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Nov 7, 2016

Member

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d.

That is true if you're talking about comparing hash values. But for approx similarity and nearest neighbors, this is doing d = 1 and L = outputDim (i.e., OR amplification). (Did you swap accidentally?) Definitely need to clarify in the docs.

I'm not too worried about making randUnitVectors private. We can always deprecate it and have it throw an exception when it is not applicable.

I'm more worried about the schema for transform(). Do you think we should go ahead and output a Matrix so we can support AND and OR in the future?

Member

jkbradley commented Nov 7, 2016

The current implementation is equivalent to the L = 1 case always, and outputDim corresponds to d.

That is true if you're talking about comparing hash values. But for approx similarity and nearest neighbors, this is doing d = 1 and L = outputDim (i.e., OR amplification). (Did you swap accidentally?) Definitely need to clarify in the docs.

I'm not too worried about making randUnitVectors private. We can always deprecate it and have it throw an exception when it is not applicable.

I'm more worried about the schema for transform(). Do you think we should go ahead and output a Matrix so we can support AND and OR in the future?

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 8, 2016

Contributor

I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter.

Thinking more about it, this is a tough question. What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

When/if we use the AND/OR amplification, we could go a couple of different routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme to the input to obtain:

features g1 g2 g3
[12.5609584702036... [112.0,1.0,12.0] [1.0,120.0,16.0] [102.0,1.0,14.0]
... ... ... ...

Then we generate g1(q), g2(q), g3(q) where q is the query point and we would select all points where g1(q) == g1(x_i) OR g2(q) == g2(x_i) OR .... In spark-neighbors, instead the number of elements in the output dataframe has L * N rows where N is the number of rows in the input dataframe. Then you can join on the hashed column plus a "table identifier" (the index l in range [1, L]). Still, this makes a temporary dataframe within the near-neighbors or approx-join algos, and I'm not sure the output schema of transform needs to have all L hashed values. We could store randUnitVectors: Array[Array[Vector]] and for transform output the hashed value for only the first sequence of random vectors, but that seems a bit strange to me. Thoughts?

Contributor

sethah commented Nov 8, 2016

I was using L to refer to the number of compound hash functions, but you're right that in my explanation L was the "OR" parameter and d was the "AND" parameter.

Thinking more about it, this is a tough question. What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

When/if we use the AND/OR amplification, we could go a couple of different routes. Let's say for d = 3 and L = 3 we could first apply our hashing scheme to the input to obtain:

features g1 g2 g3
[12.5609584702036... [112.0,1.0,12.0] [1.0,120.0,16.0] [102.0,1.0,14.0]
... ... ... ...

Then we generate g1(q), g2(q), g3(q) where q is the query point and we would select all points where g1(q) == g1(x_i) OR g2(q) == g2(x_i) OR .... In spark-neighbors, instead the number of elements in the output dataframe has L * N rows where N is the number of rows in the input dataframe. Then you can join on the hashed column plus a "table identifier" (the index l in range [1, L]). Still, this makes a temporary dataframe within the near-neighbors or approx-join algos, and I'm not sure the output schema of transform needs to have all L hashed values. We could store randUnitVectors: Array[Array[Vector]] and for transform output the hashed value for only the first sequence of random vectors, but that seems a bit strange to me. Thoughts?

@karlhigley

This comment has been minimized.

Show comment
Hide comment
@karlhigley

karlhigley Nov 8, 2016

@sethah: Your description of the combination of AND and OR amplification from the literature matches my understanding, and the combination of the two is what I was aiming for in spark-neighbors. I also concur with your assessment of the potential performance impacts of OR-amplification without first applying AND-amplification, in terms of both precision/recall and runtime.

@sethah: Your description of the combination of AND and OR amplification from the literature matches my understanding, and the combination of the two is what I was aiming for in spark-neighbors. I also concur with your assessment of the potential performance impacts of OR-amplification without first applying AND-amplification, in terms of both precision/recall and runtime.

@karlhigley

This comment has been minimized.

Show comment
Hide comment
@karlhigley

karlhigley Nov 8, 2016

@jkbradley: "Multi-probe" seems like a standard term, and I think this is the original paper that coined it.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

I confess that I'm a little confused what you mean by the above. There are several relevant dimensionalities: the dimensionality of the input points (x), the dimensionality of the computed hashes (i.e. the results of applying g(x)), and the number of hash tables computed (i.e. how many g(x) functions are applied), which is the dimensionality of AND-amplification (in a sense).

After wrestling with inconsistent terminology for a while, what I settled on for spark-neighbors was to refer to g(x) as a hash function, the outputs of g(x) as hashes, the sub-elements of g(x) -- h1(x) etc. -- as whatever made sense for the particular method (e.g. permutations for Minhash), and the output of each of the L g(x) functions as a hash table. While that terminology isn't necessarily standard, it helped me identify the common concepts across LSH methods clearly enough to build some abstractions around them.

Using those terms, the dimensionality of the g(x) hash functions and the hashes they produce is equivalent to the number of h(x) sub-elements they contain. I thought of applying OR-amplification as producing multiple hash tables by using multiple g(x) functions, with a collision in any one hash table producing a pair of candidate neighbors.

Does that make any more (or less) sense?

@jkbradley: "Multi-probe" seems like a standard term, and I think this is the original paper that coined it.

Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set.

I confess that I'm a little confused what you mean by the above. There are several relevant dimensionalities: the dimensionality of the input points (x), the dimensionality of the computed hashes (i.e. the results of applying g(x)), and the number of hash tables computed (i.e. how many g(x) functions are applied), which is the dimensionality of AND-amplification (in a sense).

After wrestling with inconsistent terminology for a while, what I settled on for spark-neighbors was to refer to g(x) as a hash function, the outputs of g(x) as hashes, the sub-elements of g(x) -- h1(x) etc. -- as whatever made sense for the particular method (e.g. permutations for Minhash), and the output of each of the L g(x) functions as a hash table. While that terminology isn't necessarily standard, it helped me identify the common concepts across LSH methods clearly enough to build some abstractions around them.

Using those terms, the dimensionality of the g(x) hash functions and the hashes they produce is equivalent to the number of h(x) sub-elements they contain. I thought of applying OR-amplification as producing multiple hash tables by using multiple g(x) functions, with a collision in any one hash table producing a pair of candidate neighbors.

Does that make any more (or less) sense?

@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Nov 8, 2016

Contributor

@jkbradley I agree with most of your comments above. And I would like to suggest the following:

  • I would recommend a more intuitive name like HyperplaneProjection instead of PStableHashing if we adopt the LSH function @sethah suggested.
  • x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size is AND-amplification. I think we should use OR-amplification here. I have already made a pull request to fix the issue in #15800.
  • I think for MinHash, multi-probing NN Search is either single probing or full scan.
  • Here is my reference for Multi-probing: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf

@sethah @karlhigley Now I see your LSH function for Euclidean distance is the AND-amplification of what I have implemented.

  • Do you have any reference for compound AND/OR-amplification? I see this is not always working without assumptions on distance threshold and sensitivity, for example, (0.6, 0.4) => (0.426, 0.098) for L = 4, d = 4, and (0.8, 0.2) => (0.678, 0.000) for L = 10, d = 10
  • For the schema of transform(), I think we either add a generic type for the output column in LSH class or change the output type to Array[Vector]. I would recommend the latter way because (1) it's very easy to select from the array to get what @sethah suggested (2) The type of output column still needs to be spark sql compatible, which is not so generic.
Contributor

Yunni commented Nov 8, 2016

@jkbradley I agree with most of your comments above. And I would like to suggest the following:

  • I would recommend a more intuitive name like HyperplaneProjection instead of PStableHashing if we adopt the LSH function @sethah suggested.
  • x.toDense.values.zip(y.toDense.values).map(pair => pair._1 == pair._2).sum / x.size is AND-amplification. I think we should use OR-amplification here. I have already made a pull request to fix the issue in #15800.
  • I think for MinHash, multi-probing NN Search is either single probing or full scan.
  • Here is my reference for Multi-probing: http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf

@sethah @karlhigley Now I see your LSH function for Euclidean distance is the AND-amplification of what I have implemented.

  • Do you have any reference for compound AND/OR-amplification? I see this is not always working without assumptions on distance threshold and sensitivity, for example, (0.6, 0.4) => (0.426, 0.098) for L = 4, d = 4, and (0.8, 0.2) => (0.678, 0.000) for L = 10, d = 10
  • For the schema of transform(), I think we either add a generic type for the output column in LSH class or change the output type to Array[Vector]. I would recommend the latter way because (1) it's very easy to select from the array to get what @sethah suggested (2) The type of output column still needs to be spark sql compatible, which is not so generic.
@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Nov 9, 2016

Member

@sethah

What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

I agree it's mainly for dimensionality reduction, though these LSH functions are not ideal for that. (E.g., most people doing dimensionality reduction would probably want to use random projections without bucketing.)

@karlhigley

I agree with your description of different dimensionalities and agree we may just have to pick some terminology out of many choices. I'm fairly ambivalent about what terminology we choose, though it would be great for it to match whatever references we cite. (And maybe we do need another reference cited for describing OR vs AND amplification and "dimensions.")

@Yunni

  • Have you seen "HyperplaneProjection" used in literature?
  • I'll respond about the hashDistance in [https://github.com/apache/spark/pull/15800]
  • Let's not implement both types of amplification just yet. Let's either:
    • Fix the API so we can add them in the future, or
    • Make LSH private for now so that we can change fix its API for 2.2.
Member

jkbradley commented Nov 9, 2016

@sethah

What is the intended use of the output column generated by transform? As an alternative set of features with decreased dimensionality?

I agree it's mainly for dimensionality reduction, though these LSH functions are not ideal for that. (E.g., most people doing dimensionality reduction would probably want to use random projections without bucketing.)

@karlhigley

I agree with your description of different dimensionalities and agree we may just have to pick some terminology out of many choices. I'm fairly ambivalent about what terminology we choose, though it would be great for it to match whatever references we cite. (And maybe we do need another reference cited for describing OR vs AND amplification and "dimensions.")

@Yunni

  • Have you seen "HyperplaneProjection" used in literature?
  • I'll respond about the hashDistance in [https://github.com/apache/spark/pull/15800]
  • Let's not implement both types of amplification just yet. Let's either:
    • Fix the API so we can add them in the future, or
    • Make LSH private for now so that we can change fix its API for 2.2.
@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Nov 9, 2016

Contributor

I tend to agree that the terminology used here is a little confusing, and doesn't seem to match up with the "general" terminology (I use that term loosely however).

Terminology

In my dealings with LSH, I too have tended to come across the version that @sethah mentions (and @karlhigley's package, and others such as https://github.com/marufaytekin/lsh-spark, implement). that is, each input vector is hashed into L "tables" of hash signatures of "length" or "dimension" d. Each hash signature is created by concatenating the result of applying d "hash functions".

I agree what's effectively implemented here is L = outputDim and d=1. What I find a bit troubling is that it is done "implicitly", as part of the hashDistance function. Without knowing that is what is happening, it is not clear to a new user - coming from other common LSH implementations - that outputDim is not the "number of hash functions" or "length of the hash signatures" but actually the "number of hash tables".

Transform semantics

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that. In my view, the real use case is the approximate nearest neighbour search.

I'll give a concrete example for the transform output. Let's say I want to export recommendation model factor vectors (from ALS), or Word2Vec vectors, etc, to a real-time scoring system. I have many items, so I'd like to use LSH to make my scoring feasible. I do this by effectively doing a real-time version of OR-amplification. I store the hash tables (L tables of d hash signatures) with my vectors. When doing "similar items" for a given item, I retrieve the hash sigs of the query item, and use these to filter down the candidate item set for my scoring. This is in fact something I'm working on in a demo project currently. So if we will support the OR/AND combo, then it will be very important to output the full L x d set of hash sigs in transform.

Proposal:

My recommendation is:

  1. future proof the API by returning Array[Vector] in transform (as mentioned above by others);
  2. we need to update the docs / user guide to make it really clear what the implementation is doing;
  3. I think we need to make it clear that the implied d value here is 1 - we can mention that AND amplification will be implemented later and perhaps even link to a JIRA.
  4. rename outputDim to something like numHashTables.
  5. when we add AND-amp, we can add the parameter hashSignatureLength or numHashFunctions.
  6. make as much private as possible to avoid being stuck with any implementation detail in future releases (e.g. I also don't see why randUnitVectors or randCoefficients needs to be public).

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

I believe we should support OR/AND in future. If so, then to me many things need to change - hashFunction, hashDistance etc will need to be refactored. Most of the implementation is private/protected so I think it will be ok. Let's just ensure we're not left with an API that we can't change in future. Setting L and d=1 must then yield the same result as current impl to avoid a behavior change (I guess this will be ok since current default for L is 1, and we can make the default for d when added also 1).

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

Contributor

MLnick commented Nov 9, 2016

I tend to agree that the terminology used here is a little confusing, and doesn't seem to match up with the "general" terminology (I use that term loosely however).

Terminology

In my dealings with LSH, I too have tended to come across the version that @sethah mentions (and @karlhigley's package, and others such as https://github.com/marufaytekin/lsh-spark, implement). that is, each input vector is hashed into L "tables" of hash signatures of "length" or "dimension" d. Each hash signature is created by concatenating the result of applying d "hash functions".

I agree what's effectively implemented here is L = outputDim and d=1. What I find a bit troubling is that it is done "implicitly", as part of the hashDistance function. Without knowing that is what is happening, it is not clear to a new user - coming from other common LSH implementations - that outputDim is not the "number of hash functions" or "length of the hash signatures" but actually the "number of hash tables".

Transform semantics

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that. In my view, the real use case is the approximate nearest neighbour search.

I'll give a concrete example for the transform output. Let's say I want to export recommendation model factor vectors (from ALS), or Word2Vec vectors, etc, to a real-time scoring system. I have many items, so I'd like to use LSH to make my scoring feasible. I do this by effectively doing a real-time version of OR-amplification. I store the hash tables (L tables of d hash signatures) with my vectors. When doing "similar items" for a given item, I retrieve the hash sigs of the query item, and use these to filter down the candidate item set for my scoring. This is in fact something I'm working on in a demo project currently. So if we will support the OR/AND combo, then it will be very important to output the full L x d set of hash sigs in transform.

Proposal:

My recommendation is:

  1. future proof the API by returning Array[Vector] in transform (as mentioned above by others);
  2. we need to update the docs / user guide to make it really clear what the implementation is doing;
  3. I think we need to make it clear that the implied d value here is 1 - we can mention that AND amplification will be implemented later and perhaps even link to a JIRA.
  4. rename outputDim to something like numHashTables.
  5. when we add AND-amp, we can add the parameter hashSignatureLength or numHashFunctions.
  6. make as much private as possible to avoid being stuck with any implementation detail in future releases (e.g. I also don't see why randUnitVectors or randCoefficients needs to be public).

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

I believe we should support OR/AND in future. If so, then to me many things need to change - hashFunction, hashDistance etc will need to be refactored. Most of the implementation is private/protected so I think it will be ok. Let's just ensure we're not left with an API that we can't change in future. Setting L and d=1 must then yield the same result as current impl to avoid a behavior change (I guess this will be ok since current default for L is 1, and we can make the default for d when added also 1).

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Nov 9, 2016

Contributor

Oh and for naming - I'm ok with the current ones actually. However we could think about changing to ScalarRandomProjectionLSH (a term mentioned in @karlhigley's package), as later we will have SignRandomProjectionLSH for cosine distance; and MinHashLSH, etc - just to make it clear what the class is doing. (perhaps later we have some other random projection algorithm that conflicts etc).

We could name according to the estimated metric such as EuclideanLSH or so on, but if we want to support say Euclidean and Manhattan distance at some point that becomes problematic. So perhaps best not to?

Contributor

MLnick commented Nov 9, 2016

Oh and for naming - I'm ok with the current ones actually. However we could think about changing to ScalarRandomProjectionLSH (a term mentioned in @karlhigley's package), as later we will have SignRandomProjectionLSH for cosine distance; and MinHashLSH, etc - just to make it clear what the class is doing. (perhaps later we have some other random projection algorithm that conflicts etc).

We could name according to the estimated metric such as EuclideanLSH or so on, but if we want to support say Euclidean and Manhattan distance at some point that becomes problematic. So perhaps best not to?

@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Nov 9, 2016

Member

@MLnick I agree with most of your comments. A few responses:

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that.

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

I also don't see why randUnitVectors or randCoefficients needs to be public

You mentioned people using LSH outside of Spark for serving. In order to do that, we will need to expose randUnitVectors and randCoefficients so that users can compute hash values for query points. That said, I'm fine with making those private for now and preventing this use case for 1 release while we stabilize the API.

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

You can see some results linked from the JIRA.

Member

jkbradley commented Nov 9, 2016

@MLnick I agree with most of your comments. A few responses:

In terms of transform - I disagree somewhat that the main use case is "dimensionality reduction". Perhaps there are common examples of using the hash signatures as a lower-dim representation as a feature in some model (e.g. in a similar way to say a PCA transform), but I haven't seen that.

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

I also don't see why randUnitVectors or randCoefficients needs to be public

You mentioned people using LSH outside of Spark for serving. In order to do that, we will need to expose randUnitVectors and randCoefficients so that users can compute hash values for query points. That said, I'm fine with making those private for now and preventing this use case for 1 release while we stabilize the API.

One issue I have is that currently we would output a 1 x L set of hash values. But it actually should be L x 1 i.e. a set of signatures of length 1. I guess we can leave it as is, but document what the output actually is.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Finally, my understanding was results from some performance testing would be posted. I don't believe we've seen this yet.

You can see some results linked from the JIRA.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Nov 9, 2016

Contributor

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

Ok makes sense - for the transform case if users are looking to directly use the hash sigs as lower-dim representation, they can always set L=1 and d (assuming we do AND + OR later) to get just one "vector" output.

For the public vals - sorry if I wan't clear. I meant we should probably not expose them until the API is fully baked. But yes I see that they are useful to expose once we're happy with the API. I just don't love the idea of changing things later (and throwing errors and whatnot) if we can avoid it - I think we saw similar issues with e.g. NaiveBayes now.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Matrix can work - I don't think Array[Vector] is an issue either. I seem to recall a comment above that Matrix was a bit less easy to work with (exploding indices and so on). I don't see a big difference between an Lx1 matrix and an L-length Array of 1-d vectors in practical terms. So, I'm ok with either approach.

I'll check the JIRA - sorry I missed the links.

Contributor

MLnick commented Nov 9, 2016

This is very common in academic research and literature, but it may not be in industry. I'm fine with not considering it for now.

Ok makes sense - for the transform case if users are looking to directly use the hash sigs as lower-dim representation, they can always set L=1 and d (assuming we do AND + OR later) to get just one "vector" output.

For the public vals - sorry if I wan't clear. I meant we should probably not expose them until the API is fully baked. But yes I see that they are useful to expose once we're happy with the API. I just don't love the idea of changing things later (and throwing errors and whatnot) if we can avoid it - I think we saw similar issues with e.g. NaiveBayes now.

What about outputting a Matrix instead of an Array of Vectors? That will make it easy to change in the future, without us having weird Vectors of length 1.

Matrix can work - I don't think Array[Vector] is an issue either. I seem to recall a comment above that Matrix was a bit less easy to work with (exploding indices and so on). I don't see a big difference between an Lx1 matrix and an L-length Array of 1-d vectors in practical terms. So, I'm ok with either approach.

I'll check the JIRA - sorry I missed the links.

@sethah

This comment has been minimized.

Show comment
Hide comment
@sethah

sethah Nov 9, 2016

Contributor

If we were to use a matrix for the output, then when we do approxSimilarityJoin we would want to explode the output column by matrix rows, assuming the matrix structure was:

| ---g1(x)---- |
| ---g2(x)---- |
|     ...      |
| ---gL(x)---- |

This is probably possible, but might be a bit awkward? Array[Vector] might make it a bit easier.

Contributor

sethah commented Nov 9, 2016

If we were to use a matrix for the output, then when we do approxSimilarityJoin we would want to explode the output column by matrix rows, assuming the matrix structure was:

| ---g1(x)---- |
| ---g2(x)---- |
|     ...      |
| ---gL(x)---- |

This is probably possible, but might be a bit awkward? Array[Vector] might make it a bit easier.

@jkbradley

This comment has been minimized.

Show comment
Hide comment
@jkbradley

jkbradley Nov 9, 2016

Member

Good points: Array of Vectors sounds good to me.

There has been a lot of discussion. I'm going to try to summarize things in a follow-up JIRA, which I'll link here shortly. LSH turned out to be a much messier area than I expected; thanks a lot to everyone for all of the post-hoc reviews and discussions!

Member

jkbradley commented Nov 9, 2016

Good points: Array of Vectors sounds good to me.

There has been a lot of discussion. I'm going to try to summarize things in a follow-up JIRA, which I'll link here shortly. LSH turned out to be a much messier area than I expected; thanks a lot to everyone for all of the post-hoc reviews and discussions!

@jkbradley

This comment has been minimized.

Show comment
Hide comment
@Yunni

This comment has been minimized.

Show comment
Hide comment
@Yunni

Yunni Nov 10, 2016

Contributor

Thanks for the discussion, everyone! I will take a look at the JIRA.

Contributor

Yunni commented Nov 10, 2016

Thanks for the discussion, everyone! I will take a look at the JIRA.

uzadude added a commit to uzadude/spark that referenced this pull request Jan 27, 2017

[SPARK-5992][ML] Locality Sensitive Hashing
## What changes were proposed in this pull request?

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).

Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:
 - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
 - PySpark Integration for the scala classes and methods.

## How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.

Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).

## References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>

Closes #15148 from Yunni/SPARK-5992-yunn-lsh.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Feb 9, 2017

[SPARK-5992][ML] Locality Sensitive Hashing
## What changes were proposed in this pull request?

Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).

Detailed changes are as follows:
(1) Implement abstract LSH, LSHModel classes as Estimator-Model
(2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
(3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
(4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin

Things that will be implemented in a follow-up PR:
 - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
 - PySpark Integration for the scala classes and methods.

## How was this patch tested?
Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.

Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).

## References
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>

Closes #15148 from Yunni/SPARK-5992-yunn-lsh.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment