[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) #15795

Yunni · 2016-11-07T07:43:20Z

What changes were proposed in this pull request?

The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

How was this patch tested?

Doc has been generated through Jekyll, and checked through manual inspection.

srowen · 2016-11-07T10:00:14Z

examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java

+public class JavaRandomProjectionExample {
+    public static void main(String[] args) {
+        SparkSession spark = SparkSession
+                .builder()


Fix the indentation in many of the Java examples -- 4 space for continuation.

Used 2 space as in other java examples.

srowen · 2016-11-07T10:07:45Z

docs/ml-features.md

 </div>
+
+# Locality Sensitive Hashing
+[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a class of dimension reduction hash families, which can be used as both feature transformation and machine-learned ranking. Difference distance metric has its own LSH family class in `spark.ml`, which can transform feature columns to hash values as new columns. Besides feature transforming, `spark.ml` also implemented approximate nearest neighbor algorithm and approximate similarity join algorithm using LSH.


Despite the opening sentence of the wikipedia article, I wouldn't class LSH as a dimensionality reduction technique? It's a set of hashing techniques where the hash preserves some properties. Maybe it's just my taste. But the rest of the text talks about the output as hash values.

What does "machine-learned ranking" refer to here? as this isn't a ranking technique per se.

I think this is missing a broad summary statement that indicates why LSH is even of interest: it provides a hash function where hashed values are in some sense close when the input values are close according to some metric. And then the variations below plug in different definitions of "close" and "input".

Rephrased. PTAL

srowen · 2016-11-07T10:10:57Z

docs/ml-features.md

+
+The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported.
+
+The bucket length can be used to trade off the performance of random projection. A larger bucket lowers the false negative rate but usually increases running time and false positive rate.


Isn't the point here that near vectors end up in nearby buckets? I feel like this is skipping the intuition of why you would care about this technique or when you'd use it. Not like this needs a whole paragraph, just a few sentences of pointers?

Fixed. PTAL

srowen · 2016-11-07T10:12:06Z

docs/ml-features.md

+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.


Same sort of comment: isn't the intuition that MinHash approximates the Jaccard similarity without actually computing it completely?

This may again reduce to my different understanding of the technique, but MinHash isn't a hashing technique per se. it relies on a given family of hash functions to approximate set similarity. I'm finding it a little hard to view it as an 'LSH family'

Not sure how MinHash approximates the Jaccard similarity? It is true that Pr(min(h(A)) = min(h(B))) is equal to Jaccard similarity when h is picked from a universal hash family. But I think we are not computing Pr(min(h(A)) = min(h(B))) in MinHash, we are only use this property to construct an LSH function.

You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.

srowen · 2016-11-07T10:12:26Z

docs/ml-features.md

+</div>
+</div>
+
+## MinHash for Jaccard Distance


Similarity, not distance, right? it's higher when they overlap more.

Jaccard Distance is just 1 - JaccardSimilarity(https://en.wikipedia.org/wiki/Jaccard_index)

There are 2 reasons we use Jaccard Distance instead of similarity:
(1) It's cleaner to map each distance metric to their LSH family by the definition of LSH.
(2) In approxNearestNeighbor and approxSimilarityJoin, the returned dataset has a distCol showing the distance values.

srowen · 2016-11-07T10:23:26Z

docs/ml-features.md

+In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket.
+
+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).


Just an aside, but this random projection + cosine distance technique is the main thing I think of when I think of "LSH". Is that not implemented?

No, that is tracked in https://issues.apache.org/jira/browse/SPARK-18082.

srowen · 2016-11-07T10:30:22Z

Also this seems to overlap with #15787

jiayue-zhang · 2016-11-07T15:16:17Z

examples/src/main/java/org/apache/spark/examples/ml/JavaRandomProjectionExample.java

+    public static void main(String[] args) {
+        SparkSession spark = SparkSession
+                .builder()
+                .appName("JavaMinHashExample")


JavaRandomProjectionExample

jiayue-zhang · 2016-11-07T15:16:57Z

examples/src/main/scala/org/apache/spark/examples/ml/RandomProjectionExample.scala

+    // Creates a SparkSession
+    val spark = SparkSession
+      .builder
+      .appName("MinHashExample")


RandomProjectionExample

jiayue-zhang · 2016-11-07T15:30:36Z

Examples are usually class/algorithm based, not interface/use case based. Maybe we can summarize the 5 classes into 2? Do you mind to modify the examples in #15787 after it is merged instead?

…e some part as requested (3) Add docs for distance column

Yunni · 2016-11-07T23:32:08Z

@bravo-zhang @srowen I am OK to use the example in #15787.
But I still think approxNearestNeighbor and approxSimilarityJoin are different algorithms and it would be easier for user to understand in separate examples.

…081-lsh-guide

MLnick · 2016-11-09T10:18:46Z

docs/ml-features.md

 </div>
+
+# Locality Sensitive Hashing
+[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. 


Locality Sensitive Hashing(LSH) will show up twice here - I think you can just keep the link one

You could mention approximate nearest neighbour search as a common use case too?

MLnick · 2016-11-09T10:22:33Z

docs/ml-features.md

+## Random Projection for Euclidean Distance
+**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
+
+[Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows:


Perhaps we can say something like "also referred to as p-stable distributions" or similar?

My understanding is 2-stable distribution is how we choose the random vector, not the name of metric space or the hash function name.

See: https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions

MLnick · 2016-11-09T10:43:39Z

docs/ml-features.md

+\]`
+where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket.
+
+The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported.


This is a little unclear - should we say "RandomProjection accepts arbitrary vectors as input features, and supports both sparse and dense vectors".

MLnick · 2016-11-09T10:47:41Z

docs/ml-features.md

+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.


Perhaps we can clarify here - "The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, ..."

Do we require (check) in MinHash that the input vectors are binary? Or do we just treat any non-zero value as 1 effectively? Maybe mention it whichever it is.

MLnick · 2016-11-09T11:14:53Z

docs/ml-features.md

+
+Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero indices.


"... non-zero entry" perhaps

MLnick · 2016-11-09T11:28:14Z

docs/ml-features.md

+Approximate nearest neighbor and approximate similarity join use OR-amplification.
+
+## Approximate Similarity Join
+Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.


Row pairs of what? Does it return all columns or just the vector columns?

I think we need to be specific about "distance between two input vectors is smaller".

Add some description in L1501

MLnick · 2016-11-09T11:29:21Z

docs/ml-features.md

+## Approximate Similarity Join
+Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.
+
+Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.


I don't think it's totally clear what this means. Let's be more specific about the steps involved:

transform the input dataset(s) to create the hash signature in LSH.outputCol.

if an untransformed dataset is used as input, it will be transformed automatically

Because (1) is expensive, the transformed dataset can be cached if it will be re-used many times.

How about now?

MLnick · 2016-11-09T11:29:38Z

docs/ml-features.md

+
+Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.
+
+A distance column will be added in the output dataset of approximate similarity join to show the distance between each output row pairs.


Again, row pairs of what? Specify what is being compared here to be clear.

Added some description.

MLnick · 2016-11-09T11:30:49Z

docs/ml-features.md

+</div>
+
+## Approximate Nearest Neighbor Search
+Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user.


Can simplify to "... returns a specified number of rows ..." and drop the last sentence.

Are we supporting arbitrary keys? I don't think so, so perhaps just call it "vector"?

MLnick · 2016-11-09T11:31:16Z

docs/ml-features.md

+## Approximate Nearest Neighbor Search
+Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user.
+
+Approximate nearest neighbor search allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.


Same comment as above for similarity join applies here.

thunterdb · 2016-11-09T00:12:19Z

docs/ml-features.md

+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1


thunterdb · 2016-11-09T00:18:51Z

docs/ml-features.md

+
+The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties:


You should either link to the definition of a metric space, or explain what M and d are.

Actually, I would at least mention that d is a distance function. It is important in the context of LSH.

thunterdb · 2016-11-09T00:33:07Z

docs/ml-features.md

 </div>
+
+# Locality Sensitive Hashing
+[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. 


This is duplicating the first few words.

thunterdb · 2016-11-09T00:44:54Z

docs/ml-features.md

+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.


You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.

thunterdb · 2016-11-09T00:46:21Z

docs/ml-features.md

+</div>
+
+When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.
+* AND-amplifications: Two input vectors are defined to be in the same bucket only if ALL of the hash values match. This will decrease the false positive rate but increase the false negative rate.


you need to add some extra spaces and new lines to make the list work. Try it out in a web-based markdown renderer if necessary

thunterdb · 2016-11-09T00:46:35Z

docs/ml-features.md

+</div>
+</div>
+
+When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.


rate -> rate

Sorry, I did not get it?

thunterdb · 2016-11-09T00:49:17Z

docs/ml-features.md

+</div>
+</div>
+
+When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.


I have not looked too much into the implementation of LSH, but this is a property of the queries, right? This should be moved into its own section along with some examples.

This section is removed since it will be fully implemented in https://issues.apache.org/jira/browse/SPARK-18450

jkbradley · 2016-11-10T00:45:06Z

Linking [https://issues.apache.org/jira/browse/SPARK-18392] since it will alter the public API for LSH

sethah · 2016-11-28T19:32:25Z

Is this still targeted for 2.1?

Yunni · 2016-11-28T20:36:18Z

@sethah I think so. I have made changes for the docs but I haven't made changes to the examples. Please take a look when you get a chance.

…8081-lsh-guide

MLnick

Made a pass - I think we can consolidate at least one of the examples.

Also think we need a little more detail in places.

MLnick · 2016-12-01T07:07:25Z

docs/ml-features.md

+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)


This Scaladoc link should be for BucketedRandomProjection now

MLnick · 2016-12-01T07:07:42Z

docs/ml-features.md

+
+<div data-lang="java" markdown="1">
+
+Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)


MLnick · 2016-12-01T07:11:30Z

docs/ml-features.md

+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)


Should be updated to MinHashLSH?

MLnick · 2016-12-01T07:17:06Z

docs/ml-features.md

+## Feature Transformation
+Feature Transformation is the base functionality to add hash results as a new column. Users can specify input column name and output column name by setting `inputCol` and `outputCol`. LSH in `spark.ml` also supports multiple LSH hash tables. Users can specify the number of hash tables by setting `numHashTables`.
+
+The output type of feature type is `Array[Vector]` where the dimension of the array equals `numHashTables`, and the dimensions of the vectors are currently set to 1.


The type of outputCol is ...

MLnick · 2016-12-01T07:23:46Z

examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala

+    val model = mh.fit(dfA)
+    model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+    // Cache the transformed columns


This mentions caching but doesn't cache.

MLnick · 2016-12-01T07:32:34Z

docs/ml-features.md

+</div>
+</div>
+
+## Feature Transformation


It would also be good to mention that the transformed dataset can be cached, since transform can be expensive. We can either mention it here, or perhaps mention it (twice) in the join and ANN sections below.

This doc is in L1509 and L1516

MLnick · 2016-12-01T07:33:05Z

docs/ml-features.md

+
+## Approximate Similarity Join
+Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.
+


Will self join produce duplicates? If so we should note that.

MLnick · 2016-12-01T07:34:03Z

docs/ml-features.md

+## Approximate Nearest Neighbor Search
+Approximate nearest neighbor search takes a dataset and a vector, and approximately returns a specified number of rows in the dataset that are closest to the vector.
+
+Approximate nearest neighbor accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.


backticks around "outputCol"

MLnick · 2016-12-01T07:34:08Z

docs/ml-features.md

+## Approximate Similarity Join
+Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.
+
+Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.


backticks around "outputCol"

MLnick · 2016-12-01T07:34:36Z

docs/ml-features.md

+
+Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
+
+In the joined dataset, the origin datasets can be queried in `datasetA` and `datasetB`. A distance column will be added in the output dataset of approximate similarity join to show the distance between each output pairs of rows in the origin datasets .


nit - space at end before .

jkbradley · 2016-12-01T21:42:40Z

Could you please add tags "[ML][DOCS]" to the PR title?

jkbradley · 2016-12-01T21:46:10Z

+1 for consolidating the examples. The boilerplate of creating a dataset and setting algorithm parameters takes up most of the example. I would create 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. The only reason not to would be if those demos required different datasets or algorithm settings, but I suspect they could be done in a unified manner.

MLnick · 2016-12-02T08:13:36Z

ok to test

SparkQA · 2016-12-02T08:42:03Z

Test build #69555 has finished for PR 15795 at commit 7e60b76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…into SPARK-18081-lsh-guide

Yunni · 2016-12-02T23:18:51Z

@MLnick @jkbradley I have changed the examples to be 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. PTAL.

SparkQA · 2016-12-02T23:35:57Z

Test build #69592 has finished for PR 15795 at commit 19653d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-12-03T00:54:26Z

I can take a look

jkbradley · 2016-12-03T06:03:04Z

I found myself wanting to make a number of tiny comments, so I thought it'd be easier to send a PR. Could you please take a look at this one? Yunni#1
Thanks!

Minor updates to Yunni spark 18081 lsh guide

SparkQA · 2016-12-03T07:41:47Z

Test build #69612 has finished for PR 15795 at commit 7922117.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
* Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
[Locality Sensitive Hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.

SparkQA · 2016-12-03T07:47:17Z

Test build #69613 has finished for PR 15795 at commit 7c09f9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-12-04T00:57:42Z

LGTM
merging with master and branch-2.1
Thanks all!

…(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes #15795 from Yunni/SPARK-18081-lsh-guide. (cherry picked from commit 3477718) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

…(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes apache#15795 from Yunni/SPARK-18081-lsh-guide.

[SPARK-18081] Add user guide to LSH

8c7971b

srowen requested changes Nov 7, 2016

View reviewed changes

jiayue-zhang reviewed Nov 7, 2016

View reviewed changes

Code Review Comments: (1) Fix indention for Java Examples (2) Rephras…

84877ee

…e some part as requested (3) Add docs for distance column

Yun Ni added 2 commits November 8, 2016 11:06

Merge branch 'master' of https://github.com/Yunni/spark into SPARK-18…

19404c4

…081-lsh-guide

Improve documenatation about multi-probe nearest neighbor

6654f8b

MLnick suggested changes Nov 9, 2016

View reviewed changes

thunterdb reviewed Nov 9, 2016

View reviewed changes

jkbradley mentioned this pull request Nov 18, 2016

[Spark-18408][ML] API Improvements for LSH #15874

Closed

Changes in docs for the API Improvements.

40a0caa

Yunni added 2 commits November 28, 2016 20:41

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

a78d920

…8081-lsh-guide

Changes of the examples after API improvements

7e60b76

MLnick suggested changes Dec 1, 2016

View reviewed changes

Yunni changed the title ~~[SPARK-18081] Add user guide for Locality Sensitive Hashing(LSH)~~ [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) Dec 2, 2016

Yunni added 2 commits December 2, 2016 18:02

Changes of the examples after API improvements

b9f716d

Merge branch 'SPARK-18081-lsh-guide' of https://github.com/Yunni/spark …

19653d1

…into SPARK-18081-lsh-guide

minor updates to lsh docs

a048194

Yunni added 2 commits December 3, 2016 15:14

Merge pull request #1 from jkbradley/Yunni-SPARK-18081-lsh-guide

7922117

Minor updates to Yunni spark 18081 lsh guide

Capitalize first letter of each word in title

7c09f9a

asfgit closed this in 3477718 Dec 4, 2016

jiayue-zhang mentioned this pull request Dec 4, 2016

[SPARK-18286][ML] Add Scala/Java examples for MinHash and RandomProjection #15787

Closed


		The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported.

		The bucket length can be used to trade off the performance of random projection. A larger bucket lowers the false negative rate but usually increases running time and false positive rate.


		Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.

		Note: Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero indices.


		Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.

		A distance column will be added in the output dataset of approximate similarity join to show the distance between each output row pairs.


		The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows:

		In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties:


		<div data-lang="java" markdown="1">

		Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)


		## Approximate Similarity Join
		Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.


		Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.

		In the joined dataset, the origin datasets can be queried in `datasetA` and `datasetB`. A distance column will be added in the output dataset of approximate similarity join to show the distance between each output pairs of rows in the origin datasets .

[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) #15795

[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) #15795

Uh oh!

Conversation

Yunni commented Nov 7, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yunni Nov 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiayue-zhang commented Nov 7, 2016

Uh oh!

Yunni commented Nov 7, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yunni Nov 7, 2016 •

edited

Loading