-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) #15795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
public class JavaRandomProjectionExample { | ||
public static void main(String[] args) { | ||
SparkSession spark = SparkSession | ||
.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the indentation in many of the Java examples -- 4 space for continuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used 2 space as in other java examples.
docs/ml-features.md
Outdated
</div> | ||
|
||
# Locality Sensitive Hashing | ||
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a class of dimension reduction hash families, which can be used as both feature transformation and machine-learned ranking. Difference distance metric has its own LSH family class in `spark.ml`, which can transform feature columns to hash values as new columns. Besides feature transforming, `spark.ml` also implemented approximate nearest neighbor algorithm and approximate similarity join algorithm using LSH. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the opening sentence of the wikipedia article, I wouldn't class LSH as a dimensionality reduction technique? It's a set of hashing techniques where the hash preserves some properties. Maybe it's just my taste. But the rest of the text talks about the output as hash values.
What does "machine-learned ranking" refer to here? as this isn't a ranking technique per se.
I think this is missing a broad summary statement that indicates why LSH is even of interest: it provides a hash function where hashed values are in some sense close when the input values are close according to some metric. And then the variations below plug in different definitions of "close" and "input".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased. PTAL
docs/ml-features.md
Outdated
|
||
The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported. | ||
|
||
The bucket length can be used to trade off the performance of random projection. A larger bucket lowers the false negative rate but usually increases running time and false positive rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the point here that near vectors end up in nearby buckets? I feel like this is skipping the intuition of why you would care about this technique or when you'd use it. Not like this needs a whole paragraph, just a few sentences of pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. PTAL
docs/ml-features.md
Outdated
h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) | ||
\]` | ||
|
||
Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same sort of comment: isn't the intuition that MinHash approximates the Jaccard similarity without actually computing it completely?
This may again reduce to my different understanding of the technique, but MinHash isn't a hashing technique per se. it relies on a given family of hash functions to approximate set similarity. I'm finding it a little hard to view it as an 'LSH family'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how MinHash approximates the Jaccard similarity? It is true that Pr(min(h(A)) = min(h(B)))
is equal to Jaccard similarity when h
is picked from a universal hash family. But I think we are not computing Pr(min(h(A)) = min(h(B)))
in MinHash, we are only use this property to construct an LSH function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.
docs/ml-features.md
Outdated
</div> | ||
</div> | ||
|
||
## MinHash for Jaccard Distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarity, not distance, right? it's higher when they overlap more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jaccard Distance is just 1 - JaccardSimilarity
(https://en.wikipedia.org/wiki/Jaccard_index)
There are 2 reasons we use Jaccard Distance instead of similarity:
(1) It's cleaner to map each distance metric to their LSH family by the definition of LSH.
(2) In approxNearestNeighbor
and approxSimilarityJoin
, the returned dataset has a distCol
showing the distance values.
docs/ml-features.md
Outdated
In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. | ||
|
||
## Random Projection for Euclidean Distance | ||
**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an aside, but this random projection + cosine distance technique is the main thing I think of when I think of "LSH". Is that not implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that is tracked in https://issues.apache.org/jira/browse/SPARK-18082.
Also this seems to overlap with #15787 |
public static void main(String[] args) { | ||
SparkSession spark = SparkSession | ||
.builder() | ||
.appName("JavaMinHashExample") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JavaRandomProjectionExample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Creates a SparkSession | ||
val spark = SparkSession | ||
.builder | ||
.appName("MinHashExample") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomProjectionExample
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Examples are usually class/algorithm based, not interface/use case based. Maybe we can summarize the 5 classes into 2? Do you mind to modify the examples in #15787 after it is merged instead? |
…e some part as requested (3) Add docs for distance column
@bravo-zhang @srowen I am OK to use the example in #15787. |
docs/ml-features.md
Outdated
</div> | ||
|
||
# Locality Sensitive Hashing | ||
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locality Sensitive Hashing(LSH)
will show up twice here - I think you can just keep the link one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could mention approximate nearest neighbour search as a common use case too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
## Random Projection for Euclidean Distance | ||
**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection). | ||
|
||
[Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can say something like "also referred to as p-stable distributions" or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is 2-stable distribution is how we choose the random vector, not the name of metric space or the hash function name.
See: https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions
docs/ml-features.md
Outdated
\]` | ||
where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. | ||
|
||
The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little unclear - should we say "RandomProjection accepts arbitrary vectors as input features, and supports both sparse and dense vectors".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) | ||
\]` | ||
|
||
Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can clarify here - "The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we require (check) in MinHash
that the input vectors are binary? Or do we just treat any non-zero value as 1 effectively? Maybe mention it whichever it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
|
||
Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. | ||
|
||
**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... non-zero entry" perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
Approximate nearest neighbor and approximate similarity join use OR-amplification. | ||
|
||
## Approximate Similarity Join | ||
Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Row pairs of what? Does it return all columns or just the vector columns?
I think we need to be specific about "distance between two input vectors is smaller".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some description in L1501
docs/ml-features.md
Outdated
## Approximate Similarity Join | ||
Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining. | ||
|
||
Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's totally clear what this means. Let's be more specific about the steps involved:
- transform the input dataset(s) to create the hash signature in
LSH.outputCol
. - if an untransformed dataset is used as input, it will be transformed automatically
Because (1) is expensive, the transformed dataset can be cached if it will be re-used many times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now?
docs/ml-features.md
Outdated
|
||
Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly. | ||
|
||
A distance column will be added in the output dataset of approximate similarity join to show the distance between each output row pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, row pairs of what? Specify what is being compared here to be clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some description.
docs/ml-features.md
Outdated
</div> | ||
|
||
## Approximate Nearest Neighbor Search | ||
Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can simplify to "... returns a specified number of rows ..." and drop the last sentence.
Are we supporting arbitrary keys? I don't think so, so perhaps just call it "vector"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
## Approximate Nearest Neighbor Search | ||
Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user. | ||
|
||
Approximate nearest neighbor search allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above for similarity join applies here.
docs/ml-features.md
Outdated
`\[ | ||
\forall p, q \in M,\\ | ||
d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ | ||
d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
|
||
The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: | ||
|
||
In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should either link to the definition of a metric space, or explain what M and d are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I would at least mention that d
is a distance function. It is important in the context of LSH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
</div> | ||
|
||
# Locality Sensitive Hashing | ||
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicating the first few words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) | ||
\]` | ||
|
||
Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.
docs/ml-features.md
Outdated
</div> | ||
|
||
When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate. | ||
* AND-amplifications: Two input vectors are defined to be in the same bucket only if ALL of the hash values match. This will decrease the false positive rate but increase the false negative rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add some extra spaces and new lines to make the list work. Try it out in a web-based markdown renderer if necessary
docs/ml-features.md
Outdated
</div> | ||
</div> | ||
|
||
When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rate -> rate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not get it?
docs/ml-features.md
Outdated
</div> | ||
</div> | ||
|
||
When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked too much into the implementation of LSH, but this is a property of the queries, right? This should be moved into its own section along with some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is removed since it will be fully implemented in https://issues.apache.org/jira/browse/SPARK-18450
Linking [https://issues.apache.org/jira/browse/SPARK-18392] since it will alter the public API for LSH |
Is this still targeted for 2.1? |
@sethah I think so. I have made changes for the docs but I haven't made changes to the examples. Please take a look when you get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a pass - I think we can consolidate at least one of the examples.
Also think we need a little more detail in places.
docs/ml-features.md
Outdated
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
|
||
Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Scaladoc link should be for BucketedRandomProjection
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
|
||
<div data-lang="java" markdown="1"> | ||
|
||
Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
|
||
Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be updated to MinHashLSH
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
## Feature Transformation | ||
Feature Transformation is the base functionality to add hash results as a new column. Users can specify input column name and output column name by setting `inputCol` and `outputCol`. LSH in `spark.ml` also supports multiple LSH hash tables. Users can specify the number of hash tables by setting `numHashTables`. | ||
|
||
The output type of feature type is `Array[Vector]` where the dimension of the array equals `numHashTables`, and the dimensions of the vectors are currently set to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of outputCol
is ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
val model = mh.fit(dfA) | ||
model.approxSimilarityJoin(dfA, dfB, 0.6).show() | ||
|
||
// Cache the transformed columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mentions caching but doesn't cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
docs/ml-features.md
Outdated
</div> | ||
</div> | ||
|
||
## Feature Transformation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to mention that the transformed dataset can be cached, since transform
can be expensive. We can either mention it here, or perhaps mention it (twice) in the join and ANN sections below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc is in L1509 and L1516
docs/ml-features.md
Outdated
|
||
## Approximate Similarity Join | ||
Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will self join produce duplicates? If so we should note that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
## Approximate Nearest Neighbor Search | ||
Approximate nearest neighbor search takes a dataset and a vector, and approximately returns a specified number of rows in the dataset that are closest to the vector. | ||
|
||
Approximate nearest neighbor accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks around "outputCol"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
## Approximate Similarity Join | ||
Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining. | ||
|
||
Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks around "outputCol"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
docs/ml-features.md
Outdated
|
||
Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol. | ||
|
||
In the joined dataset, the origin datasets can be queried in `datasetA` and `datasetB`. A distance column will be added in the output dataset of approximate similarity join to show the distance between each output pairs of rows in the origin datasets . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - space at end before .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Could you please add tags "[ML][DOCS]" to the PR title? |
+1 for consolidating the examples. The boilerplate of creating a dataset and setting algorithm parameters takes up most of the example. I would create 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. The only reason not to would be if those demos required different datasets or algorithm settings, but I suspect they could be done in a unified manner. |
ok to test |
Test build #69555 has finished for PR 15795 at commit
|
…into SPARK-18081-lsh-guide
@MLnick @jkbradley I have changed the examples to be 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. PTAL. |
Test build #69592 has finished for PR 15795 at commit
|
I can take a look |
I found myself wanting to make a number of tiny comments, so I thought it'd be easier to send a PR. Could you please take a look at this one? Yunni#1 |
Minor updates to Yunni spark 18081 lsh guide
Test build #69612 has finished for PR 15795 at commit
|
Test build #69613 has finished for PR 15795 at commit
|
LGTM |
…(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes #15795 from Yunni/SPARK-18081-lsh-guide. (cherry picked from commit 3477718) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
…(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes apache#15795 from Yunni/SPARK-18081-lsh-guide.
…(LSH) ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes apache#15795 from Yunni/SPARK-18081-lsh-guide.
What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.
How was this patch tested?
Doc has been generated through Jekyll, and checked through manual inspection.