Skip to content

Conversation

Yunni
Copy link
Contributor

@Yunni Yunni commented Nov 7, 2016

What changes were proposed in this pull request?

The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

How was this patch tested?

Doc has been generated through Jekyll, and checked through manual inspection.

public class JavaRandomProjectionExample {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the indentation in many of the Java examples -- 4 space for continuation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used 2 space as in other java examples.

</div>

# Locality Sensitive Hashing
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is a class of dimension reduction hash families, which can be used as both feature transformation and machine-learned ranking. Difference distance metric has its own LSH family class in `spark.ml`, which can transform feature columns to hash values as new columns. Besides feature transforming, `spark.ml` also implemented approximate nearest neighbor algorithm and approximate similarity join algorithm using LSH.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the opening sentence of the wikipedia article, I wouldn't class LSH as a dimensionality reduction technique? It's a set of hashing techniques where the hash preserves some properties. Maybe it's just my taste. But the rest of the text talks about the output as hash values.

What does "machine-learned ranking" refer to here? as this isn't a ranking technique per se.

I think this is missing a broad summary statement that indicates why LSH is even of interest: it provides a hash function where hashed values are in some sense close when the input values are close according to some metric. And then the variations below plug in different definitions of "close" and "input".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased. PTAL


The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported.

The bucket length can be used to trade off the performance of random projection. A larger bucket lowers the false negative rate but usually increases running time and false positive rate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the point here that near vectors end up in nearby buckets? I feel like this is skipping the intuition of why you would care about this technique or when you'd use it. Not like this needs a whole paragraph, just a few sentences of pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. PTAL

h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
\]`

Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same sort of comment: isn't the intuition that MinHash approximates the Jaccard similarity without actually computing it completely?

This may again reduce to my different understanding of the technique, but MinHash isn't a hashing technique per se. it relies on a given family of hash functions to approximate set similarity. I'm finding it a little hard to view it as an 'LSH family'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how MinHash approximates the Jaccard similarity? It is true that Pr(min(h(A)) = min(h(B))) is equal to Jaccard similarity when h is picked from a universal hash family. But I think we are not computing Pr(min(h(A)) = min(h(B))) in MinHash, we are only use this property to construct an LSH function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.

</div>
</div>

## MinHash for Jaccard Distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarity, not distance, right? it's higher when they overlap more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jaccard Distance is just 1 - JaccardSimilarity(https://en.wikipedia.org/wiki/Jaccard_index)

There are 2 reasons we use Jaccard Distance instead of similarity:
(1) It's cleaner to map each distance metric to their LSH family by the definition of LSH.
(2) In approxNearestNeighbor and approxSimilarityJoin, the returned dataset has a distCol showing the distance values.

In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket.

## Random Projection for Euclidean Distance
**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an aside, but this random projection + cosine distance technique is the main thing I think of when I think of "LSH". Is that not implemented?

Copy link
Contributor Author

@Yunni Yunni Nov 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen
Copy link
Member

srowen commented Nov 7, 2016

Also this seems to overlap with #15787

public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("JavaMinHashExample")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JavaRandomProjectionExample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Creates a SparkSession
val spark = SparkSession
.builder
.appName("MinHashExample")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomProjectionExample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jiayue-zhang
Copy link
Contributor

Examples are usually class/algorithm based, not interface/use case based. Maybe we can summarize the 5 classes into 2? Do you mind to modify the examples in #15787 after it is merged instead?

…e some part as requested (3) Add docs for distance column
@Yunni
Copy link
Contributor Author

Yunni commented Nov 7, 2016

@bravo-zhang @srowen I am OK to use the example in #15787.
But I still think approxNearestNeighbor and approxSimilarityJoin are different algorithms and it would be easier for user to understand in separate examples.

</div>

# Locality Sensitive Hashing
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locality Sensitive Hashing(LSH) will show up twice here - I think you can just keep the link one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could mention approximate nearest neighbour search as a common use case too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Random Projection for Euclidean Distance
**Note:** Please note that this is different than the [Random Projection for cosine distance](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection).

[Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can say something like "also referred to as p-stable distributions" or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is 2-stable distribution is how we choose the random vector, not the name of metric space or the hash function name.

See: https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions

\]`
where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket.

The input features in Euclidean space are represented in vectors. Both sparse and dense vectors are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little unclear - should we say "RandomProjection accepts arbitrary vectors as input features, and supports both sparse and dense vectors".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
\]`

Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can clarify here - "The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, ..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we require (check) in MinHash that the input vectors are binary? Or do we just treat any non-zero value as 1 effectively? Maybe mention it whichever it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.

**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero indices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... non-zero entry" perhaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Approximate nearest neighbor and approximate similarity join use OR-amplification.

## Approximate Similarity Join
Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Row pairs of what? Does it return all columns or just the vector columns?

I think we need to be specific about "distance between two input vectors is smaller".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some description in L1501

## Approximate Similarity Join
Approximate similarity join takes two datasets, and approximately returns row pairs which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.

Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's totally clear what this means. Let's be more specific about the steps involved:

  1. transform the input dataset(s) to create the hash signature in LSH.outputCol.
  2. if an untransformed dataset is used as input, it will be transformed automatically

Because (1) is expensive, the transformed dataset can be cached if it will be re-used many times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about now?


Approximate similarity join allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.

A distance column will be added in the output dataset of approximate similarity join to show the distance between each output row pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, row pairs of what? Specify what is being compared here to be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some description.

</div>

## Approximate Nearest Neighbor Search
Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify to "... returns a specified number of rows ..." and drop the last sentence.

Are we supporting arbitrary keys? I don't think so, so perhaps just call it "vector"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Approximate Nearest Neighbor Search
Approximate nearest neighbor search takes a dataset and a key, and approximately returns a number of rows in the dataset that are closest to the key. The number of rows to return are defined by user.

Approximate nearest neighbor search allows users to cache the transformed columns when necessary: If the `outputCol` is missing, the method will transform the data; if the `outputCol` exists, it will use the `outputCol` directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above for similarity join applies here.

`\[
\forall p, q \in M,\\
d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows:

In a metric space `(M, d)`, an LSH family is a family of functions `h` that satisfy the following properties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should either link to the definition of a metric space, or explain what M and d are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would at least mention that d is a distance function. It is important in the context of LSH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

</div>

# Locality Sensitive Hashing
[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) Locality Sensitive Hashing(LSH) is an important class of hashing techniques, which is commonly used in clustering and outlier detection with large datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicating the first few words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
\]`

Input sets for MinHash is represented in vectors which dimension equals the total number of elements in the space. Each dimension of the vectors represents the status of an elements: zero value means the elements is not in the set; non-zero value means the set contains the corresponding elements. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should mention this property, because it is not very intuitive and forms the basis of using MinHash to approximate Jacquard distance.

</div>

When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.
* AND-amplifications: Two input vectors are defined to be in the same bucket only if ALL of the hash values match. This will decrease the false positive rate but increase the false negative rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add some extra spaces and new lines to make the list work. Try it out in a web-based markdown renderer if necessary

</div>
</div>

When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate -> rate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not get it?

</div>
</div>

When multiple hash functions are picked, it's very useful for users to apply [amplification](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification) to trade off between false positive and false negative rate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked too much into the implementation of LSH, but this is a property of the queries, right? This should be moved into its own section along with some examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is removed since it will be fully implemented in https://issues.apache.org/jira/browse/SPARK-18450

@jkbradley
Copy link
Member

Linking [https://issues.apache.org/jira/browse/SPARK-18392] since it will alter the public API for LSH

@sethah
Copy link
Contributor

sethah commented Nov 28, 2016

Is this still targeted for 2.1?

@Yunni
Copy link
Contributor Author

Yunni commented Nov 28, 2016

@sethah I think so. I have made changes for the docs but I haven't made changes to the examples. Please take a look when you get a chance.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a pass - I think we can consolidate at least one of the examples.

Also think we need a little more detail in places.

<div class="codetabs">
<div data-lang="scala" markdown="1">

Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Scaladoc link should be for BucketedRandomProjection now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


<div data-lang="java" markdown="1">

Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

<div class="codetabs">
<div data-lang="scala" markdown="1">

Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be updated to MinHashLSH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Feature Transformation
Feature Transformation is the base functionality to add hash results as a new column. Users can specify input column name and output column name by setting `inputCol` and `outputCol`. LSH in `spark.ml` also supports multiple LSH hash tables. Users can specify the number of hash tables by setting `numHashTables`.

The output type of feature type is `Array[Vector]` where the dimension of the array equals `numHashTables`, and the dimensions of the vectors are currently set to 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of outputCol is ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

val model = mh.fit(dfA)
model.approxSimilarityJoin(dfA, dfB, 0.6).show()

// Cache the transformed columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions caching but doesn't cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

</div>
</div>

## Feature Transformation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to mention that the transformed dataset can be cached, since transform can be expensive. We can either mention it here, or perhaps mention it (twice) in the join and ANN sections below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is in L1509 and L1516


## Approximate Similarity Join
Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will self join produce duplicates? If so we should note that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Approximate Nearest Neighbor Search
Approximate nearest neighbor search takes a dataset and a vector, and approximately returns a specified number of rows in the dataset that are closest to the vector.

Approximate nearest neighbor accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks around "outputCol"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Approximate Similarity Join
Approximate similarity join takes two datasets, and approximately returns pairs of rows in the origin datasets which distance is smaller than a user-defined threshold. Approximate Similarity Join supports both joining two different datasets and self joining.

Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks around "outputCol"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Approximate similarity join accepts both transformed and untransformed datasets as input. If an untransformed dataset is used, it will be transformed automatically. In this case, the hash signature will be created as outputCol.

In the joined dataset, the origin datasets can be queried in `datasetA` and `datasetB`. A distance column will be added in the output dataset of approximate similarity join to show the distance between each output pairs of rows in the origin datasets .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - space at end before .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jkbradley
Copy link
Member

Could you please add tags "[ML][DOCS]" to the PR title?

@jkbradley
Copy link
Member

+1 for consolidating the examples. The boilerplate of creating a dataset and setting algorithm parameters takes up most of the example. I would create 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. The only reason not to would be if those demos required different datasets or algorithm settings, but I suspect they could be done in a unified manner.

@MLnick
Copy link
Contributor

MLnick commented Dec 2, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69555 has finished for PR 15795 at commit 7e60b76.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Yunni Yunni changed the title [SPARK-18081] Add user guide for Locality Sensitive Hashing(LSH) [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) Dec 2, 2016
@Yunni
Copy link
Contributor Author

Yunni commented Dec 2, 2016

@MLnick @jkbradley I have changed the examples to be 1 example per algorithm which does transform, approxNearestNeighbor, and approxSimilarityJoin. PTAL.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69592 has finished for PR 15795 at commit 19653d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I can take a look

@jkbradley
Copy link
Member

jkbradley commented Dec 3, 2016

I found myself wanting to make a number of tiny comments, so I thought it'd be easier to send a PR. Could you please take a look at this one? Yunni#1
Thanks!

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69612 has finished for PR 15795 at commit 7922117.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • * Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
  • [Locality Sensitive Hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets.

@SparkQA
Copy link

SparkQA commented Dec 3, 2016

Test build #69613 has finished for PR 15795 at commit 7c09f9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM
merging with master and branch-2.1
Thanks all!

@asfgit asfgit closed this in 3477718 Dec 4, 2016
asfgit pushed a commit that referenced this pull request Dec 4, 2016
…(LSH)

## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual inspection.

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Yun Ni <Euler57721@gmail.com>

Closes #15795 from Yunni/SPARK-18081-lsh-guide.

(cherry picked from commit 3477718)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…(LSH)

## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual inspection.

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Yun Ni <Euler57721@gmail.com>

Closes apache#15795 from Yunni/SPARK-18081-lsh-guide.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…(LSH)

## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual inspection.

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Yun Ni <Euler57721@gmail.com>

Closes apache#15795 from Yunni/SPARK-18081-lsh-guide.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants