Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for min_hash filter #39671

Conversation

mayya-sharipova
Copy link
Contributor

Closes #20757

@mayya-sharipova mayya-sharipova added >docs General docs changes v6.7.0 v7.2.0 v8.0.0 :Search Relevance/Analysis How text is split into tokens labels Mar 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@cbuescher cbuescher self-assigned this Mar 5, 2019
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition to the docs, I left some very minor comments.
One general thought I was having: I understand why it makes sense to start with a sort of "overview" and theory, but since or docs also work as a kind of reference guide, maybe we should aim for a very brief summary (like the existing one, maybe extended slightly) followed by the table of parameters, then add the more detailed "theory" and usage sections afterwards.
Also I was wondering if it would make sense to add a small example of how to actually use any such min-hashed field in a query, e.g. for near duplicate detecton etc... or if this would go beyond the scope of our documentation.

internally each shingle is hashed into to 128-bit hash, you should choose
`k` small enough so that all possible
different k-words shingles can be hashed to 128-bit hash with
minimal collision. 5-word shingles typically work well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my own education, do we have any blogs or knowledge articles around this? Or is this advice taken from the Wikipedia article or other sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher I took an advice on 5-word shingle from the MinHash filter sourcecode in Lucene

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats interesting, would you mind linking to that source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Thanks for the suggetion. I opted not to include the link to this source, as I am afraid as the sourcecode changes this link becomes invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR that adds min_hash, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtibshirani Thanks a lot for the review. I think the best for now is to remove this line completely "5-word shingles typically work well.", as there are conflicting suggestions what shingle size works best. Once we have better sources (external or from our own experiments), we can add shingle size suggestions to the file. Is this fine with you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a good plan to me!

@mayya-sharipova
Copy link
Contributor Author

@cbuescher Thanks a lot for the review. I have addressed your comments in the 2nd commit.

Also I was wondering if it would make sense to add a small example of how to actually use any such min-hashed field in a query, e.g. for near duplicate detecton etc... or if this would go beyond the scope of our documentation.

I indeed very much wanted to add this example, but I opted not to do this. The reason for this that I am not sure how to set the best query for this. A general idea is to partition resulting hashed tokens into bands; tokens in a single band should be joined by AND, and bands should be joined with other bands by OR. I have asked the author of the MinHash filter for his idea how this query should be set. When he replies, we can update the documentation with this query information as well.

@mayya-sharipova mayya-sharipova force-pushed the documentation_minhash_token_filter branch from 3dc2671 to a273050 Compare March 5, 2019 19:31
Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayya-sharipova Thanks for adressing my previous comments, I left one more suggestion but nothing that requires another review. Feel free to adress or not.

When he replies, we can update the documentation with this query information as well

Fine by me, this PR already is a great addition. Maybe an extended example would also be better suited for a blog post or something like it. I'd be really interested in real-life usages of this.

@mayya-sharipova mayya-sharipova merged commit 5b852fa into elastic:master Mar 7, 2019
@mayya-sharipova mayya-sharipova deleted the documentation_minhash_token_filter branch March 7, 2019 13:47
mayya-sharipova added a commit that referenced this pull request Mar 7, 2019
mayya-sharipova added a commit that referenced this pull request Mar 7, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Mar 7, 2019
* 6.7:
  Fix CCR HLRC docs
  Introduce forget follower API (elastic#39718)
  6.6.2 release notes.
  Update release notes for 6.7.0
  Add documentation for min_hash filter (elastic#39671)
  Unmute testIndividualActionsTimeout
  Unmute testFollowIndexAndCloseNode
  Use unwrapped cause to determine if node is closing (elastic#39723)
  Don’t ack if unable to remove failing replica (elastic#39584)
  Wipe Snapshots Before Indices in RestTests (elastic#39662) (elastic#39765)
  Bug fix for AnnotatedTextHighlighter (elastic#39525)
  Fix Snapshot BwC with Version 5.6.x (elastic#39737)
  Fix occasional SearchServiceTests failure (elastic#39697)
  Correct date in daterange-aggregation.asciidoc (elastic#39727)
  Add a note to purge the ingest-geoip plugin (elastic#39553)
Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read over the documentation to learn more about this token filter and had a couple thoughts. I found these additions very helpful!

will provide a higher guarantee that different tokens are
indexed to different buckets.
** to improve the recall,
you should increase `hash_token` parameter. For example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hash_count?

internally each shingle is hashed into to 128-bit hash, you should choose
`k` small enough so that all possible
different k-words shingles can be hashed to 128-bit hash with
minimal collision. 5-word shingles typically work well.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original PR that adds min_hash, it looks like we were not sure about the 5 word suggestion, and instead encouraged 2 word shingles: #20206 (comment). It would be nice if there was a reference or set of experiments to help confirm a good default value... I didn't manage to find one in a quick search, but will keep a lookout. The right choice seems like it would depend on the use case as well (for example similarity search vs. duplicate detection).

mayya-sharipova added a commit that referenced this pull request Mar 8, 2019
mayya-sharipova added a commit that referenced this pull request Mar 8, 2019
mayya-sharipova added a commit that referenced this pull request Mar 8, 2019
@rdvdijk
Copy link

rdvdijk commented Aug 6, 2024

The documentation does not show or explain how to query for min_hash values.

For others that are trying to figure out how to query for min_hash values, and end up here (as I did), an explanation on how to do this can be found in the comments of the issue: #20757 (comment)

I think it would be a good idea to add such an example to the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens v6.7.0 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Minhash token filter needs better documentation
6 participants