-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant terms misses some terms #5998
Comments
Thanks for looking at this, Britta. The |
My bad, did not try the A little more explanation: In my particular case I was trying to get the significant terms for wikipedia articles (only partially indexed) that contain a particular term in the title, in this case "shoe". I have only one shard. Not sure if 3k is unreasonable and I also only tested tis one case. In addition, I do not know the true significance of the terms that are returned. The documents (only 254 contained "shoe" in the title) did not contain many significant terms anyway, so most of the 100 terms did not make sense and also scored low. I am still wondering if it makes sense use the On the other hand, if there is no routing of docs to shards involved, I can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. So in that case it might be easier to not add documents to the pq that have Anyway, this is just and idea. We could proceed with formal measurements but this might be tricky and time consuming. If 3k is reasonable for the pq size we can also close this issue for now. |
So if One thing to watch with Wikipedia - there tend to be a lot of very short docs that have no real wordy content on a theme but instead serve as disambiguation or synonym type entries. It can be important to tune these out from these sorts of analysis. |
I like the |
…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes elastic#5998
…`shard_size` Significant terms internally maintain a priority queue per shard with a size potentially lower than the number of terms. This queue uses the score as criterion to determine if a bucket is kept or not. If many terms with low subsetDF score very high but the `min_doc_count` is set high, this might result in no terms being returned because the pq is filled with low frequent terms which are all sorted out in the end. This can be avoided by increasing the `shard_size` parameter to a higher value. However, it is not immediately clear to which value this parameter must be set because we can not know how many terms with low frequency are scored higher that the high frequent terms that we are actually interested in. On the other hand, if there is no routing of docs to shards involved, we can maybe assume that the documents of classes and also the terms therein are distributed evenly across shards. In that case it might be easier to not add documents to the pq that have subsetDF <= `shard_min_doc_count` which can be set to something like `min_doc_count`/number of shards because we would assume that even when summing up the subsetDF across shards `min_doc_count` will not be reached. closes #5998 closes #6041
This was discussed in issue elastic#6041 and elastic#5998 .
When defining a "min_doc_count" and a "size", significant terms might fail to return the right significant terms. The scenario is as follows:
Under these circumstances no significant term is returned at all (test is in the branches below). This makes it hard to use significant terms for processing natural text which contains many low frequent terms.
I think the reason for this behavior is that internally a priority queue is maintained that has a maximum of 2*size entries. This queue uses the score as criterion to determine if a bucket is kept or not. Because terms in 1. score higher than terms in 2., only documents with subsetDF < min_doc_count will be kept and therefore finally no terms are returned at all.
It is unclear to me how to fix this. I prepared two potential ways to fix this (not nearly ready for a pr, more proof of concept):
A. assign score -MAX_FLOAT if subsetDf < minDocCount
This would cause terms with a too low subsetDF to be added with a very low score. These terms still would get a chance to be returned if by merging their subsetDf increases. However, they might be thrown out of the priority queue before that by terms with a higher score and never be merged at all.
Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-1
B. Do not add terms to priority queue if subsetDf < minDocCount
Easier to code but unfortunately this means that terms that might match the minDocCount criterion after merging would not be added in the first place.
Branch: https://github.com/brwe/elasticsearch/tree/missing-sig-terms-2
Both ways do not guarantee that the "correct" significant terms are returned.
I would prefer B.
The text was updated successfully, but these errors were encountered: