Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant Terms: Add google normalized distance and chi square #6858

Closed
wants to merge 14 commits into from

Conversation

Projects
None yet
5 participants
@brwe
Copy link
Contributor

commented Jul 14, 2014

For google normalized distance, see
"The Google Similarity Distance", Cilibrasi and Vitanyi, 2007
http://arxiv.org/pdf/cs/0412098v3.pdf

For chi square, see
"Information Retrieval", Manning et al., Eq. 13.19

@brwe brwe added review labels Jul 14, 2014

@s1monw s1monw removed the v1.3.0 label Jul 15, 2014

@jpountz

View changes

docs/reference/search/aggregations/bucket/significantterms-aggregation.asciidoc Outdated


===== Chi square
added[1.3.1]

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

As a new feature, this should be either 1.3.0 or 1.4.0

This comment has been minimized.

Copy link
@brwe

brwe Jul 24, 2014

Author Contributor

ok

@jpountz

View changes

...main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/ChiSquare.java Outdated
* @param supersetFreq The frequency of the term in the superset from which the sample was taken
* @param supersetSize The size of the superset from which the sample was taken (typically number of docs)
* @return a "significance" score
*/

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

I don't think it is useful to repeat the description of the parameters since they should be inherited from the parent class. So maybe we should just remove the javadocs of this method and move the reference to the source of the algorithm to the class-level javadocs?

This comment has been minimized.

Copy link
@brwe

brwe Jul 24, 2014

Author Contributor

yes

@jpountz

View changes

...main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/ChiSquare.java Outdated

@Override
protected void checkName(String givenName) {
NAMES_FIELD.match(givenName, ParseField.EMPTY_FLAGS);

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

Hmm, ParseField.match nevers throws with empty flags, what is your goal here? Also, is it ok to ignore the return value of match?

This comment has been minimized.

Copy link
@brwe

brwe Jul 24, 2014

Author Contributor

ParseField.match should throw an exception if the flag contains strict and a deprecated name was used. It does currently not make sense at all as the flag is always empty. The reason for this is that the parse flags are currently not passed on to the different aggregation parsers.

I added this in #6561 and remember I had a discussion about it (but could not find the comment anymore). The idea was to have it in place and once the flags are properly passed, we can use it properly. It is also done this way in Aggregator

However, I now think adding the check with empty flags was a bad idea as it causes confusion already.
Should I remove this check everywhere?

This comment has been minimized.

Copy link
@brwe

brwe Jul 25, 2014

Author Contributor

I removed the match checks with the empty flags in "remove EMPTY_FLAGS check as the match..."

@jpountz

View changes

...main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/ChiSquare.java Outdated
protected static final ParseField NAMES_FIELD = new ParseField("chi_square");

private ChiSquare() {
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

this constructor looks unused

This comment has been minimized.

Copy link
@brwe

brwe Jul 24, 2014

Author Contributor

it is, will remove

@jpountz

View changes

...lasticsearch/search/aggregations/bucket/significant/heuristics/NXYSignificanceHeuristic.java Outdated
* the background without the subset. We might want to filter out the terms that are appear much less often
* in the subset than in the background without the subset.
*/
protected boolean includeNegatives = false;

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

Can the two above properties be made final?

This comment has been minimized.

Copy link
@brwe

brwe Jul 24, 2014

Author Contributor

yes

@jpountz

View changes

src/main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/GND.java Outdated
}
if (supersetFreq > supersetSize) {
throw new ElasticsearchIllegalArgumentException("supersetFreq > supersetSize, in GND.score(..)");
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

These checks seem to be duplicated in most instances, can we try to share the code in a parent class somehow?

Additionally, in that particular class, it seems like 0 should be avoided as well since we are taking a log?

This comment has been minimized.

Copy link
@brwe

brwe Jul 25, 2014

Author Contributor

yes. I added the distinction between if background is superset or not to GND also. Commit is "make gnd also distinguish..."

@jpountz

View changes

src/main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/GND.java Outdated
public static class GNDBuilder implements SignificanceHeuristicBuilder {

public GNDBuilder() {
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

Maybe we should just use the default constructor instead of defining one explicitely? (Chi2 and Mutual information have the same pattern.)

This comment has been minimized.

Copy link
@brwe

brwe Jul 25, 2014

Author Contributor

yes, I removed them in commit "make gnd also distinguish..."

int result = (includeNegatives ? 1 : 0);
result = 31 * result + (backgroundIsSuperset ? 1 : 0);
return result;
}

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

The class of the object is not taken into account so it would be very easy to generate collisions with instances of MutualInformation and ChiSquare. Do we use these objects as keys of hash tables or rely on the equals method somewhere?

This comment has been minimized.

Copy link
@brwe

brwe Jul 25, 2014

Author Contributor

I use the equals(..) in tests in SignificanceHeuristicTests . I thought (and @s1monw requested in this comment) if you override the equals you will also have to override hash just to be on the save side.
So, now I added a hash function for each of the heuristics to avoid collisions in commit "add hashCode() to each heuristic that actually has instances". Please check if this is the way to write a hash function.

@jpountz

View changes

...lasticsearch/search/aggregations/bucket/significant/heuristics/NXYSignificanceHeuristic.java Outdated
double N00, N01, N10, N11, N0_, N1_, N_0, N_1, N;
}

protected void computeNxys(long subsetFreq, long subsetSize, long supersetFreq, long supersetSize) {

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

I think there is a thread safety issue here since this method updates an instance member while it could be called from separate threads at the same time? I think it's ok to allocate one each time as this would be an easy job for the JVM's escape analysis?

This comment has been minimized.

Copy link
@brwe

brwe Jul 25, 2014

Author Contributor

ok, added commit "new frequency each time score is computed"


"gnd": {
}
--------------------------------------------------

This comment has been minimized.

Copy link
@jpountz

jpountz Jul 15, 2014

Contributor

I think we should also briefly try to explain how they differ (on a very high level) and when each of these heuristics should be used.

@jpountz jpountz removed the review label Jul 15, 2014

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2014

I left a couple of comments. Something that looks important to me is to explain how these heuristics compare, otherwise it would be hard to find out which one best fits given data/use-cases?

@hkorte

This comment has been minimized.

Copy link
Contributor

commented Jul 24, 2014

Regarding the heuristic comparison for the docs: How about an experiment using a large corpus like the Wikipedia. You could plot the distribution over document frequencies of the top-X significant terms to a large number of randomly generated queries for each of the different heuristics. My gut feeling tells me that you see differences in the distributions. Then you could advise to use heuristic A if you are interested in rather rare terms with a very strong correlation or to use heuristic B if you are interested in rather common words with a possibly slightly lower correlation.

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 25, 2014

Thanks for the review! I added commits to address the comments.
Next I will try to come up with a way to show the difference between the heuristics. I'll start with what @hkorte proposed.

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 28, 2014

I made some plots. Let me know what you think. I also added example results but I do not think that looking at examples tells a lot about quality of the result.

Experiment

Index wikipedia, 243361 pages, 2 shards

For 200 random words, compute top 100 significant terms. This should then be the terms that co-occur with the query term most often. Here is a small illustration:

cooc-illu 001

"query term" here refers to the set of documents that contain one of the random terms. "other term" is the set of documents that contain another term which is potentially significant to the query term. "num co-occurences" is the set of documents that contain both terms.

For each of the 200 query terms I collected the top 100 most significant terms and then plotted
plotted the mean (over the 200 random words) of

  1. co-occurrence: How big is the set of co-occurring documents?

  2. other term: How often does the significant term appear in the whole set?

  3. co-occurrences/ # num query term: How many (in percent) of the documents containing query term also contain the significant term?

  4. co-occurrences/ # other term: How many (in percent) of the documents containing the significant term also contain the query term?

For example, a high value for 1. and 3. and low value for 2. and 4. would indicate a preference of the heuristic to select terms that often appear together with the query term even if it they also appear in the background frequently.

In addition, show the resulting significant terms for one term that occurs often in documents and one that occurs rarely just to get an idea.

Plots

Just to get an idea of how frequent the query terms are, here is a histogram for the number of documents containing each of the 200 randomly chosen query terms (number of terms falling in bucket/ number of documents containing the terms):

image

  1. co-occurrence: How big is the set of co-occurring documents? (mean over 200 query terms, plot for the top 100 significant terms)

image
2. # other term: How often does the significant term appear in the whole set? (mean over 200 query terms, plot for the top 100 significant terms)

image
3. # co-occurrences/ # num query term: How many (in percent) of the documents containing query term also contain the significant term?(mean over 200 query terms, plot for the top 100 significant terms)

image
4. # co-occurrences/ # other term: How many (in percent) of the documents containing the significant term also contain the query term?(mean over 200 query terms, plot for the top 100 significant terms)

image

Result examples

Example results for term "shoe" (appearing in few - 1512 - documents)

jlh chi_square mutual_information gnd
shoe shoe shoe shoe
shoes shoes shoes shoes
footwear footwear one footwear
leather leather have boot
factory factory new boots
boot boot old leather
boots boots many heel
street heel be shoemaking
heel shoemaking time soles
clothing soles home soled
old clothing first tenspeed
home shop their sneakers
shop wear two laces
shoemaking soled when adidas
soles street into toe
just store they cordwainers
wear old around medalen
store foot three sqornshellous
back home well decristofaro
around just by buildering

Example results for term "germany" (appearing in many - 21662 - documents)

jlh chi_square mutual_information gnd
germany germany germany germany
german german german german
france france france france
italy austria europe italy
austria italy italy austria
europe poland world europe
poland nazi austria poland
nazi europe poland countries
netherlands netherlands european netherlands
berlin berlin netherlands nazi
european switzerland countries berlin
countries european nazi european
world countries international republic
switzerland russia kingdom russia
russia belgium war switzerland
belgium world during spain
kingdom hungary berlin belgium
republic sweden after sweden
sweden spain russia hungary
hungary republic republic von

Conclusion

Mutual information seems to select frequent terms more often than rare terms. This might mean that you end up with lots of stopwords as for example in the "shoe" example.
Google normalized distance seems to favor less frequent terms. This might be good for avoiding stopwords but you might also end up with, for example, misspellings as for example in the "shoe" example.
Chi square and jlh perform comparably and are somewhat in between.

It will be hard to say which one is "better" because the usefulness of the significant terms seems to be tied to the application they are used in (see for example Yang and Petersen where they evaluate goodness by measuring performance of feature selection methods by evaluating the performance of a subsequent text classification using the features).

If no one objects I will add this little analysis to the documentation.

@brwe

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2014

I addressed all comments in separate commits. Let me know if they make sense!

@brwe brwe added the review label Jul 31, 2014

@jpountz

View changes

...main/java/org/elasticsearch/search/aggregations/bucket/significant/heuristics/ChiSquare.java Outdated

// here we check if the term appears more often in subset than in background without subset.
if (!includeNegatives && frequencies.N11 / frequencies.N_1 < frequencies.N10 / frequencies.N_0) {
return -1.0 * Double.MAX_VALUE;

This comment has been minimized.

Copy link
@jpountz

jpountz Aug 1, 2014

Contributor

Should it be Double.NEGATIVE_INFINITY instead?

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Aug 1, 2014

Left one minor comment, other than that LGTM

@jpountz jpountz removed the review label Aug 1, 2014

@brwe brwe changed the title Add google normalized distance and chi square to significant terms significant terms: Add google normalized distance and chi square Aug 4, 2014

@brwe brwe closed this in a3cefd9 Aug 4, 2014

@clintongormley clintongormley changed the title significant terms: Add google normalized distance and chi square Significant Terms: Add google normalized distance and chi square Sep 8, 2014

Mpdreamz added a commit to elastic/elasticsearch-net that referenced this pull request Dec 11, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.