Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Significant Terms: Add google normalized distance and chi square #6858
Regarding the heuristic comparison for the docs: How about an experiment using a large corpus like the Wikipedia. You could plot the distribution over document frequencies of the top-X significant terms to a large number of randomly generated queries for each of the different heuristics. My gut feeling tells me that you see differences in the distributions. Then you could advise to use heuristic A if you are interested in rather rare terms with a very strong correlation or to use heuristic B if you are interested in rather common words with a possibly slightly lower correlation.
I made some plots. Let me know what you think. I also added example results but I do not think that looking at examples tells a lot about quality of the result.
Index wikipedia, 243361 pages, 2 shards
For 200 random words, compute top 100 significant terms. This should then be the terms that co-occur with the query term most often. Here is a small illustration:
"query term" here refers to the set of documents that contain one of the random terms. "other term" is the set of documents that contain another term which is potentially significant to the query term. "num co-occurences" is the set of documents that contain both terms.
For each of the 200 query terms I collected the top 100 most significant terms and then plotted
For example, a high value for 1. and 3. and low value for 2. and 4. would indicate a preference of the heuristic to select terms that often appear together with the query term even if it they also appear in the background frequently.
In addition, show the resulting significant terms for one term that occurs often in documents and one that occurs rarely just to get an idea.
Just to get an idea of how frequent the query terms are, here is a histogram for the number of documents containing each of the 200 randomly chosen query terms (number of terms falling in bucket/ number of documents containing the terms):
Example results for term "shoe" (appearing in few - 1512 - documents)
Example results for term "germany" (appearing in many - 21662 - documents)
Mutual information seems to select frequent terms more often than rare terms. This might mean that you end up with lots of stopwords as for example in the "shoe" example.
It will be hard to say which one is "better" because the usefulness of the significant terms seems to be tied to the application they are used in (see for example Yang and Petersen where they evaluate goodness by measuring performance of feature selection methods by evaluating the performance of a subsequent text classification using the features).
If no one objects I will add this little analysis to the documentation.