-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterminism in documents indexed for Gov2? #7
Comments
I redo the Gov2 indexing on Hops at UWaterloo, and the same number of documents, 24900602 docs indexed. |
1000 documents don't appear from nowhere |
As Charlie suggested in email (and the sensical thing to do), we should dump out the docids and diff. @LuchenTan so you can start learning the Lucene APIs, etc., can you write a simple program that does this? Fork the repo, add in your code, and send a pull request. The ultimate product should be a command I can copy and paste to generate the output (documented in the README). Thanks! |
OK, I'll do it. |
Update on this: as is turned out, there was a minor corruption in the copy of Gov2 at UMD. After fixing this and rerunning the indexer, I get 24900602 docs, which machines @LuchenTan's number. |
Yes, exactly the same number of mine. On Thu, Oct 15, 2015 at 2:02 PM, Jimmy Lin notifications@github.com wrote:
Luchen Tan David R. Cheriton School of Computer Science |
Resolving issue... basically, no issue. |
Merged in Luchen code to dump out docids in an index: |
Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.
Weird - some non-determinism the multi-threading?
Not that important if we can replicate effective results on standard test collections, but worth noting.
The text was updated successfully, but these errors were encountered: