Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterminism in documents indexed for Gov2? #7

Closed
lintool opened this issue Oct 6, 2015 · 8 comments
Closed

Nondeterminism in documents indexed for Gov2? #7

lintool opened this issue Oct 6, 2015 · 8 comments

Comments

@lintool
Copy link
Member

lintool commented Oct 6, 2015

Indexing Gov2 on streeling at UMD, I get 24899563 docs.
Luchen reports 24900602 docs indexing on hops.

Weird - some non-determinism the multi-threading?

Not that important if we can replicate effective results on standard test collections, but worth noting.

@LuchenTan
Copy link
Contributor

I redo the Gov2 indexing on Hops at UWaterloo, and the same number of documents, 24900602 docs indexed.

@claclark
Copy link

claclark commented Oct 7, 2015

1000 documents don't appear from nowhere

@lintool
Copy link
Member Author

lintool commented Oct 7, 2015

As Charlie suggested in email (and the sensical thing to do), we should dump out the docids and diff.

@LuchenTan so you can start learning the Lucene APIs, etc., can you write a simple program that does this?

Fork the repo, add in your code, and send a pull request. The ultimate product should be a command I can copy and paste to generate the output (documented in the README).

Thanks!

@LuchenTan
Copy link
Contributor

OK, I'll do it.

@lintool
Copy link
Member Author

lintool commented Oct 15, 2015

Update on this: as is turned out, there was a minor corruption in the copy of Gov2 at UMD. After fixing this and rerunning the indexer, I get 24900602 docs, which machines @LuchenTan's number.

@LuchenTan
Copy link
Contributor

Yes, exactly the same number of mine.

On Thu, Oct 15, 2015 at 2:02 PM, Jimmy Lin notifications@github.com wrote:

Update on this: as is turned out, there was a minor corruption in the copy
of Gov2 at UMD. After fixing this and rerunning the indexer, I get 24900602
docs, which machines @LuchenTan https://github.com/LuchenTan's number.


Reply to this email directly or view it on GitHub
#7 (comment).

Luchen Tan

David R. Cheriton School of Computer Science
University of Waterloo
Waterloo, Ontario, Canada N2L 3G1

@lintool
Copy link
Member Author

lintool commented Oct 23, 2015

Resolving issue... basically, no issue.

@lintool lintool closed this as completed Oct 23, 2015
@lintool
Copy link
Member Author

lintool commented Oct 23, 2015

Merged in Luchen code to dump out docids in an index:
commit 491dd9e

@castorini castorini locked as resolved and limited conversation to collaborators Nov 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants