Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing all of ClueWeb09 #18

Closed
lintool opened this issue Oct 16, 2015 · 1 comment
Closed

Indexing all of ClueWeb09 #18

lintool opened this issue Oct 16, 2015 · 1 comment

Comments

@lintool
Copy link
Member

lintool commented Oct 16, 2015

Quite impressively, I was able to index all of ClueWeb09 (English):

nohup sh target/appassembler/bin/IndexClueWeb09b \
  -input /scratch1/collections/ClueWeb09.English/data/ \
  -index lucene-index.cw09.cnt -threads 32 -optimize >& log.cw09.cnt.txt &

Took ~18 hours:

2015-10-16 07:51:04,775 INFO  [main] index.IndexClueWeb09b (IndexClueWeb09b.java:298) - Total 503781465 documents indexed in 18:01:04

Index size (note: no positions):

$ du -h lucene-index.cw09.cnt/
254G    lucene-index.cw09.cnt/
@iorixxx
Copy link
Contributor

iorixxx commented Dec 1, 2015

I was able to index whole CW09 dataset, with slight modifications to IndexClueWeb09b class. With 15 threads and 50gb to heap, it took about 20 hours and ended up in index size of 650 GB.

iorixxx added a commit to iorixxx/Anserini that referenced this issue Dec 2, 2015
@lintool lintool closed this as completed Mar 1, 2017
@castorini castorini locked as resolved and limited conversation to collaborators Nov 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants