Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract common code paths in indexing pipeline #2275

Merged
merged 12 commits into from
Dec 13, 2023
Merged

Extract common code paths in indexing pipeline #2275

merged 12 commits into from
Dec 13, 2023

Conversation

lintool
Copy link
Member

@lintool lintool commented Nov 26, 2023

Major refactoring of indexing pipeline (IndexCollection, IndexHnswDenseVectors, and IndexInvertedDenseVectors), extracting common code paths into AbstractIndexer.

@tteofili @MXueguang still running tests, but ready for review in parallel.

@lintool lintool marked this pull request as draft November 26, 2023 02:31
Copy link

codecov bot commented Nov 26, 2023

Codecov Report

Attention: 74 lines in your changes are missing coverage. Please review.

Comparison is base (2748548) 63.41% compared to head (00fe5a7) 63.82%.
Report is 8 commits behind head on master.

Files Patch % Lines
...c/main/java/io/anserini/index/AbstractIndexer.java 69.48% 37 Missing and 10 partials ⚠️
...c/main/java/io/anserini/index/IndexCollection.java 76.00% 12 Missing and 6 partials ⚠️
.../java/io/anserini/index/IndexHnswDenseVectors.java 83.72% 6 Missing and 1 partial ⚠️
...a/io/anserini/index/IndexInvertedDenseVectors.java 93.93% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #2275      +/-   ##
============================================
+ Coverage     63.41%   63.82%   +0.41%     
- Complexity     1287     1296       +9     
============================================
  Files           196      197       +1     
  Lines         11463    11251     -212     
  Branches       1457     1421      -36     
============================================
- Hits           7269     7181      -88     
+ Misses         3672     3593      -79     
+ Partials        522      477      -45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lintool lintool changed the title More refactoring Extract common code paths in indexing pipeline Nov 30, 2023
Copy link
Collaborator

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx @lintool LGTM

@lintool lintool mentioned this pull request Dec 5, 2023
public int memoryBuffer = 16384;

@Option(name = "-threads", metaVar = "[num]", usage = "Number of indexing threads.")
public int threads = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: what about defaulting to Runtime.getRuntime().availableProcessors() so that indexing concurrency would automatically try to use all available resources, without requiring users to pass the right number of threads? (Possibly for another PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, potentially - but:
(1) defaulting to consuming all available resources can be rude ;)
(2) it depends on the corpus as well... e.g., if the corpus only has 10 files, then 16 threads isn't going to help since we only do per-file parallelism.

@lintool
Copy link
Member Author

lintool commented Dec 11, 2023

Tests are proceeding, but not finished yet. Stand by.

@lintool lintool marked this pull request as draft December 11, 2023 12:19
@lintool lintool marked this pull request as ready for review December 13, 2023 09:23
@lintool lintool merged commit b6a7534 into master Dec 13, 2023
3 checks passed
@lintool lintool deleted the refactoring branch January 23, 2024 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants