Skip to content

Speed up ivec reads by buffering#584

Merged
marianotepper merged 1 commit intodatastax:mainfrom
ashkrisk:ivec-bufread
Jan 20, 2026
Merged

Speed up ivec reads by buffering#584
marianotepper merged 1 commit intodatastax:mainfrom
ashkrisk:ivec-bufread

Conversation

@ashkrisk
Copy link
Copy Markdown
Contributor

@ashkrisk ashkrisk commented Dec 1, 2025

MultiFileDataSource makes use of SiftLoader.readFvecs to read base and query vectors, and SiftLoader.readIvecs to read the provided ground truth. The readIvecs function is currently quite inefficient due to lack of buffering, disproportionately slowing down the time taken to load the Dataset.

This is not so important for Bench, where the time taken to load the dataset is insignificant compared to the time taken to build the index. However, this becomes quite important when running short-lived programs with pre-created graphs, especially during rapid prototyping.

This PR addresses this by adding a BufferedInputStream, similar to the current implementation of readFvecs.

Some numbers from my machine based on a dataset with ~2M base vectors and ~50K query vectors illustrates the difference:

File Size Contents Time
base.fvecs 964M ~2M 128D fvecs 2.6s
query.fvecs 25M ~50K 128D fvecs 0.12s
gt.ivecs 58M ~50K 300D ivecs 10.3s (unbuffered)
0.34s (buffered)

Without buffering, reading the ground truth is ~4x slower than the actual base vectors. With buffering, the ground truth is no longer the bottleneck.

Copy link
Copy Markdown
Contributor

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marianotepper marianotepper merged commit 42259e9 into datastax:main Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants