Faster bulk numeric reads from BufferedIndexInput #12453

original-brownbear · 2023-07-20T18:29:09Z

Reading ints/floats/longs one-by-one from a heap-byte-buffer, including doing our own bounds checks is not very efficient. We can use the ability to translate the buffer and read in bulk while taking turns with one-off reading/refilling instead.

…-bii

stefanvodita

I left some nit-picky comments and some questions, but overall this looks like a good and useful change.

stefanvodita · 2023-07-22T12:39:58Z

lucene/core/src/java/org/apache/lucene/store/BufferedIndexInput.java

@@ -159,6 +159,63 @@ public final long readLong() throws IOException {
    }
  }

+  @Override
+  public void readFloats(float[] floats, int offset, int len) throws IOException {


Nit: Could we change the argument names to be consistent (dst, len)? I see the parent class, DataInput, has the same inconsistency.

stefanvodita · 2023-07-22T12:41:07Z

lucene/core/src/java/org/apache/lucene/store/BufferedIndexInput.java

@@ -159,6 +159,63 @@ public final long readLong() throws IOException {
    }
  }

+  @Override
+  public void readFloats(float[] floats, int offset, int len) throws IOException {
+    int remaining = len;


Nit: I found remaining to be a little confusing just because the buffer also calls to remaining(). What if we called it remainingDst to better differentiate from the buffer's remaining?

stefanvodita · 2023-07-22T12:43:55Z

lucene/core/src/java/org/apache/lucene/store/BufferedIndexInput.java

+  }
+
+  @Override
+  public void readLongs(long[] dst, int offset, int length) throws IOException {


I'm wondering why we're following a different pattern with these methods than with readBytes. If we rewrote readBytes using this model, would it just work?

@Override public final void readBytes(byte[] dst, int offset, int len) throws IOException { int remaining = len; while (remaining > 0) { int cnt = Math.min(buffer.remaining(), remaining); buffer.get(dst, offset + len - remaining, cnt); remaining -= cnt; if (remaining > 0) { if (buffer.hasRemaining()) { dst[offset + len - remaining] = readByte(); --remaining; } else { refill(); } } } }

I think it would come out to about that code if we inlined readBytes with useBuffer=true except that

if (buffer.hasRemaining()) { dst[offset + len - remaining] = readByte(); --remaining; }

is never entered, hasRemaining is just always false because we don't have anything like 1 one of 4 int bytes available.

-> the difference I think is just in that we have to deal with the alignment at the buffer boundary

stefanvodita · 2023-07-22T13:01:55Z

lucene/core/src/java/org/apache/lucene/store/BufferedIndexInput.java

+      }
+    }
+  }
+


Should TestBufferedIndexInput have some basic tests for these new methods?

++ good call, already had one bug that I only found when running benchmarks. Added a test for each of the 3 that should cover all the boundary/alignment/offset cases we can expect.

jpountz

This looks good to me. Can you add an entry to CHANGES.txt?

jpountz · 2023-07-24T08:59:31Z

lucene/core/src/test/org/apache/lucene/store/TestBufferedIndexInput.java

+              .put(byten(offset + 1))
+              .put(byten(offset + 2))
+              .put(byten(offset + 3));
+          assertEquals(bb.getFloat(0), floatBuffer[idx], 0f);


nit: would it make more sense to compare the int bits rather than the float itself, to make sure +/-0 or the different representations of NaN are considered different?

Right that's technically more exact I think :) Adjusted that and added changes.txt entry

jpountz · 2023-07-24T12:42:06Z

lucene/CHANGES.txt

@@ -84,6 +84,8 @@ Optimizations

 * GITHUB#12372: Reduce allocation during HNSW construction (Jonathan Ellis)

+* GITHUB#12453: Faster bulk numeric reads from BufferedIndexInput (Armin Braun)


This is the section for the upcoming Lucene major (10.0), your change looks safe for backport, let's move it to section 9.8?

I pushed a change

Thanks Adrien!

Reading ints/floats/longs one-by-one from a heap-byte-buffer, including doing our own bounds checks is not very efficient. We can use the ability to translate the buffer and read in bulk while taking turns with one-off reading/refilling instead.

original-brownbear added 3 commits July 20, 2023 20:27

Merge remote-tracking branch 'apache/main' into faster-bulk-nums-read…

851e5ef

…-bii

fix bug

29e0277

stefanvodita reviewed Jul 22, 2023

View reviewed changes

original-brownbear added 2 commits July 24, 2023 09:05

renamings

42ddb7d

tests

d09a0f8

original-brownbear requested a review from stefanvodita July 24, 2023 08:23

jpountz reviewed Jul 24, 2023

View reviewed changes

cmp int bits + changes.txt

5c8e77f

original-brownbear requested a review from jpountz July 24, 2023 09:40

jpountz approved these changes Jul 24, 2023

View reviewed changes

jpountz reviewed Jul 24, 2023

View reviewed changes

Move CHANGES to 9.8

2d6531b

jpountz merged commit 20e97fb into apache:main Jul 24, 2023
4 checks passed

original-brownbear deleted the faster-bulk-nums-read-bii branch July 25, 2023 08:23

zhaih added this to the 9.8.0 milestone Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster bulk numeric reads from BufferedIndexInput #12453

Faster bulk numeric reads from BufferedIndexInput #12453

original-brownbear commented Jul 20, 2023

stefanvodita left a comment

stefanvodita Jul 22, 2023

stefanvodita Jul 22, 2023

stefanvodita Jul 22, 2023

original-brownbear Jul 23, 2023 •

edited

stefanvodita Jul 22, 2023

original-brownbear Jul 24, 2023

jpountz left a comment

jpountz Jul 24, 2023

original-brownbear Jul 24, 2023

jpountz Jul 24, 2023

jpountz Jul 24, 2023

original-brownbear Jul 24, 2023

		@@ -84,6 +84,8 @@ Optimizations

		* GITHUB#12372: Reduce allocation during HNSW construction (Jonathan Ellis)

		* GITHUB#12453: Faster bulk numeric reads from BufferedIndexInput (Armin Braun)

Faster bulk numeric reads from BufferedIndexInput #12453

Faster bulk numeric reads from BufferedIndexInput #12453

Conversation

original-brownbear commented Jul 20, 2023

stefanvodita left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Jul 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear Jul 23, 2023 •

edited