Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/speed up binary vector decoding #96716

Conversation

benwtrent
Copy link
Member

Encoding floats in little endian format provides much faster decoding.

This commit takes all indices created in 8.9.0+ and stores binary vectors as little endian.

closes: #96710

@benwtrent benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.9.0 labels Jun 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 8, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member Author

Baseline is current main (with Panama Vector Module)
Contender is this change (again, with Panama Vector Module)

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                        Metric |                 Task |        Baseline |       Contender |       Diff |   Unit |   Diff % |
|--------------------------------------------------------------:|---------------------:|----------------:|----------------:|-----------:|-------:|---------:|
|                    Cumulative indexing time of primary shards |                      |     1.5163      |     1.53303     |    0.01673 |    min |   +1.10% |
|             Min cumulative indexing time across primary shard |                      |     0.758       |     0.765483    |    0.00748 |    min |   +0.99% |
|          Median cumulative indexing time across primary shard |                      |     0.75815     |     0.766517    |    0.00837 |    min |   +1.10% |
|             Max cumulative indexing time across primary shard |                      |     0.7583      |     0.76755     |    0.00925 |    min |   +1.22% |
|           Cumulative indexing throttle time of primary shards |                      |     0           |     0           |    0       |    min |    0.00% |
|    Min cumulative indexing throttle time across primary shard |                      |     0           |     0           |    0       |    min |    0.00% |
| Median cumulative indexing throttle time across primary shard |                      |     0           |     0           |    0       |    min |    0.00% |
|    Max cumulative indexing throttle time across primary shard |                      |     0           |     0           |    0       |    min |    0.00% |
|                       Cumulative merge time of primary shards |                      |     0.86665     |     0.871267    |    0.00462 |    min |   +0.53% |
|                      Cumulative merge count of primary shards |                      |     4           |     4           |    0       |        |    0.00% |
|                Min cumulative merge time across primary shard |                      |     0.423467    |     0.42825     |    0.00478 |    min |   +1.13% |
|             Median cumulative merge time across primary shard |                      |     0.433325    |     0.435633    |    0.00231 |    min |   +0.53% |
|                Max cumulative merge time across primary shard |                      |     0.443183    |     0.443017    |   -0.00017 |    min |   -0.04% |
|              Cumulative merge throttle time of primary shards |                      |     0.512683    |     0.527467    |    0.01478 |    min |   +2.88% |
|       Min cumulative merge throttle time across primary shard |                      |     0.256167    |     0.2612      |    0.00503 |    min |   +1.96% |
|    Median cumulative merge throttle time across primary shard |                      |     0.256342    |     0.263733    |    0.00739 |    min |   +2.88% |
|       Max cumulative merge throttle time across primary shard |                      |     0.256517    |     0.266267    |    0.00975 |    min |   +3.80% |
|                     Cumulative refresh time of primary shards |                      |     0.112983    |     0.1234      |    0.01042 |    min |   +9.22% |
|                    Cumulative refresh count of primary shards |                      |    46           |    46           |    0       |        |    0.00% |
|              Min cumulative refresh time across primary shard |                      |     0.0534667   |     0.0589833   |    0.00552 |    min |  +10.32% |
|           Median cumulative refresh time across primary shard |                      |     0.0564917   |     0.0617      |    0.00521 |    min |   +9.22% |
|              Max cumulative refresh time across primary shard |                      |     0.0595167   |     0.0644167   |    0.0049  |    min |   +8.23% |
|                       Cumulative flush time of primary shards |                      |     0.09235     |     0.0870667   |   -0.00528 |    min |   -5.72% |
|                      Cumulative flush count of primary shards |                      |     2           |     2           |    0       |        |    0.00% |
|                Min cumulative flush time across primary shard |                      |     0.0459      |     0.0435167   |   -0.00238 |    min |   -5.19% |
|             Median cumulative flush time across primary shard |                      |     0.046175    |     0.0435333   |   -0.00264 |    min |   -5.72% |
|                Max cumulative flush time across primary shard |                      |     0.04645     |     0.04355     |   -0.0029  |    min |   -6.24% |
|                                       Total Young Gen GC time |                      |     1.262       |     1.028       |   -0.234   |      s |  -18.54% |
|                                      Total Young Gen GC count |                      |    43           |    42           |   -1       |        |   -2.33% |
|                                         Total Old Gen GC time |                      |     0           |     0           |    0       |      s |    0.00% |
|                                        Total Old Gen GC count |                      |     0           |     0           |    0       |        |    0.00% |
|                                                    Store size |                      |     2.0004      |     2.02266     |    0.02226 |     GB |   +1.11% |
|                                                 Translog size |                      |     1.02445e-07 |     1.02445e-07 |    0       |     GB |    0.00% |
|                                        Heap used for segments |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                      Heap used for doc values |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                           Heap used for terms |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                           Heap used for norms |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                          Heap used for points |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                   Heap used for stored fields |                      |     0           |     0           |    0       |     MB |    0.00% |
|                                                 Segment count |                      |     2           |     2           |    0       |        |    0.00% |
|                                   Total Ingest Pipeline count |                      |     0           |     0           |    0       |        |    0.00% |
|                                    Total Ingest Pipeline time |                      |     0           |     0           |    0       |     ms |    0.00% |
|                                  Total Ingest Pipeline failed |                      |     0           |     0           |    0       |        |    0.00% |
|                                                Min Throughput |         index-append | 23251           | 23174.7         |  -76.2634  | docs/s |   -0.33% |
|                                               Mean Throughput |         index-append | 23737.5         | 23651.8         |  -85.6234  | docs/s |   -0.36% |
|                                             Median Throughput |         index-append | 23726.4         | 23608.6         | -117.861   | docs/s |   -0.50% |
|                                                Max Throughput |         index-append | 24177.5         | 24110.8         |  -66.6388  | docs/s |   -0.28% |
|                                       50th percentile latency |         index-append |   169.887       |   170.214       |    0.32644 |     ms |   +0.19% |
|                                       90th percentile latency |         index-append |   202.253       |   196.648       |   -5.60429 |     ms |   -2.77% |
|                                       99th percentile latency |         index-append |   280.865       |   233.765       |  -47.0994  |     ms |  -16.77% |
|                                      100th percentile latency |         index-append |   421.361       |   492.882       |   71.5206  |     ms |  +16.97% |
|                                  50th percentile service time |         index-append |   169.887       |   170.214       |    0.32644 |     ms |   +0.19% |
|                                  90th percentile service time |         index-append |   202.253       |   196.648       |   -5.60429 |     ms |   -2.77% |
|                                  99th percentile service time |         index-append |   280.865       |   233.765       |  -47.0994  |     ms |  -16.77% |
|                                 100th percentile service time |         index-append |   421.361       |   492.882       |   71.5206  |     ms |  +16.97% |
|                                                    error rate |         index-append |     0           |     0           |    0       |      % |    0.00% |
|                                                Min Throughput |  refresh-after-index |     0.483521    |     0.48829     |    0.00477 |  ops/s |   +0.99% |
|                                               Mean Throughput |  refresh-after-index |     0.483521    |     0.48829     |    0.00477 |  ops/s |   +0.99% |
|                                             Median Throughput |  refresh-after-index |     0.483521    |     0.48829     |    0.00477 |  ops/s |   +0.99% |
|                                                Max Throughput |  refresh-after-index |     0.483521    |     0.48829     |    0.00477 |  ops/s |   +0.99% |
|                                      100th percentile latency |  refresh-after-index |  2065           |  2044.74        |  -20.2556  |     ms |   -0.98% |
|                                 100th percentile service time |  refresh-after-index |  2065           |  2044.74        |  -20.2556  |     ms |   -0.98% |
|                                                    error rate |  refresh-after-index |     0           |     0           |    0       |      % |    0.00% |
|                                                Min Throughput | refresh-after-update |    95.1046      |   255.641       |  160.537   |  ops/s | +168.80% |
|                                               Mean Throughput | refresh-after-update |    95.1046      |   255.641       |  160.537   |  ops/s | +168.80% |
|                                             Median Throughput | refresh-after-update |    95.1046      |   255.641       |  160.537   |  ops/s | +168.80% |
|                                                Max Throughput | refresh-after-update |    95.1046      |   255.641       |  160.537   |  ops/s | +168.80% |
|                                      100th percentile latency | refresh-after-update |     8.25192     |     2.64692     |   -5.605   |     ms |  -67.92% |
|                                 100th percentile service time | refresh-after-update |     8.25192     |     2.64692     |   -5.605   |     ms |  -67.92% |
|                                                    error rate | refresh-after-update |     0           |     0           |    0       |      % |    0.00% |
|                                                Min Throughput |          force-merge |     0.0622565   |     0.0611523   |   -0.0011  |  ops/s |   -1.77% |
|                                               Mean Throughput |          force-merge |     0.0622565   |     0.0611523   |   -0.0011  |  ops/s |   -1.77% |
|                                             Median Throughput |          force-merge |     0.0622565   |     0.0611523   |   -0.0011  |  ops/s |   -1.77% |
|                                                Max Throughput |          force-merge |     0.0622565   |     0.0611523   |   -0.0011  |  ops/s |   -1.77% |
|                                      100th percentile latency |          force-merge | 16060.1         | 16350.3         |  290.176   |     ms |   +1.81% |
|                                 100th percentile service time |          force-merge | 16060.1         | 16350.3         |  290.176   |     ms |   +1.81% |
|                                                    error rate |          force-merge |     0           |     0           |    0       |      % |    0.00% |
|                                                Min Throughput |   script-score-query |    11.0361      |    12.582       |    1.54594 |  ops/s |  +14.01% |
|                                               Mean Throughput |   script-score-query |    12.5429      |    14.8063      |    2.26341 |  ops/s |  +18.05% |
|                                             Median Throughput |   script-score-query |    12.6932      |    15.085       |    2.39188 |  ops/s |  +18.84% |
|                                                Max Throughput |   script-score-query |    12.9486      |    15.5165      |    2.56792 |  ops/s |  +19.83% |
|                                       50th percentile latency |   script-score-query |    74.52        |    62.3899      |  -12.1301  |     ms |  -16.28% |
|                                       90th percentile latency |   script-score-query |    76.4475      |    65.0979      |  -11.3497  |     ms |  -14.85% |
|                                       99th percentile latency |   script-score-query |    89.8932      |    66.5804      |  -23.3128  |     ms |  -25.93% |
|                                     99.9th percentile latency |   script-score-query |   132.404       |    67.2787      |  -65.1252  |     ms |  -49.19% |
|                                      100th percentile latency |   script-score-query |   140.169       |    83.8935      |  -56.2751  |     ms |  -40.15% |
|                                  50th percentile service time |   script-score-query |    74.52        |    62.3899      |  -12.1301  |     ms |  -16.28% |
|                                  90th percentile service time |   script-score-query |    76.4475      |    65.0979      |  -11.3497  |     ms |  -14.85% |
|                                  99th percentile service time |   script-score-query |    89.8932      |    66.5804      |  -23.3128  |     ms |  -25.93% |
|                                99.9th percentile service time |   script-score-query |   132.404       |    67.2787      |  -65.1252  |     ms |  -49.19% |
|                                 100th percentile service time |   script-score-query |   140.169       |    83.8935      |  -56.2751  |     ms |  -40.15% |
|                                                    error rate |   script-score-query |     0           |     0           |    0       |      % |    0.00% |

…benwtrent/elasticsearch into feature/speed-up-binary-vector-decoding
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent requested a review from jdconrad June 12, 2023 11:53
Copy link
Contributor

@jdconrad jdconrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool change! I had a few questions, but otherwise LGTM.

@@ -64,6 +65,8 @@
* A {@link FieldMapper} for indexing a dense vector of floats.
*/
public class DenseVectorFieldMapper extends FieldMapper {
public static final Version MAGNITUDE_STORED_INDEX_VERSION = Version.V_7_5_0;
public static final Version LITTLE_ENDIAN_FLOAT_STORED_INDEX_VERSION = Version.V_8_9_0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the last version prior to your change's version using the new TransportVersion constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdconrad this is a transport/wire serialization thing. its an index version thing. From my understanding index versioning is different. I will see what I can find.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked @thecoop and me just using Version here is OK. We may need to update it to IndexVersion depending on which commits make it in first :)

@@ -890,18 +907,18 @@ private Field parseKnnVector(DocumentParserContext context) throws IOException {
private Field parseBinaryDocValuesVector(DocumentParserContext context) throws IOException {
// encode array of floats as array of integers and store into buf
// this code is here and not int the VectorEncoderDecoder so not to create extra arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to your change, but would you mind fixing the not int -> not in?

Comment on lines +78 to +86
FloatBuffer fb = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length)
.order(ByteOrder.LITTLE_ENDIAN)
.asFloatBuffer();
fb.get(vector);
} else {
ByteBuffer byteBuffer = ByteBuffer.wrap(vectorBR.bytes, vectorBR.offset, vectorBR.length);
for (int dim = 0; dim < vector.length; dim++) {
vector[dim] = byteBuffer.getFloat((dim * Float.BYTES) + vectorBR.offset);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could .asFloatBuffer() not be used for both little and big endian?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.asFloatBuffer() is marginally slower for BE. These implementations are the fastest I could get them.

@benwtrent benwtrent added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 13, 2023
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

return indexVersion.onOrAfter(LITTLE_ENDIAN_FLOAT_STORED_INDEX_VERSION)
? ByteBuffer.wrap(new byte[numBytes]).order(ByteOrder.LITTLE_ENDIAN)
: ByteBuffer.wrap(new byte[numBytes]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@elasticsearchmachine elasticsearchmachine merged commit 5d93a42 into elastic:main Jun 13, 2023
12 checks passed
@benwtrent benwtrent deleted the feature/speed-up-binary-vector-decoding branch June 13, 2023 13:45
HiDAl pushed a commit to HiDAl/elasticsearch that referenced this pull request Jun 14, 2023
Encoding floats in little endian format provides much faster decoding. 

This commit takes all indices created in 8.9.0+ and stores binary
vectors as little endian. 

closes: elastic#96710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve KNN Bruteforce float decoding
5 participants