Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexing with concurrent searches #267

Merged

Conversation

mayya-sharipova
Copy link
Contributor

@mayya-sharipova mayya-sharipova commented Apr 11, 2022

A very common use case is to run indexing at the same time as searches.
This patch addresses this use case.
After the initial indexing, we add another operation that
add more documents (duplicate documents)
and at the same time doing knn searches on the index.

A very common use case is to run indexing at the same time as searches.
This patch addresses this use case.
We first index 50% of data. During the indexing of rest of data, we
also run concurrent searches.
"operation": {
"operation-type": "bulk",
"bulk-size": {{bulk_size | default(5000)}},
"ingest-percentage": 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally here, I wanted to say index half of the data iningest-percentage, but I did not know how to do that.
In our nightly benchmarks we use ingest-percentage:20, so for now, I've put half of this: 10.

"operation": {
"operation-type": "bulk",
"bulk-size": {{bulk_size | default(5000)}},
"ingest-percentage": 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this same thing in #195, and it turns out this will ingest the same 10% as the index-append task. FYI in case you need to write net-new docs for this challenge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DJRickyB Thanks for your comment. I did not know that. We want to index net-new documents in this part. What's they way to tell to index the next 10%?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not one, unfortunately. One (maybe overly simplistic way) to accomplish some amount of data density before you start "counting" query performance would be to take out the initial index-append task and then give every task in the parallel block the same warmup-time-period. This would give the queries a minimum data density before the measured iterations, but would drop the in-a-vacuum indexing measurement, and you'd have to get that from running a separate challenge. I understand this approach is a bit sloppy, maybe we can chat out-of-band how to accomplish what you need or else enhance Rally to support what you're looking for here

@mayya-sharipova
Copy link
Contributor Author

Results from my laptop:
  
|                                                         Metric |                                         Task |            Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------------:|-----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                              |      2.94898     |    min |
|             Min cumulative indexing time across primary shards |                                              |      1.47418     |    min |
|          Median cumulative indexing time across primary shards |                                              |      1.47449     |    min |
|             Max cumulative indexing time across primary shards |                                              |      1.4748      |    min |
|            Cumulative indexing throttle time of primary shards |                                              |      0.0527333   |    min |
|    Min cumulative indexing throttle time across primary shards |                                              |      0           |    min |
| Median cumulative indexing throttle time across primary shards |                                              |      0.0263667   |    min |
|    Max cumulative indexing throttle time across primary shards |                                              |      0.0527333   |    min |
|                        Cumulative merge time of primary shards |                                              |     13.4356      |    min |
|                       Cumulative merge count of primary shards |                                              |      2           |        |
|                Min cumulative merge time across primary shards |                                              |      6.67385     |    min |
|             Median cumulative merge time across primary shards |                                              |      6.71778     |    min |
|                Max cumulative merge time across primary shards |                                              |      6.76172     |    min |
|               Cumulative merge throttle time of primary shards |                                              |      0           |    min |
|       Min cumulative merge throttle time across primary shards |                                              |      0           |    min |
|    Median cumulative merge throttle time across primary shards |                                              |      0           |    min |
|       Max cumulative merge throttle time across primary shards |                                              |      0           |    min |
|                      Cumulative refresh time of primary shards |                                              |     11.2407      |    min |
|                     Cumulative refresh count of primary shards |                                              |     29           |        |
|              Min cumulative refresh time across primary shards |                                              |      5.29037     |    min |
|           Median cumulative refresh time across primary shards |                                              |      5.62037     |    min |
|              Max cumulative refresh time across primary shards |                                              |      5.95037     |    min |
|                        Cumulative flush time of primary shards |                                              |     12.6752      |    min |
|                       Cumulative flush count of primary shards |                                              |      6           |        |
|                Min cumulative flush time across primary shards |                                              |      6.18308     |    min |
|             Median cumulative flush time across primary shards |                                              |      6.33758     |    min |
|                Max cumulative flush time across primary shards |                                              |      6.49207     |    min |
|                                        Total Young Gen GC time |                                              |      0.77        |      s |
|                                       Total Young Gen GC count |                                              |     56           |        |
|                                          Total Old Gen GC time |                                              |      0           |      s |
|                                         Total Old Gen GC count |                                              |      0           |        |
|                                                     Store size |                                              |      0.995951    |     GB |
|                                                  Translog size |                                              |      1.02445e-07 |     GB |
|                                         Heap used for segments |                                              |      0           |     MB |
|                                       Heap used for doc values |                                              |      0           |     MB |
|                                            Heap used for terms |                                              |      0           |     MB |
|                                            Heap used for norms |                                              |      0           |     MB |
|                                           Heap used for points |                                              |      0           |     MB |
|                                    Heap used for stored fields |                                              |      0           |     MB |
|                                                  Segment count |                                              |      2           |        |
|                                    Total Ingest Pipeline count |                                              |      0           |        |
|                                     Total Ingest Pipeline time |                                              |      0           |      s |
|                                   Total Ingest Pipeline failed |                                              |      0           |        |
|                                                 Min Throughput |                                 index-append |  14125.4         | docs/s |
|                                                Mean Throughput |                                 index-append |  14611.7         | docs/s |
|                                              Median Throughput |                                 index-append |  14675.5         | docs/s |
|                                                 Max Throughput |                                 index-append |  14858.1         | docs/s |
|                                        50th percentile latency |                                 index-append |    273.153       |     ms |
|                                        90th percentile latency |                                 index-append |    383.032       |     ms |
|                                       100th percentile latency |                                 index-append |    625.663       |     ms |
|                                   50th percentile service time |                                 index-append |    273.153       |     ms |
|                                   90th percentile service time |                                 index-append |    383.032       |     ms |
|                                  100th percentile service time |                                 index-append |    625.663       |     ms |
|                                                     error rate |                                 index-append |      0           |      % |
|                                                 Min Throughput |        index-append-concurrent-with-searches |  13538.7         | docs/s |
|                                                Mean Throughput |        index-append-concurrent-with-searches |  13647.9         | docs/s |
|                                              Median Throughput |        index-append-concurrent-with-searches |  13636           | docs/s |
|                                                 Max Throughput |        index-append-concurrent-with-searches |  13873.4         | docs/s |
|                                        50th percentile latency |        index-append-concurrent-with-searches |    279.978       |     ms |
|                                        90th percentile latency |        index-append-concurrent-with-searches |    604.145       |     ms |
|                                       100th percentile latency |        index-append-concurrent-with-searches |    922.579       |     ms |
|                                   50th percentile service time |        index-append-concurrent-with-searches |    279.978       |     ms |
|                                   90th percentile service time |        index-append-concurrent-with-searches |    604.145       |     ms |
|                                  100th percentile service time |        index-append-concurrent-with-searches |    922.579       |     ms |
|                                                     error rate |        index-append-concurrent-with-searches |      0           |      % |
|                                                 Min Throughput |    knn-search-10-50-concurrent-with-indexing |      0.17        |  ops/s |
|                                                Mean Throughput |    knn-search-10-50-concurrent-with-indexing |      0.24        |  ops/s |
|                                              Median Throughput |    knn-search-10-50-concurrent-with-indexing |      0.24        |  ops/s |
|                                                 Max Throughput |    knn-search-10-50-concurrent-with-indexing |      0.31        |  ops/s |
|                                        50th percentile latency |    knn-search-10-50-concurrent-with-indexing |     12.8949      |     ms |
|                                        90th percentile latency |    knn-search-10-50-concurrent-with-indexing |     29.3222      |     ms |
|                                        99th percentile latency |    knn-search-10-50-concurrent-with-indexing |     45.171       |     ms |
|                                       100th percentile latency |    knn-search-10-50-concurrent-with-indexing |     45.7837      |     ms |
|                                   50th percentile service time |    knn-search-10-50-concurrent-with-indexing |     12.8949      |     ms |
|                                   90th percentile service time |    knn-search-10-50-concurrent-with-indexing |     29.3222      |     ms |
|                                   99th percentile service time |    knn-search-10-50-concurrent-with-indexing |     45.171       |     ms |
|                                  100th percentile service time |    knn-search-10-50-concurrent-with-indexing |     45.7837      |     ms |
|                                                     error rate |    knn-search-10-50-concurrent-with-indexing |      0           |      % |
|                                                 Min Throughput |   knn-search-10-100-concurrent-with-indexing |      0.17        |  ops/s |
|                                                Mean Throughput |   knn-search-10-100-concurrent-with-indexing |      0.24        |  ops/s |
|                                              Median Throughput |   knn-search-10-100-concurrent-with-indexing |      0.24        |  ops/s |
|                                                 Max Throughput |   knn-search-10-100-concurrent-with-indexing |      0.31        |  ops/s |
|                                        50th percentile latency |   knn-search-10-100-concurrent-with-indexing |     12.9775      |     ms |
|                                        90th percentile latency |   knn-search-10-100-concurrent-with-indexing |     25.315       |     ms |
|                                        99th percentile latency |   knn-search-10-100-concurrent-with-indexing |     38.458       |     ms |
|                                       100th percentile latency |   knn-search-10-100-concurrent-with-indexing |     45.6029      |     ms |
|                                   50th percentile service time |   knn-search-10-100-concurrent-with-indexing |     12.9775      |     ms |
|                                   90th percentile service time |   knn-search-10-100-concurrent-with-indexing |     25.315       |     ms |
|                                   99th percentile service time |   knn-search-10-100-concurrent-with-indexing |     38.458       |     ms |
|                                  100th percentile service time |   knn-search-10-100-concurrent-with-indexing |     45.6029      |     ms |
|                                                     error rate |   knn-search-10-100-concurrent-with-indexing |      0           |      % |
|                                                 Min Throughput | knn-search-100-1000-concurrent-with-indexing |      0.14        |  ops/s |
|                                                Mean Throughput | knn-search-100-1000-concurrent-with-indexing |      0.23        |  ops/s |
|                                              Median Throughput | knn-search-100-1000-concurrent-with-indexing |      0.23        |  ops/s |
|                                                 Max Throughput | knn-search-100-1000-concurrent-with-indexing |      0.33        |  ops/s |
|                                        50th percentile latency | knn-search-100-1000-concurrent-with-indexing |     23.4112      |     ms |
|                                        90th percentile latency | knn-search-100-1000-concurrent-with-indexing |     31.9343      |     ms |
|                                        99th percentile latency | knn-search-100-1000-concurrent-with-indexing |     53.8614      |     ms |
|                                       100th percentile latency | knn-search-100-1000-concurrent-with-indexing |     57.3495      |     ms |
|                                   50th percentile service time | knn-search-100-1000-concurrent-with-indexing |     23.4112      |     ms |
|                                   90th percentile service time | knn-search-100-1000-concurrent-with-indexing |     31.9343      |     ms |
|                                   99th percentile service time | knn-search-100-1000-concurrent-with-indexing |     53.8614      |     ms |
|                                  100th percentile service time | knn-search-100-1000-concurrent-with-indexing |     57.3495      |     ms |
|                                                     error rate | knn-search-100-1000-concurrent-with-indexing |      0           |      % |
|                                                 Min Throughput |          knn-search-10-50-before-force-merge |    285.75        |  ops/s |
|                                                Mean Throughput |          knn-search-10-50-before-force-merge |    285.75        |  ops/s |
|                                              Median Throughput |          knn-search-10-50-before-force-merge |    285.75        |  ops/s |
|                                                 Max Throughput |          knn-search-10-50-before-force-merge |    285.75        |  ops/s |
|                                        50th percentile latency |          knn-search-10-50-before-force-merge |      2.62398     |     ms |
|                                        90th percentile latency |          knn-search-10-50-before-force-merge |      3.49838     |     ms |
|                                        99th percentile latency |          knn-search-10-50-before-force-merge |      4.83488     |     ms |
|                                       100th percentile latency |          knn-search-10-50-before-force-merge |      4.84396     |     ms |
|                                   50th percentile service time |          knn-search-10-50-before-force-merge |      2.62398     |     ms |
|                                   90th percentile service time |          knn-search-10-50-before-force-merge |      3.49838     |     ms |
|                                   99th percentile service time |          knn-search-10-50-before-force-merge |      4.83488     |     ms |
|                                  100th percentile service time |          knn-search-10-50-before-force-merge |      4.84396     |     ms |
|                                                     error rate |          knn-search-10-50-before-force-merge |      0           |      % |
|                                                 Min Throughput |         knn-search-10-100-before-force-merge |    247.8         |  ops/s |
|                                                Mean Throughput |         knn-search-10-100-before-force-merge |    247.8         |  ops/s |
|                                              Median Throughput |         knn-search-10-100-before-force-merge |    247.8         |  ops/s |
|                                                 Max Throughput |         knn-search-10-100-before-force-merge |    247.8         |  ops/s |
|                                        50th percentile latency |         knn-search-10-100-before-force-merge |      3.29792     |     ms |
|                                        90th percentile latency |         knn-search-10-100-before-force-merge |      4.2637      |     ms |
|                                        99th percentile latency |         knn-search-10-100-before-force-merge |      6.67976     |     ms |
|                                       100th percentile latency |         knn-search-10-100-before-force-merge |      6.83308     |     ms |
|                                   50th percentile service time |         knn-search-10-100-before-force-merge |      3.29792     |     ms |
|                                   90th percentile service time |         knn-search-10-100-before-force-merge |      4.2637      |     ms |
|                                   99th percentile service time |         knn-search-10-100-before-force-merge |      6.67976     |     ms |
|                                  100th percentile service time |         knn-search-10-100-before-force-merge |      6.83308     |     ms |
|                                                     error rate |         knn-search-10-100-before-force-merge |      0           |      % |
|                                                 Min Throughput |       knn-search-100-1000-before-force-merge |     43.59        |  ops/s |
|                                                Mean Throughput |       knn-search-100-1000-before-force-merge |     43.68        |  ops/s |
|                                              Median Throughput |       knn-search-100-1000-before-force-merge |     43.68        |  ops/s |
|                                                 Max Throughput |       knn-search-100-1000-before-force-merge |     43.78        |  ops/s |
|                                        50th percentile latency |       knn-search-100-1000-before-force-merge |     22.0119      |     ms |
|                                        90th percentile latency |       knn-search-100-1000-before-force-merge |     23.5894      |     ms |
|                                        99th percentile latency |       knn-search-100-1000-before-force-merge |     24.9529      |     ms |
|                                       100th percentile latency |       knn-search-100-1000-before-force-merge |     25.3105      |     ms |
|                                   50th percentile service time |       knn-search-100-1000-before-force-merge |     22.0119      |     ms |
|                                   90th percentile service time |       knn-search-100-1000-before-force-merge |     23.5894      |     ms |
|                                   99th percentile service time |       knn-search-100-1000-before-force-merge |     24.9529      |     ms |
|                                  100th percentile service time |       knn-search-100-1000-before-force-merge |     25.3105      |     ms |
|                                                     error rate |       knn-search-100-1000-before-force-merge |      0           |      % |
|                                                 Min Throughput |        script-score-query-before-force-merge |      5.5         |  ops/s |
|                                                Mean Throughput |        script-score-query-before-force-merge |      5.6         |  ops/s |
|                                              Median Throughput |        script-score-query-before-force-merge |      5.61        |  ops/s |
|                                                 Max Throughput |        script-score-query-before-force-merge |      5.67        |  ops/s |
|                                        50th percentile latency |        script-score-query-before-force-merge |    169.724       |     ms |
|                                        90th percentile latency |        script-score-query-before-force-merge |    170.824       |     ms |
|                                        99th percentile latency |        script-score-query-before-force-merge |    175.088       |     ms |
|                                       100th percentile latency |        script-score-query-before-force-merge |    183.441       |     ms |
|                                   50th percentile service time |        script-score-query-before-force-merge |    169.724       |     ms |
|                                   90th percentile service time |        script-score-query-before-force-merge |    170.824       |     ms |
|                                   99th percentile service time |        script-score-query-before-force-merge |    175.088       |     ms |
|                                  100th percentile service time |        script-score-query-before-force-merge |    183.441       |     ms |
|                                                     error rate |        script-score-query-before-force-merge |      0           |      % |
|                                                 Min Throughput |                                  force-merge |      0           |  ops/s |
|                                                Mean Throughput |                                  force-merge |      0           |  ops/s |
|                                              Median Throughput |                                  force-merge |      0           |  ops/s |
|                                                 Max Throughput |                                  force-merge |      0           |  ops/s |
|                                       100th percentile latency |                                  force-merge | 806615           |     ms |
|                                  100th percentile service time |                                  force-merge | 806615           |     ms |
|                                                     error rate |                                  force-merge |      0           |      % |
|                                                 Min Throughput |                             knn-search-10-50 |    213.35        |  ops/s |
|                                                Mean Throughput |                             knn-search-10-50 |    213.35        |  ops/s |
|                                              Median Throughput |                             knn-search-10-50 |    213.35        |  ops/s |
|                                                 Max Throughput |                             knn-search-10-50 |    213.35        |  ops/s |
|                                        50th percentile latency |                             knn-search-10-50 |      2.09358     |     ms |
|                                        90th percentile latency |                             knn-search-10-50 |      5.94023     |     ms |
|                                        99th percentile latency |                             knn-search-10-50 |      8.52969     |     ms |
|                                       100th percentile latency |                             knn-search-10-50 |      8.84533     |     ms |
|                                   50th percentile service time |                             knn-search-10-50 |      2.09358     |     ms |
|                                   90th percentile service time |                             knn-search-10-50 |      5.94023     |     ms |
|                                   99th percentile service time |                             knn-search-10-50 |      8.52969     |     ms |
|                                  100th percentile service time |                             knn-search-10-50 |      8.84533     |     ms |
|                                                     error rate |                             knn-search-10-50 |      0           |      % |
|                                                 Min Throughput |                            knn-search-10-100 |    402.07        |  ops/s |
|                                                Mean Throughput |                            knn-search-10-100 |    402.07        |  ops/s |
|                                              Median Throughput |                            knn-search-10-100 |    402.07        |  ops/s |
|                                                 Max Throughput |                            knn-search-10-100 |    402.07        |  ops/s |
|                                        50th percentile latency |                            knn-search-10-100 |      1.73748     |     ms |
|                                        90th percentile latency |                            knn-search-10-100 |      2.76045     |     ms |
|                                        99th percentile latency |                            knn-search-10-100 |      4.35762     |     ms |
|                                       100th percentile latency |                            knn-search-10-100 |      4.81538     |     ms |
|                                   50th percentile service time |                            knn-search-10-100 |      1.73748     |     ms |
|                                   90th percentile service time |                            knn-search-10-100 |      2.76045     |     ms |
|                                   99th percentile service time |                            knn-search-10-100 |      4.35762     |     ms |
|                                  100th percentile service time |                            knn-search-10-100 |      4.81538     |     ms |
|                                                     error rate |                            knn-search-10-100 |      0           |      % |
|                                                 Min Throughput |                          knn-search-100-1000 |    129.23        |  ops/s |
|                                                Mean Throughput |                          knn-search-100-1000 |    129.23        |  ops/s |
|                                              Median Throughput |                          knn-search-100-1000 |    129.23        |  ops/s |
|                                                 Max Throughput |                          knn-search-100-1000 |    129.23        |  ops/s |
|                                        50th percentile latency |                          knn-search-100-1000 |      5.04306     |     ms |
|                                        90th percentile latency |                          knn-search-100-1000 |      5.96235     |     ms |
|                                        99th percentile latency |                          knn-search-100-1000 |      9.07404     |     ms |
|                                       100th percentile latency |                          knn-search-100-1000 |      9.10671     |     ms |
|                                   50th percentile service time |                          knn-search-100-1000 |      5.04306     |     ms |
|                                   90th percentile service time |                          knn-search-100-1000 |      5.96235     |     ms |
|                                   99th percentile service time |                          knn-search-100-1000 |      9.07404     |     ms |
|                                  100th percentile service time |                          knn-search-100-1000 |      9.10671     |     ms |
|                                                     error rate |                          knn-search-100-1000 |      0           |      % |
|                                                 Min Throughput |                           script-score-query |      5.81        |  ops/s |
|                                                Mean Throughput |                           script-score-query |      5.83        |  ops/s |
|                                              Median Throughput |                           script-score-query |      5.83        |  ops/s |
|                                                 Max Throughput |                           script-score-query |      5.85        |  ops/s |
|                                        50th percentile latency |                           script-score-query |    168.784       |     ms |
|                                        90th percentile latency |                           script-score-query |    170.536       |     ms |
|                                        99th percentile latency |                           script-score-query |    174.288       |     ms |
|                                       100th percentile latency |                           script-score-query |    174.321       |     ms |
|                                   50th percentile service time |                           script-score-query |    168.784       |     ms |
|                                   90th percentile service time |                           script-score-query |    170.536       |     ms |
|                                   99th percentile service time |                           script-score-query |    174.288       |     ms |
|                                  100th percentile service time |                           script-score-query |    174.321       |     ms |
|                                                     error rate |                           script-score-query |      0           |      % |


----------------------------------
[INFO] SUCCESS (took 3110 seconds)
----------------------------------

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented Apr 11, 2022

Some notable observations:

  • Indexing with concurrent searches 1.5-2 times slower
|  90th percentile service time |                                 index-append |    383.032       |     ms |
| 100th percentile service time |                                 index-append |    625.663       |     ms |
|  90th percentile service time |        index-append-concurrent-with-searches |    604.145       |     ms |
| 100th percentile service time |        index-append-concurrent-with-searches |    922.579       |     ms |
  • Searches with concurrent indexing 5x-10x times slower
| 90th percentile service time |    knn-search-10-50-concurrent-with-indexing |     29.3222      |     ms |
|100th percentile service time |    knn-search-10-50-concurrent-with-indexing |     45.7837      |     ms |

| 90th percentile service time |   knn-search-10-100-concurrent-with-indexing |     25.315       |     ms |
| 00th percentile service time |   knn-search-10-100-concurrent-with-indexing |     45.6029      |     ms |

| 90th percentile service time | knn-search-100-1000-concurrent-with-indexing |     31.9343      |     ms |
|100th percentile service time | knn-search-100-1000-concurrent-with-indexing |     57.3495      |     ms |

---

| 90th percentile service time |          knn-search-10-50-before-force-merge |      3.49838     |     ms |
|100th percentile service time |          knn-search-10-50-before-force-merge |      4.84396     |     ms |

| 90th percentile service time |         knn-search-10-100-before-force-merge |      4.2637      |     ms |
|100th percentile service time |         knn-search-10-100-before-force-merge |      6.83308     |     ms |

| 90th percentile service time |       knn-search-100-1000-before-force-merge |     23.5894      |     ms |
|100th percentile service time |       knn-search-100-1000-before-force-merge |     25.3105      |     ms |

|90th percentile service time  |                             knn-search-10-50 |      5.94023     |     ms |
|100th percentile service time |                             knn-search-10-50 |      8.84533     |     ms |

| 90th percentile service time |                            knn-search-10-100 |      2.76045     |     ms |
|100th percentile service time |                            knn-search-10-100 |      4.81538     |     ms |

| 90th percentile service time |                          knn-search-100-1000 |      5.96235     |     ms |
|100th percentile service time |                          knn-search-100-1000 |      9.10671     |     ms |

@mayya-sharipova
Copy link
Contributor Author

I experimented with creating another challenge and using warmup-time-period, but I found it is very difficult to find a good warmup-time-period.

  • I used "warmup-time-period": 100, and I get too few indexed documents during the test period (100K)
  • I used "warmup-time-period":200, and I was getting message: [WARNING] No throughput metrics available for [index-append]. Likely cause: The benchmark ended already during warmup.

And looks like warmup-time-period is very fragile and also machine dependant, so I don't think it would be a workable option for us.

concurrent-index-and-search challenge
{
  "name": "concurrent-index-and-search",
  "description": "incremental indexing of vectors with concurrent searches",
  "default": false,
  "schedule": [
    {
      "operation": {
        "operation-type": "delete-index"
      }
    },
    {
      "operation": {
        "operation-type": "create-index"
      }
    },
    {
      "operation": {
        "operation-type": "cluster-health",
        "request-params": {
          "wait_for_status": "green"
        },
        "retry-until-success": true
      }
    },
    {
      "parallel": {
        "warmup-time-period": 100,
        "completed-by": "index-append",
        "tasks": [
          {
            "name": "index-append",
            "operation": {
              "operation-type": "bulk",
              "bulk-size": 5000,
              "ingest-percentage": 20
            },
            "clients": 1
          },
          {
            "name": "knn-search-10-50-concurrent-with-indexing",
            "operation": "knn-search-10-50",
            "clients": 1
          },
          {
            "name": "knn-search-10-100-concurrent-with-indexing",
            "operation": "knn-search-10-100",
            "clients": 1
          },
          {
            "name": "knn-search-100-1000-concurrent-with-indexing",
            "operation": "knn-search-100-1000",
            "clients": 1
          }
        ]
      }  
    },
    {
      "name": "wait-until-merges-finish-after-index",
      "operation": {
        "operation-type": "index-stats",
        "index": "_all",
        "condition": {
          "path": "_all.total.merges.current",
          "expected-value": 0
        },
        "retry-until-success": true,
        "include-in-reporting": true
      }
    }      
  ]
}

I also experimented with using the same number of warmup-iterations for the whole parallel group, but it doesn't work either, as search operations get finished much faster than indexing operations. Using different number of warmup-iterations for each item in the parallel group could help, but it would be also very difficult to find right number of warmup-iterations.


I think sticking with the original plan ingesting the same documents in the parallel section would be our best strategy. Later when the performance team modifies rally to allow to use different set of documents, we can modify our challenge.

Copy link
Contributor

@DJRickyB DJRickyB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left comments that are mostly concerns about variability in the test as designed. One way to control more would be to separate each of the three queries to their own parallel block, completed-by the query task and unbounded on the indexing tasks' ingest-percentage. It is verbose to spell it out this way but I think the results will potentially be more meaningful unless you are also targeting resource contention on the search side, and we do not fear variability in the results.

I'm happy to accept the way the test is currently, also, then revise based on what we see bubble up in nightly benchmark executions, but throwing these thoughts out in case they are helpful to testing what you intend to test.

"parallel": {
"tasks": [
{
"name": "index-append-concurrent-with-searches",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"name": "index-append-concurrent-with-searches",
"name": "index-update-concurrent-with-searches",

"clients": {{bulk_indexing_clients | default(1)}}
},
{
"name": "knn-search-10-50-concurrent-with-indexing",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note here this will not just be concurrent with indexing but concurrent with your other searches, possibly resulting in queuing on the target's shards and not giving us determinant results. Is that ok?

"warmup-time-period": {{ bulk_warmup | default(40) | int }},
"clients": {{bulk_indexing_clients | default(1)}}
},
{
"parallel": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without a completed-by here, the block will execute until all dependent tasks have completed. Note however, that if the indexing task completes before the querying tasks (I'm not sure how likely this is), you will have querying tasks that execute and affect your results but are not actually executing in parallel to indexing

Add an operation that updates the current documents and at the same
time doing knn searches on them.
@mayya-sharipova
Copy link
Contributor Author

@DJRickyB Sorry for a late reply, I was away for some of this time and busy with other tasks, but would like to continue the work on this.

I've added the second commit that addresses some of your feedback:

  • index-append-concurrent-with-searches was renamed to index-update-concurrent-with-searches
  • using only a single search type request in the parallel block (this gives us enough data)
  • added "completed-by": "index-update-concurrent-with-searches" in the parallel section

concerns about variability in the test as designed. One way to control more would be to separate each of the three queries to their own parallel block, completed-by the query task and unbounded on the indexing tasks' ingest-percentage. It is verbose to spell it out this way but I think the results will potentially be more meaningful unless you are also targeting resource contention on the search side, and we do not fear variability in the results.

  • Using a single type of query would be enough for us for now.
  • We expect indexing operations take much more time than searches, and for the concurrent test to be meaningful we would like for searches to see new indexed/updated data.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 10, 2022

Here are the rally benchmarking results for the latest commit:

concurrent-index-and-search challenge
|                                                         Metric |                                         Task |           Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------------:|----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                              |     2.75132     |    min |
|             Min cumulative indexing time across primary shards |                                              |     1.36882     |    min |
|          Median cumulative indexing time across primary shards |                                              |     1.37566     |    min |
|             Max cumulative indexing time across primary shards |                                              |     1.3825      |    min |
|            Cumulative indexing throttle time of primary shards |                                              |     0           |    min |
|    Min cumulative indexing throttle time across primary shards |                                              |     0           |    min |
| Median cumulative indexing throttle time across primary shards |                                              |     0           |    min |
|    Max cumulative indexing throttle time across primary shards |                                              |     0           |    min |
|                        Cumulative merge time of primary shards |                                              |    18.5745      |    min |
|                       Cumulative merge count of primary shards |                                              |     2           |        |
|                Min cumulative merge time across primary shards |                                              |     9.27853     |    min |
|             Median cumulative merge time across primary shards |                                              |     9.28723     |    min |
|                Max cumulative merge time across primary shards |                                              |     9.29592     |    min |
|               Cumulative merge throttle time of primary shards |                                              |     0           |    min |
|       Min cumulative merge throttle time across primary shards |                                              |     0           |    min |
|    Median cumulative merge throttle time across primary shards |                                              |     0           |    min |
|       Max cumulative merge throttle time across primary shards |                                              |     0           |    min |
|                      Cumulative refresh time of primary shards |                                              |    13.4259      |    min |
|                     Cumulative refresh count of primary shards |                                              |    37           |        |
|              Min cumulative refresh time across primary shards |                                              |     6.5072      |    min |
|           Median cumulative refresh time across primary shards |                                              |     6.71294     |    min |
|              Max cumulative refresh time across primary shards |                                              |     6.91868     |    min |
|                        Cumulative flush time of primary shards |                                              |    14.3902      |    min |
|                       Cumulative flush count of primary shards |                                              |    10           |        |
|                Min cumulative flush time across primary shards |                                              |     6.84495     |    min |
|             Median cumulative flush time across primary shards |                                              |     7.19508     |    min |
|                Max cumulative flush time across primary shards |                                              |     7.54522     |    min |
|                                        Total Young Gen GC time |                                              |     1.005       |      s |
|                                       Total Young Gen GC count |                                              |    67           |        |
|                                          Total Old Gen GC time |                                              |     0           |      s |
|                                         Total Old Gen GC count |                                              |     0           |        |
|                                                     Store size |                                              |     1.24487     |     GB |
|                                                  Translog size |                                              |     1.02445e-07 |     GB |
|                                         Heap used for segments |                                              |     0           |     MB |
|                                       Heap used for doc values |                                              |     0           |     MB |
|                                            Heap used for terms |                                              |     0           |     MB |
|                                            Heap used for norms |                                              |     0           |     MB |
|                                           Heap used for points |                                              |     0           |     MB |
|                                    Heap used for stored fields |                                              |     0           |     MB |
|                                                  Segment count |                                              |     2           |        |
|                                    Total Ingest Pipeline count |                                              |     0           |        |
|                                     Total Ingest Pipeline time |                                              |     0           |      s |
|                                   Total Ingest Pipeline failed |                                              |     0           |        |
|                                                 Min Throughput |                                 index-append | 15188.1         | docs/s |
|                                                Mean Throughput |                                 index-append | 15978.4         | docs/s |
|                                              Median Throughput |                                 index-append | 16098.2         | docs/s |
|                                                 Max Throughput |                                 index-append | 16333.7         | docs/s |
|                                        50th percentile latency |                                 index-append |   271.098       |     ms |
|                                        90th percentile latency |                                 index-append |   291.438       |     ms |
|                                        99th percentile latency |                                 index-append |   451.431       |     ms |
|                                       100th percentile latency |                                 index-append |   525.137       |     ms |
|                                   50th percentile service time |                                 index-append |   271.098       |     ms |
|                                   90th percentile service time |                                 index-append |   291.438       |     ms |
|                                   99th percentile service time |                                 index-append |   451.431       |     ms |
|                                  100th percentile service time |                                 index-append |   525.137       |     ms |
|                                                     error rate |                                 index-append |     0           |      % |
|                                                 Min Throughput |        index-update-concurrent-with-searches | 16110.3         | docs/s |
|                                                Mean Throughput |        index-update-concurrent-with-searches | 16535.5         | docs/s |
|                                              Median Throughput |        index-update-concurrent-with-searches | 16556.2         | docs/s |
|                                                 Max Throughput |        index-update-concurrent-with-searches | 16868.9         | docs/s |
|                                        50th percentile latency |        index-update-concurrent-with-searches |   269.637       |     ms |
|                                        90th percentile latency |        index-update-concurrent-with-searches |   326.219       |     ms |
|                                       100th percentile latency |        index-update-concurrent-with-searches |   452.553       |     ms |
|                                   50th percentile service time |        index-update-concurrent-with-searches |   269.637       |     ms |
|                                   90th percentile service time |        index-update-concurrent-with-searches |   326.219       |     ms |
|                                  100th percentile service time |        index-update-concurrent-with-searches |   452.553       |     ms |
|                                                     error rate |        index-update-concurrent-with-searches |     0           |      % |
|                                                 Min Throughput | knn-search-100-1000-concurrent-with-indexing |    29.67        |  ops/s |
|                                                Mean Throughput | knn-search-100-1000-concurrent-with-indexing |    31.7         |  ops/s |
|                                              Median Throughput | knn-search-100-1000-concurrent-with-indexing |    31.91        |  ops/s |
|                                                 Max Throughput | knn-search-100-1000-concurrent-with-indexing |    32.98        |  ops/s |
|                                        50th percentile latency | knn-search-100-1000-concurrent-with-indexing |    27.9522      |     ms |
|                                        90th percentile latency | knn-search-100-1000-concurrent-with-indexing |    29.946       |     ms |
|                                        99th percentile latency | knn-search-100-1000-concurrent-with-indexing |    39.0829      |     ms |
|                                       100th percentile latency | knn-search-100-1000-concurrent-with-indexing |    49.5132      |     ms |
|                                   50th percentile service time | knn-search-100-1000-concurrent-with-indexing |    27.9522      |     ms |
|                                   90th percentile service time | knn-search-100-1000-concurrent-with-indexing |    29.946       |     ms |
|                                   99th percentile service time | knn-search-100-1000-concurrent-with-indexing |    39.0829      |     ms |
|                                  100th percentile service time | knn-search-100-1000-concurrent-with-indexing |    49.5132      |     ms |
|                                                     error rate | knn-search-100-1000-concurrent-with-indexing |     0           |      % |
|                                                 Min Throughput |          knn-search-10-50-before-force-merge |   198.13        |  ops/s |
|                                                Mean Throughput |          knn-search-10-50-before-force-merge |   198.13        |  ops/s |
|                                              Median Throughput |          knn-search-10-50-before-force-merge |   198.13        |  ops/s |
|                                                 Max Throughput |          knn-search-10-50-before-force-merge |   198.13        |  ops/s |
|                                        50th percentile latency |          knn-search-10-50-before-force-merge |     4.21267     |     ms |
|                                        90th percentile latency |          knn-search-10-50-before-force-merge |     5.21381     |     ms |
|                                        99th percentile latency |          knn-search-10-50-before-force-merge |     6.8245      |     ms |
|                                       100th percentile latency |          knn-search-10-50-before-force-merge |     7.42229     |     ms |
|                                   50th percentile service time |          knn-search-10-50-before-force-merge |     4.21267     |     ms |
|                                   90th percentile service time |          knn-search-10-50-before-force-merge |     5.21381     |     ms |
|                                   99th percentile service time |          knn-search-10-50-before-force-merge |     6.8245      |     ms |
|                                  100th percentile service time |          knn-search-10-50-before-force-merge |     7.42229     |     ms |
|                                                     error rate |          knn-search-10-50-before-force-merge |     0           |      % |
|                                                 Min Throughput |         knn-search-10-100-before-force-merge |   177.64        |  ops/s |
|                                                Mean Throughput |         knn-search-10-100-before-force-merge |   177.64        |  ops/s |
|                                              Median Throughput |         knn-search-10-100-before-force-merge |   177.64        |  ops/s |
|                                                 Max Throughput |         knn-search-10-100-before-force-merge |   177.64        |  ops/s |
|                                        50th percentile latency |         knn-search-10-100-before-force-merge |     4.29077     |     ms |
|                                        90th percentile latency |         knn-search-10-100-before-force-merge |     5.67705     |     ms |
|                                        99th percentile latency |         knn-search-10-100-before-force-merge |     7.0342      |     ms |
|                                       100th percentile latency |         knn-search-10-100-before-force-merge |     7.43758     |     ms |
|                                   50th percentile service time |         knn-search-10-100-before-force-merge |     4.29077     |     ms |
|                                   90th percentile service time |         knn-search-10-100-before-force-merge |     5.67705     |     ms |
|                                   99th percentile service time |         knn-search-10-100-before-force-merge |     7.0342      |     ms |
|                                  100th percentile service time |         knn-search-10-100-before-force-merge |     7.43758     |     ms |
|                                                     error rate |         knn-search-10-100-before-force-merge |     0           |      % |
|                                                 Min Throughput |       knn-search-100-1000-before-force-merge |    26.17        |  ops/s |
|                                                Mean Throughput |       knn-search-100-1000-before-force-merge |    27.19        |  ops/s |
|                                              Median Throughput |       knn-search-100-1000-before-force-merge |    27.32        |  ops/s |
|                                                 Max Throughput |       knn-search-100-1000-before-force-merge |    27.94        |  ops/s |
|                                        50th percentile latency |       knn-search-100-1000-before-force-merge |    33.8932      |     ms |
|                                        90th percentile latency |       knn-search-100-1000-before-force-merge |    35.1241      |     ms |
|                                        99th percentile latency |       knn-search-100-1000-before-force-merge |    36.9389      |     ms |
|                                       100th percentile latency |       knn-search-100-1000-before-force-merge |    38.2534      |     ms |
|                                   50th percentile service time |       knn-search-100-1000-before-force-merge |    33.8932      |     ms |
|                                   90th percentile service time |       knn-search-100-1000-before-force-merge |    35.1241      |     ms |
|                                   99th percentile service time |       knn-search-100-1000-before-force-merge |    36.9389      |     ms |
|                                  100th percentile service time |       knn-search-100-1000-before-force-merge |    38.2534      |     ms |
|                                                     error rate |       knn-search-100-1000-before-force-merge |     0           |      % |
|                                                 Min Throughput |        script-score-query-before-force-merge |     4.56        |  ops/s |
|                                                Mean Throughput |        script-score-query-before-force-merge |     4.63        |  ops/s |
|                                              Median Throughput |        script-score-query-before-force-merge |     4.63        |  ops/s |
|                                                 Max Throughput |        script-score-query-before-force-merge |     4.67        |  ops/s |
|                                        50th percentile latency |        script-score-query-before-force-merge |   208.236       |     ms |
|                                        90th percentile latency |        script-score-query-before-force-merge |   209.456       |     ms |
|                                        99th percentile latency |        script-score-query-before-force-merge |   210.977       |     ms |
|                                       100th percentile latency |        script-score-query-before-force-merge |   211.096       |     ms |
|                                   50th percentile service time |        script-score-query-before-force-merge |   208.236       |     ms |
|                                   90th percentile service time |        script-score-query-before-force-merge |   209.456       |     ms |
|                                   99th percentile service time |        script-score-query-before-force-merge |   210.977       |     ms |
|                                  100th percentile service time |        script-score-query-before-force-merge |   211.096       |     ms |
|                                                     error rate |        script-score-query-before-force-merge |     0           |      % |
|                                                 Min Throughput |                                  force-merge |     0           |  ops/s |
|                                                Mean Throughput |                                  force-merge |     0           |  ops/s |
|                                              Median Throughput |                                  force-merge |     0           |  ops/s |
|                                                 Max Throughput |                                  force-merge |     0           |  ops/s |
|                                       100th percentile latency |                                  force-merge |     1.11501e+06 |     ms |
|                                  100th percentile service time |                                  force-merge |     1.11501e+06 |     ms |
|                                                     error rate |                                  force-merge |     0           |      % |
|                                                 Min Throughput |                             knn-search-10-50 |   176.93        |  ops/s |
|                                                Mean Throughput |                             knn-search-10-50 |   176.93        |  ops/s |
|                                              Median Throughput |                             knn-search-10-50 |   176.93        |  ops/s |
|                                                 Max Throughput |                             knn-search-10-50 |   176.93        |  ops/s |
|                                        50th percentile latency |                             knn-search-10-50 |     1.37304     |     ms |
|                                        90th percentile latency |                             knn-search-10-50 |     2.30238     |     ms |
|                                        99th percentile latency |                             knn-search-10-50 |     3.68        |     ms |
|                                       100th percentile latency |                             knn-search-10-50 |     3.69208     |     ms |
|                                   50th percentile service time |                             knn-search-10-50 |     1.37304     |     ms |
|                                   90th percentile service time |                             knn-search-10-50 |     2.30238     |     ms |
|                                   99th percentile service time |                             knn-search-10-50 |     3.68        |     ms |
|                                  100th percentile service time |                             knn-search-10-50 |     3.69208     |     ms |
|                                                     error rate |                             knn-search-10-50 |     0           |      % |
|                                                 Min Throughput |                            knn-search-10-100 |   173.96        |  ops/s |
|                                                Mean Throughput |                            knn-search-10-100 |   173.96        |  ops/s |
|                                              Median Throughput |                            knn-search-10-100 |   173.96        |  ops/s |
|                                                 Max Throughput |                            knn-search-10-100 |   173.96        |  ops/s |
|                                        50th percentile latency |                            knn-search-10-100 |     1.86        |     ms |
|                                        90th percentile latency |                            knn-search-10-100 |     4.97013     |     ms |
|                                        99th percentile latency |                            knn-search-10-100 |     8.36762     |     ms |
|                                       100th percentile latency |                            knn-search-10-100 |     8.56546     |     ms |
|                                   50th percentile service time |                            knn-search-10-100 |     1.86        |     ms |
|                                   90th percentile service time |                            knn-search-10-100 |     4.97013     |     ms |
|                                   99th percentile service time |                            knn-search-10-100 |     8.36762     |     ms |
|                                  100th percentile service time |                            knn-search-10-100 |     8.56546     |     ms |
|                                                     error rate |                            knn-search-10-100 |     0           |      % |
|                                                 Min Throughput |                          knn-search-100-1000 |    53.83        |  ops/s |
|                                                Mean Throughput |                          knn-search-100-1000 |    53.83        |  ops/s |
|                                              Median Throughput |                          knn-search-100-1000 |    53.83        |  ops/s |
|                                                 Max Throughput |                          knn-search-100-1000 |    53.83        |  ops/s |
|                                        50th percentile latency |                          knn-search-100-1000 |     5.9989      |     ms |
|                                        90th percentile latency |                          knn-search-100-1000 |     7.60146     |     ms |
|                                        99th percentile latency |                          knn-search-100-1000 |    10.6784      |     ms |
|                                       100th percentile latency |                          knn-search-100-1000 |    10.9691      |     ms |
|                                   50th percentile service time |                          knn-search-100-1000 |     5.9989      |     ms |
|                                   90th percentile service time |                          knn-search-100-1000 |     7.60146     |     ms |
|                                   99th percentile service time |                          knn-search-100-1000 |    10.6784      |     ms |
|                                  100th percentile service time |                          knn-search-100-1000 |    10.9691      |     ms |
|                                                     error rate |                          knn-search-100-1000 |     0           |      % |
|                                                 Min Throughput |                           script-score-query |     4.6         |  ops/s |
|                                                Mean Throughput |                           script-score-query |     4.66        |  ops/s |
|                                              Median Throughput |                           script-score-query |     4.67        |  ops/s |
|                                                 Max Throughput |                           script-score-query |     4.7         |  ops/s |
|                                        50th percentile latency |                           script-score-query |   207.542       |     ms |
|                                        90th percentile latency |                           script-score-query |   208.814       |     ms |
|                                        99th percentile latency |                           script-score-query |   211.765       |     ms |
|                                       100th percentile latency |                           script-score-query |   215.326       |     ms |
|                                   50th percentile service time |                           script-score-query |   207.542       |     ms |
|                                   90th percentile service time |                           script-score-query |   208.814       |     ms |
|                                   99th percentile service time |                           script-score-query |   211.765       |     ms |
|                                  100th percentile service time |                           script-score-query |   215.326       |     ms |
|                                                     error rate |                           script-score-query |     0           |      % |

Some interesting observations:

  • indexing doesn't seem to be affected by concurrent searches
  • searches are fastest on a single segment: 5.9-10.9 ms
  • searches are several times slower on multiple segments: 33-38 ms
  • searches concurrent indexing has some effect on searches, but not significant (most searches performed the same, the slowest 1% up to 25% slower): 27.9 - 49.5 ms
| 50th percentile service time | knn-search-100-1000-concurrent-with-indexing |    27.9522      |     ms |
| 90th percentile service time | knn-search-100-1000-concurrent-with-indexing |    29.946       |     ms |
| 99th percentile service time | knn-search-100-1000-concurrent-with-indexing |    39.0829      |     ms |
|100th percentile service time | knn-search-100-1000-concurrent-with-indexing |    49.5132      |     ms |

| 50th percentile service time |       knn-search-100-1000-before-force-merge |    33.8932      |     ms |
| 90th percentile service time |       knn-search-100-1000-before-force-merge |    35.1241      |     ms |
| 99th percentile service time |       knn-search-100-1000-before-force-merge |    36.9389      |     ms |
|100th percentile service time |       knn-search-100-1000-before-force-merge |    38.2534      |     ms |

| 50th percentile service time |                          knn-search-100-1000 |     5.9989      |     ms |
| 90th percentile service time |                          knn-search-100-1000 |     7.60146     |     ms |
| 99th percentile service time |                          knn-search-100-1000 |    10.6784      |     ms |
|100th percentile service time |                          knn-search-100-1000 |    10.9691      |     ms |

Copy link
Contributor

@DJRickyB DJRickyB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the new approach simple and really interesting. I left a few suggestions, approving in case you think we are good without them, otherwise I'm happy to take another look

dense_vector/challenges/default.json Outdated Show resolved Hide resolved
dense_vector/challenges/default.json Outdated Show resolved Hide resolved
dense_vector/challenges/default.json Outdated Show resolved Hide resolved
@mayya-sharipova
Copy link
Contributor Author

@DJRickyB Thanks for your feedback.

@jtibshirani Julie, do you want to have a final look at the updated track to see that we are good with it?

Copy link
Contributor

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mayya-sharipova! It looks good to me too, I just wanted to check I understand what's happening:

  • We first index 2 million docs (20% of the full dataset of 10 million).
  • Then we reindex the first 500,000 docs while running knn-search-100-1000 in parallel.

Some questions:

  • In the parallel section, how many iterations of the search are run?
  • I guess this affects the next tasks knn-search-10-50-before-force-merge compared to before this change, because now there will be more segments. This seems fine, we just might see the benchmark results change a bit.

dense_vector/challenges/default.json Outdated Show resolved Hide resolved
@jtibshirani
Copy link
Contributor

One last thought: the latency for knn-search-100-1000-concurrent-with-indexing and knn-search-100-1000-before-force-merge is really similar. I wonder if it'd be more interesting if we added a wait-for-merges-to-finish before running the before-force-merge searches. That way we would see what searches look like in a "steady state" where you index all documents, have a pause, then start searching (but don't want the expense of a force merge?)

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 16, 2022

@jtibshirani Thanks for your feedback. Answering your questions:

I just wanted to check I understand what's happening:
We first index 2 million docs (20% of the full dataset of 10 million).
Then we reindex the first 500,000 docs while running knn-search-100-1000 in parallel.

Yes, indeed that's the goal. But in reality, when this parallel section finishes only around 100,000 is available for search, as it takes some time for all the refreshes to catch up. That's why after that parallel concurrent section, we have another refresh refresh-after-update to ensure that all before-force-merge see 2.5 million docs.

In the parallel section, how many iterations of the search are run?

This section runs around 30 secs, with around 1000 search and 100 index operations.

I guess this affects the next tasks knn-search-10-50-before-force-merge compared to before this change, because now there will be more segments. This seems fine, we just might see the benchmark results change a bit.

Indeed, now before_force_merge changes will see more segments and docs. We should indeed explain this in changed benchmarking results.

I'm not sure if this matters, but curious why we don't use the default "refresh": false?

I thought using "refresh" : "wait_for" will allow us to immediately see indexed documents for searches, but it turns out not to be the case, and I did not see the number of segments and documents available for search different between refresh:wait_for and the default refresh:false. So I've decided to follow your suggestion and keep the default behaviour.
Addressed in cd8ea41.

the latency for knn-search-100-1000-concurrent-with-indexing and knn-search-100-1000-before-force-merge is really similar. I wonder if it'd be more interesting if we added a wait-for-merges-to-finish before running the before-force-merge searches. That way we would see what searches look like in a "steady state" where you index all documents, have a pause, then start searching (but don't want the expense of a force merge?)

We do achieve "steady state" before we start before-force-merge searches. We have refresh-after-index operation that makes sure searches see all 2.5 million docs.

I also experimented with adding wait-until-merges-finish, but it did not make any difference, because there are no merges occurring until the final force_merge, as new updates only add new segments.

Here are the index stats before we start searches after all indexing.

index stats
{
    "primaries":
    {
        "docs":
        {
            "count": 2500000,
            "deleted": 0
        },
        "shard_stats":
        {
            "total_count": 2
        },
        "store":
        {
            "size_in_bytes": 5303589582,
            "total_data_set_size_in_bytes": 5303589582,
            "reserved_in_bytes": 0
        },
        "indexing":
        {
            "index_total": 2500000,
            "index_time_in_millis": 173547,
            "index_current": 0,
            "index_failed": 0,
            "delete_total": 0,
            "delete_time_in_millis": 0,
            "delete_current": 0,
            "noop_update_total": 0,
            "is_throttled": false,
            "throttle_time_in_millis": 204404
        },
        "get":
        {
            "total": 0,
            "time_in_millis": 0,
            "exists_total": 0,
            "exists_time_in_millis": 0,
            "missing_total": 0,
            "missing_time_in_millis": 0,
            "current": 0
        },
        "search":
        {
            "open_contexts": 0,
            "query_total": 2634,
            "query_time_in_millis": 50926,
            "query_current": 0,
            "fetch_total": 2634,
            "fetch_time_in_millis": 4729,
            "fetch_current": 0,
            "scroll_total": 0,
            "scroll_time_in_millis": 0,
            "scroll_current": 0,
            "suggest_total": 0,
            "suggest_time_in_millis": 0,
            "suggest_current": 0
        },
        "merges":
        {
            "current": 0,
            "current_docs": 0,
            "current_size_in_bytes": 0,
            "total": 0,
            "total_time_in_millis": 0,
            "total_docs": 0,
            "total_size_in_bytes": 0,
            "total_stopped_time_in_millis": 0,
            "total_throttled_time_in_millis": 0,
            "total_auto_throttle_in_bytes": 41943040
        },
        "refresh":
        {
            "total": 28,
            "total_time_in_millis": 901478,
            "external_total": 20,
            "external_total_time_in_millis": 823759,
            "listeners": 0
        },
        "flush":
        {
            "total": 8,
            "periodic": 8,
            "total_time_in_millis": 959008
        },
        "warmer":
        {
            "current": 0,
            "total": 17,
            "total_time_in_millis": 11
        },
        "query_cache":
        {
            "memory_size_in_bytes": 0,
            "total_count": 0,
            "hit_count": 0,
            "miss_count": 0,
            "cache_size": 0,
            "cache_count": 0,
            "evictions": 0
        },
        "fielddata":
        {
            "memory_size_in_bytes": 0,
            "evictions": 0
        },
        "completion":
        {
            "size_in_bytes": 0
        },
        "segments":
        {
            "count": 17,
            "memory_in_bytes": 0,
            "terms_memory_in_bytes": 0,
            "stored_fields_memory_in_bytes": 0,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 0,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 0,
            "index_writer_memory_in_bytes": 0,
            "version_map_memory_in_bytes": 0,
            "fixed_bit_set_memory_in_bytes": 0,
            "max_unsafe_auto_id_timestamp": -1,
            "file_sizes":
            {}
        },
        "translog":
        {
            "operations": 0,
            "size_in_bytes": 110,
            "uncommitted_operations": 0,
            "uncommitted_size_in_bytes": 110,
            "earliest_last_modified_age": 91415
        },
        "request_cache":
        {
            "memory_size_in_bytes": 0,
            "evictions": 0,
            "hit_count": 0,
            "miss_count": 0
        },
        "recovery":
        {
            "current_as_source": 0,
            "current_as_target": 0,
            "throttle_time_in_millis": 0
        },
        "bulk":
        {
            "total_operations": 1000,
            "total_time_in_millis": 176705,
            "total_size_in_bytes": 5223807521,
            "avg_time_in_millis": 166,
            "avg_size_in_bytes": 5223841
        }
    }
}

Also it doesn't change performance:

Without waiting for merges to stabilize:
| 50th percentile service time | knn-search-100-1000-before-force-merge | 31.2539 | ms |
| 90th percentile service time | knn-search-100-1000-before-force-merge | 32.4008 | ms |
| 99th percentile service time | knn-search-100-1000-before-force-merge | 34.4963 | ms |
| 100th percentile service time | knn-search-100-1000-before-force-merge | 35.7168 | ms |

With waiting for merges to stabilize:
| 50th percentile service time | knn-search-100-1000-before-force-merge | 31.1528 | ms |
| 90th percentile service time | knn-search-100-1000-before-force-merge | 32.4756 | ms |
| 99th percentile service time | knn-search-100-1000-before-force-merge | 35.4727 | ms |
| 100th percentile service time | knn-search-100-1000-before-force-merge | 36.6767 | ms |

@jtibshirani
Copy link
Contributor

That's interesting about wait_for, I also thought it would ensure the docs were visible before returning.

It looks like all merges were throttled, which maybe explains why wait-until-merges-finish doesn't do anything?

        "merges":
        {
            ...
            "total_auto_throttle_in_bytes": 41943040
        },

Anyways, seems like a great step forward to me! We can always discuss these questions or make tweaks later.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 17, 2022

@jtibshirani Thanks for the feedback. I will confirm with the distributed team experts what's happening with wait_for.

It looks like all merges were throttled, which maybe explains why wait-until-merges-finish doesn't do anything?

    "merges":
    {
        ...
        "total_auto_throttle_in_bytes": 41943040
    },

This is a default value that always gets displayed even for an empty just created index; defaults to 20Mb per shard.

@mayya-sharipova
Copy link
Contributor Author

mayya-sharipova commented May 17, 2022

@DJRickyB Thanks for your feedback on this PR, we are ok to merge it now whenever the timing is good for your team.

Copy link
Contributor

@DJRickyB DJRickyB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again :)

@mayya-sharipova mayya-sharipova merged commit f7597fe into elastic:master May 17, 2022
@mayya-sharipova mayya-sharipova deleted the concurrent-indexing-searches branch May 17, 2022 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants