Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CCS minimize round trips in async search #96012

Merged
merged 15 commits into from Jun 1, 2023

Conversation

quux00
Copy link
Contributor

@quux00 quux00 commented May 11, 2023

This commit makes the smallest set of changes to allow async-search based cross-cluster search
to work with the CCS minimize_round_trips feature without changing the internals/architecture of
the search action.

When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the
primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator
to orchestrate and return the entire result set to the CCS coordinator as a single response.

This is the same functionality provided by synchronous CCS search using minimize_round_trips.
However, since this is an async search, it means that the async search coordinator has no visibility
into search progress on the remote clusters while they are running the search, thus losing one of
the key features of async search. However, this is a good first approach for improving overall search
latency for cross cluster searches that query a large number of shards on remote clusters, since
Kibana does not currently expose incremental progress of an async search to users.

@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v8.9.0 labels May 11, 2023
@quux00 quux00 added :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label v8.9.0 labels May 11, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me so far, I am curious on why it's marked WIP ;) maybe we should add some indication in the get async search API response that some remote clusters are minimizing roundtrips?

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch 7 times, most recently from 1477f42 to 82930fd Compare May 19, 2023 17:34
@quux00 quux00 added cloud-deploy Publish cloud docker image for Cloud-First-Testing and removed WIP labels May 19, 2023
@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from 193e2af to eeb6bfc Compare May 20, 2023 00:01
@quux00
Copy link
Contributor Author

quux00 commented May 20, 2023

With these changes, I have removed the notifyMinimizeRoundtrips/onMinimizeRoundtrips call to the SearchProgressListener in favor of modifying the SearchResponse.Cluster object to hold the information needed.

The _clusters object in the REST response for async_search with minimizeRoundTrips starts out at "successful: 0" and can move to "successful: 1" if there is a local cluster search that finishes while one or more of the remote clusters are still doing the search.

Here's an example of the progression in the _clusters output for a CCS search using minimize-roundtrips against three 3-node clusters:

POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
{
   // ... search body for a slow search not shown
}

// after _async_seach submit
{
  "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
  "is_partial": true,
  "is_running": true,
  "start_time_in_millis": 1684516160099,
  "expiration_time_in_millis": 1684948160099,
  "response": {
    "took": 806,
    "timed_out": false,
    "terminated_early": false,
    "num_reduce_phases": 0,
    "_shards": {
      "total": 3,   // only shows local shards until remote searches finish (current behavior)
      "successful": 0,  // no shards completed
      "skipped": 0,
      "failed": 0
    },
    "_clusters": {
      "total": 3,  
      "successful": 0,  // no clusters completed (new behavior)
      "skipped": 0
    },
    "hits": {
      "total": {
        "value": 0,
        "relation": "gte"
      },
      "max_score": null,
      "hits": []
    }
  }
}

// while running (local cluster has finished; two remote cluster searches still running
GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI

{
  "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
  "is_partial": true,
  "is_running": true,
  "start_time_in_millis": 1684516160099,
  "expiration_time_in_millis": 1684948160099,
  "response": {
    "took": 5973,
    "timed_out": false,
    "terminated_early": false,
    "_shards": {
      "total": 3,
      "successful": 3,  // local shards completed
      "skipped": 0,
      "failed": 0
    },
    "_clusters": {
      "total": 3,
      "successful": 1, // local cluster has completed (new behavior)
      "skipped": 0
    },
    "hits": {
      "total": {
        "value": 142,
        "relation": "eq"
      },
      "max_score": null,
      "hits": []
    }
  }
}


// when all clusters completed
GET /_async_search/FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI

{
  "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
  "is_partial": false,
  "is_running": false,
  "start_time_in_millis": 1684516160099,
  "expiration_time_in_millis": 1684948160099,
  "response": {
    "took": 10350,
    "timed_out": false,
    "num_reduce_phases": 4,
    "_shards": {
      "total": 9,
      "successful": 9,
      "skipped": 0,
      "failed": 0
    },
    "_clusters": {
      "total": 3,
      "successful": 3,
      "skipped": 0
    },
    "hits": {
      "total": {
        "value": 432,
        "relation": "eq"
      },
      "max_score": 17.110811,
      "hits": [
        ...

Note that this is NOT the behavior with ccs_minimize_roundtrips=false. There there _clusters output is always the final output where total == successful + skipped:

POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=false
{
   // ... search body for a slow search not shown
}
...
  "id": "FkF3RE1KZnFSU3M2VUdsWEZjVnlmZFEaQVRnZDdaNjBUZ1N0RU1sUjl0QmphUTo2OTI=",
  "is_partial": true,
  "is_running": true,
  "start_time_in_millis": 1684516160099,
  "expiration_time_in_millis": 1684948160099,
  "response": {
    ...
    "_clusters": {
      "total": 3,
      "successful": 3, // shows final state even if no shards/clusters have completed (current behavior)
      "skipped": 0
    },

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from eeb6bfc to aaa9b4e Compare May 22, 2023 14:28
@quux00
Copy link
Contributor Author

quux00 commented May 22, 2023

I tested a case where the remote clusters are down:

// case where skip_unavailable: true is NOT set for a remote cluster and it is down:
POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
...

// response:
{
  "id": "FjJJZG9yWUhOU3VLaTJzNXZMalZaTkEaaVhraVNDT2VTenFoUDlod21hYmlKZzo5MTE=",
  "is_partial": true,
  "is_running": false,
  "start_time_in_millis": 1684767109136,
  "expiration_time_in_millis": 1685199109136,
  "response": {
    "took": 10927,
    "timed_out": false,
    "terminated_early": false,
    "_shards": {
      "total": 3,
      "successful": 3,
      "skipped": 0,
      "failed": 0
    },
    "_clusters": {
      "total": 3,
      "successful": 1,
      "skipped": 0
    },
    "hits": {
      "total": {
        "value": 167,
        "relation": "eq"
      },
      "max_score": null,
      "hits": []
    }
  },
  "error": {
    "type": "status_exception",
    "reason": "error while executing search",
    "caused_by": {
      "type": "connect_transport_exception",
      "reason": "[][127.0.0.1:9301] connect_exception",
      "caused_by": {
        "type": "uncategorized_execution_exception",
        "reason": "Failed execution",
        "caused_by": {
          "type": "execution_exception",
          "reason": "execution_exception: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301",
          "caused_by": {
            "type": "i_o_exception",
            "reason": "Connection refused: /127.0.0.1:9301",
            "caused_by": {
              "type": "i_o_exception",
              "reason": "Connection refused"
            }
          }
        }
      }
    }
  }
}

----

// where skip_unavailable:true
POST blogs,remote1:*,remote2:*/_async_search?ccs_minimize_roundtrips=true
...

// response:
{
  "id": "FnE0SjZrb2lOVFR5ejFCN3dtX0NHbVEbaVhraVNDT2VTenFoUDlod21hYmlKZzoxMDIz",
  "is_partial": false,
  "is_running": false,
  "start_time_in_millis": 1684767263845,
  "expiration_time_in_millis": 1685199263845,
  "response": {
    "took": 11943,
    "timed_out": false,
    "num_reduce_phases": 3,
    "_shards": {
      "total": 6,
      "successful": 6,
      "skipped": 0,
      "failed": 0
    },
    "_clusters": {
      "total": 3,
      "successful": 2,
      "skipped": 1   // was properly updated at the end (started off as 0)
    },
    "hits": {
      "total": {
        "value": 370,
        "relation": "eq"
      },
      "max_score": 16.595804,
      "hits": [
        {
...

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from aaa9b4e to 2793f5d Compare May 22, 2023 15:15
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments, great progress!!

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch 2 times, most recently from 271c63c to fba88c4 Compare May 23, 2023 17:43
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHIP IT

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch 3 times, most recently from 1cab8d9 to a6460bd Compare June 1, 2023 12:28
@quux00
Copy link
Contributor Author

quux00 commented Jun 1, 2023

@abdonpijpelink - Would you have time to review the end user docs changes here?

You can look at the last commit to see the new changes: a7b7237

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from a6460bd to a7b7237 Compare June 1, 2023 12:35
@quux00
Copy link
Contributor Author

quux00 commented Jun 1, 2023

@abdonpijpelink

I just talked with Luca and we are going to change the default setting in a follow-on PR, so the end user docs changes will go into that new PR so you can wait to review it then. I'll be deleting those commits from this PR in a moment. Sorry for the change.

@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from a7b7237 to 5c20cf5 Compare June 1, 2023 12:55
Copy link
Contributor

@abdonpijpelink abdonpijpelink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing! For your follow-up PR: note that the <1> numbers ("callouts") have to restart from 1 for each code snippet, otherwise the docs won't build.

Comment on lines 430 to 455
"successful": 3, <4>
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 1, <5>
"skipped": 0
},
"hits": {
"total": {
"value": 167, <6>
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
}
--------------------------------------------------

<4> All the local cluster shards have completed.
<5> The local cluster search has completed, so the "successful" clusters entry
is set to 1. The `_clusters` response section will not be updated for the remote clusters
until all remote searches have finished (either successfully or been skipped).
<6> Number of hits from the local cluster search. Final hits are not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"successful": 3, <4>
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 1, <5>
"skipped": 0
},
"hits": {
"total": {
"value": 167, <6>
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
}
--------------------------------------------------
<4> All the local cluster shards have completed.
<5> The local cluster search has completed, so the "successful" clusters entry
is set to 1. The `_clusters` response section will not be updated for the remote clusters
until all remote searches have finished (either successfully or been skipped).
<6> Number of hits from the local cluster search. Final hits are not
"successful": 3, <1>
"skipped": 0,
"failed": 0
},
"_clusters": {
"total": 3,
"successful": 1, <2>
"skipped": 0
},
"hits": {
"total": {
"value": 167, <3>
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
}
--------------------------------------------------
<1> All the local cluster shards have completed.
<2> The local cluster search has completed, so the "successful" clusters entry
is set to 1. The `_clusters` response section will not be updated for the remote clusters
until all remote searches have finished (either successfully or been skipped).
<3> Number of hits from the local cluster search. Final hits are not

quux00 added 15 commits June 1, 2023 09:33
This commit makes the smallest set of changes to allow async-search based cross-cluster search
to work with the CCS minimize_round_trips feature without changing the internals/architecture of
the search action.

When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the
primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator
to orchestrate and return the entire result set to the CCS coordinator as a single response.

This is the same functionality provided by synchronous CCS search using minimize_round_trips.
However, since this is an async search, it means that the async search coordinator has no visibility
into search progress on the remote clusters while they are running the search, thus losing one of
the key features of async search. However, this is a good first approach for improving overall search
latency for cross cluster searches that query a large number of shards on remote clusters, since
Kibana does not currently expose incremental progress of an async search to users.
…ul count in Clusters - for async minimizeRoundtrips usage
MutableSearchResponse now has logic in buildResponse about whether to
update the Clusters object to indicate that the local cluster has finished.
…ProgressListener.

Modified SearchResponse.Clusters to optionally track remoteClusters and ccsMinimizeRoundtrips.

As the original Clusters object was intended to be immutable, I've added a "frozen" flag
that is set when the original constructor is called.

CCS AsyncSearch minimizeRoundTrips uses this new Clusters variant.

onListShards, with the revised impl and usage of SearchResponse.Clusters, now handles
notifying the SearchProgressListener in the AsyncSearchTask about the number of remote
clusters, whether local shards are being searched and ccsMinimizeRoundtrips status.
Made the Clusters and localClusterStatusUpdated fields to be updated in a synchronized block.
in MutableSearchResponse.SearchClustersInfo to make them safe for multi threaded access.
@quux00 quux00 force-pushed the ccs/async-search-minimizeRoundtrips branch from 5c20cf5 to ded776b Compare June 1, 2023 13:46
@quux00 quux00 merged commit 8b1cd47 into elastic:main Jun 1, 2023
13 checks passed
HiDAl pushed a commit to HiDAl/elasticsearch that referenced this pull request Jun 14, 2023
* Support CCS minimize round trips in async search

This commit makes the smallest set of changes to allow async-search based cross-cluster search
to work with the CCS minimize_round_trips feature without changing the internals/architecture of
the search action.

When ccsMinimizeRoundtrips is set to true on SubmitAsyncSearchRequest, the AsyncSearchTask on the
primary CCS coordinator sends a synchronous SearchRequest to all to clusters for a remote coordinator
to orchestrate and return the entire result set to the CCS coordinator as a single response.

This is the same functionality provided by synchronous CCS search using minimize_roundtrips.
Since this is an async search, it means that the async search coordinator has no visibility
into search progress on the remote clusters while they are running the search, thus losing one of
the key features of async search. However, this is a good first approach for improving overall search
latency for cross cluster searches that query a large number of shards on remote clusters, since
Kibana does not currently expose incremental progress of an async search to users.

Relates elastic#73971
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.9.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants