Streamline AsyncShardFetch#getNumberOfInFlightFetches #93632

luyuncheng · 2023-02-09T10:20:14Z

When we restart a ES 7.10 cluster, we found lead master cpu hot_threads much cost in org.elasticsearch.action.admin.cluster.health.TransportClusterHealthAction.getResponse -> org.elasticsearch.cluster.routing.allocation.AllocationService.getNumberOfInFlightFetches

we trace the function getNumberOfInFlightFetches

it shows that all time cost in GatewayAllocator.getNumberOfInFlightFetches ,

elasticsearch/server/src/main/java/org/elasticsearch/gateway/GatewayAllocator.java

Lines 85 to 93 in b29399e

    
           public int getNumberOfInFlightFetches() { 
        
               int count = 0; 
        
               for (AsyncShardFetch<NodeGatewayStartedShards> fetch : asyncFetchStarted.values()) { 
        
                   count += fetch.getNumberOfInFlightFetches(); 
        
               } 
        
               for (AsyncShardFetch<NodeStoreFilesMetadata> fetch : asyncFetchStore.values()) { 
        
                   count += fetch.getNumberOfInFlightFetches(); 
        
               } 
        
               return count;

In some case, it will scan all shards * ndoes every time.

elasticsearch/server/src/main/java/org/elasticsearch/gateway/AsyncShardFetch.java

Lines 77 to 84 in b29399e

    
           public synchronized int getNumberOfInFlightFetches() { 
        
               int count = 0; 
        
               for (NodeEntry<T> nodeEntry : cache.values()) { 
        
                   if (nodeEntry.isFetching()) { 
        
                       count++; 
        
                   } 
        
               } 
        
               return count;

BUT, in most cases, the fetching status is all false after scanned once.
there is no need to iterator nodes every time. we can add a fetching cache to ignore this at this PR

ISSUE: #93631

2. add tests

elasticsearchmachine · 2023-02-10T09:32:08Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner · 2023-02-10T09:36:28Z

@elasticmachine generate changelog

DaveCTurner · 2023-02-10T09:42:05Z

Thanks @luyuncheng for spotting this and suggesting a fix. I think there's a better approach, however: we can make it so that getNumberOfInFlightFetches (and hasAnyNodeFetching) don't need to iterate at all. Instead, we can track the number of in-flight fetches directly in another field, adjusting the field when calling any of the methods that adjust NodeEntry#fetching and when adding or removing entries from AsyncShardFetch#cache. Would you try doing that?

luyuncheng · 2023-02-10T10:43:10Z

Thanks @luyuncheng for spotting this and suggesting a fix. I think there's a better approach, however: we can make it so that getNumberOfInFlightFetches (and hasAnyNodeFetching) don't need to iterate at all. Instead, we can track the number of in-flight fetches directly in another field, adjusting the field when calling any of the methods that adjust NodeEntry#fetching and when adding or removing entries from AsyncShardFetch#cache. Would you try doing that?

LGTM, Let me try.

luyuncheng · 2023-02-10T15:07:06Z

we can track the number of in-flight fetches directly in another field

@DaveCTurner at commit 6080f60
i try to add a field fetchingCount tracking the number of in-flight fetches, and it can remove the iterator from getNumberOfInFlightFetches and hasAnyNodeFetching

DaveCTurner

Nice. I left a few small comments.

I'd also like to introduce a method to verify that the count is consistent when not running in prod:

private boolean assertFetchingCountConsistent() {
    assert Thread.holdsLock(this);
    assert fetchingCount.get() == cache.values().stream().filter(NodeEntry::isFetching).count();
    return true;
}

and then we can say assert assertFetchingCountConsistent(); after we've changed the cache contents. That way we should pick up any mistakes in this area quickly.

DaveCTurner · 2023-02-10T15:32:45Z

server/src/main/java/org/elasticsearch/gateway/AsyncShardFetch.java

-        }
-        return count;
+    public int getNumberOfInFlightFetches() {
+        return fetchingCount.get();


Much nicer 😄

DaveCTurner · 2023-02-10T15:36:16Z

server/src/main/java/org/elasticsearch/gateway/AsyncShardFetch.java

@@ -57,6 +58,7 @@
    private final Set<String> nodesToIgnore = new HashSet<>();
    private final AtomicLong round = new AtomicLong();
    private boolean closed;
+    private final AtomicInteger fetchingCount = new AtomicInteger();


I think this could just be a volatile int because it's only ever updated within synchronized methods.

DaveCTurner · 2023-02-10T15:39:16Z

server/src/main/java/org/elasticsearch/gateway/AsyncShardFetch.java

-            }
-        }
-        return false;
+    // visible for testing


I'm ok with keeping this private, I don't think it needs testing.

++ modified

DaveCTurner · 2023-02-10T15:46:35Z

I have pushed a commit (5beddb8) which adds the required changelog YAML file. You'll need to pull this branch and merge or rebase your changes on top of my commit.

DaveCTurner · 2023-02-10T16:39:38Z

@elasticmachine ok to test

…tionOp

DaveCTurner

LGTM, thanks @luyuncheng

DaveCTurner · 2023-02-10T17:52:49Z

(it's Friday evening here, I will not merge this before Monday)

Avoids an O(#nodes) iteration by tracking the number of fetches directly.

Avoids an O(#nodes) iteration by tracking the number of fetches directly. Backport of elastic#93632 to 7.17

Avoids an O(#nodes) iteration by tracking the number of fetches directly. Backport of #93632 to 7.17 Co-authored-by: luyuncheng <luyuncheng@bytedance.com>

Relates elastic#93632

Relates #93632

Relates elastic#93632

Relates #93632

Relates elastic#93632

luyuncheng added 2 commits February 9, 2023 14:29

1. Reduce Fetching stats operation

7e2906d

1. Reduce Fetching getNumberOfInFlightFetches stats operation

41b8fa4

2. add tests

luyuncheng mentioned this pull request Feb 9, 2023

Reduce getNumberOfInFlightFetches iterator in ClusterHealthAction of AllocationService #93631

Closed

elasticsearchmachine added needs:triage Requires assignment of a team area label v8.8.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Feb 9, 2023

1. Fixed Thread ABA

7700c89

DaveCTurner added >bug :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed needs:triage Requires assignment of a team area label labels Feb 10, 2023

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Feb 10, 2023

DaveCTurner self-assigned this Feb 10, 2023

DaveCTurner added the feedback_needed label Feb 10, 2023

1. using fetchingCount instead of iterator scan nodeEntry

6080f60

DaveCTurner reviewed Feb 10, 2023

View reviewed changes

DaveCTurner changed the title ~~Reduce getNumberOfInFlightFetches iterator in ClusterHealthAction of AllocationService~~ Streamline AsyncShardFetch#getNumberOfInFlightFetches Feb 10, 2023

Add changelog

5beddb8

1. using fetchingCount instead of iterator scan nodeEntry

e63ddec

DaveCTurner removed the feedback_needed label Feb 10, 2023

luyuncheng requested a review from DaveCTurner February 10, 2023 17:48

DaveCTurner added 2 commits February 10, 2023 17:51

Minor cleanup

9734489

Merge branch 'main' into reviews/luyuncheng/2023-02-10-reduceHealthAc…

10a45e9

…tionOp

DaveCTurner approved these changes Feb 10, 2023

View reviewed changes

DaveCTurner merged commit 8cd7170 into elastic:main Feb 13, 2023

salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Feb 16, 2023

Streamline AsyncShardFetch#getNumberOfInFlightFetches (elastic#93632)

e30b320

Avoids an O(#nodes) iteration by tracking the number of fetches directly.

saarikabhasi pushed a commit to saarikabhasi/elasticsearch that referenced this pull request Apr 10, 2023

Streamline AsyncShardFetch#getNumberOfInFlightFetches (elastic#93632)

0c15fc4

Avoids an O(#nodes) iteration by tracking the number of fetches directly.

DaveCTurner pushed a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 5, 2023

Streamline AsyncShardFetch#getNumberOfInFlightFetches

d1d2a5f

Avoids an O(#nodes) iteration by tracking the number of fetches directly. Backport of elastic#93632 to 7.17

DaveCTurner mentioned this pull request Jun 5, 2023

Streamline AsyncShardFetch#getNumberOfInFlightFetches #96545

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch

72a46f9

Relates elastic#93632

DaveCTurner mentioned this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch #96553

Merged

elasticsearchmachine pushed a commit that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch (#96553)

d5426dc

Relates #93632

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch (elastic#96553)

7fb6095

Relates elastic#93632

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch (elastic#96553)

0b79431

Relates elastic#93632

elasticsearchmachine pushed a commit that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch (#96553) (#96556)

8a5bd8a

Relates #93632

elasticsearchmachine pushed a commit that referenced this pull request Jun 5, 2023

Add one more consistency check in AsyncShardFetch (#96553) (#96557)

7c5d82d

Relates #93632

HiDAl pushed a commit to HiDAl/elasticsearch that referenced this pull request Jun 14, 2023

Add one more consistency check in AsyncShardFetch (elastic#96553)

ecc84f5

Relates elastic#93632

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline AsyncShardFetch#getNumberOfInFlightFetches #93632

Streamline AsyncShardFetch#getNumberOfInFlightFetches #93632

luyuncheng commented Feb 9, 2023 •

edited

Loading

elasticsearchmachine commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

luyuncheng commented Feb 10, 2023

luyuncheng commented Feb 10, 2023

DaveCTurner left a comment

DaveCTurner Feb 10, 2023

DaveCTurner Feb 10, 2023

DaveCTurner Feb 10, 2023

luyuncheng Feb 10, 2023

DaveCTurner commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

DaveCTurner left a comment

DaveCTurner commented Feb 10, 2023

	public int getNumberOfInFlightFetches() {
	int count = 0;
	for (AsyncShardFetch<NodeGatewayStartedShards> fetch : asyncFetchStarted.values()) {
	count += fetch.getNumberOfInFlightFetches();
	}
	for (AsyncShardFetch<NodeStoreFilesMetadata> fetch : asyncFetchStore.values()) {
	count += fetch.getNumberOfInFlightFetches();
	}
	return count;

	public synchronized int getNumberOfInFlightFetches() {
	int count = 0;
	for (NodeEntry<T> nodeEntry : cache.values()) {
	if (nodeEntry.isFetching()) {
	count++;
	}
	}
	return count;

Streamline AsyncShardFetch#getNumberOfInFlightFetches #93632

Streamline AsyncShardFetch#getNumberOfInFlightFetches #93632

Conversation

luyuncheng commented Feb 9, 2023 • edited Loading

elasticsearchmachine commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

luyuncheng commented Feb 10, 2023

luyuncheng commented Feb 10, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner Feb 10, 2023

Choose a reason for hiding this comment

DaveCTurner Feb 10, 2023

Choose a reason for hiding this comment

DaveCTurner Feb 10, 2023

Choose a reason for hiding this comment

luyuncheng Feb 10, 2023

Choose a reason for hiding this comment

DaveCTurner commented Feb 10, 2023

DaveCTurner commented Feb 10, 2023

DaveCTurner left a comment

Choose a reason for hiding this comment

DaveCTurner commented Feb 10, 2023

luyuncheng commented Feb 9, 2023 •

edited

Loading