Skip to content

TransportMlMemoryAction captures and retains ClusterState for an extended period #123243

@DaveCTurner

Description

@DaveCTurner

This lambda ...

(l, trainedModelCacheInfoResponse) -> handleResponses(
state,

... captures the entire ClusterState from the point at which the action started running, and retains it all the way until after the completion of both the TransportNodesStatsAction and then the TrainedModelCacheInfoAction. Since both of those actions fan out to multiple nodes, they could take a long time (tens of seconds) to complete in an overloaded or otherwise faulty cluster. That's too long to retain a ClusterState, there's a good chance it'll be replaced by newer ClusterState instances in this time but we can't GC them while those actions are running. Moreover it appears that we only need a few select parts of the ClusterState in handleResponses()

We should instead extract and retain just those parts of the ClusterState that are needed to compute the final response.

Metadata

Metadata

Assignees

Labels

:mlMachine learning>bugTeam:MLMeta label for the ML team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions