Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default settings lead to infinite loop when a server is removed #362

Open
adumovic opened this issue Nov 16, 2021 · 4 comments
Open

Default settings lead to infinite loop when a server is removed #362

adumovic opened this issue Nov 16, 2021 · 4 comments
Labels

Comments

@adumovic
Copy link

I have observed this issue in driver versions (v4.5.2 and v5.5.0) the code I am listing and investigated are for v4.5.2.

To produce this behavior, I deployed aerospike 5.5 for community edition via the helm charts.

I deployed a cluster of 6 replicas into GKE. I then run a set of clients that continually create and get records (using the Operate method for creates). If i then remove any of my 6 replicas by restarting the pod, the CPU spikes on most of the service pods using the golang client from say 5% up to 200%. It does not recover even after waiting > 15 minutes.

I ran the profiler and discovered that every pod with runaway CPU was spending most of its time executing the command.executeAt method and tracked it to this block (command.go line 2114):

	cmd.conn, err = ifc.getConnection(policy)
		if err != nil {
			isClientTimeout = true

			if err == types.ErrConnectionPoolEmpty {
				// if the connection pool is empty, we still haven't tried
				// the transaction to increase the iteration count.
				iterations--
			}
			logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
			continue
		}

The node does get replaces by kubernetes after a few seconds, but the specific connection the client command has (the Node from the Cluster struct) never gets updated.
This block of code will execute forever, always failing cmd.conn, err = ifc.getConnection(policy) because the actual node connection is bound to the command when it was created. This can sit here forever just incrementing iterations, using the bad connection and decrementing it.

Setting a value for TotalTimeout (which defaults to 0) resolves the infinite loop, but there are still cases where after one of the nodes becomes unhealthy, the tend routine never forgets the old node and its pool is exhausted (even though a new statefull set pod has been created and joined the cluster to replace it). To work around this, I added a wrapper layer around this entire client library that recreates a new client and closes the old one, in case of any network errors. If there is some way to avoid this i would much prefer that!

I see people have reported similar behavior with the defaults in past issues, if this is truly the expected behavior with the default settings, perhaps add a note in the best practices/performance section of the docs mentioning this, as it could help people avoid surprises!

I am also curious why it keeps retrying on the same exact connection, wouldn't it be a lot more resilient if it randomly chose a new connection each time? As far as I can tell, the selection of the node happens much earlier as soon as the command is created.

@khaf
Copy link
Collaborator

khaf commented Nov 21, 2021

Sorry for my late reply, I started investigating this issue right away, but forgot to leave an acknowledgement here.
I don't see how this can be the case, since both node and connection are set on every iteration, and policy.MaxRetries and policy.TotalTimeout independently control the number of times the command will retry.
I concur that there may be an underlying issue in the code somewhere, but I have not been able to reproduce it.
You mention that the command keeps retrying on the same connection, but that can't be the case. If the connection times out (controlled by policy.SocketTimeout), the connection will be closed, the command will be retried and a new connection will be assigned. If no connection attempts succeed, the policy.TotalTimeout will kick in and end the retry cycle.
The only way I can imagine that would cause this issue is that for the faulty node to linger around due to an issue in partition update. But even in that case, policy.TotlalTimeout should kick in.
I'm interested in your investigation results, it may help us identify the underlying issue.

@khaf khaf added the bug label Nov 21, 2021
@adumovic
Copy link
Author

Yes total timeout is a fix for avoiding the infinite loop (but it does default to 0) but does not fix the underlying issue with a broken node being continually used to service requests.

As far as getting a new node assigned each iteration of the loop, I looked deeper into the code to see what could be making it get the same node. I found that even though the code to get a new node is running, there are a few factors that are keeping it from making progress.

Firstly, for the first (and subsequent failures) the code is ending up executing the continue in the block i posted above as well:

	cmd.conn, err = ifc.getConnection(policy)
		if err != nil {
			isClientTimeout = true

			if err == types.ErrConnectionPoolEmpty {
				// if the connection pool is empty, we still haven't tried
				// the transaction to increase the iteration count.
				iterations--
			}
			logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
			continue
		}

Note that client timeout will be set to true isClientTimeout = true

After the first failure, notFirstIteration will be set, and ifc.prepareRetry will execute.

		if notFirstIteration {
			aerr, ok := err.(types.AerospikeError)
			if !ifc.prepareRetry(ifc, isClientTimeout || (ok && aerr.ResultCode() != types.SERVER_NOT_AVAILABLE)) {
			. . .

Looking at the implementation of the command type I am running (operate) i can see it will end up here:

opperate_command::prepareRetry::cmd.partition.PrepareRetryWrite(isClientTimeout bool) {
	if !isClientTimeout {
		ptn.sequence++
	}
}

However, since isClientTimeout == true, the partition sequence number will NOT be incremented.

Now we get to the code where node will be set (command.go 2098):

		cmd.node, err = ifc.getNode(ifc)

Which for the operate command ends up here:

func (cmd *operateCommand) getNode(ifc command) (*Node, error) {
	if cmd.hasWrite {
		return cmd.partition.GetNodeWrite(cmd.cluster)
	}

	// this may be affected by Rackaware
	return cmd.partition.GetNodeRead(cmd.cluster)
}

Both GetNodeWrite/GetNodeRead have similar execute paths for the default REPLICA_POLCY of SEQUENCE ( the default value returned by policy.go::NewPolicy() and what i have been using:

func (ptn *Partition) getSequenceNode(cluster *Cluster) (*Node, error) {
	replicas := ptn.partitions.Replicas

	for range replicas {
		index := ptn.sequence % len(replicas)
		node := replicas[index][ptn.PartitionId]

		if node != nil && node.IsActive() {
			return node, nil
		}
		ptn.sequence++
	}
	nodeArray := cluster.GetNodes()
	return nil, newInvalidNodeError(len(nodeArray), ptn)
}

For some reason, node.IsActive() always passes here and other places in command.go. As you can see, if that occurs, and since sequence number was not changed due to it being a "client timeout" it will return the exact same node associated with the previous failed execution.

It then eventually gets back to this block, where everything starts all over.

	cmd.conn, err = ifc.getConnection(policy)
		if err != nil {
			isClientTimeout = true

			if err == types.ErrConnectionPoolEmpty {
				// if the connection pool is empty, we still haven't tried
				// the transaction to increase the iteration count.
				iterations--
			}
			logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
			continue
		}

While the solution with setting the timeout works to avoid an infinite loop, it does not allow the driver to recover. The bad nodes stick around, getting stuck in a similar error path, until i create a new client and swap my wrapper to use that.

If there is a bug here it does seem more likely to be related to the code that tends the nodes. I will try to get you some more details or perhaps figure out how to get into this state with the debugger attached.

@khaf
Copy link
Collaborator

khaf commented Feb 7, 2022

Sorry for the very late reply, I didn't get any notifications for a while from the repo. Thank you for your analysis of the issue. From what I see, we have an issue regarding identification of client side vs server side errors. It is not as trivial as it seems, I'll have to give it a bit of thought to how to handle it best. I've pushed this issue to the top of my list and will release a fix before the week is out.

@adumovic
Copy link
Author

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants