Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate persisted responses without knowing their revision to prevent duplicating state during linearization #18214

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

serathius
Copy link
Member

@k8s-ci-robot
Copy link

@serathius: GitHub didn't allow me to request PR reviews from the following users: fykaa.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @siyuanfoundation @MadhavJivrajani @fuweid @fykaa @henrybear327

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member Author

I have idea for big rewrite of history patching, however it would require more tests. As this PR would also benefit from them I wait with it until we have tests.

// MaybeEtcdResponse extends EtcdResponse to include partial information about responses to a request.
// Possible response state information:
// * Normal response. Client observed response. Only EtcdResponse is set.
// * Persisted. Client didn't observe response, but we know it was persisted by etcd. Only Persisted is set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. So, meaning, the PUT was finalized on the consensus + storage, but response wasn't received from a client point of view? Seems confusing, given the function above describeEtcdResponse strings this as "unknown"? Should we make the naming more consistent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great question, I have been thinking about changing the naming here.

So, meaning, the PUT was finalized on the consensus + storage, but response wasn't received from a client point of view?

Yes, from client perspective it got error so we don't know the response, but we know that request was persisted thanks to looking up WAL.

Seems confusing, given the function above describeEtcdResponse strings this as "unknown"? Should we make the naming more consistent?

The naming in robustness tests is constantly evolving based on feedback. Name "unknown" was meant to just mean we don't know the exact response but we knew it's revision due to matching watch response. Name "persisted" was introduced only recently into code with PR that added reading the WAL.

I think it would make sense to rename "unknown" to "persisted" in describeEtcdResponse. Does that sound good for everyone? @MadhavJivrajani @siyuanfoundation @fuweid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. "persisted" makes sense to me. Defer to other contributors :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"persisted" makes sense to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "persisted"

Copy link
Contributor

@fykaa fykaa Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to 'persisted' makes sense in the context of WAL.

I am trying to understand could there be any edge case where 'persisted' might not fully capture the state, or scenarios where additional qualifiers might be necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand could there be any edge case where 'persisted' might not fully capture the state, or scenarios where additional qualifiers might be necessary?

Good question, I think the second case would be when request generates a event and we observe it on watch. This is where the "PersistedRevision" comes from. So name "persisted" might not be capturing all cases. Maybe "observed" would be capturing the whole meaning. Observed not by client, but by other side effects, like WAL or WATCH. Don't know if "observed" is better than "persisted" though.

Let's not block this PR on picking the name. If someone is interested feel free to file a separate PR to propose and discuss a new name.

@siyuanfoundation
Copy link
Contributor

I am confused by the PR. Can you give some context why the code before this change would result in duplicating states?

@serathius
Copy link
Member Author

serathius commented Jun 24, 2024

I am confused by the PR. Can you give some context why the code before this change would result in duplicating states?

This code is meant to remove the duplicated states caused by apply function. When linearizing operation history, this is one of the functions used to decide, is this state transition possible (bool)? What is the new state(nonDeterministicState)? Based on the current state and the provided request it tries to answer if the provided response is possible. Depending on how much we know about the response we might be able to detect early that this linearization branch we are exploring is not possible, and this is what we want to be able to reject those branches as early as possible.

This is codes in the following function, the switch is from the most lenient to the most strict option of validating response:

func (states nonDeterministicState) apply(request EtcdRequest, response MaybeEtcdResponse) (bool, nonDeterministicState) {
	var newStates nonDeterministicState
	switch {
	case response.Error != "":
		newStates = states.applyFailedRequest(request)
	case response.Persisted && response.PersistedRevision == 0:
		newStates = states.applyPersistedRequest(request)
	case response.Persisted && response.PersistedRevision != 0:
		newStates = states.applyPersistedRequestWithRevision(request, response.PersistedRevision)
	default:
		newStates = states.applyRequestWithResponse(request, response.EtcdResponse)
	}
	return len(newStates) > 0, newStates
}

If you check the applyFailedRequest code, it duplicates the states because it doesn't know if etcd handled the request or not:

// applyFailedRequest returns both the original states and states with applied request. It considers both cases, request was persisted and request was lost.
func (states nonDeterministicState) applyFailedRequest(request EtcdRequest) nonDeterministicState {
   newStates := make(nonDeterministicState, 0, len(states)*2)
   for _, s := range states {
   	newStates = append(newStates, s)
   	newState, _ := s.Step(request)
   	if !reflect.DeepEqual(newState, s) {
   		newStates = append(newStates, newState)
   	}
   }
   return newStates
}

If we know that request was persisted by etcd WAL, then we can just assume that it executed and use different apply code:

// applyPersistedRequest applies request to all possible states.
func (states nonDeterministicState) applyPersistedRequest(request EtcdRequest) nonDeterministicState {
	newStates := make(nonDeterministicState, 0, len(states))
	for _, s := range states {
		newState, _ := s.Step(request)
		newStates = append(newStates, newState)
	}
	return newStates
}

@serathius serathius force-pushed the robustness-separate-persisted branch 2 times, most recently from 44921d3 to 972d2be Compare June 24, 2024 19:37
…t duplicating state during linearization

Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
@serathius
Copy link
Member Author

/retest

@serathius
Copy link
Member Author

serathius commented Jun 25, 2024

Any reviewers willing to review?

Just making sure we merge it soon as the linearization timeouts have become frequent enough that they happen on presubmit https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/18227/pull-etcd-robustness-amd64/1805320012408295424

@siyuanfoundation
Copy link
Contributor

Member

Thanks for explaining the details! LGTM

@serathius serathius merged commit 917ded9 into etcd-io:main Jun 25, 2024
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants