Skip to content

Add KSR recovery from transient Etcd errors#571

Merged
tiewei merged 5 commits intocontiv:masterfrom
jmedved:master
Feb 5, 2018
Merged

Add KSR recovery from transient Etcd errors#571
tiewei merged 5 commits intocontiv:masterfrom
jmedved:master

Conversation

@jmedved
Copy link
Copy Markdown
Member

@jmedved jmedved commented Feb 5, 2018

The main purpose of this pull request is to implement the recovery from transient Etcd errors that are not detected by the agent status monitor that also monitors Etcd. The agent status monitor does not detect fast Etcd restarts (a couple of seconds) that may result in data loss.

To monitor data loss in Etcd we utilize Etcd's capability to record revisions for every item stored in Etcd. A KSR status record was introduced in Etcd that for now holds just KSR statistics (a collection of stats for reflector in the KSR). The KSR status record is periodically updated. With every update, the etcd rev of the status record is checked with its last recorded value; if the etcd rev is not monotonically larger than the last recorded value, a resync is triggered. This functionality is implemented in plugin_impl_ksr.go (new functions: monitorEtcdStatus(), processEtcdMonitorEvent(), checkEtcdTransientError(), ksrHasSynced() and getKsrStats()).

To support the above chanes, the code base also underwent some cleanup:

  • The Writer and Lister interfaces and mocks were consolidated into a single interface/mock. This cleaned up reflector initalization and testing.

  • The ability to inject errors into the mock was made more fine grained (one for the ListValues() function, one for all other data operations) so that all error paths in KSR data resync can be tested in unit tests.

  • Reflector object type handling was cleaned up (constants introduced for each object type and then used consistently throughout the code).

@coveralls
Copy link
Copy Markdown

coveralls commented Feb 5, 2018

Coverage Status

Coverage decreased (-0.8%) to 76.065% when pulling c65ee1b on jmedved:master into f104b8b on contiv:master.


// mockKeyProtoVaBroker is a mock implementation of KSR's interface to the
// key-value data store.
type mockKeyProtoVaBroker struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo mockKeyProtoValBroker - missing l

Copy link
Copy Markdown
Contributor

@tiewei tiewei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@tiewei tiewei merged commit 886230f into contiv:master Feb 5, 2018
@jmedved jmedved changed the title Add recovery from transient Etcd errors Add KSR recovery from transient Etcd errors Feb 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants