Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible counter undercount when `max_clock_skew_usec` set below guideline #2182

Closed
aphyr opened this issue Aug 28, 2019 · 2 comments

Comments

@aphyr
Copy link

commented Aug 28, 2019

With 1.3.1.2-b1, I've seen occasional failures in the counter test. For example, This run, with clock skew as the only nemesis, saw the observed value of a counter drop below the expected value after just 1300 increments. The following structure shows each successful read of the counter, in [lower-bound, read-value, upper-bound] format; lower and upper bounds are determined by the number of known-successful & possibly-successful increment transactions.

                              [67 67 79]
                              [232 233 241]
                              [336 338 350]
                              [445 446 459]
                              [457 457 472]
                              [477 478 494]
                              [501 501 520]
                              [503 503 521]
                              [561 563 584]
                              [691 692 714]
                              [813 814 837]
                              [878 878 904]
                              [1094 1095 1129]
                              [1202 1203 1244]
                              [1222 1223 1264]
                              [1225 1225 1265]
                              [1387 1386 1434]

This read observed 1386, but there were 1387 successful increment transactions which completed before this read began. The issue persists throughout the remainder of the test; reads are often (though not always) 1 or 2 smaller than the lower bound.

The underlying transactions are:

  (setup-cluster! [this test c conn-wrapper]
    (c/execute! c (j/create-table-ddl table-name [[:id :int "PRIMARY KEY"]
                                                  [:count :int]]))
    (c/insert! c table-name {:id 0 :count 0}))

  (invoke-op! [this test op c conn-wrapper]
    (case (:f op)
      ; update! can't handle column references
      :add (do (c/execute! op c [(str "UPDATE " table-name " SET count = count + ? WHERE id = 0") (:value op)])
               (assoc op :type :ok))

      :read (let [value (c/select-single-value op c table-name :count "id = 0")]
              (assoc op :type :ok :value value))))

This isn't a serializability violation, since stale reads are serializable. But... the counter test exists to test linearizability, so we should expect this to pass, right? Is that because single-key operations should still be linearizable? If so, might we have either a stale read, or perhaps a lost update? Given we don't see increasing error over time, I'm leaning a little more towards stale reads, though it could be that we had just 1 or 2 lost updates, and those effects colored the rest of the test.

I should also note that, as with every test I've been running for the last week or so, we've set the clock skew tolerance to 1 microsecond, which has so far resulted in no linearizability failures, no transaction failures... it's weird. I'm not entirely sure whether clock skew should affect this test or not, but the fact that it doesn't affect so many of our tests is worrying.

@kmuthukk kmuthukk added this to To do in Jepsen Testing via automation Aug 28, 2019

@mbautin

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2019

Just for context, this test was run with --max_clock_skew_usec set to 1 microsecond, while the actual clock skew introduced by the test was much higher than that, so stale reads are expected in that setup.

@kmuthukk kmuthukk changed the title Possible counter undercount Possible counter undercount when `max_clock_skew_usec` set below guideline Sep 4, 2019

@kmuthukk

This comment has been minimized.

Copy link
Collaborator

commented Sep 4, 2019

As @mbautin wrote, this issue only happens when --max_clock_skew_usec is set much below the actual clock skew introduced between nodes during the test.

In this case, --max_clock_skew_usec was overridden to 1 (microsec), whereas the default is 50000 (or 50 millisecs). For correctness, we do recommend setting --max_clock_skew_usec to the "the expected maximum clock skew between any two nodes in your deployment."

With the max induced clock skew in nemesis set to 40ms, and the default settings for max_clock_skew_usec this problem does not reproduce.

@kmuthukk kmuthukk closed this Sep 4, 2019

Jepsen Testing automation moved this from To do to Done Sep 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.