/
sre-and-bdd-the-ultimate-power-pair.md
196 lines (162 loc) · 9.94 KB
/
sre-and-bdd-the-ultimate-power-pair.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: "SRE and BDD: The Ultimate Power Pair"
date: "2019-04-12 23:50:00"
slug: "sre-and-bdd-the-ultimate-power-pair"
image: "/images/sre-and-bdd-the-ultimate-power-pair/header.jpg"
keywords:
- enterprise
- strategy
- digital transformation
- sre
- reliability
- devops
- cloud
- engineering
- culture
- communities
- centers of excellence
- agile
- bdd
- product development
- cucumber
---
The responsibilities of a Reliability Engineer are well understood: maintain a high degree of
service availability so that customers can have a consistently enjoyable and predictable experience.
How these goals are accomplished --- establishing SLOs with customers, enforcing them through
monitoring SLIs and exercising the platform against failure through Game Days --- is also well
understood. Much of the literature that exists on SRE goes into great
depths talking about these concepts, and for good reason: failing to establish a contract with the
customer on availability expectations for the service that they are paying for is a great way for
its engineers to spend their entire careers fire-fighting.
However, there are times in which the definiton of availability is not as clear cut. If a
web service responds correctly within its availability SLO guidelines (say, 99.95%), but the content
that's actually served by that service is incorrect 30% of the time, then your engineers will likely
still spend a large portion of their time fire-fighting despite their Reliability dashboards
looking good.
There are various ways of capturing these details through black-box monitoring techniques such as
the Prometheus `blackbox_exporter` or using synthetic testing services from [Sauce
Labs](https://github.com/prometheus/blackbox_exporter) or [New
Relic](https://docs.newrelic.com/docs/synthetics), for example. (My personal favorite is using
[Cachet](https://cachethq.com) with the (Cachet
Monitor)[https://github.com/castawaylabs/cachet-monitor] running alongside it.). The Google Customer
Reliability team [mentions a great
example](https://cloud.google.com/blog/products/gcp/available-or-not-that-is-the-question-cre-life-lessons)
of a prober they added to an example Shakespeare searching service to measure malformed queries.
However, one simpler and more transparent method that I don't often see discussed is leveraging
acceptance tests and behavior-driven development. That's what I'll discuss in this post.
# BDD and SRE: An Unexpected Power Pair
_Behavior-Driven Development_, or BDD, helps provide a continuous interface through which product teams and
engineering can collaborate and iterate on feature development. On healthy product teams, feature
development through BDD looks something like this:
- Product teams begin the conversation for a new feature with an acceptance test: a file written in
English that describes what the feature is and how it should behave.
- Engineering writes a failing implementation for that acceptance test by way of
_step definitions_, then writes code that, ultimately, makes those step definitions pass.
- Once the acceptance test for that feature passes, the code for that feature enters the release
process through to production via continuous integration.
# An Example of BDD in action
Here's a simple example of this in action. Your company maintains a sharp-looking to-do list
product. Customer feedback collected from surveys has demonstrated a clear need for integrating your
login workflow with third-party OAuth providers, namely Google and Facebook. In preparation for your
bi-weekly story grooming session, a product owner might author a acceptance test with Cucumber that
looks like this:
{{< highlight "cucumber" >}}
# features/login/third_party_auth.feature
Feature: Logging in with Third-Party Providers
While many of our customers are happy with our login flow,
surveys are showing a clear need for authenticating via third-parties like Google
and Facebook.
Scenario: Logging in with Google
Given an instance of our to-do app
And a valid Google Account
When I navigate to the login page at "/login"
Then I see a button that lets me log in with Google
And I enter the Google authentication flow once it is clicked
And I can successfully log into our to-do app with our account
{{< / highlight >}}
Ideally, these acceptance tests would live in a separate repository since they are closer to
integration tests than service-level tests. It also makes continuous acceptance testing easier to
accomplish since the pipeline running the tests will only need to operate against a single
repository instead of potentially-many repositories. However, using a monorepo for acceptance tests
can complicate pull requests for service repositories since running an entire suite of acceptance
tests for a single PR is expensive and probably unnecessary. This can be engineered around, but it
requires a bit of work.
After Product and Engineering agree on the scope of this feature and its timing in the backlog, an
Engineer might author a failing series of step definitions for this feature, one of which might look
something like this:
{{< highlight "ruby" >}}
# features/step_definitions/third_party_auth.rb
require 'todo-app'
require 'vault'
Given("an instance of our to-do app") do
@todo_app = TodoApp::Client.new
end
Given("a valid Google Account") do
@google_account = {
username: test@gmail.com,
password: Vault::Client.get_value_for(key: test@gmail.com,
path: '/todo/testing/accounts',
token: ENV['VAULT_TOKEN'])
}
end
When("I navigate to the login page at {string}") do |url|
@todo.visit url
end
Then("I see a button that lets me log in with Google") do
expect(page).to have_element("//button[id='google_login']")
end
{{< / highlight >}}
Once the engineer playing this story is able to make this series of step definitions pass,
Engineering and Product can play the acceptance test end-to-end to confirm that the feature implemented
is in the ballpark of what they were looking for. (Yay for automating QA!) Once this is agreed upon,
the feature gets released into Production through their CI/CD pipelines.
# An Example of BDD for Site Reliability in Action
We can employ the same tactics outlined above to define availability constraints. However, in this
instance, the Reliability team would be submitting these acceptance tests instead of Product.
Let's say that data collected from user session tracking shows that out of the 100,000 users that
use our todo app on any given month, 85% of them that wait for the login page for more than five
seconds leave our app, presumably to a competitor like Todoist. Because our company is backed by
venture capital, growth is our company's primary metric. Obtaining growth at any cost helps with
future funding rounds that will help the company explore more expensive market plays and fund a
potential IPO in the future. Thus, capturing as many of the fleeting 85% is pretty critical.
To that end, the Reliability team can write a acceptance test that looks like this:
{{< highlight "cucumber" >}}
@reliability
Feature: Timely logins
Prevent users from bouncing early by ensuring that we can hit the login page in a timely manner.
Scenario: Login page within five seconds
Given an instance of the to-do app
When I navigate to the login page
Then the login page loads in five seconds or less at least ten times in succession.
{{< /highlight >}}
Notice the `@reliability` tag at the top of this acceptance test. This tag is important, as it allows
us to run our series of acceptance tests with a specific focus on reliability. Since these tests are
intended to be quick, we can run them on a schedule several times per hour. If the failure rate for
these tests is too high (as this rate would be a metric captured by your observability stack), then
Reliability can decide to roll back or fail forward. Additionally, developers can run these tests
during their local testing to gain greater confidence in releasing a reliable product and having a
better sense of what "reliability" actually means.
# Reliability Tests Don't Replace Observability!
Feature testing tools like Cucumber are often used well-beyond their initial scope, largely due to
how flexible they are. That said, _I am not arguing for removing observability tools_! Quite the
contrary, in fact: I think that reliability tests compliment more granular and data-driven
monitoring techniques quite nicely.
Going back to our `/login` example, setting a service-level objective around liveliness
--- whether `/login` returns `HTTP 200/OK` or not --- still helps a lot in giving customers a
general expectation of how available this service will be during a given period. Using feature tests
to drive that will be complicated and slow, and slow metrics are guaranteed to prevent teams from
hitting their SLO targets. Using near-realtime monitoring against the `/login` service and providing
a dashboard showing this service's uptime and remaining error budget along with a widget showing the
rate at which this service's reliability tests are passing tells a fuller story of its healthiness.
# Wrapping Up
Setting SLOs and chasing SLIs are tenets most Reliability Engineers understand well. However, these
metrics alone may not paint a complete picture of what it means for a service to be "up."
Additionally, these metrics are pretty opaque: developers, product or anyone else outside of the
Reliability team that wants to know how things work so well all of the time might have a dashboard
or two as their only recourse.
Reliability tests use behavior-driven development and acceptance testing principles to bridge this
gap. Authoring reliability tests gives non-Reliability engineers a better understanding of
availability expectations, and it shifts some of the onus of making sure that the code is reliable
onto the developer. Additionally, because they are written in plain English, everyone can understand
them, which means that everyone can talk about and iterate on them.
Give it a try!