Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create strategy for the website status page #3838

Closed
1 task done
Tracked by #137 ...
patphongs opened this issue Jun 11, 2020 · 6 comments
Closed
1 task done
Tracked by #137 ...

Create strategy for the website status page #3838

patphongs opened this issue Jun 11, 2020 · 6 comments

Comments

@patphongs
Copy link
Member

patphongs commented Jun 11, 2020

Summary

What we're after:
As a FEC product manager and developer, we need to create a strategy of when and what to post on a website status page so that we can inform the public website users when there is a website incident.

Need rules for:

  • When to post an incident and the criteria to determine quickly when a message needs to be posted
  • What should those incident messages say; do we need a variety of example messages based on the outage or incident?
  • Which team members have the responsibility to post these incident messages (production support?) when an incident occurs?

Completion criteria

  • Provide criteria for elevating an incident that is posted to the website status page
@JonellaCulmer
Copy link
Contributor

JonellaCulmer commented Jun 16, 2020

Preliminary resources:
Hosted system status page:
https://status.io/
https://www.atlassian.com/software/statuspage

Usability: Visibility of system status:
https://www.nngroup.com/articles/visibility-system-status/

@JonellaCulmer
Copy link
Contributor

JonellaCulmer commented Jun 17, 2020

Examples using statuspage.io:
https://cloudgov.statuspage.io/
https://www.githubstatus.com/#past-incidents
https://status.mural.co/
https://status.slack.com/

Cloud.gov has documentation on their process that we can review:
https://cloud.gov/docs/ops/service-disruption-guide/

  • What might our documentation say? Who is it for?

Who is the status page for?
At this point, it's our understanding that the status page is for a non-technical audience that requires quick and informational updates when something happens that they weren't expecting.

Which projects/products need to be included on this status page versus which products are covered with a banner update? Do we need to change our banner processes?

Projects/Products:

  • Efling outages?
  • FEC.gov?
  • EFO maintenance
  • docquery.fec.gov
  • EQS
  • eregs

Functionality:

  • Downloads
  • Data refresh (the nightly process wasn't completed)
  • Elasticsearch (Example: Legal data - Elasticsearch)

How soon is a message posted?

  • At what point after an incident is detected do we post a message? Cloud.gov details how long of a window between when a staff member notes the problem and a message is posted.
  • Once something is down for XX time, what happens next?

Status page functionality that we want:

  • Integrates pingdom
  • Provides method of written incident status messages, current and past incidents
  • Multiple accounts or access from staff (production support, scheduled maintenance, individuals who post banners
  • Incorporates products/projects and functionality, listed above
  • Easy to create and manage a new incident
  • Separate from FEC.gov
  • Easy to share updates/notifications. (Subscriptions?)

Next steps:

  • Research status page best practices for functionality - Laura
  • Compare statuspage.io and status.io as potential solutions - Jonella
  • Review cloud.gov documentation and see where we can adopt some processes - Jonella and Laura
  • Have followup discussion about who all is included in the responsibility

@JonellaCulmer
Copy link
Contributor

Start of conversation in slack on status page here: https://fecgov.slack.com/archives/C3X3K6EVA/p1592416026408900

@JonellaCulmer
Copy link
Contributor

@lbeaufort
Copy link
Member

lbeaufort commented Jun 23, 2020

Status page best practices

Questions:

Why have a status page?

  • Transparency around your incident history and reliability builds trust with new and existing customers

  • Make it easier for teams to communicate with their customers during incidents.

When to have a status page?

When to (partially) automate

Recommendations

  • Define which team(s) and roles own the Statuspage. This is crucial for initial implementation and longevity. Better to sort this out Day 1, before the first live incident.

  • Document access/account management in fec-accounts

  • Follow cloud.gov practices, generally. Ask at office hours if they would be willing to share their templates.

References:
https://www.donnfelker.com/you-need-a-status-page/
https://hackernoon.com/build-a-great-status-page-in-15-minutes-with-no-budget-98257f67aef1
https://www.atlassian.com/incident-management/handbook#what-is-an-incident
https://cloud.gov/docs/ops/service-disruption-guide/

@JonellaCulmer
Copy link
Contributor

@PaulClark2 @AmyKort @patphongs
I would like to schedule a discussion with you all to go over this proposed language. Some of the language included is dependent on which status page service provider we go with, which is something we should discuss as well.

Here's a comparison between top services:
https://docs.google.com/spreadsheets/d/1ykAuWjL65uZm-sLt7IcOEWmz6h6EAP4lUYCd-5QXW4Q/edit#gid=0
Within each service there are different tiers depending on the number of team members and other features we would need.

Draft language for the status page processes based on cloud.gov documentation:
https://docs.google.com/document/d/1vE3zgh2Mh5h8h07ob5EXi9capgB-e5Ic0nR_R9Imb8Y/edit#heading=h.hlgjk513nsep

Draft example messages for status page posting:
https://docs.google.com/document/d/1DapPOsnSN7Q9P3E6OM2BHqYlHKj5ELHjt49BRpFFG5U/edit#heading=h.cvhs00beblfp

cc: @lbeaufort @dorothyyeager

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants