Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Creating a public facing status page #108

Closed
alexanderbercow opened this issue Nov 14, 2018 · 7 comments
Closed

RFC: Creating a public facing status page #108

alexanderbercow opened this issue Nov 14, 2018 · 7 comments
Labels

Comments

@alexanderbercow
Copy link

Proposal:

We'd like to create a public facing status page for our gallery partners.

We believe we could use Atlassian's StatusPage for this so hopefully the amount of dev work needed is low.

The most important things to monitor externally are likely the following to start:

  • Artsy.net
  • cms.artsy.net
  • writer.artsy.net

Who actually will update this page in the event of an outage/major disruption is up for discussion still. It may make most sense for one of the two engineers on-call in #incidents to update the status page.

Reasoning

Having a public facing status page would greatly improve Gallery Relations' ability to handle outages or service disruptions by allowing partners to 'self-serve'.

Currently, if there is an outage or major service disruption (i.e. Conversations is down), it requires Partner Support to send at least 3 messages per user who writes in:

  1. Confirming it is the outage
  2. A holding message letting partners know we're working on it
  3. Finally a follow-up notifying them it's been fixed.

Generally, there is only one person on-shift at a time. If 25 galleries write in, that's ~75 messages for one person to send on top of the normal queue. Having a way for partners to check if something is resolved themselves could minimize this for Gallery Relations.

An added benefit could be having a history of issues as well.

Exceptions:

There may be other external services we do not need to monitor. The Genomer applications come to mind but there may be more.

Additional Context:

Our current status.artsy.net page uses pingdom uptime checks which may not give us the flexibility we need externally.

Alexander is happy to do as much of it as possible to minimize the amount of time needed from engineers if possible.

You can see our discussion in slack here

@orta
Copy link
Contributor

orta commented Nov 15, 2018

We're still kinda struggling with how to define an incident which requires docs like this internally, but I think having someone outside the org with a fresh eyes might help, so I'm for this

FWIW, we've use https://www.statuspage.io for cocoapods for a few years and it's worked out well

@dblandin
Copy link
Member

FWIW, we've use https://www.statuspage.io for cocoapods for a few years and it's worked out well

Atlassian bought StatusPage a couple years ago. Since then, Atlassian has also released Jira Ops (an internal tool to track incident status) which features integrations with StatusPage for external communications. @sweir27 has been looking into Jira Ops during our recent Hackathon.

Generally, there is only one person on-shift at a time.

Is there always a customer support assocate on call for partner-facing incidents? If so, it might make sense for that associate to handle updates to the status page in collaboration with on-call engineers during an incident.

If there isn't a user-facing customer support representative available, I think a reasonable fallback is to the engineers on call to handle updates to the status page.

@dblandin dblandin added the RFC label Nov 16, 2018
@sweir27
Copy link
Contributor

sweir27 commented Nov 16, 2018

Agree with @dblandin about the on-call people dealing with updating status page! I still mean to investigate how it hooks into Jira Ops, but hopefully this would be seamless. 😄

@dleve123
Copy link
Contributor

I'm pro a status page.

In my experience at previous companies, the business side of the house also finds value in being able to attest to platform stability with a well designed status page -- so I think there's a lot of business value, in addition to customer-relations value, in the status page.

You can also provide custom HTML/CSS to statuspage.io pages, so we can make sure the page follows our brand.

@alexanderbercow
Copy link
Author

alexanderbercow commented Nov 16, 2018

Is there always a customer support assocate on call for partner-facing incidents?

Not quite. Our hours (in EST) are:

Monday: 10am-6pm
Tuesday: 4am - 6pm
Wednesday: 4am - 6pm
Thursday: 4am -6pm
Friday: 10am-6pm

I'm always here 10am-6pm though (barring illness, vacation, etc). During those hours, I'm happy to help for sure! Otherwise, I tend to agree that it may need to fallback onto one of the on-call engineers.

[[Edit]]:
I should note that on Tuesday/Wednesday/Thursday the 4am to 10am EST times are picked up by our European/Hong Kong Gallery Liaisons. I personally would prefer not letting them make status updates to the page/managing that if possible.

the business side of the house also finds value in being able to attest to platform stability with a well designed status page

This is huge too. We do often get questions/concerns about our stability. Being able to point to something concrete would be huge for that too. A nice added benefit.

@dblandin
Copy link
Member

I gave an update on this RFC during our engineering open standup meeting this morning. @sweir27 and I plan to sync up and discuss an action plan. Will comment again once I have another update!

@dblandin
Copy link
Member

As this is now happening 🎉, I think we can close this RFC!

Resolution

We decided to do it!

Level of Support

2: Positive feedback.

Next Steps

We're now tracking the progress of this (and other incident-related process updates) in a Jira epic: https://artsyproduct.atlassian.net/browse/PLATFORM-1048 🔒

Exceptions

We will not be tracking internal services on the new status page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants