-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Creating a public facing status page #108
Comments
We're still kinda struggling with how to define an incident which requires docs like this internally, but I think having someone outside the org with a fresh eyes might help, so I'm for this FWIW, we've use https://www.statuspage.io for cocoapods for a few years and it's worked out well |
Atlassian bought StatusPage a couple years ago. Since then, Atlassian has also released Jira Ops (an internal tool to track incident status) which features integrations with StatusPage for external communications. @sweir27 has been looking into Jira Ops during our recent Hackathon.
Is there always a customer support assocate on call for partner-facing incidents? If so, it might make sense for that associate to handle updates to the status page in collaboration with on-call engineers during an incident. If there isn't a user-facing customer support representative available, I think a reasonable fallback is to the engineers on call to handle updates to the status page. |
Agree with @dblandin about the on-call people dealing with updating status page! I still mean to investigate how it hooks into Jira Ops, but hopefully this would be seamless. 😄 |
I'm pro a status page. In my experience at previous companies, the business side of the house also finds value in being able to attest to platform stability with a well designed status page -- so I think there's a lot of business value, in addition to customer-relations value, in the status page. You can also provide custom HTML/CSS to statuspage.io pages, so we can make sure the page follows our brand. |
Not quite. Our hours (in EST) are: Monday: 10am-6pm I'm always here 10am-6pm though (barring illness, vacation, etc). During those hours, I'm happy to help for sure! Otherwise, I tend to agree that it may need to fallback onto one of the on-call engineers. [[Edit]]:
This is huge too. We do often get questions/concerns about our stability. Being able to point to something concrete would be huge for that too. A nice added benefit. |
I gave an update on this RFC during our engineering open standup meeting this morning. @sweir27 and I plan to sync up and discuss an action plan. Will comment again once I have another update! |
As this is now happening 🎉, I think we can close this RFC! ResolutionWe decided to do it! Level of Support2: Positive feedback. Next StepsWe're now tracking the progress of this (and other incident-related process updates) in a Jira epic: https://artsyproduct.atlassian.net/browse/PLATFORM-1048 🔒 ExceptionsWe will not be tracking internal services on the new status page. |
Proposal:
We'd like to create a public facing status page for our gallery partners.
We believe we could use Atlassian's StatusPage for this so hopefully the amount of dev work needed is low.
The most important things to monitor externally are likely the following to start:
Who actually will update this page in the event of an outage/major disruption is up for discussion still. It may make most sense for one of the two engineers on-call in #incidents to update the status page.
Reasoning
Having a public facing status page would greatly improve Gallery Relations' ability to handle outages or service disruptions by allowing partners to 'self-serve'.
Currently, if there is an outage or major service disruption (i.e. Conversations is down), it requires Partner Support to send at least 3 messages per user who writes in:
Generally, there is only one person on-shift at a time. If 25 galleries write in, that's ~75 messages for one person to send on top of the normal queue. Having a way for partners to check if something is resolved themselves could minimize this for Gallery Relations.
An added benefit could be having a history of issues as well.
Exceptions:
There may be other external services we do not need to monitor. The Genomer applications come to mind but there may be more.
Additional Context:
Our current status.artsy.net page uses pingdom uptime checks which may not give us the flexibility we need externally.
Alexander is happy to do as much of it as possible to minimize the amount of time needed from engineers if possible.
You can see our discussion in slack here
The text was updated successfully, but these errors were encountered: