Skip to content

Feature: Troubleshooting

Vratislav Podzimek edited this page Apr 19, 2017 · 19 revisions

Overview

The idea here is to give high level notifications about problems going on with the system. Trello card

Scope

Overview and workflow of high-level issues in Cockpit. Type of issues:

Stories

Sarah Manning is a part-time sysadmin at an IT startup, spending the other half of her work as a backend developer on the company's upcoming product. Making sure the companies servers secure is very important to her. They run two atomic servers in their infrastructure. One is a development server, and the other one a production server.

Phillip J Fry is a junior sysadmin at a medium-sized company. They have 3 file servers called Juniper, Redwood and Pine, with five disks on each machine set up as RAID6. Phillip logs in to check that everything is all right. It seems it is not. There is a storage error on Juniper.

Robert Paulson is a developer at a small IT company with 20 employees. For one reason or another, he got tossed the sysadmin hat at the company. They have one build server that he gets to take care off.

Workflows

Sarah logs in to the dashboard. She identifies that both servers has containers that are detected as unsafe by the container security scan. She decides to focus on the production server first and deals with that. After she's done with the production server, she moves on to deal with the development server.

Phillip goes to the multi-server dashboard and sees a message that says that the server Juniper has a storage error. He clicks the notification and learns that one disk has failed and that the RAID is currently degenerated. He identifies the disk. Takes a new one from the shelf where they keep the extra disks. Takes out the old disk and pops in the new one. Everything is now well.

Robert Robert logs in to the server for the first time in a while. He gets SELinux errors and a container error. He looks at the container error. It's an old image they don't use any more, so he deletes it. He then moves on to the SELinux error. It seems it was due to a misconfiguration in one of their apps, so he fixes that.

Wireframes

Troubleshooting Flow Troubleshooting

Prior art

Feedback

A surprisingly difficult question for Phillip to work out is which physical disk to pull. He has to translate some information, e.g., a device file, a multipath device node, a serial number, a WWID into, sometimes but not always, a particular bay in an enclosure device. If he has had unusual foresight, he has a piece of paper, showing the layout of the enclosures and labels attached with, for example, the WWIDs of the disks in each enclosure. If not, he may be able to locate the offending disk by blinking a status LED. It is lucky that he has only three arrays to deal with, and they are in the same rack, so he will probably be looking in the correct direction if the LED blinks. -mulkieran

See also

Clone this wiki locally
You can’t perform that action at this time.