Support combining SSDs into a raid #5085

csweichel · 2021-08-06T07:24:09Z

Useful for GKE deployments

mrsimonemms

/lgtm

roboquat · 2021-08-06T07:32:38Z

LGTM label has been added.

Git tree hash: cdd2fb617237261659421cd5d2f86682773fd08f

csweichel · 2021-08-06T07:32:50Z

/approve no-issue

roboquat · 2021-08-06T07:32:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csweichel, MrSimonEmms

Associated issue requirement bypassed by: csweichel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [MrSimonEmms,csweichel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jankeromnes · 2021-08-09T11:30:41Z

@csweichel @aledbf @mrsimonemms Hi!

FYI, this change seems to be related to a cluster incident where a ws-daemon pod keeps crashlooping (happened several times since this PR was deployed).

The symptoms are:

A ws-daemon pod keeps crashlooping, triggering an alert
kubectl describe shows this event:

$ kubectl describe pod ws-daemon-zc64j
[...]
Events:
  Type     Reason   Age                   From     Message
  ----     ------   ----                  ----     -------
  Warning  BackOff  112s (x390 over 86m)  kubelet  Back-off restarting failed container

When getting the logs for all containers, it seems to be the first "init container", raid-local-disks, that fails with:

...
+ DISK_DEV_SUFFIX=(c d e f g h i j)
+ MAX_NUM_DISKS=8
+ NUM_DISKS=0
+ declare -a DISKS
++ seq 0 7
+ for i in `seq 0 $((MAX_NUM_DISKS-1))`
+ CURR_DISK=/dev/sdc
+ '[' '!' -b /dev/sdc ']'
+ break
+ '[' 0 -eq 0 ']'
+ echo 'no local disks detected!'
+ exit 1
no local disks detected!

jankeromnes · 2021-08-09T11:55:30Z

Update: We now believe the code is behaving as expected (after also #5096), i.e. "fail to init ws-daemon if setupSSDRaid is enabled but there are no disks for it".

The incident looks rather like a configuration problem (i.e. there is a headless pool in EU, which is unexpected, and it seems to both enable setupSSDRaid and have no disks for it, leading its single ws-daemon to crashloop).

csweichel and others added 2 commits August 5, 2021 17:42

[ws-daemon] Support SSD raid setup

a6e0614

Fix raid-local-disks script

e6c7b4c

roboquat requested a review from mrsimonemms August 6, 2021 07:24

roboquat added the size/M label Aug 6, 2021

mrsimonemms approved these changes Aug 6, 2021

View reviewed changes

roboquat assigned mrsimonemms Aug 6, 2021

roboquat added the lgtm label Aug 6, 2021

roboquat added the approved label Aug 6, 2021

roboquat merged commit 3b93768 into main Aug 6, 2021

roboquat deleted the cw/ssd-raid branch August 6, 2021 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support combining SSDs into a raid #5085

Support combining SSDs into a raid #5085

csweichel commented Aug 6, 2021

mrsimonemms left a comment

roboquat commented Aug 6, 2021

csweichel commented Aug 6, 2021

roboquat commented Aug 6, 2021

jankeromnes commented Aug 9, 2021

jankeromnes commented Aug 9, 2021

Support combining SSDs into a raid #5085

Support combining SSDs into a raid #5085

Conversation

csweichel commented Aug 6, 2021

mrsimonemms left a comment

Choose a reason for hiding this comment

roboquat commented Aug 6, 2021

csweichel commented Aug 6, 2021

roboquat commented Aug 6, 2021

jankeromnes commented Aug 9, 2021

jankeromnes commented Aug 9, 2021