Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[workspace]: add force-stop check on stopping workspaces #5184

Merged
merged 1 commit into from
Aug 13, 2021

Conversation

mrsimonemms
Copy link
Contributor

@mrsimonemms mrsimonemms commented Aug 13, 2021

Further work on #5055

Since #4910 stopped counting "stopping" workspaces for billing purposes,
any workspace caught in a "stopping" phase would never be force-stopped.
This adds a conditional "excludeStopping" boolean (defaulting to true)
to the DB implementation and the meta-instance-controller simply includes
that phase in the search.

It was discovered that ~200 workspaces were caught in this phase (90%
prebuilds) so this phase is necessary to force-stop.

@mrsimonemms
Copy link
Contributor Author

@jankeromnes has manually deleted all instances stuck in stopping, except the following (all owned by Gitpodders). When this is merged into prod, check that these instances are successfully stopped

  • bf6fe0d1-4213-4843-b003-c89a9e263965
  • b5e3519a-fd7d-4cf1-9691-6efa3cab0520
  • 99f3149d-fe76-4085-9807-0da13c725e73
  • 7bf0d14a-258e-47ac-9720-7a7d832dbf94
  • 3123c642-df61-4a60-b81a-6ecba467a129

@JanKoehnlein
Copy link
Contributor

Is there any way to test this in the preview env?

@jankeromnes
Copy link
Contributor

jankeromnes commented Aug 13, 2021

Another super cool fix! Many thanks @mrsimonemms 🙏 🚀

It was discovered that ~200 workspaces were caught in this phase (90%
prebuilds)

... all created since yesterday around midnight UTC (2021-08-12T00:05:14.984Z), so this could be a side-effect of recent incidents (but it's still 100% a good idea to not leave instances stuck in stopping forever 💯)

@jankeromnes has manually deleted all instances stuck in stopping

To clarify, I've manually forced them back to stopped phase -- I haven't actually deleted them 😅

The query:

mysql> update d_b_workspace_instance set status = JSON_SET(status, '$.phase', 'stopped'), phasePersisted = 'stopped' where phase = 'stopping' and stoppedTime = '' and STR_TO_DATE(stoppingTime, '%Y-%m-%dT%H:%i:%s.%fZ') < (NOW() - INTERVAL 2 HOUR) and id not in ('bf6fe0d1-4213-4843-b003-c89a9e263965','b5e3519a-fd7d-4cf1-9691-6efa3cab0520','99f3149d-fe76-4085-9807-0da13c725e73','7bf0d14a-258e-47ac-9720-7a7d832dbf94','3123c642-df61-4a60-b81a-6ecba467a129');
Query OK, 208 rows affected (0.06 sec)
Rows matched: 208  Changed: 208  Warnings: 0

Is there any way to test this in the preview env?

I guess you can:

  1. Start and stop a workspace
  2. Open the PR in Gitpod, then connect to the DB (kubectl port-forward statefulset/mysql 3306 & mysql -h 127.0.0.1 -ptest gitpod)
  3. Update the d_b_workspace_instance entry to be stuck in stopping since > 2 hours ago, e.g. like so:
mysql> update d_b_workspace_instance set status = JSON_SET(status, '$.phase', 'stopping'), phasePersisted = 'stopping', stoppedTime = '', creationTime = '2021-08-13T05:00:00.000Z', stoppingTime = '2021-08-13T05:00:00.000Z';

(Warning: There is no where clause, so this will update all workspace instances in this deployment. Should be okay but FYI.)

  1. Wait for the PR to do its clean-up job (instance should go back to stopped eventually)

Copy link
Contributor

@jankeromnes jankeromnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks 99% good to me!

Added a few thoughts in-line.

components/gitpod-db/src/typeorm/workspace-db-impl.ts Outdated Show resolved Hide resolved
@JanKoehnlein
Copy link
Contributor

Test passed.

Since #4910 stopped counting "stopping" workspaces for billing purposes,
any workspace caught in a "stopping" phase would never be force-stopped.
This adds a conditional "includeStopping" boolean (defaulting to `false`)
to the DB implementation and the meta-instance-controller simply includes
that phase in the search.

It was discovered that ~200 workspaces were caught in this phase (90%
prebuilds) so this phase is necessary to force-stop.
@JanKoehnlein
Copy link
Contributor

/lgtm

@roboquat
Copy link
Contributor

LGTM label has been added.

Git tree hash: f4fc46160a894b20bf3fd4dce3b3c841568e376d

@roboquat roboquat added the lgtm label Aug 13, 2021
@roboquat
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JanKoehnlein

Associated issue: #5055

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit f35e762 into main Aug 13, 2021
@roboquat roboquat deleted the sje/force-stop-stopping-ws branch August 13, 2021 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Please force-stop workspace instances that are "stuck" in a bad state
4 participants