schedule: do not report status for first and last in suite jobs #1472

kshtsk · 2020-05-12T16:54:30Z

Addresses the issue when teuthology run gets stuck with
first_in_suite or laste_in_suite jobs in queued state.

Attention: This change requires both steps, which are not mutually exclusive

server teuthology worker restart, otherwise old worker's code will try to
remove reported job from paddles and exit with unexpected exception.
user's teuthology runner environment should be updated to recent code,
because new workers will not cleanup FIS and LIS jobs and they will remain
in paddles, correspondingly the run will get stuck.

Requires: a34fb6a

Fixes: http://tracker.ceph.com/issues/43291

Signed-off-by: Kyr Shatskyy kyrylo.shatskyy@suse.com

kshtsk · 2020-05-12T16:57:15Z

@tchaikov I guess you're familiar with --first-in-suite and --last-in-suite jobs, I wonder why we would ever want to push their statuses to paddles?

kshtsk · 2020-05-12T16:57:59Z

@susebot run deploy

vasukulkarni · 2020-05-12T17:21:43Z

I dont remember but I think the email needs to know when to send out and it needs status of those jobs, we cannto remove them.

vasukulkarni · 2020-05-12T17:23:56Z

https://github.com/ceph/teuthology/blob/master/teuthology/suite/run.py#L346

kshtsk · 2020-05-12T17:34:39Z

I dont remember but I think the email needs to know when to send out and it needs status of those jobs, we cannto remove them.

@vasukulkarni
This patch does NOT remove these jobs from beanstalkd and workers, we just do not need to post their status to paddles, because first thing what worker does when it picks the jobs form beanstalkd it deletes them from paddles and start a reporter process in background which polls the runs.

vasukulkarni · 2020-05-12T17:49:27Z

The changes look ok to me, but the manual task to go over it requires quite a bit of coordination in labs.

kshtsk · 2020-05-12T17:51:28Z

The changes look ok to me, but the manual task to go over it requires quite a bit of coordination in labs.

Yah, we will need help of @djgalloway anyways, I will spend some time for testing in isolated environment before I remove DNM label.

susebot · 2020-05-12T18:15:16Z

Commit 3e05a70 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/165/

kshtsk · 2020-05-15T21:16:58Z

retest this please

kshtsk · 2020-05-24T19:42:14Z

teuthology/worker.py

@@ -191,8 +191,6 @@ def prep_job(job_config, log_file_path, archive_dir):
 def run_job(job_config, teuth_bin_path, archive_dir, verbose):
    safe_archive = safepath.munge(job_config['name'])
    if job_config.get('first_in_suite') or job_config.get('last_in_suite'):
-        if teuth_config.results_server:
-            report.try_delete_jobs(job_config['name'], job_config['job_id'])


Seems like we still need this to support older users, since they may not be able to update their teuthology sandboxes to the latest code and correspondingly the jobs may continue to be added by some users.
I only see that we can add try-catch block to save worker from dying and maybe add couple of tries with pause to handle situations which we have downstream, when paddles is got the job to be add in some thread awaiting it's own turn and worker already picked the the LIS and FIS jobs and started to processing them trying to delete job which is not created yet.

Corresponding PR #1524

Addresses the issue when teuthology run gets stuck with first_in_suite or laste_in_suite jobs in queued state. Attention: This change requires the next steps, which are not mutually exclusive: 1) server teuthology worker restart, otherwise old worker's code will try to remove reported job from paddles and exit with unexpected exception. 2) user's teuthology runner environment should be updated to recent code, because new workers will not cleanup FIS and LIS jobs and they will remain in paddles, correspondingly the run will get stuck. Requires: a34fb6a Fixes: http://tracker.ceph.com/issues/43291 Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>

kshtsk · 2020-06-25T10:46:27Z

@susebot run deploy

susebot · 2020-06-25T11:43:50Z

Commit b164cdc is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/215/

kshtsk · 2020-06-25T12:25:56Z

Both failed jobs are due to unrelated issue:

2020-06-25T11:36:26.071 INFO:teuthology.orchestra.run.target-pr-1472-047:> sudo yum -y install ceph-mgr-diskprediction-local
2020-06-25T11:36:26.838 INFO:teuthology.orchestra.run.target-pr-1472-047.stdout:Last metadata expiration check: 0:03:27 ago on Thu 25 Jun 2020 11:32:59 AM UTC.
2020-06-25T11:36:26.864 INFO:teuthology.orchestra.run.target-pr-1472-045.stdout:  Installing       : ceph-grafana-dashboards-2:16.0.0-2890.ge07ebed.el8.n   4/5
2020-06-25T11:36:27.320 INFO:teuthology.orchestra.run.target-pr-1472-045.stdout:  Installing       : ceph-mgr-dashboard-2:16.0.0-2890.ge07ebed.el8.noarch   5/5
2020-06-25T11:36:27.360 INFO:teuthology.orchestra.run.target-pr-1472-047.stdout:(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
2020-06-25T11:36:27.361 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr:Error:
2020-06-25T11:36:27.362 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr: Problem: conflicting requests
2020-06-25T11:36:27.362 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr:  - nothing provides python3-scikit-learn needed by ceph-mgr-diskprediction-local-2:16.0.0-2890.ge07ebed.el8.noarch
2020-06-25T11:36:27.407 DEBUG:teuthology.orchestra.run:got remote process result: 1

kshtsk · 2020-06-25T14:09:05Z

Another run passed:
https://ceph-ci.suse.de/job/teuthology-deploy-ovh/267/

kshtsk added the DNM label May 12, 2020

kshtsk requested a review from tchaikov May 12, 2020 16:55

kshtsk force-pushed the wip-dont-report-status branch from 3e05a70 to 6193f2c Compare May 12, 2020 17:38

kshtsk requested review from djgalloway and zmc May 13, 2020 19:34

tchaikov previously approved these changes May 16, 2020

View reviewed changes

kshtsk commented May 24, 2020

View reviewed changes

kshtsk dismissed tchaikov’s stale review via b164cdc June 25, 2020 10:45

kshtsk force-pushed the wip-dont-report-status branch from 6193f2c to b164cdc Compare June 25, 2020 10:45

kshtsk removed the DNM label Jun 25, 2020

kshtsk requested a review from tchaikov June 25, 2020 21:24

tchaikov approved these changes Jun 30, 2020

View reviewed changes

tchaikov merged commit 645528e into ceph:master Jun 30, 2020

kshtsk deleted the wip-dont-report-status branch July 6, 2020 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedule: do not report status for first and last in suite jobs #1472

schedule: do not report status for first and last in suite jobs #1472

kshtsk commented May 12, 2020 •

edited

kshtsk commented May 12, 2020

kshtsk commented May 12, 2020

vasukulkarni commented May 12, 2020

vasukulkarni commented May 12, 2020

kshtsk commented May 12, 2020

vasukulkarni commented May 12, 2020

kshtsk commented May 12, 2020

susebot commented May 12, 2020

kshtsk commented May 15, 2020

kshtsk May 24, 2020 •

edited

kshtsk Jun 24, 2020

kshtsk commented Jun 25, 2020

susebot commented Jun 25, 2020

kshtsk commented Jun 25, 2020

kshtsk commented Jun 25, 2020

schedule: do not report status for first and last in suite jobs #1472

schedule: do not report status for first and last in suite jobs #1472

Conversation

kshtsk commented May 12, 2020 • edited

kshtsk commented May 12, 2020

kshtsk commented May 12, 2020

vasukulkarni commented May 12, 2020

vasukulkarni commented May 12, 2020

kshtsk commented May 12, 2020

vasukulkarni commented May 12, 2020

kshtsk commented May 12, 2020

susebot commented May 12, 2020

kshtsk commented May 15, 2020

kshtsk May 24, 2020 • edited

Choose a reason for hiding this comment

kshtsk Jun 24, 2020

Choose a reason for hiding this comment

kshtsk commented Jun 25, 2020

susebot commented Jun 25, 2020

kshtsk commented Jun 25, 2020

kshtsk commented Jun 25, 2020

kshtsk commented May 12, 2020 •

edited

kshtsk May 24, 2020 •

edited