Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule: do not report status for first and last in suite jobs #1472

Merged
merged 1 commit into from Jun 30, 2020

Conversation

kshtsk
Copy link
Contributor

@kshtsk kshtsk commented May 12, 2020

Addresses the issue when teuthology run gets stuck with
first_in_suite or laste_in_suite jobs in queued state.

Attention: This change requires both steps, which are not mutually exclusive

  1. server teuthology worker restart, otherwise old worker's code will try to
    remove reported job from paddles and exit with unexpected exception.
  2. user's teuthology runner environment should be updated to recent code,
    because new workers will not cleanup FIS and LIS jobs and they will remain
    in paddles, correspondingly the run will get stuck.

Requires: a34fb6a

Fixes: http://tracker.ceph.com/issues/43291

Signed-off-by: Kyr Shatskyy kyrylo.shatskyy@suse.com

@kshtsk kshtsk added the DNM label May 12, 2020
@kshtsk kshtsk requested a review from tchaikov May 12, 2020 16:55
@kshtsk
Copy link
Contributor Author

kshtsk commented May 12, 2020

@tchaikov I guess you're familiar with --first-in-suite and --last-in-suite jobs, I wonder why we would ever want to push their statuses to paddles?

@kshtsk
Copy link
Contributor Author

kshtsk commented May 12, 2020

@susebot run deploy

@vasukulkarni
Copy link
Contributor

I dont remember but I think the email needs to know when to send out and it needs status of those jobs, we cannto remove them.

@vasukulkarni
Copy link
Contributor

@kshtsk
Copy link
Contributor Author

kshtsk commented May 12, 2020

I dont remember but I think the email needs to know when to send out and it needs status of those jobs, we cannto remove them.

@vasukulkarni
This patch does NOT remove these jobs from beanstalkd and workers, we just do not need to post their status to paddles, because first thing what worker does when it picks the jobs form beanstalkd it deletes them from paddles and start a reporter process in background which polls the runs.

@vasukulkarni
Copy link
Contributor

The changes look ok to me, but the manual task to go over it requires quite a bit of coordination in labs.

@kshtsk
Copy link
Contributor Author

kshtsk commented May 12, 2020

The changes look ok to me, but the manual task to go over it requires quite a bit of coordination in labs.

Yah, we will need help of @djgalloway anyways, I will spend some time for testing in isolated environment before I remove DNM label.

@susebot
Copy link

susebot commented May 12, 2020

Commit 3e05a70 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/165/

@kshtsk kshtsk requested review from djgalloway and zmc May 13, 2020 19:34
@kshtsk
Copy link
Contributor Author

kshtsk commented May 15, 2020

retest this please

tchaikov
tchaikov previously approved these changes May 16, 2020
@@ -191,8 +191,6 @@ def prep_job(job_config, log_file_path, archive_dir):
def run_job(job_config, teuth_bin_path, archive_dir, verbose):
safe_archive = safepath.munge(job_config['name'])
if job_config.get('first_in_suite') or job_config.get('last_in_suite'):
if teuth_config.results_server:
report.try_delete_jobs(job_config['name'], job_config['job_id'])
Copy link
Contributor Author

@kshtsk kshtsk May 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we still need this to support older users, since they may not be able to update their teuthology sandboxes to the latest code and correspondingly the jobs may continue to be added by some users.
I only see that we can add try-catch block to save worker from dying and maybe add couple of tries with pause to handle situations which we have downstream, when paddles is got the job to be add in some thread awaiting it's own turn and worker already picked the the LIS and FIS jobs and started to processing them trying to delete job which is not created yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corresponding PR #1524

Addresses the issue when teuthology run gets stuck with
first_in_suite or laste_in_suite jobs in queued state.

Attention: This change requires the next steps,
which are not mutually exclusive:

 1) server teuthology worker restart, otherwise old
    worker's code will try to remove reported job
    from paddles and exit with unexpected exception.
 2) user's teuthology runner environment should be
    updated to recent code, because new workers will
    not cleanup FIS and LIS jobs and they will remain
    in paddles, correspondingly the run will get stuck.

Requires: a34fb6a

Fixes: http://tracker.ceph.com/issues/43291

Signed-off-by: Kyr Shatskyy <kyrylo.shatskyy@suse.com>
@kshtsk
Copy link
Contributor Author

kshtsk commented Jun 25, 2020

@susebot run deploy

@susebot
Copy link

susebot commented Jun 25, 2020

Commit b164cdc is NOT OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/215/

@kshtsk
Copy link
Contributor Author

kshtsk commented Jun 25, 2020

Both failed jobs are due to unrelated issue:

2020-06-25T11:36:26.071 INFO:teuthology.orchestra.run.target-pr-1472-047:> sudo yum -y install ceph-mgr-diskprediction-local
2020-06-25T11:36:26.838 INFO:teuthology.orchestra.run.target-pr-1472-047.stdout:Last metadata expiration check: 0:03:27 ago on Thu 25 Jun 2020 11:32:59 AM UTC.
2020-06-25T11:36:26.864 INFO:teuthology.orchestra.run.target-pr-1472-045.stdout:  Installing       : ceph-grafana-dashboards-2:16.0.0-2890.ge07ebed.el8.n   4/5
2020-06-25T11:36:27.320 INFO:teuthology.orchestra.run.target-pr-1472-045.stdout:  Installing       : ceph-mgr-dashboard-2:16.0.0-2890.ge07ebed.el8.noarch   5/5
2020-06-25T11:36:27.360 INFO:teuthology.orchestra.run.target-pr-1472-047.stdout:(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
2020-06-25T11:36:27.361 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr:Error:
2020-06-25T11:36:27.362 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr: Problem: conflicting requests
2020-06-25T11:36:27.362 INFO:teuthology.orchestra.run.target-pr-1472-047.stderr:  - nothing provides python3-scikit-learn needed by ceph-mgr-diskprediction-local-2:16.0.0-2890.ge07ebed.el8.noarch
2020-06-25T11:36:27.407 DEBUG:teuthology.orchestra.run:got remote process result: 1

@kshtsk
Copy link
Contributor Author

kshtsk commented Jun 25, 2020

Another run passed:
https://ceph-ci.suse.de/job/teuthology-deploy-ovh/267/

@kshtsk kshtsk removed the DNM label Jun 25, 2020
@kshtsk kshtsk requested a review from tchaikov June 25, 2020 21:24
@tchaikov tchaikov merged commit 645528e into ceph:master Jun 30, 2020
@kshtsk kshtsk deleted the wip-dont-report-status branch July 6, 2020 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants