Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

devanshk · 2018-04-01T16:44:28Z

The follow two bugs, combined, prevent pending jobs from being executed:

When number of jobs is larger than number of vms in free pool, jobManager dies.
When jobManager restarts, free pool is not emptied whilst total pool is, causing inconsistency.
xyzisinus@4dcbbb4
xyzisinus@e2afe8a
Add ability to specify image name for ec2 using "Name" tag on AMI (used to allow only one image specified as DEFAULT_AMI):
xyzisinus@97c22e3
xyzisinus@e66551a
When job id reaches the max and wraps around, the jobs with larger ids starve.
xyzisinus@9565275
Improve the worker's run() function to report errors on the copy-in/exec/copy-out path more precisely.
xyzisinus@caac9b4
xyzisinus@c47d889
In the original code, Tango allocates all vm instances allowed by POOL_SIZE at once. It shouldn't be an issue because once a vm is made ready a pending job should start using it. However, due to well-known Python thread scheduling problems, the pending jobs will not run until all vms are allocated. As we observed, vm allocations are almost sequential although each allocation runs in a separate thread, again due to Python's threading. That results in a long delay for the first job to start running. To get around the problem, POOL_ALLOC_INCREMENT is added to incrementally allocate vms and allow jobs to start running sooner.
xyzisinus@93e60ad
With POOL_SIZE_LOW_WATER_MARK, add the ability to shrink pool size when there are extra vms in free pool. When low water mark is set to zero, no vms are kept in free pool and a fresh vm is allocated for every job and destroyed afterward. It is used to maintain desired number of ec2 machines as standbys in the pool while terminating extra vms to save money.
xyzisinus@d896b36
xyzisinus@7805577
Improve autodriver with accurate error reporting and optional time stamp insertion into job output.
Tango/autodriver/autodriver.c
When Tango restarts, vms in free pool are preserved (used to be all destroyed).
xyzisinus@e2afe8a
Add run_jobs script to submit existing student handins in large numbers:
Tango/tools/run_jobs.py
Improve general logging by adding pid in logs and messages at critical execution points.

Part 2. New configuration variables (all optional)

Passed to autodriver to enhance readability of the output file. Currently only integrated in ec2 vmms.
AUTODRIVER_LOGGING_TIME_ZONE
AUTODRIVER_TIMESTAMP_INTERVAL
Control of the preallocator pool as explained in Part 1.
POOL_SIZE_LOW_WATER_MARK
POOL_ALLOC_INCREMENT
Instead of destroying it, set the vm aside for further investigation after autodriver returns OS ERROR. Currently only integrated in ec2 vmms.
KEEP_VM_AFTER_FAILURE

jobManager dies after N jobs have finished in the exception handling of __manage(), if Config.REUSE_VM is set to true. This commit simply checks whether "job" is None, to avoid the crash. This allows job manager to continue and finish all jobs. But it may not be the real fix of the root problem which needs further investigation.

queue. Due to misunderstanding of redis, the free queue is not actually emptied, resulting in vms staying in free queue but not in total pool.

…mage name for each ami. The lab author specifies the desired image using this name.

…m restful server.

…e host.

by pool size at once and 2) consider vms in free pool first.

not in the free pools, instead of destroy all vms.

report them at the end of the run.

…un and to list failed submissions.

from logic changes.

…riable

report file writing failure, always write current trouble jobs into file

Or all Tango jobs will be stuck in such cases.

…ily. But the timed cleanup shouldn't run in each of them. Now it only runs in jobManager.

victorhuangwq · 2020-07-08T12:51:31Z

@fanpu, this PR seems to contain a lot of the important fixes that Tango really needs.

I think the ideal way to go about adding this PR into Tango is to break this PR down into smaller more easier to review PRs.

I'm currently thinking that we have 1 PR based on each commit (that is, each of the commits based in the description above), so that we can actually logically get people to approve and merge them in.

xyzisinus and others added 30 commits August 11, 2017 13:47

When considering a vm for a job, ask aws if it is still running.

c7f5664

Add pid in the logs of the modules used by jobManager.

6de4692

When job manager restarts, it empties its vm "total" pool and "free"

e2afe8a

queue. Due to misunderstanding of redis, the free queue is not actually emptied, resulting in vms staying in free queue but not in total pool.

Add ability to pull amis from aws and to exact a tag "Name" as the i…

97c22e3

…mage name for each ami. The lab author specifies the desired image using this name.

Remove DEFAULT_AMI since amis are automatically loaded from aws now.

e66551a

Add script to drive lab submissions into tango.

94656b7

Add scripts to access ec2 and redis.

253cb8e

remove trailing lines.

45d5fc1

Check if output file exists before comparing modification time.

12139e1

resetTango should only be called from jobManager. Remove the call fro…

d8d0f65

…m restful server.

Add and subtract some logging.

9043e3a

Fix a condition for running all students' jobs.

4a6bba9

Use tangoHostPort to distinguish multiple tango containers on the sam…

b92ffba

…e host.

Add logging in destroyVM.

75ca36d

Modified pool allocation logic to 1) not to allocate all vms allowed

93e60ad

by pool size at once and 2) consider vms in free pool first.

When job manager restarts, it now destroy the vm instances that not

76157c3

not in the free pools, instead of destroy all vms.

Improve tool script that exercises job manager and the code beneath.

7674945

Add ability to submit jobs for a given list of students.

cbe01c2

Fix incomplete test script

46ceb59

Check if the vm still exists before terminating.

7fef985

Check output file for the missing "scores:" line and

0febf75

report them at the end of the run.

Fix typos that prevents job manager to start.

5b2bda8

Improvements to run_jobs: ability to run failed submissions, to dry r…

4ce9534

…un and to list failed submissions.

Add a separate config file for run_jobs to separate config settings

0296460

from logic changes.

Separate "class Config" from util.py.

b1aa4ba

Move student submission range into config file.

9be308f

Correct a typo.

a9e2983

Better check for output files with missing scores.

e0b5253

remove trailing spaces.

79b7ead

xyzisinus and others added 22 commits August 30, 2018 13:45

log file copied from vm is now readable to all.

88ac6e1

Add script to check the health of Tango.

77a26fa

check in a working version of check_jobs

09bca0a

fix: timezone issue, default instance type not observed, undefined va…

d24fdb4

…riable

Fix check_jobs (cron job in production Autolab): comment its purpose,

6b6de4c

report file writing failure, always write current trouble jobs into file

Add README for autodriver

8f0ed66

time.localtime().tm_isdst gives the correct answer to day light saving.

7a18dbc

Deal with exception inside exception handler of initializeVM.

48568e9

Or all Tango jobs will be stuck in such cases.

Clean existing output files for those student jobs selected to run.

0b1b344

Commit changes to build/admin files.

a49c447

remove the unusual redis port mapping from run_jobs config file.

42fb51b

Make redis port available outside the container

ae2799f

Swap the order of wating from pending to running and instance tagging.

4e6c345

Add timed cleanup for untagged stale vms. Add tests

defb37f

Use proper timezone to log instance launch time.

3d60ae0

Add exception handling to cleanup function.

50f31a5

Add python pytz package.

870757f

Manually start the cleanup function from test script.

fc2dec6

ec2SSH is activated from multiple tango services, probably unnecessar…

0ba4bae

…ily. But the timed cleanup shouldn't run in each of them. Now it only runs in jobManager.

Remove tool's reference to redis and close redis's open door

d406a13

dump the content of an email in 2018 in a file.

96c9830

Add the timestamp in the "major fixes" file.

a0a7434

victorhuangwq requested a review from fanpu July 8, 2020 12:48

victorhuangwq assigned fanpu Jul 8, 2020

victorhuangwq added Status: On Hold Type: Maintenance labels Jul 8, 2020

victorhuangwq linked an issue Jul 8, 2020 that may be closed by this pull request

VM jobs run indefinitely #152

Open

fanpu mentioned this pull request Dec 6, 2020

Tango concurrency and locking issues #182

Open

damianhxy removed the request for review from fanpu December 1, 2023 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

devanshk commented Apr 1, 2018

victorhuangwq commented Jul 8, 2020

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

Are you sure you want to change the base?

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

Conversation

devanshk commented Apr 1, 2018

victorhuangwq commented Jul 8, 2020