Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary PR to integrate PDL's mass of bug fixes and enhancements #148

Open
wants to merge 132 commits into
base: master
Choose a base branch
from

Conversation

devanshk
Copy link
Contributor

@devanshk devanshk commented Apr 1, 2018

The follow two bugs, combined, prevent pending jobs from being executed:

  • When number of jobs is larger than number of vms in free pool, jobManager dies.

  • When jobManager restarts, free pool is not emptied whilst total pool is, causing inconsistency.
    xyzisinus@4dcbbb4
    xyzisinus@e2afe8a

  • Add ability to specify image name for ec2 using "Name" tag on AMI (used to allow only one image specified as DEFAULT_AMI):
    xyzisinus@97c22e3
    xyzisinus@e66551a

  • When job id reaches the max and wraps around, the jobs with larger ids starve.
    xyzisinus@9565275

  • Improve the worker's run() function to report errors on the copy-in/exec/copy-out path more precisely.
    xyzisinus@caac9b4
    xyzisinus@c47d889

  • In the original code, Tango allocates all vm instances allowed by POOL_SIZE at once. It shouldn't be an issue because once a vm is made ready a pending job should start using it. However, due to well-known Python thread scheduling problems, the pending jobs will not run until all vms are allocated. As we observed, vm allocations are almost sequential although each allocation runs in a separate thread, again due to Python's threading. That results in a long delay for the first job to start running. To get around the problem, POOL_ALLOC_INCREMENT is added to incrementally allocate vms and allow jobs to start running sooner.
    xyzisinus@93e60ad

  • With POOL_SIZE_LOW_WATER_MARK, add the ability to shrink pool size when there are extra vms in free pool. When low water mark is set to zero, no vms are kept in free pool and a fresh vm is allocated for every job and destroyed afterward. It is used to maintain desired number of ec2 machines as standbys in the pool while terminating extra vms to save money.
    xyzisinus@d896b36
    xyzisinus@7805577

  • Improve autodriver with accurate error reporting and optional time stamp insertion into job output.
    Tango/autodriver/autodriver.c

  • When Tango restarts, vms in free pool are preserved (used to be all destroyed).
    xyzisinus@e2afe8a

  • Add run_jobs script to submit existing student handins in large numbers:
    Tango/tools/run_jobs.py

  • Improve general logging by adding pid in logs and messages at critical execution points.

Part 2. New configuration variables (all optional)

  • Passed to autodriver to enhance readability of the output file. Currently only integrated in ec2 vmms.
    AUTODRIVER_LOGGING_TIME_ZONE
    AUTODRIVER_TIMESTAMP_INTERVAL

  • Control of the preallocator pool as explained in Part 1.
    POOL_SIZE_LOW_WATER_MARK
    POOL_ALLOC_INCREMENT

  • Instead of destroying it, set the vm aside for further investigation after autodriver returns OS ERROR. Currently only integrated in ec2 vmms.
    KEEP_VM_AFTER_FAILURE

xyzisinus and others added 30 commits August 11, 2017 13:47
jobManager dies after N jobs have finished in the exception handling
of __manage(), if Config.REUSE_VM is set to true.  This commit
simply checks whether "job" is None, to avoid the crash.  This allows
job manager to continue and finish all jobs.
But it may not be the real fix of the root problem which needs further
investigation.
queue.  Due to misunderstanding of redis, the free queue is not
actually emptied, resulting in vms staying in free queue but not in total
pool.
…mage

 name for each ami.  The lab author specifies the desired image using
 this name.
by pool size at once and 2) consider vms in free pool first.
not in the free pools, instead of destroy all vms.
report them at the end of the run.
xyzisinus and others added 22 commits August 30, 2018 13:45
report file writing failure, always write current trouble jobs into
file
Or all Tango jobs will be stuck in such cases.
…ily.

But the timed cleanup shouldn't run in each of them.  Now it only runs
in jobManager.
@victorhuangwq
Copy link
Contributor

@fanpu, this PR seems to contain a lot of the important fixes that Tango really needs.

I think the ideal way to go about adding this PR into Tango is to break this PR down into smaller more easier to review PRs.

I'm currently thinking that we have 1 PR based on each commit (that is, each of the commits based in the description above), so that we can actually logically get people to approve and merge them in.

@victorhuangwq victorhuangwq linked an issue Jul 8, 2020 that may be closed by this pull request
@damianhxy damianhxy removed the request for review from fanpu December 1, 2023 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VM jobs run indefinitely
4 participants