Skip to content

Job submisison workflow #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arabnejad opened this issue Mar 23, 2021 · 3 comments
Closed

Job submisison workflow #195

arabnejad opened this issue Mar 23, 2021 · 3 comments
Assignees
Labels
backend dev bug prevention tasks that help reduce the emergence of critical bugs. design decision enhancement long term

Comments

@arabnejad
Copy link
Collaborator

The current multi-threading implementation for job submission is not thread safe when we use high number of threads. The main issue is : some global variables modified by different threads are not protected by mutex lock, therefore the output values of those variables are not predicable and not set as they should be

the possible solutions for this issue are:

  • adding those variable blocks in a mutex lock to be protected
  • working on a different approach which may requires some changes in the workflow of how we submit a job to remote machine

the main disadvantage of first approach is : due to have more mutex lock area, the total job submission will be increased,
so, I believe the second option is the best one to investigate

@arabnejad arabnejad added enhancement long term bug prevention tasks that help reduce the emergence of critical bugs. design decision backend dev labels Mar 23, 2021
@arabnejad arabnejad added this to the VECMA M33 Release milestone Mar 23, 2021
@arabnejad arabnejad self-assigned this Mar 23, 2021
@arabnejad
Copy link
Collaborator Author

In the current version, when we call the job() function to submit a job, all steps from preparation, transmission and submission are executed in a single job() function. Even with the current multithreading approach, we still need to have some mutex lock to protect global evn variables in different threads which increase the total execution time of job() function.

Therefore, I am suggesting to split those steps as :

  1. job_preparation() : where all the job folders and scripts will be created in the temporary folder (/tmp) <tmp_folder>/{results/,scripts/} . In this function, the config_files folder is already transferred to the remote machine by calling with_config function

    • this part can be executed in parallel. In case of using Multithreading, we need to have some local copy of env variables for each thread. By using Multiprocessing there is no change needed compare to serial mode
  2. job_transmission() : transfers all generated files/folders from <tmp_folder>/{results/,scripts/} to <target_work_dir>/{results/,scripts/} with a single rsync command.

  3. job_submission() : submits all job scripts to the target remote machine.

    • this function can be also executed in parallel. However, I think having multiple simultaneous SSH tunnels will not be safe due to the system limitation on open/alive SSH connection. Therefore, I think this part should be done in a serial mode.

@arabnejad
Copy link
Collaborator Author

arabnejad commented Mar 23, 2021

the new job workflow system are implemented with both Multithreading and Multiprocessing approaches

Both are tested for thread/process safety.

Also, I did some benchmarking on new job submission workflow and here are the results compared to the current version

compare

Here, PJ refers to the PilotJob functionality, i.e, in case of PJ=True only a single job will be submitted, and if PJ=True each job will be submitted individually to the remote machine scheduler

@arabnejad
Copy link
Collaborator Author

This issue closed by Pull Request #196

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend dev bug prevention tasks that help reduce the emergence of critical bugs. design decision enhancement long term
Projects
None yet
Development

No branches or pull requests

1 participant