Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH error connection when submitting a large number of ensemble runs #190

Closed
arabnejad opened this issue Oct 22, 2020 · 3 comments
Closed

Comments

@arabnejad
Copy link
Collaborator

FabSim3 provided a multi-threading functionality to decrease the total job submission time for a large number of ensemble/replica
however, we the ensemble runs are really high (>30k) due to a high number of SSH connection, the submission process may FAILED

    ...
    raise SSHException("SSH session not active")
paramiko.ssh_exception.SSHException: SSH session not active

there are a number of ways that we can handle this issue, but there will be degradation on total submission time

what do you think @djgroen ? what are your suggestions to tackle this issue?

@djgroen
Copy link
Owner

djgroen commented Oct 22, 2020

Okay, I think this is one part that is responsible for all the SSH connections. There are three run commands right after this line:

if env.label in ['PJ_PYheader', 'PJ_header']:

Any command with run() in job will scale the number of SSH connections by the job count. These need to be merged such that file staging is done for all jobs (or at least multiple jobs) in one go, rather than separately for each job.

Later on there is also this line:

run(template("chmod u+x %s" %

That line changes permission for one file for each job. That one may actually be straightforward to refactor, because I think you can just chmod all subdirs in that results dir with a single command at the right time (i.e. after everything is uploaded)?

Lastly, there are several run commands after this line

# We don't want to go next during replicas and pjm until the final

but those are merged in the case of QCG-PJ ensembles, so I think that part can remain unchanged.

Does this help things @arabnejad ?

@arabnejad
Copy link
Collaborator Author

I reduced the number of SSH connection during the job submission (c8cfadc, d162081)

Federica (@FGugole) could you please confirm that new implementation fixed your problem
also, could you please mention the campaign size that you used, and the total submission time

@FGugole
Copy link

FGugole commented Nov 17, 2020

Yes, the new implementation fixed my problem and I was able to submit a campaign with 60k samples. The job submission took a long time (at least 10 days), but it was successful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants