New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--job fails with 3.7.1 & gc3pie 2.5.0 #2632
Comments
Instinct tells me this is most likely a problem with GC3Pie, but I'll let @riccardomurri be the judge on this... Is GC3Pie somehow resetting the environment? @jpecar Can you provide some more information on how EasyBuild was installed? Are you loading a module file for EasyBuild, or was it installed via |
It's a module file. |
GC3Pie does not reset the environment; all processes are run in the environment inherited from the parent process. What has changed in rel 2.5.0 compared to rel 2.4.2 is that sub-processes are now executed directly without calling a shell (i.e., we use Python's Environmental variables can be overridden, but that has to be specified when you create the GC3Pie I guess it would help in your case if gc3pie/gc3pie#609 was solved? |
@riccardomurri: It seems like we should also be explicitly passing |
It seems like we should also be explicitly passing PATH via env_vars? That
would probably fix the issue @jpecar <https://github.com/jpecar> is
seeing?
Yes, I think that would do it.
|
@jpecar Can you try applying this patch in your EasyBuild installation as see if that fixes the problem you're seeing? diff --git a/easybuild/tools/parallelbuild.py b/easybuild/tools/parallelbuild.py
index 1a5018a3e..7511fce57 100644
--- a/easybuild/tools/parallelbuild.py
+++ b/easybuild/tools/parallelbuild.py
@@ -161,7 +161,7 @@ def create_job(job_backend, build_command, easyconfig, output_dir='easybuild-bui
if name.startswith("EASYBUILD"):
easybuild_vars[name] = os.environ[name]
- for env_var in ["PYTHONPATH", "MODULEPATH"]:
+ for env_var in ["PATH", "PYTHONPATH", "MODULEPATH"]:
if env_var in os.environ:
easybuild_vars[env_var] = os.environ[env_var]
|
With this patch eb command is now found but fails with "ERROR: Lmod modules tool can not be used, 'lmod' command is not available". So I guess more variables from env need to be passed down, some of the LMOD_* ones? Don't know in detail how easybuild detects lmod availability. |
@jpecar OK, then try also passing down This may point to a bigger problem there, if the shell in the job being submitted isn't properly set up, you may run into more problems... How is Lmod installed & initialized in shell sessions? |
@jpecar I hit the same issue and the easiest workaround for now is to pass a bunch of Lmod related env vars to --- parallelbuild.py.orig 2018-10-24 10:52:55.959314000 -0500
+++ parallelbuild.py 2018-10-24 10:33:27.066026750 -0500
@@ -155,17 +155,16 @@
returns the job
"""
- # capture PYTHONPATH, MODULEPATH and all variables starting with EASYBUILD
- easybuild_vars = {}
- for name in os.environ:
- if name.startswith("EASYBUILD"):
- easybuild_vars[name] = os.environ[name]
-
- for env_var in ["PYTHONPATH", "MODULEPATH"]:
- if env_var in os.environ:
- easybuild_vars[env_var] = os.environ[env_var]
+ # capture PYTHONPATH, MODULEPATH and all variables starting with EASYBUILD or LMOD
+ env_vars_to_pass = {}
- _log.info("Dictionary of environment variables passed to job: %s" % easybuild_vars)
+ regex = re.compile("^(PATH|PYTHONPATH|MODULEPATH|USER)$|^(LMOD|EASYBUILD).*")
+
+ for env_var in os.environ:
+ if re.match(regex, env_var) is not None:
+ env_vars_to_pass[env_var] = os.environ[env_var]
+
+ _log.info("Dictionary of environment variables passed to job: %s" % env_vars_to_pass)
# obtain unique name based on name/easyconfig version tuple
ec_tuple = (easyconfig['ec']['name'], det_full_ec_version(easyconfig['ec']))
@@ -194,7 +193,7 @@
if build_option('job_cores'):
extra['cores'] = build_option('job_cores')
- job = job_backend.make_job(command, name, easybuild_vars, **extra)
+ job = job_backend.make_job(command, name, env_vars_to_pass, **extra)
job.module = easyconfig['ec'].full_mod_name
return job |
@riccardomurri The workaround above is clearly not enough to ensure that the environment is correctly set in the job. Think for example of a customized environment where Lmod or EasyBuild pull information from a site-specific environment variable. Obviously we cannot predict this a priori. Is there any way to request |
@riccardomurri +1 on @vanzod's question. Also, can you clarify why that change to using |
@vanzod: Thanks, now I see --job jobs running and producing expected output. In the log files I actually see builds succeeding. However in CI I still see that there's some communication error, apparently in gc3pie-slurm:
Seems like something else changed in gc3pie as well. Any pointers? |
@jpecar What does "in CI" mean exactly? Again, maybe @riccardomurri can help here... |
Hello all, before we jump straight to conclusions and suggested patches, I would First of all: a different backend is used in GC3Pie, depending on What is the backend that @jpecar and @vanzod are using here? (I was Anyway, whatever the backend, the environmental variable passed to So, before we go on: what backend are you guys using for these tests? |
This is a chunk of my gc3pie.conf:
And the same for other archs. CI is run through gitlab, where each commit spawns a docker container that can talk and submit jobs to our slurm cluster. Within that docker easybuild is run with --job option, using gc3pie to manage slurm jobs. |
@riccardomurri Thanks for clarifying. The claim is that this worked with GC3Pie 2.4.x, so I'm assuming the same setup is being used here by @jpecar & @vanzod w.r.t. Gc3Pie configuration & backends. If it doesn't work in Gc3Pie 2.5.0 anymore, then something was changed that changes the (default) behavior w.r.t. passing down of environment variables. It's clear that GC3Pie isn't actively resetting the environment, but it may be a secondary effect, e.g. by no longer starting a login shell? |
@jpecar Would you be able to apply the following patch to your GC3Pie code and retry? diff --git a/gc3libs/__init__.py b/gc3libs/__init__.py
index b7ed95e0..3979ab46 100755
--- a/gc3libs/__init__.py
+++ b/gc3libs/__init__.py
@@ -1572,7 +1572,9 @@ class Application(Task):
+ ['{0}={1}'.format(name, value)
for name, value in self.environment.iteritems()]
+ sbatch
- + ['--export', ','.join(self.environment.keys())])
+ + ['--export', ','.join(self.environment.keys() + ['ALL'])])
+ else:
+ sbatch += ['--export', 'ALL']
return (sbatch, cmdline) If you cannot patch GC3Pie, it should suffice to change your config to read::
|
Yes, I'm not trying to say that it doesn't depend on GC3Pie at all. But I'm not convinced that |
With patch by @vanzod reverted and sbatch --export=PWD,ALL in gc3pie.cfg, I'm back to "/bin/sh: eb: command not found". @boegel: is there anything blocking me from trying eb 3.7.1 with gc3pie 2.4.2? I'm thinking of isolating the problem to one of the involved components and then bisecting it to identify the issue ... seems like a good opportunity to learn how to do that :) |
@jpecar It's possible that SLURM's Also, if you create a file with just these contents: #! /bin/sh
printenv Do you see any difference in the output if you submit it with:
or with::
|
@riccardomurri In my setup the backend is Slurm. As @boegel pointed out, nothing changed in our setup or configuration after moving to 2.5.0 from 2.4.2. My reference to the issue being |
I think I have a clue now about what happens. On our prod clusters, all running SLURM 15.08 on Ubuntu 16.04, I see that (1) all environment variables are propagated, (2) unless $ sbatch --version
slurm 15.08.7
$ export FOO=bar
$ cat foo.sh
#! /bin/sh
echo ${FOO:-no FOO!}
$ sbatch foo.sh # (1)
Submitted batch job 1276138
$ cat slurm-1276138.out
bar
$ sbatch --export=PWD,ALL foo.sh # (2)
Submitted batch job 1276139
$ cat slurm-1276139.out
bar
$ sbatch --export=PWD foo.sh # (2)
Submitted batch job 1276140
$ cat slurm-1276140.out
no FOO!
$ sbatch --export=FOO,ALL --export=PWD foo.sh # (3)
Submitted batch job 1276141
$ cat slurm-1276141.out
no FOO!
So, one thing that changed in the transition from GC3Pie 2.4.2 to 2.5.0 is exactly the addition of Can someone please try applying the patch from #2632 (comment) and tell if it fixes the issue? (It should if this explanation is correct.) |
@riccardomurri But then, what's the point of passing any environment that should be passed at all, since you'll always pass the current environment anyway via |
@riccardomurri I tried the patched gc3pie and it seems to work: I see jobs submitted and software being built. However I still see that "Could not retrieve status information" like I mentioned in #2632 (comment) . I guess this is another issue, unrelated to this one? |
It seems like we should change the current behavior in EasyBuild, to make it not specify any environment variables to pass down into submitted jobs (which assumes that EasyBuild will be available by default in submitted jobs). Does that make sense @riccardomurri, @vanzod? cc @akesandgren |
My problem might be unrelated (PySlurm) but I'll investigate a bit more. |
I think the proper way to fix this in the EasyBuild framework is to simply stop passing down specific environment variables in submitted jobs. |
hi all, sorry for being silent on this for a while -- it's a very busy time until Dec. for me... Anyway, I think GC3Pie's submit method needs to be corrected to include |
@riccardomurri Any updates? |
@vanzod Can you check whether this is still a problem? |
Fixes easybuilders/easybuild-framework#2632 (comment) (and possibly other issues that nobody cared to report).
I have just released GC3Pie 2.5.1 with the SLURM environment fix. Let me know if it fixes this issue. |
@boegel Initial problem is resolved with 2.5.1 (job is submitted, package starts building and gets successfully built), but it appears there's another problem lurking. See above my post from Oct 25. |
@jpecar Following up on your comment from Oct. 25: can you set the GC3Pie logging level to DEBUG? That should show the actual interactions with SLURM. |
@riccardomurri Maybe a stupid question, but ... how? I didn't find anything relevant in gc3pie docs, grepping the source gave me two ideas (adding "debug=1" to gc3pie.conf and creating $HOME/.gc3/gc3libs.log.conf with content "level=debug"), none of which made any change to output I get. I agree that looking at commands fired at slurm and reading their output is the right way to understand what I'm seeing, just let me know how I can achieve that. |
@jpecar GC3Pie uses the logger passed by EB, so just raising EB's log level should be enough. |
Is there an option to build the gc3pie with the easybuild python instead of system? i have old 2.6 system python with old setuptools that cannot be upgraded to newer. |
I am not sure I understand the question: GC3Pie is a pure Python library, there is nothing being compiled, and it will run in EB's Python interpreter when used by EasyBuild... |
I get this error when trying to compile with EasyBuild:
|
Looks like a more general error to me: the traceback points to a problem importing |
You are right, will open a new issue. |
We had our CI pipeline running fine all the way & including EasyBuild 3.6.2 & GC3Pie 2.4.2. Now I've upgraded to 3.7.1 and GC3Pie 2.5.0 and jobs are created by GC3Pie but fail immediately with "eb: command not found", causing gc3pie to fail with " gc3libs.exceptions.LRMSError: Could not retrieve status information for task Application@3aa5250".
It appears that something changed in the way how environment is propagated down from eb that generates a job to a job itself. There are no changes that would mention anything like this in either easybuild or gc3pie changelogs.
For now I've hardcoded our CI to use previous versions of EasyBuild and GC3Pie but would eventually like to see this resolved.
The text was updated successfully, but these errors were encountered: