Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container code #5250

Closed
wants to merge 35 commits into from
Closed

Container code #5250

wants to merge 35 commits into from

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Dec 6, 2021

There are the following major changes in this PR:

  • For the job script generated, if there are $ contained in the string, it will not be escaped in single quotes.
  • A ContainerCode class is created to represent the code running in the container
  • The command-line interface for setting this ContainerCode instant is added

Since the container set is now included in the setting of the container code, which I think for the production using this should be separated, the implementation of the verdi code setup --on-container interface is a bit of a mess, all the options are in the code setup subcommand.

@unkcpz unkcpz marked this pull request as draft December 6, 2021 23:24
@unkcpz unkcpz force-pushed the container-code branch 2 times, most recently from 43c02ae to ba21194 Compare December 8, 2021 10:53
Copy link
Contributor

@yakutovicha yakutovicha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comments below.

aiida/cmdline/commands/cmd_code.py Outdated Show resolved Hide resolved
aiida/cmdline/params/options/commands/code.py Outdated Show resolved Hide resolved
@@ -68,11 +84,89 @@ def validate_label_uniqueness(ctx, _, value):
return value


ON_CONTAINER = OverridableOption(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if "on_container" should be a separate option. Maybe instead of having the "ON_COMPUTER" option we call it "CODE_LOCATION" and provide 3 choices: on a computer, on a container, local. Something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that these three cases are related; the downside of your suggestion is that we would need to break backwards compatibility.
At this stage I would rather vote for a separate option that is turned off by default.

__all__ = ('ContainerCode',)


class ContainerCode(Code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am not convinced anymore whether creating a new ContainerCode class is better than adding a few more argument to the standard Code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know yet how exactly we want to represent all the container metadata, so I think going with a subclass is the right way for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am not convinced anymore whether creating a new ContainerCode class is better than adding a few more argument to the standard Code

Would you mind sharing why? That route has been tried multiple times (local and remote for Code, mesh and explicit for KpointsData) and it always creates a lot of confusion, both in the code and for the user.

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @unkcpz - most of this looks good to me, and is in line with what we discussed.

just a couple of comments, mostly on naming

@@ -47,6 +47,7 @@
'Comment',
'Computer',
'ComputerEntityLoader',
'ContainerCode',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative name: Container and ContainerizedCode

I don't have strong preferences for any of these options though, also ContainerCode is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I'd go for ContainerCode: it makes it clear it's a Code (actually it's a subclass) and it's a bit shorter than ContainerizedCode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally definitely keep the Code suffix. Also I think ContainerizedCode is more "correct" and therefore for me clearer. I usually don't mind names that are a bit longer if that makes them clearer. But I could also live with ContainerCode if others prefer this.

def __init__(
self,
computer=None,
cmd_tmpl=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think abbreviations are necessary here.
command_template is perfectly fine and easier to read


self.set_attribute('container_exec_path', container_exec_path)

def get_container_exec_path(self):
Copy link
Member

@ltalirz ltalirz Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose container_command, in analogy to https://docs.docker.com/engine/reference/run/

__all__ = ('ContainerCode',)


class ContainerCode(Code):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know yet how exactly we want to represent all the container metadata, so I think going with a subclass is the right way for now.

aiida/cmdline/commands/cmd_code.py Outdated Show resolved Hide resolved
CONTAINER_CMDLINE_TMPL = OverridableOption(
'--container-cmdline-tmpl',
prompt='container cmdline params tmpl',
default='sarus run --mount=src=$PWD,dst=/workir,type=bind --workdir /workdir {image}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should default to singularity instead, since it is way more popular.

I would perhaps call this container-engine-command (I don't think "template" needs to be part of the name; we also don't call the work directory "template")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a typo! the first time you say /workir instead of /workdir

ON_COMPUTER = OverridableOption(
'--on-computer/--store-in-db',
is_eager=False,
default=True,
cls=InteractiveOption,
required_fn=is_not_on_container,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also originally thinking that the three options are mutually exclusive, but you're right - we could have containerised code that runs both 'on computer' or 'in-db'. @unkcpz did you test the two cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this in-db? it is the other way to say for the local code, no? So either we on_computer or 'not_on_computer' which is local (in-db?) or we on_container?

@@ -40,6 +40,10 @@ def escape_for_bash(str_to_escape):

str_to_escape = str(str_to_escape)

# dont espace when string contain $ symbol, usually this is for ENV variable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this fix is way too far-reaching; it will affect any string that contains a dollar sign, not just the one we are looking at.

Can you provide a link to the issue that @giovannipizzi mentioned where they discuss the pros/cons of using single/double quotes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is too much...
Also note the spelling dont -> don't, espace -> escape

The issue is #4836

Shall we add an optional parameter to escape_for_bash, use_double_quotes=False? If True, one should do (untested)

    escaped_quotes = str_to_escape.replace('"', '''"'"'"''')
     return f'"{escaped_quotes}"'

(obvious, no? :-P)

Then, the question however becomes: "who decides which quotes to use?"

Option 1: the new string container-engine-command is escaped with double quotes, the rest with single quotes (backward compatibility). However, this does not address #4836 and it will be complex for users to remember which is quoted how.
Option 2: we also ask one more thing to users: use double quotes instead of single quotes for escaping (default false). Is this too much?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for option 2. But this prompt setup question should be in the computer setup stage just after the scheduler setting there, I assume.

aiida/cmdline/params/options/commands/code.py Outdated Show resolved Hide resolved
return ContainerCode

@classmethod
def _get_query_builder_label_identifier(cls, identifier, classes, operator='==', project='*'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you changing something in this method?

In a quick browse I didn't see anything specific to the container code (but I might be wrong).
You might be able to subclass CodeEntityLoader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described above, if we simply make the entry point core.code.containerized then load_code can be used and none of this extra code is needed

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @unkcpz - most of this looks good to me, and is in line with what we discussed.

just a couple of comments, mostly on naming

Copy link
Member

@giovannipizzi giovannipizzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments to be taken into account.

One important thing: before merging it would be important to test this (both ContainerCode in mode ON_COMPUTER and in mode STORE_AND_UPLOAD) to check it actually works. In can give you a path of a pw code I already prepared in sarus on eiger, if you want to try. You could prepare a simple container with python in it and check that you can upload a python script and run it in mode STORE_AND_UPLOAD.

Actually, it would be good to add integration tests where we actually run one of these codes among the tests? Using docker without MPI as an example? I feel (but I'm not sure) that there should be already examples of this for the add code, and docker should be already available in the GitHub actions.

CONTAINER_CMDLINE_TMPL = OverridableOption(
'--container-cmdline-tmpl',
prompt='container cmdline params tmpl',
default='sarus run --mount=src=$PWD,dst=/workir,type=bind --workdir /workdir {image}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a typo! the first time you say /workir instead of /workdir

ON_COMPUTER = OverridableOption(
'--on-computer/--store-in-db',
is_eager=False,
default=True,
cls=InteractiveOption,
required_fn=is_not_on_container,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also originally thinking that the three options are mutually exclusive, but you're right - we could have containerised code that runs both 'on computer' or 'in-db'. @unkcpz did you test the two cases?

@@ -40,6 +40,10 @@ def escape_for_bash(str_to_escape):

str_to_escape = str(str_to_escape)

# dont espace when string contain $ symbol, usually this is for ENV variable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is too much...
Also note the spelling dont -> don't, espace -> escape

The issue is #4836

Shall we add an optional parameter to escape_for_bash, use_double_quotes=False? If True, one should do (untested)

    escaped_quotes = str_to_escape.replace('"', '''"'"'"''')
     return f'"{escaped_quotes}"'

(obvious, no? :-P)

Then, the question however becomes: "who decides which quotes to use?"

Option 1: the new string container-engine-command is escaped with double quotes, the rest with single quotes (backward compatibility). However, this does not address #4836 and it will be complex for users to remember which is quoted how.
Option 2: we also ask one more thing to users: use double quotes instead of single quotes for escaping (default false). Is this too much?

@@ -86,7 +87,7 @@ def upload_calculation(
computer = node.computer

codes_info = calc_info.codes_info
input_codes = [load_node(_.code_uuid, sub_classes=(Code,)) for _ in codes_info]
input_codes = [load_node(_.code_uuid, sub_classes=(Code, ContainerCode)) for _ in codes_info]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed since ContainerCode is a subclass? (maybe it is, I didn't check the underlying code). Same question in at least another point below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether it is a sub-class in Python doesn't matter. What matters is the structuring of the entry points, because that is what the query builder uses. So as long as Code = core.code and ContainerCode = core.code.container, then querying for all Code's will also match all ContainerCode's, unless subclassing is explicitly turned off. Note that the load_node call here should just be replaced with load_code which is the standard for querying for codes.

@@ -710,6 +711,12 @@ def presubmit(self, folder: Folder) -> CalcInfo:
this_argv = [this_code.get_execname()
] + (code_info.cmdline_params if code_info.cmdline_params is not None else [])

# set this_argv only for container code
if isinstance(this_code, ContainerCode):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely overriding the behaviour above? I.e., this e.g. is ignoring the behaviour of this_withmpi and not putting the mpi_args, that I think is incorrect?

"""Get image name"""
return self.get_attribute('image', '')

def container_cmd_params(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check Leo's suggestion for the name, above (I liked it). In any case, let's use the same name for the function and the attribute, to avoid confusion in users when they call it via the API or when they query for it via the QueryBuilder. Finally, for consistency, let's add get_ in front. So, e.g., container_engine_command for the attribute and get_container_engine_command for the method

aiida/orm/utils/loaders.py Outdated Show resolved Hide resolved
setup.json Outdated Show resolved Hide resolved
@@ -20,6 +20,7 @@ def test_escape_for_bash():
["string with a ' single quote", """'string with a '"'"' single quote'"""],
[1, "'1'"],
[2.0, "'2.0'"],
['$PWD', '$PWD'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be adapted according to my comment above (and testing both single-quoted escaping, i.e. now, and double-quoted escaping - essentially copying the tests an swapping all single quote with double quotes and vice versa

@giovannipizzi
Copy link
Member

There are already quite a few reviews, but I added @sphuber if he wants to have an eye (not critical if he doesn't have time) since there are quite a few things to be careful about (entry point names, subclassing, ...)

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @unkcpz . I have given this a quick read, but I haven't really gone into depth of the ContainerCode implementation for now, since I haven't been present during the meetings. So I trust the implementation is ok.

The main problem that I see is that we are making the already complex verdi code setup even more complex. The fact that this endpoint is used for three different classes feels a bit problematic. We have already had plenty of problems with the command merely supporting local and remote codes, but I feel this will increase the complexity signifcantly. Now I am not sure if we can or should address this now, but I think that in the future an approach more in the sense of the transport commands, where each plugin can provide their own CLI subcommand, would be the way to go. That solution is not without its own problems for sure, but I think that will be better in the end.

aiida/engine/processes/calcjobs/calcjob.py Outdated Show resolved Hide resolved
@@ -47,6 +47,7 @@
'Comment',
'Computer',
'ComputerEntityLoader',
'ContainerCode',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally definitely keep the Code suffix. Also I think ContainerizedCode is more "correct" and therefore for me clearer. I usually don't mind names that are a bit longer if that makes them clearer. But I could also live with ContainerCode if others prefer this.

__all__ = ('ContainerCode',)


class ContainerCode(Code):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I am not convinced anymore whether creating a new ContainerCode class is better than adding a few more argument to the standard Code

Would you mind sharing why? That route has been tried multiple times (local and remote for Code, mesh and explicit for KpointsData) and it always creates a lot of confusion, both in the code and for the user.

aiida/orm/nodes/data/container_code.py Outdated Show resolved Hide resolved
aiida/orm/nodes/data/container_code.py Outdated Show resolved Hide resolved
@@ -222,3 +236,4 @@ def is_local(self):
class CodeType(enum.Enum):
STORE_AND_UPLOAD = 'store in the db and upload'
ON_COMPUTER = 'on computer'
ON_CONTAINER = 'on container'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ON_CONTAINER = 'on container'
ON_CONTAINER = 'on container'

Would go with CONTAINERIZED or CONTAINER here.

aiida/orm/utils/loaders.py Outdated Show resolved Hide resolved
return ContainerCode

@classmethod
def _get_query_builder_label_identifier(cls, identifier, classes, operator='==', project='*'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As described above, if we simply make the entry point core.code.containerized then load_code can be used and none of this extra code is needed

@ltalirz
Copy link
Member

ltalirz commented Jan 10, 2022

@unkcpz What is the status of this?

I guess the 2.0 release candidate is due by Jan 15th, would be good to get this in?

@unkcpz
Copy link
Member Author

unkcpz commented Jan 10, 2022

Thanks for headup @ltalirz, I am cleaning up the commits and working on this today.

@unkcpz unkcpz force-pushed the container-code branch 4 times, most recently from 7df6adb to 15de2d6 Compare January 10, 2022 18:05
@unkcpz
Copy link
Member Author

unkcpz commented Jan 10, 2022

Got the error in CI https://github.com/aiidateam/aiida-core/runs/4765450356?check_suite_focus=true:

verdi -p test_${AIIDA_TEST_BACKEND} run ${SYSTEM_TESTS}/test_daemon.py
Running the `ArithmeticAddCalculation`
Warning: could not parse scheduler output: the `_scheduler-stderr.txt` file is missing
Warning: could not parse scheduler output: the `_scheduler-stdout.txt` file is missing
Warning: output parser returned exit code<310>: The output file could not be read.

It only appears in this branch, not sure which file change causing this. I can reproduce it in my local machine.

EDIT: By rollback escaping change this error is fixed although the _aiidasubmit.sh generated by two branches are all the same, very strange.

@unkcpz
Copy link
Member Author

unkcpz commented Jan 10, 2022

Another issue that block me to finish this PR is that to using docker as an example to configure the add code and run in containerized code, the redirection of input and output are confused. The symbol < and > to redirect the input and output cannot be used in the containized command. That is to say, we can not run

docker run -it -v $PWD:/workdir:ro ubuntu /bin/bash < aiida.in > aiida.out

But

docker run -it -v $PWD:/workdir:ro ubuntu sh -c '/bin/bash < aiida.in > aiida.out'

This also happened if we want (and we should) to keep on using the > and < in job script.

I am thinking maybe it is always safe to put the command without containized code after sh -c (not tested yet)? for example for command:

'mpirun' '-np' '1' '/home/aiida/.conda/envs/quantum-espresso-6.7/bin/pw.x' '-in' 'aiida.in'  > 'aiida.out'

When switch to containerized code we have:

<containerized specific command> 'sh' '-c' 'mpirun -np 1 /home/aiida/.conda/envs/quantum-espresso-6.7/bin/pw.x -in aiida.in  > aiida.out'

@ltalirz
Copy link
Member

ltalirz commented Jan 10, 2022

I think you are correct that you will need extra quotes around the command that is to be run inside the container.

You don't necessarily need to remove the quotes around the individual components - one could also escape them as in

<containerized specific command> '\'mpirun\' \'-np 1\' \'/home/aiida/.conda/envs/quantum-espresso-6.7/bin/pw.x\' \'-in\' \'aiida.in\'  > \'aiida.out\''

if that's easier (isn't there a way in AiiDA to do this automatically?)

@ltalirz
Copy link
Member

ltalirz commented Jan 10, 2022

Hm... upon closer look: what about the -a option in docker run?
https://docs.docker.com/engine/reference/run/#foreground
https://stackoverflow.com/a/50161815/1069467

@unkcpz
Copy link
Member Author

unkcpz commented Jan 11, 2022

Hm... upon closer look: what about the -a option in docker run?

Thanks @ltalirz. Yes, I saw that, there are ways to pass variables instead of using redirections <. We can also do:

docker run -it -v $PWD:/workdir:rw -w /workdir ubuntu /bin/bash  aiida.in > aiida.out

However, if I understand it correctly, it is defined in the calcjob plugins how to pass the input, either use -in for PwCalcJob or < for Add CalcJob. In principle, I think we should not override this.

@ltalirz
Copy link
Member

ltalirz commented Jan 11, 2022

Yes, sorry, forget what I said (I was a bit tired already ;-) ).

I think it is the right approach to have the directory mounted and, from thereon, make sure that the stdout/stderr redirection to files happens inside the container.

@ltalirz
Copy link
Member

ltalirz commented Mar 9, 2022

@unkcpz are you still waiting for something in particular with this PR?

@unkcpz
Copy link
Member Author

unkcpz commented Mar 9, 2022

Yes, I am waiting for #5350 get merged. More commits are now based on this PR and in my forked repo.

@ltalirz
Copy link
Member

ltalirz commented Apr 4, 2022

Hey @unkcpz , now that #5350 has been merged, would you be able to update this PR?

I think container support is a key feature for 2.0

@unkcpz
Copy link
Member Author

unkcpz commented Apr 4, 2022

Hi @ltalirz, thanks for the head up. I have other stuff in my hands and already have containerized code PR to finish in my schedule but after Thursday this week. Will take some time to focus on this then.

@sphuber
Copy link
Contributor

sphuber commented May 19, 2022

@unkcpz think this PR can be closed, correct? Majority of functionality has been merged in other PRs and the final addition of the code plugins is now taken care of in #5507

@unkcpz
Copy link
Member Author

unkcpz commented May 19, 2022

@sphuber yes. I close it. Thanks for head up.

@unkcpz unkcpz closed this May 19, 2022
@unkcpz unkcpz deleted the container-code branch May 19, 2022 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants