Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine: Fix bug introduced when refactoring upload_calculation #6348

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Apr 16, 2024

In 6898ff4 the implementation of the
processing of the local_copy_list in the upload_calculation method
was changed. Originally, the files specified by the local_copy_list
were first copied into the SandboxFolder before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.

For example, if the sandbox contained the directory folder and the
local_copy_list contained the items (_, 'file', './folder/file')
this would work just fine in the original implementation as the file
would be written to the folder on the remote folder. The current
implementation, however, would write the file content to folder/file
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:

Transport.put('/tmpdir/folder', 'folder')

but since folder would already exist on the remote working directory
the local folder would be nested and so the final path on the remote
would be /workingdir/folder/folder/file.

The correct approach is to copy each item of the local_copy_list from
the local temporary directory individually using the Transport.put
interface and not iterate over the top-level entries of the temporary
directory at the end.

@sphuber sphuber added the priority/critical-blocking must be resolved before next release label Apr 16, 2024
@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from b851b73 to 13d400b Compare April 16, 2024 18:07
@sphuber
Copy link
Contributor Author

sphuber commented Apr 17, 2024

Note that this is a critical bug that is currently on main as it breaks for example things as simple as running a PwCalculation from aiida-quantumespresso so this has to be fixed before release.

@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from 13d400b to 20fc810 Compare April 19, 2024 09:32
@mbercx
Copy link
Member

mbercx commented Apr 24, 2024

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

$ tree
.
├── CRASH
├── _aiidasubmit.sh
├── _scheduler-stderr.txt
├── _scheduler-stdout.txt
├── aiida.in
├── aiida.out
├── out
└── pseudo
    └── pseudo
        └── Si.pbesol-n-rrkjus_psl.1.0.0.UPF

Can confirm that the changes in this PR fix that issue.

@sphuber
Copy link
Contributor Author

sphuber commented Apr 24, 2024

Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the pseudo folder would be copied inside an already existing one, causing QE to crash:

So he is running of the main branch, correct? Because that bug should not be present in an actual release. Can we merge this PR then? Or do you have doubts about the changes?

@mbercx
Copy link
Member

mbercx commented Apr 24, 2024

So he is running of the main branch, correct?

Yes, it seems so, and reverting to v2.5.1 fixed the issue.

Can we merge this PR then? Or do you have doubts about the changes?

I still want to have a proper look at the code, so I can also make sure I understand the changes in 6898ff4. Should have time for this tomorrow or Friday.

@mbercx
Copy link
Member

mbercx commented May 1, 2024

Looking into this some more, won't the following line cause similar woes?

https://github.com/sphuber/aiida-core/blob/20fc81065731a05eb5ee1da95976bda504d7f67d/src/aiida/engine/daemon/execmanager.py#L293

I was playing around with the following code:

import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile

load_profile()

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

folder_data = orm.FolderData(tree=pseudo_path)

shutil.rmtree(remote_workdir, ignore_errors=True)

def copy_local(transport, folder_data):
    with TemporaryDirectory() as tmpdir:

        dirpath = pathlib.Path(tmpdir)

        data_node = folder_data

        filepath_target = (dirpath / 'pseudo').resolve().absolute()
        filepath_target.parent.mkdir(parents=True, exist_ok=True)

        data_node.base.repository.copy_tree(filepath_target, 'pseudo')

        transport.put(f'{dirpath}/*', transport.getcwd())

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    copy_local(transport, folder_data)
    transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

The code above will give the following directory tree:

.
├── data
│   └── pseudo
│       ├── Ba.upf
│       └── Si.upf
├── sandbox.ipynb
└── workdir
    └── pseudo
        ├── Ba.upf
        ├── Si.upf
        └── pseudo
            ├── Ba.upf
            └── Si.upf

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

@sphuber
Copy link
Contributor Author

sphuber commented May 1, 2024

But switching the order of the copy_local and transport.copy will not result in the nested psuedo folder.

Sure, but that is because you are calling the following

transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo')

And that is saying copy the contents of the source pseudo inside the pseudo target directory. It is not saying update the contents of the target pseudo dir with the contents of the source pseudo. What you are trying to do is:

transport.copy(os.path.join(pseudo_path, 'pseudo/*'), 'pseudo')

So you are globbing the contents of pseudo and copying that into the target pseudo. If you change this line, then your script works as expected. This is also exactly the change I added in this PR for the local copy.

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

@mbercx
Copy link
Member

mbercx commented May 1, 2024

So I don't think there is a regression in the behavior of transport.copy as it is used for the remote_copy_list is there?

I don't think there is a regression, I was just wondering if we should make a similar change for remote_copy_list, since now the copying behavior for the two is different when the folder is already present due to a previous copying operation.

I rewrote the example to rely on the functions in the execmanager to see if I can break things (as I like to do):

from logging import LoggerAdapter
import shutil

from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files

load_profile()

random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'

shutil.rmtree(remote_workdir, ignore_errors=True)

folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())

Critical of course are the local_copy_list_item and remote_copy_list_item definitions:

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')

Here, all is well, since I use the glob * to copy the contents of /Users/mbercx/project/core/jupyter/data/pseudo, as you noted. Removing the glob results in the nested pseudo dirs, but perhaps this is expected.

If we remove the glob, and invert the _copy_local_files and _copy_remote_files operations (as we currently allow):

local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')

with localhost.get_transport() as transport:

    transport.mkdir(remote_workdir)
    transport.chdir(remote_workdir)

    _copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
    _copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])

the behavior is different, i.e. there is no nested pseudo folder. I guess now my question is if this difference in behavior is problematic.

@unkcpz
Copy link
Member

unkcpz commented May 2, 2024

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

@sphuber
Copy link
Contributor Author

sphuber commented May 2, 2024

Since _copy_remote_files and _copy_local_files are not methods for public use, maybe it not a issue that the behaviors are not parallel with each other?

Sure, but they are actually used by the user, albeit it indirectly. The local_copy_list and remote_copy_list instructions are actually passed directly to these functions. If they now require different instructions to achieve the same behavior, I can see how that would be confusing and inconsistent. For example if in local_copy_list you can/have to use relative_source/* and for remote_copy_list you don't use the glob. So I agree with @mbercx that we should look at it being consistent.

I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on main only change the behavior of local_copy_list which is why I had to change it. But remote_copy_list should not have changed, so I think the highlighted problem of nesting should be present in v2.5.1 as well.

In 6898ff4 the implementation of the
processing of the `local_copy_list` in the `upload_calculation` method
was changed. Originally, the files specified by the `local_copy_list`
were first copied into the `SandboxFolder` before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.
@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2024

TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of _copy_local_files does not directly respect the copying instructions.

This is a tricky one. The first question is what the behavior of Transport.put and Transport.copy (intuitively) should be. If you copy dir_a to dir_a, what should happen if a) dir_a does already exist in the target or b) dir_a does not exist. If we follow the behavior of cp, then there should be a difference in the two cases. If the target dir_a already exists, the dir_a source will be copied inside the existing directory, i.e., it will be nested. This is the current behavior of Transport.put and Transport.copy so it stands to be argued that that is the expected behavior.

The problem is really due to a detail of the original implementation of upload_calculation. There are three sources of input files that are provided by the calcjob plugin that are copied to the remote working directory:

  1. The sandbox folder
  2. The local_copy_list
  3. The remote_copy_list

The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the local_copy_list to the sandbox folder, and then it would copy the sandbox folder wholesale to the working dir. Critically, it did not use the Transport interface for the first operation, but simply used normal Python file operations since the sandbox sits on the local file system. The remote_copy_list was actually done through the transport interface directly to the working directory.

In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the local_copy_list contents are not merged into the sandbox folder first. But this now leads to problem if the sandbox folder creates a folder on the remote, that also exists in the local_copy_list at which point the latter will now be nested. This happens for PwCalculation for example, because the plugin manually creates the pseudo folder in the sandbox, and then also copies the UpfData files to ./pseudo/filename through the local_copy_list.

In principle, getting rid of the "hack" of merging local_copy_list with the sandbox is a good thing in my opinion. However, the exact manner I did it in is causing the breaking of original functionality I believe. The _copy_local_files has to copy the files from the instructions to a temporary file on the local disk because the Transport interface does not allow file handles, and AiiDA's repository interface can only provide streaming handles. The problem is that I first copy all items of local_copy_list to a temp dir on disk, and then copy all the contents to the remote working dir, however, this is not identical to copying the local_copy_list instructions over one by one. So the solution, I think, is to copy each item of local_copy_list as soon as it is written to temporary file locally, instead of copying them all locally and then copying the temp dir in one go.

@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from 20fc810 to c2a5903 Compare May 5, 2024 17:10
@sphuber sphuber changed the title LocalTransport: Accept existing directories in puttree Engine: Fix bug introduced when refactoring upload_calculation May 5, 2024
@sphuber
Copy link
Contributor Author

sphuber commented May 5, 2024

Thanks for the critical review @mbercx . My original solution was horribly flawed and was trying to cover up a bad refactoring of the upload_calculation in an earlier commit. I am now pretty confident that I have found the correct solution which doesn't introduce differences between local and remote copy operations and does not have to touch the behavior of the transports whatsoever. I have tested the change with aiida-quantumespresso and it now works as expected.

Scratch that... it is still not quite that simple 😭

In 6898ff4 the implementation of the
processing of the `local_copy_list` in the `upload_calculation` method
was changed. Originally, the files specified by the `local_copy_list`
were first copied into the `SandboxFolder` before copying its contents
to the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.

For example, if the sandbox contained the directory `folder` and the
`local_copy_list` contained the items `(_, 'file', './folder/file')`
this would work just fine in the original implementation as the `file`
would be written to the `folder` on the remote folder. The current
implementation, however, would write the file content to `folder/file`
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:

    Transport.put('/tmpdir/folder', 'folder')

but since `folder` would already exist on the remote working directory
the local folder would be _nested_ and so the final path on the remote
would be `/workingdir/folder/folder/file`.

The correct approach is to copy each item of the `local_copy_list` from
the local temporary directory _individually_ using the `Transport.put`
interface and not iterate over the top-level entries of the temporary
directory at the end.
@sphuber sphuber force-pushed the fix/local-copy-list-subdirectory branch from c2a5903 to db3c9f1 Compare May 5, 2024 19:10
@sphuber sphuber requested a review from mbercx May 6, 2024 18:01
@sphuber
Copy link
Contributor Author

sphuber commented May 6, 2024

@mbercx could you give this another review please? The behavior of the remote_copy_list and local_copy_list in your example is now the same once again and I believe the original behavior of the latest release is also restored.

Copy link
Member

@mbercx mbercx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sphuber! I think the critical question is indeed what the behavior of Transport.put and Transport.copy should be. I think it'd be hard to change their behavior from what you describe above, so I agree it makes sense to keep cp-like behavior.

I had a closer look at the code and did some more field testing on the behavior of the various FileCopyOperations. I left two comments so far, of which the double comment on line 364 re copying the contents of a FileType.DIRECTORY is the most critical.

# Now copy the contents of the temporary folder to the remote working directory using the transport
for filepath in dirpath.iterdir():
transport.put(str(filepath), filepath.name)
transport.makedirs(str(pathlib.Path(target).parent), ignore_existing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that because of line 359 and this one, the copy behaviour of local_copy_list is not the same as cp, which would simply fail in case the parent folder that you are trying to copy into doesn't exist.

I wonder if this was also the previous behavior of local_copy_list. The QE plugin made the pseudo directory in the sandbox folder exactly because otherwise the copy command would fail, I assume.

Finally, remote_copy_list does fail when trying to copy files into a parent folder that doesn't exist.

@@ -360,15 +361,14 @@ def _copy_local_files(logger, node, transport, inputs, local_copy_list):
if data_node.base.repository.get_object(filename_source).file_type == FileType.DIRECTORY:
# If the source object is a directory, we copy its entire contents
data_node.base.repository.copy_tree(filepath_target, filename_source)
transport.put(f'{dirpath}/*', target or '.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment below, I was wondering if this means that the local_copy_list behavior is once again different from cp. Funny enough, it does seem similar when using -r and adding a forward slash after the source directory:

❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo ./pseudo; tree
.
└── pseudo
    └── pseudo
        ├── Ba.upf
        └── Si.upf

3 directories, 2 files
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo/ ./pseudo; tree
.
└── pseudo
    ├── Ba.upf
    └── Si.upf

2 directories, 2 files

Kind of similar to rsync, I guess.

Copy link
Member

@mbercx mbercx May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, after having a closer look, the behaviour seems different than I expected. With the following code:

import shutil

from logging import LoggerAdapter

from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.common.folders import SandboxFolder
from aiida.engine.daemon.execmanager import _copy_local_files, _copy_sandbox_files

load_profile()

random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))

localhost = orm.load_computer('localhost')

remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
test_path = '/Users/mbercx/project/core/jupyter/test_qe'

shutil.rmtree(remote_workdir, ignore_errors=True)

folder_data = orm.FolderData(tree=test_path)
folder_data.store()

local_copy_list = [
    (folder_data.uuid, 'pseudo', 'pseudo'),
]

with SandboxFolder() as sandbox_folder:

    sandbox_folder.get_subfolder('pseudo', create=True)

    with localhost.get_transport() as transport:

        transport.mkdir(remote_workdir)
        transport.chdir(remote_workdir)

        _copy_sandbox_files(logger, random_calc_job, transport, sandbox_folder)
        _copy_local_files(logger, random_calc_job, transport, None, local_copy_list)

and the contents of test_qe:

test_qe
└── pseudo
    ├── Ba.upf
    └── Si.upf

I indeed get a nested folder:

workdir/
└── pseudo
    └── pseudo
        ├── Ba.upf
        └── Si.upf

Not creating the pseudo folder in the sandbox leads to the non-nested result. However, if I make the local_copy_list:

local_copy_list = [
    (folder_data.uuid, 'pseudo', '.'),
]

Then the workdir becomes:

workdir/
├── Ba.upf
└── Si.upf

Is that what we want? Now we are really doing the cp -r pseudo/ . (with forward slash) option, which means copy the contents of the directory to the target path.

@mbercx
Copy link
Member

mbercx commented May 8, 2024

@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review?

I think the discrepancy between local_copy_list and remote_copy_list in allowing the absence of parent folders is fine, although we could also allow this for remote_copy_list as well by making the parent folders as we do for remote_copy_list.

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange. I suppose this was necessary to maintain backwards compatibility with the previous approach of merging the sandbox folder and local_copy_list?

@sphuber
Copy link
Contributor Author

sphuber commented May 8, 2024

The fact that local_copy_list copies the contents of a directory instead of the directory itself seems somewhat strange

If and only if the target directory already exists right? Otherwise it just copies as is.

The problem here is indeed really the fact that the old implementation did not go directly through the Transport but had an intermediate merging step that used a different API and so we kind of now have to maintain that behavior in order to not break existing plugins, as demonstrated by PwCalculation breaking.

@mbercx
Copy link
Member

mbercx commented May 8, 2024

If and only if the target directory already exists right? Otherwise it just copies as is.

Not sure if that's true, see my example near the end of #6348 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-blocking must be resolved before next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants