New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine: Fix bug introduced when refactoring upload_calculation
#6348
base: main
Are you sure you want to change the base?
Conversation
b851b73
to
13d400b
Compare
Note that this is a critical bug that is currently on |
13d400b
to
20fc810
Compare
Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the
Can confirm that the changes in this PR fix that issue. |
So he is running of the |
Yes, it seems so, and reverting to v2.5.1 fixed the issue.
I still want to have a proper look at the code, so I can also make sure I understand the changes in 6898ff4. Should have time for this tomorrow or Friday. |
Looking into this some more, won't the following line cause similar woes? I was playing around with the following code: import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile
load_profile()
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
folder_data = orm.FolderData(tree=pseudo_path)
shutil.rmtree(remote_workdir, ignore_errors=True)
def copy_local(transport, folder_data):
with TemporaryDirectory() as tmpdir:
dirpath = pathlib.Path(tmpdir)
data_node = folder_data
filepath_target = (dirpath / 'pseudo').resolve().absolute()
filepath_target.parent.mkdir(parents=True, exist_ok=True)
data_node.base.repository.copy_tree(filepath_target, 'pseudo')
transport.put(f'{dirpath}/*', transport.getcwd())
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
copy_local(transport, folder_data)
transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo') The code above will give the following directory tree:
But switching the order of the |
Sure, but that is because you are calling the following transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo') And that is saying copy the contents of the source
So you are globbing the contents of So I don't think there is a regression in the behavior of |
I don't think there is a regression, I was just wondering if we should make a similar change for I rewrote the example to rely on the functions in the from logging import LoggerAdapter
import shutil
from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files
load_profile()
random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
shutil.rmtree(remote_workdir, ignore_errors=True)
folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()
local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ()) Critical of course are the local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo') Here, all is well, since I use the glob If we remove the glob, and invert the local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item]) the behavior is different, i.e. there is no nested |
Since |
Sure, but they are actually used by the user, albeit it indirectly. The I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on |
In 6898ff4 the implementation of the processing of the `local_copy_list` in the `upload_calculation` method was changed. Originally, the files specified by the `local_copy_list` were first copied into the `SandboxFolder` before copying its contents to the working directory using the transport. The commit allowed the order in which the local and sandbox files were copied, so the local files were now no longer copied through the sandbox. Rather, they were copied to a temporary directory on disk, which was then copied over using the transport. The problem is that if the latter would copy over a directory that was already created by the copying of the sandbox, an exception would be raised.
This reverts commit 424027f.
TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of This is a tricky one. The first question is what the behavior of The problem is really due to a detail of the original implementation of
The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the In principle, getting rid of the "hack" of merging |
20fc810
to
c2a5903
Compare
LocalTransport
: Accept existing directories in puttree
upload_calculation
Scratch that... it is still not quite that simple 😭 |
In 6898ff4 the implementation of the processing of the `local_copy_list` in the `upload_calculation` method was changed. Originally, the files specified by the `local_copy_list` were first copied into the `SandboxFolder` before copying its contents to the working directory using the transport. The commit allowed the order in which the local and sandbox files were copied, so the local files were now no longer copied through the sandbox. Rather, they were copied to a temporary directory on disk, which was then copied over using the transport. The problem is that if the latter would copy over a directory that was already created by the copying of the sandbox, an exception would be raised. For example, if the sandbox contained the directory `folder` and the `local_copy_list` contained the items `(_, 'file', './folder/file')` this would work just fine in the original implementation as the `file` would be written to the `folder` on the remote folder. The current implementation, however, would write the file content to `folder/file` in a local temporary directory, and then iterate over the directories and copy them over. Essentially it would be doing: Transport.put('/tmpdir/folder', 'folder') but since `folder` would already exist on the remote working directory the local folder would be _nested_ and so the final path on the remote would be `/workingdir/folder/folder/file`. The correct approach is to copy each item of the `local_copy_list` from the local temporary directory _individually_ using the `Transport.put` interface and not iterate over the top-level entries of the temporary directory at the end.
c2a5903
to
db3c9f1
Compare
@mbercx could you give this another review please? The behavior of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber! I think the critical question is indeed what the behavior of Transport.put
and Transport.copy
should be. I think it'd be hard to change their behavior from what you describe above, so I agree it makes sense to keep cp
-like behavior.
I had a closer look at the code and did some more field testing on the behavior of the various FileCopyOperation
s. I left two comments so far, of which the double comment on line 364 re copying the contents of a FileType.DIRECTORY
is the most critical.
# Now copy the contents of the temporary folder to the remote working directory using the transport | ||
for filepath in dirpath.iterdir(): | ||
transport.put(str(filepath), filepath.name) | ||
transport.makedirs(str(pathlib.Path(target).parent), ignore_existing=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that because of line 359 and this one, the copy behaviour of local_copy_list
is not the same as cp
, which would simply fail in case the parent folder that you are trying to copy into doesn't exist.
I wonder if this was also the previous behavior of local_copy_list
. The QE plugin made the pseudo
directory in the sandbox folder exactly because otherwise the copy command would fail, I assume.
Finally, remote_copy_list
does fail when trying to copy files into a parent folder that doesn't exist.
@@ -360,15 +361,14 @@ def _copy_local_files(logger, node, transport, inputs, local_copy_list): | |||
if data_node.base.repository.get_object(filename_source).file_type == FileType.DIRECTORY: | |||
# If the source object is a directory, we copy its entire contents | |||
data_node.base.repository.copy_tree(filepath_target, filename_source) | |||
transport.put(f'{dirpath}/*', target or '.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment below, I was wondering if this means that the local_copy_list
behavior is once again different from cp
. Funny enough, it does seem similar when using -r
and adding a forward slash after the source directory:
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo ./pseudo; tree
.
└── pseudo
└── pseudo
├── Ba.upf
└── Si.upf
3 directories, 2 files
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo/ ./pseudo; tree
.
└── pseudo
├── Ba.upf
└── Si.upf
2 directories, 2 files
Kind of similar to rsync
, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, after having a closer look, the behaviour seems different than I expected. With the following code:
import shutil
from logging import LoggerAdapter
from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.common.folders import SandboxFolder
from aiida.engine.daemon.execmanager import _copy_local_files, _copy_sandbox_files
load_profile()
random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
test_path = '/Users/mbercx/project/core/jupyter/test_qe'
shutil.rmtree(remote_workdir, ignore_errors=True)
folder_data = orm.FolderData(tree=test_path)
folder_data.store()
local_copy_list = [
(folder_data.uuid, 'pseudo', 'pseudo'),
]
with SandboxFolder() as sandbox_folder:
sandbox_folder.get_subfolder('pseudo', create=True)
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_sandbox_files(logger, random_calc_job, transport, sandbox_folder)
_copy_local_files(logger, random_calc_job, transport, None, local_copy_list)
and the contents of test_qe
:
test_qe
└── pseudo
├── Ba.upf
└── Si.upf
I indeed get a nested folder:
workdir/
└── pseudo
└── pseudo
├── Ba.upf
└── Si.upf
Not creating the pseudo
folder in the sandbox leads to the non-nested result. However, if I make the local_copy_list
:
local_copy_list = [
(folder_data.uuid, 'pseudo', '.'),
]
Then the workdir
becomes:
workdir/
├── Ba.upf
└── Si.upf
Is that what we want? Now we are really doing the cp -r pseudo/ .
(with forward slash) option, which means copy the contents of the directory to the target path.
@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review? I think the discrepancy between The fact that |
If and only if the target directory already exists right? Otherwise it just copies as is. The problem here is indeed really the fact that the old implementation did not go directly through the |
Not sure if that's true, see my example near the end of #6348 (comment) |
In 6898ff4 the implementation of the
processing of the
local_copy_list
in theupload_calculation
methodwas changed. Originally, the files specified by the
local_copy_list
were first copied into the
SandboxFolder
before copying its contentsto the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.
For example, if the sandbox contained the directory
folder
and thelocal_copy_list
contained the items(_, 'file', './folder/file')
this would work just fine in the original implementation as the
file
would be written to the
folder
on the remote folder. The currentimplementation, however, would write the file content to
folder/file
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:
but since
folder
would already exist on the remote working directorythe local folder would be nested and so the final path on the remote
would be
/workingdir/folder/folder/file
.The correct approach is to copy each item of the
local_copy_list
fromthe local temporary directory individually using the
Transport.put
interface and not iterate over the top-level entries of the temporary
directory at the end.