-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcsfs put copies files out of order #468
Comments
I am not seeing this behaviour :| lpath = ['0', '1', '2', '3', '4']
for lp in lpath:
open(lp, "w").write(lp)
rpath = [f"mdtemp/{i}" for i in lpath]
gcs.put(lpath, rpath);
gcs.cat(rpath)
{'mdtemp/0': b'0',
'mdtemp/1': b'1',
'mdtemp/2': b'2',
'mdtemp/3': b'3',
'mdtemp/4': b'4'} I do notice that you have |
I can confirm this issue and I don't use @martindurant You are not seeing the issue because your input lpaths are sorted. Try it with an unsorted list and it should break. I think the issue is here, line numbers in fsspec==2022.11.0 & gcsfs==2022.11.0: spec.py:893-920 def put(self, lpath, rpath, recursive=False, callback=_DEFAULT_CALLBACK, **kwargs):
"""Copy file(s) from local.
Copies a specific file or tree of files (if recursive=True). If rpath
ends with a "/", it will be assumed to be a directory, and target files
will go within.
Calls put_file for each source.
"""
from .implementations.local import LocalFileSystem, make_path_posix
rpath = (
self._strip_protocol(rpath)
if isinstance(rpath, str)
else [self._strip_protocol(p) for p in rpath]
)
if isinstance(lpath, str):
lpath = make_path_posix(lpath)
fs = LocalFileSystem()
lpaths = fs.expand_path(lpath, recursive=recursive)
rpaths = other_paths(
lpaths, rpath, exists=isinstance(rpath, str) and self.isdir(rpath)
)
callback.set_size(len(rpaths))
for lpath, rpath in callback.wrap(zip(lpaths, rpaths)):
callback.branch(lpath, rpath, kwargs)
self.put_file(lpath, rpath, **kwargs) In line 912 def expand_path(self, path, recursive=False, maxdepth=None):
"""Turn one or more globs or directories into a list of all matching paths
to files or directories."""
if isinstance(path, str):
out = self.expand_path([path], recursive, maxdepth)
else:
# reduce depth on each recursion level unless None or 0
maxdepth = maxdepth if not maxdepth else maxdepth - 1
out = set()
path = [self._strip_protocol(p) for p in path]
for p in path:
if has_magic(p):
bit = set(self.glob(p))
out |= bit
if recursive:
out |= set(
self.expand_path(
list(bit), recursive=recursive, maxdepth=maxdepth
)
)
continue
elif recursive:
rec = set(self.find(p, maxdepth=maxdepth, withdirs=True))
out |= rec
if p not in out and (recursive is False or self.exists(p)):
# should only check once, for the root
out.add(p)
if not out:
raise FileNotFoundError(path)
return list(sorted(out)) |
`put` currently sorts the list of input paths so when they are not in order before already, the don't match their output paths anymore and file names get mixed up. This fixes fsspec/gcsfs#468
i am trying to copy multiple files using gcsfs put API. The syntax provided to copy multiple files goes like this. lpath1 is supposed to put the data into rapth1 and respectively for other files.
gcs.put([lpath1, lpath2, lpath3, lapth4], [rpath1, rpath2, rpath3, rpath4])
Issue:
The copy happens out of order. lpath1 write the results to any of the four output paths, and so are the other input files. Is this behavior expected.
Expected Behavior:
lpath1 is supposed to put the data into rapth1 and respectively for other files
lpath1->rpath1
lpath2->rpath2
lpath3->rpath3
lpath4->rpath4
Example:
fs.put(
[f'{local_path}/f' for f in files] ,
[f'{gcs_path}/f' for f in files],
recursive=True
)
ls -lh on Local files give:
-rw-r--r--. 1 root root 169M Mar 25 15:37 01-080000
-rw-r--r--. 1 root root 179M Mar 25 15:37 01-090000
-rw-r--r--. 1 root root 173M Mar 25 15:37 01-100000
-rw-r--r--. 1 root root 202M Mar 25 15:37 01-110000
After the copy using above command, ls -lh on gcp buckets gives:
172.61 MiB 2022-03-25T15:38:15Z gs:///01-080000
TOTAL: 1 objects, 180989509 bytes (172.61 MiB)
201.31 MiB 2022-03-25T15:38:15Z gs:///01-090000
TOTAL: 1 objects, 211084829 bytes (201.31 MiB)
168.31 MiB 2022-03-25T15:38:16Z gs:///01-100000
TOTAL: 1 objects, 176486219 bytes (168.31 MiB)
178.05 MiB 2022-03-25T15:38:15Z gs:///01-110000
TOTAL: 1 objects, 186697415 bytes (178.05 MiB)
The text was updated successfully, but these errors were encountered: