Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcsfs put copies files out of order #468

Open
mlahir1 opened this issue Apr 4, 2022 · 2 comments
Open

gcsfs put copies files out of order #468

mlahir1 opened this issue Apr 4, 2022 · 2 comments

Comments

@mlahir1
Copy link

mlahir1 commented Apr 4, 2022

i am trying to copy multiple files using gcsfs put API. The syntax provided to copy multiple files goes like this. lpath1 is supposed to put the data into rapth1 and respectively for other files.
gcs.put([lpath1, lpath2, lpath3, lapth4], [rpath1, rpath2, rpath3, rpath4])

Issue:
The copy happens out of order. lpath1 write the results to any of the four output paths, and so are the other input files. Is this behavior expected.

Expected Behavior:
lpath1 is supposed to put the data into rapth1 and respectively for other files
lpath1->rpath1
lpath2->rpath2
lpath3->rpath3
lpath4->rpath4

Example:
fs.put(
[f'{local_path}/f' for f in files] ,
[f'{gcs_path}/f' for f in files],
recursive=True
)

ls -lh on Local files give:

-rw-r--r--. 1 root root 169M Mar 25 15:37 01-080000
-rw-r--r--. 1 root root 179M Mar 25 15:37 01-090000
-rw-r--r--. 1 root root 173M Mar 25 15:37 01-100000
-rw-r--r--. 1 root root 202M Mar 25 15:37 01-110000

After the copy using above command, ls -lh on gcp buckets gives:
172.61 MiB 2022-03-25T15:38:15Z gs:///01-080000
TOTAL: 1 objects, 180989509 bytes (172.61 MiB)
201.31 MiB 2022-03-25T15:38:15Z gs:///01-090000
TOTAL: 1 objects, 211084829 bytes (201.31 MiB)
168.31 MiB 2022-03-25T15:38:16Z gs:///01-100000
TOTAL: 1 objects, 176486219 bytes (168.31 MiB)
178.05 MiB 2022-03-25T15:38:15Z gs:///01-110000
TOTAL: 1 objects, 186697415 bytes (178.05 MiB)

@martindurant
Copy link
Member

I am not seeing this behaviour :|

lpath = ['0', '1', '2', '3', '4']
for lp in lpath:
    open(lp, "w").write(lp)
rpath = [f"mdtemp/{i}" for i in lpath]
gcs.put(lpath, rpath);
gcs.cat(rpath)
{'mdtemp/0': b'0',
 'mdtemp/1': b'1',
 'mdtemp/2': b'2',
 'mdtemp/3': b'3',
 'mdtemp/4': b'4'}

I do notice that you have recursive=True, but that you are not really dealing with anything recursive. Does removing that help?

@quassy
Copy link

quassy commented Feb 7, 2023

I can confirm this issue and I don't use recursive=True.

@martindurant You are not seeing the issue because your input lpaths are sorted. Try it with an unsorted list and it should break.

I think the issue is here, line numbers in fsspec==2022.11.0 & gcsfs==2022.11.0:

spec.py:893-920

    def put(self, lpath, rpath, recursive=False, callback=_DEFAULT_CALLBACK, **kwargs):
        """Copy file(s) from local.

        Copies a specific file or tree of files (if recursive=True). If rpath
        ends with a "/", it will be assumed to be a directory, and target files
        will go within.

        Calls put_file for each source.
        """
        from .implementations.local import LocalFileSystem, make_path_posix

        rpath = (
            self._strip_protocol(rpath)
            if isinstance(rpath, str)
            else [self._strip_protocol(p) for p in rpath]
        )
        if isinstance(lpath, str):
            lpath = make_path_posix(lpath)
        fs = LocalFileSystem()
        lpaths = fs.expand_path(lpath, recursive=recursive)
        rpaths = other_paths(
            lpaths, rpath, exists=isinstance(rpath, str) and self.isdir(rpath)
        )

        callback.set_size(len(rpaths))
        for lpath, rpath in callback.wrap(zip(lpaths, rpaths)):
            callback.branch(lpath, rpath, kwargs)
            self.put_file(lpath, rpath, **kwargs)

In line 912 fs.expand_path(lpath, recursive=recursive) on lpaths is called, which sorts its output in line 987:

    def expand_path(self, path, recursive=False, maxdepth=None):
        """Turn one or more globs or directories into a list of all matching paths
        to files or directories."""
        if isinstance(path, str):
            out = self.expand_path([path], recursive, maxdepth)
        else:
            # reduce depth on each recursion level unless None or 0
            maxdepth = maxdepth if not maxdepth else maxdepth - 1
            out = set()
            path = [self._strip_protocol(p) for p in path]
            for p in path:
                if has_magic(p):
                    bit = set(self.glob(p))
                    out |= bit
                    if recursive:
                        out |= set(
                            self.expand_path(
                                list(bit), recursive=recursive, maxdepth=maxdepth
                            )
                        )
                    continue
                elif recursive:
                    rec = set(self.find(p, maxdepth=maxdepth, withdirs=True))
                    out |= rec
                if p not in out and (recursive is False or self.exists(p)):
                    # should only check once, for the root
                    out.add(p)
        if not out:
            raise FileNotFoundError(path)
        return list(sorted(out))

quassy added a commit to quassy/filesystem_spec that referenced this issue Feb 7, 2023
`put` currently sorts the list of input paths so when they are not in order before already, the don't match their output paths anymore and file names get mixed up. This fixes fsspec/gcsfs#468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants