Skip to content

Conversation

@ianthomas23
Copy link
Collaborator

This adds a full set of AbstractGetTests and AbstractPutTests that are equivalent to the existing AbstractCopyTests, and ports the recent changes from synchronous copy/get/put code (#1250, #1254, #1255, #1259) to async. This completes the top 10 most common cp/get/put use cases documented by #713.

The abstract test suite all passes locally for me using local and memory filesystems (the latter skips 4 tests, see below) and also s3fs (using the branch that is fsspec/s3fs#713).

Noteworthy implementation details:

  • Filesystem scope for tests is per test class, so there is one setup-and-teardown combination for each of the AbstractCopyTests, AbstractGetTests and AbstractPutTests. This doesn't make any real difference for local and memory tests, but makes the s3fs tests much faster than individual test scope. The scope could be made once per test session, but that would involve a bit of a refactor so is not planned yet.
  • Setting up the source and target directories for these tests uses fixtures that automatically clean up after themselves at the end of each test, ensuring that the filesystem is empty ready for the next test.

Work to do in future PRs soon:

  • Four of the memory get tests are skipped as currently there is a bespoke code path that does not auto create a local directory when required. Needs a fix to MemoryFileSystem.
  • I haven't fully dealt with all combinations of trailing slashes on source and target directories yet.
  • gcsfs doesn't pass all the new tests yet.

I propose the following order of operations:

  1. Review and merge this when CI happy.
  2. I'll update Test using new abstract test harness s3fs#713 to check that s3fs is happy with the new changes, then review and merge of that PR.
  3. A new release of fsspec is advisable then, as one of my recent PRs broke async get in some situations and this is fixed in this PR.
  4. I'll fix the above 3 "future PRs" issues.

Longer term there are other improvements that are not urgent:

  1. Cover some of the less-common cp/get/put use cases.
  2. Test more of implementations that ship in fsspec.
  3. Document how to use the abstract test harness in downstream projects.
  4. Move tests from original test scripts to the new abstract test classes.

@ianthomas23
Copy link
Collaborator Author

@martindurant It would be good if you could check the async code looks OK as that is the area I am least familiar with.

fsspec/asyn.py Outdated
in ``fsspec.config.conf``, falling back to 1/8th of the system limit .
"""
from fsspec.implementations.local import LocalFileSystem, make_path_posix
from .implementations.local import (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we get circular imports if we move all these from the methods to the top of the module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the imports to the top works fine locally, I've pushed a commit to check in CI.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically all good, just a couple of questions for you (one of which escaped as a standalone comment)

fs.touch(fs_join(subdir, "subfile2"))
fs.touch(fs_join(nesteddir, "nestedfile"))
return source
source = self._scenario_cp(fs, fs_join, fs_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little surprised here, because it looks like we need cp to be working correctly before we get to running the tests. Wasn't touch() more explicit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking further, I see you just moved the touch() calls. But isn't _scenario_cp misnamed, then? In general, for any of these that a downstream developer is expected to implement (if any), it would be worthwhile describing all the arguments of all the methods better. Perhaps general documentation in https://filesystem-spec.readthedocs.io/en/latest/developer.html and/or https://filesystem-spec.readthedocs.io/en/latest/copying.html would help.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream developers won't need to re-implement the "scenario" functions, as long as they have implemented mkdir and touch these will work fine.

I had planned to add developer documentation for these new tests, it is the last but one item in my original comment of this PR. But I don't want downstream developers to be using them yet, not until I at least have both s3fs and gcsfs working using them. We are very nearly but not quite there yet.

I can think of two possible improvements for this PR.

  1. AbstractTestHarness is a mixture of functions that might need to be overwritten in downstream projects (e.g. fs_join) and those that will not (e.g. fs_scenario_cp). It is not clear which is in which category. So I could move the latter group into a new parent class of AbstractFixtures and just keep the overridable ones in AbstractFixtures. Both classes need docstrings too, to explain this split and highlight what functions may need to be overridden.

  2. As you say, names like fs_scenario_cp are not good! When we have many of the existing tests moved to the new structure I envisage a handful of "setup scenarios", one of which will be used in each test. The idea is to avoid each test having its own setup of directories and/or files as that gets unwieldy and instead have a minimal set of setup scenarios. At this stage there is just the one scenario which is the "setup scenario that is used for all cp, get and put tests" that I have shortened to "scenario_cp". But I can see that having "cp" in the name is confusing as it implies that copying occurs when the function is called. I am very happy to change the name, I just need a better one! It could be scenario_for_copying. It could just have an ID rather than a name although I am using IDs like 1a in the copying docs so numbers and letters could be confusing. scenario_alpha? I don't like that either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about bulk_operations_scenario, perhaps suffixed with 0 if there might eventually be more of them?

@ianthomas23
Copy link
Collaborator Author

I think s3fs test failure is unrelated as the other recent PR has the same error. CI passed OK 5 days ago. Differences in environment between passing and failing:

package passed failing
c-ares 1.19.0 1.19.1
coverage 7.2.5 7.2.6
grpcio 1.54.2 1.55.0
libgrpc 1.54.2 1.55.0
libprotobuf 3.21.12 4.23.1
moto 4.1.9 4.1.10
protobuf 4.21.12 4.23.1
requests 2.29.0 2.31.0
typing-extensions 4.5.0 4.6.0
websocket-client 1.5.1 1.5.2

@martindurant
Copy link
Member

Likely moto...

@ianthomas23
Copy link
Collaborator Author

Likely moto...

Yes, s3fs tests pass locally with moto==4.1.9 and fail with moto==4.1.10. We are not receiving a "BucketKeyEnabled" key when querying head_object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants