Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Use a registry file for test data instead of hard-coding paths in conftest.py #439

Open
alessandrofelder opened this issue Jun 6, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@alessandrofelder
Copy link
Member

alessandrofelder commented Jun 6, 2024

Is your feature request related to a problem? Please describe.
Currently, we use a pooch.registry for test data. This is great, but we have plans to add some zip files that have lots of (compressed) subfolders. This is opaque (you need to download them before seeing what's in them), but will be needed in https://github.com/matham/brainglobe-utils/pull/2/files. We also hard-code the relative paths to test data on GIN in conftest.py, which isn't great.

Describe the solution you'd like
(Credit to @matham who suggested most, if not all of this 🙏 )

  • we store all data uncompressed on GIN, which favours re-use and transparency of what data is available.
  • we use pooch to create a registry csv file with hashes
  • we provide a utility function that parses the registry csv file rows that belong to files in a certain subfolder (unless pooch has a nicer way of doing this) and fetches all files in a desired subfolder.
  • we use the hash of the registry file (instead of conftest.py) as a GH actions cache key.

This will allow:

  • test data to be re-used across BrainGlobe repos
  • easily cached on CI and locally, and the cache kept up-to-date
  • only files that are actually needed should be downloaded, exactly once.

Describe alternatives you've considered

\

Additional context
We want to move all test data out of GitHub repositories. If this plan is successful, we can roll it out to other BrainGlobe repos.
Related to brainglobe/BrainGlobe#5

@alessandrofelder alessandrofelder added the enhancement New feature or request label Jun 6, 2024
@matham
Copy link
Contributor

matham commented Jun 6, 2024

For reference, I used python -c "import pooch; pooch.make_registry('cellfinder', 'registry.txt')", as described here, on my pytorch test images and it generated this file.

registry.txt

I looked in pooch and it seems you do have to download a file at a time. So for a folder we would need to parse the registry ourselves (or more accurately get it from the registry dict as below) and call fetch for each file.

This function should work (not tested):

def fetch_pooch_directory(registry: pooch.Pooch, directory_name: str, processor=None, downloader=None, progressbar=False):
    for name in registry.registry_files:
        if name.startswith(f"{directory_name}/"):
            registry.fetch(name, processor=processor, downloader=downloader, progressbar=progressbar)

    return str(registry.abspath / directory_name)

And you'd use it with e.g. how you set up the conftest as:

@pytest.fixture
def test_data_registry():
    """
    Create a test data registry for BrainGlobe.

    Returns:
        pooch.Pooch: The test data registry object.

    """
    registry = pooch.create(
        path=None,
        base_url="https://gin.g-node.org/BrainGlobe/...",
        env="BRAINGLOBE_TEST_DATA_DIR",
    )
    registry.load_registry(".../pooch_registry.txt")
    return registry


def test_something(test_data_registry):
    root = fetch_pooch_directory(test_data_registry, "cellfinder/my_images")
    assert root

The one thing is that each repo should have its own registry file which contains only the files used in that repo. So we don't e.g. need to list all the cellfinder data in the brainglobe-utils registry file in that repo. Only files used in brainglobe-utils (whether they are all in a single gin repo).

@alessandrofelder alessandrofelder transferred this issue from brainglobe/brainglobe-utils Jun 10, 2024
@adamltyson
Copy link
Member

we store all data uncompressed on GIN, which favours re-use and transparency of what data is available.

Can the individual files themselves not be compressed? I'm cautious of our workflows not grinding to a halt with slow network connections.

@matham
Copy link
Contributor

matham commented Jun 11, 2024

I do compress the tifs (tifffile supports compression options). And we can compress individual non image files with zip. However, for large files, it ultimately won't help that much, I think.

But, on the CI, we should always cache the registry. So that only the first time after a change in the registry, or for a new python version, or the first time running locally will it take a while to download (gin seems to be very slow actually). But after that github will just get it from the cache.

@matham
Copy link
Contributor

matham commented Jun 19, 2024

Following up from #440.

Also, getting the test data from g-node is quite slow. Surprisingly so given it's supposed to host datasets. I assumed it'd be faster a bit after upload once it had a chance to cache or whatever, and especially on github, but it's just as slow as locally I think. Although there's no clear indication how long the download takes as pytest eats all the logs. On my computer it was well under 1MB/s (1MBits/s?). But if the tests weren't so slow themselves, it wouldn't be an issue due to the cache. But still it's a bit concerning!?

I wasn't sure if you were asking for suggestions on hosting. The obvious ones such as AWS et al come to mind. Maybe your university already has some preferred storage solutions that they sponsor. But it may not be free and may not outlive the lab.

Perhaps using some cloud provider on CI with a secret key to speed up CI and fallback to g-node elsewhere if someone locally is trying to run the tests. And same for getting the trained model files.

Github can host files with these repo limits, but I don't think git is the best place to manage test data. A hack I did ☹️ was host test artifacts in my first github release of the project. But that's not ideal for a large project.

@adamltyson
Copy link
Member

I'm very hesitant to move storage platforms, for all the reasons you mention, cost, longevity etc. GIN is quite slow, but it works well for our other projects, once the data in cached.

Is it possible to reduce the size of the test data in some way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants