Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

Open
boegel opened this issue Mar 26, 2024 · 5 comments
Open

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

boegel opened this issue Mar 26, 2024 · 5 comments

Comments

@boegel
Copy link
Contributor

boegel commented Mar 26, 2024

It would be nice if CernVM-FS could provide support for creating a Stratum 1 mirror for only parts of a CernVM-FS repository.

This is already supported by the shrinkwrap utility (see https://cvmfs.readthedocs.io/en/stable/cpt-shrinkwrap.html#creating-an-image-for-root), it would be nice to also support this for Stratum 1 servers

@rptaylor
Copy link
Contributor

If a client tried to access a file that did not exist it would get 404 and then I believe fail over to a different server altogether, making the partial server no longer useful. Or is the idea to prevent clients from accessing parts of a repo at all?
It seems like this would need some link between users (what files do you want to access) and administrators (what data chunks are we going to replicate), but admins can't always predict or control what users will do.

Wouldn't a proxy server or a writeable (non-preloaded) alien cache be a better solution? With those you can safely empty the cache or clean up files at will without risk of causing problems.

@DrDaveD
Copy link
Contributor

DrDaveD commented Apr 11, 2024

I wonder if @boegel is expecting that partially replicating a repo would also prevent the non-replicated files from showing up in a directory listing to the user. I don't think that's a possibility, since the top level catalog (and therefore all nested catalogs) has to match the publish-time hash.

@boegel
Copy link
Contributor Author

boegel commented Apr 11, 2024

The intention here is not to prevent users from accessing particular parts of the repository, but to only have an in-network copy of the parts of the repository that are actually relevant for that particular site.
For example, if a site only has Intel and AMD CPUs, the aarch64/ subdirectory of the repository are totally irrelevant. If users do try to access that part, it's fine if they're hit with a delay (or even a 404).

@rptaylor
Copy link
Contributor

But the delay/404 would result in the local s1 not being used at all anymore.

Is the goal to save a little bit of storage space, or are people objecting to replicating data they don't need?
An alien cache is IMO a suitable alternative to a local s1 in terms of ensuring local access in case of an external network loss, the data is just on a cluster filesystem instead of having to run a httpd server. With a writeable alien cache you can be certain that you only store data for files that are used locally. With a tiered cache, you can also still use a local disk cache in addition to the alien cache.

@HereThereBeDragons
Copy link
Contributor

Maybe as a small note that disk storage shouldnt not be such a big problem: public repos at cern combined are around 650 TB, but on disk storage only take around 70 TB. Thats nearly a 10x reduction. I would argue that for big HPC sites that should not really be a problem to have a couple of TB for the eessi repo - and huge part of it is the container image repo unpacked.cern.ch .

i would agree with @rptaylor and @DrDaveD that alien cache would be the better choice if storage space is really a problem.

if eessi really grows so much that it becomes a problem, maybe it makes sense to split eessi based on the architecture? you can have the main repo that has symlinks to the different architectures (software-aarch64.eessi.io, software-86_64x.eessi.io, ...). And if a site that just want one specific architecture they just replicate their specific architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants