support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

boegel · 2024-03-26T10:47:34Z

It would be nice if CernVM-FS could provide support for creating a Stratum 1 mirror for only parts of a CernVM-FS repository.

This is already supported by the shrinkwrap utility (see https://cvmfs.readthedocs.io/en/stable/cpt-shrinkwrap.html#creating-an-image-for-root), it would be nice to also support this for Stratum 1 servers

The text was updated successfully, but these errors were encountered:

rptaylor · 2024-04-10T23:21:14Z

If a client tried to access a file that did not exist it would get 404 and then I believe fail over to a different server altogether, making the partial server no longer useful. Or is the idea to prevent clients from accessing parts of a repo at all?
It seems like this would need some link between users (what files do you want to access) and administrators (what data chunks are we going to replicate), but admins can't always predict or control what users will do.

Wouldn't a proxy server or a writeable (non-preloaded) alien cache be a better solution? With those you can safely empty the cache or clean up files at will without risk of causing problems.

DrDaveD · 2024-04-11T00:36:33Z

I wonder if @boegel is expecting that partially replicating a repo would also prevent the non-replicated files from showing up in a directory listing to the user. I don't think that's a possibility, since the top level catalog (and therefore all nested catalogs) has to match the publish-time hash.

boegel · 2024-04-11T09:15:34Z

The intention here is not to prevent users from accessing particular parts of the repository, but to only have an in-network copy of the parts of the repository that are actually relevant for that particular site.
For example, if a site only has Intel and AMD CPUs, the aarch64/ subdirectory of the repository are totally irrelevant. If users do try to access that part, it's fine if they're hit with a delay (or even a 404).

rptaylor · 2024-04-11T16:27:57Z

But the delay/404 would result in the local s1 not being used at all anymore.

Is the goal to save a little bit of storage space, or are people objecting to replicating data they don't need?
An alien cache is IMO a suitable alternative to a local s1 in terms of ensuring local access in case of an external network loss, the data is just on a cluster filesystem instead of having to run a httpd server. With a writeable alien cache you can be certain that you only store data for files that are used locally. With a tiered cache, you can also still use a local disk cache in addition to the alien cache.

HereThereBeDragons · 2024-04-15T12:33:25Z

Maybe as a small note that disk storage shouldnt not be such a big problem: public repos at cern combined are around 650 TB, but on disk storage only take around 70 TB. Thats nearly a 10x reduction. I would argue that for big HPC sites that should not really be a problem to have a couple of TB for the eessi repo - and huge part of it is the container image repo unpacked.cern.ch .

i would agree with @rptaylor and @DrDaveD that alien cache would be the better choice if storage space is really a problem.

if eessi really grows so much that it becomes a problem, maybe it makes sense to split eessi based on the architecture? you can have the main repo that has symlinks to the different architectures (software-aarch64.eessi.io, software-86_64x.eessi.io, ...). And if a site that just want one specific architecture they just replicate their specific architecture.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

boegel commented Mar 26, 2024

rptaylor commented Apr 10, 2024

DrDaveD commented Apr 11, 2024

boegel commented Apr 11, 2024

rptaylor commented Apr 11, 2024

HereThereBeDragons commented Apr 15, 2024

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

support creating partial mirror (Stratum 1) of CernVM-FS repo #3554

Comments

boegel commented Mar 26, 2024

rptaylor commented Apr 10, 2024

DrDaveD commented Apr 11, 2024

boegel commented Apr 11, 2024

rptaylor commented Apr 11, 2024

HereThereBeDragons commented Apr 15, 2024