Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

squashfuse Performance #665

Closed
hollowec opened this issue Aug 29, 2022 · 12 comments · Fixed by #673
Closed

squashfuse Performance #665

hollowec opened this issue Aug 29, 2022 · 12 comments · Fixed by #673
Milestone

Comments

@hollowec
Copy link
Contributor

Version of Apptainer

$ apptainer --version
apptainer version 1.1.0~rc.2-1.el7

Expected behavior

When using SIF images with unprivileged Apptainer, execution time should be similar to unprivileged Singularity.

Actual behavior

Apptainer's move to squashfuse for unprivileged (user namespace) mounts of SIF images has significantly increased the execution time of some containers, compared to automatically unpacking SIF images to a temporary sandbox as unprivileged Singularity did. I believe this is primarily a concern for containers running multiple processes/threads, as it seems there is a single squashfuse process to handle all of the parallel I/O requests and decompression.

Steps to reproduce this behavior

apptainer run -i -c -e -B /tmp/atlasgen:/results -B /tmp docker://gitlab-registry.cern.ch/hep-benchmarks/hep-workloads/atlas-gen-bmk:v2.1 -W --threads 1 --events 200
This is an ATLAS event generation benchmark container that will run a process per logical core on the host. Execution times on a system with 2x AMD EPYC 7351 CPUs (64 logical cores total):
Singularity with user namespaces (unpack to sandbox)
Execution time: ~24 min

Apptainer with setuid (squashfs privileged mount)
Execution time: ~25 min

Apptainer with user namespaces (squashfuse mount)
Execution time: ~2 hours 50 minutes

During execution, I see the squashfuse process using 100% of a single CPU core during most of the run.

Ideally the default behavior would be to revert to automatically unpacking SIF images when used unprivileged.

What OS/distro are you running

Scientific Linux 7

How did you install Apptainer

RPM from EPEL testing repo.

@DrDaveD DrDaveD added this to the 1.1.0 milestone Aug 31, 2022
@DrDaveD
Copy link
Contributor

DrDaveD commented Aug 31, 2022

Thanks so much for this report and the details on your benchmark!

Indeed I was able to reproduce the issue and run many of my own measurements using your benchmark. I have access to 16-core nodes that have both local disk and lustre. They have dual 2.6 GHz Intel E5-2650v2 CPUs and 128G of RAM. Because I had 16 cores I ran my tests with 32 events instead of 200, and I always pre-converted the container to the format being tested. I don't have access to a setuid-root installation on the machine so couldn't measure using kernel squashfs (hopefully I can get a sysadmin to cooperate for a test later). I included testing the image from cvmfs at

/cvmfs/unpacked.cern.ch/gitlab-registry.cern.ch/hep-benchmarks/hep-workloads/atlas-sim-bmk:v2.1

These are the timings I found in minutes and seconds:

sandbox on local disk:  6:23
sandbox on lustre:      6:45 (only one node, not parallel launches)
sandbox on cvmfs:       9:33 (warm cache)
ext3 image on lustre:  14:27
sif image on lustre:   41:11

Clearly the time for squashfuse in that last measurement is unacceptable. However, the very good news is that there is an existing squashfuse pull request that adds multithreading support to the squashfuse_ll command. I measured the following with squashfuse_ll (after removing -o uid=NN,gid=NN because that's not supported):

unpatched squashfuse_ll: 13:06
patched squashfuse_ll:    6:35

So that makes a huge difference and I plan to include the patched squashfuse_ll in apptainer packaging for now until the new feature is distributed.

@hollowec
Copy link
Contributor Author

Thanks @DrDaveD! Let me know when there is an new Apptainer 1.1.0rc EL7 RPM with the patched squashfuse_ll included and I will be happy to test. However, I would be somewhat concerned about including a patched/development release of squashfuse in a production Apptainer release, as it may lead to stability or other issues. If you decide to proceed that way, could you also please include an Apptainer option to disable the use of squashfuse, and revert to the old automatic temporary sandbox creation behavior, for example as implemented in my PR #668?

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 1, 2022

Normally I would also be concerned with using an unreleased patch in production code, but this has such a huge impact on the user experience with default apptainer 1.1.0 that I'm willing to risk it and work on fixing any problems that are discovered.

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 2, 2022

The fix is in #673, it would be great if you could compile it from source and run your benchmark on it. Follow the updated instructions in INSTALL.md for including the enhanced performance squashfuse_ll in an rpm.

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 2, 2022

Oh, I forgot: instead of compiling it yourself you can download an rpm (for now, until it gets cleaned up) from this fedora koji scratch build.

@hollowec
Copy link
Contributor Author

hollowec commented Sep 3, 2022

Thanks @DrDaveD. I've installed the RPM from Koji, verified it contains squashfuse_ll, and have started some tests.

@hollowec
Copy link
Contributor Author

hollowec commented Sep 9, 2022

Just an update: I can confirm the patched/multithreaded squashfuse_ll performance is considerably better, and runtimes for the above container are on par with unpacked SIF. Thanks!

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 16, 2022

I redid the measurements using the same benchmark on a single node (instead of mixing them up on comparable nodes), this time including measuring kernel squashfs with setuid, the rest non-setuid, all with apptainer-1.1.0-rc.3. These are the results with the average of two runs (none of which varied between each other by more than 1%):

kernel squashfs, sif on lustre: 6:33
multithreaded squashfuse_ll:    6:29
sandbox on local disk:          6:21
sandbox on lustre:              6:32 (only one node, not parallel launches)
sandbox on cvmfs:               6:50 (warm cache)
fuse2fs with ext3 image:       14:10 
standard squashfuse_ll:        12:48
standard squashfuse:           41:33

The first 4 are nearly identical, and cvmfs is not far behind.

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 20, 2022

@hollowec I also tried the cms-gen-sim-bmk with the same parameters and the differences are not as dramatic. I ran a subset of the tests one time each and got the following results:

kernel squashfs, sif on lustre: 18:00
multithreaded squashfuse_ll:    17:57
sandbox on local disk:          17:42
sandbox on lustre:              17:53
standard squashfuse_ll:         18:07
standard squashfuse:            27:37

So the most dramatic change was from standard squashfuse to standard squashfuse_ll. Even the multithreading patch didn't make that much difference.

My question for you is, are there other benchmarks that I should be trying? Or is atlas-gen-bwk the most stressful of the benchmarks on code storage?

@hollowec
Copy link
Contributor Author

Hi @DrDaveD. lhcb-gen-sim-bmk:v2.1 (options --threads 1 and --events 5) was another container which appeared to be largely affected by the squashfuse performance issue.

@hollowec
Copy link
Contributor Author

hollowec commented Sep 21, 2022

FYI I've run tests against the complete HEPscoreBeta benchmark set (https://gitlab.cern.ch/hep-benchmarks/hep-score - atlas-gen-bmk, cms-gen-sim-bmk, and lhcb-gen-sim-bmk are part of this set), and since the introduction of the patched squashfuse_ll binary in the 1.1.0rc3 release, runtimes are very similar to temporary unpacked SIF.

@DrDaveD
Copy link
Contributor

DrDaveD commented Sep 21, 2022

Thanks for that additional info. I ran lhcb-gen-sim-bmk:v2.1 -W --threads 1 --events 2 and got a bigger spread of results than with cms but not as big as with atlas:

kernel squashfs, sif on lustre: 13:20
multithreaded squashfuse_ll:    13:24
sandbox on local disk:          13:09
standard squashfuse_ll:         14:50
standard squashfuse:            26:27

The multithreaded squashfuse_ll does have a clear advantage over standard squashfuse_ll with this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants