You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summarising the outcome of performance investigations partially documented in #3593, #3592, and #3585:
Memory usage
The expectation is that the combination of:
mount -i -o remount /mnt/repo
cvmfs_talk -i repo detach nested catalogs
cvmfs_talk -i repo drop metadata caches
Should consistently reduce the RSS to a low level. In fact the floor RSS increases with time. The RSS can be obtained by grep Rss /proc/PID/smaps_rollup.
Investigation reveals that this is a result of heap fragmentation: the glibc allocator's default behaviour is to use sbrk() in preference to mmap(), meaning that memory can only be returned to the OS if it's at the top of the heap. In contrast, mmap()ed allocations can be returned to the OS without that restriction.
This can be confirmed by setting the envvar M_MMAP_THRESHOLD=0 before running cvmfs2 to force glibc to use mmap() for all allocations, and then running malloc_trim(0) periodically.
A good production-quality solution to this problem is to use Jemalloc https://github.com/jemalloc/jemalloc, which is robust against heap fragmentation.
Parallel catalog performance
In investigating the above, we identified some further modifications that significantly improve parallel catalog performance, when a large glob is being expanded in parallel (reference code below). In our test load, the glob being expanded is 6 levels deep, with ~1M entries at the lowest level, each directory having a separate catalog:
./pglob.py "/repo/*/*/*/*/*/*"
Starting with 2.11.2, this expansion took 166secs, when all catalogs are in local CVMFS cache. We apply the following changes:
Disable CVMFS's LRU caches (CVMFS_MEMCACHE_SIZE=0 + Lock elision to make cache operations a no-op). Avoids lock contention on the caches, doesn't appear to significantly reduce performance
Disable CVMFS' custom allocator in sqlitemem.cc, which is not thread safe
Set the SQLite runtime option SQLITE_CONFIG_MEMSTATUS=0, which removes some serialisation at the expense of losing sqlite memory-use metrics (which we don't find useful anyway)
Unset DSQLITE_ENABLE_MEMORY_MANAGEMENT in the SQLite Makefile
Set the SQLite runtime option SQLITE_CONFIG_MMAP_SIZE=1073741824 to make SQLite use mmap for all file IO
these five options together reduce the runtime for our parallel glob test to ~60sec, a factor of 2.8x improvement. (The fifth option makes a very minor improvement)
#!/usr/bin/env python3
import glob as g
import multiprocessing as mp
import os
import sys
import time
def glob( pattern, pool ):
ret=[""]
pattern = pattern.split("/")
if pattern[0] == "":
pattern = pattern[1:]
ret=["/"]
for p in pattern:
n=[]
for r in ret:
n.append( os.path.join(r,p) )
if len(n)>1: print(len(n))
found = pool.map( g.glob, n )
ret = []
for f in found:
ret.extend(f)
return ret
if __name__ == "__main__":
pool = mp.Pool()
t1=time.time()
r = sorted(glob(sys.argv[1], pool))
t2=time.time()-t1
print(f"DURATION {t2}")
The text was updated successfully, but these errors were encountered:
Summarising the outcome of performance investigations partially documented in #3593, #3592, and #3585:
The expectation is that the combination of:
Should consistently reduce the RSS to a low level. In fact the floor RSS increases with time. The RSS can be obtained by
grep Rss /proc/PID/smaps_rollup
.Investigation reveals that this is a result of heap fragmentation: the glibc allocator's default behaviour is to use
sbrk()
in preference to mmap(), meaning that memory can only be returned to the OS if it's at the top of the heap. In contrast,mmap()
ed allocations can be returned to the OS without that restriction.This can be confirmed by setting the envvar
M_MMAP_THRESHOLD=0
before runningcvmfs2
to force glibc to usemmap()
for all allocations, and then runningmalloc_trim(0)
periodically.A good production-quality solution to this problem is to use Jemalloc https://github.com/jemalloc/jemalloc, which is robust against heap fragmentation.
In investigating the above, we identified some further modifications that significantly improve parallel catalog performance, when a large glob is being expanded in parallel (reference code below). In our test load, the glob being expanded is 6 levels deep, with ~1M entries at the lowest level, each directory having a separate catalog:
Starting with 2.11.2, this expansion took 166secs, when all catalogs are in local CVMFS cache. We apply the following changes:
CVMFS_MEMCACHE_SIZE=0
+ Lock elision to make cache operations a no-op). Avoids lock contention on the caches, doesn't appear to significantly reduce performancesqlitemem.cc
, which is not thread safeSQLITE_CONFIG_MEMSTATUS=0
, which removes some serialisation at the expense of losing sqlite memory-use metrics (which we don't find useful anyway)DSQLITE_ENABLE_MEMORY_MANAGEMENT
in the SQLite MakefileSQLITE_CONFIG_MMAP_SIZE=1073741824
to make SQLite usemmap
for all file IOthese five options together reduce the runtime for our parallel glob test to ~60sec, a factor of 2.8x improvement. (The fifth option makes a very minor improvement)
The text was updated successfully, but these errors were encountered: