Memory usage and parallel catalog load performance #3596

mharvey-jt · 2024-05-13T16:08:57Z

Summarising the outcome of performance investigations partially documented in #3593, #3592, and #3585:

Memory usage

The expectation is that the combination of:

mount -i -o remount /mnt/repo
cvmfs_talk -i repo detach nested catalogs
cvmfs_talk -i repo drop metadata caches

Should consistently reduce the RSS to a low level. In fact the floor RSS increases with time. The RSS can be obtained by grep Rss /proc/PID/smaps_rollup.

Investigation reveals that this is a result of heap fragmentation: the glibc allocator's default behaviour is to use sbrk() in preference to mmap(), meaning that memory can only be returned to the OS if it's at the top of the heap. In contrast, mmap()ed allocations can be returned to the OS without that restriction.

This can be confirmed by setting the envvar M_MMAP_THRESHOLD=0 before running cvmfs2 to force glibc to use mmap() for all allocations, and then running malloc_trim(0) periodically.

A good production-quality solution to this problem is to use Jemalloc https://github.com/jemalloc/jemalloc, which is robust against heap fragmentation.

Parallel catalog performance

In investigating the above, we identified some further modifications that significantly improve parallel catalog performance, when a large glob is being expanded in parallel (reference code below). In our test load, the glob being expanded is 6 levels deep, with ~1M entries at the lowest level, each directory having a separate catalog:

./pglob.py "/repo/*/*/*/*/*/*"

Starting with 2.11.2, this expansion took 166secs, when all catalogs are in local CVMFS cache. We apply the following changes:

Disable CVMFS's LRU caches (CVMFS_MEMCACHE_SIZE=0 + Lock elision to make cache operations a no-op). Avoids lock contention on the caches, doesn't appear to significantly reduce performance
Disable CVMFS' custom allocator in sqlitemem.cc, which is not thread safe
Set the SQLite runtime option SQLITE_CONFIG_MEMSTATUS=0, which removes some serialisation at the expense of losing sqlite memory-use metrics (which we don't find useful anyway)
Unset DSQLITE_ENABLE_MEMORY_MANAGEMENT in the SQLite Makefile
Set the SQLite runtime option SQLITE_CONFIG_MMAP_SIZE=1073741824 to make SQLite use mmap for all file IO

these five options together reduce the runtime for our parallel glob test to ~60sec, a factor of 2.8x improvement. (The fifth option makes a very minor improvement)

#!/usr/bin/env python3

import glob as g
import multiprocessing as mp
import os
import sys
import time


def glob( pattern, pool ):

  ret=[""]

  pattern = pattern.split("/")
  if pattern[0] == "":
    pattern = pattern[1:]
    ret=["/"]

  for p in pattern:
     n=[]
     for r in ret:
       n.append( os.path.join(r,p) )
     if len(n)>1: print(len(n))
     found = pool.map( g.glob, n )
     ret = []
     for f in found:
       ret.extend(f)

  return ret

if __name__ == "__main__":
   pool = mp.Pool()
   t1=time.time()
   r = sorted(glob(sys.argv[1], pool))
   t2=time.time()-t1
   print(f"DURATION {t2}")

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage and parallel catalog load performance #3596

Memory usage and parallel catalog load performance #3596

mharvey-jt commented May 13, 2024

Memory usage and parallel catalog load performance #3596

Memory usage and parallel catalog load performance #3596

Comments

mharvey-jt commented May 13, 2024