Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage and parallel catalog load performance #3596

Open
mharvey-jt opened this issue May 13, 2024 · 0 comments
Open

Memory usage and parallel catalog load performance #3596

mharvey-jt opened this issue May 13, 2024 · 0 comments

Comments

@mharvey-jt
Copy link
Contributor

Summarising the outcome of performance investigations partially documented in #3593, #3592, and #3585:

  • Memory usage

The expectation is that the combination of:

mount -i -o remount /mnt/repo
cvmfs_talk -i repo detach nested catalogs
cvmfs_talk -i repo drop metadata caches

Should consistently reduce the RSS to a low level. In fact the floor RSS increases with time. The RSS can be obtained by grep Rss /proc/PID/smaps_rollup.

Investigation reveals that this is a result of heap fragmentation: the glibc allocator's default behaviour is to use sbrk() in preference to mmap(), meaning that memory can only be returned to the OS if it's at the top of the heap. In contrast, mmap()ed allocations can be returned to the OS without that restriction.

This can be confirmed by setting the envvar M_MMAP_THRESHOLD=0 before running cvmfs2 to force glibc to use mmap() for all allocations, and then running malloc_trim(0) periodically.

A good production-quality solution to this problem is to use Jemalloc https://github.com/jemalloc/jemalloc, which is robust against heap fragmentation.

  • Parallel catalog performance

In investigating the above, we identified some further modifications that significantly improve parallel catalog performance, when a large glob is being expanded in parallel (reference code below). In our test load, the glob being expanded is 6 levels deep, with ~1M entries at the lowest level, each directory having a separate catalog:

./pglob.py "/repo/*/*/*/*/*/*"

Starting with 2.11.2, this expansion took 166secs, when all catalogs are in local CVMFS cache. We apply the following changes:

  • Disable CVMFS's LRU caches (CVMFS_MEMCACHE_SIZE=0 + Lock elision to make cache operations a no-op). Avoids lock contention on the caches, doesn't appear to significantly reduce performance
  • Disable CVMFS' custom allocator in sqlitemem.cc, which is not thread safe
  • Set the SQLite runtime option SQLITE_CONFIG_MEMSTATUS=0, which removes some serialisation at the expense of losing sqlite memory-use metrics (which we don't find useful anyway)
  • Unset DSQLITE_ENABLE_MEMORY_MANAGEMENT in the SQLite Makefile
  • Set the SQLite runtime option SQLITE_CONFIG_MMAP_SIZE=1073741824 to make SQLite use mmap for all file IO

these five options together reduce the runtime for our parallel glob test to ~60sec, a factor of 2.8x improvement. (The fifth option makes a very minor improvement)

#!/usr/bin/env python3

import glob as g
import multiprocessing as mp
import os
import sys
import time


def glob( pattern, pool ):

  ret=[""]

  pattern = pattern.split("/")
  if pattern[0] == "":
    pattern = pattern[1:]
    ret=["/"]

  for p in pattern:
     n=[]
     for r in ret:
       n.append( os.path.join(r,p) )
     if len(n)>1: print(len(n))
     found = pool.map( g.glob, n )
     ret = []
     for f in found:
       ret.extend(f)

  return ret

if __name__ == "__main__":
   pool = mp.Pool()
   t1=time.time()
   r = sorted(glob(sys.argv[1], pool))
   t2=time.time()-t1
   print(f"DURATION {t2}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant