Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in CachingFileSystem (possibly caused by pickle.load ?) #825

Closed
fabito opened this issue Nov 17, 2021 · 3 comments · Fixed by #1353
Closed

Memory leak in CachingFileSystem (possibly caused by pickle.load ?) #825

fabito opened this issue Nov 17, 2021 · 3 comments · Fixed by #1353

Comments

@fabito
Copy link

fabito commented Nov 17, 2021

I've been using fsspec caching support:

urlpath = 'gs://mybucket/image.jpg'
fsspec.open(f'filecache::{urlpath}', filecache={'cache_storage': '/tmp/files', 'expiry_time': 604800})

I noticed that after opening many files (10k +) my application's memory consumption goes to the roof - only a restart causes memory deallocation.
After some profiling with tracemalloc these are the 2 top consumers:

#1: /usr/local/lib/python3.8/site-packages/fsspec/implementations/cached.py:156: 1443275.5 KiB
cached_files = pickle.load(f) "
#2: /usr/local/lib/python3.8/site-packages/fsspec/implementations/cached.py:137: 700547.5 KiB
loaded_cached_files = pickle.load(f)

If filecache is not used, memory consumption is back to normal.
I don't know much about the pickle module. Could it be causing the memory leak ?

@fabito fabito changed the title Memory leak in CachingFileSystem ( possibly caused by pickle.load ? ) Memory leak in CachingFileSystem (possibly caused by pickle.load ?) Nov 17, 2021
@martindurant
Copy link
Member

I wonder, how big is the JSON file after savings 10k+ files into it? I suspect that the caching system simply doesn't scale very well to these sizes. We have thought about using other storage such as sqlite3 DB files or the filesystem itself ("sidecar files"). I assume it takes a long time just to list the cache directory.

@fabito
Copy link
Author

fabito commented Nov 18, 2021

The cache file size on disk is 424kb for a cache with 1378 files ( 400Mb )

root@trollito-7b8b87d679-jm8c5:/usr/src/app# ls /tmp/files | wc -l
1379

root@trollito-7b8b87d679-jm8c5:/usr/src/app# du -h /tmp/files/
400M	/tmp/files/

root@trollito-7b8b87d679-jm8c5:/usr/src/app# du -h /tmp/files/cache
424K	/tmp/files/cache

root@trollito-7b8b87d679-jm8c5:/usr/src/app# pmap 1 | tail -n 1 | awk '/[0-9]K/{print $2}'
23044924K

BTW, the cache file it is not encoded as JSON. It is encoded using the pickle module.
Should we try json instead ?

@martindurant
Copy link
Member

Sorry, this dropped off my radar. Yes, it probably makes sense to not rely on pickle for anything that might persist long term and be used by multiple pythons. I don't know if that does anything to fix the apparent memory issue. Would you like to make the appropriate PR? It should still allow to read a pickle, so that we don't break anybody's workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants