-
Notifications
You must be signed in to change notification settings - Fork 409
Description
My use case:
I am writing a Zarr to a MemoryFileSystem
, and that Zarr contains about 100,000 files. Subsequently, I overwrite the original Zarr with a much smaller Zarr, and this takes about half an hour. I was really perplexed that such a seemingly small operation would require so much time.
Profiling the problem:
By profiling, I tracked it down to the rm(path, recursive=True)
operation triggered by xr.Dataset.to_zarr(mode='w')
.
We iterate over all the files to be deleted,
filesystem_spec/fsspec/implementations/memory.py
Lines 257 to 267 in 176efbe
for p in reversed(paths): | |
# If the expanded path doesn't exist, it is only because the expanded | |
# path was a directory that does not exist in self.pseudo_dirs. This | |
# is possible if you directly create files without making the | |
# directories first. | |
if not self.exists(p): | |
continue | |
if self.isfile(p): | |
self.rm_file(p) | |
else: | |
self.rmdir(p) |
but the problem is that info
and hence isfile
and exists
all have runtime
filesystem_spec/fsspec/spec.py
Lines 707 to 712 in 176efbe
def isfile(self, path): | |
"""Is this entry file-like?""" | |
try: | |
return self.info(path)["type"] == "file" | |
except: # noqa: E722 | |
return False |
filesystem_spec/fsspec/implementations/memory.py
Lines 149 to 169 in 176efbe
def info(self, path, **kwargs): | |
logger.debug("info: %s", path) | |
path = self._strip_protocol(path) | |
if path in self.pseudo_dirs or any( | |
p.startswith(path + "/") for p in list(self.store) + self.pseudo_dirs | |
): | |
return { | |
"name": path, | |
"size": 0, | |
"type": "directory", | |
} | |
elif path in self.store: | |
filelike = self.store[path] | |
return { | |
"name": path, | |
"size": filelike.size, | |
"type": "file", | |
"created": getattr(filelike, "created", None), | |
} | |
else: | |
raise FileNotFoundError(path) |
The problematic line is 153, where in order to rule out that the path is a directory, we iterate over each file to check whether the path a parent directory of any file.
Quantifying the effect
Let's collect some timing data.
Deleting a single file
from time import time
from fsspec.implementations.memory import MemoryFileSystem
fs = MemoryFileSystem("memory://store")
delete_all_files_time: dict[int, float] = {}
for num_files in [10 ** i for i in range(7)]:
print(f"num_files: {num_files}")
for i in range(num_files):
fs.touch(f"file{i}")
start = time()
fs.rm("file0")
end = time()
delete_all_files_time[num_files] = end - start
print(f"delete_single_file_time: {delete_all_files_time[num_files]}")
num_files: 1
delete_single_file_time: 3.361701965332031e-05
num_files: 10
delete_single_file_time: 2.47955322265625e-05
num_files: 100
delete_single_file_time: 5.435943603515625e-05
num_files: 1000
delete_single_file_time: 0.00043773651123046875
num_files: 10000
delete_single_file_time: 0.003747701644897461
num_files: 100000
delete_single_file_time: 0.03829622268676758
num_files: 1000000
delete_single_file_time: 0.42734217643737793
Log-log-plot so that a straight line corresponds to a power law:
Doing a linear regression (dropping 1 and 10 files), the log-log slope is 0.973 (very close to linear), and the log10 intercept is -6.26, so time in seconds is about
Deleting all files
from time import time
from fsspec.implementations.memory import MemoryFileSystem
fs = MemoryFileSystem("memory://store")
delete_all_files_time: dict[int, float] = {}
for num_files in [100 * 2 ** i for i in range(9)]:
print(f"num_files: {num_files}")
for i in range(num_files):
fs.touch(f"file{i}")
start = time()
fs.rm("/", recursive=True)
end = time()
delete_all_files_time[num_files] = end - start
print(f"delete_all_files_time: {delete_all_files_time[num_files]}")
num_files: 100
delete_all_files_time: 0.0032181739807128906
num_files: 200
delete_all_files_time: 0.010508298873901367
num_files: 400
delete_all_files_time: 0.034881591796875
num_files: 800
delete_all_files_time: 0.14718341827392578
num_files: 1600
delete_all_files_time: 0.4914524555206299
num_files: 3200
delete_all_files_time: 1.9803588390350342
num_files: 6400
delete_all_files_time: 7.475835800170898
num_files: 12800
delete_all_files_time: 29.763574361801147
num_files: 25600
delete_all_files_time: 119.04087710380554
Here we observe a painful 2 minutes to delete 25,600 files.
Regression gives the number of seconds to be about
Thoughts regarding a solution
Unfortunately I don't see any obvious simple solution that doesn't increase code complexity. After briefly tinkering with the test suite, it seems like there are some delicate edge cases for which some obvious simplifying tweaks fail.
EDIT: Found one in #1725!
By the way, is there some simple and safe way to nuke the contents of a MemoryFileSystem
?