NCSS keeping files open even though files are garbage collected #820
Comments
Can you give us more detail about the datasets that cause this problem? A MCVE is best, though I know that's tough to create for some issues with the TDS. |
Greetings @cskarby, Judging by the files shown by http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html#ncss as well as the http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html#WCS It's also possible that what you are seeing is that the TDS caches file handles (max 500 by default), but I am not sure that applies to the cache, though. If tweaking the cache options for |
Okay, after a little more investigation, it seems clear that we are scouring files in the NCSS disk cache before the download is complete. We even have a comment in the code about this possibility. So, setting a greater We should probably be keeping track of what files in the cache are being read and try not to scour them. @cskarby I'm curious, are those files still, even now, being shown as open in |
I also see this problem. Basically, it seems as if the cache design is poor. Judging from the controls provided, it appears the theory is to allow cache files to be created with no limit on utilization, and no check on available disk space on the underlying filesystem (e.g. using statfs). The check on unbounded growth is to periodically tot up utilization by enumerating the cache directory, then unlink files from that directory to bring utilization below a target. Correct me if i'm wrong here. This is broken because totting up the directory doesn't tell you actual cache utilization. Deleted files are still using cache space, but, perversely, because you've unlinked them from the directory, you can't count them. In my application, i have the cache configured to use 16 GiB out of a 120 GiB filesystem. Within a couple of minutes of starting the TDS, the entire 120 GiB filesystem is filled. At that point TDS has dozens of file descriptors open on uncompressed .nc files in the cache, each of which is 2-4 GiB in size. The scour process dutifully fires up eventually and removes all references to those files from the cache directory, which of course does nothing to free up any space since the TDS threads are still holding the open descriptors; it does, however, make it more difficult to figure out why a 120 GiB filesystem is at 100% utilization while du finds only 16 GiB in the cache directory. And, once one of those threads actually lets go of the descriptors so some space is truly freed, TDS promptly uses that space up again; after all, it only sees 16 GiB in the cache area thanks to the "cleanup" the scour process has done. Thus it keeps the filesystem perpetually up against the wall. If you are trying to manage space in a cache, you need to track references and, at some point, block on allocation. Asynchronous cleanup is going to have race conditions, especially if it is as naïve as this strategy appears to be. |
@antibozo I think you're essentially correct. I still haven't had the time to REALLY dig in, but I suspect that one or more of our caches are interfering with each other. Specifically, So yeah, that's a bug that needs to be fixed. In the interim, you can try disabling <NetcdfFileCache>
<minFiles>0</minFiles>
<maxFiles>0</maxFiles>
<scour>0 min</scour>
</NetcdfFileCache> Hopefully that does the trick. I'm not sure much how it'll impact performance though. |
My coworker, @dopplershift, tried disabling It turns out there's a second cache that needs to be disabled: <RandomAccessFile>
<minFiles>0</minFiles>
<maxFiles>0</maxFiles>
<scour>0 min</scour>
</RandomAccessFile> That did the trick for us. I do not know if both |
We have an issue with tds v4.6.6 where TDS does not release files that are garbage collected. This is especially painful when the files are large aggreations/subsets generated via NCSS.
The text was updated successfully, but these errors were encountered: