Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

NCSS keeping files open even though files are garbage collected #820

Open
cskarby opened this issue Apr 27, 2017 · 6 comments
Open

NCSS keeping files open even though files are garbage collected #820

cskarby opened this issue Apr 27, 2017 · 6 comments

Comments

@cskarby
Copy link
Contributor

cskarby commented Apr 27, 2017

We have an issue with tds v4.6.6 where TDS does not release files that are garbage collected. This is especially painful when the files are large aggreations/subsets generated via NCSS.

# lsof -a +L1 /metno/thredds-production
COMMAND   PID    USER   FD   TYPE DEVICE    SIZE/OFF NLINK NODE NAME
java    23019 tomcat  178u   REG  252,6 39318020096     0  265 /usr/local/tomcat/content/thredds/cache/ncss/62380965/meps25files_meps_allmembers_full_2_5km_latest.nc (deleted)
java    23019 tomcat  198u   REG  252,6    10702848     0  269 /usr/local/tomcat/content/thredds/cache/wcs/WCS3845005436098669417.nc (deleted)
java    23019 tomcat  237u   REG  252,6  5559324940     0  263 /usr/local/tomcat/content/thredds/cache/ncss/117193742/meps25files_meps_allmembers_full_2_5km_latest.nc (deleted)
java    23019 tomcat  579u   REG  252,6  2384409756     0  267 /usr/local/tomcat/content/thredds/cache/ncss/875641617/meps25files_meps_allmembers_full_2_5km_latest.nc (deleted)
java    23019 tomcat  631u   REG  252,6     7413760     0  268 /usr/local/tomcat/content/thredds/cache/ncss/1242462205/atmo_ec_atmo_0_1deg_20170427T000000Z_3h.nc (deleted)
java    23019 tomcat  641u   REG  252,6        8192     0  270 /usr/local/tomcat/content/thredds/cache/ncss/861972011/atmo_ec_atmo_0_1deg_20170427T000000Z_3h.nc (deleted)
@cwardgar
Copy link
Contributor

Can you give us more detail about the datasets that cause this problem? A MCVE is best, though I know that's tough to create for some issues with the TDS.

@lesserwhirls
Copy link
Collaborator

Greetings @cskarby,

Judging by the files shown by lsof, I think you can address this issue by using the properties in the <NetcdfSubsetService> element in threddsConfig.xml:

http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html#ncss

as well as the <WCS> element:

http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html#WCS

It's also possible that what you are seeing is that the TDS caches file handles (max 500 by default), but I am not sure that applies to the cache, though. If tweaking the cache options for wcs and ncss does not work, you can tweaking the options in the <RandomAccessFile> element in threddsConfig.xml.

http://www.unidata.ucar.edu/software/thredds/current/tds/reference/ThreddsConfigXMLFile.html#RafCache

@cwardgar
Copy link
Contributor

cwardgar commented Apr 27, 2017

Okay, after a little more investigation, it seems clear that we are scouring files in the NCSS disk cache before the download is complete. We even have a comment in the code about this possibility.

So, setting a greater /threddsConfig/NetcdfSubsetService/maxAge value could help, especially with those massive files (36.6 GB !?). However, there were some smaller files that were scoured while still open as well. I wonder what happened there. A failure to release resources somewhere?

We should probably be keeping track of what files in the cache are being read and try not to scour them.

@cskarby I'm curious, are those files still, even now, being shown as open in lsof? I imagine that they don't show up in the file system any more, correct (the inode has been deleted)?

@antibozo
Copy link

antibozo commented Jul 20, 2017

I also see this problem. Basically, it seems as if the cache design is poor.

Judging from the controls provided, it appears the theory is to allow cache files to be created with no limit on utilization, and no check on available disk space on the underlying filesystem (e.g. using statfs). The check on unbounded growth is to periodically tot up utilization by enumerating the cache directory, then unlink files from that directory to bring utilization below a target. Correct me if i'm wrong here.

This is broken because totting up the directory doesn't tell you actual cache utilization. Deleted files are still using cache space, but, perversely, because you've unlinked them from the directory, you can't count them.

In my application, i have the cache configured to use 16 GiB out of a 120 GiB filesystem. Within a couple of minutes of starting the TDS, the entire 120 GiB filesystem is filled. At that point TDS has dozens of file descriptors open on uncompressed .nc files in the cache, each of which is 2-4 GiB in size. The scour process dutifully fires up eventually and removes all references to those files from the cache directory, which of course does nothing to free up any space since the TDS threads are still holding the open descriptors; it does, however, make it more difficult to figure out why a 120 GiB filesystem is at 100% utilization while du finds only 16 GiB in the cache directory. And, once one of those threads actually lets go of the descriptors so some space is truly freed, TDS promptly uses that space up again; after all, it only sees 16 GiB in the cache area thanks to the "cleanup" the scour process has done. Thus it keeps the filesystem perpetually up against the wall.

If you are trying to manage space in a cache, you need to track references and, at some point, block on allocation. Asynchronous cleanup is going to have race conditions, especially if it is as naïve as this strategy appears to be.

@cwardgar
Copy link
Contributor

cwardgar commented Sep 7, 2017

@antibozo I think you're essentially correct. I still haven't had the time to REALLY dig in, but I suspect that one or more of our caches are interfering with each other. Specifically, DiskCache2 monitors the cache directory and tries to delete files when some threshold is reached. Meanwhile, FileCachemaintains an open list of files—and critically—does not know that DiskCache2 has unlinked the files from the directory. So those files remain open (consuming disk space) far longer than they're intended to.

So yeah, that's a bug that needs to be fixed. In the interim, you can try disabling NetcdfFileCache. In your threddsConfig.xml, do something like:

<NetcdfFileCache>
  <minFiles>0</minFiles>
  <maxFiles>0</maxFiles>
  <scour>0 min</scour>
</NetcdfFileCache>

Hopefully that does the trick. I'm not sure much how it'll impact performance though.

@cwardgar
Copy link
Contributor

My coworker, @dopplershift, tried disabling NetcdfFileCache on a TDS instance in AWS that was experiencing the same problem. It didn't help.

It turns out there's a second cache that needs to be disabled: RandomAccessFile. Add the following to your threddsConfig.xml:

<RandomAccessFile>
  <minFiles>0</minFiles>
  <maxFiles>0</maxFiles>
  <scour>0 min</scour>
</RandomAccessFile>

That did the trick for us. I do not know if both NetcdfFileCache and RandomAccessFile need to be disabled, or only the latter. But in our test, we disabled both.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants