Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of IDs after reading many files #63

Open
jonathanasdf opened this issue Apr 27, 2016 · 5 comments
Open

Out of IDs after reading many files #63

jonathanasdf opened this issue Apr 27, 2016 · 5 comments

Comments

@jonathanasdf
Copy link

jonathanasdf commented Apr 27, 2016

I'm using many data files in hdf5 format to train a neural network. After running for many epochs over a few hours, it crashes with an error

HDF5-DIAG: Error detected in HDF5 (1.8.16) thread 140335388788608:
  #000: H5F.c line 608 in H5Fopen(): unable to atomize file handle
    major: Object atom
    minor: Unable to register new atom
  #001: H5I.c line 921 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group

It seems to be a known(?) bug, and exists in both 1.8.14 and 1.8.16
https://stackoverflow.com/questions/35522633/hdf5-consumes-all-resource-ids-for-dataspaces-and-exits-c-api

I can reproduce it with this if I just let it run for a while (to be precise, around 2^24 = 16777216 iterations)

require 'hdf5'
require 'xlua'
local N = 20000000
local n = '/tmp/test.h5'
local f = hdf5.open(n, 'w')
f:write('/data', torch.rand(1))
f:close()

for i=1,N do
  local f = hdf5.open(n, 'r')
  f:read('/data'):all()
  f:close()
  xlua.progress(i, N)
end

Any ideas? Should I just not use hdf5?

@d11
Copy link
Contributor

d11 commented May 24, 2016

Hmm, that's a pain. It's probably possible to use a similar workaround to the one described in the thread you linked. Did you find any way around it in the end?

@jonathanasdf
Copy link
Author

No, I moved over to https://github.com/jonathantompson/torchzlib instead and it has worked for me with no problems.

@gulvarol
Copy link

gulvarol commented Nov 2, 2016

Was this problem handled by anyone? I am using HDF5 1.8.17 in enable-threadsafe mode, and training with multi-threads loading many hdf5 files during network training. It never seems to release memory and it crashes after a few hours with 'too many open files' although i checked many times that opened files are closed.


HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 140023236646656:
  #000: H5F.c line 604 in H5Fopen(): unable to open file
    major: File accessibilty
    minor: Unable to open file
  #001: H5Fint.c line 992 in H5F_open(): unable to open file: time = Wed Nov  2 11:31:31 2016
, name = '00018.h5', tent_flags = 0
    major: File accessibilty
    minor: Unable to open file
  #002: H5FD.c line 993 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: H5FDsec2.c line 339 in H5FD_sec2_open(): unable to open file: name = '00018.h5', errno = 24, error message = 'Too many open files', flags = 0, o_flags = 0
    major: File accessibilty
    minor: Unable to open file

The same code doesn't crash with HDF5 1.8.12, it just leaks memory the same way.

@anibali
Copy link
Contributor

anibali commented Dec 11, 2016

I'm having what appears to be a similar issue, but with an earlier version of HDF5:

HDF5-DIAG: Error detected in HDF5 (1.8.11) thread 139866049939328:
  #000: ../../../src/H5D.c line 445 in H5Dget_space(): unable to register data space
    major: Object atom
    minor: Unable to register new atom
  #001: ../../../src/H5I.c line 951 in H5I_register(): no IDs available in type
    major: Object atom
    minor: Out of IDs for group

This is after many iterations of open/read/close on a HDF5 file. Usually the program just hangs forever, I feel like I was "lucky" to even see the error message.

@anibali
Copy link
Contributor

anibali commented Dec 12, 2016

I created a fork of torch-hdf5 which works with HDF5 1.10 (https://github.com/anibali/torch-hdf5/tree/hdf5-1.10), installed HDF5 1.10 with 1.8 API compatibility, and reran the sample program provided by the OP. The program now finishes successfully, whereas before it did not. So either a) the issue is properly fixed in newer versions of HDF5, or b) the new 64-bit IDs in 1.10 have increased the number of available IDs but they will still eventually run out given enough time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants