-
Notifications
You must be signed in to change notification settings - Fork 267
Description
To report a non-security related issue, please provide:
- the version of the software with which you are encountering an issue
- environmental information (i.e. Operating System, compiler info, java version, python version, etc.)
- a description of the issue with the steps needed to reproduce it
If you have a general question about the software, please view our Suggested Support Process.
Please consider me to be a novice when it comes to using NetCDF4 and all things related.
Version: - installed via spack v0.23.1
compiler: gcc@11.2.0
python@3.11.9
netcdf-c@4.9.2
py-netcdf4@1.7.1
py-h5py@3.12.1
py-mpi4py@4.0.1
hdf5@1.14.5~cxx~fortran+hl~ipo~java~map+mpi+shared+subfiling~szip+threadsafe+tools
openmpi@5.0.5
Both on:
- Ubuntu 22.04 - 6.6.87.2-microsoft-standard-WSL2
- Levante - 4.18.0-553.42.1.el8_10.x86_64
- Additionally verified by a member of the DKRZ not running the exact environment used (i.e different software versions) (can get details if needed)
Any file create via the NetCDF4 python API grows exactly 50% in size (i.e 10->15GB, 20->30GB, 30->45GB, ...).
The code provided here (test.py) can be used to reproduce the issue. Simply enabling MPI via the netCDF4.Dataset(path, "w", format="NETCDF4", parallel=True)
option results in a file being 50% larger than intended. Setting the flag the False
creates the expected filesize. mpiexec
, ,mpirun
or -n N
do not specifically need to be supplied for this effect to show. Simply running it with python test.py
and setting the flag to True is enough to reproduce the issue. A way this can be viewed is by using a tool such as binocle to view the raw binary data.
from mpi4py import MPI
import netCDF4
import numpy as np
def create(path, form, dtype="f8", parallel=False):
root = netCDF4.Dataset(path, "w", format="NETCDF4", parallel=parallel) # type: ignore
root.createGroup("/")
used = 0
for variable, element in form.items():
shape = element[0]
chunks = element[1]
dimensions = []
for size in shape:
root.createDimension(f"{used}", size)
dimensions.append(f"{used}")
used += 1
if len(chunks) != 0:
x = root.createVariable(variable, dtype, dimensions, chunksizes=chunks)
else:
x = root.createVariable(variable, dtype, dimensions)
if parallel == False:
print(len(np.random.random_sample(shape)))
x[:] = np.random.random_sample(shape)
else:
rank = MPI.COMM_WORLD.rank # type: ignore
rsize = MPI.COMM_WORLD.size # type: ignore
total_size = shape[0]
size = int(total_size / rsize)
rstart = rank * size
rend = rstart + size
print(f"shape: {shape}, chunks: {chunks}, dimensions: {dimensions}, total chunksize: {total_size}, size per rank:{size} rank: {rank}, rsize: {rsize}, rstart: {rstart}, rend: {rend}")
print(len(np.random.random_sample(size)))
x[rstart:rend] = np.random.random_sample(size)
MPI.COMM_WORLD.Barrier() # type: ignore
print(f"var: {x}, ncattrs after fill: {x.ncattrs()}, as dict: {x.__dict__}")
def main():
create(form={"X": [[10 * 134217728], []]}, path="test.nc", parallel=True)
if __name__=="__main__":
main()
This is an Image obtained from the broken, 50% larger file. This is zoomed out very far, though at the very beginning one would be able to see the header.
This is what the file should look like. A lot less empty space before the data.
Additional output obtained by aforementioned member of the DKRZ:
~/Git/Testprogramme/NetCDF/IO on master ● λ ncdump -h test_false.nc
netcdf test_false {
dimensions:
\0 = 1342177280 ;
variables:
double X(\0) ;
}
~/Git/Testprogramme/NetCDF/IO on master ● λ ncdump -h test_true.nc
netcdf test_true {
dimensions:
\0 = 1342177280 ;
variables:
double X(\0) ;
}
~/Git/Testprogramme/NetCDF/IO on master ● λ ls -lh test_*
-rw-r--r-- 1 user user 11G Sep 22 14:59 test_false.nc
-rw-r--r-- 1 user user 16G Sep 22 14:59 test_true.nc
~/Git/Testprogramme/NetCDF/IO on master ● λ du -shc test_*
11G test_false.nc
11G test_true.nc
21G total