Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEOS-8099] GSIP 158 - NetCDF output support for variable attributes and extra variables #2281

Merged
merged 1 commit into from May 6, 2017

Conversation

bencaradocdavies
Copy link
Contributor

@bencaradocdavies
Copy link
Contributor Author

Note again that the build failure is expected until geotools/geotools#1564 is merged.

Copy link
Contributor

@dromagnoli dromagnoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on merging this.
Looks like a great contribution to preserve input information (and customize it) when writing back the data.

Do you have any additional detail on the test failure you get when specifying simple NetCDF format instead of NetCDF4? Will the new code work even with simple application/x-netcdf so that it is only a testing issues?

@bencaradocdavies
Copy link
Contributor Author

I restarted the Travis CI build after geotools/geotools#1564 was merged and geotools-master ran on ares Jenkins to update the snapshots. The Travis CI build failed with a segfault in gs-netcdf-out tests:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.3:test (default-test) on project gs-netcdf-out: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.12.3:test failed: The forked VM terminated without saying properly goodbye. VM crash or System.exit called ? -> [Help 1]

Crash details:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f977e18635d, pid=33935, tid=140291330213632
#
# JRE version: Java(TM) SE Runtime Environment (8.0_31-b13) (build 1.8.0_31-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libnetcdf.so.6+0x4b35d]  nc4_pg_varm+0x3cd
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/travis/build/geoserver/geoserver/src/extension/netcdf-out/hs_err_pid33935.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)

@bencaradocdavies
Copy link
Contributor Author

bencaradocdavies commented Apr 26, 2017

I am pointing the finger at the ancient libnetcdf6 on Ubuntu 12.04.5 LTS (precise pangolin).

@bencaradocdavies
Copy link
Contributor Author

The same JVM crash in gs-netcdf-out occurred in two out of two Jenkins CI builds.

@bencaradocdavies
Copy link
Contributor Author

@dromagnoli I have seen the format=application/x-netcdf exception elsewhere, not just in this test. It is not caused by the changes in this PR. The failure seems to occur when requesting the full multidimensional NetCDF output via WCS without any slicing. These requests work fine for NetCDF-4. I will make a proper report. If we can fix it, I can switch the test to format=application/x-netcdf and workaround the unrelated JVM segfault. :-)

@bencaradocdavies
Copy link
Contributor Author

Segfault with NetCDF 4.1 fixed in cc6af2e.

I was able to reproduce the segfault on debian sid by unpacking the jessie NetCDF-C 4.1 library libnetcdfc7_4.1.3-7.2_amd64.deb and its dependency libhdf5-8_1.8.13+docs-15+deb8u1_amd64.deb and placing the libraries renamed libnetcdf.so, libhdf5_serial.so.8, and libhdf5_serial_hl.so.8 in a directory set as the value of an Eclipse run configuration environment variable LD_LIBRARY_PATH.

At line 2833 in Nc4Iosp:

ret = nc4.nc_put_vars_double(grpid, varid, origin, shape, stride, vald);

origin, shape, and stride are null. This results in a segfault:

Stack: [0x00007f475cc00000,0x00007f475cd01000],  sp=0x00007f475ccf8850,  free space=994k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libnetcdf.so+0x44990]  NCDEFAULT_put_varm+0x130

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.sun.jna.Native.invokeInt(JI[Ljava/lang/Object;)I+0
j  com.sun.jna.Function.invoke([Ljava/lang/Object;Ljava/lang/Class;Z)Ljava/lang/Object;+225
j  com.sun.jna.Function.invoke(Ljava/lang/Class;[Ljava/lang/Object;Ljava/util/Map;)Ljava/lang/Object;+262
j  com.sun.jna.Library$Handler.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;+316
j  com.sun.proxy.$Proxy17.nc_put_vars_double(II[Lucar/nc2/jni/netcdf/SizeT;[Lucar/nc2/jni/netcdf/SizeT;[Lucar/nc2/jni/netcdf/SizeT;[D)I+46
j  ucar.nc2.jni.netcdf.Nc4Iosp.writeData(Lucar/nc2/Variable;IIILucar/ma2/Section;Lucar/ma2/Array;)V+375
j  ucar.nc2.jni.netcdf.Nc4Iosp.writeData(Lucar/nc2/Variable;Lucar/ma2/Section;Lucar/ma2/Array;)V+95
j  ucar.nc2.NetcdfFileWriter.write(Lucar/nc2/Variable;[ILucar/ma2/Array;)V+35
j  ucar.nc2.NetcdfFileWriter.write(Lucar/nc2/Variable;Lucar/ma2/Array;)V+30
j  org.geoserver.wcs.responses.NetCDFOutputManager.writeDataValues()V+377
j  org.geoserver.wcs.responses.NetCDFOutputManager.write()V+51

Inspecting variables indicates that data for forecast_reference_time was being written by NetcdfFileWriter.write(Variable, Array). Wait! What? This is written as a vector and only scalar data should be written here. The logic error is that the iteration should be over scalarExtraVariables not extraVariables at this point. This does no harm with NetCDF-C 4.4 as it just writes the first element of the output vector with the value from the sample granule, which is overwritten when looping over all granules. NetCDF-C 4.1 cannot handle the null origin/shape/stride when the shape of the array and output variable do not match and it segfaults. In short, NetcdfFileWriter.write(Variable, Array) must not be used when the shape of the Variable and Array do not match because, if the platform provides NetCDF-C 4.1 native libraries, they will segfault and crash the JVM. Do Not Do That Then.

@bencaradocdavies
Copy link
Contributor Author

GSIP 158 has been accepted. I will make the final changes to this PR: renaming the Extra Variables section and squashing the commits into one.

@bencaradocdavies
Copy link
Contributor Author

bencaradocdavies commented May 6, 2017

I renamed the Extra Variables section, moved it to the end, and updated the user manual with a new screenshot. All changes squashed into one commit.

@bencaradocdavies
Copy link
Contributor Author

bencaradocdavies commented May 6, 2017

@dromagnoli I went back and investigated the failure I encountered with NetCDF-3 output and discovered that it now works, because the failure was caused by the same logic error that caused a segfault with NetCDF-C 4.1. I have expanded test coverage to include NetCDF-3 in #2311 .

@dromagnoli
Copy link
Contributor

@bencaradocdavies: Good news! thanks for the fix.

@bencaradocdavies
Copy link
Contributor Author

bencaradocdavies commented May 11, 2017

@dromagnoli can I backport this to 2.11.x? I have created pull requests geotools/geotools#1583 and #2343 for the backport.

@dromagnoli
Copy link
Contributor

Hi @bencaradocdavies
You have my +1 on the backport Pull Request

@aaime
Copy link
Member

aaime commented May 12, 2017

Stratch scratch... backporting a GSIP worthy new features 6 days before the stable release, after a permanence of a week on master?
I guess it's well isolated and the netcdf output format is not used by many? Probably the GSIP was not required and it's putting this in the wrong light?

@bencaradocdavies
Copy link
Contributor Author

@aaime the change is well-contained and I have taken care to ensure that it is backwards compatible both in terms of behaviour and, most importantly, in settings files (preserving the XStream class structure despite refactoring). The GSIP was mainly to solicit feedback on the functional enhancements and UI changes, not that I got much.

@aaime what is your preference? Merge backport for 2.11.1 or wait for 2.11.2?

Out of caution, I do not intend to backport to 2.10.x.

@dromagnoli
Copy link
Contributor

My understanding from the previous discussions on the mailing list was that this specific GSIP was more a place where collect notes and feedbacks than a classic GSIP for API changes.
That being said, it only touches the netcdf output extension. I was thinking updating an extension represents a different case. What do you think?

@aaime
Copy link
Member

aaime commented May 12, 2017

The "one month cooldown for new features" has no exceptions that I can remember, but I don't want to be picky, if you are happy with the backport I will stop complaining.

@bencaradocdavies
Copy link
Contributor Author

Thanks @dromagnoli and @aaime. I have merged the backports.

@aaime yes, you are right, strictly there is no exception for extensions, although it does seem to me to be overly restrictive for a modest non-API-changing improvement that is confined to a single extension to be held on master for a whole month. Nonetheless, PSC members in particular have a responsibility to either fix policy or set a good example by following it. Should we have a maintainer-discretion exemption for improvements confined to a single extension? Perhaps something for me to raise on the mailing lists? Thank you for the reminder and your forbearance.

@aaime
Copy link
Member

aaime commented May 12, 2017

I'd raise a discussion, I believe it's important to have clear rules applied the same way to everybody. That said, there will always be gray areas, stuff that is in between bug and improvement, or bug fixes that are so large that it's good to cool them down anyways (I've just made one to mosaic for mixed CRSs that I'm waiting to backport because it touched so many files).

@bencaradocdavies
Copy link
Contributor Author

Agreed. I also have a bug fix that I have applied to master but whose performance impact I want to test. Bug fixing must also be weighed against performance regression. The stable branch must be treated with caution. I have even seen cases of mass pull requests submitted for master, stable, and maintenance, that were all merged before Travis CI had finished running (or even in once case failed), bringing down the build on all branches. This was a community module, but should there be different rules for community modules? What about extensions? I can feel a longwinded public service announcement coming on. ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants