Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageRecordReader crashes JVM with loaded Keras model in 1.0.0-beta7 #8976

Closed
basedrhys opened this issue May 25, 2020 · 15 comments
Closed

ImageRecordReader crashes JVM with loaded Keras model in 1.0.0-beta7 #8976

basedrhys opened this issue May 25, 2020 · 15 comments
Labels
Bug Bugs and problems DataVec / ETL ETL & DataVec issues

Comments

@basedrhys
Copy link
Contributor

basedrhys commented May 25, 2020

Issue Description

I encountered a strange problem in 1.0.0-beta7 while trying to run a Keras model loaded from a .h5 file (e.g., VGG16.h5 from here) - this model previously ran fine in 1.0.0-beta6.

Calling computationGraph.feedForward(features, false) would crash the JVM (error log, using this code snippet:

// Create VGG16 from a Keras .h5 file
ComputationGraph tmpModel = KerasModelImport.importKerasModelAndWeights("VGG16.h5");
tmpModel.init();

ImageRecordReader reader = new ImageRecordReader(224, 224, 3);
reader.initialize(new FileSplit(new File("img_125_5.jpg"))); // Test with a single image
DataSetIterator it = new RecordReaderDataSetIterator(reader, 1);

// Keras model has wrong channel order, so flip it at the reader level
reader.setNchw_channels_first(false);

INDArray features = it.next().getFeatures();
// INDArray features = Nd4j.rand(1, 224, 224, 3); // Runs fine when initializing from random array of same size

System.out.println(Arrays.toString(features.shape())); // prints [1, 224, 224, 3]

tmpModel.feedForward(features, false);

The crash would happen specifically within the ComputationGraph class at line 1976 - figured this by stepping through the code in IntelliJ.

Strangely though, the code snippet above runs fine if you use a random numpy array of the same shape (so the issue isn't caused by the features shape). Looking into the values of the features given by the DatasetIterator, there aren't any NaNs or weird values (all are between 0 and 1).

Also interesting to note is that the .h5 model can be saved in beta6 to a zip using model.save(new File("VGG.zip")), then loaded in beta7, and the above snippet works fine (swapping the KerasModelImport... for ComputationGraph.load(new File("beta6KerasVGG.zip"), true);

Another note, the above snippet works fine if using a different model (e.g., ResNet50.h5) - so it's not all Keras models that this problem occurs with.

Conclusion

On one hand, it seems like the problem is caused by updates to the KerasModelImport process - a .h5 file which loaded and ran fine in 1.0.0-beta6 now no longer works in 1.0.0-beta7. Additionally, saving a .zip file of the beta6 version and loading a new ComputationGraph in beta7 circumvents the above problem.

However, it also seems like the ImageRecordReader or DataSetIterator could be the culprit - when those are taken out of the equation (by using a random INDArray) no errors occur.

Attached files

img_125_5

Version Information

Please indicate relevant versions, including, if relevant:

  • Deeplearning4j version - 1.0.0-beta7
  • Platform information (OS, etc) - Ubuntu 18.04
  • CUDA version, if used
  • NVIDIA driver version, if in use
@basedrhys
Copy link
Contributor Author

Any update on this? Would you like more information for reproducing it?

@treo
Copy link
Member

treo commented Jun 3, 2020

If you can provide a small demo project that we can clone and run directly, in order to reproduce the crash, it would be very helpful.

@basedrhys
Copy link
Contributor Author

basedrhys commented Jun 6, 2020

So it turns out the problem is somewhat platform dependant (didn't occur on Mac).

A fix for this is to duplicate the array: it.next().getFeatures() --> it.next().getFeatures().dup()

Would you have any idea why duplicating the input array would stop the crash?

@basedrhys
Copy link
Contributor Author

I've made a simple Gradle project to demonstrate this and help you reproduce it.

Instructions

  1. Download and unzip the project file from Google Drive: Link
  2. Open/Import the project in IntelliJ (or your IDE of choice). Let your IDE download the relevant dependencies
  3. Run the main() method in Main.java. The project initializes using beta6 so the main() method should complete successfully.
  4. In build.gradle, change the nd4j and dl4j versions from 1.0.0-beta6 to 1.0.0-beta7. Let your IDE import these changes.
  5. Run main() again. This should now cause the program to crash (JVM crash on Ubuntu 18.04 (log file attached) and nondescript Gradle error on Windows 10).

In Main.java, I've also written in some different scenarios that I've tried to help debug the issue; most notable is Scenario 3 which is the duplicating fix mentioned above.

Hopefully this can be reproduced on your machine, let me know if there's any other info you'd like :)

Attached Files

hs_err_pid17974.log

@raver119 raver119 added the Bug Bugs and problems label Jun 11, 2020
@phong-phuong
Copy link

phong-phuong commented Jun 12, 2020

Mine's crashing on 1.0.0 beta-7 with certain images, regardless of whether cpu or gpu is used.
I'm using the Oxford pets dataset, but even other images used will cause it to crash.
I get a Kernalbase.dll error reading memory address

Try reproducing by grabbing the dataset:
https://www.robots.ox.ac.uk/~vgg/data/pets/

Put the first four pets into separate sub folders and move their corresponding images. Use the AnimalClassifier image classification example, replace the download path with a local path pointing to the pets folder.

change:
int maxPathsPerLabel = 18;
to
int maxPathsPerLabel = Integer.MAX_VALUE;

in order to get disable the max path per label

Partial crash dump below.

A fatal error has been detected by the Java Runtime Environment:

EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ff94c81a799, pid=14192, tid=30636

JRE version: Java(TM) SE Runtime Environment (14.0+36) (build 14+36-1461)

Java VM: Java HotSpot(TM) 64-Bit Server VM (14+36-1461, mixed mode, sharing, tiered, compressed oops, g1 gc, windows-amd64)

Problematic frame:

C [KERNELBASE.dll+0x3a799]

urrent thread (0x000002104c301800): JavaThread "main" [_thread_in_native, id=30636, stack(0x000000b1c6500000,0x000000b1c6600000)]

Stack: [0x000000b1c6500000,0x000000b1c6600000], sp=0x000000b1c65fbe50, free space=1007k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [KERNELBASE.dll+0x3a799]
C [VCRUNTIME140.dll+0x3351]
C [ntdll.dll+0xa0616]
C [opencv_core430.dll+0x1a2728]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j org.bytedeco.opencv.global.opencv_imgproc.cvtColor(Lorg/bytedeco/opencv/opencv_core/Mat;Lorg/bytedeco/opencv/opencv_core/Mat;I)V+0
J 3348 c1 org.datavec.image.loader.NativeImageLoader.transformImage(Lorg/bytedeco/opencv/opencv_core/Mat;Lorg/nd4j/linalg/api/ndarray/INDArray;)Lorg/nd4j/linalg/api/ndarray/INDArray; (459 bytes) @ 0x00000210557a2924 [0x00000210557a14a0+0x0000000000001484]
J 3358 c1 org.datavec.image.loader.NativeImageLoader.asMatrixView(Ljava/io/InputStream;Lorg/nd4j/linalg/api/ndarray/INDArray;)V (94 bytes) @ 0x00000210557a7a44 [0x00000210557a7700+0x0000000000000344]
J 3460 c1 org.datavec.image.loader.NativeImageLoader.asMatrixView(Ljava/io/File;Lorg/nd4j/linalg/api/ndarray/INDArray;)V (107 bytes) @ 0x000002105581276c [0x0000021055812540+0x000000000000022c]
j org.datavec.image.recordreader.BaseImageRecordReader.next(I)Ljava/util/List;+416
j org.deeplearning4j.datasets.datavec.RecordReaderMultiDataSetIterator.next(I)Lorg/nd4j/linalg/dataset/api/MultiDataSet;+126
j org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next(I)Lorg/nd4j/linalg/dataset/DataSet;+64
j org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next()Lorg/nd4j/linalg/dataset/DataSet;+5
j org.deeplearning4j.datasets.datavec.RecordReaderDataSetIterator.next()Ljava/lang/Object;+1
j image_classification.ImageDataSetViewer.viewDataset(Lorg/nd4j/linalg/dataset/api/iterator/DataSetIterator;Ljava/util/List;)V+48
j image_classification.ImageClassification.run()V+660
j image_classification.ImageClassification.main([Ljava/lang/String;)V+7
v ~StubRoutines::call_stub

@raver119
Copy link
Contributor

This crash means OpenCV has thrown C++ exception which wasn't caught properly. C interface can't process those.

@phong-phuong
Copy link

Thanks. I looked into this further. I can confirm that my particular is 8-bit color depth images are causing this line to fail (other images were 24-bit depth) in opencv_core and ultimately causing a kernalbase.dll crash.

org.bytedeco.opencv.global.opencv_imgproc.cvtColor(Lorg/bytedeco/opencv/opencv_core/Mat;Lorg/bytedeco/opencv/opencv_core/Mat;I)V+0

Sample 8-bit image causing the crash:

Abyssinian_34

I converted the 8-bit image to 24 bit and the crash upon loading went away, however this is different issue from the topic, but the takeaway from this is that we should consider adding color depth checking in to the ImageRecordReader class and warn the user about the incompatible images along with the offending filename, until either opencv fixes this issue, or ImageRecordReader is updated to support different color depths by calling the appropriate function in opencv if there is one.

@saudet
Copy link
Contributor

saudet commented Jun 13, 2020

@raver119 I'm pretty sure C++ exceptions are getting caught and rethrown as Java exceptions. Something else is going on here...

@phong-phuong
Copy link

phong-phuong commented Jun 14, 2020

@saudet After more investigation, I mistakenly thought that 8-bit images was the issue, but problem appears to be related to some gif images that cause opencv to seg fault, especially the animated ones. But I still can't work out what's wrong with the cat gif, although when trying to read the first pixel, opencv crashes, probably reading the wrong memory address.

Not a biggie, but anyone wants to have a crack at solving why the above gif crashes, here's an isolated test:

import java.io.File;
import java.io.IOException;

import org.datavec.image.loader.NativeImageLoader;
import org.nd4j.linalg.api.ndarray.INDArray;
import org.nd4j.linalg.factory.Nd4j;

public class OpenCVTest {
    public static void main(String[] args) throws IOException {
	NativeImageLoader imageLoader = new NativeImageLoader();
        String filenameAndPath = "cat.gif";
	int channels = 4;
	int imageHeight = 202;
	int imageWidth = 250;
	INDArray view = Nd4j.create(new int[] {channels, imageHeight, imageWidth});
	File file = new File(filenameAndPath);	
	imageLoader.asMatrixView(file, view);
	System.out.println(view);
    }
}

@raver119
Copy link
Contributor

raver119 commented Jun 14, 2020 via email

@phong-phuong
Copy link

@raver119 it's above, a few posts up, the cat one but it labelled as a jpg but it's really a gif.

@raver119
Copy link
Contributor

Thanks

@saudet
Copy link
Contributor

saudet commented Jun 15, 2020

Ah, I see, a GIF file, that's not actually supported by OpenCV, but it looks like imread() returns an empty array instead of null, so we should check for that...

@saudet saudet added the DataVec / ETL ETL & DataVec issues label Jun 15, 2020
@saudet
Copy link
Contributor

saudet commented Jun 15, 2020

It does check for empty arrays, so it's using Leptonica to load this. It's probably related to issue #8785.

@agibsonccc
Copy link
Contributor

Closing in favor of sam's linked issue with more details: #8785

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bugs and problems DataVec / ETL ETL & DataVec issues
Projects
None yet
Development

No branches or pull requests

6 participants