Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fit_generator Segmentation fault #8225

Closed
benoistlaurent opened this issue Oct 23, 2017 · 10 comments
Closed

fit_generator Segmentation fault #8225

benoistlaurent opened this issue Oct 23, 2017 · 10 comments

Comments

@benoistlaurent
Copy link

Hi,

I use model.fit_generator to handle a large dataset.

I want to read data by batch from a source file, which I did successfully using a CSV file.

When I want to use pandas.read_hdf function, keras fit_generator ends-up with a segmentation fault:

$ python 3_wine_net_fit_generator_hdf.py                                                                                                           [16:40:46]
Using TensorFlow backend.
2017-10-23 16:43:11.345428: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345450: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345455: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345476: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Epoch 1/10
518/519 [============================>.] - ETA: 0s - loss: 0.2543 - acc: 0.9116[1]    79974 segmentation fault  python 3_wine_net_fit_generator_hdf.py

I already noticed that if I do not use validation_data, I don't get the segmentation fault but I don't understand why.

Here is a link to the small example I'm running: wine-example

Any help would be very much appreciated.

Cheers,
Ben

@dgorissen
Copy link
Contributor

dgorissen commented Nov 12, 2017

Did you solve this?

I have the same problem (also using fit_generator) but during the epoch, consistently within the first or first 5 epochs.
Turns out versions before 2.0.9 are fine, only 2.0.9 shows this behaviour. Running on Tensorflow 1.0.1.

Training model
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Epoch 1/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.1092 - categorical_accuracy: 0.9744Epoch 00001: val_categorical_accuracy improved from -inf to 0.99166, saving model to /home/xxx/code7884/7884 [==============================] - 410s 52ms/step - loss: 0.1092 - categorical_accuracy: 0.9744 - val_loss: 0.0319 - val_categorical_accuracy: 0.9917
Epoch 2/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.0438 - categorical_accuracy: 0.9893Epoch 00002: val_categorical_accuracy improved from 0.99166 to 0.99559, saving model to /home/xxx/c7884/7884 [==============================] - 410s 52ms/step - loss: 0.0438 - categorical_accuracy: 0.9893 - val_loss: 0.0151 - val_categorical_accuracy: 0.9956
Epoch 3/20
5925/7884 [=====================>........] - ETA: 1:38 - loss: 0.0342 - categorical_accuracy: 0.9917Segmentation fault (core dumped)

@benoistlaurent
Copy link
Author

I still got the problem on this version of the script (using tensorflow==1.3.0 and Keras==2.0.8).

The solution I ended up with is to stop using model.fit_generator and replace it with model.train_on_batch (see this example)

@rdelassus
Copy link

I just had the exact same problem.
I use fit_generator and the generator is reading data from a file.

The first epoch ends with a segmentation fault :
Epoch 1/35
2694/2695 [============================>.] - ETA: 0s - loss: 0.2397 - acc: 0.9052 - jaccard_coef: 0.4410 - jaccard_coef_int: 0.5470Erreur de segmentation (core dumped)

It doesn't happen when I remove the validation_data.

With keras 1.2.2 and tensorflow 1.4.0

@oxydron
Copy link

oxydron commented Apr 4, 2018

Same problem
keras: 2.0.8
tensorflow: 1.3.0
Training data: 1048
Test data: 259
image type: 120x120x1

@mdgoldberg
Copy link

Same problem. @fchollet is there any way this could be fixed? I'm happy to help provide debug info and potentially contribute as needed.

Using fit_generator, with generator arguments for both training and validation data.
Both train and validation generators read from the same HDF5 file (using pandas).

I've tried it with my full dataset (237K rows) and a sample subset of the full dataset (1000 rows), both with ~1K columns, and in both cases the segmentation fault happens right after the first epoch finishes. Like others, if I remove the validation data it doesn't occur. I'm using a train/test split of 85/15 and a batch size of 64 for both the full and sample datasets (so I'm only reading 64 rows from the HDF5 file at any given time, in the generator). Output from top confirms that I'm not running out of memory.

Versions:
Keras 2.1.6
tensorflow 1.8.0

Unlike @dgorissen, I'm experiencing this issue on 2.0.8 as well as 2.1.6.

@mdgoldberg
Copy link

I believe I actually just figured out what was causing my personal issue. Not sure if this will apply to others, but in my generator I was using pd.read_hdf to read subsets of an HDF5 file into memory, but the problem is that read_hdf is not thread-safe, even for reading (documentation is currently not clear about this).

I solved this problem by passing workers=0 to fit_generator, so that the generator is executed on the main thread.

@amarion35
Copy link

amarion35 commented Jun 21, 2018

I had the same problem with the following code

def init_model():
    model = Sequential()
    model.add(Conv3D(4, kernel_size=(3, 3, 3), input_shape=(None,None,None,1), padding='same'))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
    model.add(Dropout(0.25))

    model.add(GlobalAveragePooling3D())
    model.add(Dense(32, activation='sigmoid'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='softmax'))
    model.compile(loss='mse', optimizer=Adam())

    print(model.summary())
    
    return model

def input_generator(metas):
    while True:
        meta_sample = metas.sample(frac=1)
        yield np.expand_dims(np.expand_dims(load_from_meta(meta_sample), axis=4), axis=0), [meta_sample.DMOS]

video_metadatas = get_datas().iloc[0]
model = init_model()
hist = model.fit_generator(generator=input_generator(video_metadatas), epochs=1, steps_per_epoch=1, use_multiprocessing=False, workers=0)

load_from_meta() loads videos using a ffmpeg wrapper

I fixed the issue with workers=0

Edit:

Actually it does not work every time

$ for k in 1 2 3 4 5 6 7 8 9 10; do python3 3D_CNN.py; done
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)

@gschramm
Copy link

gschramm commented Aug 7, 2018

I experience the same problem (segmentation fault during the first epochs when using
fit_generator).

The segmentation fault occurs when I run fit_generator on CPUs with a batch size of 40
It does not occur when I run the same example (see below) on a GPU (GTX 1080 Ti) or
when running on CPU with a batch size of 10.
I was able to reproduce the segmentation faults on two linux machines.

4/10 [===========>..................] - ETA: 10:59 - loss: 0.4438Segmentation fault (core dumped)

Here is a small standalone script that produces the segmentation fault
(when using batch_size = 40 and run on CPUs):
https://gist.github.com/gschramm/e6db1f7333b50bca10c38243efec0925

Any idea what is going wrong?

I am running:

  • OS: Ubuntu 16.04
  • Keras 2.2.0
  • Tensorflow 1.9.0 (I also tested 1.8.0 -> same problem)

@luncliff
Copy link

Hi, I just met this symptom in my Docker environment with Keras 2.2.4 and Tensorflow 1.12 (GPU).

For me, the issue disappeared when I changed Tensorflow to 1.13-gpu-py3
I'm not sure about it is solved completely, but writing my environment for future visitors...

Environment

  • Host:
    • OS: Ubuntu 16.04.6 LTS
    • Kernel Linux 4.15.0-51-generic
  • Docker:
    • Docker version: 18.09.6
    • API version: 1.39
    • NVIDIA Docker Runtime (nvidia-docker2)
  • GPU Device
    • CUDA: 9.2
    • GeForce RTX 2080 Ti * 2
Mon Jun 17 17:23:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   39C    P8    21W / 250W |    403MiB / 10989MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 37%   34C    P8     2W / 250W |      1MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1114      G   /usr/lib/xorg/Xorg                           224MiB |
|    0      2511      G   compiz                                        79MiB |
|    0     10650      G   ...-token=9E760AB97E59CC5C02D0AFC5D37FE54E    98MiB |
+-----------------------------------------------------------------------------+

Installation

  • Keras:
    • Installed via pip
  • Tensorflow:
    • Docker Image: 1.12-gpu-py3 >> 1.13-gpu-py3(solved)
FROM    tensorflow/tensorflow:1.13.1-gpu-py3 as ship
LABEL   maintainer="luncliff@gmail.com"

RUN     pip install -qqq --upgrade pip && pip install -qqq keras
RUN     pip install -qqq pillow
# ...

@mikechen66
Copy link

For the larger datasets with Keras multithreading, users needs to adopt a threadsafe generator method to deal with the issue. There is a brief introdcution by Anand Chitipothu as well as the explaination of composed functions by Mathieu Larose. The threadsafe method has been adopted in the library of Faster RCNN by RGB and Kaiming He.

threadsafe_code: http://anandology.com/blog/using-iterators-and-generators/
composition of functions: https://mathieularose.com/function-composition-in-python/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants