fit_generator Segmentation fault #8225

benoistlaurent · 2017-10-23T14:52:31Z

Hi,

I use model.fit_generator to handle a large dataset.

I want to read data by batch from a source file, which I did successfully using a CSV file.

When I want to use pandas.read_hdf function, keras fit_generator ends-up with a segmentation fault:

$ python 3_wine_net_fit_generator_hdf.py                                                                                                           [16:40:46]
Using TensorFlow backend.
2017-10-23 16:43:11.345428: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345450: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345455: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345476: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Epoch 1/10
518/519 [============================>.] - ETA: 0s - loss: 0.2543 - acc: 0.9116[1]    79974 segmentation fault  python 3_wine_net_fit_generator_hdf.py

I already noticed that if I do not use validation_data, I don't get the segmentation fault but I don't understand why.

Here is a link to the small example I'm running: wine-example

Any help would be very much appreciated.

Cheers,
Ben

The text was updated successfully, but these errors were encountered:

dgorissen · 2017-11-12T11:21:34Z

Did you solve this?

I have the same problem (also using fit_generator) but during the epoch, consistently within the first or first 5 epochs.
Turns out versions before 2.0.9 are fine, only 2.0.9 shows this behaviour. Running on Tensorflow 1.0.1.

Training model
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Epoch 1/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.1092 - categorical_accuracy: 0.9744Epoch 00001: val_categorical_accuracy improved from -inf to 0.99166, saving model to /home/xxx/code7884/7884 [==============================] - 410s 52ms/step - loss: 0.1092 - categorical_accuracy: 0.9744 - val_loss: 0.0319 - val_categorical_accuracy: 0.9917
Epoch 2/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.0438 - categorical_accuracy: 0.9893Epoch 00002: val_categorical_accuracy improved from 0.99166 to 0.99559, saving model to /home/xxx/c7884/7884 [==============================] - 410s 52ms/step - loss: 0.0438 - categorical_accuracy: 0.9893 - val_loss: 0.0151 - val_categorical_accuracy: 0.9956
Epoch 3/20
5925/7884 [=====================>........] - ETA: 1:38 - loss: 0.0342 - categorical_accuracy: 0.9917Segmentation fault (core dumped)

benoistlaurent · 2017-11-13T09:07:22Z

I still got the problem on this version of the script (using tensorflow==1.3.0 and Keras==2.0.8).

The solution I ended up with is to stop using model.fit_generator and replace it with model.train_on_batch (see this example)

rdelassus · 2018-03-04T13:47:06Z

I just had the exact same problem.
I use fit_generator and the generator is reading data from a file.

The first epoch ends with a segmentation fault :
Epoch 1/35
2694/2695 [============================>.] - ETA: 0s - loss: 0.2397 - acc: 0.9052 - jaccard_coef: 0.4410 - jaccard_coef_int: 0.5470Erreur de segmentation (core dumped)

It doesn't happen when I remove the validation_data.

With keras 1.2.2 and tensorflow 1.4.0

oxydron · 2018-04-04T18:47:50Z

Same problem
keras: 2.0.8
tensorflow: 1.3.0
Training data: 1048
Test data: 259
image type: 120x120x1

mdgoldberg · 2018-05-29T21:03:07Z

Same problem. @fchollet is there any way this could be fixed? I'm happy to help provide debug info and potentially contribute as needed.

Using fit_generator, with generator arguments for both training and validation data.
Both train and validation generators read from the same HDF5 file (using pandas).

I've tried it with my full dataset (237K rows) and a sample subset of the full dataset (1000 rows), both with ~1K columns, and in both cases the segmentation fault happens right after the first epoch finishes. Like others, if I remove the validation data it doesn't occur. I'm using a train/test split of 85/15 and a batch size of 64 for both the full and sample datasets (so I'm only reading 64 rows from the HDF5 file at any given time, in the generator). Output from top confirms that I'm not running out of memory.

Versions:
Keras 2.1.6
tensorflow 1.8.0

Unlike @dgorissen, I'm experiencing this issue on 2.0.8 as well as 2.1.6.

mdgoldberg · 2018-05-29T21:35:36Z

I believe I actually just figured out what was causing my personal issue. Not sure if this will apply to others, but in my generator I was using pd.read_hdf to read subsets of an HDF5 file into memory, but the problem is that read_hdf is not thread-safe, even for reading (documentation is currently not clear about this).

I solved this problem by passing workers=0 to fit_generator, so that the generator is executed on the main thread.

amarion35 · 2018-06-21T07:50:23Z

I had the same problem with the following code

def init_model():
    model = Sequential()
    model.add(Conv3D(4, kernel_size=(3, 3, 3), input_shape=(None,None,None,1), padding='same'))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
    model.add(Dropout(0.25))

    model.add(GlobalAveragePooling3D())
    model.add(Dense(32, activation='sigmoid'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='softmax'))
    model.compile(loss='mse', optimizer=Adam())

    print(model.summary())
    
    return model

def input_generator(metas):
    while True:
        meta_sample = metas.sample(frac=1)
        yield np.expand_dims(np.expand_dims(load_from_meta(meta_sample), axis=4), axis=0), [meta_sample.DMOS]

video_metadatas = get_datas().iloc[0]
model = init_model()
hist = model.fit_generator(generator=input_generator(video_metadatas), epochs=1, steps_per_epoch=1, use_multiprocessing=False, workers=0)

load_from_meta() loads videos using a ffmpeg wrapper

I fixed the issue with workers=0

Edit:

Actually it does not work every time

$ for k in 1 2 3 4 5 6 7 8 9 10; do python3 3D_CNN.py; done
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)

gschramm · 2018-08-07T12:56:08Z

I experience the same problem (segmentation fault during the first epochs when using
fit_generator).

The segmentation fault occurs when I run fit_generator on CPUs with a batch size of 40
It does not occur when I run the same example (see below) on a GPU (GTX 1080 Ti) or
when running on CPU with a batch size of 10.
I was able to reproduce the segmentation faults on two linux machines.

4/10 [===========>..................] - ETA: 10:59 - loss: 0.4438Segmentation fault (core dumped)

Here is a small standalone script that produces the segmentation fault
(when using batch_size = 40 and run on CPUs):
https://gist.github.com/gschramm/e6db1f7333b50bca10c38243efec0925

Any idea what is going wrong?

I am running:

OS: Ubuntu 16.04
Keras 2.2.0
Tensorflow 1.9.0 (I also tested 1.8.0 -> same problem)

luncliff · 2019-06-17T08:37:17Z

Hi, I just met this symptom in my Docker environment with Keras 2.2.4 and Tensorflow 1.12 (GPU).

For me, the issue disappeared when I changed Tensorflow to 1.13-gpu-py3
I'm not sure about it is solved completely, but writing my environment for future visitors...

Environment

Host:
- OS: Ubuntu 16.04.6 LTS
- Kernel Linux 4.15.0-51-generic
Docker:
- Docker version: 18.09.6
- API version: 1.39
- NVIDIA Docker Runtime (nvidia-docker2)
GPU Device
- CUDA: 9.2
- GeForce RTX 2080 Ti * 2

Mon Jun 17 17:23:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   39C    P8    21W / 250W |    403MiB / 10989MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 37%   34C    P8     2W / 250W |      1MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1114      G   /usr/lib/xorg/Xorg                           224MiB |
|    0      2511      G   compiz                                        79MiB |
|    0     10650      G   ...-token=9E760AB97E59CC5C02D0AFC5D37FE54E    98MiB |
+-----------------------------------------------------------------------------+

Installation

Keras:
- Installed via pip
Tensorflow:
- Docker Image: 1.12-gpu-py3 >> 1.13-gpu-py3(solved)

FROM    tensorflow/tensorflow:1.13.1-gpu-py3 as ship
LABEL   maintainer="luncliff@gmail.com"

RUN     pip install -qqq --upgrade pip && pip install -qqq keras
RUN     pip install -qqq pillow
# ...

mikechen66 · 2021-05-12T06:47:33Z

For the larger datasets with Keras multithreading, users needs to adopt a threadsafe generator method to deal with the issue. There is a brief introdcution by Anand Chitipothu as well as the explaination of composed functions by Mathieu Larose. The threadsafe method has been adopted in the library of Faster RCNN by RGB and Kaiming He.

threadsafe_code: http://anandology.com/blog/using-iterators-and-generators/
composition of functions: https://mathieularose.com/function-composition-in-python/

wkeithvan mentioned this issue Aug 10, 2018

Add Dice Loss (and Intersection Over Union) #10890

Closed

steelexu mentioned this issue Nov 2, 2018

Segmentation fault (core dumped) ellisdg/3DUnetCNN#58

Closed

fchollet closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fit_generator Segmentation fault #8225

fit_generator Segmentation fault #8225

benoistlaurent commented Oct 23, 2017

dgorissen commented Nov 12, 2017 •

edited

benoistlaurent commented Nov 13, 2017

rdelassus commented Mar 4, 2018

oxydron commented Apr 4, 2018

mdgoldberg commented May 29, 2018

mdgoldberg commented May 29, 2018

amarion35 commented Jun 21, 2018 •

edited

gschramm commented Aug 7, 2018

luncliff commented Jun 17, 2019

mikechen66 commented May 12, 2021

fit_generator Segmentation fault #8225

fit_generator Segmentation fault #8225

Comments

benoistlaurent commented Oct 23, 2017

dgorissen commented Nov 12, 2017 • edited

benoistlaurent commented Nov 13, 2017

rdelassus commented Mar 4, 2018

oxydron commented Apr 4, 2018

mdgoldberg commented May 29, 2018

mdgoldberg commented May 29, 2018

amarion35 commented Jun 21, 2018 • edited

Edit:

gschramm commented Aug 7, 2018

luncliff commented Jun 17, 2019

Environment

Installation

mikechen66 commented May 12, 2021

dgorissen commented Nov 12, 2017 •

edited

amarion35 commented Jun 21, 2018 •

edited