Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Engine shutdown #7958

Closed
Harold-Zhang opened this issue Sep 20, 2017 · 13 comments
Closed

Engine shutdown #7958

Harold-Zhang opened this issue Sep 20, 2017 · 13 comments

Comments

@Harold-Zhang
Copy link

Harold-Zhang commented Sep 20, 2017

Environment info

Operating System:
Ubuntu 14.04
Compiler:
gcc 4.8.4
Package used (Python/R/Scala/Julia):
Python 2.7
MXNet version:
The latest version
GPU:
Tesla K40m

Error Message:

[19:43:24] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
2017-09-19 19:43:45,834 - Epoch[0] Batch [20]	Speed: 4.56 samples/sec	accuracy=0.964286
2017-09-19 19:44:03,379 - Epoch[0] Batch [40]	Speed: 4.56 samples/sec	accuracy=1.000000
2017-09-19 19:44:20,942 - Epoch[0] Batch [60]	Speed: 4.56 samples/sec	accuracy=1.000000
2017-09-19 19:44:38,747 - Epoch[0] Batch [80]	Speed: 4.49 samples/sec	accuracy=1.000000
2017-09-19 19:44:56,319 - Epoch[0] Batch [100]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:45:13,862 - Epoch[0] Batch [120]	Speed: 4.56 samples/sec	accuracy=1.000000
2017-09-19 19:45:31,494 - Epoch[0] Batch [140]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:45:49,110 - Epoch[0] Batch [160]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:46:06,677 - Epoch[0] Batch [180]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:46:24,257 - Epoch[0] Batch [200]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:46:41,886 - Epoch[0] Batch [220]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:46:59,501 - Epoch[0] Batch [240]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:47:17,085 - Epoch[0] Batch [260]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:47:34,667 - Epoch[0] Batch [280]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:47:52,273 - Epoch[0] Batch [300]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:48:09,861 - Epoch[0] Batch [320]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:48:27,503 - Epoch[0] Batch [340]	Speed: 4.53 samples/sec	accuracy=1.000000
2017-09-19 19:48:45,085 - Epoch[0] Batch [360]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:49:02,700 - Epoch[0] Batch [380]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:49:20,358 - Epoch[0] Batch [400]	Speed: 4.53 samples/sec	accuracy=1.000000
2017-09-19 19:49:37,943 - Epoch[0] Batch [420]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:49:55,530 - Epoch[0] Batch [440]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:50:13,105 - Epoch[0] Batch [460]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:50:30,683 - Epoch[0] Batch [480]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:50:48,265 - Epoch[0] Batch [500]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:51:05,903 - Epoch[0] Batch [520]	Speed: 4.54 samples/sec	accuracy=1.000000
2017-09-19 19:51:23,492 - Epoch[0] Batch [540]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:51:41,176 - Epoch[0] Batch [560]	Speed: 4.52 samples/sec	accuracy=1.000000
2017-09-19 19:51:58,766 - Epoch[0] Batch [580]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:52:16,347 - Epoch[0] Batch [600]	Speed: 4.55 samples/sec	accuracy=1.000000
2017-09-19 19:52:33,933 - Epoch[0] Batch [620]	Speed: 4.55 samples/sec	accuracy=1.000000
[19:52:38] /home/harold/mxnet/dmlc-core/include/dmlc/logging.h:308: [19:52:38] src/io/image_io.cc:165: Check failed: !dst.empty() 

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9450196c8c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io12ImdecodeImplEibPvmPNS_7NDArrayE+0x67a) [0x7f9451a3eefa]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f9450284963]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x8b) [0x7f94519daf4b]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6Engine8PushSyncESt8functionIFvNS_10RunContextEEENS_7ContextERKSt6vectorIPNS_6engine3VarESaIS9_EESD_NS_10FnPropertyEiPKc+0x124) [0x7f9450285814]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io8ImdecodeERKN4nnvm9NodeAttrsERKSt6vectorINS_7NDArrayESaIS6_EEPS8_+0xc90) [0x7f9451a40a10]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z20ImperativeInvokeImplRKN5mxnet7ContextEON4nnvm9NodeAttrsEPSt6vectorINS_7NDArrayESaIS7_EESA_PS6_IbSaIbEESD_+0x3cf) [0x7f94519a91ff]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z22MXImperativeInvokeImplPviPS_PiPS0_iPPKcS5_+0x25b) [0x7f94519bb43b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvokeEx+0x2f) [0x7f94519a982f]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9468629adc]

Traceback (most recent call last):
  File "train.py", line 142, in <module>
    image_shape='3,224,224', epoch=0, num_epoch=args.num_epoch, kv=kv)
  File "train.py", line 106, in train_model
    epoch_end_callback=checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/module/base_module.py", line 491, in fit
    next_data_batch = next(data_iter)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1151, in next
    data = self.imdecode(s)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1183, in imdecode
    return imdecode(s)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 136, in imdecode
    return _internal._cvimdecode(buf, *args, **kwargs)
  File "<string>", line 16, in _cvimdecode
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/_ctypes/ndarray.py", line 92, in _imperative_invoke
    ctypes.byref(out_stypes)))
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/base.py", line 143, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [19:52:38] src/io/image_io.cc:165: Check failed: !dst.empty() 

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f9450196c8c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io12ImdecodeImplEibPvmPNS_7NDArrayE+0x67a) [0x7f9451a3eefa]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZNSt17_Function_handlerIFvN5mxnet10RunContextENS0_6engine18CallbackOnCompleteEEZNS0_6Engine8PushSyncESt8functionIFvS1_EENS0_7ContextERKSt6vectorIPNS2_3VarESaISC_EESG_NS0_10FnPropertyEiPKcEUlS1_S3_E_E9_M_invokeERKSt9_Any_dataS1_S3_+0x23) [0x7f9450284963]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6engine11NaiveEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKc+0x8b) [0x7f94519daf4b]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet6Engine8PushSyncESt8functionIFvNS_10RunContextEEENS_7ContextERKSt6vectorIPNS_6engine3VarESaIS9_EESD_NS_10FnPropertyEiPKc+0x124) [0x7f9450285814]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet2io8ImdecodeERKN4nnvm9NodeAttrsERKSt6vectorINS_7NDArrayESaIS6_EEPS8_+0xc90) [0x7f9451a40a10]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z20ImperativeInvokeImplRKN5mxnet7ContextEON4nnvm9NodeAttrsEPSt6vectorINS_7NDArrayESaIS7_EESA_PS6_IbSaIbEESD_+0x3cf) [0x7f94519a91ff]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(_Z22MXImperativeInvokeImplPviPS_PiPS0_iPPKcS5_+0x25b) [0x7f94519bb43b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/libmxnet.so(MXImperativeInvokeEx+0x2f) [0x7f94519a982f]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f9468629adc]

[19:52:38] src/engine/naive_engine.cc:53: Engine shutdown

please provide the commands you have run that lead to the error.

I used the pretrained model from https://github.com/cypw/DPNs

commands:
python train.py --epoch 0 --model ./models/dpn92-extra --batch-size 4 --num-classes 2 --data-train ./lst_train.lst --image-train ./data/ --data-val ./lst_val.lst --image-val ./data/ --num-examples 2000 --lr 0.001 --gpus 0 --num-epoch 20 --save-result ./output

I have tried --batch-size 16/32, and I got the same result.

What have you tried to solve it?

At first, I got a result: An fatal error occurred in asynchronous engine operation.
According to a guide, I set environment MXNET_CUDNN_AUTOTUNE_DEFAULT=0 and MXNET_ENGINE_TYPE=NaiveEngine, then I got the above result.

@Harold-Zhang
Copy link
Author

@piiswrong @szha

@ZiyueHuang
Copy link
Member

It seems not relevant to the engine. src/io/image_io.cc:165: Check failed: !dst.empty()

@zhreshold
Copy link
Member

Possibly failed in imdecode to the dst
Image not exists or image broken?

@piiswrong
Copy link
Contributor

@zhreshold Could you change imdecode to imread and improve the error message to output which image is broken?

@Harold-Zhang
Copy link
Author

Harold-Zhang commented Sep 25, 2017

@zhreshold I used a smaller test, and then I got the following result>>>

$ sh run_train.sh 
2017-09-24 20:20:16,154 - Namespace(batch_size=32, data_train='./test_train.lst', data_val='./test_val.lst', epoch=0, gpus='0', image_shape='3,224,224', image_train='./test/', image_val='./test/', kv_store='device', lr=0.001, model='./models/dpn92-extra', mom=0.9, num_classes=2, num_epoch=20, num_examples=1000, save_result='./output', wd=0.0001)
2017-09-24 20:20:16,156 - Using 1 threads for decoding...
2017-09-24 20:20:16,156 - Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.
2017-09-24 20:20:16,156 - ImageIter: loading image list ./test_train.lst...
2017-09-24 20:20:16,157 - Using 1 threads for decoding...
2017-09-24 20:20:16,157 - Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.
2017-09-24 20:20:16,157 - ImageIter: loading image list ./test_val.lst...
[20:20:16] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[20:20:16] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
**Traceback (most recent call last):
  File "train.py", line 142, in <module>
    image_shape='3,224,224', epoch=0, num_epoch=args.num_epoch, kv=kv)
  File "train.py", line 106, in train_model
    epoch_end_callback=checkpoint)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/module/base_module.py", line 482, in fit
    next_data_batch = next(data_iter)
  File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1165, in next
    raise StopIteration
StopIteration**

I used the pretrained model(https://goo.gl/1sbov7), and I never modified any code. What should I do to debug? Thanks for your experience ~

@zhreshold
Copy link
Member

@Harold-Zhang Can you provide the train.py ?

@Harold-Zhang
Copy link
Author

Harold-Zhang commented Sep 26, 2017

@zhreshold
Here is the train.py.

import argparse
import find_mxnet
import mxnet as mx
import time
import os, sys
import logging
import importlib
sys.path.insert(0, "./settings")
sys.path.insert(0, "../")

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

formatter = logging.Formatter('%(asctime)s - %(message)s')
console = logging.StreamHandler()
console.setFormatter(formatter)
logger.addHandler(console)

def get_fine_tune_model(sym, arg_params, num_classes, layer_name, batchsize):
    
    all_layers = sym.get_internals()
    net = all_layers[layer_name+'_output']
    net = mx.symbol.FullyConnected(data=net, num_hidden=num_classes, name='fc')
    net = mx.symbol.SoftmaxOutput(data=net, name='softmax')

    new_args = dict({k:arg_params[k] for k in arg_params if 'fc' not in k})
    return (net, new_args)

def multi_factor_scheduler(begin_epoch, epoch_size, step=[7,14], factor=0.1):
    step_ = [epoch_size * (x-begin_epoch) for x in step if x-begin_epoch > 0]
    return mx.lr_scheduler.MultiFactorScheduler(step=step_, factor=factor) if len(step_) else None

def train_model(model, gpus, batch_size, image_shape, epoch=0, num_epoch=20, kv='device'):
    train = mx.image.ImageIter(
        batch_size          = args.batch_size,
        data_shape          = (3,224,224),        
        label_width         = 1,
        path_imglist        = args.data_train,
        path_root           = args.image_train,
        part_index          = kv.rank,
        num_parts           = kv.num_workers,
        shuffle             = True,        
        data_name           = 'data',
        label_name          = 'softmax_label',
        aug_list            = mx.image.CreateAugmenter((3,224,224),resize=224,rand_crop=True,rand_mirror=True,mean=True,std=True))

    val = mx.image.ImageIter(
        batch_size          = args.batch_size,
        data_shape          = (3,224,224),
        label_width         = 1,
        path_imglist        = args.data_val,
        path_root           = args.image_val,
        part_index          = kv.rank,
        num_parts           = kv.num_workers,       
        data_name           = 'data',
        label_name          = 'softmax_label',
        aug_list            = mx.image.CreateAugmenter((3,224,224),resize=224,mean=True,std=True))

    kv = mx.kvstore.create(args.kv_store)

    prefix = model
    sym, arg_params, aux_params = mx.model.load_checkpoint(prefix, epoch)

    (new_sym, new_args) = get_fine_tune_model(
        sym, arg_params, args.num_classes, 'flatten', args.batch_size)

    epoch_size = max(int(args.num_examples / args.batch_size / kv.num_workers), 1)
    lr_scheduler=multi_factor_scheduler(args.epoch, epoch_size)

    optimizer_params = {
            'learning_rate': args.lr,
            'momentum' : args.mom,
            'wd' : args.wd,
            'lr_scheduler': lr_scheduler}
    initializer = mx.init.Xavier(
            rnd_type='gaussian', factor_type="in", magnitude=2)

    if gpus == '':
        devs = mx.cpu()
    else:
        devs = [mx.gpu(int(i)) for i in gpus.split(',')]
        
    model = mx.mod.Module(
        context       = devs,
        symbol        = new_sym
    )

    checkpoint = mx.callback.do_checkpoint(args.save_result)

    eval_metric = ['accuracy']

    model.fit(train,
              begin_epoch=epoch,
              num_epoch=num_epoch,
              eval_data=val,
              eval_metric=eval_metric,
              kvstore=kv,
              optimizer='sgd',
              optimizer_params=optimizer_params,
              arg_params=new_args,
              aux_params=aux_params,
              initializer=initializer,
              allow_missing=True,
              batch_end_callback=mx.callback.Speedometer(args.batch_size, 20),
              epoch_end_callback=checkpoint)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='score a model on a dataset')
    parser.add_argument('--model',         type=str, required=True,)
    parser.add_argument('--gpus',          type=str, default='0')
    parser.add_argument('--batch-size',    type=int, default=200)
    parser.add_argument('--epoch',         type=int, default=0)
    parser.add_argument('--image-shape',   type=str, default='3,224,224')
    parser.add_argument('--data-train',    type=str)
    parser.add_argument('--image-train',   type=str)
    parser.add_argument('--data-val',      type=str)
    parser.add_argument('--image-val',     type=str)
    parser.add_argument('--num-classes',   type=int, default=6)
    parser.add_argument('--lr',            type=float, default=0.001)
    parser.add_argument('--num-epoch',     type=int, default=20)
    parser.add_argument('--kv-store',      type=str, default='device', help='the kvstore type')
    parser.add_argument('--save-result',   type=str, help='the save path')
    parser.add_argument('--num-examples',  type=int)
    parser.add_argument('--mom',           type=float, default=0.9, help='momentum for sgd')
    parser.add_argument('--wd',            type=float, default=0.0001, help='weight decay for sgd')
    args = parser.parse_args()

    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)

    kv = mx.kvstore.create(args.kv_store)

    if not os.path.exists(args.save_result):
        os.mkdir(args.save_result)
    hdlr = logging.FileHandler(args.save_result+ '/train.log')
    hdlr.setFormatter(formatter)
    logger.addHandler(hdlr)
    logging.info(args)

    train_model(model=args.model, gpus=args.gpus, batch_size=args.batch_size,
          image_shape='3,224,224', epoch=0, num_epoch=args.num_epoch, kv=kv)

I downloaded it from here.

@Harold-Zhang
Copy link
Author

Harold-Zhang commented Sep 30, 2017

@zhreshold I am sorry about that I found I just @ a wrong person. I updated my mxnet code after you modified it, and then I got an error. The following code is where the error happened.

#if` (CV_MAJOR_VERSION > 2 || (CV_MAJOR_VERSION == 2 && CV_MINOR_VERSION >=4))
      cv::imdecode(buf, flag, &dst);
    + CHECK(!dst.empty()) << "Decoding failed. Invalid image file.";

Here is the error report:
/home/deeplearning/github/incubator-mxnet/dmlc-core/include/dmlc/./logging.h:308: [11:13:06] src/io/image_io.cc:160: Check failed: !dst.empty() Decoding failed. Invalid image file.
What should I do to correct this error?

@szha
Copy link
Member

szha commented Dec 30, 2017

@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage.

For general "how-to" questions, our user forum (and Chinese version) is a good place to get help.

@Paul0M
Copy link

Paul0M commented Apr 26, 2018

Hi, I've got the same error.
using PIL, this can be avoid as:

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

But using mxnet.image.imread(img_path), python cannot catch the error to avoid this specific image(core dumped pops)

I wonder how to skip this image using mxnet.image.imread
core dumped pops even using PIL to skip a specific image

def somefun(net, img_path):
    img = Image.open(img_path)
    try:
        img.load()
    except Exception as e:
        print(f'[Error]:{e}\t{img_path}')
        return None 
    return(net(mx.image.imread(img_path)))  
terminate called recursively
terminate called after throwing an instance of 'dmlc::Error'
  what():  driver shutting down
Aborted (core dumped)

@Ishitori
Copy link
Contributor

@Paul0M can you provide a minimal reproducible example with how to run it?

@lanking520
Copy link
Member

@Paul0M I know it's kind of late, but I still want to reach out and see if you are still facing this issue?

@ankkhedia
Copy link
Contributor

@sandeep-krishnamurthy @nswamy @anirudh2290 COuld you please close this issue due to lack of activity.

@Paul0M Please feel free to re-open in case closed in error

@szha szha closed this as completed Nov 1, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests