Engine shutdown #7958
Comments
It seems not relevant to the engine. |
Possibly failed in imdecode to the dst |
@zhreshold Could you change imdecode to imread and improve the error message to output which image is broken? |
@zhreshold I used a smaller test, and then I got the following result>>> $ sh run_train.sh
2017-09-24 20:20:16,154 - Namespace(batch_size=32, data_train='./test_train.lst', data_val='./test_val.lst', epoch=0, gpus='0', image_shape='3,224,224', image_train='./test/', image_val='./test/', kv_store='device', lr=0.001, model='./models/dpn92-extra', mom=0.9, num_classes=2, num_epoch=20, num_examples=1000, save_result='./output', wd=0.0001)
2017-09-24 20:20:16,156 - Using 1 threads for decoding...
2017-09-24 20:20:16,156 - Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.
2017-09-24 20:20:16,156 - ImageIter: loading image list ./test_train.lst...
2017-09-24 20:20:16,157 - Using 1 threads for decoding...
2017-09-24 20:20:16,157 - Set enviroment variable MXNET_CPU_WORKER_NTHREADS to a larger number to use more threads.
2017-09-24 20:20:16,157 - ImageIter: loading image list ./test_val.lst...
[20:20:16] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[20:20:16] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
**Traceback (most recent call last):
File "train.py", line 142, in <module>
image_shape='3,224,224', epoch=0, num_epoch=args.num_epoch, kv=kv)
File "train.py", line 106, in train_model
epoch_end_callback=checkpoint)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/module/base_module.py", line 482, in fit
next_data_batch = next(data_iter)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.11.1-py2.7.egg/mxnet/image/image.py", line 1165, in next
raise StopIteration
StopIteration** I used the pretrained model(https://goo.gl/1sbov7), and I never modified any code. What should I do to debug? Thanks for your experience ~ |
@Harold-Zhang Can you provide the train.py ? |
@zhreshold
I downloaded it from here. |
@zhreshold I am sorry about that I found I just @ a wrong person. I updated my mxnet code after you modified it, and then I got an error. The following code is where the error happened.
Here is the error report: |
@apache/mxnet-committers: This issue has been inactive for the past 90 days. It has no label and needs triage. For general "how-to" questions, our user forum (and Chinese version) is a good place to get help. |
Hi, I've got the same error. from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True But using mxnet.image.imread(img_path), python cannot catch the error to avoid this specific image(core dumped pops) I wonder how to skip this image using mxnet.image.imread def somefun(net, img_path):
img = Image.open(img_path)
try:
img.load()
except Exception as e:
print(f'[Error]:{e}\t{img_path}')
return None
return(net(mx.image.imread(img_path)))
|
@Paul0M can you provide a minimal reproducible example with how to run it? |
@Paul0M I know it's kind of late, but I still want to reach out and see if you are still facing this issue? |
@sandeep-krishnamurthy @nswamy @anirudh2290 COuld you please close this issue due to lack of activity. @Paul0M Please feel free to re-open in case closed in error |
Environment info
Operating System:
Ubuntu 14.04
Compiler:
gcc 4.8.4
Package used (Python/R/Scala/Julia):
Python 2.7
MXNet version:
The latest version
GPU:
Tesla K40m
Error Message:
please provide the commands you have run that lead to the error.
I used the pretrained model from https://github.com/cypw/DPNs
commands:
python train.py --epoch 0 --model ./models/dpn92-extra --batch-size 4 --num-classes 2 --data-train ./lst_train.lst --image-train ./data/ --data-val ./lst_val.lst --image-val ./data/ --num-examples 2000 --lr 0.001 --gpus 0 --num-epoch 20 --save-result ./output
I have tried --batch-size 16/32, and I got the same result.
What have you tried to solve it?
At first, I got a result: An fatal error occurred in asynchronous engine operation.
According to a guide, I set environment MXNET_CUDNN_AUTOTUNE_DEFAULT=0 and MXNET_ENGINE_TYPE=NaiveEngine, then I got the above result.
The text was updated successfully, but these errors were encountered: