Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training on Synth 90k #31

Closed
thisismohitgupta opened this issue Oct 23, 2017 · 18 comments
Closed

Error when training on Synth 90k #31

thisismohitgupta opened this issue Oct 23, 2017 · 18 comments

Comments

@thisismohitgupta
Copy link

2017-10-22 23:07:17.471187: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Invalid JPEG data, size 1024
In
image = tf.image.decode_png(img, channels=1)

@emedvedev
Copy link
Owner

Thanks for the report! Could you please provide the image in the dataset that errors out? Does this error appear from the beginning, or at some point in the middle of the training process?

@thisismohitgupta
Copy link
Author

happens in the middle, roughly around 1300-1500 steps with batch size of 512. I hard tried but I could not identify that problematic image. Please help.

@emedvedev
Copy link
Owner

It's pretty much impossible to help unless I know what the image is, unfortunately. You can try inserting some debugging line that would output the list of images in the batch, and then try to narrow it down to a particular one, or just add a catch that would ignore a failed batch and continue training. Might be that your dataset is corrupted.

@tumusudheer
Copy link

Hi @emedvedev,

I've faced similar errors while training on my data. I debugged which images are giving the similar errors and I used Imagemagick's convert command to convert that image to gray scale and the commands are working fine.

I think the issue is with this line image = tf.image.decode_png(img, channels=1) here
while converting to gray scale from image bytes.

How about changing it to:

rgb_image = tf.image.decode_png(img,  channels=3)
image = tf.image.rgb_to_grayscale(rgb_image)

@emedvedev
Copy link
Owner

@tumusudheer thanks for investigating the issue! Can you confirm that the proposed change works with a "broken" image?

@thisismohitgupta you can try applying the proposed patch and re-training your model. Please tell me if it helps!

@thisismohitgupta
Copy link
Author

@emedvedev it did'nt work for me.
@tumusudheer how did you debug from 9 M images. any pointers?

@thisismohitgupta
Copy link
Author

adding the following lines solves the problem
here

try:
    image = Image.open(IO(img)).convert('RGB')
except Exception as e:
    continue
if self.max_width and (image.size[0] <= self.max_width):
...

@emedvedev
Copy link
Owner

Well, then we're just silently skipping broken images, which isn't very good. Is there any way to make an image reading/conversion more bulletproof so that we wouldn't skip anything?

@tumusudheer
Copy link

HI @emedvedev ,

I trained with my proposed change yesterday, and I just verified the results. They are good

This change worked for me:

rgb_image = tf.image.decode_png(img,  channels=3)
image = tf.image.rgb_to_grayscale(rgb_image)

While skipping the images, we can create a log file which lists all broken images. After preparing training data, people can verify what is wrong with the images in log file and try to fix/verify the images.

@emedvedev
Copy link
Owner

Sweet! Would you mind opening a PR with the change then? If you want, you can also implement image skipping there, would be great.

@tumusudheer
Copy link

Hi @emedvedev ,
Sure I'll send a PR with my changes.

@lmolhw5252
Copy link

Hi I got some problems when I want to train.
Caused by op 'IteratorGetNext', defined at:
File "/home/user/anaconda3/bin/aocr", line 11, in
sys.exit(main())
File "/home/user/PycharmProjects/attention-ocr-master/aocr/main.py", line 308, in main
num_epoch=parameters.num_epoch
File "/home/user/PycharmProjects/attention-ocr-master/aocr/model/model.py", line 347, in train
for batch in s_gen.gen(self.batch_size):
File "/home/user/PycharmProjects/attention-ocr-master/aocr/util/data_gen.py", line 54, in gen
images, labels, comments = iterator.get_next()
File "/home/user/anaconda3/lib/python3.5/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 304, in get_next
name=name))
File "/home/user/anaconda3/lib/python3.5/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 379, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/user/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/user/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/user/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

NotFoundError (see above for traceback): ./home/user/Dataset/CAPTCHAs/training.tfrecords
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?], [?], [?]], output_types=[DT_STRING, DT_STRING, DT_STRING], _device="/job:localhost/replica:0/task:0/cpu:0"]]

Did you met this problem before? I don't know how to figure it.

@emedvedev
Copy link
Owner

NotFoundError (see above for traceback): ./home/user/Dataset/CAPTCHAs/training.tfrecords

Your dataset path is incorrect. I'd say it's that dot in the beginning. :)

@MBleeker
Copy link

MBleeker commented Feb 13, 2018

Hi Guys,

I encounter the same problem. But, for some reason both solutions are not working for me. Therefore I set the batch size to 1 and checked witch image caused the problem, it is:

mnt/ramdisk/max/90kDICT32px/2194/2/334_EFFLORESCENT_24742.jpg'

But, there could be more ...

Cheers,
Maurits

@emedvedev
Copy link
Owner

@MBleeker Hi there! Just checking: have you set the max-prediction parameter while training and testing? From an earlier issue:

The max-prediction parameter is set to 8 by default, so it'll error out on labels longer than 8 characters. Just set it to whatever makes sense for you in the CLI when you run the training subcommand.

If it's set correctly, then could you provide the full log of your run?

@MBleeker
Copy link

MBleeker commented Feb 16, 2018

Hi @emedvedev,

I found the problem already. I never used setup.py before ... I did not know about the .egg files. The updates I made were therefor not used while running the code. It is working now. There are several corrupted images

About the bias terms we discussed in #70 . I added them, results seem not be significantly better, but not worse either (the only problem is that you cannot use previous trained models anymore, because the variables are not stored in the checkpoint).

Did you try this code with a different set op hyper params than the defaults? Any different results?

Cheers,
Maurits

@emedvedev
Copy link
Owner

About the bias terms we discussed in #70 . I added them, results seems not be significantly better, but not worse either (the only problem is that you cannot use previous trained models anymore, because the variables are not stored in the checkpoint).

If there's no visible benefit, I'd rather maintain backward compatibility, but if you find out that bias terms do have significant benefit with some datasets, please submit a PR, I'd really appreciate it!

Did you try this code with a different set op hyper params than the defaults? Any different results?

I tried to tweak it a little, but mostly just depends on the dataset. I find the defaults to be sensible, but maybe someone else will have something to add here, too. :)

@emedvedev
Copy link
Owner

I'll close the issue since the original problem has been fixed, so if anyone else has issues with Synth90k, just open a new one. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants