Non-deterministic results on GPU #34

tumusudheer · 2017-10-29T17:22:30Z

I ran test on same image multiple times using the readme command.
aocr test ./datasets/testing.tfrecords

Every time I ran the command, I'm getting same predicted word as output, but the inference probabilities are changing (including loss as well).

Run1:
Step 1 (1.096s). Accuracy: 100.00%, loss: 0.000364, perplexity: 1.00036, probability: 93.33% 100%

Run2:
Step 1 (0.988s). Accuracy: 100.00%, loss: 0.000260, perplexity: 1.00026, probability: 92.58% 100%

I've observed the same behavior when I used frozen checkpoint as well (probabilities are changing for the same image). Any reason why this is happening as it should not happen. Please let me know how to fix it.

emedvedev · 2017-10-29T17:45:59Z

Hi,

It's perfectly normal, especially if you run GPU. Some optimizations that Tensorflow performs are non-deterministic, so you'll get slightly different results every time. There's usually no need to make it perfectly deterministic, and it can come with a speed penalty, but if you're interested, take a look at this article.

tumusudheer · 2017-10-29T19:21:19Z

Hi @emedvedev ,

I trained the net on GPU, freeze the model and running the frozen graph(model) on CPU. The results are not slightly different, but way off.
Run1: LUBE with probability 0.839753
Run2: LUBE with probability 0.690503
and
Run3: LUBE with probability 0.796141

I'll checkout the article you've referred. Thank you.

ckirmse · 2017-10-29T23:50:53Z

Interesting article-- @tumusudheer maybe we can investigate uses of the non-deterministic functions and change them out. I've run into the same issue, but with longer phrases I often get actual different predicted text.

tumusudheer · 2017-10-30T04:44:03Z

Hi @ckirmse,

Sure, sounds good to me. I'll keep posted if I find a fix. Please let me know if find a fix.

Thanks.

reidjohnson · 2017-10-30T09:27:03Z

I believe the source of non-deterministic behavior is how the CNN is initialized. Namely, this line:

cnn_model = CNN(self.img_data, True)

should actually be:

cnn_model = CNN(self.img_data, not self.forward_only)

Otherwise, dropout (which will randomly remove connections from the output of the CNN) is performed even during testing.

emedvedev · 2017-10-30T11:01:42Z

@reidjohnson oh, good catch! Committed the fix.

@tumusudheer @ckirmse could you verify that this behavior is fixed (or at least significantly reduced) in the latest master?

ckirmse · 2017-10-30T15:01:07Z

Oh wow, yeah that's quite poor! Good catch. I'll test out tonight.

ckirmse · 2017-10-30T15:16:29Z

This would have affected the exported graphs, too, I think.

ckirmse · 2017-10-30T15:27:30Z

OK, confirmed that this fixed the non-determinism of prediction for me. That's really good!

As for the exported graphs--it should be building a test/prediction graph (no dropout), not using what's in the checkpoint, right @emedvedev ?

emedvedev · 2017-10-30T18:01:33Z

@ckirmse I think so.

mattfeury · 2017-10-30T18:15:58Z

@ckirmse @emedvedev i'm not sure that is the case and that may actually be our issue here. as far as I can tell, our export stuff pulls from the checkpoint_state, which is only saved during training, meaning it's likely saving the model as prepared for training. this is probably why aocr predict works very well, but exporting and serving gives wildly different values (#25).

ckirmse · 2017-10-30T18:18:58Z

@mattfeury yeah I agree--that's what I was trying to say but I now realize my statement was vague. I meant to say "exporting should be building a test/prediction graph (no dropout), but as of right now it is using what's in the checkpoint which does have dropout, so that needs to be changed".

I'm hopeful that fixing that will fix #25.

mattfeury · 2017-10-30T18:20:38Z

ok i'm going to try and get up to speed with that code and see what i can do

tumusudheer · 2017-10-30T18:37:47Z

Hi All,

The fix
cnn_model = CNN(self.img_data, not self.forward_only) is working great for me. Thank you for the fix.

Here is the code I've used to freeze binary graph without weights.

with tf.Graph().as_default() as graph:
        with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
            model = Model(
                phase=parameters.phase,
                visualize=parameters.visualize,
                output_dir=parameters.output_dir,
                batch_size=parameters.batch_size,
                initial_learning_rate=parameters.initial_learning_rate,
                steps_per_checkpoint=parameters.steps_per_checkpoint,
                model_dir=parameters.model_dir,
                target_embedding_size=parameters.target_embedding_size,
                attn_num_hidden=parameters.attn_num_hidden,
                attn_num_layers=parameters.attn_num_layers,
                clip_gradients=parameters.clip_gradients,
                max_gradient_norm=parameters.max_gradient_norm,
                session=sess,
                load_model=parameters.load_model,
                gpu_id=parameters.gpu_id,
                use_gru=parameters.use_gru,
                use_distance=parameters.use_distance,
                max_image_width=parameters.max_width,
                max_image_height=parameters.max_height,
                max_prediction_length=parameters.max_prediction,
            )
            graph_def = graph.as_graph_def()
            saver = tf.train.Saver()
            input_graph_def = graph.as_graph_def()
            
            sess.run(tf.global_variables_initializer())
            with gfile.GFile('test_binary_graph.pb', 'wb') as f:
                f.write(graph_def.SerializeToString())

Just use the same parameters as 'aocr test'

After this I've used freeze_graph utility from tensorflow as follows:

python freeze_graph.py --input_graph=test_binary_graph.pb --input_checkpoint=./logs/train/model.ckpt-514062 --input_binary=true --output_node_names=prediction,probability --output_graph=test_frozen_graph.pb

The both steps can be combined into a single step. The final test_frozen_graph.pb working well for me.

Hope it helps.

tumusudheer · 2017-11-20T06:35:40Z

Hi @emedvedev ,

Closing this as the fix is working great.

Thanks.

emedvedev changed the title ~~Probabilities are changing every time you run test on same image~~ Non-deterministic results on GPU Oct 30, 2017

mattfeury mentioned this issue Oct 30, 2017

Poor results using exported model in some cases #25

Closed

mattfeury mentioned this issue Oct 30, 2017

Bug/importer exporter #36

Merged

tumusudheer closed this as completed Nov 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic results on GPU #34

Non-deterministic results on GPU #34

tumusudheer commented Oct 29, 2017

emedvedev commented Oct 29, 2017

tumusudheer commented Oct 29, 2017

ckirmse commented Oct 29, 2017

tumusudheer commented Oct 30, 2017

reidjohnson commented Oct 30, 2017

emedvedev commented Oct 30, 2017

ckirmse commented Oct 30, 2017

ckirmse commented Oct 30, 2017

ckirmse commented Oct 30, 2017

emedvedev commented Oct 30, 2017

mattfeury commented Oct 30, 2017

ckirmse commented Oct 30, 2017

mattfeury commented Oct 30, 2017

tumusudheer commented Oct 30, 2017

tumusudheer commented Nov 20, 2017

Non-deterministic results on GPU #34

Non-deterministic results on GPU #34

Comments

tumusudheer commented Oct 29, 2017

emedvedev commented Oct 29, 2017

tumusudheer commented Oct 29, 2017

ckirmse commented Oct 29, 2017

tumusudheer commented Oct 30, 2017

reidjohnson commented Oct 30, 2017

emedvedev commented Oct 30, 2017

ckirmse commented Oct 30, 2017

ckirmse commented Oct 30, 2017

ckirmse commented Oct 30, 2017

emedvedev commented Oct 30, 2017

mattfeury commented Oct 30, 2017

ckirmse commented Oct 30, 2017

mattfeury commented Oct 30, 2017

tumusudheer commented Oct 30, 2017

tumusudheer commented Nov 20, 2017