Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic results on GPU #34

Closed
tumusudheer opened this issue Oct 29, 2017 · 15 comments
Closed

Non-deterministic results on GPU #34

tumusudheer opened this issue Oct 29, 2017 · 15 comments

Comments

@tumusudheer
Copy link

Hi @emedvedev ,

I ran test on same image multiple times using the readme command.
aocr test ./datasets/testing.tfrecords

Every time I ran the command, I'm getting same predicted word as output, but the inference probabilities are changing (including loss as well).

Run1:
Step 1 (1.096s). Accuracy: 100.00%, loss: 0.000364, perplexity: 1.00036, probability: 93.33% 100%

Run2:
Step 1 (0.988s). Accuracy: 100.00%, loss: 0.000260, perplexity: 1.00026, probability: 92.58% 100%

I've observed the same behavior when I used frozen checkpoint as well (probabilities are changing for the same image). Any reason why this is happening as it should not happen. Please let me know how to fix it.

@emedvedev
Copy link
Owner

Hi,

It's perfectly normal, especially if you run GPU. Some optimizations that Tensorflow performs are non-deterministic, so you'll get slightly different results every time. There's usually no need to make it perfectly deterministic, and it can come with a speed penalty, but if you're interested, take a look at this article.

@tumusudheer
Copy link
Author

Hi @emedvedev ,

I trained the net on GPU, freeze the model and running the frozen graph(model) on CPU. The results are not slightly different, but way off.
Run1: LUBE with probability 0.839753
Run2: LUBE with probability 0.690503
and
Run3: LUBE with probability 0.796141

I'll checkout the article you've referred. Thank you.

@ckirmse
Copy link
Contributor

ckirmse commented Oct 29, 2017

Interesting article-- @tumusudheer maybe we can investigate uses of the non-deterministic functions and change them out. I've run into the same issue, but with longer phrases I often get actual different predicted text.

@tumusudheer
Copy link
Author

Hi @ckirmse,

Sure, sounds good to me. I'll keep posted if I find a fix. Please let me know if find a fix.

Thanks.

@emedvedev emedvedev changed the title Probabilities are changing every time you run test on same image Non-deterministic results on GPU Oct 30, 2017
@reidjohnson
Copy link

I believe the source of non-deterministic behavior is how the CNN is initialized. Namely, this line:

cnn_model = CNN(self.img_data, True)

should actually be:

cnn_model = CNN(self.img_data, not self.forward_only)

Otherwise, dropout (which will randomly remove connections from the output of the CNN) is performed even during testing.

@emedvedev
Copy link
Owner

@reidjohnson oh, good catch! Committed the fix.

@tumusudheer @ckirmse could you verify that this behavior is fixed (or at least significantly reduced) in the latest master?

@ckirmse
Copy link
Contributor

ckirmse commented Oct 30, 2017

Oh wow, yeah that's quite poor! Good catch. I'll test out tonight.

@ckirmse
Copy link
Contributor

ckirmse commented Oct 30, 2017

This would have affected the exported graphs, too, I think.

@ckirmse
Copy link
Contributor

ckirmse commented Oct 30, 2017

OK, confirmed that this fixed the non-determinism of prediction for me. That's really good!

As for the exported graphs--it should be building a test/prediction graph (no dropout), not using what's in the checkpoint, right @emedvedev ?

@emedvedev
Copy link
Owner

@ckirmse I think so.

@mattfeury
Copy link
Contributor

@ckirmse @emedvedev i'm not sure that is the case and that may actually be our issue here. as far as I can tell, our export stuff pulls from the checkpoint_state, which is only saved during training, meaning it's likely saving the model as prepared for training. this is probably why aocr predict works very well, but exporting and serving gives wildly different values (#25).

@ckirmse
Copy link
Contributor

ckirmse commented Oct 30, 2017

@mattfeury yeah I agree--that's what I was trying to say but I now realize my statement was vague. I meant to say "exporting should be building a test/prediction graph (no dropout), but as of right now it is using what's in the checkpoint which does have dropout, so that needs to be changed".

I'm hopeful that fixing that will fix #25.

@mattfeury
Copy link
Contributor

ok i'm going to try and get up to speed with that code and see what i can do

@tumusudheer
Copy link
Author

Hi All,

The fix
cnn_model = CNN(self.img_data, not self.forward_only) is working great for me. Thank you for the fix.

Here is the code I've used to freeze binary graph without weights.

with tf.Graph().as_default() as graph:
        with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
            model = Model(
                phase=parameters.phase,
                visualize=parameters.visualize,
                output_dir=parameters.output_dir,
                batch_size=parameters.batch_size,
                initial_learning_rate=parameters.initial_learning_rate,
                steps_per_checkpoint=parameters.steps_per_checkpoint,
                model_dir=parameters.model_dir,
                target_embedding_size=parameters.target_embedding_size,
                attn_num_hidden=parameters.attn_num_hidden,
                attn_num_layers=parameters.attn_num_layers,
                clip_gradients=parameters.clip_gradients,
                max_gradient_norm=parameters.max_gradient_norm,
                session=sess,
                load_model=parameters.load_model,
                gpu_id=parameters.gpu_id,
                use_gru=parameters.use_gru,
                use_distance=parameters.use_distance,
                max_image_width=parameters.max_width,
                max_image_height=parameters.max_height,
                max_prediction_length=parameters.max_prediction,
            )
            graph_def = graph.as_graph_def()
            saver = tf.train.Saver()
            input_graph_def = graph.as_graph_def()
            
            sess.run(tf.global_variables_initializer())
            with gfile.GFile('test_binary_graph.pb', 'wb') as f:
                f.write(graph_def.SerializeToString())

Just use the same parameters as 'aocr test'

After this I've used freeze_graph utility from tensorflow as follows:

python freeze_graph.py --input_graph=test_binary_graph.pb --input_checkpoint=./logs/train/model.ckpt-514062 --input_binary=true --output_node_names=prediction,probability --output_graph=test_frozen_graph.pb

The both steps can be combined into a single step. The final test_frozen_graph.pb working well for me.

Hope it helps.

@tumusudheer
Copy link
Author

Hi @emedvedev ,

Closing this as the fix is working great.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants