Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anyway to get "Confidence" metric? #19

Closed
mattfeury opened this issue Oct 9, 2017 · 11 comments
Closed

anyway to get "Confidence" metric? #19

mattfeury opened this issue Oct 9, 2017 · 11 comments

Comments

@mattfeury
Copy link
Contributor

Hello,

I'm interested in knowing a "confidence" for a given prediction. Does anyone have any ideas on the best way to tackle this? I assume there is some output in the graph (potentially for each character?) that I could tap into to calculate this. Hope to play with this later this week but wanted to see if anyone had any ideas first.

@mattfeury
Copy link
Contributor Author

mattfeury commented Oct 16, 2017

so dug in a little bit here and wanted to report back. still seeing some funkiness, but here's where i'm at:

Looking at this code:

for l in xrange(len(self.attention_decoder_model.output)):
guess = tf.argmax(self.attention_decoder_model.output[l], axis=1)
num_feed.append(guess)

This seems to be where we get the predictions back for each AttentionDecoder (this list looks to be equal in size to MAX_PREDICTION which makes sense, and each list in that list looks to be of size TARGET_VOCAB_SIZE which also makes sense). To me this seems like the prediction values for each character for each decoder. so i extended this code to add a node with this entire list so i could get it at predict time:

num_feed = []
allProbabilities = []

for l in xrange(len(self.attention_decoder_model.output)):
    outputs = self.attention_decoder_model.output[l]
    guess = tf.argmax(outputs, axis=1)
    num_feed.append(guess)
    allProbabilities.append(outputs)

all_probs_output = tf.convert_to_tensor(allProbabilities, name = "allProbabilities")

THEN, during predict time, i'm able to get the output of this tensor, softmax each list to turn it into probabilities, take the max probability for each AttentionDecoder, and either take the mean or the product of all the value, e.g.:

allProbs = graph.get_tensor_by_name('prefix/allProbabilities:0')
np.mean(softmax(allProps).max(axis=2)) 

with tf.Session(graph=graph) as sess:
    (y_out, probs_output) = sess.run([y,allProbs], feed_dict={
        x: [img]
    })

    return {
        "predictions": [{
            "ocr": str(y_out),
            "confidence": str(np.mean(softmax(probs_output).max(axis=2)))
        }]
    };

HOWEVER, it seems like everything i'm getting is around 50%-60% which is wildly unhelpful. Any ideas on where I'm going wrong here? I tried to use np.prod which seemed more accurate to me, but then i was getting most values <1%.

@reidjohnson
Copy link

reidjohnson commented Oct 17, 2017

The logic seems right (when using the product). I have a different implementation that I believe/hope is working that might serve as a useful (though perhaps inefficient) reference.

I basically populate a probability tensor in parallel with the prediction tensor:

num_feed = []
prb_feed = []

for l in range(len(self.attention_decoder_model.output)):
    guess = tf.argmax(self.attention_decoder_model.output[l], axis=1)
    proba = tf.reduce_max(
        tf.nn.softmax(self.attention_decoder_model.output[l]), axis=1)
    num_feed.append(guess)
    prb_feed.append(proba)

# Join the predictions into a single output string.
trans_output = tf.transpose(num_feed)
trans_output = tf.map_fn(
    lambda m: tf.foldr(
        lambda a, x: tf.cond(
            tf.equal(x, DataGen.EOS_ID),
            lambda: '',
            lambda: table.lookup(x) + a
        ),
        m,
        initializer=''
    ),
    trans_output,
    dtype=tf.string
)

# Calculate the total probability of the output string.
trans_outprb = tf.transpose(prb_feed)
trans_outprb = tf.gather(trans_outprb, tf.range(tf.size(trans_output)))
trans_outprb = tf.map_fn(
    lambda m: tf.foldr(
        lambda a, x: tf.multiply(tf.cast(x, tf.float64), a),
        m,
        initializer=tf.cast(1, tf.float64)
    ),
    trans_outprb,
    dtype=tf.float64
)

self.prediction = tf.cond(
    tf.equal(tf.shape(trans_output)[0], 1),
    lambda: trans_output[0],
    lambda: trans_output,
)
self.probability = tf.cond(
    tf.equal(tf.shape(trans_outprb)[0], 1),
    lambda: trans_outprb[0],
    lambda: trans_outprb,
)

self.prediction = tf.identity(self.prediction, name='prediction')
self.probability = tf.identity(self.probability, name='probability')

I then add it to the output feed at each step:

if not forward_only:
    output_feed += [
        self.summaries_by_bucket[0],
        self.updates[0],
        self.prediction,
    ]
else:
    output_feed += [
        self.prediction,
        self.probability,
    ]
    if self.visualize:
        output_feed += self.attention_decoder_model.attention_weights_history

outputs = self.sess.run(output_feed, input_feed)

res = {
    'loss': outputs[0],
}

if not forward_only:
    res['summaries'] = outputs[1]
    res['prediction'] = outputs[3]
else:
    res['prediction'] = outputs[1]
    res['probability'] = outputs[2]
    if self.visualize:
        res['attentions'] = outputs[3:]

Apologies for the long code snippets.

@mattfeury
Copy link
Contributor Author

thanks for the feedback! just implemented yours and it is working fine, but some of the numbers still seem funny to me. for instance, on values that match 100%, i see probabilities that range from 18%-99% (e.g. 45%, 27%, 61%, 34%, 99%). and on values that are fairly wrong, i see anything from 17% - 99.8% (e.g. 54%, 99%, 95%, 32%, 70%).

it's possible it's just my dataset, but i'm surprised to see such disparity. have you been able to run this with your dataset and feel confident in the results?

theoretically everything makes sense, so i don't think it's an issue with the implementation. just surprised to see such wide range of results. i guess i don't have confidence in the confidence score.

@mattfeury
Copy link
Contributor Author

just piggybacking off my comment, i have my max prediction length set for 12, which gives me 12 probability lists. however, the overwhelming majority of my training set is ~6 characters, which is potentially why i'm seeing such crazy probability values. perhaps it's those last 7-12 tensors that are just "not confident" dragging my prediction down? if i down max_prediction_length to 8, i get some more bearable values

@reidjohnson
Copy link

Interesting. I can confirm that the probabilities make sense on my dataset and task (which has a max prediction length of 20). It might also be that the model simply hasn't run for long enough, so the probability range is still quite wide for any given label, but would narrow with additional training.

@ckirmse
Copy link
Contributor

ckirmse commented Oct 18, 2017

It would be great to get the confidence per character exposed in the outputs via a PR, if you guys don't mind sharing the great work! It's cool that it seems we have a few of us actively using and improving this repo... let's help each other out :).

@emedvedev
Copy link
Owner

It's cool that it seems we have a few of us actively using and improving this repo... let's help each other out :).

Thank you guys for that, your help on this project is really appreciated. ❤️ I don't have too much time for it, since it's just a side project for me, but I'll help out where I can — and of course happily review contributions and publish the new versions to pip.

@mattfeury
Copy link
Contributor Author

can submit a PR for overall confidence today. would love to get confidence per character at some point, because honestly i'd love to give a list of possible solutions instead of just the most probable. e.g. here are 20 potential OCRs with their confidence score. but that would be a later thing unless someone wants to jump on it

@mattfeury
Copy link
Contributor Author

PR submitted!

@ckirmse
Copy link
Contributor

ckirmse commented Oct 18, 2017

Totally agreed about the list of possible solutions being ideal--I'm running into a lot of '0' vs 'O' and '1' vs 'l' choices that don't always come out correctly. I'm looking forward to trying out the overall confidence, though too.

@emedvedev
Copy link
Owner

Merged the overall probability, so I'll close this issue, and the multiple guesses are going to be tracked in #28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants