# Uncertainty Methods Playground

In [2]:
sentence1 = 'I recommend that we engage in an arms race with Red.'
split1 = sentence1.split(' ')
sentence2 = 'I recommend that we don\'t engage in an arms race with Red.'
split2 = sentence2.split(' ')
sentence3 = 'Red is who I recommend that we engage in an arms race with.'
split3 = sentence3.split(' ')

## METEOR

https://arize.com/glossary/meteor-score/

In [51]:
import nltk
from nltk.translate.meteor_score import single_meteor_score as score


In [52]:
nltk.download('wordnet')

score1 = score(split1, split2)
score2 = score(split1, split3)

print(f'"{sentence1}" and "{sentence2}" have a score of {score1}')
print(f'"{sentence1}" and "{sentence3}" have a score of {score2}')


"I recommend that we engage in an arms race with Red." and "I recommend that we don't engage in an arms race with Red." have a score of 0.9880128061946244
"I recommend that we engage in an arms race with Red." and "Red is who I recommend that we engage in an arms race with." have a score of 0.8030202821869489


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aryanshrivastava/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### notes on METEOR

- Semantically opposite phrases have high similarity score
- Semantically similar phrases have low(er) similarity score

# BLEURT

https://github.com/google-research/bleurt

In [53]:
from bleurt import score

In [54]:
checkpoint = '/Users/aryanshrivastava/bleurt/bleurt/test_checkpoint'
scorer = score.BleurtScorer(checkpoint)
score1 = scorer.score(references=[sentence1], candidates=[sentence2])[0]
score2 = scorer.score(references=[sentence1], candidates=[sentence3])[0]

INFO:tensorflow:Reading checkpoint /Users/aryanshrivastava/bleurt/bleurt/test_checkpoint.


INFO:tensorflow:Reading checkpoint /Users/aryanshrivastava/bleurt/bleurt/test_checkpoint.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Will load checkpoint dbleurt_tiny


INFO:tensorflow:Will load checkpoint dbleurt_tiny


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:... name:dbleurt_tiny


INFO:tensorflow:... name:dbleurt_tiny


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


INFO:tensorflow:Loading model.


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


In [55]:
print(f'"{sentence1}" and "{sentence2}" have a score of {score1}')
print(f'"{sentence1}" and "{sentence3}" have a score of {score2}')

"I recommend that we engage in an arms race with Red." and "I recommend that we don't engage in an arms race with Red." have a score of 0.5444111227989197
"I recommend that we engage in an arms race with Red." and "Red is who I recommend that we engage in an arms race with." have a score of 0.7867673635482788


### notes on BLEURT

- semantically opposite phrases have low(er) score
- semantically similar phrases have high(er) score

### BERTScore

In [3]:
from bert_score import score

In [26]:
ref = [sentence1]
cand1 = [sentence2]
cand2 = [sentence3]

F1_1 = score(cand1, ref, lang='en', verbose=True)[2]
F1_2 = score(cand2, ref, lang='en', verbose=True)[2]


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.34 seconds, 2.94 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.29 seconds, 3.49 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
print(f'"{sentence1}" and "{sentence2}" have a score of {F1_1[0]}')
print(f'"{sentence1}" and "{sentence3}" have a score of {F1_2[0]}')

"I recommend that we engage in an arms race with Red." and "I recommend that we don't engage in an arms race with Red." have a score of 0.9750765562057495
"I recommend that we engage in an arms race with Red." and "Red is who I recommend that we engage in an arms race with." have a score of 0.9442029595375061
tensor(0.9272)
tensor(0.9389)
tensor(0.9330)
tensor(0.9448)
tensor(0.9448)
tensor(0.9448)


### notes on BERTScore

- semantically opposite have really high score
- semantically similar, but different structure has still high, but lower, score
- takes pretty long to run the first time (?)

# Question-Answering

https://github.com/potsawee/mqag0

In [59]:
from selfcheckgpt.modeling_mqag import MQAG

In [60]:
mqag_model = MQAG()

MQAG (race) initialized to cpu


In [None]:
score1 = mqag_model.score(candidate=sentence2, reference=sentence1, num_questions=1, verbose=True)
score2 = mqag_model.score(candidate=sentence3, reference=sentence1, num_questions=1, verbose=True)

## note that MQAG is definitely not good for small examples like this, I just included this here for completeness
## THIS TAKES INSANELY LONG TO RUN WITHOUT COMPUTE, DO NOT TRY AND RUN THIS

In [61]:
# what the cell output would be for score1 code running

example_output = '''Initializing global attention on multiple choice...
Input ids are automatically padded to be a multiple of `config.attention_window`: 512
Initialized Answering
Q1: Why can't we engage in an arms race with Red?
(1) [P(.|cand)=23.90%]	[P(.|ref)=25.37%]	Because red has an army that is much bigger than white.
(2) [P(.|cand)=47.58%]	[P(.|ref)=45.06%]	Because the Army of Red is much bigger than White's.
(3) [P(.|cand)=23.90%]	[P(.|ref)=25.37%]	Because red has an army that is much bigger than white.
(4) [P(.|cand)=4.62%]	[P(.|ref)=4.20%]	Because white is already in an arms race with red.'''

print(example_output)

Initializing global attention on multiple choice...
Input ids are automatically padded to be a multiple of `config.attention_window`: 512
Initialized Answering
Q1: Why can't we engage in an arms race with Red?
(1) [P(.|cand)=23.90%]	[P(.|ref)=25.37%]	Because red has an army that is much bigger than white.
(2) [P(.|cand)=47.58%]	[P(.|ref)=45.06%]	Because the Army of Red is much bigger than White's.
(3) [P(.|cand)=23.90%]	[P(.|ref)=25.37%]	Because red has an army that is much bigger than white.
(4) [P(.|cand)=4.62%]	[P(.|ref)=4.20%]	Because white is already in an arms race with red.


In [62]:
example_scores = {'kl_div': 0.0017558218082514964, 'counting': 0.0, 'hellinger': 0.0006205150575312447, 'total_variation': 0.02520403265953064}

### notes on MQAG (Multiple-choice Question Answering and Generation)

- idk, super long to run
- not good for small texts, good for larger texts
- I'm pretty sure lower scores correspond to higher similarity
    - so, semantically opposite sentences have very similar scores
    - But, I think this is just a function of a bad question being generated due to length of the sentences
    - DO NOT USE THIS METHOD FOR SHORT RESPONSES