Sources:   

https://github.com/Tiiiger/bert_score     
https://arxiv.org/pdf/1904.09675.pdf      

Usage Examples:    
https://github.com/Tiiiger/bert_score/blob/master/example/Demo.ipynb     
https://colab.research.google.com/drive/1kpL8Y_AnUUiCxFjhxSrxCsc6-sDMNb_Q       

## BERTScore

In [5]:
import logging
import transformers

# visuaization
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import rcParams

rcParams["xtick.major.size"] = 0
rcParams["xtick.minor.size"] = 0
rcParams["ytick.major.size"] = 0
rcParams["ytick.minor.size"] = 0

rcParams["axes.labelsize"] = "large"
rcParams["axes.axisbelow"] = True
rcParams["axes.grid"] = True

# check bert_score installation
import bert_score
from bert_score import BERTScorer
bert_score.__version__

'0.3.8'

### Read dataset

In [7]:
PATH_GENERATED="/home/ruslan_yermakov/nlg-ra/T5_experiments/NLPcircle_model/outputs/mh_test_generations_explicit_path_model.txt"
PATH_ORIGINAL="/home/ruslan_yermakov/nlg-ra/T5_experiments/NLPcircle_model/input_data/test.target"

In [8]:
# read candidates
with open(PATH_GENERATED) as f:
    cands = [line.strip() for line in f]

In [9]:
# read references
with open(PATH_ORIGINAL) as f:
    refs = [line.strip() for line in f]

In [10]:
assert len(cands) == len(refs)

### Apply BERTScore

In practice, most of the time of calling the `score` function is spent on building the model. In situations when we want to call the `score` function repeatedly, it is better to cache the model in a `scorer` object. Hence, in `bert_score` we also provide an object-oriented API. 

The `BERTScorer` class provides the two methods we have introduced above, `score` and `plot_example`.

Inputs to score are a list of candidate sentences and a list of reference sentences.   

Some contextual embedding models, like RoBERTa, often produce BERTScores in a very narrow range (as shown above, the range is roughly between 0.92 and 1). Although this artifact does not affect the ranking ability of BERTScore, it affects the readability. Therefore, we propose to apply "baseline rescaling" to adjust the output scores. More details on this feature can be found in this post.   

https://github.com/Tiiiger/bert_score/blob/master/journal/rescale_baseline.md   


In [None]:
scorer = BERTScorer(lang="en", rescale_with_baseline=True)

We are now ready to call the score function. Besides candidates and references, we need to speicify the bert model we are using. Since we are dealing with English sentences, we will use the bert-base-uncased model.   

In [None]:
precision, recall, F1_score = scorer.score(cands, refs)


The outputs of the score function are Tensors of precision, recall, and F1 respectively. Each Tensor has the same number of items with the candidate and reference lists. Each item in the list is a scalar, representing the score for the corresponding candidates and references.

Problem: - truncates samples really long ??? but is it a problem? it should also truncate the original section, right??

In [20]:
print(f"System level F1 score: {F1_score.mean():.3f}")

System level F1 score: 0.337


We can now see that the scores are much more spread out, which makes it easy to compare different examples.



In [None]:
plt.hist(F1, bins=20)
plt.xlabel("score")
plt.ylabel("counts")
plt.show()

In [11]:
# for visualisation
# scorer.plot_example(cands[0], refs[0])
