Distinct-1 and Distinct-2 for diversity, BERTScore for semantic coherence, and Self-BLEU to evaluate diversity negatively.
Additionally, I'll provide a simple example for a coherence metric such as the Lexical Chain Score.

* Diversity Metrics (Distinct-1 & Distinct-2):

These metrics are widely used to assess the variety in the generated text by calculating the ratio of unique unigrams (Distinct-1) and bigrams (Distinct-2) to the total number of words or bigrams. Higher values suggest greater diversity, indicating that the model can generate varied outputs rather than repeating the same phrases.
* BERTScore:

BERTScore has become popular for its ability to use contextual embeddings (from models like BERT) to measure semantic similarity between the generated text and a reference. It's especially useful for assessing the coherence of the text, as it considers the contextual usage of words rather than just their presence.

* Self-BLEU:

Self-BLEU is often used to evaluate diversity negatively; it measures how similar different texts generated from the same model are to each other. Lower Self-BLEU scores are desirable as they indicate less repetition between different generated samples.

* Coherence Metrics (Entity-based Coherence Score, Lexical Chain Score):

Specialized coherence metrics analyze how logically connected and consistent the entities and their relationships are throughout the generated text. These metrics assess whether the text makes sense contextually and logically, which is crucial for tasks like story generation or lengthy descriptions in visual language models.

In [1]:
!pip install bert-score nltk sacrebleu

Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m562.5 kB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.0.0->bert-score)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.0.0->bert-score)
 

In [12]:
import nltk
from bert_score import score
import sacrebleu
from nltk.corpus import wordnet as wn
from nltk import pos_tag, word_tokenize
from nltk import ngrams, pos_tag, word_tokenize
from nltk.corpus import wordnet as wn
from bert_score import score

In [6]:
# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
def calculate_diversity_metrics(texts):
    # Flatten the list of texts into a single list of tokens
    all_tokens = [token for text in texts for token in text.split()]
    unigrams = list(ngrams(all_tokens, 1))
    bigrams = list(ngrams(all_tokens, 2))

    # Calculate Distinct-1 and Distinct-2
    distinct_1 = len(set(unigrams)) / len(unigrams) if unigrams else 0
    distinct_2 = len(set(bigrams)) / len(bigrams) if bigrams else 0
    return distinct_1, distinct_2

def calculate_self_bleu(texts):
    # Self-BLEU is calculated by treating each sentence as a hypothesis and the rest as a reference
    scores = []
    for i in range(len(texts)):
        hypothesis = texts[i]
        references = texts[:i] + texts[i+1:]
        bleu = sacrebleu.corpus_bleu([hypothesis], [[ref] for ref in references])
        scores.append(bleu.score)
    return sum(scores) / len(scores) if scores else 0

def calculate_bert_score(hypotheses, references):
    # Compute BERTScore
    P, R, F1 = score(hypotheses, references, lang="en", rescale_with_baseline=True)
    return F1.mean().item()

def lexical_chain_score(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    nouns = [word for word, pos in tagged if pos.startswith('N')]

    # Build chains based on WordNet synsets
    chains = []
    for noun in nouns:
        synsets = wn.synsets(noun, pos=wn.NOUN)
        if not synsets:  # If no synset is found, continue to the next noun
            continue
        added = False
        for chain in chains:
            # Check if any synset of the current noun matches any synset in the existing chains
            if any(syn in chain_synsets for syn in synsets for chain_synsets, _ in chain):
                chain.append((synsets, noun))
                added = True
                break
        if not added:
            chains.append([(synsets, noun)])

    # Calculate the score based on chain length
    score = sum(len(chain) for chain in chains) / len(nouns) if nouns else 0
    return score


In [16]:
# Example usage
texts = [
    "the cat sits on the mat",
    "the cat plays with a ball",
    "a quick brown fox jumps over the lazy dog",
    "the quick brown fox is quick"
    "the quick brown fox jumps over the lazy dog and the quick brown fox was very quick"
]

distinct_1, distinct_2 = calculate_diversity_metrics(texts)
self_bleu = calculate_self_bleu(texts)
bert_score = calculate_bert_score(texts, texts)
lexical_chain = lexical_chain_score(' '.join(texts))  # Pass concatenated text to function

print(f"Distinct-1: {distinct_1:.2f}")
print(f"Distinct-2: {distinct_2:.2f}")
print(f"Self-BLEU: {self_bleu:.2f}")
print(f"BERTScore: {bert_score:.2f}")
print(f"Lexical Chain Score: {lexical_chain:.2f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Distinct-1: 0.49
Distinct-2: 0.69
Self-BLEU: 39.26
BERTScore: 1.00
Lexical Chain Score: 1.00
