<h1>BLEU(Bilingual Evaluation Understudy)</h1>

1. Brevity Penalty(BP) <br>
If candidate length ≥ closest reference length → BP = 1 (no penalty). <br>
If candidate is shorter → BP < 1 to penalize short captions.

2. N-gram Precision <br>
Computes 1-gram to 4-gram overlap. <br>
Uses clipped counts to avoid over-rewarding repeated words.

In [107]:
from collections import Counter
import math

In [104]:
def get_ngrams(tokens, n):
    return zip(*[tokens[i:] for i in range(n)])

def compute_bleu(candidate, references):
    c = len(candidate)
    if(c == 0):
      return [0.0] * 4

    ref_lens = [len(ref) for ref in references]
    r = min(ref_lens, key=lambda ref_len: (abs(ref_len - c), ref_len))

    # Compute brevity penalty
    if c > r:
        bp = 1.0
    else:
        bp = math.exp((1 - r) / c)

    p_n = []
    for n in range(1, 5):
        candidate_ngrams = list(get_ngrams(candidate, n))
        if not candidate_ngrams:
            p_n.append(0.0)
            continue

        candidate_counts = Counter(candidate_ngrams)
        max_ref_counts = {}

        for ref in references:
            ref_ngrams = get_ngrams(ref, n)
            ref_counts = Counter(ref_ngrams)
            for ngram in candidate_counts:
                cnt = ref_counts.get(ngram, 0)
                if ngram not in max_ref_counts or cnt > max_ref_counts[ngram]:
                    max_ref_counts[ngram] = cnt

        clipped = sum(min(count, max_ref_counts.get(ngram, 0)) for ngram, count in candidate_counts.items())
        total = len(candidate_ngrams)
        p_n.append(clipped / total if total != 0 else 0.0)

    bleu_scores = []
    for i in range(4):
        n = i + 1
        relevant_p = p_n[:n]
        if any(p == 0 for p in relevant_p):
            bleu = 0.0
        else:
            product = 1.0
            for p in relevant_p:
                product *= p
            gm = product ** (1.0 / n)
            bleu = bp * gm
        bleu_scores.append(bleu)

    return bleu_scores

# BLEU usage
# resultset ex.
candidate = "the cat is on the mat".split()
# dataset values ex.
references = [
    "the cat is sitting on the mat".split(),
    "there is a cat on the mat".split(),
    "a cat is found on the mat while sitting".split(),
    "cat sits on mat".split()
]

# bleu_scores = compute_bleu(candidate, references)
# print(f"BLEU-1: {bleu_scores[0]:.4f}")
# print(f"BLEU-2: {bleu_scores[1]:.4f}")
# print(f"BLEU-3: {bleu_scores[2]:.4f}")
# print(f"BLEU-4: {bleu_scores[3]:.4f}")

<h1>ROUGE-N(Recall-Oriented Understudy for Gisting Evaluation - N_Gram Recall)</h1>

1. Iterates over multiple reference sentences and calculates ROUGE scores for each.
2. Averages precision, recall, and F1-score across all references, but focuses on the value of ROUGE-N_recall section.
3. Supports different N-gram levels(ROUGE-1, ROUGE-2, ROUGE-3, etc.).

In [8]:
# !pip install rouge-score

In [108]:
from rouge_score import rouge_scorer

In [13]:
def calculate_rouge_n(candidate, references, n = 1):
    scorer = rouge_scorer.RougeScorer([f'rouge{n}'], use_stemmer=True)
    scores = [scorer.score(ref, candidate)[f'rouge{n}'] for ref in references]

    # Compute average scores over all references
    avg_precision = sum(score.precision for score in scores) / len(scores)
    avg_recall = sum(score.recall for score in scores) / len(scores)
    avg_f1 = sum(score.fmeasure for score in scores) / len(scores)

    return {
        "precision": avg_precision,
        "recall": avg_recall,
        "f1-score": avg_f1
    }

# ROUGE-N usage
# resultset ex.
candidate_sentence = "The cat sat on the mat."
# dataset values ex.
reference_sentences = [
    "The cat is sitting on the mat.",
    "A cat was resting on a mat.",
    "The feline was on the mat."
]

# rouge_1_score = calculate_rouge_n(candidate_sentence, reference_sentences, n = 1)
# rouge_2_score = calculate_rouge_n(candidate_sentence, reference_sentences, n = 2)

# Highlight on the ROUGE-N_recall final values.
# print("ROUGE-1:", rouge_1_score)
# print("ROUGE-2:", rouge_2_score)

<h1>METEOR(Metric for Evaluation of Translation with Explicit ORdering)</h1>

1. Splits sentences into tokens for better accuracy.
2. Computes the METEOR score for each reference and then averages them.

In [18]:
import nltk
nltk.download('wordnet')

from nltk.translate.meteor_score import meteor_score
import numpy as np

In [49]:
def calculate_meteor(candidate, references):
    # Tokenize the sentences
    candidate_tokens = candidate.split()
    reference_tokens = [ref.split() for ref in references]

    # Compute METEOR score for each reference and average them
    scores = [meteor_score([ref], candidate_tokens) for ref in reference_tokens]
    avg_meteor = np.mean(scores)
    return avg_meteor

# METEOR usage
# resultset ex.
candidate_sentence = "The cat sat on the mat."
# dataset values ex.
reference_sentences = [
    "The cat is sitting on the mat.",
    "A cat sleeps and rests on a mat.",
    "The feline is on the mat."
]

# meteor = calculate_meteor(candidate_sentence, reference_sentences)
# print("METEOR Score:", meteor)

<h1>CIDEr(Consensus-based Image Description Evaluation)</h1>

1. Computes n-grams(1 to 4) for the candidate and references. <br>
2. Calculates term frequency(TF) and document frequency(DF) for candidate and references. <br>
3. Applies TF-IDF weighting to reduce the impact of common phrases. <br>
4. Uses cosine similarity between the TF-IDF weights of the candidate and references. <br>
5. Applies a Gaussian penalty to balance different n-gram contributions. <br>

In [79]:
import numpy as np
from collections import Counter
from itertools import chain

In [67]:
def compute_ngrams(sentence, n):
    """
    Generate n-grams from a sentence.
    """
    words = sentence.split()
    return [' '.join(words[i:i+n]) for i in range(len(words) - n + 1)]

def term_frequency(ngrams):
    """
    Compute term frequency for n-grams.
    """
    return Counter(ngrams)

def compute_cider(candidate, references, n=4, sigma=6.0):
    """
    Calculate CIDEr score for a candidate sentence against multiple references.
    Args:
        candidate(str): The generated sentence.
        references(list of str): List of reference sentences.
        n(int): Maximum n-gram order(default is 4).
        sigma(float): Gaussian penalty coefficient val.(default is 6.0).
    Returns:
        float: CIDEr score.
    """
    # Compute term frequencies(TF) for candidate(resultset)
    candidate_tf = {i: term_frequency(compute_ngrams(candidate, i)) for i in range(1, n+1)}
    # Compute term frequencies for references(dataset_vals)
    reference_tfs = [{i: term_frequency(compute_ngrams(ref, i)) for i in range(1, n+1)} for ref in references]

    # Compute document frequency(DF) across all references
    df = {i: Counter() for i in range(1, n+1)}
    for ref_tf in reference_tfs:
        for i in range(1, n+1):
            for ngram in ref_tf[i]:
                df[i][ngram] += 1

    # Compute CIDEr score
    cider_score = 0.0
    for i in range(1, n+1):
        # Compute TF-IDF for candidate and references using shared vocabulary
        all_ngrams = set(candidate_tf[i].keys()).union(*[ref_tf[i].keys() for ref_tf in reference_tfs])
        candidate_tfidf = {ngram: candidate_tf[i].get(ngram, 0) * np.log(max(1, len(references) / (df[i][ngram] + 1)))
                           for ngram in all_ngrams}
        reference_tfidfs = []

        for ref_tf in reference_tfs:
            ref_tfidf = {ngram: ref_tf[i].get(ngram, 0) * np.log(max(1, len(references) / (df[i][ngram] + 1)))
                         for ngram in all_ngrams}
            reference_tfidfs.append(ref_tfidf)

        # Compute cosine similarity
        reference_vectors = [list(ref_tfidf.values()) for ref_tfidf in reference_tfidfs]
        candidate_vector = list(candidate_tfidf.values())

        if not candidate_vector or not reference_vectors:
            continue  # Skip if no valid n-grams present

        # Compute cosine similarity
        reference_scores = []
        for ref_vec in reference_vectors:
            ref_norm = np.linalg.norm(ref_vec)
            cand_norm = np.linalg.norm(candidate_vector)
            if ref_norm > 0 and cand_norm > 0:
                similarity = np.dot(candidate_vector, ref_vec) / (cand_norm * ref_norm)
                reference_scores.append(similarity)

        # Average over references and apply Gaussian penalty
        if reference_scores:
            avg_similarity = np.mean(reference_scores)
            cider_score += avg_similarity * np.exp(-(i - 1) ** 2 / (2 * sigma ** 2))

    return cider_score

In [78]:
# CIDEr usage
candidate_sentence = "The cat sat on the mat."
reference_sentences = [
    "The cat is sitting on the mat.",
    "A cat was resting on a mat.",
    "The feline was to be on the mat."
]

# cider_score = compute_cider(candidate_sentence, reference_sentences)
# print("CIDEr Score:", cider_score)

<h1>SPICE(Semantic Propositional Image Caption Evaluation)</h1>

1. Uses nltk corpus WordNet’s Wu-Palmer similarity to find semantic relations between words.
2. This approach focuses on word-level synonyms.

In [96]:
import nltk
from nltk.corpus import wordnet as wn
from itertools import product
import numpy as np

In [101]:
nltk.download('punkt_tab')

In [98]:
def wordnet_similarity(word1, word2):
    """
    Compute similarity between two words using WordNet.
    """
    synsets1 = wn.synsets(word1)
    synsets2 = wn.synsets(word2)
    max_sim = 0
    for syn1, syn2 in product(synsets1, synsets2):
        sim = syn1.wup_similarity(syn2)
        if sim is not None:
            max_sim = max(max_sim, sim)
    return max_sim if max_sim > 0 else 0  # Avoid None values

def compute_spice_wordnet(candidate, references):
    """
    Approximate SPICE score using WordNet-based semantic similarity.
    """
    candidate_words = nltk.word_tokenize(candidate.lower())
    reference_words = [nltk.word_tokenize(ref.lower()) for ref in references]

    scores = []
    for ref_words in reference_words:
        similarities = []
        for cand_word in candidate_words:
            word_similarities = [wordnet_similarity(cand_word, ref_word) for ref_word in ref_words]
            max_word_sim = max(word_similarities) if word_similarities else 0
            similarities.append(max_word_sim)

        scores.append(np.mean(similarities) if similarities else 0)

    return np.mean(scores) if scores else 0

In [100]:
# SPICE usage
candidate_sentence = "The cat sat on the mat."
reference_sentences = [
    "The cat is sitting on the mat.",
    "A cat was resting on a mat.",
    "The feline was on the mat."
]

# spice_score_wordnet = compute_spice_wordnet(candidate_sentence, reference_sentences)
# print("Appr. SPICE Score(WordNet):", spice_score_wordnet)