Step 2: Implement Evaluation Measures (10 points)

Implement 1 binary and 2 multi-graded evaluation measures out of the 7 measures mentioned above. 

(Note 2: Some of the aforementioned measures require the total number of relevant and highly relevant documents in the entire collection – pay extra attention on how to find this)


In [48]:
P = ['N', 'N', 'N', 'N', 'N']
E = ['R', 'R', 'R', 'R', 'R']

In [49]:
P = ['R', 'N', 'R', 'N', 'R']
E = ['N', 'R', 'N', 'R', 'N']

In [50]:
def average_precision(rankings):    
    """Calculates Average Precision (AP) = Average of precisions at relevant documents

    Args:
        rankings (list): ranked result of query.

    Returns:
        float: The average precision of rankings.
    """
    relevant = 0
    numerator = 0
    for rank, rel in enumerate(rankings):
        rank += 1
        if rel == 'R' or rel == 'HR':
            relevant += 1
            numerator += relevant/rank
    return numerator/len(rankings)

average_precision(E)

0.2

In [51]:
import math

"""
The scores for relevance that are used are 0, 1 and 5 for N, R and HR respectively (equal to example in slides).

The nDCG@k measures requires the total number of relevant and highly relevant documents in the entire collection. Since
for this dummy example there does not really exist a corpus, another approach is required:

def nDCGk is feeded a ranking of length 5 (one of the permutations created in Step 1). This list of five is treated
as the corpus. The list consisting of the first k elements is seen as the result of a query q, and the DCGk of this
list is calculated by def DCGk. To find the perfect ranking (for normalization), the corpus list is sorted (descending)
and the DCGk of the top k elements of the resulting list is calculated. The result is then used for normalization 
in def nDCGk.
"""

def DCGk(rankings):
    """Calculates Discounted Cumulative Gain (DCG)

    Args:
        rankings (list): ranked result of query.

    Returns:
        float: The discounted cumulative gain.
    """
    discounted_gain = 0
    for rank, rel in enumerate(rankings):
        rank += 1
        gain = (2**rel)-1
        discount = 1/math.log(rank+1,2)    
        discounted_gain += gain*discount        
    return discounted_gain

def nDCGk(rankings, k=3):
    """Calculates Normalized Discounted Cumulative Gain at rank k (nDCG@k)

    Args:
        rankings (list): ranked result of query. (treated as corpus)
        k (int): rank k. Value of 3 is used if no argument is given.

    Returns:
        float: The normalized discounted gain at rank k.
    """
    # translate relevance to corresponding score for calculations
    rankings = [5 if x is 'HR' else 1 if x is 'R' else 0 for x in rankings]
    # calculate discounted gain for top k
    DCG = DCGk(rankings[:k])
    # sort all relevant documents (descending) in the corpus by their 
    # relative relevance to find best possible DCG result (perfect ranking)
    perfect_DCG = DCGk(sorted(rankings, reverse=True)[:k])
    # normalize discounted_gain by this result
    return DCG/perfect_DCG

#nDCGk(['HR', 'HR', 'N','N','R'])

In [53]:
def ERR(rankings, k=5):
    """Computes the Expected Reciprocal Rank (ERR) metric in linear time. Based on paper by Chapelle et al.

    Args:
        rankings (list): ranked result of query.
        k (int): rank k. Value of 5 is used if no argument is given.

    Returns:
        float: the Expected Reciprocal Rank (ERR).
    """
    # translate relevance to corresponding score for calculations
    rankings = [5 if x is 'HR' else 1 if x is 'R' else 0 for x in rankings]
    p = 1.0
    err = 0.0
    for rank, rel in enumerate(rankings[:k]):
        rank += 1
        R = ((2**rel)-1) / (2**max(rankings))
        err += p*(R/rank)
        p *= 1-R
    return err

#ERR(['HR', 'HR', 'N','N','R'])

In [57]:
def delta_measure(P, E, measure):
    """Computes the delta of P and E for average precision, nDCG@k or ERR.

    Args:
        P (list): ranked result of production algorithm.
        E (list): ranked result of experimental algorithm.
        measure (string): measure to be calculated (average precision, nDCGk, ERR)

    Returns:
        float: delta of E and P. If < 0, E does not outperform P.
    """
    if measure == 'average precision':
        P_measure = average_precision(P)
        E_measure = average_precision(E)
    elif measure == 'nDCGk':
        P_measure = nDCGk(P)
        E_measure = nDCGk(E)
    elif measure == 'ERR':
        P_measure = ERR(P)
        E_measure = ERR(E)
    
    return E_measure - P_measure

delta_measure(P,E,'nDCGk')