# Theoretical Part
## 1) Hypothesis Testing – The problem of multiple comparisons [5 points]
Experimentation in AI often happens like this: 
Modify/Build an algorithm
Compare the algorithm to a baseline by running a hypothesis test.
If not significant, go back to step A
If significant, start writing a paper. 
How many hypothesis tests, m, does it take to get to (with Type I error for each test = α):

a) P(mth experiment gives significant result | m experiments lacking power to reject H0)?

b) P(at least one significant result | m experiments lacking power to reject H0)?

## 2) Bias and unfairness in Interleaving experiments [10 points]

Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning ⅔ of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?


# Practical part
## Step 1: Simulate Rankings of Relevance for E and P (5 points)

In [1]:
from itertools import product
 
def generate_ranking_pairs(grades, rank_len):
    """ Generates all possible ranking pairs, excluding pairs of 
        equal rankings

    Args:
        grades (list): Possible grades in ranking.
        rank_len (int): Length of ranking pairs.

    Returns:
        generator: All possible ranking pairs.
    """
    
    # generate all possible rankings
    rankings = product(grades, repeat=rank_len)
    # generate all possible pairs of rankings
    pairs = product(rankings, repeat=2)    
    # exclude pairs of equal rankings
    pairs = filter(lambda pair: pair[0] != pair[1], pairs)
    
    return pairs

grades = ['HR', 'R', 'N']
rank_len = 5

ranking_pairs = list(generate_ranking_pairs(grades, rank_len))
print('Number of possible ranking pairs: ', len(ranking_pairs))

('Number of possible ranking pairs: ', 58806)


## Step 2: Implement Evaluation Measures (10 points)

In [2]:
def average_precision(rankings):    
    """Calculates Average Precision (AP) = Average of precisions at relevant documents

    Args:
        rankings (list): ranked result of query.

    Returns:
        float: The average precision of rankings.
    """
    relevant = 0
    numerator = 0
    for rank, rel in enumerate(rankings):
        rank += 1
        if rel == 'R' or rel == 'HR':
            relevant += 1
            numerator += relevant/rank
    return numerator/len(rankings)

The scores for relevance that are used are 0, 1 and 5 for N, R and HR respectively (equal to example in slides).

The nDCG@k measures requires the total number of relevant and highly relevant documents in the entire collection. Since
for this dummy example there does not really exist a corpus, another approach is required:

def nDCGk is feeded a ranking of length 5 (one of the permutations created in Step 1). This list of five is treated
as the corpus. The list consisting of the first k elements is seen as the result of a query q, and the DCGk of this
list is calculated by def DCGk. To find the perfect ranking (for normalization), the corpus list is sorted (descending)
and the DCGk of the top k elements of the resulting list is calculated. The result is then used for normalization 
in def nDCGk.

In [3]:
import math

def DCGk(rankings):
    """Calculates Discounted Cumulative Gain (DCG)

    Args:
        rankings (list): ranked result of query.

    Returns:
        float: The discounted cumulative gain.
    """
    discounted_gain = 0
    for rank, rel in enumerate(rankings):
        rank += 1
        gain = (2**rel)-1
        discount = 1/math.log(rank+1,2)    
        discounted_gain += gain*discount        
    return discounted_gain

def nDCGk(rankings, k=3):
    """Calculates Normalized Discounted Cumulative Gain at rank k (nDCG@k)

    Args:
        rankings (list): ranked result of query. (treated as corpus)
        k (int): rank k. Value of 3 is used if no argument is given.

    Returns:
        float: The normalized discounted gain at rank k.
    """
    # translate relevance to corresponding score for calculations
    rankings = [5 if x is 'HR' else 1 if x is 'R' else 0 for x in rankings]
    # calculate discounted gain for top k
    DCG = DCGk(rankings[:k])
    # sort all relevant documents (descending) in the corpus by their 
    # relative relevance to find best possible DCG result (perfect ranking)
    perfect_DCG = DCGk(sorted(rankings, reverse=True)[:k])
    # normalize discounted_gain by this result
    return DCG/perfect_DCG

#nDCGk(ranking_pairs[0][0])

In [4]:
def ERR(rankings, k=5):
    """Computes the Expected Reciprocal Rank (ERR) metric in linear time. Based on paper by Chapelle et al.

    Args:
        rankings (list): ranked result of query.
        k (int): rank k. Value of 5 is used if no argument is given.

    Returns:
        float: the Expected Reciprocal Rank (ERR).
    """
    # translate relevance to corresponding score for calculations
    rankings = [5 if x is 'HR' else 1 if x is 'R' else 0 for x in rankings]
    p = 1.0
    err = 0.0
    for rank, rel in enumerate(rankings[:k]):
        rank += 1
        R = ((2**rel)-1) / (2**max(rankings))
        err += p*(R/rank)
        p *= 1-R
    return err

ERR(ranking_pairs[100][1])

0.0

## Step 3: Calculate the delta measure (0 points)

In [5]:
def delta_measure(P, E, measure):
    """Computes the delta of P and E for average precision, nDCG@k or ERR.

    Args:
        P (list): ranked result of production algorithm.
        E (list): ranked result of experimental algorithm.
        measure (string): measure to be calculated (average precision, nDCGk, ERR)

    Returns:
        float: delta of E and P. If < 0, E does not outperform P.
    """
    if measure == 'average precision':
        P_measure = average_precision(P)
        E_measure = average_precision(E)
    elif measure == 'nDCGk':
        P_measure = nDCGk(P)
        E_measure = nDCGk(E)
    elif measure == 'ERR':
        P_measure = ERR(P)
        E_measure = ERR(E)
    
    return E_measure - P_measure

## Step 4: Implement Interleaving (15 points)

In [8]:
from random import randint

def interleave(ranking_A, ranking_B):
    """
    Interleaves rankings from two different ranking algorithms into 
    one ranking based on Team Draft Interleaving. Both rankings are
    assumed to have equal length.
    
    Args:
        ranking_A (list): Ranking of algorithm A.
        ranking_B (list): Ranking of algorithm B.
        
    Returns:
        (list): Interleaved ranking.
    """
    ranking_I = []
    i = 0
    while len(ranking_I) < len(ranking_A):
        # A wins
        if randint(0,1) == 0:
            ranking_I.append(ranking_A[i])
            ranking_I.append(ranking_B[i])
        # B wins
        else:
            ranking_I.append(ranking_B[i])
            ranking_I.append(ranking_A[i])
        i += 1
    return ranking_I[0:-1] if len(ranking_I) > len(ranking_A) else ranking_I

In [9]:
ranking_A = ['N','N','R','HR','N']
ranking_B = ['R','HR','N','N','R']
ranking_I = interleave(ranking_A, ranking_B)
print(ranking_A, ranking_B)
print(ranking_I)

(['N', 'N', 'R', 'HR', 'N'], ['R', 'HR', 'N', 'N', 'R'])
['N', 'R', 'N', 'HR', 'R']


## Step 5: Implement User Clicks Simulation (15 points)

## Step 6: Simulate Interleaving Experiment (10 points)

## Step 7: Results and Analysis (30 points)