# Information Retrieval Week 6 Tutorial

##### version 1.1

###### INFS7410 team

---

##### About today's tutorial

In this week's tutorial, you will be learning about and implementing a pointwise Learning to Rank (LTR) method. LTR refers to the application of machine learning to the ranking problem.

##### Tutorial Etiquette
Please refrain from loud noises, irrelevant conversations and use of mobile phones during tutorial activities. Be respectful of everyone's opinions and ideas during the tutorial activities. You will be asked to leave if you disturb. Remember the tutor is there to help you understand and learn, not to provide debugging of your code or solutions to assignments. 


## Exercise 1: 
The Learning to Rank dataset you are going to use in this execise is the MQ2007 dataset, which is a subset of LETOR: a package of benchmark datasets for research on LEarning TO Rank.[You can find details of this collection here](https://arxiv.org/pdf/1306.2597.pdf).
Description of the MQ2007 dataset: Each row refers to a query-document pair. The first column is the relevance label of this pair, the second column is the query id, the following columns are the features that represent the document (often in the context of the query), and the end of the row is a comment about the pair, which in the case of this dataset includes the id of the document. The larger the relevance label, the more relevant the document is to the query. In the MQ20007 dataset, a query-document pair is represented by a 46-dimensional feature vector. The features used in this dataset include TF, TF-IDF, BM25 and LM scores. 

Logistic regression is a probabilistic classification task which assigns labels to classes according to a hypothesis function. In order to learn a classifier, the hypothesis function is used to update *weights* assigned to the features of the classes to be classified.

There are many different LTR techniques. In this practical we use logistic regression.  Here, the classifier becomes  the ranker used to order retrieval results. Specifically, in this exercise, you will learn how to implement a logistic regression to train a linear ranker.

Here are several example rows from the MQ2007 dataset:

---
2 qid:10032 1:0.056537 2:0.000000 3:0.666667 4:1.000000 5:0.067138 ... 45:0.000000 46:0.076923 #docid=GX029-35-5894638 inc=0.0119881192468859 prob=0.139842

0 qid:10032 1:0.279152 2:0.000000 3:0.000000 4:0.000000 5:0.279152 ... 45:0.250000 46:1.000000 #docid=GX030-77-6315042 inc=1 prob=0.341364

0 qid:10032 1:0.130742 2:0.000000 3:0.333333 4:0.000000 5:0.134276 ... 45:0.750000 46:1.000000 #docid=GX140-98-13566007 inc=1 prob=0.0701303

1 qid:10032 1:0.593640 2:1.000000 3:0.000000 4:0.000000 5:0.600707 ... 45:0.500000 46:0.000000 #docid=GX256-43-0740276 inc=0.0136292023050293 prob= 0.400738

---

Note: the relavence labels of this dataset are graded from 0 (non-relevant) to 2 (highly relevent); however we binarise them for convinience, i.e. transform the graded relevance to binary relevance (0 stays 0, while 1 and 2 become 1).

Now we can treat the ranking task as a binary classification problem (thus becoming a LTR problem): given an input, we want to predict the output. In our case, the input is a feature vector (which for MQ2007 contains 46 dimensions), while the output is the (binary) relevance label (0 or 1).

In [1]:
import numpy as np
import math

# ------------------------------------------------

class Dataset():
    """Representation of the LTR datasets."""
    def __init__(self, file_path, num_features):
        self.num_features = num_features
        self.label_vector = []
        self.feature_matrix = []
        self.qid_all_features = {}
        self.qid_all_docids = {}
        self.qid_relevance_set = {}
        self.qid_docid_relevance = {}   
        
        with open(file_path, "r") as f:
            data = f.readlines()
            
        docids = []
        query_features = []
        relevance_set = []
        docid_rel = {}
        current_qid = None
        for line in data:
            tokens = line.split()
            qid = tokens[1].split(":")[1]
            
            if qid != current_qid and current_qid is not None:
                self.qid_all_docids[current_qid] = docids.copy()
                self.qid_all_features[current_qid] = query_features.copy()
                self.qid_relevance_set[current_qid] = relevance_set.copy()
                self.qid_docid_relevance[current_qid] = docid_rel.copy()
                docids = []
                query_features = []
                relevance_set = []
                docid_rel = {}
                
            current_qid = qid
            
            docid = tokens[self.num_features+4]
            relevance = int(tokens[0])
            if relevance > 0:
                relevance = 1
            self.label_vector.append(relevance)
            
            features = []
            for i in range(self.num_features):
                feature = float(tokens[i+2].split(":")[1])
                features.append(feature)
            
            self.feature_matrix.append(features)
            
            docids.append(docid)
            query_features.append(features)
            relevance_set.append(relevance)
            docid_rel[docid] = relevance
    
    def relevance_set(self, qid):
        return self.qid_relevance_set[qid].copy()

# ------------------------------------------------    
    
def ndcg_at_k(dataset, query_result_list, k):
    """Evaluation measure we will use."""
    ndcg = 0.0
    num_query = 0.0
    
    for qid in query_result_list.keys():
        result_list_size = len(query_result_list[qid])
        if result_list_size < k:
            k = result_list_size
            
        dcg = 0.0
        for i in range(k):
            docid = query_result_list[qid][i]
            relevance = dataset.qid_docid_relevance[qid][docid]
            dcg += math.pow(2, relevance) / (math.log(i+2)/math.log(2))
            
        rel_set = dataset.relevance_set(qid)
        rel_set = sorted(rel_set, reverse=True)
        
        idcg = 0.0
        for i in range(k):
            idcg += math.pow(2, rel_set[i]) / (math.log(i+2)/math.log(2))
        ndcg += dcg/idcg
        num_query += 1
    ndcg /= num_query
    return ndcg

# ------------------------------------------------
num_features = 46 # Dimentionality of the data.
num_epoch = 10 # How many epochs to train for.
train_dataset = Dataset("train.txt", num_features)
test_dataset = Dataset("test.txt", num_features)

### Task 1

First, you need to implement `compute_hypothesis` method in `LinearRanker` class below:

The training set $X$ is the set of all training examples (represented as a matrix):
$$
X=x\in
\begin{bmatrix}
x^0\\
x^1\\
...\\
x^n
\end{bmatrix}
$$
Where $x^i$ is the ith training example in the training set which represented by a feature vector. Your task is the implement the hypothesis function which attempts to predict the relevance label for a given training example. The function should return a number between 0 and 1 where 0 is non-relevance and 1 is relevant.
$$
h_\theta(x) = \frac{1}{1+e^{-\theta^{T}x}}
$$
Where $\theta$ is the rankers' feature weights and $x$ is the training example (a single line in the training data). 

### Task 2

Second, implement `compute_gradient` and `update_weights` methods in LinearRanker class below. We have provided a skeleton for the function. 

In `compute_gradient` function, you need to write code to compute update gradient for each weight in the `LinearRanker`, the equation is following:

$$
\nabla\theta_j=\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^i)-y^i)\cdot x_{j}^i
$$
where $m$ is the number of rows in the training set, $x^{i}$ the $i^{\text{th}}$ row in the training set, $y^{i}$ is the relevance label of the $i^{\text{th}}$ row, and $x_{j}^{i}$ is a single feature in the $i^{\text{th}}$ row.

After you implemente the `compute_gradient` method, you can update ranker's weights in `update_weights` method by:

$$
\theta_j=\theta_j-\alpha\cdot\nabla\theta_j
$$

Where $\alpha$ is the learning rate which has been fixed to a default value 0.1 (feel free to change it!). 

If implemented correctly, you will see the nDCG@10 increasing gradually each epoch.


In [86]:
class LinearRanker():
    
    def __init__(self, num_features):
        self.num_features = num_features
        self.weight_vector = np.random.normal(loc=0.0, scale=1.0, size=num_features)
        
    def compute_hypothesis(self, X):
        """Compute sigmoid hypothesis function."""
        # Your implementation here!
        
        score = 1/(1 + np.exp(-(np.mat(self.weight_vector) * np.mat(X).T)))
        return score[0][0]
    
    def compute_gradient(self, feature_matrix, label_vector, weight_index):
        """Compute gradient for given feature weight index."""
        # Your implementation here!
        
        gradient = 0
        for i in range(0, len(feature_matrix)):
            gradient = gradient + (((self.compute_hypothesis(feature_matrix[i]) - label_vector[i])) * feature_matrix[i][weight_index])
        gradient = gradient/self.num_features
        return gradient

    def update_weights(self, learning_rate, feature_matrix, label_vector):
        """Update ranker's feature weights."""
        # Your implementation here!
        
        for i in range(0, self.num_features):
            temp= self.weight_vector[i]
            self.weight_vector[i] = temp - learning_rate * self.compute_gradient(feature_matrix, label_vector, i)
    
    def compute_scores(self, feature_list):
        scores = []
        for features in feature_list:
            score = 0
            for i in range(self.num_features):
                score += features[i] * self.weight_vector[i]
            scores.append(score)
        return scores

    def get_query_result_list(self, dataset):
        query_result_set = {}
        for qid in dataset.qid_all_docids.keys():
            doc_list = dataset.qid_all_docids[qid].copy()
            feature_list = dataset.qid_all_features[qid].copy()
            scores = self.compute_scores(feature_list)
            
            mapping = []
            for i in range(len(doc_list)):
                mapping.append((doc_list[i], scores[i]))
            mapping = sorted(mapping, reverse=True, key=lambda x: x[1])
            
            result_list = [x[0] for x in mapping]
            query_result_set[qid] = result_list
        return query_result_set

In [89]:
# Run this cell once you have implemented the methods above!
ranker = LinearRanker(num_features)
for i in range(num_epoch):
    ranker.update_weights(0.5, train_dataset.feature_matrix, train_dataset.label_vector)
    ndcg_at_10 = ndcg_at_k(test_dataset, ranker.get_query_result_list(test_dataset), 10)
    print(f"ndcg@10: {ndcg_at_10} after {i+1}/{num_epoch} epochs")

ndcg@10: 0.7500254818562296 after 1/10 epochs
ndcg@10: 0.7179380251787845 after 2/10 epochs
ndcg@10: 0.7391110394118868 after 3/10 epochs
ndcg@10: 0.7560727982341146 after 4/10 epochs
ndcg@10: 0.7557547599756669 after 5/10 epochs
ndcg@10: 0.7419141032261893 after 6/10 epochs
ndcg@10: 0.7377112214276289 after 7/10 epochs
ndcg@10: 0.7392545556464268 after 8/10 epochs
ndcg@10: 0.7403830935829396 after 9/10 epochs
ndcg@10: 0.7352054942361922 after 10/10 epochs
