<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Learning-to-Rank" data-toc-modified-id="Learning-to-Rank-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Learning to Rank</a></span><ul class="toc-item"><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Preprocessing</a></span></li><li><span><a href="#Model-Training" data-toc-modified-id="Model-Training-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Model Training</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></li><li><span><a href="#Reference" data-toc-modified-id="Reference-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

In [1]:
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import os
import numpy as np
import pandas as pd
import xgboost as xgb

# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,xgboost

Ethen 2020-03-28 22:06:34 

CPython 3.6.4
IPython 7.9.0

numpy 1.16.5
pandas 0.25.0
sklearn 0.21.2
xgboost 0.81


# Learning to Rank

In [2]:
# !wget https://s3-us-west-2.amazonaws.com/xgboost-examples/MQ2008.rar

In [3]:
# !unrar x MQ2008.rar

In [4]:
# mv -f MQ2008/Fold1/*.txt .

mv: rename MQ2008/Fold1/*.txt to ./*.txt: No such file or directory


In [5]:
!ls

[34mMQ2008[m[m              MQ2008.rar.1        model.xgb           train.txt
MQ2008.rar          learn_to_rank.ipynb test.txt            vali.txt


## Data Preprocessing

We'll print out the first few line of the raw data to understand that are some of the preprocessing steps required to use it.

In [41]:
input_path = 'train.txt'

with open(input_path) as f:
    for _ in range(2):
        line = f.readline()
        print(line)

0 qid:10002 1:0.007477 2:0.000000 3:1.000000 4:0.000000 5:0.007470 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.471076 12:0.000000 13:1.000000 14:0.000000 15:0.477541 16:0.005120 17:0.000000 18:0.571429 19:0.000000 20:0.004806 21:0.768561 22:0.727734 23:0.716277 24:0.582061 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.780495 30:0.962382 31:0.999274 32:0.961524 33:0.000000 34:0.000000 35:0.000000 36:0.000000 37:0.797056 38:0.697327 39:0.721953 40:0.582568 41:0.000000 42:0.000000 43:0.000000 44:0.000000 45:0.000000 46:0.007042 #docid = GX008-86-4444840 inc = 1 prob = 0.086622

0 qid:10002 1:0.603738 2:0.000000 3:1.000000 4:0.000000 5:0.603175 6:0.000000 7:0.000000 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.122130 14:0.000000 15:0.000000 16:0.998377 17:0.375000 18:1.000000 19:0.000000 20:0.998128 21:0.000000 22:0.000000 23:0.154578 24:0.555676 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.071711 30:0.000000 31:0.000000 32:0.000000 33:0.00

> From the [documentation](https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/#!letor-4-0): each row is a query-document pair. The first column is relevance label of this pair, the second column is query id, the following columns are features, and the end of the row is comment about the pair, including id of the document. The larger the relevance label, the more relevant the query-document pair.

The next code chunk parses out the query ids, input features, relevance label and group information. The group information stores the number of instances in each group. For example if our group list looks like `[5, 8, 7]`, that means, the first 5 items belongs to the first group, the next 8 items belongs to the second group and so on. This grouping information is needed by XGBoost to determine which group of items are to be compared against one another.

In [6]:
def parse_raw_data(input_path):
    data = []
    labels = []
    groups = []
    query_ids = []

    n_group = 0
    current_group = ''
    with open(input_path) as f:
        for line in f:
            # filter out comment about the pair
            if '#' in line:
                line = line[:line.index('#')]

            splits = line.strip().split(' ')

            feature = np.array([float(feature_str.split(':')[1]) for feature_str in splits[2:]])
            data.append(feature)

            label = int(splits[0])
            labels.append(label)

            query_id = splits[1]
            query_ids.append(query_id)
            
            # keep accumulating the number of items in a group until a new
            # query id is encountered
            if current_group == '':
                current_group = query_id
                n_group += 1
            elif current_group == query_id:
                n_group += 1
            else:
                groups.append(n_group)
                current_group = query_id
                n_group = 1

        # make sure to append the last group
        groups.append(n_group)

    return np.array(data), np.array(labels), np.array(query_ids), np.array(groups)

We can print out the data to do some quick inspection. As the group stores the number of records in each group, summing all the numbers up should give us the total number of records, this is a quick sanity check to make sure we parse the data correctly.

In [7]:
input_path = 'train.txt'
X_train, y_train, query_train, group_train = parse_raw_data(input_path)

print('sample group:')
print(group_train[:20])
print('total items: ', np.sum(group_train))

print('data dimension:')
print(X_train.shape)
X_train

sample group:
[  8   8   8   8   8  16   8 118  16   8   8   8   7   8  16   8  16   8
  32   8]
total items:  9630
data dimension:
(9630, 46)


array([[0.007477, 0.      , 1.      , ..., 0.      , 0.      , 0.007042],
       [0.603738, 0.      , 1.      , ..., 0.003708, 0.333333, 1.      ],
       [0.214953, 0.      , 0.      , ..., 1.      , 1.      , 0.021127],
       ...,
       [1.      , 0.      , 0.      , ..., 0.060915, 0.454545, 0.      ],
       [0.259641, 0.6     , 0.      , ..., 0.051975, 0.090909, 0.      ],
       [0.791031, 0.      , 0.      , ..., 0.001754, 0.181818, 0.      ]])

In [8]:
# distribution of the relevance label
np.bincount(y_train)

array([7820, 1223,  587])

In [9]:
input_path = 'vali.txt'
X_val, y_val, query_val, group_val = parse_raw_data(input_path)

print('sample group:')
print(group_val[:20])

print('data dimension:')
print(X_val.shape)
X_val

sample group:
[15  8  7 31  7  7  8 15  8  8 16  8 30 16 16 16 31  8  8 31]
data dimension:
(2707, 46)


array([[1.      , 0.      , 0.      , ..., 0.197431, 0.5     , 0.      ],
       [0.003315, 0.      , 1.      , ..., 0.212859, 0.25    , 0.214286],
       [0.093923, 0.      , 0.      , ..., 0.309468, 1.      , 0.357143],
       ...,
       [0.018219, 0.5     , 0.      , ..., 1.      , 1.      , 0.25    ],
       [0.006073, 0.      , 0.      , ..., 0.111722, 0.      , 0.125   ],
       [1.      , 0.5     , 0.      , ..., 0.006428, 0.      , 0.      ]])

## Model Training

Training the learning to rank model is very similar to training regression or classification model, except the objective is different, and we also need to pass in the group information to the Ranker class.

In [26]:
params = {
    'objective': 'rank:ndcg',
    'learning_rate': 0.01,
    'gamma': 1.0,
    'min_child_weight': 0.1,
    'max_depth': 6,
    'n_estimators': 100
}

fit_params = {
    'verbose': 10,
    'early_stopping_rounds': 5,
    'eval_metric': 'ndcg',
    'eval_set': [(X_train, y_train), (X_val, y_val)],
    'eval_group': [group_train, group_val]
}

model = xgb.sklearn.XGBRanker(**params)
model.fit(X_train, y_train, group_train, **fit_params)

[0]	eval_0-ndcg:0.837363	eval_1-ndcg:0.785951
Multiple eval metrics have been passed: 'eval_1-ndcg' will be used for early stopping.

Will train until eval_1-ndcg hasn't improved in 5 rounds.
[10]	eval_0-ndcg:0.861817	eval_1-ndcg:0.805565
Stopping. Best iteration:
[13]	eval_0-ndcg:0.861784	eval_1-ndcg:0.808128



XGBRanker(base_score=0.5, booster='gbtree', colsample_bylevel=1,
          colsample_bytree=1, gamma=1.0, learning_rate=0.01, max_delta_step=0,
          max_depth=6, min_child_weight=0.1, missing=None, n_estimators=100,
          n_jobs=-1, nthread=None, objective='rank:ndcg', random_state=0,
          reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
          subsample=1)

A quick reminder that when using predict, the default behavior is to use all the trees to to the prediction, if we leverage the early stopping functionality to determine the best amount of trees required, then we should either specify `None` or the use `best_ntree_limit` to generate the prediction.

In [11]:
# default behaviour is to use all trees
pred_val = model.predict(X_val)
pred_val

array([0.53620577, 0.33488166, 0.46941793, ..., 0.33488166, 0.3867974 ,
       0.5271703 ], dtype=float32)

In [12]:
# we can use None is use only up to the best iteration determined
pred_val = model.predict(X_val, ntree_limit=None)
pred_val

array([0.5332428 , 0.3766683 , 0.46111014, ..., 0.3766683 , 0.4103794 ,
       0.53116554], dtype=float32)

In [13]:
# or use the best_ntree_limit attribute, which is available when using
# the early stopping functionality
pred_val = model.predict(X_val, ntree_limit=model.best_ntree_limit)
pred_val

array([0.5332428 , 0.3766683 , 0.46111014, ..., 0.3766683 , 0.4103794 ,
       0.53116554], dtype=float32)

Upon determining the best tree limit, we can also refit the model on the entire training and validation set using that tree number.

In [None]:
X = np.vstack([X_train, X_val])
y = np.hstack([y_train, y_val])
group = np.hstack([group_train, group_val])

In [None]:
params = model.get_params()

# by directly accessing best_ntree_limit attribute, we assume
# the model always uses early_stopping_rounds in the fit parameter
if params['n_estimators'] != model.best_ntree_limit:

    # We over-ride the current n_estimators with the best_ntree_limit and refit the model again
    params['n_estimators'] = model.best_ntree_limit
    best_model = xgb.sklearn.XGBRanker(**params)

    del fit_params['early_stopping_rounds']
    best_model.fit(X, y, group, **fit_params)

## Model Evaluation

The model evaluation part is where things becomes a little bit different. As we were ranking the items within each group, we need to evaluate the model on how well it is ranking the items within each group.

We'll use NDCG as the evaluation metric in the following example.

In [14]:
def ndcg_at_k(y_true, y_score, k=None):
    actual = dcg_at_k(y_true, y_score, k)
    best = dcg_at_k(y_true, y_true, k) 
    ndcg = actual / best
    return ndcg


def dcg_at_k(y_true, y_score, k=None):
    order = np.argsort(y_score)[::-1]
    if k is not None:
        order = order[:k]

    y_true = np.take(y_true, order)
    gains = 2 ** y_true - 1
    discounts = np.log2(np.arange(2, gains.size + 2))
    dcg = np.sum(gains / discounts)
    return dcg

We can extract the prediction and relevance label for one of group and compute the NDCG metrics.

In [15]:
n_group = group_val[0]
y_score = pred_val[:n_group]
y_true = y_val[:n_group]
print('num of examples in group:\n', n_group)
print('prediction score:\n', y_score)
print('relevance label:\n', y_true)

num of examples in group:
 15
prediction score:
 [0.5332428  0.3766683  0.46111014 0.6059945  0.60195273 0.37404552
 0.40666327 0.37734008 0.60195273 0.39321342 0.37554443 0.38511944
 0.37404552 0.37647572 0.41525683]
relevance label:
 [0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]


In [16]:
ndcg_at_k(y_true, y_score)

0.8503449055347546

We can loop over each group to compute the evaluation metric.

In [18]:
ndcg_array = []

start = 0
for n_group in group_val:
    end = start + n_group
    y_score = pred_val[start:end]
    y_true = y_val[start:end]
    ndcg = ndcg_at_k(y_true, y_score)
    ndcg_array.append(ndcg)
    start = end

ndcg_array = np.array(ndcg_array)
ndcg_array[:10]

  after removing the cwd from sys.path.


array([0.85034491, 0.38685281, 0.63990933, 0.63752463,        nan,
       0.35620719, 0.98289208, 0.85261156, 0.82131371, 1.        ])

Notice that when computing the evaluation metric for each group, we get nan results for some of the groups. This is due to the fact that the relevance labels in that group is all 0. We can filter out these nan elements afterwards, or exclude them altogether from the data at the beginning.

In [19]:
start = np.sum(group_val[:4])
n_group = group_val[4]
end = start + n_group
y_score = pred_val[start:end]
y_true = y_val[start:end]
print(y_score)
print(y_true)

ndcg = ndcg_at_k(y_true, y_score)
ndcg

[0.5731795  0.37404552 0.3808377  0.45763242 0.37404552 0.6059945
 0.3766683 ]
[0 0 0 0 0 0 0]


  after removing the cwd from sys.path.


nan

The for loop we've implemented above can become very slow for large datasets, there are different approach for speeding up the crude for loop, the following code chunk uses pandas to perform vectorized computation.

In [20]:
group_col = 'qid'
label_col = 'label'
score_col = 'score'
df_val = pd.DataFrame({
    group_col: query_val,
    label_col: y_val,
    score_col: pred_val
})
df_val.head()

Unnamed: 0,qid,label,score
0,qid:15928,0,0.533
1,qid:15928,0,0.377
2,qid:15928,1,0.461
3,qid:15928,1,0.606
4,qid:15928,0,0.602


In [21]:
def compute_ndcg(df, group_col, label_col, score_col, k=None):

    df_label_sorted = df.sort_values([group_col, label_col], ascending=False)
    df_score_sorted = df.sort_values([group_col, score_col], ascending=False)
    
    df_score_sorted['actual_rank'] = df_score_sorted.groupby(group_col).cumcount()
    df_score_sorted['best_rank'] = df_label_sorted.groupby(group_col).cumcount()

    item_gain = 2 ** df_score_sorted[label_col] - 1
    df_score_sorted['actual_dcg'] = item_gain / np.log2(df_score_sorted['actual_rank'] + 2)
    df_score_sorted['best_dcg'] = item_gain / np.log2(df_score_sorted['best_rank'] + 2)

    if k is not None and k > 0:
        df_actual_dcg_at_k = df_score_sorted[df_score_sorted['actual_rank'] < k]
        df_best_dcg_at_k = df_score_sorted[df_score_sorted['best_rank'] < k]
    else:
        df_actual_dcg_at_k = df_score_sorted
        df_best_dcg_at_k = df_score_sorted

    actual_dcg = df_actual_dcg_at_k.groupby(group_col)['actual_dcg'].sum()
    best_dcg = df_best_dcg_at_k.groupby(group_col)['best_dcg'].sum()
    ndcg_array = np.array(actual_dcg / best_dcg)
    ndcg_array = ndcg_array[~np.isnan(ndcg_array)]
    return ndcg_array

In [23]:
ndcg_array = compute_ndcg(df_val, group_col, label_col, score_col)
ndcg_array[:10]

array([0.85034491, 0.38685281, 0.63990933, 0.70218067, 0.33333333,
       0.98289208, 0.85261156, 0.82131371, 1.        , 0.81243718])

# Reference

- [Github: XGBoost Example - Learning to Rank](https://github.com/dmlc/xgboost/tree/master/demo/rank)
- [Blog: Learning to Rank Explained (with Code)](https://mlexplained.com/2019/05/27/learning-to-rank-explained-with-code/)