## Learning to Rank with Linear Regression & Xgboost

Data set: MSLR-WEB10K data set, which is open sourced by Microsoft and can be downloaded from [here](https://www.microsoft.com/en-us/research/project/mslr/)

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import ndcg_score

### Read Train Data

In [2]:
train = pd.read_csv("/kaggle/input/learningtorank/data/train.txt", header=None, sep=" ")

In [3]:
df_samp = train.sample(4)
df_samp.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,129,130,131,132,133,134,135,136,137,138
284760,2,qid:5794,1:1,2:1,3:1,4:0,5:1,6:1,7:1,8:1,...,128:8,129:2,130:64363,131:58024,132:1,133:1,134:0,135:0,136:0,
11312,0,qid:1486,1:7,2:0,3:4,4:0,5:7,6:1,7:0,8:0.571429,...,128:0,129:0,130:33033,131:39942,132:3,133:5,134:0,135:0,136:0,


Each row represents a query - document pair. 

Column '0' is an integer taking values from 0 to 4, where 0 denotes "this document is irrelevant for this query", and 4 denotes "this document is perfectly relevant for this query".

Column '1' holds an integer identifying the query and is of the form 'qid:int'.

Columns '2' through to '137' hold the features for the query - document pair. The data in these columns are of the form 'feature_id:feature_value'. Full details about each of the features can be found here. 

Finally, column '138' is NaN for every row. (It is an artifact of the way we split the data set into columns.)

#### Processing the data

We want to do three things to the data frame in order to get it into an easier form to work with: 
1. replace 'qid:int' with 'int' in column '1'
2. replace 'feature_id:feature_value' with 'feature_value' for all entries in columns '2' to '137'
3. delete column '138'


In [4]:
df_samp[1] = df_samp[1].apply(lambda x: x[4:])
df_samp[df_samp.columns[2:-1]] = df_samp[df_samp.columns[2:-1]].applymap(lambda x: x.split(':')[1])
df_samp.drop(138, axis = 1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
284760,2,5794,1,1,1,0,1,1,1,1.0,...,64,8,2,64363,58024,1,1,0,0,0
11312,0,1486,7,0,4,0,7,1,0,0.571429,...,59,0,0,33033,39942,3,5,0,0,0
140945,2,17236,4,0,3,0,4,1,0,0.75,...,45,3,0,154,6909,4,4,0,0,0
593921,0,14032,4,0,1,0,4,1,0,0.25,...,39,3,1,1204,10800,1,11,0,0,0


We wrap these transformations up in one function which we can then apply to the full training set. We also import and transform the testing set, which we will use to evaluate our model. 

In [5]:
def df_transform(df):
    df[1] = df[1].apply(lambda x: x[4:])
    df[df.columns[2:-1]] = df[df.columns[2:-1]].applymap(lambda x: x.split(':')[1])
    df = df.drop(138, axis=1)
    return df

In [6]:
train_df = df_transform(train)

In [7]:
train_df.sample(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
16242,0,2116,0,0,0,0,0,0,0.0,0,...,29,0,26,3681,63583,1,2,0,1,1.8
109047,0,13681,3,0,0,0,3,1,0.0,0,...,37,5000,6,11391,57052,1,13,0,3,7.8
104616,0,13096,3,2,0,0,3,1,0.666667,0,...,66,18,0,11966,41147,29,8,0,7,65.9


In [8]:
X = train_df[train_df.columns[2:]]
y = train_df[0]

### Read Test data

In [9]:
test = pd.read_csv("/kaggle/input/learningtorank/data/test.txt", header=None, sep=" ")

In [10]:
test_df = df_transform(test)

In [11]:
X_test = test_df[test_df.columns[2:]]
y_test = test_df[0]

### Linear Regression

In [12]:
import sklearn
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X.astype(float), y)
reg.score(X.astype(float),y)

0.1334931311234563

In [13]:
test_pred = reg.predict(X_test.astype(float))
y_test_df = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})
y_test_df.head(2)

Unnamed: 0,relevance_score,predicted_ranking
0,2,0.76923
1,1,0.293494


### Evaluation

In [14]:
true_relevance = y_test.sort_values(ascending=False)
relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)

In [15]:
print(
        "nDCG score: ",
        ndcg_score(
            [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    )

print(
        "nDCG score @ 5: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=5
        ),
    )

print(
        "nDCG score @ 10: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=10
        ),
    )

print(
        "nDCG score @ 100: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=100
        ),
    )

nDCG score:  0.9218422284313269
nDCG score @ 5:  0.3452380952380952
nDCG score @ 10:  0.34523809523809523
nDCG score @ 100:  0.3452380952380954


## XBGRanker

In [16]:
import xgboost as xgb

In [17]:
g = train_df.groupby(by=1)
size = g.size()
group_train = size.to_list()

g = test_df.groupby(by=1)
size = g.size()
group_valid = size.to_list()

### Rank:NDCG

In [18]:
ranker = xgb.XGBRanker(
        n_estimators=10000,
        learning_rate=0.1,
        objective='rank:ndcg',
        reg_lambda=0.05,
        verbose = True,
        tree_method = 'gpu_hist'
    )

ranker.fit(
    X.astype(float),
    y.astype(int),
    group=group_train,
    eval_group=[group_valid],
    eval_set=[(X_test.astype(float), y_test.astype(int))],
    early_stopping_rounds=100,
    verbose = True
)

test_pred = ranker.predict(X_test.astype(float))
y_test_df = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})

true_relevance = y_test.sort_values(ascending=False)
relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)

print(
        "nDCG score: ",
        ndcg_score(
            [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    )

print(
        "nDCG score @ 5: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=5
        ),
    )

print(
        "nDCG score @ 10: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=10
        ),
    )

print(
        "nDCG score @ 100: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=100
        ),
    )



Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation_0-map:0.62655
[1]	validation_0-map:0.63346
[2]	validation_0-map:0.63370
[3]	validation_0-map:0.63512
[4]	validation_0-map:0.63708
[5]	validation_0-map:0.63681
[6]	validation_0-map:0.63642
[7]	validation_0-map:0.63663
[8]	validation_0-map:0.63656
[9]	validation_0-map:0.63692
[10]	validation_0-map:0.63686
[11]	validation_0-map:0.63814
[12]	validation_0-map:0.63830
[13]	validation_0-map:0.63909
[14]	validation_0-map:0.63984
[15]	validation_0-map:0.64023
[16]	validation_0-map:0.64138
[17]	validation_0-map:0.64209
[18]	validation_0-map:0.64266
[19]	validation_0-map:0.64310
[20]	validation_0-map:0.64351
[21]	validation_0-map:0.64422
[22]	validation_0-map:0.64447
[23]	v

### Rank:Map

In [19]:
ranker = xgb.XGBRanker(
        n_estimators=10000,
        learning_rate=0.1,
        objective='rank:map',
        reg_lambda=0.05,
        verbose = True,
        tree_method = 'gpu_hist'
    )

ranker.fit(
    X.astype(float),
    y.astype(int),
    group=group_train,
    eval_group=[group_valid],
    eval_set=[(X_test.astype(float), y_test.astype(int))],
    early_stopping_rounds=100,
    verbose = True
)

test_pred = ranker.predict(X_test.astype(float))
y_test_df = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})

true_relevance = y_test.sort_values(ascending=False)
relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)

print(
        "nDCG score: ",
        ndcg_score(
            [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    )

print(
        "nDCG score @ 5: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=5
        ),
    )

print(
        "nDCG score @ 10: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=10
        ),
    )

print(
        "nDCG score @ 100: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=100
        ),
    )



Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation_0-map:0.63724
[1]	validation_0-map:0.64015
[2]	validation_0-map:0.64075
[3]	validation_0-map:0.64348
[4]	validation_0-map:0.64503
[5]	validation_0-map:0.64535
[6]	validation_0-map:0.64552
[7]	validation_0-map:0.64495
[8]	validation_0-map:0.64511
[9]	validation_0-map:0.64547
[10]	validation_0-map:0.64574
[11]	validation_0-map:0.64643
[12]	validation_0-map:0.64783
[13]	validation_0-map:0.64871
[14]	validation_0-map:0.64945
[15]	validation_0-map:0.65028
[16]	validation_0-map:0.65097
[17]	validation_0-map:0.65086
[18]	validation_0-map:0.65199
[19]	validation_0-map:0.65241
[20]	validation_0-map:0.65300
[21]	validation_0-map:0.65310
[22]	validation_0-map:0.65361
[23]	v

### Rank:Pairwise

In [20]:
ranker = xgb.XGBRanker(
        n_estimators=10000,
        learning_rate=0.1,
        objective='rank:pairwise',
        reg_lambda=0.05,
        verbose = True,
        tree_method = 'gpu_hist'
    )

ranker.fit(
    X.astype(float),
    y.astype(int),
    group=group_train,
    eval_group=[group_valid],
    eval_set=[(X_test.astype(float), y_test.astype(int))],
    early_stopping_rounds=100,
    verbose = True
)

test_pred = ranker.predict(X_test.astype(float))
y_test_df = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})

true_relevance = y_test.sort_values(ascending=False)
relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)

print(
        "nDCG score: ",
        ndcg_score(
            [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    )

print(
        "nDCG score @ 5: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=5
        ),
    )

print(
        "nDCG score @ 10: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=10
        ),
    )

print(
        "nDCG score @ 100: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=100
        ),
    )



Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation_0-map:0.63942
[1]	validation_0-map:0.64872
[2]	validation_0-map:0.65097
[3]	validation_0-map:0.65212
[4]	validation_0-map:0.65345
[5]	validation_0-map:0.65397
[6]	validation_0-map:0.65505
[7]	validation_0-map:0.65529
[8]	validation_0-map:0.65629
[9]	validation_0-map:0.65637
[10]	validation_0-map:0.65682
[11]	validation_0-map:0.65685
[12]	validation_0-map:0.65696
[13]	validation_0-map:0.65752
[14]	validation_0-map:0.65809
[15]	validation_0-map:0.65848
[16]	validation_0-map:0.65911
[17]	validation_0-map:0.65930
[18]	validation_0-map:0.65958
[19]	validation_0-map:0.66022
[20]	validation_0-map:0.66090
[21]	validation_0-map:0.66118
[22]	validation_0-map:0.66165
[23]	v

### NDCG@10 - Evaluation on Rank:NDCG

In [21]:
ranker = xgb.XGBRanker(
        n_estimators=10000,
        learning_rate=0.1,
        objective='rank:ndcg',
        reg_lambda=0.05,
        verbose = True,
        tree_method = 'gpu_hist'
    )

ranker.fit(
    X.astype(float),
    y.astype(int),
    group=group_train,
    eval_group=[group_valid],
    eval_set=[(X_test.astype(float), y_test.astype(int))],
    early_stopping_rounds=100,
    eval_metric="ndcg",
    verbose = True
)

test_pred = ranker.predict(X_test.astype(float))
y_test_df = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})

true_relevance = y_test.sort_values(ascending=False)
relevance_score = y_test_df.sort_values("predicted_ranking", ascending=False)

print(
        "nDCG score: ",
        ndcg_score(
            [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    )

print(
        "nDCG score @ 5: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=5
        ),
    )

print(
        "nDCG score @ 10: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=10
        ),
    )

print(
        "nDCG score @ 100: ",
        ndcg_score(
            y_true = [true_relevance.to_numpy()], y_score = [relevance_score["relevance_score"].to_numpy()], k=100
        ),
    )



Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	validation_0-ndcg:0.68117
[1]	validation_0-ndcg:0.69145
[2]	validation_0-ndcg:0.69374
[3]	validation_0-ndcg:0.69529
[4]	validation_0-ndcg:0.69738
[5]	validation_0-ndcg:0.69758
[6]	validation_0-ndcg:0.69769
[7]	validation_0-ndcg:0.69799
[8]	validation_0-ndcg:0.69819
[9]	validation_0-ndcg:0.69830
[10]	validation_0-ndcg:0.69923
[11]	validation_0-ndcg:0.70016
[12]	validation_0-ndcg:0.70044
[13]	validation_0-ndcg:0.70105
[14]	validation_0-ndcg:0.70151
[15]	validation_0-ndcg:0.70168
[16]	validation_0-ndcg:0.70243
[17]	validation_0-ndcg:0.70284
[18]	validation_0-ndcg:0.70310
[19]	validation_0-ndcg:0.70321
[20]	validation_0-ndcg:0.70360
[21]	validation_0-ndcg:0.70428
[22]	validatio