Welcome to the Learning to Rank

Microsoft Learning to Rank Dataset.


  &nbsp;&nbsp;&nbsp;&nbsp;LTR Intro http://times.cs.uiuc.edu/course/598f14/l2r.pdf (overview/introduction to Learning to Rank - 2011)<br>
  &nbsp;&nbsp;&nbsp;&nbsp;TFR https://arxiv.org/abs/1812.00073 (a specific implementation/framework for Learning to Rank models - 2019)

# The Model

### 1) Imports

In [None]:
# Import dependencies here
!pip install tfds-nightly # in order to download MSLR dataset, we include this lib.

Collecting tfds-nightly
  Downloading tfds_nightly-4.5.2.dev202204010045-py3-none-any.whl (4.3 MB)
[K     |████████████████████████████████| 4.3 MB 5.1 MB/s 
Collecting toml
  Downloading toml-0.10.2-py2.py3-none-any.whl (16 kB)
Collecting etils[epath-no-tf]
  Downloading etils-0.5.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.7 MB/s 
Installing collected packages: etils, toml, tfds-nightly
Successfully installed etils-0.5.0 tfds-nightly-4.5.2.dev202204010045 toml-0.10.2


In [None]:
!pip install tensorflow_ranking # library for Learning-to-Rank (LTR) 
!pip install lightgbm # we use LightGBM machine learning algorithm

Collecting tensorflow_ranking
  Downloading tensorflow_ranking-0.5.0-py2.py3-none-any.whl (141 kB)
[?25l[K     |██▎                             | 10 kB 19.2 MB/s eta 0:00:01[K     |████▋                           | 20 kB 2.3 MB/s eta 0:00:01[K     |███████                         | 30 kB 3.3 MB/s eta 0:00:01[K     |█████████▎                      | 40 kB 4.3 MB/s eta 0:00:01[K     |███████████▋                    | 51 kB 4.0 MB/s eta 0:00:01[K     |██████████████                  | 61 kB 4.7 MB/s eta 0:00:01[K     |████████████████▎               | 71 kB 3.7 MB/s eta 0:00:01[K     |██████████████████▌             | 81 kB 4.1 MB/s eta 0:00:01[K     |████████████████████▉           | 92 kB 4.6 MB/s eta 0:00:01[K     |███████████████████████▏        | 102 kB 4.9 MB/s eta 0:00:01[K     |█████████████████████████▌      | 112 kB 4.9 MB/s eta 0:00:01[K     |███████████████████████████▉    | 122 kB 4.9 MB/s eta 0:00:01[K     |██████████████████████████████▏ | 133 kB 

In [None]:
# Import dependencies here
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_ranking as tfr

import pandas as pd
import numpy as np
import lightgbm as lgb

import matplotlib.pyplot as plt
%matplotlib inline

### 2) Download Dataset

In [None]:
# Download the dataset located at https://storage.googleapis.com/personalization-takehome/MSLR-WEB10K.zip
# You can read about the features included in the dataset here: https://www.microsoft.com/en-us/research/project/mslr/
ds = tfds.load("mslr_web/10k_fold1", split="train") # We choose FOLD1

[1mDownloading and preparing dataset 1.15 GiB (download: 1.15 GiB, generated: 381.58 MiB, total: 1.52 GiB) to ~/tensorflow_datasets/mslr_web/10k_fold1/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/6000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/mslr_web/10k_fold1/1.0.0.incomplete67UNJ0/mslr_web-train.tfrecord*...:   0%|  …

Generating vali examples...:   0%|          | 0/2000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/mslr_web/10k_fold1/1.0.0.incomplete67UNJ0/mslr_web-vali.tfrecord*...:   0%|   …

Generating test examples...:   0%|          | 0/2000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/mslr_web/10k_fold1/1.0.0.incomplete67UNJ0/mslr_web-test.tfrecord*...:   0%|   …

[1mDataset mslr_web downloaded and prepared to ~/tensorflow_datasets/mslr_web/10k_fold1/1.0.0. Subsequent calls will reuse this data.[0m


### 3) Preprocess and evaluate the dataset

In [None]:
# Preprocess and evaluate the dataset
ds_df = tfds.as_dataframe(ds) # dataframe version of MSLR dataset
# We use FOLD1 of MSLR dataset in this work.

In [None]:
ds_df.head() # illustrate the data

Unnamed: 0,bm25_anchor,bm25_body,bm25_title,bm25_url,bm25_whole_document,boolean_model_anchor,boolean_model_body,boolean_model_title,boolean_model_url,boolean_model_whole_document,...,variance_of_tf_idf_anchor,variance_of_tf_idf_body,variance_of_tf_idf_title,variance_of_tf_idf_url,variance_of_tf_idf_whole_document,vector_space_model_anchor,vector_space_model_body,vector_space_model_title,vector_space_model_url,vector_space_model_whole_document
0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[29.42768, 9.288441, 18.526854, 0.0, 18.957465...","[0.0, 0.0, 0.0, 0.0, 0.0, 11.846837, 0.0, 0.0,...","[0.0, 0.0, 0.0, 0.0, 0.0, 7.998056, 0.0, 0.0, ...","[28.927591, 9.288662, 18.490218, 0.0, 18.95132...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[27.840805, 19.929322, 36.308397, 0.0, 27.5398...","[0.0, 0.0, 0.0, 0.0, 0.0, 44.544215, 0.0, 0.0,...","[0.0, 0.0, 0.0, 0.0, 0.0, 44.035199, 0.0, 0.0,...","[27.841789, 19.918075, 36.308397, 0.0, 27.5418...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.933903, 0.325014, 0.875523, 0.0, 0.945709, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.761123, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.745211, 0.0, 0.0, ...","[0.933988, 0.324964, 0.875633, 0.0, 0.945726, ..."
1,"[0.0, 9.997904, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[36.243178, 24.041161, 35.895516, 40.150412, 4...","[9.428295, 7.181888, 27.228748, 24.879653, 24....","[7.812148, 0.0, 23.955459, 0.0, 0.0, 0.0, 0.0,...","[36.497524, 24.591495, 36.488682, 40.325685, 4...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"[0.0, 9.003951, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[339.055495, 31.415201, 3305.919844, 1285.5990...","[7.687248, 7.687248, 14.848784, 14.848784, 14....","[9.020905, 0.0, 17.631846, 0.0, 0.0, 0.0, 0.0,...","[417.968267, 64.96175, 3836.19764, 1452.90327,...","[0.0, 0.408501, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.8083, 0.679002, 0.827916, 0.84103, 0.839974...","[0.417303, 0.417303, 0.789509, 0.789509, 0.789...","[0.412937, 0.0, 0.795755, 0.0, 0.0, 0.0, 0.0, ...","[0.783515, 0.606861, 0.829137, 0.840663, 0.839..."
2,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 31.099175, 30.484913, 31.105763, 0.0, 22...","[0.0, 28.65454, 28.65454, 16.023471, 0.0, 13.4...","[0.0, 22.842047, 22.842047, 19.6778, 0.0, 0.0,...","[0.0, 33.156394, 32.988977, 32.236219, 0.0, 22...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, ...","[0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, ...","[0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.228711, 2.72403, 2.764601, 0.0, 292.25...","[0.0, 0.064654, 0.064654, 0.016164, 0.0, 19.50...","[0.0, 0.043984, 0.043984, 0.043984, 0.0, 0.0, ...","[0.0, 3.770071, 0.667559, 4.924175, 0.0, 383.1...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 1.0, 0.989758, 1.0, 0.0, 0.764901, 0.999...","[0.0, 1.0, 1.0, 1.0, 0.0, 0.697006, 1.0, 0.0, ...","[0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.691259, 0.0, ...","[0.0, 1.0, 0.997039, 1.0, 0.0, 0.755015, 1.0, ..."
3,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[12.288321, 13.075652, 12.605294, 15.993784, 1...","[7.69312, 9.684237, 11.621084, 4.758204, 10.59...","[8.196643, 8.196643, 0.0, 8.695295, 8.196643, ...","[14.136543, 13.76887, 13.591888, 16.030504, 16...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[23.545009, 288.426363, 147.156307, 8058.27939...","[18.508313, 18.508313, 74.03325, 18.508313, 18...","[22.635226, 22.635226, 0.0, 22.635226, 22.6352...","[94.160557, 476.687818, 288.366705, 8951.13792...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.264399, 0.264399, 0.264399, 0.264399, 0.264...","[0.436652, 0.436652, 0.436652, 0.436652, 0.436...","[0.47293, 0.47293, 0.0, 0.47293, 0.47293, 0.47...","[0.264374, 0.264374, 0.264374, 0.264374, 0.264..."
4,"[27.88085, 0.0, 28.833058, 14.801772, 0.0, 11....","[42.103867, 38.373132, 43.194869, 36.933612, 4...","[28.110139, 19.193686, 33.732166, 22.108442, 3...","[0.0, 0.0, 27.622547, 20.344219, 26.038466, 9....","[42.451466, 38.582556, 43.651186, 38.617481, 4...","[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, ...","[1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, ...","[0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, ...",...,"[13.69384, 0.0, 1.892948, 14.273079, 0.0, 14.2...","[10240.249987, 407.423062, 798.105042, 13405.8...","[6.715774, 11.835504, 26.863096, 13.222567, 30...","[0.0, 0.0, 1.751718, 50.328568, 1.751718, 14.2...","[10990.6939, 475.568154, 1290.993942, 14438.21...","[0.93639, 0.0, 1.0, 0.555718, 0.0, 0.555718, 0...","[0.952431, 0.893485, 0.992766, 0.703059, 0.971...","[1.0, 0.811025, 1.0, 0.8548, 0.968685, 0.61367...","[0.0, 0.0, 1.0, 0.754014, 1.0, 0.568278, 0.568...","[0.956585, 0.885813, 0.995693, 0.731488, 0.975..."


In [None]:
#preprocessing

#   In order to use the Light GBM, we need to create variables group_train and group_validation, which contain
#   number of examples for each query ID. This will allow LGBMRanker to group examples by query during training.

# We define a function (get_data) to create variables group_train and group_validation.

def get_data(data_path):
    # we take the train, test and validation data from local colab file
    dfs = {
        
        "train": pd.read_csv(f"{data_path}/train.txt", delimiter=" "),
        "vali": pd.read_csv(f"{data_path}/vali.txt", delimiter=" "),
        "test": pd.read_csv(f"{data_path}/test.txt", delimiter=" "),
    }


    # delete columns where all data is missing

    for df in dfs.values():
        df.columns = np.arange(len(df.columns))
        df.drop(
            columns=df.columns[df.isna().all()].tolist(), inplace=True
        )


    split = {}

    split["X_train"] = dfs["train"].iloc[:, 1:]
    split["X_val"] = dfs["vali"].iloc[:, 1:]
    split["X_test"] = dfs["test"].iloc[:, 1:]

    y_train = dfs["train"].iloc[:, 0]
    y_val = dfs["vali"].iloc[:, 0]
    y_test = dfs["test"].iloc[:, 0]


    g = split["X_train"].groupby(by=1)
    size = g.size()
    group_train = size.to_list()

    g = split["X_val"].groupby(by=1)
    size = g.size()
    group_vali = size.to_list()
    
    # g = split["X_test"].groupby(by=1)
    # size = g.size()
    # group_test = size.to_list()

    # 136 features from the dataset MSLR-WEB

    # According to a LASSO regression analysis in [2], variance features, as well as Inverse
    # Document Frequency (IDF) based features, appear to be less useful.
    
    # Therefore, We will train the model on the more relevant features instead.

    # # FEATURE SELECTION
    columns_to_remove = [41, 42, 43, 44, 45, 66, 67, 68, 69, 70,
                         91, 92, 93, 94, 95, 16, 17, 18, 19, 20,
                         71, 72, 73, 74, 75, 76, 77, 78, 79, 80,
                         81, 82, 83, 84, 85, 86, 87, 88, 89, 90]

    for name, df in split.items():

        # Get rid of irrelevant information at the beginning of each feature value
        df = df.applymap(lambda x: x.split(":", 1)[-1])
        
        # convert data into float format to conform to LGBMRanker input standard
        df = df.astype(float)
       
        # get rid of the query ID column since it is not a feature
        df = df.drop(columns=1)
        
        # rename column indices for convenience
        
        df.columns = [i for i in range(1, 137)]
        # drop less useful features
        
        df = df.drop(columns=columns_to_remove)

        split[name] = df

    return (
        split["X_train"],
        split["X_test"],
        split["X_val"],
        y_train,
        y_test,
        y_val,
        group_vali,
        group_train,
      #  group_test,
    )

In [None]:
X_train, X_test, X_val, y_train, y_test, y_val, group_vali, group_train = get_data("/root/tensorflow_datasets/downloads/extracted/ZIP.api.onedr.com_v1.0_share_s_AtsMf_root_conteUrzlWWPKsOb_kzO3CJTPhBB9FOoELAWN1cE0YgcaUkk/Fold1")
# We get train and test set from the function for Light GBM

### 4) Build ranking model

In [None]:
# Build ranking model
!pip install optuna

Collecting optuna
  Downloading optuna-2.10.0-py3-none-any.whl (308 kB)
[K     |████████████████████████████████| 308 kB 5.1 MB/s 
[?25hCollecting cmaes>=0.8.2
  Downloading cmaes-0.8.2-py3-none-any.whl (15 kB)
Collecting cliff
  Downloading cliff-3.10.1-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 8.5 MB/s 
Collecting alembic
  Downloading alembic-1.7.7-py3-none-any.whl (210 kB)
[K     |████████████████████████████████| 210 kB 61.6 MB/s 
Collecting colorlog
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting Mako
  Downloading Mako-1.2.0-py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 3.7 MB/s 
Collecting pbr!=2.1.0,>=2.0.0
  Downloading pbr-5.8.1-py2.py3-none-any.whl (113 kB)
[K     |████████████████████████████████| 113 kB 57.5 MB/s 
[?25hCollecting autopage>=0.4.0
  Downloading autopage-0.5.0-py3-none-any.whl (29 kB)
Collecting stevedore>=2.0.1
  Downloading stevedore-3.5.0-py3-none-any.whl (49 kB)
[K    

In [None]:
import optuna.integration.lightgbm as lgb
import optuna
# OUR MODEL
# This library tries to find the best hyperparameter of network.
    
gbm = lgb.LGBMRanker(
        n_estimators = 255,
        num_leaves = 100, 
        learning_rate = 0.1, 
        reg_lambda = 2.5
) # ndcg for LGBMRanker.

gbm.fit(
        X_train,
        y_train,
        group=group_train,
        eval_group=[group_vali],
        eval_set=[(X_val, y_val)],
        early_stopping_rounds=150,
)


gbm.booster_.save_model("MSLR-WEB10K_Fold1", num_iteration=gbm.best_iteration_) 
# we save the best iteration of ndcg@1 value

[1]	valid_0's ndcg@1: 0.358176
Training until validation scores don't improve for 150 rounds.
[2]	valid_0's ndcg@1: 0.409838
[3]	valid_0's ndcg@1: 0.415776
[4]	valid_0's ndcg@1: 0.421043
[5]	valid_0's ndcg@1: 0.427852
[6]	valid_0's ndcg@1: 0.43471
[7]	valid_0's ndcg@1: 0.435048
[8]	valid_0's ndcg@1: 0.433886
[9]	valid_0's ndcg@1: 0.436262
[10]	valid_0's ndcg@1: 0.436905
[11]	valid_0's ndcg@1: 0.438114
[12]	valid_0's ndcg@1: 0.439795
[13]	valid_0's ndcg@1: 0.440248
[14]	valid_0's ndcg@1: 0.44389
[15]	valid_0's ndcg@1: 0.447386
[16]	valid_0's ndcg@1: 0.44539
[17]	valid_0's ndcg@1: 0.451362
[18]	valid_0's ndcg@1: 0.453324
[19]	valid_0's ndcg@1: 0.453743
[20]	valid_0's ndcg@1: 0.453324
[21]	valid_0's ndcg@1: 0.455838
[22]	valid_0's ndcg@1: 0.458267
[23]	valid_0's ndcg@1: 0.458567
[24]	valid_0's ndcg@1: 0.46291
[25]	valid_0's ndcg@1: 0.464962
[26]	valid_0's ndcg@1: 0.465724
[27]	valid_0's ndcg@1: 0.463433
[28]	valid_0's ndcg@1: 0.468514
[29]	valid_0's ndcg@1: 0.466019
[30]	valid_0's ndcg@1:

<lightgbm.basic.Booster at 0x7f170be46250>

### 5) Evaluate model performance

In [None]:
# Evaluate model performance

# we choose ndcg (Normalized Discounted Cumulative Gain)

from sklearn.metrics import ndcg_score
gbm = lgb.Booster(model_file="MSLR-WEB10K_Fold1")

true_relevance = y_test.sort_values(ascending=False)

# Get the actual order of y_test by sorting it according to our model's predictions.

test_pred = gbm.predict(X_test)
y_test2 = pd.DataFrame({"relevance_score": y_test, "predicted_ranking": test_pred})

relevance_score = y_test2.sort_values("predicted_ranking", ascending=False)

# Use computed variables to calculate the nDCG score
print(
    "nDCG score: ",
    ndcg_score(
        [true_relevance.to_numpy()], [relevance_score["relevance_score"].to_numpy()]
        ),
    ) # our model is done...

nDCG score:  0.9326257430447896


In [None]:
# just try to three models (ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor)
# These are just examples.

X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()
X_val_np = X_val.to_numpy()

y_train_np = y_train.to_numpy()
y_test_np = y_test.to_numpy()
y_val_np = y_val.to_numpy()

In [None]:
from sklearn.ensemble import ExtraTreesRegressor

etr = ExtraTreesRegressor(n_estimators=200, min_samples_split=5, random_state=1, n_jobs=-1)
etr.fit(X_train_np, y_train_np)

ExtraTreesRegressor(min_samples_split=5, n_estimators=200, n_jobs=-1,
                    random_state=1)

In [None]:
etr.score(X_test_np, y_test_np)

0.17951725165637167

In [None]:
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=200, min_samples_split=5, random_state=1, n_jobs=-1)
rfr.fit(X_train_np, y_train_np)

In [None]:
rfr.score(X_test_np, y_test_np)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

gbr2 = GradientBoostingRegressor(n_estimators=200, random_state=1, verbose=1)
gbr2.fit(X_train_np, y_train_np)

In [None]:
gbr2.score(X_test_np, y_test_np)

# Discussion

### 1) Please answer the following questions about your choices:
- Discuss your model and why you chose the model you chose (eg architecture, design, loss functions, etc)
- Why did you choose your metric to evaluate the model?
- How well would you say your model performed?
- If you had more time what else would you want to try?

#**ANSWER 1**

1.) We choose Light GBM [1] because it has advantages such as high processing speed, large data processing, less resource (RAM) usage, high prediction rate, parallel learning and GPU learning support. Also LightGBM speeds up the training process of conventional Gradient Boosting Decision Tree by up to over 20 times.

2.) We choose the metric NDCG (Normalized Discounted Cumulative Gain) [3]. Our goal is to rank the relevant items higher than irrelevant items for any given query. That's why we choose this.

3.) We trained MSLR data with Light GBM and we got the result very fast. We also use RandomForestRegressor, ExtraTreesRegressor and GradientBoostingRegressor for comparison but this three methods took a long time.

4.) If we had more time, we had tried to tune the hyperparameter of Light GBM and and we had tried to train another MSLR dataset part (Fold2, Fold3, Fold4 and Fold5). We would have liked to use k fold cross validation among them.

### 2) Answer the following questions about how you would use additional features:

- If you had an additional feature for each row of the dataset that was unique identifier for the user performing the query e.g. `user_id`, how could you use it to improve the performance of the model?
- If you had the additional features of: `query_text` or the actual textual query itself, as well as document text features like `title_text`, `body_text`, `anchor_text`, `url` for the document, how would you include them in your model (or any model) to improve its performance?




#**ANSWER 2**

1.)If I had an additional feature like user_id for each row, then it would be a person-based ranking. I think more useful ranking model could be obtained.

2.)Also this suggestion would have increased our model accuracy. Our rank model also would based on text, url so on. We also have to feature selection beause we have a lot of features. We can use optuna library to measure their affects.

#REFERENCES

[1] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. "LightGBM: a highly efficient gradient boosting decision tree." In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 3149–3157.

[2] Xinzhi Han and Sen Lei. 2018 "Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets", doi: 10.48550/ARXIV.1803.05127

[3] Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Tie-Yan Liu and Wei Chen. 2013. "A Theoretical Analysis of NDCG Type Ranking Measures", doi: https://doi.org/10.48550/arXiv.1304.6480