<a href="https://colab.research.google.com/github/faisaladisoe/ir-tp/blob/master/TP4/source%20code/IR_TP4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning-To-Rank (LETOR) menggunakan LightGBM dan XGBoost

# Install Libraries

In [1]:
!pip install gensim
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [42]:
import os
import random
import numpy as np
import pandas as pd
import xgboost as xgb
import lightgbm as lgb

from gensim.models import FastText
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from scipy.spatial.distance import cosine

# Data Preparation

## Scraping

source: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/ \\
download: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz

In [3]:
!wget -c https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz -P data
!tar -xvf data/nfcorpus.tar.gz

--2022-12-04 06:00:25--  https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz
Resolving www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)... 147.142.207.78
Connecting to www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)|147.142.207.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31039523 (30M) [application/x-gzip]
Saving to: ‘data/nfcorpus.tar.gz’


2022-12-04 06:00:29 (14.8 MB/s) - ‘data/nfcorpus.tar.gz’ saved [31039523/31039523]

nfcorpus/
nfcorpus/train.docs
nfcorpus/test.docs
nfcorpus/dev.docs
nfcorpus/dev.3-2-1.qrel
nfcorpus/test.3-2-1.qrel
nfcorpus/train.3-2-1.qrel
nfcorpus/raw/
nfcorpus/raw/doc_dump.txt
nfcorpus/raw/dev.docs.ids
nfcorpus/raw/dev.queries.ids
nfcorpus/raw/test.docs.ids
nfcorpus/raw/test.queries.ids
nfcorpus/raw/train.docs.ids
nfcorpus/raw/train.queries.ids
nfcorpus/raw/stopwords.large
nfcorpus/raw/nfdump.txt
nfcorpus/raw/all_videos.ids
nfcorpus/raw/nontopics.ids
nfcorpus/test.2-1-0.qrel
nfcorpus/dev.2-1-0.qrel
nfc

**Combination for training purposes:**


1.   nfcorpus/train.docs
2.   nfcorpus/train.3-2-1.qrel
3.   nfcorpus/train.vid-desc.queries

**Combination for optimizing purposes:**

1.   nfcorpus/dev.docs
2.   nfcorpus/dev.3-2-1.qrel
3.   nfcorpus/dev.vid-desc.queries

**Combination for testing purposes:**

1.   nfcorpus/test.docs
2.   nfcorpus/test.3-2-1.qrel
3.   nfcorpus/test.vid-desc.queries

## Preprocessing

In [4]:
!head -10 nfcorpus/train.docs

MED-10	statin breast cancer survival nationwide cohort study finland abstract recent studies suggested statins established drug group prevention cardiovascular mortality delay prevent breast cancer recurrence effect disease-specific mortality remains unclear evaluated risk breast cancer death statin users population-based cohort breast cancer patients study cohort included newly diagnosed breast cancer patients finland num num num cases identified finnish cancer registry information statin diagnosis obtained national prescription database cox proportional hazards regression method estimate mortality statin users statin time-dependent variable total num participants statins median follow-up num years diagnosis range num num years num participants died num num due breast cancer adjustment age tumor characteristics treatment selection post-diagnostic pre-diagnostic statin lowered risk breast cancer death hr num num ci num num hr num num ci num num risk decrease post-diagnostic statin affe

In [5]:
!head -10 nfcorpus/train.vid-desc.queries

PLAIN-2427	diet and exercise synergize to improve endothelial function , the ability of our arteries to relax normally .
PLAIN-2428	the parable of the tiny parachute explains the study that found no relationship between dietary fiber intake and diverticulosis .
PLAIN-2431	pbde fire retardant chemicals in the food supply may contribute to attention and cognitive deficits in children .
PLAIN-2432	peppermint essential oil should be considered the first-line treatment for ibs .
PLAIN-2433	the reversal of blindness due to hypertension and diabetes with dr. kempner ’ s rice and fruit diet demonstrates the power of diet to exceed the benefits of the best modern medicine and surgery has to offer .
PLAIN-2434	squatting and leaning can help straighten the anorectal angle , but a healthy enough diet should make bowel movements effortless regardless of positioning .
PLAIN-2435	most people have between 3 bowel movements a day and 3 a week , but normal doesn ’ t necessarily mean optimal .
PLAIN-2436

In [6]:
!head -10 nfcorpus/train.3-2-1.qrel

PLAIN-3	0	MED-2436	3
PLAIN-3	0	MED-2437	3
PLAIN-3	0	MED-2438	3
PLAIN-3	0	MED-2439	3
PLAIN-3	0	MED-2440	3
PLAIN-3	0	MED-2427	2
PLAIN-3	0	MED-2428	2
PLAIN-3	0	MED-2429	2
PLAIN-3	0	MED-2430	2
PLAIN-3	0	MED-2431	2


### Mapping

#### Variable Initialization

In [7]:
docs = {}
queries = {}
qrels = {}
dataset = {}
query_num_of_docs_each_type = {}

#### Document dataset

In [8]:
def map_docs(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.docs', 'r') as file:
    for line in file:
      doc_id, content = line.split('\t')
      temporary_map[doc_id] = [item for item in content.split() if item.isalnum()]
  docs[type_of_docs] = temporary_map

#### Query dataset

In [9]:
def map_queries(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.vid-desc.queries', 'r') as file:
    for line in file:
      query_id, content = line.split('\t')
      temporary_map[query_id] = [item for item in content.split() if item.isalnum()]
  queries[type_of_docs] = temporary_map

#### Query Relevance dataset

In [10]:
def map_qrels(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.3-2-1.qrel', 'r') as file:
    for line in file:
      query_id, _, doc_id, qrel = line.split('\t')
      if (query_id in queries[type_of_docs]) and (doc_id in docs[type_of_docs]):
        try:
          temporary_map[query_id].append((doc_id, int(qrel)))
        except:
          temporary_map[query_id] = []
          temporary_map[query_id].append((doc_id, int(qrel)))
  qrels[type_of_docs] = temporary_map

#### Count number of docs in each query

In [11]:
def map_qid_num_of_docs(type_of_docs):
  query_num_of_docs = []
  combination_qid_did_qrel = []
  for query_id in qrels[type_of_docs]:
    content = qrels[type_of_docs][query_id]
    # Normalization step if length of docs is zero
    query_num_of_docs.append(len(content) + 1)
    for doc_id, qrel in content:
      combination_qid_did_qrel.append((queries[type_of_docs][query_id], docs[type_of_docs][doc_id], qrel))
    # Handle if length of docs is zero
    combination_qid_did_qrel.append((queries[type_of_docs][query_id], random.choice(list(docs[type_of_docs].values())), 0))
  dataset[type_of_docs] = combination_qid_did_qrel
  query_num_of_docs_each_type[type_of_docs] = query_num_of_docs

#### Execution

In [12]:
whole_set = ['train', 'dev', 'test']
for item in whole_set:
  map_docs(item)
  map_queries(item)
  map_qrels(item)
  map_qid_num_of_docs(item)

# Building Word Vectors

## Initialize Variable

In [13]:
os.makedirs('./models', exist_ok = True)
os.makedirs('./models/lsia', exist_ok = True)
os.makedirs('./models/fasttext', exist_ok = True)

In [14]:
dictionaries = {}
LSIA_models = {}
FT_models = {}

## Term-Document Matrix (LSI/LSA)

In [15]:
def td_matrix_lsia(type_of_docs):
  NUM_LATENT_TOPIC = 250
  lsia_dictionary = Dictionary()
  lsia_bow_corpus = [lsia_dictionary.doc2bow(doc, allow_update = True) for doc in docs[type_of_docs].values()]
  lsia_model = LsiModel(lsia_bow_corpus, num_topics = NUM_LATENT_TOPIC)
  dictionaries[type_of_docs] = lsia_dictionary
  lsia_model.save(f'./models/lsia/{type_of_docs}-{NUM_LATENT_TOPIC}.model')

## Term-Context Matrix (FastText)

In [16]:
def tc_matrix_fasttext(type_of_docs):
  VEC_SIZE = 125
  fasttext_corpus = [doc for doc in docs[type_of_docs].values()]
  fasttext_model = FastText(sg = 1)
  fasttext_model.build_vocab(fasttext_corpus)
  fasttext_model.train(fasttext_corpus, model = 'skipgram', vector_size=VEC_SIZE, total_examples = fasttext_model.corpus_count, epochs = 25)
  fasttext_model.save(f'./models/fasttext/{type_of_docs}-{VEC_SIZE}.model')

## Execution

In [17]:
whole_set = ['train', 'dev', 'test']
for item in whole_set:
  td_matrix_lsia(item)
  LSIA_models[item] = LsiModel.load(f'./models/lsia/{item}-250.model')
  tc_matrix_fasttext(item)
  FT_models[item] = FastText.load(f'./models/fasttext/{item}-125.model')



## Term-Document Vector Representation

In [18]:
def td_vector_rep(array, type_of_docs):
  dictionary = dictionaries[type_of_docs]
  lsia_model = LSIA_models[type_of_docs]
  representation = [topic_value for (_, topic_value) in lsia_model[dictionary.doc2bow(array)]]
  return representation if len(representation) == 250 else [0.] * 250

In [19]:
print(td_vector_rep(docs['train']["MED-329"], 'train'))
print(td_vector_rep(queries['train']["PLAIN-2435"], 'train'))

[5.921538401607114, 3.1465016463689164, -2.6088592386504272, 0.676589508285761, -1.776701542572881, -2.420750848036928, -1.4295160417868886, -0.4883067734947725, -0.3430109190629869, -1.3695277974780877, 0.7181170013456368, -0.5637004642763289, -1.6761414408487836, -0.24145301297474464, 1.7369916747306118, -0.6902350463817528, 0.776844060134116, 1.0486426336920072, 0.9781161431716745, -1.0451598539528528, 0.1037759438771572, 1.8148887288638629, -1.570852064788117, -1.5845461691360871, 0.5343039258567022, 0.21710539166408538, 1.7199676520445641, -0.05649545301721492, -0.1594462828550908, 0.5019793173113957, -0.11615211741949501, 0.9048618358789069, 1.2680409622968374, 1.6020538970908178, 1.2823754793859035, 0.39781147174841275, 0.7399698904024125, -1.936413096866833, 1.3107080871190497, -0.05766444218632007, 1.476558890981695, -0.21183060834089434, -0.04266984019489774, 0.34742454818034757, -0.463328571172394, 1.3670690410818653, 0.8777224765929399, 0.812867693249457, 0.8301760520281187

## Term-Context Vector Representation

In [20]:
fasttext_model = FastText.load('./models/fasttext/train-125.model')
word_vector = fasttext_model.wv
word_vector.most_similar('developmental')

[('developments', 0.8717353343963623),
 ('neurodevelopmental', 0.8558087944984436),
 ('development', 0.8479216694831848),
 ('neurodevelopment', 0.7892796993255615),
 ('mental', 0.6347976326942444),
 ('parental', 0.5977450609207153),
 ('develops', 0.5954824686050415),
 ('fetal', 0.5826789140701294),
 ('prenatal', 0.5766539573669434),
 ('postnatal', 0.5570307970046997)]

In [21]:
def tc_vector_rep(word, type_of_docs):
  fasttext_model = FT_models[type_of_docs]
  word_vector = fasttext_model.wv
  return word_vector[word]

In [22]:
print(tc_vector_rep('statin vitro developmental', 'train'))
print(tc_vector_rep('vitro', 'train'))

[-0.1182723   0.7445859  -0.5131531   0.05441907 -0.16392565 -0.18878083
 -0.03743232  0.05043976  0.45734417 -0.08432929  0.28292453 -0.30305445
  0.29970664 -0.39034355 -0.23612088  0.23779048 -0.2623025   0.23497991
 -0.3095541  -0.55062973  0.14316349 -0.3457676  -0.01858135 -0.13721447
  0.37058088 -0.09426513 -0.3595557  -0.09751277  0.37665203  0.5840994
 -0.2922705   0.2954333  -0.20438589 -0.42096508  0.4439871  -0.1126425
  0.11100439  0.14537287  0.07442331 -0.30362004  0.15235384 -0.17078754
  0.05322624  0.07649558 -0.27624887  0.06070527 -0.42381328 -0.30532867
  0.07786699 -0.2848723   0.04564207 -0.2393235   0.08138093 -0.19258784
 -0.00211608 -0.17436697  0.11246253 -0.098701   -0.12790644 -0.10877469
 -0.2073003   0.2013119  -0.08885467  0.26361963  0.21289109  0.006822
 -0.34599847  0.12472442  0.06393512 -0.1252947  -0.41279057  0.64223653
  0.00712478 -0.30643508 -0.06793853 -0.5140494  -0.5596733  -0.30177295
  0.10198282 -0.2161595   0.46718076  0.23323485 -0.109

# Compute Similarity between Query and Docs

## Variable Initialization

In [23]:
lsia_pair_of_X_Y = {}
fasttext_pair_of_X_Y = {}

## Vector Representation for pair of query and docs

In [24]:
def pair_query_docs(query, doc, vsm_type, type_of_docs):
  if vsm_type == 'lsia':
    vector_of_query = td_vector_rep(query, type_of_docs)
    vector_of_doc = td_vector_rep(doc, type_of_docs)
  elif vsm_type == 'fasttext':
    vector_of_query = tc_vector_rep(' '.join(query), type_of_docs)
    vector_of_doc = tc_vector_rep(' '.join(doc), type_of_docs)
  q = set(query)
  d = set(doc)
  cosine_dist = cosine(vector_of_query, vector_of_doc)
  jaccard_sim = len(q & d) / len(q | d)
  return vector_of_query + vector_of_doc + [jaccard_sim] + [cosine_dist]

## Generalize Mapping

In [25]:
def generalize_mapping_query_doc_qrel(vsm_type):
  whole_set = ['train', 'dev', 'test']
  for item in whole_set:
    X = [] # Query and Docs
    Y = [] # Qrel
    for query, doc, qrel in dataset[item]:
      X.append(pair_query_docs(query, doc, vsm_type, item))
      Y.append(qrel)
    X = np.array(X)
    Y = np.array(Y)
    if vsm_type == 'lsia':
      lsia_pair_of_X_Y[item] = (X, Y)
    elif vsm_type == 'fasttext':
      fasttext_pair_of_X_Y[item] = (X, Y)

## Mapping for LSI/LSA Model

In [26]:
generalize_mapping_query_doc_qrel('lsia')

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [27]:
print(lsia_pair_of_X_Y['train'][0].shape)
print(lsia_pair_of_X_Y['train'][1].shape)
print(lsia_pair_of_X_Y['dev'][0].shape)
print(lsia_pair_of_X_Y['dev'][1].shape)
print(lsia_pair_of_X_Y['test'][0].shape)
print(lsia_pair_of_X_Y['test'][1].shape)

(28277, 502)
(28277,)
(3170, 502)
(3170,)
(3210, 502)
(3210,)


## Mapping for FastText Model

In [28]:
generalize_mapping_query_doc_qrel('fasttext')

In [29]:
print(fasttext_pair_of_X_Y['train'][0].shape)
print(fasttext_pair_of_X_Y['train'][1].shape)
print(fasttext_pair_of_X_Y['dev'][0].shape)
print(fasttext_pair_of_X_Y['dev'][1].shape)
print(fasttext_pair_of_X_Y['test'][0].shape)
print(fasttext_pair_of_X_Y['test'][1].shape)

(28277, 100)
(28277,)
(3170, 100)
(3170,)
(3210, 100)
(3210,)


# Build the Ranker Model (LightGBM)

## LSI/LSA

In [244]:
ranker_model = lgb.LGBMRanker(
    learning_rate=.002,
    objective='lambdarank',
    num_leaves=40,
    importance_type='gain'
)

### Train

In [245]:
lsia_ranker_model = ranker_model.fit(
    lsia_pair_of_X_Y['train'][0],
    lsia_pair_of_X_Y['train'][1],
    eval_metric='auc',
    eval_set=[(lsia_pair_of_X_Y['dev'][0], lsia_pair_of_X_Y['dev'][1])],
    eval_group=[[lsia_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    eval_at=[5, 7, 10, 20],
    verbose=10
)

[10]	valid_0's ndcg@5: 0.709291	valid_0's ndcg@7: 0.656084	valid_0's ndcg@10: 0.648599	valid_0's ndcg@20: 0.751899	valid_0's auc: 0.672475
[20]	valid_0's ndcg@5: 0.683916	valid_0's ndcg@7: 0.659901	valid_0's ndcg@10: 0.727685	valid_0's ndcg@20: 0.695336	valid_0's auc: 0.669821
[30]	valid_0's ndcg@5: 0.683916	valid_0's ndcg@7: 0.659901	valid_0's ndcg@10: 0.727685	valid_0's ndcg@20: 0.695336	valid_0's auc: 0.671874
[40]	valid_0's ndcg@5: 0.693733	valid_0's ndcg@7: 0.751782	valid_0's ndcg@10: 0.801253	valid_0's ndcg@20: 0.769115	valid_0's auc: 0.672166
[50]	valid_0's ndcg@5: 0.841558	valid_0's ndcg@7: 0.871589	valid_0's ndcg@10: 0.857507	valid_0's ndcg@20: 0.837552	valid_0's auc: 0.673627
[60]	valid_0's ndcg@5: 0.841558	valid_0's ndcg@7: 0.871589	valid_0's ndcg@10: 0.857507	valid_0's ndcg@20: 0.837552	valid_0's auc: 0.679124
[70]	valid_0's ndcg@5: 0.841558	valid_0's ndcg@7: 0.871589	valid_0's ndcg@10: 0.819647	valid_0's ndcg@20: 0.832849	valid_0's auc: 0.677148
[80]	valid_0's ndcg@5: 0.87

### Evaluation

In [246]:
lsia_eval = lsia_ranker_model.predict(lsia_pair_of_X_Y['dev'][0])

In [247]:
result_eval = pd.DataFrame(data={'predicted_ranking': tuple(lsia_eval)})
result_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
2114,0.072052
2091,0.072052
2128,0.072052
2997,0.072052
2126,0.072052
...,...
3144,-0.073947
805,-0.073947
794,-0.073947
2805,-0.073947


### Test

In [248]:
lsia_test = lsia_ranker_model.predict(lsia_pair_of_X_Y['test'][0])

In [249]:
result_test = pd.DataFrame(data={'predicted_ranking': tuple(lsia_test)})
result_test.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
580,0.072052
1498,0.072052
3138,0.072052
1957,0.072052
3136,0.072052
...,...
2016,-0.073974
2184,-0.073974
1873,-0.073974
799,-0.074095


## FastText

In [485]:
ft_ranker_model = lgb.LGBMRanker(
    n_estimators=150,
    learning_rate=.5,
    objective='lambdarank',
    num_leaves=125, #60
    importance_type='gain'
)

### Train

In [486]:
fasttext_ranker_model = ft_ranker_model.fit(
    fasttext_pair_of_X_Y['train'][0],
    fasttext_pair_of_X_Y['train'][1],
    eval_metric='auc',
    eval_set=[(fasttext_pair_of_X_Y['dev'][0], fasttext_pair_of_X_Y['dev'][1])],
    eval_group=[[fasttext_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    eval_at=[5, 7, 10, 20],
    verbose=50
)

[50]	valid_0's ndcg@5: 0.428571	valid_0's ndcg@7: 0.484522	valid_0's ndcg@10: 0.58726	valid_0's ndcg@20: 0.598795	valid_0's auc: 0.534478
[100]	valid_0's ndcg@5: 0.73122	valid_0's ndcg@7: 0.729807	valid_0's ndcg@10: 0.707628	valid_0's ndcg@20: 0.635982	valid_0's auc: 0.573362
[150]	valid_0's ndcg@5: 0.903097	valid_0's ndcg@7: 0.921464	valid_0's ndcg@10: 0.861087	valid_0's ndcg@20: 0.747585	valid_0's auc: 0.586828


### Evaluation

In [487]:
fasttext_eval = fasttext_ranker_model.predict(fasttext_pair_of_X_Y['dev'][0])

In [488]:
ft_result_eval = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_eval)})
ft_result_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1269,4.845788
2900,4.827874
1341,4.750932
448,4.742551
2141,4.698156
...,...
3050,-1.932526
408,-1.963946
316,-1.998856
2456,-2.681535


### Test

In [489]:
fasttext_test = fasttext_ranker_model.predict(fasttext_pair_of_X_Y['test'][0])

In [490]:
ft_result_test = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_test)})
ft_result_test.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1215,4.514802
2479,4.372701
1908,4.298482
2973,4.211498
2482,4.188819
...,...
1802,-0.417085
2436,-0.508816
2407,-0.513897
2419,-0.550725


# Build the Ranker Model (XGBoost)

## LSI/LSA

In [393]:
lsia_xgb_ranker_model = xgb.XGBRanker(
    n_estimators=200,
    max_depth=100,
    max_leaves=40,
    learning_rate=.01,
    verbosity=1,
    objective='rank:pairwise',
    booster='gbtree',
    tree_method='hist',
    eval_metric=['ndcg@5', 'ndcg@7', 'ndcg@10', 'ndcg@20']
)

### Train

In [394]:
lsia_xgb_train_model = lsia_xgb_ranker_model.fit(
    lsia_pair_of_X_Y['train'][0],
    lsia_pair_of_X_Y['train'][1],
    eval_set=[(lsia_pair_of_X_Y['dev'][0], lsia_pair_of_X_Y['dev'][1])],
    eval_group=[[lsia_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    verbose=50
)

[0]	eval_0-ndcg@5:0.600449	eval_0-ndcg@7:0.597643	eval_0-ndcg@10:0.563946	eval_0-ndcg@20:0.656623
[50]	eval_0-ndcg@5:0.428571	eval_0-ndcg@7:0.536879	eval_0-ndcg@10:0.629182	eval_0-ndcg@20:0.659043
[100]	eval_0-ndcg@5:0.625824	eval_0-ndcg@7:0.640794	eval_0-ndcg@10:0.674527	eval_0-ndcg@20:0.699544
[150]	eval_0-ndcg@5:0.877722	eval_0-ndcg@7:0.792591	eval_0-ndcg@10:0.759714	eval_0-ndcg@20:0.775821
[199]	eval_0-ndcg@5:0.916532	eval_0-ndcg@7:0.879995	eval_0-ndcg@10:0.809706	eval_0-ndcg@20:0.777479


### Evaluation

In [405]:
lsia_xgb_eval = lsia_xgb_train_model.predict(lsia_pair_of_X_Y['dev'][0])

In [406]:
result_xgb_eval = pd.DataFrame(data={'predicted_ranking': tuple(lsia_xgb_eval)})
result_xgb_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
2128,1.633784
1074,1.628641
2117,1.623148
3093,1.621692
1187,1.614894
...,...
1365,-0.164220
124,-0.165590
1188,-0.169109
2157,-0.174519


### Test

In [407]:
lsia_xgb_test = lsia_xgb_train_model.predict(lsia_pair_of_X_Y['test'][0])

In [408]:
result_xgb_test = pd.DataFrame(data={'predicted_ranking': tuple(lsia_xgb_test)})
result_xgb_test.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1088,1.636737
1084,1.631397
2517,1.627605
1087,1.617097
1050,1.616501
...,...
1806,-0.278757
751,-0.281220
427,-0.285314
1886,-0.287520


## FastText

In [466]:
ft_xgb_ranker_model = xgb.XGBRanker(
    n_estimators=150,
    max_depth=125,
    max_leaves=100,
    learning_rate=.05,
    verbosity=1,
    objective='rank:ndcg',
    booster='gbtree',
    tree_method='hist',
    eval_metric=['ndcg@5', 'ndcg@7', 'ndcg@10', 'ndcg@20']
)

### Train

In [458]:
ft_xgb_train_model = ft_xgb_ranker_model.fit(
    fasttext_pair_of_X_Y['train'][0],
    fasttext_pair_of_X_Y['train'][1],
    eval_set=[(fasttext_pair_of_X_Y['dev'][0], fasttext_pair_of_X_Y['dev'][1])],
    eval_group=[[fasttext_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    verbose=50
)

[0]	eval_0-ndcg@5:0.925026	eval_0-ndcg@7:0.830929	eval_0-ndcg@10:0.788596	eval_0-ndcg@20:0.700717
[50]	eval_0-ndcg@5:0.771178	eval_0-ndcg@7:0.758598	eval_0-ndcg@10:0.692822	eval_0-ndcg@20:0.640588
[100]	eval_0-ndcg@5:0.647752	eval_0-ndcg@7:0.630591	eval_0-ndcg@10:0.664542	eval_0-ndcg@20:0.653739
[149]	eval_0-ndcg@5:0.925026	eval_0-ndcg@7:0.883286	eval_0-ndcg@10:0.735714	eval_0-ndcg@20:0.731674


### Evaluation

In [459]:
fasttext_xgb_eval = ft_xgb_train_model.predict(fasttext_pair_of_X_Y['dev'][0])

In [460]:
result_xgb_eval = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_xgb_eval)})
result_xgb_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
471,1.762225
468,1.747983
1269,1.741430
450,1.741422
3089,1.732921
...,...
2579,-0.885305
1200,-0.888180
1207,-0.900328
1202,-0.913183


### Test

In [461]:
fasttext_xgb_test = ft_xgb_train_model.predict(fasttext_pair_of_X_Y['test'][0])

In [462]:
result_xgb_test = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_xgb_test)})
result_xgb_test.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
634,1.894096
56,1.802026
1159,1.774808
2300,1.722766
2998,1.717169
...,...
1121,-0.680057
1324,-0.691025
1323,-0.695994
2287,-0.739756


# Predict Ranking for Unseen Pairs

In [505]:
query = "how much cancer risk can be avoided through lifestyle change ?"

docs =[("D1", "dietary restriction reduces insulin-like growth factor levels modulates apoptosis cell proliferation tumor progression num defici pubmed ncbi abstract diet contributes one-third cancer deaths western world factors diet influence cancer elucidated reduction caloric intake dramatically slows cancer progression rodents major contribution dietary effects cancer insulin-like growth factor igf-i lowered dietary restriction dr humans rats igf-i modulates cell proliferation apoptosis tumorigenesis mechanisms protective effects dr depend reduction multifaceted growth factor test hypothesis igf-i restored dr ascertain lowering igf-i central slowing bladder cancer progression dr heterozygous num deficient mice received bladder carcinogen p-cresidine induce preneoplasia confirmation bladder urothelial preneoplasia mice divided groups ad libitum num dr num dr igf-i igf-i/dr serum igf-i lowered num dr completely restored igf-i/dr-treated mice recombinant igf-i administered osmotic minipumps tumor progression decreased dr restoration igf-i serum levels dr-treated mice increased stage cancers igf-i modulated tumor progression independent body weight rates apoptosis preneoplastic lesions num times higher dr-treated mice compared igf/dr ad libitum-treated mice administration igf-i dr-treated mice stimulated cell proliferation num fold hyperplastic foci conclusion dr lowered igf-i levels favoring apoptosis cell proliferation ultimately slowing tumor progression mechanistic study demonstrating igf-i supplementation abrogates protective effect dr neoplastic progression"), 
       ("D2", "study hard as your blood boils"), 
       ("D3", "processed meats risk childhood leukemia california usa pubmed ncbi abstract relation intake food items thought precursors inhibitors n-nitroso compounds noc risk leukemia investigated case-control study children birth age num years los angeles county california united states cases ascertained population-based tumor registry num num controls drawn friends random-digit dialing interviews obtained num cases num controls food items principal interest breakfast meats bacon sausage ham luncheon meats salami pastrami lunch meat corned beef bologna hot dogs oranges orange juice grapefruit grapefruit juice asked intake apples apple juice regular charcoal broiled meats milk coffee coke cola drinks usual consumption frequencies determined parents child risks adjusted risk factors persistent significant associations children's intake hot dogs odds ratio num num percent confidence interval ci num num num hot dogs month trend num fathers intake hot dogs num ci num num highest intake category trend num evidence fruit intake provided protection results compatible experimental animal literature hypothesis human noc intake leukemia risk potential biases data study hypothesis focused comprehensive epidemiologic studies warranted"), 
       ("D4", "long-term effects calorie protein restriction serum igf num igfbp num concentration humans summary reduced function mutations insulin/igf-i signaling pathway increase maximal lifespan health span species calorie restriction cr decreases serum igf num concentration num protects cancer slows aging rodents long-term effects cr adequate nutrition circulating igf num levels humans unknown report data long-term cr studies num num years showing severe cr malnutrition change igf num igf num igfbp num ratio levels humans contrast total free igf num concentrations significantly lower moderately protein-restricted individuals reducing protein intake average num kg num body weight day num kg num body weight day num weeks volunteers practicing cr resulted reduction serum igf num num ng ml num num ng ml num findings demonstrate unlike rodents long-term severe cr reduce serum igf num concentration igf num igfbp num ratio humans addition data provide evidence protein intake key determinant circulating igf num levels humans suggest reduced protein intake important component anticancer anti-aging dietary interventions"), 
       ("D5", "cancer preventable disease requires major lifestyle abstract year num million americans num million people worldwide expected diagnosed cancer disease commonly believed preventable num num cancer cases attributed genetic defects remaining num num roots environment lifestyle lifestyle factors include cigarette smoking diet fried foods red meat alcohol sun exposure environmental pollutants infections stress obesity physical inactivity evidence cancer-related deaths num num due tobacco num num linked diet num num due infections remaining percentage due factors radiation stress physical activity environmental pollutants cancer prevention requires smoking cessation increased ingestion fruits vegetables moderate alcohol caloric restriction exercise avoidance direct exposure sunlight minimal meat consumption grains vaccinations regular check-ups review present evidence inflammation link agents/factors cancer agents prevent addition provide evidence cancer preventable disease requires major lifestyle")]

# sekedar pembanding, ada bocoran: D3 & D5 relevant, D1 & D4 partially relevant, D2 tidak relevan

# bentuk ke format numpy array
X_unseen_lsia = []
X_unseen_ft = []
type_doc = 'test'
for doc_id, doc in docs:
  X_unseen_lsia.append(pair_query_docs(query.split(), doc.split(), 'lsia', type_doc))
  X_unseen_ft.append(pair_query_docs(query.split(), doc.split(), 'fasttext', type_doc))

X_unseen_lsia = np.array(X_unseen_lsia)
X_unseen_ft = np.array(X_unseen_ft)

## LGBM

In [506]:
# hitung scores
scores_lsia = lsia_ranker_model.predict(X_unseen_lsia)
scores_ft = fasttext_ranker_model.predict(X_unseen_ft)
print(scores_lsia, scores_ft)

[-0.0226168  -0.05702923 -0.01493181 -0.03620556 -0.00832089] [1.5232241  1.2530128  1.48885115 2.17910539 3.14123633]


### LSI/LSA Model

In [507]:
# Ranking pada SERP

# sekedar pembanding, ada bocoran: D3 & D5 relevant, D1 & D4 partially relevant, D2 tidak relevan
# apakah LambdaMART berhasil merefleksikan hal ini?

did_scores = [x for x in zip([did for (did, _) in docs], scores_lsia)]
sorted_did_scores = sorted(did_scores, key = lambda tup: tup[1], reverse = True)

print("query        :", query)
print("SERP/Ranking :")
for (did, score) in sorted_did_scores:
  print(did, score)

query        : how much cancer risk can be avoided through lifestyle change ?
SERP/Ranking :
D5 -0.008320891214572976
D3 -0.014931807027754197
D1 -0.02261679813057083
D4 -0.03620555818514738
D2 -0.05702923217172881


### FastText

In [508]:
# Ranking pada SERP

# sekedar pembanding, ada bocoran: D3 & D5 relevant, D1 & D4 partially relevant, D2 tidak relevan
# apakah LambdaMART berhasil merefleksikan hal ini?

did_scores = [x for x in zip([did for (did, _) in docs], scores_ft)]
sorted_did_scores = sorted(did_scores, key = lambda tup: tup[1], reverse = True)

print("query        :", query)
print("SERP/Ranking :")
for (did, score) in sorted_did_scores:
  print(did, score)

query        : how much cancer risk can be avoided through lifestyle change ?
SERP/Ranking :
D5 3.1412363252461044
D4 2.179105393601496
D1 1.5232241029450662
D3 1.488851154743188
D2 1.2530128029008056


## XGBoost

In [509]:
# hitung scores
scores_lsia_xgb = lsia_xgb_train_model.predict(X_unseen_lsia)
scores_ft_xgb = ft_xgb_train_model.predict(X_unseen_ft)
print(scores_lsia_xgb, scores_ft_xgb)

[0.5225571  0.07551515 0.66568536 0.47114003 0.8186902 ] [-0.19637197 -0.0947364  -0.04299712 -0.06259704  0.9255079 ]


### LSI/LSA Model

In [510]:
# Ranking pada SERP

# sekedar pembanding, ada bocoran: D3 & D5 relevant, D1 & D4 partially relevant, D2 tidak relevan
# apakah LambdaMART berhasil merefleksikan hal ini?

did_scores = [x for x in zip([did for (did, _) in docs], scores_lsia_xgb)]
sorted_did_scores = sorted(did_scores, key = lambda tup: tup[1], reverse = True)

print("query        :", query)
print("SERP/Ranking :")
for (did, score) in sorted_did_scores:
  print(did, score)

query        : how much cancer risk can be avoided through lifestyle change ?
SERP/Ranking :
D5 0.8186902
D3 0.66568536
D1 0.5225571
D4 0.47114003
D2 0.07551515


### FastText

In [511]:
# Ranking pada SERP

# sekedar pembanding, ada bocoran: D3 & D5 relevant, D1 & D4 partially relevant, D2 tidak relevan
# apakah LambdaMART berhasil merefleksikan hal ini?

did_scores = [x for x in zip([did for (did, _) in docs], scores_ft_xgb)]
sorted_did_scores = sorted(did_scores, key = lambda tup: tup[1], reverse = True)

print("query        :", query)
print("SERP/Ranking :")
for (did, score) in sorted_did_scores:
  print(did, score)

query        : how much cancer risk can be avoided through lifestyle change ?
SERP/Ranking :
D5 0.9255079
D3 -0.04299712
D4 -0.06259704
D2 -0.0947364
D1 -0.19637197
