<a href="https://colab.research.google.com/github/faisaladisoe/ir-tp/blob/master/TP4/source%20code/IR_TP4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning-To-Rank (LETOR) menggunakan monoBERT

# Install Libraries

In [1]:
!pip install gensim
!pip install lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [91]:
import os
import random
import numpy as np
import pandas as pd
import lightgbm as lgb

from gensim.models import FastText
from gensim.models import LsiModel
from gensim.corpora import Dictionary
from scipy.spatial.distance import cosine

# Data Preparation

## Scraping

source: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/ \\
download: https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz

In [3]:
!wget -c https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz -P data
!tar -xvf data/nfcorpus.tar.gz

--2022-12-03 23:05:21--  https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/nfcorpus.tar.gz
Resolving www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)... 147.142.207.78
Connecting to www.cl.uni-heidelberg.de (www.cl.uni-heidelberg.de)|147.142.207.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31039523 (30M) [application/x-gzip]
Saving to: ‘data/nfcorpus.tar.gz’


2022-12-03 23:05:24 (19.1 MB/s) - ‘data/nfcorpus.tar.gz’ saved [31039523/31039523]

nfcorpus/
nfcorpus/train.docs
nfcorpus/test.docs
nfcorpus/dev.docs
nfcorpus/dev.3-2-1.qrel
nfcorpus/test.3-2-1.qrel
nfcorpus/train.3-2-1.qrel
nfcorpus/raw/
nfcorpus/raw/doc_dump.txt
nfcorpus/raw/dev.docs.ids
nfcorpus/raw/dev.queries.ids
nfcorpus/raw/test.docs.ids
nfcorpus/raw/test.queries.ids
nfcorpus/raw/train.docs.ids
nfcorpus/raw/train.queries.ids
nfcorpus/raw/stopwords.large
nfcorpus/raw/nfdump.txt
nfcorpus/raw/all_videos.ids
nfcorpus/raw/nontopics.ids
nfcorpus/test.2-1-0.qrel
nfcorpus/dev.2-1-0.qrel
nfc

**Combination for training purposes:**


1.   nfcorpus/train.docs
2.   nfcorpus/train.3-2-1.qrel
3.   nfcorpus/train.vid-desc.queries

**Combination for optimizing purposes:**

1.   nfcorpus/dev.docs
2.   nfcorpus/dev.3-2-1.qrel
3.   nfcorpus/dev.vid-desc.queries

**Combination for testing purposes:**

1.   nfcorpus/test.docs
2.   nfcorpus/test.3-2-1.qrel
3.   nfcorpus/test.vid-desc.queries

## Preprocessing

In [4]:
!head -10 nfcorpus/train.docs

MED-10	statin breast cancer survival nationwide cohort study finland abstract recent studies suggested statins established drug group prevention cardiovascular mortality delay prevent breast cancer recurrence effect disease-specific mortality remains unclear evaluated risk breast cancer death statin users population-based cohort breast cancer patients study cohort included newly diagnosed breast cancer patients finland num num num cases identified finnish cancer registry information statin diagnosis obtained national prescription database cox proportional hazards regression method estimate mortality statin users statin time-dependent variable total num participants statins median follow-up num years diagnosis range num num years num participants died num num due breast cancer adjustment age tumor characteristics treatment selection post-diagnostic pre-diagnostic statin lowered risk breast cancer death hr num num ci num num hr num num ci num num risk decrease post-diagnostic statin affe

In [5]:
!head -10 nfcorpus/train.vid-desc.queries

PLAIN-2427	diet and exercise synergize to improve endothelial function , the ability of our arteries to relax normally .
PLAIN-2428	the parable of the tiny parachute explains the study that found no relationship between dietary fiber intake and diverticulosis .
PLAIN-2431	pbde fire retardant chemicals in the food supply may contribute to attention and cognitive deficits in children .
PLAIN-2432	peppermint essential oil should be considered the first-line treatment for ibs .
PLAIN-2433	the reversal of blindness due to hypertension and diabetes with dr. kempner ’ s rice and fruit diet demonstrates the power of diet to exceed the benefits of the best modern medicine and surgery has to offer .
PLAIN-2434	squatting and leaning can help straighten the anorectal angle , but a healthy enough diet should make bowel movements effortless regardless of positioning .
PLAIN-2435	most people have between 3 bowel movements a day and 3 a week , but normal doesn ’ t necessarily mean optimal .
PLAIN-2436

In [6]:
!head -10 nfcorpus/train.3-2-1.qrel

PLAIN-3	0	MED-2436	3
PLAIN-3	0	MED-2437	3
PLAIN-3	0	MED-2438	3
PLAIN-3	0	MED-2439	3
PLAIN-3	0	MED-2440	3
PLAIN-3	0	MED-2427	2
PLAIN-3	0	MED-2428	2
PLAIN-3	0	MED-2429	2
PLAIN-3	0	MED-2430	2
PLAIN-3	0	MED-2431	2


### Mapping

#### Variable Initialization

In [7]:
docs = {}
queries = {}
qrels = {}
dataset = {}
query_num_of_docs_each_type = {}

#### Document dataset

In [8]:
def map_docs(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.docs', 'r') as file:
    for line in file:
      doc_id, content = line.split('\t')
      temporary_map[doc_id] = [item for item in content.split() if item.isalnum()]
  docs[type_of_docs] = temporary_map

#### Query dataset

In [9]:
def map_queries(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.vid-desc.queries', 'r') as file:
    for line in file:
      query_id, content = line.split('\t')
      temporary_map[query_id] = [item for item in content.split() if item.isalnum()]
  queries[type_of_docs] = temporary_map

#### Query Relevance dataset

In [10]:
def map_qrels(type_of_docs):
  temporary_map = {}
  with open(f'nfcorpus/{type_of_docs}.3-2-1.qrel', 'r') as file:
    for line in file:
      query_id, _, doc_id, qrel = line.split('\t')
      if (query_id in queries[type_of_docs]) and (doc_id in docs[type_of_docs]):
        try:
          temporary_map[query_id].append((doc_id, int(qrel)))
        except:
          temporary_map[query_id] = []
          temporary_map[query_id].append((doc_id, int(qrel)))
  qrels[type_of_docs] = temporary_map

#### Count number of docs in each query

In [11]:
def map_qid_num_of_docs(type_of_docs):
  query_num_of_docs = []
  combination_qid_did_qrel = []
  for query_id in qrels[type_of_docs]:
    content = qrels[type_of_docs][query_id]
    # Normalization step if length of docs is zero
    query_num_of_docs.append(len(content) + 1)
    for doc_id, qrel in content:
      combination_qid_did_qrel.append((queries[type_of_docs][query_id], docs[type_of_docs][doc_id], qrel))
    # Handle if length of docs is zero
    combination_qid_did_qrel.append((queries[type_of_docs][query_id], random.choice(list(docs[type_of_docs].values())), 0))
  dataset[type_of_docs] = combination_qid_did_qrel
  query_num_of_docs_each_type[type_of_docs] = query_num_of_docs

#### Execution

In [12]:
whole_set = ['train', 'dev', 'test']
for item in whole_set:
  map_docs(item)
  map_queries(item)
  map_qrels(item)
  map_qid_num_of_docs(item)

# Building Word Vectors

In [13]:
os.makedirs('./models', exist_ok = True)
os.makedirs('./models/lsia', exist_ok = True)
os.makedirs('./models/fasttext', exist_ok = True)

In [14]:
dictionaries = {}

## Term-Document Matrix (LSI/LSA)

In [15]:
def td_matrix_lsia(type_of_docs):
  NUM_LATENT_TOPIC = 250
  lsia_dictionary = Dictionary()
  lsia_bow_corpus = [lsia_dictionary.doc2bow(doc, allow_update = True) for doc in docs[type_of_docs].values()]
  lsia_model = LsiModel(lsia_bow_corpus, num_topics = NUM_LATENT_TOPIC)
  dictionaries[type_of_docs] = lsia_dictionary
  lsia_model.save(f'./models/lsia/{type_of_docs}-{NUM_LATENT_TOPIC}.model')

## Term-Context Matrix (FastText)

In [16]:
def tc_matrix_fasttext(type_of_docs):
  VEC_SIZE = 125
  fasttext_corpus = [doc for doc in docs[type_of_docs].values()]
  fasttext_model = FastText(sg = 1)
  fasttext_model.build_vocab(fasttext_corpus)
  fasttext_model.train(fasttext_corpus, model = 'skipgram', vector_size=VEC_SIZE, total_examples = fasttext_model.corpus_count, epochs = 25)
  fasttext_model.save(f'./models/fasttext/{type_of_docs}-{VEC_SIZE}.model')

## Execution

In [17]:
whole_set = ['train', 'dev', 'test']
for item in whole_set:
  td_matrix_lsia(item)
  tc_matrix_fasttext(item)



## Term-Document Vector Representation

In [38]:
dictionary = dictionaries['dev']
lsia_model = LsiModel.load(f'./models/lsia/dev-250.model')
def td_vector_rep(array, type_of_docs):
  representation = [topic_value for (_, topic_value) in lsia_model[dictionary.doc2bow(array)]]
  return representation if len(representation) == 250 else [0.] * 250

In [19]:
print(td_vector_rep(docs['train']["MED-329"], 'train'))
print(td_vector_rep(queries['train']["PLAIN-2435"], 'train'))

[5.921538388996683, 3.1464908098755036, -2.6088675457815707, 0.6764912534166082, -1.7759191679068784, -2.4213520563446207, 1.4294315435180878, 0.48859795462465494, -0.3438670143267466, -1.3699780669453558, -0.7180300448332841, 0.5651768037691284, -1.6727147003268574, 0.2400746277038075, -1.736617342602498, -0.6898083685719456, 0.7722386007817862, 1.048263633859768, -0.984591626974623, 1.0428507094095572, 0.10439204329683373, 1.8149685457397733, 1.571600790640962, -1.5853648965712728, -0.532278984674473, 0.21447480576530714, 1.7156483987378546, -0.06060988256833068, 0.1588719736794296, -0.5081084203259312, 0.11370777925262167, 0.9149711790307414, -1.2577765024310275, -1.603777444309239, 1.278705222583665, -0.4050207749942126, -0.7548043047490174, 1.9489056138977152, -1.31612074540327, 0.06065687967486596, 1.457894467620031, -0.18824591540206712, -0.0472052730267614, -0.3201687378591965, -0.46631885739366435, -1.3736091296145718, -0.8619649476917427, 0.8057455125961769, -0.83804360079320

## Term-Context Vector Representation

In [20]:
fasttext_model = FastText.load('./models/fasttext/train-125.model')
word_vector = fasttext_model.wv
word_vector.most_similar('developmental')

[('developments', 0.8620983362197876),
 ('neurodevelopmental', 0.8536943793296814),
 ('development', 0.8508096933364868),
 ('neurodevelopment', 0.789792537689209),
 ('mental', 0.6455051898956299),
 ('fetal', 0.6157057285308838),
 ('develops', 0.5882867574691772),
 ('psychomotor', 0.577566385269165),
 ('parental', 0.5691647529602051),
 ('prenatal', 0.5677740573883057)]

In [115]:
fasttext_model = FastText.load(f'./models/fasttext/test-125.model')
def tc_vector_rep(word, type_of_docs):
  word_vector = fasttext_model.wv
  return word_vector[word]

In [22]:
print(tc_vector_rep('statin vitro developmental', 'train'))
print(tc_vector_rep('vitro', 'train'))

[ 0.00086178  0.7112148  -0.56431973  0.42318308 -0.15794535 -0.1867541
  0.40705127 -0.08717214  0.13548957 -0.15464574  0.05829882 -0.5756704
  0.38673627 -0.21238707 -0.09336478 -0.23297353 -0.06648219  0.12664953
 -0.18531604 -0.4978059   0.34533757 -0.11824949 -0.1923272   0.05202987
  0.346623   -0.51382685 -0.0627332  -0.01424299  0.34125006  0.53760934
 -0.09107467  0.3487767  -0.3268377  -0.4639472   0.31419155 -0.1751086
  0.06159164  0.4839102  -0.02583551 -0.3152744   0.10026489 -0.00672439
 -0.10624322 -0.11640561 -0.09343895 -0.03869553 -0.60006726 -0.2681943
  0.46182868 -0.2987717   0.15281802 -0.09797607  0.00144322 -0.16126706
 -0.2927824  -0.19079271  0.0624644   0.05589211  0.00779116 -0.20415872
 -0.40874687  0.3592989  -0.3033358   0.16900006  0.39695105 -0.21791461
 -0.40421912  0.1503278   0.24309143 -0.08476384 -0.35813773  0.5578511
 -0.35237592 -0.19394414 -0.09166814 -0.21579579 -0.5203028  -0.27614918
  0.32403713  0.03852601  0.18531284 -0.10021025 -0.1616

# Compute Similarity between Query and Docs

In [23]:
lsia_pair_of_X_Y = {}
fasttext_pair_of_X_Y = {}

## Vector Representation for pair of query and docs

In [24]:
def pair_query_docs(query, doc, vsm_type, type_of_docs):
  if vsm_type == 'lsia':
    vector_of_query = td_vector_rep(query, type_of_docs)
    vector_of_doc = td_vector_rep(doc, type_of_docs)
  elif vsm_type == 'fasttext':
    vector_of_query = tc_vector_rep(' '.join(query), type_of_docs)
    vector_of_doc = tc_vector_rep(' '.join(doc), type_of_docs)
  q = set(query)
  d = set(doc)
  cosine_dist = cosine(vector_of_query, vector_of_doc)
  jaccard_sim = len(q & d) / len(q | d)
  return vector_of_query + vector_of_doc + [jaccard_sim] + [cosine_dist]

## Generalize Mapping

In [123]:
def generalize_mapping_query_doc_qrel(vsm_type):
  whole_set = ['train']
  for item in whole_set:
    X = [] # Query and Docs
    Y = [] # Qrel
    for query, doc, qrel in dataset[item]:
      X.append(pair_query_docs(query, doc, vsm_type, item))
      Y.append(qrel)
    X = np.array(X)
    Y = np.array(Y)
    if vsm_type == 'lsia':
      lsia_pair_of_X_Y[item] = (X, Y)
    elif vsm_type == 'fasttext':
      fasttext_pair_of_X_Y[item] = (X, Y)

## Mapping for LSI/LSA Model

In [39]:
generalize_mapping_query_doc_qrel('lsia')

  dist = 1.0 - uv / np.sqrt(uu * vv)


In [40]:
print(lsia_pair_of_X_Y['train'][0].shape)
print(lsia_pair_of_X_Y['train'][1].shape)
print(lsia_pair_of_X_Y['dev'][0].shape)
print(lsia_pair_of_X_Y['dev'][1].shape)
print(lsia_pair_of_X_Y['test'][0].shape)
print(lsia_pair_of_X_Y['test'][1].shape)

(28277, 502)
(28277,)
(3170, 502)
(3170,)
(3210, 502)
(3210,)


## Mapping for FastText Model

In [116]:
generalize_mapping_query_doc_qrel('fasttext')

In [117]:
print(fasttext_pair_of_X_Y['train'][0].shape)
print(fasttext_pair_of_X_Y['train'][1].shape)
print(fasttext_pair_of_X_Y['dev'][0].shape)
print(fasttext_pair_of_X_Y['dev'][1].shape)
print(fasttext_pair_of_X_Y['test'][0].shape)
print(fasttext_pair_of_X_Y['test'][1].shape)

(28277, 100)
(28277,)
(3170, 100)
(3170,)
(3210, 100)
(3210,)


# Build the Ranker Model

In [252]:
ranker_model = lgb.LGBMRanker(
    learning_rate=.005115,
    objective='lambdarank',
    num_leaves=40,
    importance_type='gain'
)

## LSI/LSA

### Train

In [253]:
lsia_ranker_model = ranker_model.fit(
    lsia_pair_of_X_Y['train'][0],
    lsia_pair_of_X_Y['train'][1],
    eval_metric='auc',
    eval_set=[(lsia_pair_of_X_Y['dev'][0], lsia_pair_of_X_Y['dev'][1])],
    eval_group=[[lsia_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    eval_at=[5, 7, 10, 20],
    verbose=10
)

[10]	valid_0's auc: 0.687572	valid_0's ndcg@5: 0.428571	valid_0's ndcg@7: 0.480929	valid_0's ndcg@10: 0.491239	valid_0's ndcg@20: 0.582961
[20]	valid_0's auc: 0.683585	valid_0's ndcg@5: 0.587013	valid_0's ndcg@7: 0.556983	valid_0's ndcg@10: 0.607419	valid_0's ndcg@20: 0.637517
[30]	valid_0's auc: 0.674721	valid_0's ndcg@5: 0.608942	valid_0's ndcg@7: 0.627112	valid_0's ndcg@10: 0.623897	valid_0's ndcg@20: 0.65584
[40]	valid_0's auc: 0.673959	valid_0's ndcg@5: 0.608942	valid_0's ndcg@7: 0.630705	valid_0's ndcg@10: 0.664633	valid_0's ndcg@20: 0.701784
[50]	valid_0's auc: 0.666135	valid_0's ndcg@5: 0.608942	valid_0's ndcg@7: 0.630705	valid_0's ndcg@10: 0.664633	valid_0's ndcg@20: 0.720564
[60]	valid_0's auc: 0.663982	valid_0's ndcg@5: 0.697352	valid_0's ndcg@7: 0.702358	valid_0's ndcg@10: 0.684145	valid_0's ndcg@20: 0.754548
[70]	valid_0's auc: 0.663436	valid_0's ndcg@5: 0.780819	valid_0's ndcg@7: 0.766412	valid_0's ndcg@10: 0.776613	valid_0's ndcg@20: 0.741729
[80]	valid_0's auc: 0.659881

### Evaluation

In [291]:
lsia_eval = lsia_ranker_model.predict(lsia_pair_of_X_Y['dev'][0])

In [292]:
result_eval = pd.DataFrame(data={'predicted_ranking': tuple(lsia_eval)})
result_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1543,0.167020
2178,0.161107
574,0.160931
3095,0.160310
2303,0.160248
...,...
2802,-0.162358
922,-0.162472
1612,-0.163288
800,-0.163642


### Test

In [288]:
lsia_test = lsia_ranker_model.predict(lsia_pair_of_X_Y['test'][0])

In [290]:
result_test = pd.DataFrame(data={'predicted_ranking': tuple(lsia_test)})
result_test.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1491,0.167042
1354,0.166680
1957,0.162015
2517,0.162015
1786,0.161238
...,...
1701,-0.163548
1702,-0.163955
39,-0.164888
721,-0.165067


## FastText

In [281]:
ft_ranker_model = lgb.LGBMRanker(
    learning_rate=.00125, # learning_rate=.00115 actually do better than .00125; but prone to overfit
    objective='lambdarank',
    num_leaves=60,
    importance_type='gain'
)

### Train

In [282]:
fasttext_ranker_model = ft_ranker_model.fit(
    fasttext_pair_of_X_Y['train'][0],
    fasttext_pair_of_X_Y['train'][1],
    eval_metric='auc',
    eval_set=[(fasttext_pair_of_X_Y['dev'][0], fasttext_pair_of_X_Y['dev'][1])],
    eval_group=[[fasttext_pair_of_X_Y['dev'][0].shape[0]]],
    group=query_num_of_docs_each_type['train'],
    eval_at=[5, 7, 10, 20],
    verbose=10
)

[10]	valid_0's auc: 0.534841	valid_0's ndcg@5: 0.66084	valid_0's ndcg@7: 0.616816	valid_0's ndcg@10: 0.585896	valid_0's ndcg@20: 0.614476
[20]	valid_0's auc: 0.53243	valid_0's ndcg@5: 0.71928	valid_0's ndcg@7: 0.72013	valid_0's ndcg@10: 0.73805	valid_0's ndcg@20: 0.682675
[30]	valid_0's auc: 0.549703	valid_0's ndcg@5: 0.819629	valid_0's ndcg@7: 0.797866	valid_0's ndcg@10: 0.798478	valid_0's ndcg@20: 0.717209
[40]	valid_0's auc: 0.551865	valid_0's ndcg@5: 0.841558	valid_0's ndcg@7: 0.735306	valid_0's ndcg@10: 0.751707	valid_0's ndcg@20: 0.719955
[50]	valid_0's auc: 0.550322	valid_0's ndcg@5: 0.925026	valid_0's ndcg@7: 0.802953	valid_0's ndcg@10: 0.728337	valid_0's ndcg@20: 0.743136
[60]	valid_0's auc: 0.55223	valid_0's ndcg@5: 0.804071	valid_0's ndcg@7: 0.841207	valid_0's ndcg@10: 0.758966	valid_0's ndcg@20: 0.735302
[70]	valid_0's auc: 0.554278	valid_0's ndcg@5: 0.887539	valid_0's ndcg@7: 0.800547	valid_0's ndcg@10: 0.762764	valid_0's ndcg@20: 0.788661
[80]	valid_0's auc: 0.559376	vali

### Evaluation

In [293]:
fasttext_eval = fasttext_ranker_model.predict(fasttext_pair_of_X_Y['dev'][0])

In [295]:
result_eval = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_eval)})
result_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
1,0.037756
2,0.037239
2141,0.035701
1269,0.035618
2151,0.035115
...,...
2459,-0.032869
2458,-0.032968
2472,-0.033432
2456,-0.033878


### Test

In [296]:
fasttext_eval = fasttext_ranker_model.predict(fasttext_pair_of_X_Y['test'][0])

In [298]:
result_eval = pd.DataFrame(data={'predicted_ranking': tuple(fasttext_eval)})
result_eval.sort_values('predicted_ranking', ascending=False)

Unnamed: 0,predicted_ranking
707,0.041124
703,0.039993
2971,0.039729
2609,0.039402
1167,0.038500
...,...
1664,-0.024250
2412,-0.024355
1662,-0.024823
2032,-0.025479
