<a href="https://colab.research.google.com/github/bhadreshpsavani/NLP-based-Article-Analysis/blob/main/Query_Based_Article_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Query Based Article Ranking:

**We have Articles, We want to rank it according to the given Query**

In this task we have considered [MS MARCO dataset](https://github.com/microsoft/MSMARCO-Document-Ranking). 

MS MARCO(Microsoft Machine Reading Comprehension) is also used for evaluaton and training for passage/document ranking.

The [Leaderboard](https://microsoft.github.io/msmarco/) of the datasets shows that 	**`LCE loss + HDCT (ensemble)`** based model gives best results as per date `2021/01/20`. Details about LCE Loss and the experiment on Different model architecture can be found in this [research paper](https://arxiv.org/pdf/2101.08751.pdf)

The model is original model used is **bert-base** they call it bert-base-mdoc-hdct. It is available at [Huggingface Model Hub](https://huggingface.co/Luyu/bert-base-mdoc-hdct)

### Evaluation Results:
```
MRR @10: 0.434 on Dev. MRR @10: 0.382 on Eval.
```

If we want to train the model on entire dataset of MS MARCO, Training and Evaluation script are available at their official [github](https://github.com/luyug/Reranker) repository. They mentioned that because of large courpus of data it will take really longer time. I will not train it, i will take few example and run the inference on pretrained model.


In [1]:
!pip install -q git+https://github.com/luyug/Reranker.git
!pip install -q datasets

[K     |████████████████████████████████| 1.8MB 7.5MB/s 
[K     |████████████████████████████████| 163kB 39.2MB/s 
[K     |████████████████████████████████| 2.9MB 35.1MB/s 
[K     |████████████████████████████████| 890kB 36.6MB/s 
[K     |████████████████████████████████| 20.7MB 1.2MB/s 
[K     |████████████████████████████████| 245kB 53.0MB/s 
[?25h  Building wheel for reranker (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


## Load Dataset: 
ms_marco is available in two versions v1.1 and v2.1

below table shows number of data/examples available for both version

| name	| train | 	validation | 	test |
| ---- | ---- | ----- | ---- |
| v1.1|	82326	| 10047	| 9650 |
| v2.1 | 	808731	| 101093	| 101092 |


In [2]:
# load dataset
from datasets import load_dataset
import numpy as np
dataset = load_dataset("ms_marco",  'v1.1')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2567.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1977.0, style=ProgressStyle(description…


Downloading and preparing dataset ms_marco/v1.1 (download: 160.88 MiB, generated: 414.48 MiB, post-processed: Unknown size, total: 575.36 MiB) to /root/.cache/huggingface/datasets/ms_marco/v1.1/1.1.0/8378931e642240518368077ec1cc5b794130258f94ed47a957aba95e8910912a...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=110704491.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=13493661.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44499856.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset ms_marco downloaded and prepared to /root/.cache/huggingface/datasets/ms_marco/v1.1/1.1.0/8378931e642240518368077ec1cc5b794130258f94ed47a957aba95e8910912a. Subsequent calls will reuse this data.


In [3]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})

In [None]:
test_data = dataset['test']

In [None]:
print(test_data['answers'][0])
print(test_data['passages'][0])
print(test_data['query'][0])
print(test_data['query_id'][0])
print(test_data['query_type'][0])
print(test_data['wellFormedAnswers'][0])

In [None]:
test_data['passages'][0]

In [None]:
test_data['passages'][1]

### Observations:
* For each record we have a query and differnt set of passages and URLs out of which one passages should be selected as per `is_selected` array value

## Testing Model:

In [11]:
import torch
if torch.cuda.is_available():
  device='cuda'
else:
  device='cpu'

In [8]:
from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=560.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438021385.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




In [12]:
rk.to(device)

RerankerForInference(
  (hf_model): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=768

In [18]:
def get_scores(index):
  query = test_data['query'][index]
  passages = test_data['passages'][index]['passage_text']
  target = test_data['passages'][index]['is_selected']
  scores = []
  for passage in passages:
    inputs = rk.tokenize(query, passage, return_tensors='pt').to(device)
    score = rk(inputs).logits.item()
    scores.append(score)
    scores=scores
  score_argsort = np.argsort(scores)[::-1]
  return target, scores, score_argsort

In [19]:
for i in range(10):
  print("Example", i)
  target, scores, score_argsort = get_scores(i)
  print("Targeted Passage:",target)
  print("Predicted Score:",scores)
  print("Sorted Index based on Decreasing Score", score_argsort)
  print()

Example 0
Targeted Passage: [0, 0, 1, 0, 0, 0, 0]
Predicted Score: [-8.834929466247559, -1.7923840284347534, 0.21405844390392303, -8.503401756286621, -2.9099113941192627, -3.4219865798950195, 1.7465814352035522]
Sorted Index based on Decreasing Score [6 2 1 4 5 3 0]

Example 1
Targeted Passage: [0, 1, 0, 0, 0, 0, 0, 0, 0]
Predicted Score: [4.3025383949279785, 6.56196928024292, 3.0258350372314453, 6.598851203918457, 5.434708595275879, 1.148148536682129, -0.011368111707270145, -1.697503685951233, 0.5051721930503845]
Sorted Index based on Decreasing Score [3 1 4 0 2 5 8 6 7]

Example 2
Targeted Passage: [0, 0, 0, 0, 0, 1, 0, 0, 0]
Predicted Score: [2.6888246536254883, 4.24318790435791, 1.2499557733535767, -9.72462272644043, 0.33728525042533875, 5.192683219909668, 3.660550594329834, 4.845767974853516, 5.857205390930176]
Sorted Index based on Decreasing Score [8 5 7 1 6 0 2 4 3]

Example 3
Targeted Passage: [0, 0, 0, 0, 0, 1, 0, 0, 0]
Predicted Score: [-7.311724662780762, -0.907153367996215

## Test Few sample Data:

For Ranking problem `label_ranking_average_precision_score` is considered as equivalent to MRR According to the [doc](https://scikit-learn.org/stable/modules/model_evaluation.html)

Lets use it to test few sample data

In [20]:
from sklearn.metrics import label_ranking_average_precision_score
from tqdm.notebook import tqdm_notebook

In [23]:
ranking_precisions=[]
for i in tqdm_notebook(range(1000)):
  target, scores, score_argsort = get_scores(i)
  ranking_precision = label_ranking_average_precision_score([target], [scores])
  ranking_precisions.append(ranking_precision)

HBox(children=(FloatProgress(value=0.0, max=1000.0), HTML(value='')))




In [24]:
print("Avarage Ranking Precision is", (sum(ranking_precisions)/len(ranking_precisions))*100)

Avarage Ranking Precision is 58.78531746031742


## Obseravations:
* **Model is giving 58.7853% Ranking Precision for ms_marco 1000 test data samples on Article based ranking task.**