<a href="https://colab.research.google.com/github/bhadreshpsavani/NLP-based-Article-Analysis/blob/main/Query_Based_Article_Ranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Query Based Article Ranking:

**We have Articles, We want to rank it according to the given Query**

In this task we have considered [MS MARCO dataset](https://github.com/microsoft/MSMARCO-Document-Ranking). 

MS MARCO(Microsoft Machine Reading Comprehension) is also used for evaluaton and training for passage/document ranking.

The [Leaderboard](https://microsoft.github.io/msmarco/) of the datasets shows that 	**`LCE loss + HDCT (ensemble)`** based model gives best results as per date `2021/01/20`. Details about LCE Loss and the experiment on Different model architecture can be found in this [research paper](https://arxiv.org/pdf/2101.08751.pdf)

The model is original model used is **bert-base** they call it bert-base-mdoc-hdct. It can is available at [Huggingface Model Hub](https://huggingface.co/Luyu/bert-base-mdoc-hdct)

### Evaluation Results:
```
MRR @10: 0.434 on Dev. MRR @10: 0.382 on Eval.
```

If we want to train the model on entire dataset of MS MARCO, Training and Evaluation script are available at their official [github](https://github.com/luyug/Reranker) repository. They mentioned that because of large courpus of data it will take really longer time. I will not train it, i will take few example and run the inference on pretrained model.


In [3]:
!pip install -q git+https://github.com/luyug/Reranker.git
!pip install -q datasets

  Building wheel for reranker (setup.py) ... [?25l[?25hdone


## Load Dataset: 
ms_marco is available in two versions v1.1 and v2.1

below table shows number of data/examples available for both version

| name	| train | 	validation | 	test |
| ---- | ---- | ----- | ---- |
| v1.1|	82326	| 10047	| 9650 |
| v2.1 | 	808731	| 101093	| 101092 |


In [62]:
# load dataset
from datasets import load_dataset
import numpy as np
dataset = load_dataset("ms_marco",  'v1.1')

Reusing dataset ms_marco (/root/.cache/huggingface/datasets/ms_marco/v1.1/1.1.0/8378931e642240518368077ec1cc5b794130258f94ed47a957aba95e8910912a)


In [8]:
dataset

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})

In [13]:
test_data = dataset['test']

In [14]:
print(test_data['answers'][0])
print(test_data['passages'][0])
print(test_data['query'][0])
print(test_data['query_id'][0])
print(test_data['query_type'][0])
print(test_data['wellFormedAnswers'][0])

['Yes']
{'is_selected': [0, 0, 1, 0, 0, 0, 0], 'passage_text': ['We have been feeding our back yard squirrels for the fall and winter and we noticed that a few of them have missing fur. One has a patch missing down his back and under both arms. Also another has some missing on his whole chest. They are all eating and seem to have a good appetite.', 'Critters cannot stand the smell of human hair, so sprinkling a barrier of hair clippings around your garden, or lightly working it into the soil when you plant bulbs, apparently does have some merit. The whole thing kind of makes me laugh. It never occurred to me that we are the ones that stink.', "Spread some human hair around your vegetable and flower gardens. This will scare the squirrels away because humans are predators of squirrels. It is better if the hair hasn't been washed so the squirrels will easily pick up the human scent.", '1 You can sprinkle blood meal around your garden as well. 2  Don’t trap and relocate squirrels. 3  This 

In [15]:
test_data['passages'][0]

{'is_selected': [0, 0, 1, 0, 0, 0, 0],
 'passage_text': ['We have been feeding our back yard squirrels for the fall and winter and we noticed that a few of them have missing fur. One has a patch missing down his back and under both arms. Also another has some missing on his whole chest. They are all eating and seem to have a good appetite.',
  'Critters cannot stand the smell of human hair, so sprinkling a barrier of hair clippings around your garden, or lightly working it into the soil when you plant bulbs, apparently does have some merit. The whole thing kind of makes me laugh. It never occurred to me that we are the ones that stink.',
  "Spread some human hair around your vegetable and flower gardens. This will scare the squirrels away because humans are predators of squirrels. It is better if the hair hasn't been washed so the squirrels will easily pick up the human scent.",
  '1 You can sprinkle blood meal around your garden as well. 2  Don’t trap and relocate squirrels. 3  This i

In [17]:
test_data['passages'][1]

{'is_selected': [0, 1, 0, 0, 0, 0, 0, 0, 0],
 'passage_text': ['The biggest advantage of using fossil fuels is that they can be easily stored and transported from one place to another. Large reserves of coal are therefore taken from the coal mines to the industries which are acres away from the mines. The petroleum is also taken to too far off power stations to produce energy. Fossil fuels are the highest producers of calorific value in terms of energy. This is also one of the reasons why they are still preferred over the renewable sources of energy or the alternative source',
  'Benefits of fossil fuels. Fossil fuels are basically the remains of animals and plants and these are good energy resources. The three main fossil fuels are natural gas, oil, and coal. Fossil fuels are low in cost and are very important resources for our economy. Fossil fuels are used to generate electricity used as fuels for transportation.',
  'Fossil fuels are energy resources that come from the remains of p

### Observations:
* For each record we have a query and differnt set of passages and URLs out of which one passages should be selected as per `is_selected` array value

## Testing Model:

In [21]:
from reranker import RerankerForInference
rk = RerankerForInference.from_pretrained("Luyu/bert-base-mdoc-bm25")  # load checkpoint

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=560.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438021385.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




In [64]:
def get_scores(index):
  query = test_data['query'][index]
  passages = test_data['passages'][index]['passage_text']
  target = test_data['passages'][index]['is_selected']
  scores = []
  for passage in passages:
    inputs = rk.tokenize(query, passage, return_tensors='pt')
    score = rk(inputs).logits.item()
    scores.append(score)
  score_argsort = np.argsort(scores)[::-1]
  print("Targeted Passage:",target)
  print("Predicted Score:",scores)
  print("Sorted Index based on Decreasing Score", score_argsort)
  return target, scores, score_argsort

In [68]:
for i in range(10):
  print("Example", i)
  get_scores(i)
  print()

Example 0
Targeted Passage: [0, 0, 1, 0, 0, 0, 0]
Predicted Score: [-8.834932327270508, -1.7923879623413086, 0.21405810117721558, -8.503399848937988, -2.9099104404449463, -3.4219794273376465, 1.7465850114822388]
Sorted Index based on Decreasing Score [6 2 1 4 5 3 0]

Example 1
Targeted Passage: [0, 1, 0, 0, 0, 0, 0, 0, 0]
Predicted Score: [4.3025383949279785, 6.56196928024292, 3.0258352756500244, 6.598850250244141, 5.4347100257873535, 1.1481492519378662, -0.01136590912938118, -1.6975057125091553, 0.5051694512367249]
Sorted Index based on Decreasing Score [3 1 4 0 2 5 8 6 7]

Example 2
Targeted Passage: [0, 0, 0, 0, 0, 1, 0, 0, 0]
Predicted Score: [2.688828468322754, 4.243185520172119, 1.2499558925628662, -9.724624633789062, 0.3372823894023895, 5.192684173583984, 3.6605515480041504, 4.845767974853516, 5.857206344604492]
Sorted Index based on Decreasing Score [8 5 7 1 6 0 2 4 3]

Example 3
Targeted Passage: [0, 0, 0, 0, 0, 1, 0, 0, 0]
Predicted Score: [-7.31172513961792, -0.9071551561355