<a href="https://colab.research.google.com/github/ayush-96/msc-data-science/blob/master/information_retrieval/IR_H_M_2025_Exercise2_TEMPLATE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval Exercise 2 Notebook


This is the template notebook for Exercise 2. The specification for the exercise and the corresponding Exercise 2 Quiz submission instance are available on the Moodle page of the course.

This exercise builds upon Exercise 1, and assumes that you are now familiar with concepts we have introduced in both the Lab 1 and Exercise 1, including:
 - [PyTerrier operators](https://pyterrier.readthedocs.io/en/latest/operators.html)
 - [Pyterrier apply transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html)
 - [PyTerrier pt.Experiment()](https://pyterrier.readthedocs.io/en/latest/experiments.html)


## PyTerrier Setup

First, let's install PyTerrier as usual. We require a specific version of LightGBM. Do not change this version - if you are running locally on Apple Silicon, this wont work, and you should move back to Google Colab.

In [None]:
%pip install -q python-terrier lightgbm==2.2.3 pyterrier-caching

Let's start PyTerrier

In [None]:
import pyterrier as pt

# we require a specific version of LightGBM for this exercise
import lightgbm
assert lightgbm.__version__ == '2.2.3'

We're going to speed things up for you by caching the PL2 results and the standard feature set.

DO NOT be tempted to cache your own feature implementations.

In [None]:
from pyterrier_caching import RetrieverCache, SparseScorerCache

CACHE=True

## Index, Topics & Qrels for Exercise 2

You will need your login & password credentials from Exercise 1. We will be using again the "50pct" and the "trec-wt-2004" datasets from Exercise 1.


In [None]:
UNAME="TODO"
PWORD="TODO"

# we will again be using the "50pct" and "trec-wt-2004" datasets
Fiftypct = pt.get_dataset("50pct",  user=UNAME, password=PWORD)
dotgov_topicsqrels = pt.get_dataset("trec-wt-2004")

On the other hand, you will be using a slightly updated index for Exercise 2. It is a bit bigger than the Exercise 1 index, hence it takes about 2-3 minutes to download to Colab.


In [None]:

indexref = Fiftypct.get_index(variant="ex3")
index = pt.IndexFactory.of(indexref, memory=True)


Let's check out the new index. Compared to the index we used for Exercise 1, you can see that this index has `Field Names: [TITLE, ELSE]`, which means that we can provide statistics about how many times each term occurs in the title of each document (the "TITLE" field), vs the rest of the document (the "ELSE" field). Refer to Lecture 7 for more information about fields.

Let's also display the keys in the meta index - this is the metadata that we have stored for each document. You can see that we are storing the "url" and the "body" (content) of the document. These will particularly come in handy for Q2 and Q3 of Exercise 2, respectively.


In [None]:
print(index.getCollectionStatistics())
print("In the meta index: " + str(index.getMetaIndex().getKeys()))

Finally, these are all of the topics and qrels (including the training and validation datasets) that you will need to conduct Exercise 2.

In [None]:
tr_topics = Fiftypct.get_topics("training")
va_topics = Fiftypct.get_topics("validation")

tr_qrels = Fiftypct.get_qrels("training")
va_qrels = Fiftypct.get_qrels("validation")

test_topics = dotgov_topicsqrels.get_topics("hp")
test_qrels = dotgov_topicsqrels.get_qrels("hp")

## Baseline Setup

We introduce here the terrier.Retriever for our baseline. Note that:
 - We are using PL2 as our weighting model to generate the candidate set of documents to re-rank.
 - We expose more document metadata, namely "url" and "body" for each document retrieved, which you will need to deploy your two new features.
 - By setting `verbose=True`, we display a progress bar while retrieval executes.
 - We cache PL2 to make it faster for reuse in later experiments.

In [None]:
firstpass = pt.terrier.Retriever(index, wmodel="PL2", metadata=["docno", "url", "body"], verbose=True)
if CACHE: # wrap in a cache transformer
    firstpass = RetrieverCache('pl2-cache', firstpass)

Let's see the resulting output - you can see that there are now "url" and "body" attributes for each retrieved document. (We also display a progress bar, enabled by the `verbose=True`).

In [None]:
firstpass.search("chemical reactions")

# Standard List of Features

Let's introduce the list of features we need to deploy a baseline learning-to-rank approach.

We again cache the results of FeaturesRetriever to make it faster.

In [None]:
pagerankfile = indexref + "/data-pagerank.oos"

# DO *NOT* CHANGE THIS LIST. Use PyTerrier operators to add features...
features = [
    "SAMPLE", #ie PL2 - this exposes the scores used to obtain the candidate set as a feature
    "WMODEL:SingleFieldModel(BM25,0)", #BM25 title
    "QI:StaticFeature(OIS,%s)" % pagerankfile,
]

stdfeatures = pt.terrier.FeaturesRetriever(index, features, verbose=True)
if CACHE: # wrap in a cache transformer
    stdfeatures = SparseScorerCache('features-cache', stdfeatures, value="features", pickle=True, verbose=True)

stage12 = firstpass >> stdfeatures

This is our feature set. We will be using FeaturesBatchRetrieve to compute these extra features on the fly. Let's see the output. You can see that there is now a "features" column.

In [None]:
stage12.search("chemical reactions").head(2)

Let's look in more detail at the features. It is clear that there are 3 numbers for each document. The first is the PL2 score (1.27555456e+01 == 12.7555), the second is the BM25 score, and the third is the PageRank (a link analysis feature - discussed in more detail in Lecture 9)

In [None]:
stage12.search("chemical reactions").head(1).iloc[0]["features"]

# Q1

You now have everyting you need to attempt Q1. You will need to refer to the specification, and to PyTerrier's [learning to rank documentation](https://pyterrier.readthedocs.io/en/latest/ltr.html).

You should use a LightGBM LambdaMART implementation (*not* XGBoost), instantiated using the configuration suggested in the PyTerrier documentation.

Hints:
 - You will need to use the provided separate “training” and “validation” topic sets and qrels to train the learning-to-rank.
 - There is no need to vary the configuration of LightGBM from that in the documentation.
 - Training and evaluating a LTR pipeline takes around 5 minutes.

In [None]:
#YOUR SOLUTION

# Q2 - URL Length Features

In this block, please provide your code for Q2 concerning your two URL Length features, namely URL Length by counting slashes (URL-slashes) and URL Length through using the type of the URL (URL-type). The two different URL length features that you will need to implement are detailed in the specification. Do carefully read and follow the Exercise 2 specification before starting the implementation of the features.

Some hints:

 - For computing each of your URL features, you will need to use an appropriate [pt.apply function](https://pyterrier.readthedocs.io/en/latest/apply.html). The dataframe of results obtained from the `firstpass` transformer has all of the information you need. You can see how fast your apply function is by setting `verbose=True`.

 - You can use the `**` PyTerrier operator for combining feature sets.

 - Refer to the PyTerrier learning to rank documentation  concerning `feature_importances_` for obtaining feature importances.

 - You may wish to refer to Python's [`urlparse()`](https://docs.python.org/3/library/urllib.parse.html) function.

 - Use Python assertions to test that your feature implmentation(s) give the expected results. **Remember that you need to report along your code all the tests you have conducted to ascertain the code's correctness.**


## Q2 (a) URL-Slashes Feature

In this block you should define your URL-Slashes feature, and **test it**. **Show clearly all the tests** that you have conducted to test that your feature works as expected.

In [None]:
#YOUR SOLUTION

#### (i) URL-Slashes as a PL2 re-ranker

Now you should evaluate your URL-slashes score by re-ranking PL2, without applying learning-to-rank.

Hint:
 - Your reranker should order documents in descending order, i.e. longest URLs first.

 You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

#### (ii) URL-Slashes within an LTR model

Now you should evaluate your URL-slashes score as a feature within a new learned model.

Hint:
 - Carefully consider how to integrate your feature into an LTR model, based on your understanding of how a regression tree works.

You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

## Q2 (b) URL Type Feature

In this block you should define your URL Type feature and **test it**. **Show clearly all the tests** you have conducted to test that your feature works as expected.

In [None]:
#YOUR SOLUTION

#### (i) URL Type as a PL2 re-ranker

Now you should evaluate your URL type score by re-ranking PL2, without applying learning-to-rank.

Hint:
 - Your reranker should order documents in descending order, i.e. longest URLs first.

You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

#### (ii) URL Type within an LTR model

Now you should evaluate your URL type score as a feature within a new learned model.

Hint:
 - Carefully consider how to integrate your feature into an LTR model, based on your understanding of how a regression tree works.

You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

# Q3 Proximity Search Feature

Now you will implement a new query-dependent feature, using the MinDist() function, as discussed in the specification. Do carefully **read the Exercise 2 specification** before starting the implementation.

Hints:
 - Again, remember to use assertions to **test** your feature implementations.
 - Refer to the PyTerrier learning to rank documentation concerning `features_importances_` for obtaining feature importances
 - For tokenisation of queries and documents, you can simply use Python's [`str.split()`](https://docs.python.org/3.3/library/stdtypes.html#str.split), without any arguments. Do not use any external libraries.

As mentioned in the specification, you should implement a function called `avgmindist()`, which takes the text of the query and the text of the document, and returns a score for the document, i.e. it must conform to the following Python specification:
```python
def avgmindist(query : str, document : str) -> float
```

**NB**: There are particular specific requirements for your implementations of MinDist() and avgmindist() that are detailed in the specification.

In [None]:
#YOUR AVGMINDIST IMPLEMENTATION

def avgmindist(query : str, document : str) -> float:
  #update your implementation here.
  return 0.0

You should test your impementation yourself (your code must list along your code *all* the test cases you deployed to test that your feature works as expected). In addition, to also allow us to verify your implementation, we have created 9 testcases. Please run `run_test_cases()` and use its responses to answer the relevant quiz questions.

Hint:
 - Our test cases took around 1-3ms each. If the testing of your implementation takes magnitudes of time longer, then this will impact upon how long it takes you to train and evaluate your implementation within a LTR pipeline.


In [None]:
#DO NOT ALTER THIS CELL
TEST_CASES = [
  ('fermilab directory', 45, 567257), #1
  ('webcam', 45, 567257), #2
  ('DOM surface', 384034, 388292), #3
  ('DOM surface', 45, 384034), #4
  ('DOM surface document', 388292, 384034), #5
  ('DOM software AMANDA', 639302, 384034), #6
  ('fermilab directory', 388292, 384034), #7
  ('trigger data', 596532, 639302), #8
  ('underlying hardware', 384034, 333649) #9
]

def run_test_cases():
  import datetime
  docno=0
  body=3
  for i, (query, docid1, docid2) in enumerate(TEST_CASES):
    start = datetime.datetime.now()
    meta1 = index.getMetaIndex().getAllItems(docid1)
    meta2 = index.getMetaIndex().getAllItems(docid2)
    s1 = avgmindist(query, meta1[body])
    s2 = avgmindist(query, meta2[body])
    if s1 > s2:
      result = meta1[docno]
      cmpD = "%s > %s" % (meta1[docno],meta2[docno])
    elif s2 > s1:
      result = meta2[docno]
      cmpD = "%s > %s" % (meta2[docno],meta1[docno])
    else:
      result = "EQUAL"
      cmpD = "%s == %s" % (meta1[docno],meta2[docno])
    end = datetime.datetime.now()
    print("TEST CASE %d result %s time %d ms" % (i+1, result, float((end-start).microseconds)/1000.))

run_test_cases()

You should now integrate your avgmindist() function into a new LTR model, and compare its MAP & P@5 performance to the LTR baseline. You can now answer the corresponding quiz questions.

In [None]:
#YOUR SOLUTION

# Q4 A 5-feature Learning-to-Rank Model

You will now experiment with the LightGBM LambdaMART technique where you include both your added features (URL Type and AvgMinDist) along with the 3 initial features inc the initial PL2 candidate set (5 features in total).

You need to learn a *new* model when using your final selection of 5 features.

Evaluate the performance of your resulting LTR system in comparison to the LTR baseline and answer the quiz questions. For ease of comparison and readability, you should also display your results for the performance of the 4-feature LTR models.

In [None]:
#YOUR SOLUTION

# That's all Folks

**Submission Instructions:** Complete this notebook. All your answers to Exercise 2 must be submitted on the Exercise 2 Quiz instance on Moodle with your completed notebook (showing **both your solutions and the results of their executions**). Only answers submitted through the Quiz are marked though. Marks can be lost if the notebook does not **show evidence** for the reported answers in the quiz.

While students are asked to submit their solutions through a Quiz, marking will be done with a “human-in-the-loop” and partial marks are awarded depending on the quality of the submitted work.

Your answers to the Quiz questions along with your .ipynb notebook file (showing code and outputs) must be submitted by the stated Exercise 2 deadline.