<a href="https://colab.research.google.com/github/ayush-96/msc-data-science/blob/master/information_retrieval/IR(H_M)_2025_Exercise_1_TEMPLATE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Information Retrieval Exercise 1 Notebook

 - Due Date: Wed 26th February, 4.30pm
 - This is an *Individual Exercise*
 - Anticipated Hours: ~10 hours (assuming you have already completed Lab 1
 - Submit through Moodle Quiz

# Introduction

In this exercise, building on the previous Lab 1 exercise and the tutorial on PyTerrier, you will be further familiarising yourself with PyTerrier by deploying various retrieval approaches and evaluating their impact on retrieval performance, as well as learning how to conduct an experiment in IR, and how to analyse results.


We have provided a medium size Web dataset (a user agreement to access and use this dataset is required), on which you will conduct your experiments. It is a sample of a TREC Web test collection, of approx. 800k documents, with corresponding topics (i.e. queries) & relevance assessments (i.e. qrels). If you have signed the required user agreement (see Moodle), you would have been emailed a personal login/password to access an index for this collection via PyTerrier's get_dataset() function, using the dataset name “50pct” and your personal credentials.  DO NOT SHARE YOUR CREDENTIALS.

This is an *individual exercise*. Your work will be submitted through the Exercise 1 Quiz Instance available on Moodle. The Quiz asks you various questions, which you should answer based on the experiments you have conducted. To support you conducting your work, you are strongly encouraged to make the best use of the two dedicated lab sessions to this exercise.

To help you structure your experiments, the rest of this notebook describes the experiments you need to conduct in relation to four tasks. Once you conduct a task, you should answer the corresponding questions on the Exercise 1 Quiz instance. **Ensure that you click the “Next Page” button to incrementally save your answers on the Quiz instance.**

## Assumed Knowledge

This Exercise assumes knowedge of Pandas and PyTerrier from Lab 1, which you should have completed by now (also see PyTerrier Tutorial). The relevant parts of the PyTerrier documentation are:
 - [Using Terrier indices in PyTerrier](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html)
 - [Terrier Retrieval using PyTerrier](https://pyterrier.readthedocs.io/en/latest/terrier-retrieval.html), e.g. pt.terrier.Retriever
 - [Operators on PyTerrier transformers](https://pyterrier.readthedocs.io/en/latest/operators.html)




# Setup

NB: Windows users may need to use `%pip install  --user python-terrier gensim` -- you can ignore warnings about cython, PATH etc. If in doubt, resort to Colab

In [None]:
%pip install -q python-terrier gensim

In [None]:
import pyterrier as pt

import pandas as pd
pd.set_option('display.max_colwidth', 150)
pd.set_option('display.max_rows', 200)

# Datasets for Ex1

For Exercise 1, we'll be using the Datasets API to obtain the files we need for this exercise. PyTerrier actually provides many datasets. You can list all of them using `pt.list_datasets()`.

In [None]:
pt.list_datasets().head()

There are several sets of files we need for Exercise 1:
 - We need an index for 50% of the TREC GOV corpus. We provide this through the "50pct" dataset, but you will need the username and password that you have been assigned once you accepted the user license agreement.
 - the topics (queries) and qrels (relevance assessments) for evaluating the performance of our search engine. These come from the "trec-wt-2004" dataset.

Update your username and password. DO NOT SHARE your login details with other students - all they need to do is to agree to the user agreement on Moodle.



In [None]:
USERNAME = "TODO"
PASSWORD = "TODO"

dotgov_50pct = pt.get_dataset("50pct", user=USERNAME, password=PASSWORD)
dotgov_topicsqrels = pt.get_dataset("trec-wt-2004")

The size of the "50pct" index is 800MB - this will take a minute or so for Colab to download before we load it for the first time. You can read on while its downloading.

In [None]:
indexref = dotgov_50pct.get_index('ex2')
index = pt.IndexFactory.of(indexref)


# Q1 [2 marks]

Using this setup, you now have sufficient knowledge from the Lab 1 to complete this task, namely to get the indexing statistics of the "50pct" collection.

Print the index collection statistics and answer the corresponding Quiz questions by entering the obtained indexing statistics: number of documents, number of terms, number of tokens, number of postings.


In [None]:
#YOUR SOLUTION

# Retrieval & Evaluation

In our experiments, we are using three sets of topics: homepage finding ("hp"), named page finding ("np") and topic distillation ("td"). They correspond to different user information needs on the Web:

- Homepage finding: The user's aim is to find the homepage of a given entity (person, organisation, etc) - e.g.  ‘University of Glasgow’, and the system should return the URL of that site’s homepage at (or near) rank one.

- Named page finding: The user aims to find a particular webpage/document - e.g. 'Uk Tax return form’, and the system should return the URL of that page at (or near) rank one.

- Topic distillation: The user aims to find as many relevant webpages as possible about a general topic. - e.g. ‘electoral  college’,  the  system  should  return and rank highly as many relevant webpages about the topic as possible. Each topic might have many relevant documents (similar to an adhoc search task).


For instance, to load the topics for "hp", you can do the following:

In [None]:
topics = dotgov_topicsqrels.get_topics(variant="hp")
topics.head(5)


Let's create a simple TF_IDF retriever - we will use this for demonstrating IR evaluation using PyTerrier.

In [None]:
retr = pt.terrier.Retriever(index, wmodel="TF_IDF")

Let's see how we can actually evaluate our TF_IDF retrieval system. Firstly, we'll need the qrels.

In [None]:
qrels = dotgov_topicsqrels.get_qrels(variant='hp')

We can use `pt.Evaluate(results, qrels, metrics)` to evaluate the results. The metrics argument (with default value `["map", ndcg"]`) allows to configure the evaluation measures. For example, we can obtain the Mean Average Precision for a set of results `res` using relevance assessments `qrels` as follows:

In [None]:
res = retr.transform(topics)
eval = pt.Evaluate(res, qrels, metrics=["map"])
eval

However, creating the `res` dataframe for each system in turn, and then evaluating it, is laborious and imperative in nature. We strongly recommend using [`pt.Experiment()`](https://pyterrier.readthedocs.io/en/latest/experiments.html) to evaluate one or more retrieval systems at once, in a declarative manner (see PyTerrier Tutorial).

Take the time to read the [documentation for `pt.Experiment()`](https://pyterrier.readthedocs.io/en/latest/experiments.html) to understand its available functionality. Tasks Q2-Q4 will all require that you adapt the arguments to `pt.Experiment()` and use its output in different ways (e.g. for significance testing).

In [None]:
pt.Experiment(
    [retr],
     dotgov_topicsqrels.get_topics(variant='hp'),
     dotgov_topicsqrels.get_qrels(variant='hp'),
     eval_metrics=['map']
)

# Q2

Now you will experiment with three weighting models (TF_IDF, BM25 and PL2) and analyse their results on 3 different topic sets, representing different Web retrieval tasks: homepage finding (variant “hp”), named page finding (“np”), and topic distillation (“td”). These three topic sets and the corresponding qrels can also be accessed through PyTerrier's `get_dataset()` function, e.g.:

```python
topicsHP = pt.get_dataset(“trec-wt-2004”).get_topics(”hp”)
qrelsHP = pt.get_dataset(“trec-wt-2004”).get_qrels(”hp”)
```

In particular, we would like to compare the performances of the more advanced BM25 and PL2 term weighting models to those of TF_IDF, which is our baseline here. In other words, do the BM25 and PL2 models significantly improve the performances of the TF_IDF baseline on the three used topic sets?


# Q2(a)    [12 marks]

Provide the required MAP performances of each of the weighting models over the 3 topic sets. In particular, for each topic set (hp, np, td), compare the MAP performances of BM25 and PL2 to that of TF_IDF used as a *baseline* and answer if there are any observed statistical significance differences (p-value < 0.05) between the 3 models, when prompted by the Quiz. Next, provide the average MAP performance of each weighting model across the three topic sets (i.e. a global average performance over the three topic sets), when prompted by the Quiz instance. **Report your MAP performances rounded to 4 decimal places.**


*Hint*: We encourage you to write your own functions that perform reusable operations across different topic sets.

In [None]:
#YOUR SOLUTION

#Q2(b)  [10 marks]

Next, for each topic set (hp, np, td), draw a single recall-precision graph comparing the performances of the 3 used weighting models (TF_IDF, BM25, PL2). Upload the resulting graphs into the Moodle instance when prompted (**check the graphs are readable/complete**). Then, answer the corresponding questions on the Quiz instance.


Hints:
 - You will need to use the `"iprec_at_recall"` measure, which gives precision at a given standard recall value.
 - Matplotlib has a [`savefig()`](https://chartio.com/resources/tutorials/how-to-save-a-plot-to-a-file-using-matplotlib/#the-savefig-method) function for saving a PNG of a figure.

In [None]:
#YOUR SOLUTION

#Q2 (c) [1 mark]

Finally, you should now answer on the Quiz the most effective weighting model (in terms of average Mean Average Precision), which you will use for the rest of Exercise 1. To find this model, simply identify the weighting model with the highest average performance over the 3 topic sets.  


In [None]:
#YOUR SOLUTION

# Q3 Query Expansion

Query expansion is one of the most well-known and effective techniques for improving the effectiveness of a search engine. We'll be using the Terrier's Bo1 query expansion model.

You will now conduct the Query Expansion experiments using the weighting model that produced the highest average Mean Average Precision (MAP) across the 3 topic sets in Q2.

See the [relevant documentation](https://pyterrier.readthedocs.io/en/latest/rewrite.html#bo1queryexpansion) about creating a QE transformer pipeline in PyTerrier using the Bo1 model.





#Q3(a).  [6 marks]

For each of the topic sets (i.e. homepage finding (hp), named page finding (np), and topic distillation (td)), run an experiment evaluating the application of query expansion on the best weighting model identified in Q2(c) used here as the *baseline*. Query expansion has a few parameters, e.g. query expansion model, number of documents to analyse, number of expansion terms – in conducting your experiments, you should simply use the default query expansion settings of Terrier: Bo1, 3 documents, 10 expansion terms. Report the obtained MAP performances in the Quiz instance then answer the corresponding questions. **Report your MAP performances rounded to 4 decimal places.**

Recall that the required experiments for evaluating the application of query expansion should be conducted with the best weighting model identified in the previous question Q2(c).

In [None]:
#YOUR SOLUTION

#Q3(b)   [9 marks]

Now, you will delve into the performance of the best retrieval model identified in Question Q2(c) using the topic distillation (“td”) topic set. Draw a query-delta bar chart (see example in Lecture 5) comparing the Average Precision (AP) performance of your system with and without query expansion - each bar represents the difference in average precision for a given query between the baseline and after applying QE for that query (i.e. “delta_AP = QE_AP - noQE_AP”). As the topic set has 75 queries, your figure should only show the queries that have delta_AP > 0.02 absolute, in order to focus on queries that have the biggest positive or negative changes. Your x-axis should be labelled with the qid and the text of the original query and your queries should be ordered as could be seen in Lecture 5 (Slide 62). Using the produced bar chart (**check the graph is readable/complete**), and the corresponding data, you should now be able to answer the corresponding questions in the Quiz.

*Hints*:
 - You will need to use the `perquery=True` option for `pt.Experiment()`. You will also need to analyse the expanded queries.
 - You may need a [self-join](https://www.w3schools.com/sql/sql_join_self.asp) on a dataframe.
 - You can iterate through a dataframe using [`dataframe.iterrows()`](https://cmdlinetips.com/2018/12/how-to-loop-through-pandas-rows-or-how-to-iterate-over-pandas-rows/)
 - You can examine the expanded queries by adjusting your pipeline, and executing the pipeline on the relevant topics.

In [None]:
#YOUR SOLUTION

# Q4 Word Embeddings [10 marks]

Next, implement a new query expansion method as a PyTerrier transformer that uses a Word2Vec model for identifying semantically related terms. In particular, your implementation should take each query term in an incoming query, identify the most semantically related words - using a provided Word2Vec model and the Gensim Python library - to add to the query. In conducting this experiment, you need to choose:

-  How many similar terms to identify for each existing query term so as to ensure a *fair comparison* with the experiments conducted in Q3;

-  The relative importance of these new terms compared to the existing query terms;

-  How/if to integrate the Word2Vec cosine distance into your weighting formula.

-  An adequate pipeline to use your custom Word2Vec QE transformer.  

Compare the performance of your model in comparison to the **PL2 baseline** on the *topic distillation (“td”)* topic set. How many queries are improved or are degraded? Check how your new query expansion mechanism compares to Terrier’s standard Bo1 query expansion mechanism.


## Background on word2vec

Q4 asks for a word2vec-based query expansion model. Word2vec (also called word embeddings) is a shallow neural network where semantically similar words end up with similar embedding vectors.

If you haven't taken Text-as-Data, you can do some background reading about word embeddings at:
 - https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
 - https://en.wikipedia.org/wiki/Word2vec
 - https://en.wikipedia.org/wiki/Word_embedding

In general, while word2vec is still a very widely used model, note that it has been surpassed by more complex models such as BERT. But word2vec is still useful to consider in the context of query expansion.


# Setup of Gensim

In this part of the exercise, we will use [Gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html), a Python toolkit for working with a word2vec model.

We are providing a pre-trained word2vec model that Gensim will download and open - the file is quite large, so this might take a few minutes to download and a couple of minutes to load. You can read on while it opens.

In [None]:
import gensim.downloader as api
%time model = api.load("glove-wiki-gigaword-300")

# Example Usage of Gensim

`model` is of type [gensim.models.keyedvectors.KeyedVectors](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors).

You can think of this as a dictionary mapping string words to the vector embeddings for each word.  For example, we can get the vector for the word `'government'` as follows:

In [None]:
emb = model.get_vector("government")
print(emb.shape)
print(emb)

As you can see, each word is represented by a 300-dimension vector.

We can also ask `model` for the most similar words to `'government'` using [`model.most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar). It returns the 10 most similar words, based on the cosine similarity of their emebddings to that of `'government'`.

See also: [Example in Gensim documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#what-can-i-do-with-word-vectors).

In [None]:
model.most_similar("government")

As you can see, some words are clearly related to the original word `'government'`, including some lexical variations (`'governments'`), as well as semantically similar (`"authorities"`) words. You can also see some words that perhaps seem unrelated - probably they are highly weighted because they appeared in similar contexts to `"government"` (e.g. `"saying"`).

# Now develop your Word2Vec-based Query Expansion

The next task is to use `model` to develop your custom transformer for a word2vec-based query expansion, and use it with PL2.

*Hints about the customer transformer*:
 - Inspired by Pandas, PyTerrier has the notion of [apply functions](https://pyterrier.readthedocs.io/en/latest/apply.html) for applying transformations.
 - What to do with out-of-vocabulary (OOV) words?
 - How many similar terms to identify for each existing query term?
 - How to ensure fair comparison with the experiments conducted in Q3.
 - What is the relative importance of these new terms compared to the existing query terms? e.g. you might want to give more emphasis to the original query terms (See Lecture 6).
 - How/if to integrate the Word2Vec cosine distance into your weighting formula?
 - How to deal with special characters not recognised by the default Terrier query parser, causing a QueryParserException (e.g `/`)?

*Hints about integration*:
 - Think *very* carefully about the required pipeline to use your custom word2vec-based query expansion transformer. It should *not* be used in the same way as Bo1. If your pipeline is very slow, this might be the problem. See also the Quiz questions in the PyTerrier Tutorial on 31st January.

Recall from the start of Q4 that you need to compare the performance of your QE model in comparison to the PL2 baseline on the topic distillation (“td”) topic set.

You now have sufficient information to make a start on Q4. In the Quiz instance, insert your source code for your PyTerrier transformer (note that a **2-bands penalty** will be applied if you do not upload the code you used to answer the Quiz questions of Q4). Next, answer the remaining questions in the Quiz. **Ensure that your notebook shows evidence of all work you have done to answer all of the Q4 Quiz questions or marks will be lost.**

In [None]:
#YOUR SOLUTION

# That's all Folks

**Submission Instructions:** Complete this notebook, and answer the related questions in the Exercise 1 Quiz Instance on Moodle. As part of the Quiz, you will be asked to upload your .ipynb notebook (**showing both your solutions and the results of their execution**) and answer questions as per the exercise specification (use File... Download .ipynb).

**IMPORTANT:** Your notebook should indicate **clearly** how your code blocks correspond to each question. Please note that a **2-bands penalty** will be applied, if you do not upload your completed notebook or if you do not show all the results (including plots) obtained from the execution of your solutions. Your completed notebook **MUST** show both your solutions and the results of their executions. The submitted notebook will be used to *spot check* your answers in the Quiz. **Marks can be lost** if the notebook does not show evidence for the reported answers submitted in your Quiz. This exercise is worth 50 marks and 10% of the final course grade.


**NB:** Remember that you can (and should) naturally complete the answers to the quiz over several iterations. However, *please ensure that you save your intermediary work on the Quiz instance by clicking the “Next Page” button every time you make any change in a given page of the quiz and you want it to be saved*.

Your responses to the Quiz along with your ipynb notebook solution must be submitted by **the deadline stated on the Exercise 1 Specification.**