# INFS7410 Week 5 Practical 

##### version 1.0

###### The INFS7410 Teaching Team

##### Tutorial Etiquette
Please refrain from loud noises, irrelevant conversations and use of mobile phones during tutorial activities. Be respectful of everyone's opinions and ideas during the tutorial activities. You will be asked to leave if you disturb. Remember the tutor is there to help you understand and learn, not to provide debugging of your code or solutions to assignments. 

#### About today's tutorial
In this week's tutorial, you will be learning about and implementing methods for query expansion and reduction using relevance feedback.

----
## Exercise 1: Pseudo-relevance Feedback Query Expansion

When discussing the Binary Independence Model and BM25, we noticed these models allowed for taking into account information about relevant (and not relevant) documents. Similarly, we could adapt other models to make use of statistics from relevant documents. The problem, however, is that in most cases we start with a query, but no relevant documents.

The intuition of pseudo-relevance feedback (PRF) is to assume that the top k documents retrieved by the search engine in answer to the query are relevant. Then, statistics from these documents can be used as if they were from relevant documents, to re-perform retrieval.

We can take this a step further, and instead of simply performing term-reweighting for the original query, we can use the pseudo-relevance feedback document to actually augment the query with additional query terms -- in a hope to improve the retrieval effectiveness over running the original query. This is called pseudo-relevance feedback query expansion, and it works as follows:

1. rank documents using the original query.
2. consider the top n documents only.
3. rank terms in those documents by a weighting scheme: for example we will use tf-idf to rank terms.
4. add the top m terms from the top n documents to the original query.

Note that you can use this mechanism by combining any retrieval model for ranking documents and any retrieval model/weighting schema to rank terms.


First, let's load some packages and tool functions you will need in this practical. You have seen most of them before.

In [1]:
from pyserini.search import SimpleSearcher
from pyserini.analysis import Analyzer, get_lucene_analyzer
import pytrec_eval
from pyserini.index import IndexReader
import math
from collections import Counter



searcher = SimpleSearcher('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')
index_reader = IndexReader('indexes/lucene-index-msmarco-passage-vectors-noProcessing/')
lucene_analyzer = get_lucene_analyzer(stemming=False, stopwords=False)
analyzer = Analyzer(lucene_analyzer)
searcher.set_analyzer(lucene_analyzer)

queries = []
with open("queries.tsv", "r") as f:
    for line in f.readlines():
        parts = line.split("\t")
        # parts[0] ~> topic id
        # parts[1] ~> query
        queries.append((parts[0], parts[1].strip()))
        
def search(run_file: str, k: int=1000):
    with open(run_file, "w") as f:
        for topic_id, query in queries: # topic id is important here.
            hits = searcher.search(query, k=k)
            for i, hit in enumerate(hits):
                # Write the results to our file.
                f.write(f"{topic_id} Q0 {hit.docid} {i} {hit.score} infs7410_w5\n")
                
def print_results(run_file, qrel_file='qrel.txt', measures=["map", "ndcg_cut_10", "recall_1000"]):
    with open(run_file, "r") as f:
        run = pytrec_eval.parse_run(f)
    # Open the qrels file.
    with open(qrel_file, "r") as f:
        msmarco_qrels = pytrec_eval.parse_qrel(f)

    evaluator = pytrec_eval.RelevanceEvaluator(query_relevance=msmarco_qrels, measures=measures)
    results = evaluator.evaluate(run)
    for measure in sorted(measures):
        print('{:25s}{:8s}{:.4f}'.format(measure, 'all', pytrec_eval.compute_aggregated_measure(measure,
                                  [query_measures[measure]for query_measures in results.values()]))) 
        


Now let's first get a baseline run that does not apply any query modifications. For this, we use pyserini SimpleSearcher API which uses BM25 as default retrieval model:

In [3]:
search("prac5_k1000.run", k=1000)
print_results('prac5_k1000.run')

map                      all     0.3566
ndcg_cut_10              all     0.4854
recall_1000              all     0.6881


Remember these evaluation scores which we will use to compare query expansion and reduction later.

## Your task for this exercise

Implement the below function that does pseudo-relevance feedback query expansion.

As an reminder, follow this algorithm:

1. rank documents using the original query. (we already done this for you, that is `hits = searcher.search(query, k=50)`)
2. consider the top n documents only.
3. rank terms in those documents by a weighting scheme: for example we will use tf-idf to rank terms.
4. add the top m terms from the top n documents to the original query.

`prf_query_expansion` takes original `query` string, `n` and `m` as inputs, and outputs the expanded query string.

Hints: 

- You may want to use `index_reader.get_document_vector(docid)` to get term frequence in the given document.
- You may want to use `index_reader.get_term_counts(term, analyzer=None)[0]` to get term document frequence in the collection.

In [64]:
N = index_reader.stats()["documents"]

def prf_query_expansion(query: str, n: int, m: int):
    hits = searcher.search(query, k=50)

    # TODO
    #print(query)
    rank = {}
    q_terms = analyzer.analyze(query)

    for i in range(0, n):
        tf = index_reader.get_document_vector(hits[i].docid)
        df = {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in tf.keys()}
        
        #c = list((Counter(q_terms) & Counter(tf.keys())).elements())
        #c = Counter(tf.keys()).elements()
        c = list((Counter(tf.keys()) - Counter(q_terms)).elements())


        #print(c)

        for term in c:
            tfidf = (tf[term]*math.log(N/(1+df[term])))
            if term not in rank:
                rank[term] = tfidf
            else:
                rank[term] += tfidf
    #print(rank)
    
    sorted_rank = dict(sorted(rank.items(), key=lambda item: item[1], reverse=True))

    expanded_query = query
    
    top_terms = list(sorted_rank.keys())[:m]
    
    #print(top_terms)
    
    for i in top_terms:
        expanded_query += ' '
        expanded_query += i

        #print(expanded_query)
    
    #print(expanded_query)
    
    return expanded_query

Run the following sell to test your `prf_query_expansion`, does the added terms make sense to you? feel free to change parameters and the original query to play around.

In [65]:
original_query = "what is information retrieval?"
expanded_query = prf_query_expansion(original_query, 5, 10)
print(f"Original query: {original_query}")
print(f"Expanded query: {expanded_query}")

Original query: what is information retrieval?
Expanded query: what is information retrieval? metadata encoding searching query indexing overload ir improve text process


If you happy with your `prf_query_expansion`, then run the new search function blow to evaluate your `prf_query_expansion`:

In [99]:
def prf_expansion_search(run_file: str, k: int=1000, n: int=5, m: int=10):
    with open(run_file, "w") as f:
        for topic_id, query in queries: # topic id is important here.
            expanded_query = prf_query_expansion(query, n, m)
            hits = searcher.search(expanded_query, k=k)
            for i, hit in enumerate(hits):
                # Write the results to our file.
                f.write(f"{topic_id} Q0 {hit.docid} {i} {hit.score} infs7410_w5\n")
                
prf_expansion_search("prac5_prf_exp_k1000.run", k=1000, n=10, m=1)
print_results('prac5_prf_exp_k1000.run')




# (30,3) 33 43 73

# (20,20)29 40 71 
# (20,10)32 42 72
# (20,5) 31 42 74
# (20,3) 34 44 73
# (20,2) 35 45 75

# (10,10)30 43 70
# (10,5) 31 42 71
# (10,3) 30 41 71
# (10,1) 33 47 74

# (5,10) 29 45 68
# (5,5)  31 43 70
# (3,3)  31 44 66

map                      all     0.3340
ndcg_cut_10              all     0.4711
recall_1000              all     0.7351


**QUESTION:** _Compare to the metrics' score with no prf expansion, which metric increased and which metric decreased? Why this is the case?_

----
## Exercise 2: IDF-r Query Reduction

So far we have considered methods to improve the representation of the query by adding terms. Next, we consider the case of removing terms from the query, to make the query more focused. The removal of terms brings also other advantages such as less terms for which to iterate through the postings -- thus a faster query processing: this is important in particular for large queries.

In this exercise we consider a simple query reduction approach in which terms are ranked by IDF score, and then the top n terms (i.e., those with the highest IDF, that is, the most discriminative), are selected.


## Your task for this exercise

Similar to the previous exercise, implement the `idfr_query_reduction` in the following cell:

In [81]:
N = index_reader.stats()["documents"]

def idfr_query_reduction(query: str, n: int):
    terms = analyzer.analyze(query)
    #print(terms)


    terms = list(dict.fromkeys(terms))
    # TODO
    
    #print(terms)
    
    idf = {}
    
    df = {term: (index_reader.get_term_counts(term, analyzer=None))[0] for term in terms}
    
    for term in terms:
        term_idf = math.log(N/(1+df[term]))
        idf[term] = term_idf
    
    sorted_idf = dict(sorted(idf.items(), key=lambda item: item[1], reverse=True))
    #print(sorted_idf)

 
    top_idf = list(sorted_idf.keys())[:n]
    
    #print(top_idf)
    
    #result = terms
    
    #for term in top_idf:
    #    result.remove(term)
    
    #print(result)
    
    pruned_query = ' '.join(top_idf)
    
    
    return pruned_query

Run the following cell to test your `idfr_query_reduction`. As IDF-r reduction is usually applied for long queries, so we make up a relatively long query to test. Again you can play around with the parameter query.

In [84]:
original_query = "what is information retrieval? what it is used for? and how can it help us to access knowledge?"
pruned_query = idfr_query_reduction(original_query, 10)
print(f"Original query: {original_query}")
print(f"Pruned query: {pruned_query}")

Original query: what is information retrieval? what it is used for? and how can it help us to access knowledge?
Pruned query: retrieval knowledge access us help information what how used can


You may find that set `n` to be a large number means you will keep all the query terms thus won't change the original query at all. In addition to setting n to be a fixed number, you can also try to set it as a ratio to determine the percentage of the entire term set you want to keep, for example, `n=0.8` means keeping 80% of the query terms.

Finally, run the following cell to evaluate how your `idfr_query_reduction`

In [101]:
def idfr_reduction_search(run_file: str, k: int=1000, n: int=10):
    with open(run_file, "w") as f:
        for topic_id, query in queries: # topic id is important here.
            pruned_query = idfr_query_reduction(query, n)
            hits = searcher.search(pruned_query, k=k)
            for i, hit in enumerate(hits):
                # Write the results to our file.
                f.write(f"{topic_id} Q0 {hit.docid} {i} {hit.score} infs7410_w5\n")
                
idfr_reduction_search("prac5_idfr_k1000.run", k=1000, n=8)
print_results('prac5_idfr_k1000.run')


# n=3 0.3558 0.4666 0.6801
# n=5 0.3544 0.4805 0.6879
# n=8 0.3565 0.4854 0.6881

map                      all     0.3565
ndcg_cut_10              all     0.4854
recall_1000              all     0.6881


**QUESTION:** _Compare to the metrics' score with no prf expansion and with prf expansion, which metric increased and which metric decreased? Why this is the case?_

-----
## Challenge exercise

Create and evaluate a search function that combines prf query expansion and IDFr query reduction. 

You can try: prf_expansion --> IDFr_reduction or IDFr_reduction --> prf_expansion.