# Scaffolding project

_DSAIT4050: Information retrieval lecture, TU Delft_

Welcome to the **DSAIT4050: Information retrieval** lecture!

This project acts as a gentle introduction to information retrieval for you. You do not need any prior knowledge about IR for this task. Only some Python programming skills are required.

## Getting started

Under the hood, this notebook uses a library called **PyTerrier**. Please check out the first part of our _Introduction to PyTerrier_ series to learn how to install PyTerrier. However, you do not need to interact with PyTerrier directly for now; rather, we're providing you with simple utility functions you can use. Feel free to have a look how these are implemented, but it's not required.

**Task 1**: Install PyTerrier (see the `01-setup.ipynb` notebook).

Now you should be able to import the utility functions. A dataset will be downloaded and indexed automatically (this will take a minute).


In [1]:
%env PYTHONUTF8 1
%env JAVA_HOME C:\\Program Files\\Java\\jdk-14.0.2

import pandas as pd

env: PYTHONUTF8=1
env: JAVA_HOME=C:\\Program Files\\Java\\jdk-14.0.2


In [70]:
from util import search, evaluate, evaluate_all, INDEX
from pyterrier import IndexFactory, TerrierTokeniser

Now that we have loaded the data, you can run search queries. For example:


In [10]:
foo = search("dog training")
foo

Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,185762,3803738_5,Your dog possibly getting ill and dying. Get ...,0,10.80056,dog training
1,1,288253,1345383_1,"First of all, dogs can't be ""gay"". Secondly, ...",1,10.733361,dog training
2,1,117371,1658873_4,I think dogs are stupid. My girlfriends dog d...,2,10.729649,dog training
3,1,155139,118800_11,your dog is a dog and dogs like chasing cats- ...,3,10.729649,dog training
4,1,276810,1722290_5,I have had many dogs who lived to be old dogs....,4,10.715026,dog training
5,1,186953,2521790_1,Spending time with the dog does not mean the d...,5,10.682956,dog training
6,1,155084,377792_2,"Your dog is leaving his/her ""calling card"" for...",6,10.648571,dog training
7,1,231815,673660_9,Get the dogs a dog house first of all. Then pu...,7,10.648571,dog training
8,1,47231,169464_0,Because she dogs get f***** by more than one d...,8,10.642309,dog training
9,1,83860,1614604_19,all dogs do that or your dog has fleas,9,10.642309,dog training


What you get here is a list of ten documents from the corpus that are ordered by how relevant they are to our query (according to the search engine).

## Query rewriting

The goal of this task is to come up with a way of **rewriting queries** such that the search engine can "understand" them better.

In order to do this, let's first take a look at some example queries from our dataset. We represent these queries using a `pandas.DataFrame`, where the first column corresponds to the **query ID** and the second column corresponds to the **query**:


In [56]:
example_queries = pd.DataFrame(
    [
        [
            "443848",
            "does anybody know where i could get a free guide on how to train a siberian husky",
        ],
        [
            "1783010",
            "what is blaphsemy",
        ],
        [
            "2838988",
            "how can i get a cork out of not into a wine bottle without a corkscrew",
        ],
    ],
    columns=["qid", "query"],
)

Since these queries are taken from the dataset, we can **evaluate the performance** of our search engine on these queries. This means that we know which documents the system should retrieve for each query.

You can use the following evaluation function to do this. This function takes your queries and returns a score (mean average precision -- you will learn about this later). For now, all you need to know is that, the higher this score, the better the system works.

Let's evaluate the queries we have:


In [15]:
print("score:", evaluate(example_queries))

score: 0.07906002902973568


Now it's up to you to figure out if and how it's possible to make the search engine perform better on these queries. How would you query a search engine if you wanted to know about these topics? Experiment a bit.

**Task 2**: Try to manually come up with ways to rewrite or reformulate the queries so the performance improves.

**Important**: Make sure that the query IDs match! Otherwise, evaluation will not work.


In [20]:
query_scores = []

In [37]:
query_eval = {
    "query":
        [
            [
                "443848",
                "how can i train a Siberian Husky dog",
            ],
            [
                "1783010",
                "blasphemy definition insult religion",
            ],
            [
                "2838988",
                "how can i remove cork from a wine bottle no corkscrew, knife",
            ],
        ],
}

example_queries_rewritten = pd.DataFrame(query_eval["query"], columns=["qid", "query"])
score = evaluate(example_queries_rewritten)
query_eval['score'] = score
query_scores.append(query_eval)
# current best: "how can i remove cork from a wine bottle no corkscrew, knife",

print("score after rewriting:", score)

score after rewriting: 0.15556549533771236


In [34]:
query_scores

[{'query': [['443848', 'how can i train a Siberian Husky dog'],
   ['1783010', 'blasphemy definition insult religion'],
   ['2838988',
    'how can i take out cork from a wine bottle no corkscrew, knife']],
  'score': np.float64(0.153324827792697)},
 {'query': [['443848', 'how can i train a Siberian Husky dog'],
   ['1783010', 'blasphemy definition insult religion'],
   ['2838988',
    'how can i remove take out cork from a wine bottle no corkscrew, knife']],
  'score': np.float64(0.153324827792697)},
 {'query': [['443848', 'how can i train a Siberian Husky dog'],
   ['1783010', 'blasphemy definition insult religion'],
   ['2838988',
    'how can i remove cork from a wine bottle no corkscrew, knife']],
  'score': np.float64(0.15556549533771236)},
 {'query': [['443848', 'how can i train a Siberian Husky dog'],
   ['1783010', 'blasphemy definition insult religion'],
   ['2838988', 'how can i remove plug from a wine bottle no opener']],
  'score': np.float64(0.04969761084602401)},
 {'quer

# An automatic approach

In this last part, we'll try to come up with an automatic approach to perform query re-writing. Use your findings from task 2 for this.

**Task 3**: Implement a function that automatically re-writes any input query.

You can use any approach or library you want for this task. However, keep in mind that simple ideas often work well!


In [3]:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import string

In [6]:
from nltk.corpus import brown
import math
from collections import defaultdict


def calculate_brown_idf():
    """
    Calculates IDF scores for words in the Brown corpus.
    
    Returns:
    - DataFrame with TF-IDF scores for each category
    - Dictionary mapping words to their corpus-wide IDF scores
    """
    # Get all categories
    file_ids = brown.fileids()

    # Calculate document frequencies (DF) across all categories
    doc_frequencies = defaultdict(int)
    document_word_counts = {}

    # Count words in each category and document frequencies
    for file_id in file_ids:
        # Get words for this category
        words = brown.words(fileids=file_id)
        # Convert to lowercase and count
        word_counts = Counter(word.lower() for word in words)
        document_word_counts[file_id] = word_counts

        # Update document frequencies
        for word in word_counts:
            doc_frequencies[word] += 1

    # Calculate IDF scores
    num_docs = len(file_ids)
    idf_scores = {word: math.log(num_docs / freq) for word, freq in doc_frequencies.items()}

    return idf_scores

In [87]:

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('wordnet')


def get_word_scores(sentence):
    """
    Find the most important word in a sentence using NLTK for analysis.
    
    Args:
        sentence (str): Input sentence to analyze
        
    Returns:
        tuple: (word, score) - The most important word and its importance score
    """
    # Clean and tokenize the sentence
    tokens = word_tokenize(sentence.lower())

    # Remove punctuation and stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens
              if word not in stop_words
              and word not in string.punctuation]

    if not tokens:
        return None, 0

    # Get POS tags
    pos_tags = pos_tag(tokens)

    # Score words based on multiple factors
    word_scores = {}

    for word, pos in pos_tags:
        score = 0
        word_scores[word] = {}

        # 1. POS tag importance
        pos_scores = {
            'NN': 3.0,  # Nouns
            'NNP': 4.0,  # Proper nouns
            'NNPS': 4.0,
            'NNS': 3.0,
            'VB': 3.5,  # Verbs
            'VBD': 3.5,
            'VBG': 3.5,
            'VBN': 3.5,
            'VBP': 3.5,
            'VBZ': 3.5,
            'JJ': 2.0,  # Adjectives
            'JJR': 2.0,
            'JJS': 2.0,
            'RB': 1.5,  # Adverbs
        }
        score += pos_scores.get(pos, 1.0)
        word_scores[word]['pos'] = {'pos': pos, 'score': score}

        # 2. Word length (normalized)
        score += len(word) / 10
        word_scores[word]['len'] = {'len(word)': len(word), 'score': len(word) / 10}

        # 3. Word complexity (number of syllables approximation)
        vowels = 'aeiou'
        syllable_count = sum(1 for letter in word if letter in vowels)
        score += syllable_count / 5
        word_scores[word]['syl'] = {'syllable_count': syllable_count, 'score': syllable_count / 5}

        # 4. Check if word has WordNet entries (indicates it's a meaningful word)
        if wordnet.synsets(word):
            score += 1

            # Add score based on number of meanings (indicates word importance)
            synonyms = wordnet.synsets(word)
            score += len(synonyms) / 10
            word_scores[word]['syn'] = {'synonyms': synonyms, 'score': len(synonyms) / 10}
        else:
            word_scores[word]['syn'] = {'synonyms': [], 'score': 0}

        word_scores[word]['score'] = score

    return word_scores


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\skakr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\skakr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\skakr\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\skakr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
idf = calculate_brown_idf()
idf

{'the': 0.0,
 'fulton': 5.115995809754082,
 'county': 2.0874737133771,
 'grand': 2.5510464522925456,
 'jury': 3.123565645063876,
 'said': 0.4557063245449111,
 'friday': 2.6882475738060303,
 'an': 0.004008021397538868,
 'investigation': 2.6036901857779675,
 'of': 0.0,
 "atlanta's": 5.521460917862246,
 'recent': 1.4271163556401458,
 'primary': 2.1037342342488805,
 'election': 2.8134107167600364,
 'produced': 2.0249533563957662,
 '``': 0.07904320734045288,
 'no': 0.07042246429654579,
 'evidence': 1.4188175528254505,
 "''": 0.07688104433595759,
 'that': 0.0,
 'any': 0.14387037041970183,
 'irregularities': 4.422848629194137,
 'took': 0.7896580809407889,
 'place': 0.53785429615391,
 '.': 0.0,
 'further': 1.2173958246580767,
 'in': 0.0,
 'term-end': 6.214608098422191,
 'presentments': 6.214608098422191,
 'city': 1.1647520911726548,
 'executive': 2.5257286443082556,
 'committee': 2.0714733720306593,
 ',': 0.0,
 'which': 0.05340077672711525,
 'had': 0.16960278438617996,
 'over-all': 3.036554268

In [71]:
TerrierTokeniser("the big brown fox jumps over")

ValueError: 'the big brown fox jumps over' is not a valid TerrierTokeniser

In [107]:
import pyterrier as pt
# %%
tokenizer = pt.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()

def strip_markup(text):
    return " ".join(tokenizer.getTokens(text))


  tokenizer = pt.autoclass("org.terrier.indexing.tokenisation.Tokeniser").getTokeniser()


In [108]:

from collections import Counter
from autocorrect import Speller

index = IndexFactory.of(INDEX)
N = index.getCollectionStatistics().numberOfDocuments
num_docs: int = 3
num_terms: int = 5
alpha: float = 0.7


def rewrite_query(query: str) -> str:
    """
    Rewrites a query using pseudo relevance feedback based on TF-IDF scores
    of top retrieved documents.
    
    Parameters:
    query (str): Original query
    search (callable): Function that returns a DataFrame with search results
    idf (Dict[str, float]): Dictionary of IDF scores for terms
    num_docs (int): Number of top documents to use for feedback
    num_terms (int): Number of terms to add to the query
    alpha (float): Weight of original query terms (0-1)
    
    Returns:
    str: Expanded query
    """

    # Get initial search results
    spell = Speller()
    query = spell(query)
    
    normalized_query = query.lower() #.split()

    # Remove punctuation and stop words
    stop_words = set(stopwords.words('english'))
    
    normalized_query = ' ' .join([word for word in normalized_query
              if word not in stop_words
              and word not in string.punctuation])
    
    normalized_query = strip_markup(normalized_query)
    

    results = search(normalized_query)
    if len(results) == 0:
        results = search(query)
        if len(results) == 0:
            return strip_markup(query)

    # Get top documents
    top_docs = results.head(num_docs)

    # Combine all document text and calculate term frequencies
    doc_text = " ".join(top_docs['text'].astype(str))  # Assuming 'text' column exists
    term_freqs = Counter(doc_text.lower().split())

    # Calculate TF-IDF scores for terms in top documents
    tfidf_scores = {}
    total_terms = sum(term_freqs.values())

    for term, freq in term_freqs.items():
        if term in stop_words:
            tfidf_scores[term] = 1e-10
        else:
            try:
                df = index.getLexicon()[term].getDocumentFrequency()
                idf = math.log(N / df)
                tf = freq / total_terms
                tfidf_scores[term] = tf * idf
            except KeyError:
                tfidf_scores[term] = 1e-10

    # Sort terms by TF-IDF score, excluding original query terms
    expansion_candidates = {
        term: score for term, score in tfidf_scores.items()
        if term not in normalized_query
    }

    # Select top terms for expansion
    expansion_terms = [x[0] for x in sorted(
        expansion_candidates.items(),
        key=lambda x: x[1],
        reverse=True
    )[:num_terms]]

    # Combine into final query
    expanded_query = normalized_query + " " + " ".join(expansion_terms)

    return strip_markup(expanded_query)

original_query = "library consider heart university people website for your consideration:" 
expanded_query = rewrite_query(original_query)
print(f"Original query: {original_query}")
print(f"Expanded query: {expanded_query}")


JavaException: JVM exception occurred: Failed to process qid 1 'library consider heart university people website for your consideration:' -- Encountered "" at line 1, column 72.
 org.terrier.querying.parser.QueryParserException

In [88]:
def rewrite_query_syn(query: str) -> str:
    word_scores = get_word_scores(query)
    top_word, top_word_data = sorted(word_scores.items(), reverse=True, key=lambda x: x[1]['score'])[0]
    print(top_word)
    syns = top_word_data['syn']['synonyms']
    pos = top_word_data['pos']['pos']

    if len(syns) > 2:
        syns = syns[:2]
    syns_to_add = [syn.name().split('.')[0] for syn in syns]
    syns_to_add = [syn for syn in syns_to_add if syn != top_word]
    print(syns_to_add)

    if pos.startswith('VB'):
        query_list = query.split(' ')
        idx = query_list.index(top_word)
        for syn in syns_to_add:
            query_list.insert(idx, syn)
        print(query_list)
        query = ' '.join(query_list)

    if pos.startswith('NN'):
        query = query + ' ' + ' '.join(syns_to_add)

    return query

# query = "how can i train a siberian husky"
# rewrite_query(query)

This time, we'll evalute on _all_ queries in the dataset. This will give us a more general result:


In [68]:
print("score:", evaluate_all())

[INFO] Please confirm you agree to the authors' data usage agreement found at <https://ciir.cs.umass.edu/downloads/Antique/readme.txt>
[INFO] [starting] https://ciir.cs.umass.edu/downloads/Antique/antique-test-queries.txt
[INFO] [finished] https://ciir.cs.umass.edu/downloads/Antique/antique-test-queries.txt: [00:00] [11.4kB] [146kB/s]
                                                                                                

score: 0.06179994498738492


Are you able to improve the overall performance using your rewriting approach?


In [104]:
print("score after rewriting", evaluate_all(rewrite_query))

score after rewriting 0.03286984046687657
