# PyTerrier CIKM 2021 Tutorial Notebook - Part 3 - Neural Re-Ranking and Neural Index Augmentation

This is one of a series of Colab notebooks created for the [CIKM 2021](https://www.cikm2021.org/) Tutorial entitled '**IR From Bag-of-words to BERT and Beyond through Practical Experiments**'. It demonstrates the use of [PyTerrier](https://github.com/terrier-org/pyterrier) on the [CORD19 test collection](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).

In particular, in this notebook you will:

 - Re-rank documents using neural models like KNRM, Vanilla BERT, EPIC, and monoT5.
 - Use DeepCT and doc2query to augment documents for lexical retrieval functions like BM25.

## Setup

In the following, we will set up the libraries required to execute the notebook.

### Pyterrier installation

The following cell installs the latest release of the [PyTerrier](https://github.com/terrier-org/pyterrier) package.

In [1]:
!pip install -q --upgrade python-terrier

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.4/163.4 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m859.0/859.0 kB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m86.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m89.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.0/288.0 kB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.7/119.7

### Pyterrier plugins installation

We install the [OpenNIR](https://opennir.net/), [monoT5](https://github.com/terrierteam/pyterrier_t5), [DeepCT](https://github.com/terrierteam/pyterrier_deepct) and [doc2query](https://github.com/terrierteam/pyterrier_doc2query) PyTerrier plugins. You can safely ignore the package versioning errors.

In [2]:
!pip install -q --upgrade git+https://github.com/Georgetown-IR-Lab/OpenNIR
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_deepct.git
!pip install -q --upgrade git+https://github.com/terrierteam/pyterrier_doc2query.git

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.7/74.7 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.3/114.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.1/158.1 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Preliminary steps

**[PyTerrier](https://github.com/terrier-org/pyterrier) initialization**

Lets get [PyTerrier](https://github.com/terrier-org/pyterrier) started. This will download the latest version of the [Terrier](http://terrier.org/) IR platform. We also import the [OpenNIR](https://opennir.net/) pyterrier bindings.

In [19]:
# First, install and import PyTerrier
# !pip install python-terrier
import pyterrier as pt
# Initialize PyTerrier
if not pt.started():
    pt.init()

# Set up paths
index_dir = "./fiqa_index"
fiqa_data_path = "./output"  # Adjust to your actual path

# Function to generate document dictionaries from FIQA
def fiqa_document_generator():
    # You'll need to adjust this based on your FIQA data format
    # Here's an example assuming you have a JSON file with documents
    import json
    with open("/content/corpus.jsonl", 'r', encoding='utf-8') as f:
        for line in f:
            try:
                # Parse each line as a JSON object
                doc = json.loads(line.strip())

                # Yield a document dictionary suitable for indexing
                # Adjust the field names based on your JSONL structure
                yield {
                    'docno': doc.get('id', ''),  # Document ID
                    'text': doc.get('text', ''),  # Main text content
                    'title': doc.get('title', ''),  # Optional title field
                    # Add other fields as needed
                }
            except json.JSONDecodeError as e:
                print(f"Error parsing line: {e}")
                continue

    # with open(f"/content/corpus.jsonl", "r") as f:
    #     corpus = json.load(f)

    # for doc in corpus:
    #     # Create a dictionary with the document fields
    #     yield {
    #         'docno': doc['id'],
    #         'text': doc['text'],
    #         # Add other fields if available
    #         'title': doc.get('title', ''),
    #     }

# Create the indexer
# You can specify which fields to index and metadata to store
indexer = pt.IterDictIndexer(
    index_dir,
    # Configure which fields to index (the content to search)
    fields=['text', 'title',],
    # Configure metadata (fields to retrieve but not search)
    meta={'docno': 20, 'text': 4096}
)

# Run the indexing process
indexref = indexer.index(fiqa_document_generator())

# Print information about the created index
index = pt.IndexFactory.of(indexref)
print(f"Index created with {index.getCollectionStatistics().getNumberOfDocuments()} documents")

  if not pt.started():


21:58:16.700 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 39 empty documents
Index created with 57638 documents


# Retrieve top 100 results using BM25

In [25]:
import pyterrier as pt
import pandas as pd
import os

# Initialize PyTerrier if not already started
if not pt.started():
    pt.init()

# Paths to your data files
train_path = "train.tsv"
test_path = "test.tsv"
dev_path = "dev.tsv"
index_path = "./fiqa_index"  # Path to your previously created index

# Load the qrels files with the proper column names
def load_qrels(file_path):
    qrels = pd.read_csv(file_path, sep='\t',
                        names=['qid', 'docno', 'relevance'],  # Named 'relevance' instead of 'score'
                        header=0)  # Skip header row

    # Convert to the format PyTerrier expects
    qrels['qid'] = qrels['qid'].astype(str)
    qrels['docno'] = qrels['docno'].astype(str)

    return qrels

train_qrels = load_qrels(train_path)
test_qrels = load_qrels(test_path)
dev_qrels = load_qrels(dev_path)

print(f"Loaded {len(train_qrels)} training relevance judgments")
print(f"Loaded {len(test_qrels)} test relevance judgments")
print(f"Loaded {len(dev_qrels)} development relevance judgments")

# Extract unique query IDs from qrels
train_query_ids = train_qrels['qid'].unique()
test_query_ids = test_qrels['qid'].unique()
dev_query_ids = dev_qrels['qid'].unique()

print(f"Found {len(train_query_ids)} unique training query IDs")
print(f"Found {len(test_query_ids)} unique test query IDs")
print(f"Found {len(dev_query_ids)} unique development query IDs")

# Get the queries (we need query text for retrieval)
# Do you have a separate file with query texts?
# If not, we'll try to extract whatever text we can from your index

# Load the index
index = pt.IndexFactory.of(index_path)
print(f"Loaded index with {index.getCollectionStatistics().getNumberOfDocuments()} documents")

# Create dummy queries if you don't have actual query texts
def create_dummy_queries(query_ids):
    queries = pd.DataFrame({
        'qid': query_ids,
        'query': [f"query_{qid}" for qid in query_ids]  # Placeholder query text
    })
    return queries

train_queries = create_dummy_queries(train_query_ids)
test_queries = create_dummy_queries(test_query_ids)
dev_queries = create_dummy_queries(dev_query_ids)

# Create a BM25 retriever and set to retrieve top 100 results
retriever = pt.BatchRetrieve(index, wmodel="BM25", num_results=100)

# Run retrieval on each query set
train_results = retriever.transform(f"results/{train_queries}")
test_results = retriever.transform(f"results/{test_queries}")
dev_results = retriever.transform(f"results/{dev_queries}")

# Save results in TREC format
def save_results(results, output_file):
    with open(output_file, 'w') as f:
        for _, row in results.iterrows():
            f.write(f"{row['qid']} Q0 {row['docno']} {row['rank']} {row['score']} PyTerrier-BM25\n")
    print(f"Saved {len(results)} results to {output_file}")

save_results(train_results, "train_results.txt")
save_results(test_results, "test_results.txt")
save_results(dev_results, "dev_results.txt")

# Example of evaluating retrieval against the qrels
test_eval = pt.pipelines.Evaluate(test_results, test_qrels, metrics=["map", "ndcg_cut_10", "P_100", "recall_100"])
print("\nTest evaluation results:")
print(test_eval)

  if not pt.started():
  retriever = pt.BatchRetrieve(index, wmodel="BM25", num_results=100)


Loaded 14166 training relevance judgments
Loaded 1706 test relevance judgments
Loaded 1238 development relevance judgments
Found 5500 unique training query IDs
Found 648 unique test query IDs
Found 500 unique development query IDs
Loaded index with 57638 documents
Saved 0 results to train_results.txt
Saved 0 results to test_results.txt
Saved 0 results to dev_results.txt


ValueError: No results for evaluation