### <code>Dirichlet Model with Smoothing Parameter</code>

#### **Smoothed Probability**

The smoothed probability of a term $( w )$ given a document $( d )$ using Dirichlet smoothing is given by:

$$p_{\text{Dir}}(w|d) = \frac{c(w, d) + \mu \cdot p(w|C)}{|d| + \mu} \text{........................Eq(1) }$$

where:
- $c(w, d)$  is the count of term $( w )$ in document $( d )$.
- $|d|$  is the total number of terms in document $( d )$.
- $\mu$ is the Dirichlet smoothing parameter.
- $p(w|C)$ is the probability of term $( w )$ in the entire collection $( C )$.


#### **Collection Probability**

The probability of term $( w )$ in the entire collection $( C )$ is:

$$p(w|C) = \frac{f(w, C)}{|C|}\text{........................Eq(2) }$$

where:
- $f(w, C)$ is the count or frequency of term $( w )$ in the entire collection $( C )$.
- $|C|$ is the total number of terms in the entire collection.


   

In [1]:
import os
import pandas as pd
from collections import defaultdict, Counter
import numpy as np
import pyterrier as pt

In [14]:
# Set JAVA_HOME environment variable
java_home = r"C:\Program Files\Java\jdk-22"   # adjust your java JDK folder 
os.environ["JAVA_HOME"] = java_home

# Verify that JAVA_HOME is set correctly
print("JAVA_HOME set to:", os.environ.get("JAVA_HOME"))

if not pt.started():
  pt.init()



JAVA_HOME set to: C:\Program Files\Java\jdk-22


PyTerrier 0.10.0 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



### Loading the Topics & qrels

In [26]:
# Define the relative paths based on the notebook's location
#topics_path = os.path.join("..", "Data", "AP_Doc", "ap", "topics", "all_topics_fixed.txt")
#qrels_path = os.path.join("..", "Data", "AP_Doc", "ap", "qrels", "AP_only.txt")

topics_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "topics", "all_topics_fixed.txt")
qrels_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "qrels", "WSJ_only.txt")

# Load topics and qrels from text files
topics = pt.io.read_topics(topics_path)
qrels = pt.io.read_qrels(qrels_path)



In [16]:
qrels

Unnamed: 0,qid,docno,label
0,51,WSJ861203-0077,0
1,51,WSJ861204-0160,0
2,51,WSJ861204-0167,0
3,51,WSJ861209-0043,0
4,51,WSJ861209-0128,0
...,...,...,...
104283,200,WSJ920316-0108,0
104284,200,WSJ920317-0087,0
104285,200,WSJ920319-0108,0
104286,200,WSJ920323-0193,0


In [25]:
topics

Unnamed: 0,qid,query
0,51,airbus subsidies
1,52,south african sanctions
2,53,leveraged buyouts
3,54,satellite launch contracts
4,55,insider trading
...,...,...
145,196,school choice voucher system and its effects u...
146,197,reform of the jurisprudence system to stop jur...
147,198,gene therapy and its benefits to humankind
148,199,legality of medically assisted suicides


#### AP TREC Files Preprocessing

In [2]:
# Function to parse the TREC file
def parse_trec_file(trec_file_path):
    doc_texts = {}
    current_doc_id = None
    current_text = []
    
    encodings = ['utf-8', 'latin-1', 'ISO-8859-1']
    for encoding in encodings:
        try:
            with open(trec_file_path, 'r', encoding=encoding, errors='ignore') as file:
                for line in file:
                    if line.startswith('<DOCNO>'):
                        current_doc_id = line.strip().replace('<DOCNO>', '').replace('</DOCNO>', '').strip()
                    elif line.startswith('</TEXT>'):
                        if current_doc_id:
                            doc_texts[current_doc_id] = ' '.join(current_text)
                            current_doc_id = None
                            current_text = []
                    elif current_doc_id:
                        if not (line.startswith('<DOC>') or line.startswith('</DOC>') or line.startswith('<FILEID>') or
                                line.startswith('<FIRST>') or line.startswith('<SECOND>') or line.startswith('<HEAD>') or
                                line.startswith('<DATELINE>') or line.startswith('<TEXT>')):
                            current_text.append(line.strip())
            break
        except UnicodeDecodeError:
            continue  

    return doc_texts

# Path to your concatenated TREC file
#trec_file_path = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated", "concatenated_documents.txt")
trec_file_path = os.path.join("..", "Data", "WSJ_DOC", "wsj", "concatenated_WSJ", "concatenated_WSJ.txt")

# Parse the document texts
doc_texts = parse_trec_file(trec_file_path)



#### WSJ TREC Files Preprocessing


In [3]:


def parse_trec_file(trec_file_path):
    doc_texts = {}
    current_doc_id = None
    current_text = []
    
    encodings = ['utf-8', 'latin-1', 'ISO-8859-1']
    for encoding in encodings:
        try:
            with open(trec_file_path, 'r', encoding=encoding, errors='ignore') as file:
                for line in file:
                    if line.startswith('<DOCNO>'):
                        current_doc_id = line.strip().replace('<DOCNO>', '').replace('</DOCNO>', '').strip()
                    elif line.startswith('</TEXT>'):
                        if current_doc_id:
                            doc_texts[current_doc_id] = ' '.join(current_text)
                            current_doc_id = None
                            current_text = []
                    elif current_doc_id:
                        if not (line.startswith('<DOC>') or line.startswith('</DOC>') or line.startswith('<FILEID>') or
                                line.startswith('<FIRST>') or line.startswith('<SECOND>') or line.startswith('<HEAD>') or
                                line.startswith('<DATELINE>') or line.startswith('<TEXT>') or 
                                line.startswith('<HL>') or line.startswith('</HL>') or 
                                line.startswith('<DD>') or line.startswith('</DD>') or 
                                line.startswith('<SO>') or line.startswith('</SO>') or 
                                line.startswith('<IN>') or line.startswith('</IN>')):
                            current_text.append(line.strip())
            break
        except UnicodeDecodeError:
            continue  

    return doc_texts

# Path to your concatenated TREC file
# trec_file_path = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated", "concatenated_documents.txt")
trec_file_path = os.path.join("..", "Data", "WSJ_DOC", "wsj", "concatenated_WSJ", "concatenated_WSJ.txt")

# Parse the document texts
doc_texts = parse_trec_file(trec_file_path)


In [5]:
#dict(list(doc_texts.items())[0:5])


#### Dirichlet Language Model

In [34]:
# Parameters
mu = 1500 

# Preprocess documents
doc_lengths = {}
term_doc_freq = defaultdict(Counter)
total_term_count = Counter()
collection_length = 0

# Tokenizing and gathering statistics
for docno, text in doc_texts.items():
    tokens = text.lower().split()
    doc_length = len(tokens)
    doc_lengths[docno] = doc_length
    term_doc_freq[docno].update(tokens)
    total_term_count.update(tokens)
    collection_length += doc_length

# Calculate P(w|C) for the collection
P_w_C = {word: count / collection_length for word, count in total_term_count.items()} # ..... Eq(2)

# Function to compute Dirichlet smoothed P(w|D)
def dirichlet_smoothed_P_w_D(word, docno):
    doc_length = doc_lengths[docno]
    word_count_in_doc = term_doc_freq[docno][word]
    P_w_C_word = P_w_C.get(word, 0)
    return (word_count_in_doc + mu * P_w_C_word) / (doc_length + mu) # ..... Eq(1)

# Scoring function
def score_document(query, docno):
    query_tokens = query.lower().split()
    score = 0.0
    for token in query_tokens:
        P_w_D = dirichlet_smoothed_P_w_D(token, docno)
        if P_w_D > 0:
            score += np.log(P_w_D)
    return score

# Calculate scores for each query-document pair and rank them
results = []

for index, row in topics.iterrows():
    qid = row['qid']
    query = row['query']
    scores = []
    for docno in doc_texts.keys():
        score = score_document(query, docno)
        scores.append((docno, score))
    
    # Sort scores in descending order and assign ranks
    ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
    for rank, (doc_id, score) in enumerate(ranked_scores, start=1):
        results.append({
            'qid': qid,
            'docno': doc_id,
            'rank': rank,
            'score': score,
            'query': query
        })

# Convert results to DataFrame
retrieved_results = pd.DataFrame(results, columns=['qid', 'docno', 'rank', 'score', 'query'])

### <code>Mean Average Precision (MAP)</code>

Mean Average Precision (MAP) is a metric used to evaluate the performance of an information retrieval system. It calculates the average precision for a set of queries and then computes the mean of these average precision values. The formula for MAP is given by:

$$\text{MAP} = \frac{1}{N} \sum_{i=1}^{N} \text{AP}(i)$$

Where:
- $N$ is the total number of queries.
- $\text{AP}(i)$ is the average precision for query $( i )$.

#### Average Precision (AP)

The average precision for a single query is the average of the precision values obtained at each point a relevant document is retrieved. The formula for average precision is:

$$\text{AP}(i) = \frac{1}{R_i} \sum_{k=1}^{|D_i|} P(k) \cdot \text{rel}(k)$$

Where:
- $R_i$ is the total number of relevant documents for query $( i )$.
- $|D_i|$ is the total number of retrieved documents for query $( i )$.
- $P(k)$ is the precision at rank $( k )$.
- $\text{rel}(k)$ is an indicator function equaling 1 if the document at rank $( k )$ is relevant, and 0 otherwise.

#### Steps to calculate MAP:
1. For each query, calculate the Average Precision (AP).
2. Compute the precision at each rank where a relevant document is retrieved.
3. Average these precision values for each query to get AP for that query.
4. Compute the mean of the average precisions for all queries to get MAP.


In [35]:
# Compute Mean Average Precision (MAP)
def calculate_map(retrieved_results, qrels):
    avg_precision = []
    for qid in retrieved_results['qid'].unique():
        relevant_docs = qrels[qrels['qid'] == qid]['docno'].tolist()
        retrieved_docs = retrieved_results[retrieved_results['qid'] == qid]['docno'].tolist()
        
        precision_at_k = []
        num_relevant = 0
        for i, doc in enumerate(retrieved_docs):
            if doc in relevant_docs:
                num_relevant += 1
                precision_at_k.append(num_relevant / (i + 1))
        
        if len(precision_at_k) > 0:
            avg_precision.append(np.mean(precision_at_k))
        else:
            avg_precision.append(0.0)
    
    return np.mean(avg_precision)

map_score = calculate_map(retrieved_results, qrels)
print(f"Mean Average Precision (MAP): {map_score}")

Mean Average Precision (MAP): 0.15203106531011265


### <code>Precision at 10 (P@10)</code>

Precision at 10, or P@10, is a metric used to evaluate the performance of an information retrieval system. It measures the proportion of relevant documents within the top 10 retrieved documents for each query. The formula for P@10 is given by:

$$ \text{P@10} = \frac{1}{N} \sum_{i=1}^{N} \frac{| \{ \text{relevant documents in top 10 results for query } i \} |}{10} $$

Where:
- $N$ is the total number of queries.
- ${ \text{relevant documents in top 10 results for query } i }$ is the set of relevant documents within the top 10 retrieved documents for query \( i \).
- The precision is calculated as the number of relevant documents in the top 10 divided by 10.

Steps to calculate P@10:
1. For each query, identify the top 10 retrieved documents.
2. Count the number of these top 10 documents that are relevant according to the ground truth.
3. Calculate the ratio of relevant documents to the total number of retrieved documents (10).
4. Average this ratio over all queries to get the final P@10 score.


In [35]:

def calculate_p_at_k(retrieved_results, qrels, k=10):
    precision_scores = []
    for qid in retrieved_results['qid'].unique():
        relevant_docs = set(qrels[qrels['qid'] == qid]['docno'].tolist())
        retrieved_docs = retrieved_results[retrieved_results['qid'] == qid]['docno'].tolist()[:k]
        
        num_relevant = len(set(retrieved_docs).intersection(relevant_docs))
        precision_at_k = num_relevant / k if k > 0 else 0.0
        
        precision_scores.append(precision_at_k)
    
    return np.mean(precision_scores)

# Example usage:
p_at_10_score = calculate_p_at_k(retrieved_results, qrels, k=10)
print(f"Precision at 10 (P@10): {p_at_10_score}")


Precision at 10 (P@10): 0.6893333333333332


## Exmaple

In [3]:
import pandas as pd

In [4]:
qrels = pd.DataFrame({
    'qid': [1, 1, 1, 2, 2, 3],
    'docno': ['doc1', 'doc2', 'doc3', 'doc2', 'doc4', 'doc1']
})


In [5]:
retrieved_results = pd.DataFrame({
    'qid': [1, 1, 1, 2, 2, 3],
    'docno': ['doc1', 'doc3', 'doc5', 'doc2', 'doc6', 'doc1'],
    'rank': [1, 2, 3, 1, 2, 1],
    'score': [0.9, 0.8, 0.7, 0.95, 0.85, 0.75],
    'query': ['query1', 'query1', 'query1', 'query2', 'query2', 'query3']
})


In [10]:
import numpy as np

def calculate_p_at_k(retrieved_results, qrels, k=10):
    precision_scores = []
    
    for qid in retrieved_results['qid'].unique():
        # Extract relevant documents for the query
        relevant_docs = set(qrels[qrels['qid'] == qid]['docno'].tolist())
        # Extract top k retrieved documents for the query
        retrieved_docs = retrieved_results[retrieved_results['qid'] == qid]['docno'].tolist()[:k]
        
        # Calculate number of relevant documents in the top k
        num_relevant = len(set(retrieved_docs).intersection(relevant_docs))
        # Calculate precision at k
        precision_at_k = num_relevant / k if k > 0 else 0.0
        
        # Debug information
        print(f"QID: {qid}")
        print(f"Relevant Docs: {relevant_docs}")
        print(f"Retrieved Docs: {retrieved_docs}")
        print(f"Number of Relevant Docs in Top {k}: {num_relevant}")
        print(f"Precision@{k}: {precision_at_k}")
        
        precision_scores.append(precision_at_k)
    
    # Return the mean precision at k across all queries
    return np.mean(precision_scores)

# Sample qrels DataFrame (replace with actual qrels data)
qrels = pd.DataFrame({
    'qid': [1, 1, 1, 2, 2, 3],
    'docno': ['doc1', 'doc2', 'doc3', 'doc2', 'doc4', 'doc1']
})

# Sample retrieved results DataFrame (replace with actual retrieved results data)
retrieved_results = pd.DataFrame({
    'qid': [1, 1, 1, 2, 2, 3],
    'docno': ['doc1', 'doc3', 'doc5', 'doc2', 'doc6', 'doc1'],
    'rank': [1, 2, 3, 1, 2, 1],
    'score': [0.9, 0.8, 0.7, 0.95, 0.85, 0.75],
    'query': ['query1', 'query1', 'query1', 'query2', 'query2', 'query3']
})

# Example usage
p_at_10_score = calculate_p_at_k(retrieved_results, qrels, k=5)
print(f"Precision at 10 (P@10): {p_at_10_score}")


QID: 1
Relevant Docs: {'doc1', 'doc2', 'doc3'}
Retrieved Docs: ['doc1', 'doc3', 'doc5']
Number of Relevant Docs in Top 5: 2
Precision@5: 0.4
QID: 2
Relevant Docs: {'doc4', 'doc2'}
Retrieved Docs: ['doc2', 'doc6']
Number of Relevant Docs in Top 5: 1
Precision@5: 0.2
QID: 3
Relevant Docs: {'doc1'}
Retrieved Docs: ['doc1']
Number of Relevant Docs in Top 5: 1
Precision@5: 0.2
Precision at 10 (P@10): 0.26666666666666666
