## <code>NTLM IR Experiment</code>

### Log-Likelihood for Document Ranking and Neural Translation Model for Query Likelihood

1. **Log-Likelihood for Document Ranking**:
   - Estimates the probability of generating the query $( q )$ given a document $( d )$.
   
     
      $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{p(q_i|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d \text{........................Eq(1) } $$
      
   **Where:**
   - $ p(q_i|d) $ is the probability of term \( q_i \) given document \( d \).
   - $ alpha_d  $ is a normalizing factor.
   - $ p(q_i|C)$ is the probability of term \( q_i \) in the collection \( C \).
2. **Cosine Similarity**
    - $$ {cosine\_similarity}(u, w) = \frac{u \cdot w}{\|u\| \|w\|} $$
 
     $$ p_t(w|u) = \frac{ \text{cos}(u', w)}{\sum_{u_0 \in V} \text{cos}(u, w)}   \text{............................................................................Eq(2) insert it into Eq(3) }$$  


3. **Translation Model for Query Likelihood**:
   - Models the process of query generation as a translation from document terms to query terms.
   
     
     $$p_t(w|d) = \sum_{u \in d} p_t(w|u) p(u|d)  \text{............................................................Eq(3) }$$ 
     
   **Where:**
   - $ p_t(w|u) $ is the probability of translating document term $( u )$ into query term $( w )$.
   - $ p(u|d) $ is the probability of term $( u )$ occurring in document $( d )$.

4. **The probability of term $u$ occurring in document $d$**
   - $$p(u|d) = \frac{{\text{tf}(u, d)}}{{\sum_{v \in d} \text{tf}(v, d)}} \text{............................................................Eq(4) }$$

   **Where:**
   - $ p(u|d) $ represents the probability of term $( u )$ occurring in document $( d )$.
   - $ \text{tf}(u, d) $ denotes the frequency of term $( u )$ in document $( d )$.
   - $\sum_{v \in d} \text{tf}(v, d) $ calculates the total number of terms in document $( d )$.
   
   
   




5. **Connecting the Two Concepts**:
   
   - Specifically, $ p(q_i|d) $ can be estimated using $ p_t(w|d) $:
   
     $ p_t(w|d) \approx p_t(q_i|d)$
     
     $ p(u|d) = \approx p(q_i|C)  $
     
   - Substitute these equations in $Eq (1)$

6. **Final Log-Likelihood for Document Ranking Equation Eq(1)**:
   - Using the Neural translation model, the log-likelihood equation becomes:
     
     $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{\sum_{u \in d} p_t(q_i|u) p(u|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d $$
     
   

In [8]:
import gensim
import scipy
from gensim.models import Word2Vec
import numpy as np
import pyterrier as pt
import os
import pandas as pd

In [9]:
# Set JAVA_HOME environment variable
java_home = r"C:\Program Files\Java\jdk-22"   # adjust your java JDK folder or use relative path in below
os.environ["JAVA_HOME"] = java_home

# Verify that JAVA_HOME is set correctly
print("JAVA_HOME set to:", os.environ.get("JAVA_HOME"))

if not pt.started():
  pt.init()



JAVA_HOME set to: C:\Program Files\Java\jdk-22


PyTerrier 0.10.0 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



In [10]:
# Define the relative paths based on the notebook's location
DISK45_PATH = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated")
INDEX_DIR = os.path.join("..", "Data", "AP_Doc", "ap", "index1")

# Check if the index exists
if os.path.exists(os.path.join(INDEX_DIR, "data.properties")):
    indexref = pt.IndexRef.of(os.path.join(INDEX_DIR, "data.properties"))
else:    
    # Find files in the directory
    files = pt.io.find_files(DISK45_PATH)
    
    # Remove unwanted files
    bad = ['/CR/', '/AUX/', 'READCHG', 'READMEFB', 'READFRCG', 'READMEFR', 'READMEFT', 'READMELA']
    for b in bad:
        files = list(filter(lambda f: b not in f, files))
    
    # Check if files list is empty and raise an error if it is
    if not files:
        raise ValueError(f"No files found in the directory {DISK45_PATH}")
    
    # Index the remaining files
    indexer = pt.TRECCollectionIndexer(INDEX_DIR, verbose=True)
    indexref = indexer.index(files)

# Create an index object
index = pt.IndexFactory.of(indexref)

# collection statistics
print(index.getCollectionStatistics().toString())


Number of documents: 242918
Number of terms: 301375
Number of postings: 44556521
Number of fields: 0
Number of tokens: 69541941
Field names: []
Positions:   false



## Loading the AP Topics & qrels

In [11]:
# Define the relative paths based on the notebook's location
topics_path = os.path.join("..", "Data", "AP_Doc", "ap", "topics", "all_topics.txt")
qrels_path = os.path.join("..", "Data", "AP_Doc", "ap", "qrels", "AP_only.txt")
index_path = os.path.join("..", "Data", "AP_Doc", "ap", "index1")

#topics_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "topics", "all_topics.txt")
#qrels_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "qrels", "WSJ_only.txt")
#index_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "index_WSJ")

# Load topics and qrels from text files
topics = pt.io.read_topics(topics_path)
qrels = pt.io.read_qrels(qrels_path)



In [6]:
qrels

Unnamed: 0,qid,docno,label
0,51,AP880301-0271,1
1,51,AP880302-0275,1
2,51,AP880311-0301,1
3,51,AP880316-0292,1
4,51,AP880318-0287,1
...,...,...,...
63203,200,AP891129-0238,0
63204,200,AP891130-0262,0
63205,200,AP891204-0017,0
63206,200,AP891214-0236,0


In [7]:
topics

Unnamed: 0,qid,query
0,Economics,tipster topic description topic airbus subsidies
1,Economics,tipster topic description topic south african ...
2,Economics,tipster topic description topic leveraged buyouts
3,Economics,tipster topic description topic satellite laun...
4,Economics,tipster topic description topic insider trading
...,...,...
145,196,topic school choice voucher system and its eff...
146,197,topic reform of the jurisprudence system to st...
147,198,topic gene therapy and its benefits to humankind
148,199,topic legality of medically assisted suicides


In [12]:
# Function to parse the TREC file
def parse_trec_file(trec_file_path):
    doc_texts = {}
    current_doc_id = None
    current_text = []
    
    encodings = ['utf-8', 'latin-1', 'ISO-8859-1']
    for encoding in encodings:
        try:
            with open(trec_file_path, 'r', encoding=encoding, errors='ignore') as file:
                for line in file:
                    if line.startswith('<DOCNO>'):
                        current_doc_id = line.strip().replace('<DOCNO>', '').replace('</DOCNO>', '').strip()
                    elif line.startswith('</TEXT>'):
                        if current_doc_id:
                            doc_texts[current_doc_id] = ' '.join(current_text)
                            current_doc_id = None
                            current_text = []
                    elif current_doc_id:
                        if not (line.startswith('<DOC>') or line.startswith('</DOC>') or line.startswith('<FILEID>') or
                                line.startswith('<FIRST>') or line.startswith('<SECOND>') or line.startswith('<HEAD>') or
                                line.startswith('<DATELINE>') or line.startswith('<TEXT>')):
                            current_text.append(line.strip())
            break
        except UnicodeDecodeError:
            continue  

    return doc_texts

# Path to your concatenated TREC file
trec_file_path = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated", "concatenated_documents.txt")
#trec_file_path = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated", "sample_document.txt")

# Parse the document texts
doc_texts = parse_trec_file(trec_file_path)

# Function to get document text from the TREC collection
#def get_document_text_from_trec(doc_id, doc_texts):
 #   return doc_texts.get(doc_id, "")

In [35]:
# Load pre-trained word embeddings
#word_embeddings_path = os.path.join("../", "NTLM_Experiment", "GoogleNews-vectors-negative300-SLIM.bin")   # download the pretrained model
word_embeddings_path = r"C:\Users\dolla\NTLM\GoogleNews-vectors-negative300-SLIM.bin"   
#word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(word_embeddings_path, binary=True)
#word_embeddings_path = r"D:\Thesis_New\thesis_eyasu\Data\Word_Embedding\word_vectors.bin"
word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(word_embeddings_path, binary=True)



### Functions to calculate the scoring probability

In [27]:
# term vectors for every term in the document
def compute_term_vectors(document, word_embeddings):
    term_vectors = {}
    for term in document:
        if term in word_embeddings:
            term_vectors[term] = word_embeddings[term]
    return term_vectors

In [28]:

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)


import numpy as np



In [38]:
"""
# Compute translation probability using cosine similarity
def compute_translation_probability(target_word, candidate_term, word_embeddings):
    if target_word in word_embeddings and candidate_term in word_embeddings:
        target_vector = word_embeddings[target_word]
        candidate_vector = word_embeddings[candidate_term]
        # Normalise cosine similarity to [0, 1] 
        # TODO: properly normalise
        return (cosine_similarity(target_vector, candidate_vector)+1)/2 
    else:
        return 0.0 
"""


def compute_translation_probability(
    target_word: str, 
    candidate_word: str, 
    word_embeddings
) -> float:
    """
    Calculates:
    p_t(w|u) = \frac{ \text{cos}(u', w)}{\sum_{u_0 \in V} \text{cos}(u, w)}
    """
    if (target_word not in word_embeddings.key_to_index) or (candidate_word not in word_embeddings.key_to_index):
        return 0.0
    target_vec = word_embeddings[target_word]
    candidate_vec = word_embeddings[candidate_word]

    # Extract all vectors into a NumPy array
    embedding_arr = np.vstack([word_embeddings[word] for word in word_embeddings.key_to_index])

    # Calculate cosine similarity between target word and candidate word
    cosine_sim = cosine_similarity(target_vec, candidate_vec)
    
    # Calculate sum of cosine similarities between target word and all words in the embedding
    # using a vectorized computation
    dp = np.dot(embedding_arr, target_vec)              # dimension (n_words,)
    e_norm = np.linalg.norm(embedding_arr, axis=1)      # dimension (n_words,)
    t_norm = np.linalg.norm(target_vec)                 # dimension 1
    sum_cosine_similarities = (dp / (e_norm * t_norm)).sum()

    # Return normalized cosine_sim
    if sum_cosine_similarities > 0.0:
        return cosine_sim / sum_cosine_similarities
    else:
        return 0.0


In [30]:

# Compute log-likelihood ratio
def compute_log_likelihood_ratio(likelihood_in_doc, translation_prob, alpha, n):
    return np.log(translation_prob / (alpha * likelihood_in_doc)) + n * np.log(alpha)   


In [31]:
# likelihood in document P(u/d) # .....Eq(4)
def compute_likelihood_in_document(candidate_word, document_terms):
    if candidate_word in document_terms:
        frequency = document_terms.count(candidate_word)
        return frequency / len(document_terms)
    else:
        return 0

In [32]:
# Score documents based on translation probabilities
def score_document(query, document, word_embeddings, alpha=1.0):
    score = 0.0
    n = len(query)
    
    for query_term in query:
        likelihood_in_doc = 0.0
        translation_prob_sum = 0.0
        if query_term not in word_embeddings:
            continue
        if query_term not in document:
            continue
        for doc_term in document:
            if doc_term not in word_embeddings:
                continue
            #likelihood_in_doc += word_embeddings[query_term].dot(word_embeddings[doc_term])
            likelihood_in_doc = compute_likelihood_in_document(doc_term, document) # ........Eq(4)
            translation_prob = compute_translation_probability(query_term, doc_term, word_embeddings)
            translation_prob_sum += (translation_prob * likelihood_in_doc)  # ......Eq(3)
        
        if translation_prob_sum > 0:
            log_likelihood_ratio = compute_log_likelihood_ratio(likelihood_in_doc, translation_prob_sum, alpha, n) # .....Eq(1)
            score += log_likelihood_ratio

    return score

In [42]:
from tqdm import tqdm

dirichlet = pt.BatchRetrieve(index_path, wmodel="DirichletLM", controls={'dirichletlm.mu': 1000}, verbose=True, num_results=10)

def retrieve_and_rank_documents(doc_texts, topics, word_embeddings, alpha=0.5):
    results = []
         
    result = dirichlet.transform(topics)
    
    for idx, row in tqdm(topics.iterrows(), total=len(topics), desc="Processing Topics", position=0, leave=True):
        topic_id = row['qid']
        query = row['query'].split()
        scores = []
        
        retrieved_docs = result.loc[result["qid"]==topic_id]['docno'].values
        
        # Iterate over all document IDs in the topics
        for doc_id in doc_texts.keys():
            if doc_id not in retrieved_docs:
                continue
            doc_text = doc_texts[doc_id]
            doc_terms = doc_text.split()
            score = score_document(query, doc_terms, word_embeddings, alpha)
            scores.append((doc_id, score))
        
        ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
        for rank, (doc_id, score) in enumerate(ranked_scores):
            results.append({
                'qid': topic_id,
                'docno': doc_id,
                'rank': rank + 1,
                'score': score,
                'query': ' '.join(query)
            })
    return pd.DataFrame(results)



In [43]:
retrieved_results = retrieve_and_rank_documents(doc_texts, topics[:5], word_embeddings)

BR(DirichletLM): 100%|██████████| 5/5 [00:00<00:00, 63.74q/s]
Processing Topics:   0%|          | 0/5 [00:00<?, ?it/s]


AttributeError: 'KeyedVectors' object has no attribute 'values'

In [121]:
retrieved_results

Unnamed: 0,qid,docno,rank,score,query
0,Economics,AP880712-0012,1,0.000000,tipster topic description topic airbus subsidies
1,Economics,AP881026-0247,2,0.000000,tipster topic description topic airbus subsidies
2,Economics,AP881101-0202,3,0.000000,tipster topic description topic airbus subsidies
3,Economics,AP881109-0034,4,0.000000,tipster topic description topic airbus subsidies
4,Economics,AP881119-0093,5,0.000000,tipster topic description topic airbus subsidies
...,...,...,...,...,...
165,Economics,AP890901-0154,30,-7.548565,tipster topic description topic insider trading
166,Economics,AP890508-0092,31,-8.634704,tipster topic description topic insider trading
167,Economics,AP890409-0023,32,-9.463451,tipster topic description topic insider trading
168,Economics,AP880731-0085,33,-11.027188,tipster topic description topic insider trading


In [69]:
BM25 = pt.BatchRetrieve(index_path, wmodel="BM25")


In [70]:
pt.Experiment([retrieved_results,BM25], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "BM25"], filter_by_qrels=True,
              filter_by_topics=True)

Unnamed: 0,name,map,P_10
0,NTLM,0.0,0.0
1,BM25,0.19957,0.298


# BM25

In [43]:
from tqdm import tqdm

#dirichlet = pt.BatchRetrieve(index_path, wmodel="DirichletLM", controls={'dirichletlm.mu': 1000}, verbose=True, num_results=70)
#BM25 = pt.BatchRetrieve(index_path, wmodel="BM25", verbose=True, num_results=150)
PL2 = pt.BatchRetrieve(index_path, wmodel="PL2", verbose=True, num_results=100)

def retrieve_and_rank_documents(doc_texts, topics, word_embeddings, alpha=0.5):
    results = []
         
    result = PL2.transform(topics)
    
    for idx, row in tqdm(topics.iterrows(), total=len(topics), desc="Processing Topics", position=0, leave=True):
        topic_id = row['qid']
        query = row['query'].split()
        scores = []
        
        retrieved_docs = result.loc[result["qid"]==topic_id]['docno'].values
        
        # Iterate over all document IDs in the topics
        for doc_id in doc_texts.keys():
            if doc_id not in retrieved_docs:
                continue
            doc_text = doc_texts[doc_id]
            doc_terms = doc_text.split()
            score = score_document(query, doc_terms, word_embeddings, alpha)
            scores.append((doc_id, score))
        
        ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
        for rank, (doc_id, score) in enumerate(ranked_scores):
            results.append({
                'qid': topic_id,
                'docno': doc_id,
                'rank': rank + 1,
                'score': score,
                'query': ' '.join(query)
            })
    return pd.DataFrame(results)



In [45]:
retrieved_results = retrieve_and_rank_documents(doc_texts, topics, word_embeddings)

BR(PL2): 100%|██████████| 150/150 [00:01<00:00, 93.04q/s] 
Processing Topics:  25%|██▍       | 37/150 [53:44<2:44:08, 87.15s/it] 


KeyboardInterrupt: 

In [122]:
# using Dirichlet 

pt.Experiment([retrieved_results,dirichlet], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "dirichlet"], filter_by_qrels=True,
              filter_by_topics=True)

BR(DirichletLM): 100%|██████████| 50/50 [00:01<00:00, 43.25q/s]


Unnamed: 0,name,map,P_10
0,NTLM,0.0,0.0
1,dirichlet,0.0484,0.298


In [33]:
# using BM25 reranker
pt.Experiment([retrieved_results,BM25], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "BM25"], filter_by_qrels=True,
              filter_by_topics=True)

Unnamed: 0,name,map,P_10
0,NTLM,0.099994,0.25
1,BM25,0.19957,0.298


In [44]:
# using PL2 reranker
pt.Experiment([retrieved_results,PL2], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "PL2"], filter_by_qrels=True,
              filter_by_topics=True)

BR(PL2): 100%|██████████| 50/50 [00:00<00:00, 77.11q/s]


Unnamed: 0,name,map,P_10
0,NTLM,0.082609,0.242
1,PL2,0.144259,0.332


In [42]:
# on 70 number of results using Dirichlet
pt.Experiment([retrieved_results,dirichlet], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "dirichlet"], filter_by_qrels=True,
              filter_by_topics=True)

BR(DirichletLM): 100%|██████████| 50/50 [00:00<00:00, 57.76q/s]


Unnamed: 0,name,map,P_10
0,NTLM,0.082609,0.242
1,dirichlet,0.112818,0.298


In [163]:
def calculate_average_precision(relevance_labels):
    precision_sum = 0.0
    num_relevant_docs = 0
    num_relevant_retrieved = 0
    
    for i, label in enumerate(relevance_labels):
        if label == 1:
            num_relevant_retrieved += 1
            precision = num_relevant_retrieved / (i + 1)
            precision_sum += precision
            num_relevant_docs += 1
    
    if num_relevant_docs == 0:
        return 0.0
    else:
        return precision_sum / num_relevant_docs

def calculate_map(retrieved_results, qrels):
    grouped_results = retrieved_results.groupby('qid')
    total_average_precision = 0.0
    num_queries = 0
    
    for query_id, group in grouped_results:
        retrieved_docnos = group['docno'].values
        
        relevance_labels = [1 if (docno in qrels[(qrels['qid'] == query_id) & (qrels['docno'] == docno)]['docno'].values) else 0 for docno in retrieved_docnos]
        
        print(f"Query ID: {query_id}")
        print(f"Retrieved Docnos: {retrieved_docnos}")
        print(f"Relevance Labels: {relevance_labels}")
        
        average_precision = calculate_average_precision(relevance_labels)
        print(f"Average Precision: {average_precision}")
        
        total_average_precision += average_precision
        num_queries += 1
    
    map_score = total_average_precision / num_queries
    return map_score

# Calculate MAP
map_score = calculate_map(retrieved_results, qrels)
print("MAP Score:", map_score)

Query ID: 071
Retrieved Docnos: ['AP880318-0123' 'AP880319-0014' 'AP880315-0057' 'AP880324-0184'
 'AP880319-0124' 'AP880324-0252' 'AP891103-0071' 'AP880320-0011'
 'AP880324-0085' 'AP880317-0052']
Relevance Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Average Precision: 0.0
Query ID: 072
Retrieved Docnos: ['AP880413-0292' 'AP890213-0188' 'AP891030-0254' 'AP891201-0138'
 'AP881207-0021' 'AP890209-0046' 'AP881216-0224' 'AP880831-0264'
 'AP901031-0054' 'AP880711-0180']
Relevance Labels: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Average Precision: 0.0
Query ID: 100
Retrieved Docnos: ['AP890508-0220' 'AP890320-0177' 'AP880707-0079' 'AP901224-0003'
 'AP900425-0028' 'AP891209-0009' 'AP900605-0203' 'AP900424-0144'
 'AP900122-0277' 'AP891030-0181']
Relevance Labels: [1, 1, 1, 0, 0, 1, 0, 0, 0, 1]
Average Precision: 0.8333333333333333
Query ID: 101
Retrieved Docnos: ['AP901005-0009' 'AP880520-0190' 'AP890130-0077' 'AP880513-0046'
 'AP881030-0049' 'AP890426-0036' 'AP880714-0012' 'AP900916-0009'
 'AP881024-0011' 'A