## <code>NTLM IR Experiment</code>

### Log-Likelihood for Document Ranking and Neural Translation Model for Query Likelihood

1. **Log-Likelihood for Document Ranking**:
   - Estimates the probability of generating the query $( q )$ given a document $( d )$.
   
     
      $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{p(q_i|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d \text{........................Eq(1) }$$
      
   **Where:**
   - $p(q_i|d)$ is the probability of term \( q_i \) given document \( d \).
   - $alpha_d$ is a normalizing factor.
   - $p(q_i|C)$ is the probability of term \( q_i \) in the collection \( C \).
2. **Cosine Similarity**
    - $$ {cosine\_similarity}(u, w) = \frac{u \cdot w}{|u| |w|} $$
 
     $$p_t(w|u) = \frac{ \text{cos}(u', w)}{\sum_{u_0 \in V} \text{cos}(u, w)}   \text{............................................................................Eq(2) insert it into Eq(3) }$$  


3. **Translation Model for Query Likelihood**:
   - Models the process of query generation as a translation from document terms to query terms.
   
     
     $$p_t(w|d) = \sum_{u \in d} p_t(w|u) p(u|d)  \text{............................................................Eq(3) }$$ 
     
   **Where:**
   - $p_t(w|u)$ is the probability of translating document term $( u )$ into query term $( w )$.
   - $p(u|d)$ is the probability of term $( u )$ occurring in document $( d )$.

4. **The probability of term $u$ occurring in document $d$**
   - $$p(u|d) = \frac{{\text{tf}(u, d)}}{{\sum_{v \in d} \text{tf}(v, d)}} \text{............................................................Eq(4) }$$

   **Where:**
   - $p(u|d)$ represents the probability of term $( u )$ occurring in document $( d )$.
   - $\text{tf}(u, d)$ denotes the frequency of term $( u )$ in document $( d )$.
   - $\sum_{v \in d} \text{tf}(v, d)$ calculates the total number of terms in document $( d )$.
   
   
   
5. **Probability of Query Term $( q_i )$ in the Collection $( C )$ ---> $p(q_i|C)$**

   The probability  $p(q_i|C)$ represents the likelihood of term $(q_i)$ appearing in the entire collection of documents $(C)$. 
   It is calculated as the term frequency of $(q_i)$ in the collection $(C)$ divided by the total number of terms in the collection.

   - $$p(q_i|C) = \frac{\text{cf}(q_i, C)}{\sum_{v \in C} \text{cf}(v, C)} \text{............................................................Eq(5) }$$

   **Where:**
   - $\text{cf}(q_i, C)$ : Collection frequency of term $( q_i )$ in the collection $( C )$.
   - $\sum_{v \in C} \text{cf}(v, C)$: Total number of terms in the entire collection ( C )$.




6. **Connecting the Two Concepts**:
   
   - Specifically, $ p(q_i|d) $ can be estimated using $ p_t(w|d) $:
   
     $p_t(w|d) \approx p_t(q_i|d)$
     
     $p(u|d)$
     
     $p(q_i|C)$
     
   - Substitute these equations in $Eq (1)$

7. **Final Log-Likelihood for Document Ranking Equation Eq(1)**:
   - Using the Neural translation model, the log-likelihood equation becomes:
     
     $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{\sum_{u \in d} p_t(q_i|u) p(u|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d$$
     
   

In [1]:
import gensim
#import scipy
from gensim.models import Word2Vec
import numpy as np
import pyterrier as pt
import os
from collections import defaultdict
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


### Setup java environment for pyterrier

In [33]:
# Set JAVA_HOME environment variable
java_home = r"C:\Program Files\Java\jdk-11"   # adjust your java JDK folder 

# Verify that JAVA_HOME is set correctly
print("JAVA_HOME set to:", os.environ.get("JAVA_HOME"))

if not pt.started():
  pt.init()



JAVA_HOME set to: C:\Program Files\Java\jdk-11


In [34]:
# Define the relative paths based on the notebook's location
#DISK45_PATH = os.path.join("concatenated_WSJ")
#INDEX_DIR = os.path.join("index_WSJ")
DISK45_PATH = os.path.abspath(os.path.join("concatenated_WSJ"))
INDEX_DIR = os.path.abspath(os.path.join("index_WSJ"))


# Check if the index exists
if os.path.exists(os.path.join(INDEX_DIR, "data.properties")):
    indexref = pt.IndexRef.of(os.path.join(INDEX_DIR, "data.properties"))
else:    
    # Find files in the directory
    files = pt.io.find_files(DISK45_PATH)
    
    # Remove unwanted files
    bad = ['/CR/', '/AUX/', 'READCHG', 'READMEFB', 'READFRCG', 'READMEFR', 'READMEFT', 'READMELA']
    for b in bad:
        files = list(filter(lambda f: b not in f, files))
    
    # Check if files list is empty and raise an error if it is
    if not files:
        raise ValueError(f"No files found in the directory {DISK45_PATH}")
    
    # Index the remaining files
    indexer = pt.TRECCollectionIndexer(INDEX_DIR, verbose=True)
    indexref = indexer.index(files)

# Create an index object
index = pt.IndexFactory.of(indexref)

# collection statistics
print(index.getCollectionStatistics().toString())


Number of documents: 173252
Number of terms: 176210
Number of postings: 29484536
Number of fields: 0
Number of tokens: 49044405
Field names: []
Positions:   false



## Loading the AP Topics & qrels

In [35]:
# Define the relative paths based on the notebook's location

topics_path = os.path.join("topics","all_topics_fixed.txt")
qrels_path = os.path.join("qrels","WSJ_only.txt")
#index_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "index_WSJ")

# Load topics and qrels from text files
topics = pt.io.read_topics(topics_path)
qrels = pt.io.read_qrels(qrels_path)



22:02:09.932 [main] WARN org.terrier.applications.batchquerying.TRECQuery -- trec.encoding is not set; resorting to platform default (windows-1252). Retrieval may be platform dependent. Recommend trec.encoding=UTF-8


In [36]:
qrels

Unnamed: 0,qid,docno,label
0,51,WSJ861203-0077,0
1,51,WSJ861204-0160,0
2,51,WSJ861204-0167,0
3,51,WSJ861209-0043,0
4,51,WSJ861209-0128,0
...,...,...,...
104283,200,WSJ920316-0108,0
104284,200,WSJ920317-0087,0
104285,200,WSJ920319-0108,0
104286,200,WSJ920323-0193,0


In [37]:
topics

Unnamed: 0,qid,query
0,51,airbus subsidies
1,52,south african sanctions
2,53,leveraged buyouts
3,54,satellite launch contracts
4,55,insider trading
...,...,...
145,196,school choice voucher system and its effects u...
146,197,reform of the jurisprudence system to stop jur...
147,198,gene therapy and its benefits to humankind
148,199,legality of medically assisted suicides


### preprocess the dataset

In [38]:

def parse_trec_file(trec_file_path):
    doc_texts = {}
    current_doc_id = None
    current_text = []
    
    encodings = ['utf-8', 'latin-1', 'ISO-8859-1']
    for encoding in encodings:
        try:
            with open(trec_file_path, 'r', encoding=encoding, errors='ignore') as file:
                for line in file:
                    if line.startswith('<DOCNO>'):
                        current_doc_id = line.strip().replace('<DOCNO>', '').replace('</DOCNO>', '').strip()
                    elif line.startswith('</TEXT>'):
                        if current_doc_id:
                            doc_texts[current_doc_id] = ' '.join(current_text)
                            current_doc_id = None
                            current_text = []
                    elif current_doc_id:
                        if not (line.startswith('<DOC>') or line.startswith('</DOC>') or line.startswith('<FILEID>') or
                                line.startswith('<FIRST>') or line.startswith('<SECOND>') or line.startswith('<HEAD>') or
                                line.startswith('<DATELINE>') or line.startswith('<TEXT>') or 
                                line.startswith('<HL>') or line.startswith('</HL>') or 
                                line.startswith('<DD>') or line.startswith('</DD>') or 
                                line.startswith('<SO>') or line.startswith('</SO>') or 
                                line.startswith('<IN>') or line.startswith('</IN>')):
                            current_text.append(line.strip())
            break
        except UnicodeDecodeError:
            continue  

    return doc_texts

# Path to your concatenated TREC file
trec_file_path = os.path.join("concatenated_WSJ","concatenated_WSJ.txt")


# Parse the document texts
doc_texts = parse_trec_file(trec_file_path)


In [None]:
dict(list(doc_texts.items())[0:3])

#### Inital ranking using Dirichlet model

In [39]:
dirichlet = pt.BatchRetrieve(index, wmodel="DirichletLM", controls={'dirichletlm.mu': 1500}, verbose=True) 
retrieved_results = dirichlet.transform(topics)
retrieved_results

BR(DirichletLM): 100%|██████████| 150/150 [00:04<00:00, 36.05q/s]


Unnamed: 0,qid,docid,docno,rank,score,query
0,51,27745,WSJ871218-0126,0,13.950991,airbus subsidies
1,51,111455,WSJ900720-0157,1,13.266061,airbus subsidies
2,51,1016,WSJ870316-0068,2,12.942354,airbus subsidies
3,51,77369,WSJ880315-0169,3,12.656361,airbus subsidies
4,51,164736,WSJ920116-0130,4,12.501765,airbus subsidies
...,...,...,...,...,...,...
146857,200,104584,WSJ900518-0081,995,4.564010,impact of foreign textile imports on u s texti...
146858,200,35189,WSJ870717-0014,996,4.563954,impact of foreign textile imports on u s texti...
146859,200,51240,WSJ881013-0006,997,4.561225,impact of foreign textile imports on u s texti...
146860,200,40472,WSJ871111-0099,998,4.559232,impact of foreign textile imports on u s texti...


#### Loading Word2vec embedding

In [40]:
# Load pre-trained word embeddings

word_embeddings_path = r"GoogleNews-vectors-negative300-SLIM.bin"   
word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(word_embeddings_path, binary=True)

### Implementation of reranking with NTLM using Word2vec model

In [41]:
import numpy as np
from collections import defaultdict
from tqdm import tqdm
import pyterrier as pt



# Function to calculate term frequencies for documents and collection
def calculate_term_frequencies(doc_texts):
    term_frequencies = {}
    collection_frequencies = defaultdict(int)
    total_terms_in_collection = 0
    
    for doc_id, text in doc_texts.items():
        term_freq = defaultdict(int)
        document_terms = text.split()  # Split text into terms
        total_terms_in_doc = len(document_terms)
        
        for term in document_terms:
            term_freq[term] += 1
            collection_frequencies[term] += 1
            total_terms_in_collection += 1
        
        term_frequencies[doc_id] = (term_freq, total_terms_in_doc)
    
    return term_frequencies, collection_frequencies, total_terms_in_collection

# Function to calculate the probability of term u given document d
def p_u_given_d(term, doc_term_freq, total_terms_in_doc):
    return doc_term_freq[term] / total_terms_in_doc if total_terms_in_doc > 0 else 0

# Function to calculate the probability of term u in the collection C
def p_u_given_C(term, collection_frequencies, total_terms_in_collection):
    return collection_frequencies[term] / total_terms_in_collection if total_terms_in_collection > 0 else 0

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Compute translation probability using cosine similarity
def compute_translation_probability(target_word, candidate_term, word_embeddings):
    if target_word in word_embeddings and candidate_term in word_embeddings:
        target_vector = word_embeddings[target_word]
        candidate_vector = word_embeddings[candidate_term]
        # Normalize cosine similarity to [0, 1]
        return (cosine_similarity(target_vector, candidate_vector)+1)/2 
    else:
        return 0.0 

# Compute log-likelihood ratio
def compute_log_likelihood_ratio(translation_prob, p_qi_C, alpha, n):
    return (np.log(translation_prob / (p_qi_C * alpha)) + n * np.log(alpha))

# Score documents based on translation probabilities
def score_document(query, document, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha=0.7):
    score = 0.0
    n = len(query)

    for query_term in query:
        if query_term not in word_embeddings:
            continue  # Skip query terms not in word embeddings

        translation_prob_sum = 0.0

        for doc_term in document:
            if doc_term not in word_embeddings:
                continue  # Skip document terms not in word embeddings

            p_qi_d = p_u_given_d(query_term, doc_term_freq, total_terms_in_doc) # ....................................Eq(4)
            p_qi_C = p_u_given_C(query_term, collection_frequencies, total_terms_in_collection) #.....................Eq(5)
            translation_prob = compute_translation_probability(query_term, doc_term, word_embeddings)
            translation_prob_sum += (translation_prob * p_qi_d)  # ........................................................Eq(3)
        
        if translation_prob_sum > 0:
            log_likelihood_ratio = compute_log_likelihood_ratio(translation_prob_sum, p_qi_C, alpha, n) # ..................Eq(1)
            score += log_likelihood_ratio

    return score

dirichlet = pt.BatchRetrieve(index, wmodel="DirichletLM", controls={'dirichletlm.mu': 1500}, verbose=True)

# Retrieve and rank documents
def retrieve_and_rank_documents(doc_texts, topics, word_embeddings, alpha=0.7):
    results = []
    
    # Calculate term frequencies for documents and collection
    term_frequencies, collection_frequencies, total_terms_in_collection = calculate_term_frequencies(doc_texts)
    
    # Retrieve initial set of documents using Dirichlet language model
    result = dirichlet.transform(topics)
    
    for idx, row in tqdm(topics.iterrows(), total=len(topics), desc="Processing Topics", position=0, leave=True):
        topic_id = row['qid']
        query = row['query'].split()
        scores = []
        
        retrieved_docs = result.loc[result["qid"] == topic_id]['docno'].values
        
        # Iterate over all document IDs in the topics
        for doc_id in doc_texts.keys():
            if doc_id not in retrieved_docs:
                continue
            doc_text = doc_texts[doc_id]
            doc_terms = doc_text.split()
            doc_term_freq, total_terms_in_doc = term_frequencies[doc_id]
            score = score_document(query, doc_terms, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha)
            scores.append((doc_id, score))
        
        ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
        for rank, (doc_id, score) in enumerate(ranked_scores):
            results.append({
                'qid': topic_id,
                'docno': doc_id,
                'rank': rank + 1,
                'score': score,
                'query': ' '.join(query)
            })
    return pd.DataFrame(results, columns=['qid', 'docno', 'rank', 'score', 'query'])

retrieved_results = retrieve_and_rank_documents(doc_texts, topics, word_embeddings)

BR(DirichletLM): 100%|██████████| 150/150 [00:06<00:00, 22.25q/s]
Processing Topics: 100%|██████████| 150/150 [1:30:35<00:00, 36.24s/it]


#### Save the reranked result "retrieved_results"

In [42]:
path = os.path.join("result_files","reranked_retrieved_results.csv")

retrieved_results.to_csv(path, index=False)

### Evaluation the reranked result

In [2]:
import pandas as pd
import pytrec_eval
import os

# Define reranked files, qrel files, and model names
reranked_files = [
    
    {"path": os.path.join("result_files","WSJ_Dirichlet_ranked_results.csv"), "qrel": os.path.join("result_files","WSJ_qrels.csv"), "model": "Dirichlet model WSJ"},  # dirichlet model

    {"path": os.path.join("result_files","reranked_retrieved_results.csv"), "qrel": os.path.join("result_files","WSJ_qrels.csv"), "model": "NTLM WSJ"}, # reranking using NTLM


]

def evaluate_reranked_results(reranked_df, qrel_df):
    qrels = {str(qid): {str(docno): int(label) for docno, label in zip(qrel_df.loc[qrel_df['qid'] == qid, 'docno'], qrel_df.loc[qrel_df['qid'] == qid, 'label'])} for qid in qrel_df['qid'].unique()}
    results = {str(qid): {str(docno): float(score) for docno, score in zip(reranked_df.loc[reranked_df['qid'] == qid, 'docno'], reranked_df.loc[reranked_df['qid'] == qid, 'score'])} for qid in reranked_df['qid'].unique()}

    evaluator = pytrec_eval.RelevanceEvaluator(qrels, {'map', 'P_10', "ndcg"})
    metrics = evaluator.evaluate(results)

    map_score = sum([m['map'] for m in metrics.values()]) / len(metrics)
    p10_score = sum([m['P_10'] for m in metrics.values()]) / len(metrics)
    
    return map_score, p10_score

# DataFrame to store evaluation results
results_df = pd.DataFrame(columns=["Model", "File", "MAP", "P@10"])

# Loop through reranked files and evaluate them
for file_info in reranked_files:
    try:
        reranked_df = pd.read_csv(file_info["path"])
        qrel_df = pd.read_csv(file_info["qrel"])

        map_score, p10_score = evaluate_reranked_results(reranked_df, qrel_df)

        temp_df = pd.DataFrame({
            "Model": [file_info["model"]],
            "File": [os.path.basename(file_info["path"])],
            "MAP": [map_score],
            "P@10": [p10_score]
        })

        results_df = pd.concat([results_df, temp_df], ignore_index=True)
    
    except Exception as e:
        print(f"Error processing file {file_info['path']}: {str(e)}")

# Save the results to a CSV file
#output_file = "evaluation_results.csv"
#results_df.to_csv(output_file, index=False)

# Print the results
print("\nEvaluation Results:")
print("=" * 80)
print(results_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print("=" * 80)
#print(f"\nResults saved to {output_file}")

  results_df = pd.concat([results_df, temp_df], ignore_index=True)



Evaluation Results:
              Model                             File      MAP     P@10
Dirichlet model WSJ WSJ_Dirichlet_ranked_results.csv 0.273027 0.448667
           NTLM WSJ   reranked_retrieved_results.csv 0.087345 0.253333
