## <code>NTLM IR Experiment</code>

### Log-Likelihood for Document Ranking and Neural Translation Model for Query Likelihood

1. **Log-Likelihood for Document Ranking**:
   - Estimates the probability of generating the query $( q )$ given a document $( d )$.
   
     
      $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{p(q_i|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d \text{........................Eq(1) }$$
      
   **Where:**
   - $p(q_i|d)$ is the probability of term \( q_i \) given document \( d \).
   - $alpha_d$ is a normalizing factor.
   - $p(q_i|C)$ is the probability of term \( q_i \) in the collection \( C \).
2. **Cosine Similarity**
    - $$ {cosine\_similarity}(u, w) = \frac{u \cdot w}{|u| |w|} $$
 
     $$p_t(w|u) = \frac{ \text{cos}(u', w)}{\sum_{u_0 \in V} \text{cos}(u, w)}   \text{............................................................................Eq(2) insert it into Eq(3) }$$  


3. **Translation Model for Query Likelihood**:
   - Models the process of query generation as a translation from document terms to query terms.
   
     
     $$p_t(w|d) = \sum_{u \in d} p_t(w|u) p(u|d)  \text{............................................................Eq(3) }$$ 
     
   **Where:**
   - $p_t(w|u)$ is the probability of translating document term $( u )$ into query term $( w )$.
   - $p(u|d)$ is the probability of term $( u )$ occurring in document $( d )$.

4. **The probability of term $u$ occurring in document $d$**
   - $$p(u|d) = \frac{{\text{tf}(u, d)}}{{\sum_{v \in d} \text{tf}(v, d)}} \text{............................................................Eq(4) }$$

   **Where:**
   - $p(u|d)$ represents the probability of term $( u )$ occurring in document $( d )$.
   - $\text{tf}(u, d)$ denotes the frequency of term $( u )$ in document $( d )$.
   - $\sum_{v \in d} \text{tf}(v, d)$ calculates the total number of terms in document $( d )$.
   
   
   
5. **Probability of Query Term $( q_i )$ in the Collection $( C )$ ---> $p(q_i|C)$**

   The probability  $p(q_i|C)$ represents the likelihood of term $(q_i)$ appearing in the entire collection of documents $(C)$. 
   It is calculated as the term frequency of $(q_i)$ in the collection $(C)$ divided by the total number of terms in the collection.

   - $$p(q_i|C) = \frac{\text{cf}(q_i, C)}{\sum_{v \in C} \text{cf}(v, C)} \text{............................................................Eq(5) }$$

   **Where:**
   - $\text{cf}(q_i, C)$ : Collection frequency of term $( q_i )$ in the collection $( C )$.
   - $\sum_{v \in C} \text{cf}(v, C)$: Total number of terms in the entire collection ( C )$.




6. **Connecting the Two Concepts**:
   
   - Specifically, $ p(q_i|d) $ can be estimated using $ p_t(w|d) $:
   
     $p_t(w|d) \approx p_t(q_i|d)$
     
     $p(u|d)$
     
     $p(q_i|C)$
     
   - Substitute these equations in $Eq (1)$

7. **Final Log-Likelihood for Document Ranking Equation Eq(1)**:
   - Using the Neural translation model, the log-likelihood equation becomes:
     
     $$log p(q|d) = \sum_{i:c(q_i;d)>0} \log \left( \frac{\sum_{u \in d} p_t(q_i|u) p(u|d)}{\alpha_d * p(q_i|C)} \right) + n \log \alpha_d$$
     
   

In [1]:
import gensim
import scipy
from gensim.models import Word2Vec
import numpy as np
import pyterrier as pt
import os
import pandas as pd

In [2]:
# Set JAVA_HOME environment variable
java_home = r"C:\Program Files\Java\jdk-22"   # adjust your java JDK folder or use relative path in below
os.environ["JAVA_HOME"] = java_home

# Verify that JAVA_HOME is set correctly
print("JAVA_HOME set to:", os.environ.get("JAVA_HOME"))

if not pt.started():
  pt.init()



JAVA_HOME set to: C:\Program Files\Java\jdk-22


PyTerrier 0.10.0 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8



In [3]:
# Define the relative paths based on the notebook's location
DISK45_PATH = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated")
INDEX_DIR = os.path.join("..", "Data", "AP_Doc", "ap", "index1")

# Check if the index exists
if os.path.exists(os.path.join(INDEX_DIR, "data.properties")):
    indexref = pt.IndexRef.of(os.path.join(INDEX_DIR, "data.properties"))
else:    
    # Find files in the directory
    files = pt.io.find_files(DISK45_PATH)
    
    # Remove unwanted files
    bad = ['/CR/', '/AUX/', 'READCHG', 'READMEFB', 'READFRCG', 'READMEFR', 'READMEFT', 'READMELA']
    for b in bad:
        files = list(filter(lambda f: b not in f, files))
    
    # Check if files list is empty and raise an error if it is
    if not files:
        raise ValueError(f"No files found in the directory {DISK45_PATH}")
    
    # Index the remaining files
    indexer = pt.TRECCollectionIndexer(INDEX_DIR, verbose=True)
    indexref = indexer.index(files)

# Create an index object
index = pt.IndexFactory.of(indexref)

# collection statistics
print(index.getCollectionStatistics().toString())


Number of documents: 242918
Number of terms: 301375
Number of postings: 44556521
Number of fields: 0
Number of tokens: 69541941
Field names: []
Positions:   false



## Loading the AP Topics & qrels

In [4]:
# Define the relative paths based on the notebook's location
topics_path = os.path.join("..", "Data", "AP_Doc", "ap", "topics", "all_topics_fixed.txt")
qrels_path = os.path.join("..", "Data", "AP_Doc", "ap", "qrels", "AP_only.txt")
index_path = os.path.join("..", "Data", "AP_Doc", "ap", "index1")

#topics_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "topics", "all_topics.txt")
#qrels_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "qrels", "WSJ_only.txt")
#index_path = os.path.join("..", "Data", "WSJ_Doc", "wsj", "index_WSJ")

# Load topics and qrels from text files
topics = pt.io.read_topics(topics_path)
qrels = pt.io.read_qrels(qrels_path)


In [5]:
qrels

Unnamed: 0,qid,docno,label
0,51,AP880301-0271,1
1,51,AP880302-0275,1
2,51,AP880311-0301,1
3,51,AP880316-0292,1
4,51,AP880318-0287,1
...,...,...,...
63203,200,AP891129-0238,0
63204,200,AP891130-0262,0
63205,200,AP891204-0017,0
63206,200,AP891214-0236,0


In [6]:
topics

Unnamed: 0,qid,query
0,51,airbus subsidies
1,52,south african sanctions
2,53,leveraged buyouts
3,54,satellite launch contracts
4,55,insider trading
...,...,...
145,196,school choice voucher system and its effects u...
146,197,reform of the jurisprudence system to stop jur...
147,198,gene therapy and its benefits to humankind
148,199,legality of medically assisted suicides


In [7]:
# Function to parse the TREC file
def parse_trec_file(trec_file_path):
    doc_texts = {}
    current_doc_id = None
    current_text = []
    
    encodings = ['utf-8', 'latin-1', 'ISO-8859-1']
    for encoding in encodings:
        try:
            with open(trec_file_path, 'r', encoding=encoding, errors='ignore') as file:
                for line in file:
                    if line.startswith('<DOCNO>'):
                        current_doc_id = line.strip().replace('<DOCNO>', '').replace('</DOCNO>', '').strip()
                    elif line.startswith('</TEXT>'):
                        if current_doc_id:
                            doc_texts[current_doc_id] = ' '.join(current_text)
                            current_doc_id = None
                            current_text = []
                    elif current_doc_id:
                        if not (line.startswith('<DOC>') or line.startswith('</DOC>') or line.startswith('<FILEID>') or
                                line.startswith('<FIRST>') or line.startswith('<SECOND>') or line.startswith('<HEAD>') or
                                line.startswith('</BYLINE>') or
                                line.startswith('<DATELINE>') or line.startswith('<TEXT>')):
                            current_text.append(line.strip())
            break
        except UnicodeDecodeError:
            continue  

    return doc_texts

# Path to your concatenated TREC file
trec_file_path = os.path.join("..", "Data", "AP_Doc", "ap", "concatenated", "concatenated_documents.txt")
#trec_file_path = os.path.join("..", "Data", "WSJ_DOC", "wsj", "concatenated_WSJ", "concatenated_WSJ.txt")

# Parse the document texts
doc_texts = parse_trec_file(trec_file_path)



In [4]:
doc_texts1 = doc_texts.values()
doc_texts1[:5]

TypeError: 'dict_values' object is not subscriptable

In [6]:
dict(list(doc_texts.items())[43:50])

{'AP880212-0044': "<BYLINE>By HENRY GOTTLIEB</BYLINE> <BYLINE>Associated Press Writer</BYLINE> Three Eastern European countries have begun talks with the Boeing and McDonnell Douglas aircraft companies to supply long-range passenger planes to replace aging Soviet equipment in their national airlines, Deputy Secretary of State John C. Whitehead said Friday. Whitehead, just back from a two-week trip to four Soviet-bloc countries, told reporters ``my impression is they are very much interested in buying U.S. planes. It's a significant thing, psychologically, I believe _ Eastern European countries flying U.S. planes.'' He declined to identify the countries that expressed interest, but another official, speaking on condition of anonymity, said they were Poland, Hungary and Romania. If deals are arranged it would be the first U.S. aircraft sales to Hungary and Poland. In the early 1970s, Romania purchased several Boeing 707 passenger jets under a contract arranged at a time when the United S

In [11]:
unique_tokens = set()

# Function to add tokens to the set
def add_tokens(value):
    if isinstance(value, str):
        # Split the string and add tokens to the set
        unique_tokens.update(value.split())
    elif isinstance(value, list):
        # If the value is a list, iterate through each element
        for item in value:
            add_tokens(item)

# Iterate through each key-value pair 
for value in doc_texts.values():
    add_tokens(value)

print("Total number of unique tokens in the dictionary:", len(unique_tokens))
#print("Unique tokens:", unique_tokens)

Total number of unique tokens in the dictionary: 1112170


In [8]:
# Load pre-trained word embeddings
  
#word_embeddings_path = os.path.join("../", "NTLM_Experiment", "GoogleNews-vectors-negative300-SLIM.bin")   # download the pretrained model
word_embeddings_path = r"C:\Users\dolla\NTLM\GoogleNews-vectors-negative300-SLIM.bin" 

word_embeddings = gensim.models.KeyedVectors.load_word2vec_format(word_embeddings_path, binary=True)



In [9]:
import numpy as np
from collections import defaultdict
from tqdm import tqdm
import pyterrier as pt

# Function to calculate term frequencies for documents and collection
def calculate_term_frequencies(doc_texts):
    term_frequencies = {}
    collection_frequencies = defaultdict(int)
    total_terms_in_collection = 0
    
    for doc_id, text in doc_texts.items():
        term_freq = defaultdict(int)
        document_terms = text.split()  # Split text into terms
        total_terms_in_doc = len(document_terms)
        
        for term in document_terms:
            term_freq[term] += 1
            collection_frequencies[term] += 1
            total_terms_in_collection += 1
        
        term_frequencies[doc_id] = (term_freq, total_terms_in_doc)
    
    return term_frequencies, collection_frequencies, total_terms_in_collection

# Function to calculate the probability of term u given document d
def p_u_given_d(term, doc_term_freq, total_terms_in_doc):
    return doc_term_freq[term] / total_terms_in_doc if total_terms_in_doc > 0 else 0

# Function to calculate the probability of term u in the collection C
def p_u_given_C(term, collection_frequencies, total_terms_in_collection):
    return collection_frequencies[term] / total_terms_in_collection if total_terms_in_collection > 0 else 0

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Compute translation probability using cosine similarity
def compute_translation_probability(target_word, candidate_term, word_embeddings):
    if target_word in word_embeddings and candidate_term in word_embeddings:
        target_vector = word_embeddings[target_word]
        candidate_vector = word_embeddings[candidate_term]
        # Normalize cosine similarity to [0, 1]
        return (cosine_similarity(target_vector, candidate_vector)+1)/2 
    else:
        return 0.0 

# Compute log-likelihood ratio
def compute_log_likelihood_ratio(translation_prob, p_qi_C, alpha, n):
    return (np.log(translation_prob / (p_qi_C * alpha)) + n * np.log(alpha))

# Score documents based on translation probabilities
def score_document(query, document, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha=0.07):
    score = 0.0
    n = len(query)

    for query_term in query:
        if query_term not in word_embeddings:
            continue  # Skip query terms not in word embeddings

        translation_prob_sum = 0.0

        for doc_term in document:
            if doc_term not in word_embeddings:
                continue  # Skip document terms not in word embeddings

            p_qi_d = p_u_given_d(query_term, doc_term_freq, total_terms_in_doc) # ........Eq(4)
            p_qi_C = p_u_given_C(query_term, collection_frequencies, total_terms_in_collection) # ........Eq(5)
            translation_prob = compute_translation_probability(query_term, doc_term, word_embeddings)
            translation_prob_sum += (translation_prob * p_qi_d)  # ......Eq(3)
        
        if translation_prob_sum > 0:
            log_likelihood_ratio = compute_log_likelihood_ratio(translation_prob_sum, p_qi_C, alpha, n) # .....Eq(1)
            score += log_likelihood_ratio

    return score

dirichlet = pt.BatchRetrieve(index_path, wmodel="DirichletLM", controls={'dirichletlm.mu': 1000}, verbose=True) # , num_results=10

# Retrieve and rank documents
def retrieve_and_rank_documents(doc_texts, topics, word_embeddings, alpha=0.07):
    results = []
    
    # Calculate term frequencies for documents and collection
    term_frequencies, collection_frequencies, total_terms_in_collection = calculate_term_frequencies(doc_texts)
    
    # Retrieve initial set of documents using Dirichlet language model
    result = dirichlet.transform(topics)
    
    for idx, row in tqdm(topics.iterrows(), total=len(topics), desc="Processing Topics", position=0, leave=True):
        topic_id = row['qid']
        query = row['query'].split()
        scores = []
        
        retrieved_docs = result.loc[result["qid"] == topic_id]['docno'].values
        
        # Iterate over all document IDs in the topics
        for doc_id in doc_texts.keys():
            if doc_id not in retrieved_docs:
                continue
            doc_text = doc_texts[doc_id]
            doc_terms = doc_text.split()
            doc_term_freq, total_terms_in_doc = term_frequencies[doc_id]
            score = score_document(query, doc_terms, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha)
            scores.append((doc_id, score))
        
        ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
        for rank, (doc_id, score) in enumerate(ranked_scores):
            results.append({
                'qid': topic_id,
                'docno': doc_id,
                'rank': rank + 1,
                'score': score,
                'query': ' '.join(query)
            })
    return pd.DataFrame(results, columns=['qid', 'docno', 'rank', 'score', 'query'])

retrieved_results = retrieve_and_rank_documents(doc_texts, topics, word_embeddings)


BR(DirichletLM): 100%|██████████| 150/150 [00:11<00:00, 13.52q/s]
Processing Topics: 100%|██████████| 150/150 [1:28:44<00:00, 35.50s/it]


In [10]:
pt.Experiment([retrieved_results,dirichlet], topics, qrels, eval_metrics=["map", "P_10"], names=["NTLM", "Dirichlet"], filter_by_qrels=True,
              filter_by_topics=True)

BR(DirichletLM): 100%|██████████| 148/148 [00:04<00:00, 33.03q/s]


Unnamed: 0,name,map,P_10
0,NTLM,0.086965,0.135135
1,Dirichlet,0.184347,0.3


In [11]:
# Compute Mean Average Precision (MAP)
def calculate_map(retrieved_results, qrels):
    avg_precision = []
    for qid in retrieved_results['qid'].unique():
        relevant_docs = qrels[qrels['qid'] == qid]['docno'].tolist()
        retrieved_docs = retrieved_results[retrieved_results['qid'] == qid]['docno'].tolist()
        
        precision_at_k = []
        num_relevant = 0
        for i, doc in enumerate(retrieved_docs):
            if doc in relevant_docs:
                num_relevant += 1
                precision_at_k.append(num_relevant / (i + 1))
        
        if len(precision_at_k) > 0:
            avg_precision.append(np.mean(precision_at_k))
        else:
            avg_precision.append(0.0)
    
    return np.mean(avg_precision)

map_score = calculate_map(retrieved_results, qrels)
print(f"Mean Average Precision (MAP): {map_score}")

Mean Average Precision (MAP): 0.25319642029633127


### Calculate MAP

In [26]:
def calculate_average_precision(relevance_labels):
    precision_sum = 0.0
    num_relevant_docs = 0
    num_relevant_retrieved = 0
    
    for i, label in enumerate(relevance_labels):
        if label == 1:
            num_relevant_retrieved += 1
            precision = num_relevant_retrieved / (i + 1)
            precision_sum += precision
            num_relevant_docs += 1
    
    if num_relevant_docs == 0:
        return 0.0
    else:
        return precision_sum / num_relevant_docs

def calculate_map(retrieved_results, qrels):
    grouped_results = retrieved_results.groupby('qid')
    total_average_precision = 0.0
    num_queries = 0
    
    for query_id, group in grouped_results:
        retrieved_docnos = group['docno'].values
        
        relevance_labels = [1 if (docno in qrels[(qrels['qid'] == query_id) & (qrels['docno'] == docno)]['docno'].values) else 0 for docno in retrieved_docnos]
        
        #print(f"Query ID: {query_id}")
        #print(f"Retrieved Docnos: {retrieved_docnos}")
        #print(f"Relevance Labels: {relevance_labels}")
        
        average_precision = calculate_average_precision(relevance_labels)
        #print(f"Average Precision: {average_precision}")
        
        total_average_precision += average_precision
        num_queries += 1
    
    map_score = total_average_precision / num_queries
    return map_score

# Calculate MAP
map_score = calculate_map(retrieved_results, qrels)
print("MAP Score:", map_score)

MAP Score: 0.25560750142638755


### Calculate P@10

In [13]:
def calculate_precision_at_k(relevance_labels, k):
   
   
    num_relevant_retrieved = sum(relevance_labels[:k])
    return num_relevant_retrieved / k

def calculate_p_at_10(retrieved_results, qrels):
  
  
    grouped_results = retrieved_results.groupby('qid')
    total_precision_at_10 = 0.0
    num_queries = 0
    
    for query_id, group in grouped_results:
        retrieved_docnos = group['docno'].values
        
        relevance_labels = [1 if (docno in qrels[(qrels['qid'] == query_id) & (qrels['docno'] == docno)]['docno'].values) else 0 for docno in retrieved_docnos]
        
        precision_at_10 = calculate_precision_at_k(relevance_labels, 10)
        
        total_precision_at_10 += precision_at_10
        num_queries += 1
    
    average_precision_at_10 = total_precision_at_10 / num_queries
    return average_precision_at_10

# Example usage:
p_at_10_score = calculate_p_at_10(retrieved_results, qrels)
print("P@10 Score:", p_at_10_score)


P@10 Score: 0.3413333333333334


In [12]:

def calculate_p_at_k(retrieved_results, qrels, k=10):
    precision_scores = []
    for qid in retrieved_results['qid'].unique():
        relevant_docs = set(qrels[qrels['qid'] == qid]['docno'].tolist())
        retrieved_docs = retrieved_results[retrieved_results['qid'] == qid]['docno'].tolist()[:k]
        
        num_relevant = len(set(retrieved_docs).intersection(relevant_docs))
        precision_at_k = num_relevant / k if k > 0 else 0.0
        
        precision_scores.append(precision_at_k)
    
    return np.mean(precision_scores)

# Example usage:
p_at_10_score = calculate_p_at_k(retrieved_results, qrels, k=10)
print(f"Precision at 10 (P@10): {p_at_10_score}")


Precision at 10 (P@10): 0.3413333333333334


In [None]:
import numpy as np
from collections import defaultdict
import pandas as pd
from tqdm import tqdm
from transformers import BertTokenizer, BertModel
import torch
import os

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Load the embeddings from the .bin file
def load_embeddings(file_path):
    with open(file_path, 'rb') as f:
        embeddings = np.load(f, allow_pickle=True).item()
    return embeddings

# Function to calculate term frequencies for documents and collection
def calculate_term_frequencies(doc_texts):
    term_frequencies = {}
    collection_frequencies = defaultdict(int)
    total_terms_in_collection = 0
    
    for doc_id, text in doc_texts.items():
        term_freq = defaultdict(int)
        document_terms = text.split()  # Split text into terms
        total_terms_in_doc = len(document_terms)
        
        for term in document_terms:
            term_freq[term] += 1
            collection_frequencies[term] += 1
            total_terms_in_collection += 1
        
        term_frequencies[doc_id] = (term_freq, total_terms_in_doc)
    
    return term_frequencies, collection_frequencies, total_terms_in_collection

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

# Compute translation probability using cosine similarity with BERT embeddings
def compute_translation_probability(target_word, candidate_term, word_embeddings):
    if target_word in word_embeddings and candidate_term in word_embeddings:
        target_vector = word_embeddings[target_word]
        candidate_vector = word_embeddings[candidate_term]
        
        # Calculate cosine similarity
        similarity = cosine_similarity(target_vector, candidate_vector)
        
        # Normalize cosine similarity to [0, 1]
        return (similarity + 1) / 2
    else:
        return 0.0

# Function to calculate the probability of term u given document d
def p_u_given_d(term, doc_term_freq, total_terms_in_doc):
    return doc_term_freq[term] / total_terms_in_doc if total_terms_in_doc > 0 else 0

# Function to calculate the probability of term u in the collection C
def p_u_given_C(term, collection_frequencies, total_terms_in_collection):
    return collection_frequencies[term] / total_terms_in_collection if total_terms_in_collection > 0 else 0

# Compute log-likelihood ratio
def compute_log_likelihood_ratio(translation_prob, p_qi_C, alpha, n):
    return (np.log(translation_prob / (p_qi_C * alpha)) + n * np.log(alpha))

# Score documents based on translation probabilities using BERT embeddings
def score_document(query, document, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha=0.1):
    score = 0.0
    n = len(query)

    for query_term in query:
        if query_term not in word_embeddings:
            continue  # Skip query terms not in word embeddings

        translation_prob_sum = 0.0

        for doc_term in document:
            if doc_term not in word_embeddings:
                continue  # Skip document terms not in word embeddings

            p_qi_d = p_u_given_d(query_term, doc_term_freq, total_terms_in_doc) # Eq(4)
            p_qi_C = p_u_given_C(query_term, collection_frequencies, total_terms_in_collection) # Eq(5)
            translation_prob = compute_translation_probability(query_term, doc_term, word_embeddings)
            translation_prob_sum += (translation_prob * p_qi_d)  # Eq(3)
        
        if translation_prob_sum > 0:
            log_likelihood_ratio = compute_log_likelihood_ratio(translation_prob_sum, p_qi_C, alpha, n) # Eq(1)
            score += log_likelihood_ratio

    return score

# Example usage with Terrier
import pyterrier as pt
if not pt.started():
    pt.init()
dirichlet = pt.BatchRetrieve("index_path", wmodel="DirichletLM", controls={'dirichletlm.mu': 1000}, verbose=True)

# Retrieve and rank documents using BERT embeddings
def retrieve_and_rank_documents(doc_texts, topics, word_embeddings, alpha=0.1):
    results = []
    
    # Calculate term frequencies for documents and collection
    term_frequencies, collection_frequencies, total_terms_in_collection = calculate_term_frequencies(doc_texts)
    
    # Retrieve initial set of documents using Dirichlet language model
    result = dirichlet.transform(topics)
    
    for idx, row in tqdm(topics.iterrows(), total=len(topics), desc="Processing Topics", position=0, leave=True):
        topic_id = row['qid']
        query = row['query'].split()
        scores = []
        
        retrieved_docs = result.loc[result["qid"] == topic_id]['docno'].values
        
        # Iterate over all document IDs in the topics
        for doc_id in doc_texts.keys():
            if doc_id not in retrieved_docs:
                continue
            doc_text = doc_texts[doc_id]
            doc_terms = doc_text.split()
            doc_term_freq, total_terms_in_doc = term_frequencies[doc_id]
            score = score_document(query, doc_terms, word_embeddings, doc_term_freq, total_terms_in_doc, collection_frequencies, total_terms_in_collection, alpha)
            scores.append((doc_id, score))
        
        ranked_scores = sorted(scores, key=lambda x: x[1], reverse=True)
        for rank, (doc_id, score) in enumerate(ranked_scores):
            results.append({
                'qid': topic_id,
                'docno': doc_id,
                'rank': rank + 1,
                'score': score,
                'query': ' '.join(query)
            })
    return pd.DataFrame(results, columns=['qid', 'docno', 'rank', 'score', 'query'])

# Example usage
if __name__ == "__main__":
    # Path to the BERT embeddings .bin file
    embeddings_file_path = 'path_to_your_embeddings_file.bin'
    
    # Load word embeddings
    word_embeddings = load_embeddings(embeddings_file_path)
    
    # Assuming 'doc_texts' is a dictionary of document IDs and their corresponding text
    # Assuming 'topics' is a DataFrame with columns 'qid' and 'query'
    retrieved_results = retrieve_and_rank_documents(doc_texts, topics, word_embeddings)

    # Save or process retrieved_results as needed
