With this notebook, we want to create our own dataset for ranking. The dataset gets saved as a `.xlsx`file in `FILE_PATH`.

The idea of this notebook is to create "expert" rankings using the crawled data. Since the crawled data contains meta information like "keywords" or "headings", these meta information can be used to create queries and corresponding rankings of our documents.

It works like this:
1. We sample a query term based on the given meta information, e.g. the keywords "tübingen neuroscience" 
2. We prefilter every document of our index to get a list of true positives. For each query we want to rank `NUMBER_RANKED_DOCUMENTS`, e.g. 100 documents. If there are more than e.g. 100 documents available, we cut of that list. If we have less then 100, we sample some negative samples without replacement to fill up 100 documents.
3. We run BM25 to rank these documents
4. Because BM25 does not produce the best results and we also cannot fully trust the keywords, we sample gumbel noise for each ranking.
5. We calculate a softmax of (BM25 + gumbel noise) for each document and get the final ranking. These ranking be saved as excel.

(In the end of the notebook, some are some sanity checks are being done.)


In [1]:
# Imports
import os
import random 
import numpy as np
import pandas as pd
import torch

from tqdm.notebook import tqdm

from db.DocumentRepository import DocumentRepository
from ranker.QueryResult import QueryResult
from ranker.ranker import RankerFlat
from utils.directoryutil import get_path

In [2]:
# CONSTANTS
MAX_DATAPOINTS = 100000  # determines how big the dataset should get
MODE="keywords" # Alternatively supported: "headings"
NUMBER_RANKED_DOCUMENTS = 100   # How many documents should be ranked for each query?
FILE_PATH = get_path("exp/data/008_kategorie_bm25_v2.xlsx") # The path of where to save it
MAX_QUERY_WORDS = 3 # maximum words per query

In [3]:
os.system("""
    docker compose down;
    docker compose up -d --build db;
    sleep 3;
    """)

 Container project_mse-db-1  Stopping
 Container project_mse-db-1  Stopped
 Container project_mse-db-1  Removing
 Container project_mse-db-1  Removed
 Network project_mse_default  Removing
 Network project_mse_default  Removed


#0 building with "desktop-linux" instance using docker driver

#1 [db internal] load build definition from Dockerfile
#1 transferring dockerfile: 122B done
#1 DONE 0.0s

#2 [db internal] load metadata for docker.io/library/postgres:latest
#2 DONE 0.0s

#3 [db internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [db 1/2] FROM docker.io/library/postgres:latest
#4 DONE 0.0s

#5 [db internal] load build context
#5 transferring context: 69B done
#5 DONE 0.0s

#6 [db 2/2] COPY dump.sql /docker-entrypoint-initdb.d/
#6 CACHED

#7 [db] exporting to image
#7 exporting layers done
#7 writing image sha256:4309df92e6a4a6651b884dd8eea6e932d63c0bf202ba691fa3819b60a26bbc2b done
#7 naming to docker.io/library/project_mse-db done
#7 DONE 0.0s


 Network project_mse_default  Creating
 Network project_mse_default  Created
 Container project_mse-db-1  Creating
 Container project_mse-db-1  Created
 Container project_mse-db-1  Starting
 Container project_mse-db-1  Started


0

In [4]:
ranker = RankerFlat()
documentRepository = ranker.documentRepository
tokenizer = ranker.tokenizer

SC: Connected to the db. Now you can go and build the best search engine around!


In [5]:
all_docs = documentRepository.loadAllDocuments()
all_encoded_docs = documentRepository.getEncodedTextOfAllDocuments()

Token indices sequence length is longer than the specified maximum sequence length for this model (1306 > 512). Running this sequence through the model will result in indexing errors


In [None]:
def sample_gumbel(eps=1e-20):
    shape = (100,)
    U = torch.rand(shape)
    return -torch.log(-torch.log(U + eps) + eps)

def create_own_dataset(all_docs, max_datapoints, mode, n) -> list[QueryResult]:
    """
    created an own dataset based on a list of documents.
    
    Parameters
    all_docs: list of documents
    max_datapoints: Maximum number of datapoints to create
    mode: The attribute that you want to take for sampling the queries.
    n: number of ranked documents per query.
    number_query_words: 
    """
    assert mode in ["keywords", "headings"]
        
    all_attributes = []
    results = []
    
    for doc in all_docs:
        attribute = doc.keywords if mode == "keywords" else doc.headings
        if attribute is None or len(attribute) == 0:
            continue
        all_attributes.extend(attribute)
    all_attributes = list(set(all_attributes))
    
    for _ in tqdm(range(max_datapoints)):
        number_of_words_in_query = random.randint(1, MAX_QUERY_WORDS) # each query can contain [1,MAX_QUERY_WORDS] words
        query_list = random.sample(all_attributes, number_of_words_in_query)
        query = " ".join(query_list)
        documents_for_query = []
        for doc in all_docs:
            # if one keyword of the element is matching with the ones from the query
            if ((mode == "keywords" and bool(set(doc.keywords) & set(query_list))) 
                    or (mode == "headings" and bool(set(doc.headings) & set(query_list)))):
                documents_for_query.append(doc)

        # here we want to ensure to have n documents to rank. We either cut of if we have more docs or if we dont have enough, we keep sampling from our document list until we have n different documents
        if len(documents_for_query) > n:
            documents_for_query = documents_for_query[:n]
        while len(documents_for_query) < n:
            sampled_list = random.sample(all_docs, n - len(documents_for_query))
            documents_for_query.extend(sampled_list)
            documents_for_query = list(set(documents_for_query))
        
        # now we will rank each document based on if one of BM25 and normalize them
        enc_query = tokenizer.encode(query)
        enc_docs = {}
        for doc in documents_for_query:
            enc_docs[doc.url] = all_encoded_docs[doc.url]
        bm25 = ranker.rank_BM25(enc_query, enc_docs)
        bm25_ranked = dict(sorted(bm25.items(), key=lambda x: x[1], reverse=True))
        
        # we use gumbel noise to rerank the bm25 because we don't fully want to rely on bm25 scores. Finally we use a softmax for our entries to sum up to 1
        bm25_tensor = torch.tensor(list(bm25_ranked.values()))
        noise = sample_gumbel()
        stochastic_rank = torch.softmax(bm25_tensor + noise, dim=-1)
        
        query_result = QueryResult(query, documents_for_query, stochastic_rank.tolist())
        
        results.append(query_result)
    return results

results = create_own_dataset(all_docs, MAX_DATAPOINTS, MODE, NUMBER_RANKED_DOCUMENTS)

  0%|          | 0/100000 [00:00<?, ?it/s]

In [None]:
# Create a Dataframe for the results
urls = {}
for result in results:
    urls[result] = [doc.url for doc in result.documents]
data = [{'query': query_result.query, 'urls': urls[query_result], 'scores': query_result.scores} for query_result in results]

df = pd.DataFrame(data)

print(f"Size of dataset: {len(df)}")
df.head()

In [None]:
# Saving the excel
df.to_excel(FILE_PATH, index=False)

print(f"Excel file saved!")

From here on, only sanity checks are being done to verify that it also works to load the data back again

In [None]:
# Let's read the excel file to verify that the dataset can be reconstructed

df_reconstruct = pd.read_excel(FILE_PATH)

print(f"Size of reconstructed dataset: {len(df_reconstruct)}, Is test passed: {len(df) == len(df_reconstruct)}")
df_reconstruct.head()

In [None]:
def reconstructDataset(df):
    df_list = df.values.tolist()
    result = []
    for entry in df_list:
        documents = []
        for url in entry[1]:
            documents.append(documentRepository.getDocumentOfUrl(url))
        result.append(QueryResult(entry[0], documents, entry[2]))
    return result
to_query_results = reconstructDataset(df)
print(f"Size of reconstructed dataset: {len(to_query_results)}. Is test passed: {len(df) == len(to_query_results)}")

We want to check if the entries are the same. We cannot make this assert completely but otherwise make weaker statements.

In [None]:
(results[0].query == to_query_results[0].query
and results[0].documents[0].url == to_query_results[0].documents[0].url
and results[0].scores[0] == to_query_results[0].scores[0])