# Text Retrieval and Mining - Data Science Minor - Week 4

* Text Retrieval and Mining, BSc BAN, 2023-2024
* Author: [Julien Rossi](mailto:j.rossi@uva.nl)



# Document Retrieval in Action



In [None]:
!pip install accelerate
!pip install beir

Collecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/280.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m276.5/280.0 kB[0m [31m8.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2
Collecting beir
  Downloading beir-2.0.0.tar.gz (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentence-transformers (from beir)
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0

In [None]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = f"https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{dataset}.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

  from tqdm.autonotebook import tqdm


datasets/scifact.zip:   0%|          | 0.00/2.69M [00:00<?, ?iB/s]

  0%|          | 0/5183 [00:00<?, ?it/s]

`corpus` is a dictionary of dictionaries:
* Each key of `corpus` is a Document ID
* The value associated to a key is a dictionary
* In this nested dictionaries, the keys are:
  * `text`: the full text of the document
  * `title`: the title of the document



`queries` is a dictionary:
* Each key of `queries` is a Query ID
* The value associated to the key is the full text of the query



`qrels` is a dictionary of dictionaries:
* Each key of `qrels` is a Query ID
* The value is a dictionary
* In this nested dictionary:
  * keys are Document ID
  * values are relevance judgements `1` for a relevant document, `0` for a non-relevant document
* To shorten the data, only the relevant documents are given (by default, it means none of the other documents in the corpus is relevant for this query)

In [None]:
type(corpus)

dict

In [None]:
list(corpus.keys())[:5]

['4983', '5836', '7912', '18670', '19238']

In [None]:
corpus["4983"]

{'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 versus 1.1 microm2/ms). Relative anisotropy was higher the closer birth was to term with 

In [None]:
list(queries.keys())[:5]

['1', '3', '5', '13', '36']

In [None]:
queries["1"]

'0-dimensional biomaterials show inductive properties.'

In [None]:
qrels["1"]

{'31715818': 1}

Glued all together:
* Put a query ID in `query_id` and see which documents of the corpus are relevant

In [None]:
query_id = "42"

print(f"Query: {queries[query_id]}")
print(f"Relevant Docs:")
for doc_id, relevance in qrels[query_id].items():
    if relevance == 1:
        print(f"  Doc #{doc_id:>10}: {corpus[doc_id]['text']}")

Query: A high microerythrocyte count raises vulnerability to severe anemia in homozygous alpha (+)- thalassemia trait subjects.
Relevant Docs:
  Doc #  18174210: BACKGROUND The heritable haemoglobinopathy alpha(+)-thalassaemia is caused by the reduced synthesis of alpha-globin chains that form part of normal adult haemoglobin (Hb). Individuals homozygous for alpha(+)-thalassaemia have microcytosis and an increased erythrocyte count. Alpha(+)-thalassaemia homozygosity confers considerable protection against severe malaria, including severe malarial anaemia (SMA) (Hb concentration < 50 g/l), but does not influence parasite count. We tested the hypothesis that the erythrocyte indices associated with alpha(+)-thalassaemia homozygosity provide a haematological benefit during acute malaria. METHODS AND FINDINGS Data from children living on the north coast of Papua New Guinea who had participated in a case-control study of the protection afforded by alpha(+)-thalassaemia against severe malari

Results are provided as a dictionary of dictionaries:
* Each key of `results` is a Query ID
* The value is a dictionary
* In this nested dictionary:
  * Keys are Document ID
  * Values are float numbers: the relevance score of this document to the query, based on our model

# Generic Model

In [None]:
from abc import ABC
from abc import abstractmethod
from collections import defaultdict
from dataclasses import dataclass
from dataclasses import field
from typing import Dict
from typing import List
from typing import Tuple

from beir.retrieval.evaluation import EvaluateRetrieval
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity


K_VALUES = [1, 5, 10, 50, 100, 200, 500, 800, 1000]


@dataclass
class Metrics:
    ndcg: Dict[str, float]
    map_: Dict[str, float]
    recall: Dict[str, float]
    precision: Dict[str, float]


@dataclass
class Search(ABC):
    corpus: Dict[str, Dict[str, str]] = field(repr=False)

    @abstractmethod
    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        """Retrieves relevant documents for queries.

        Params:
            * queries (dict): set of queries
            * top_k (int): for each query, return only the top_k documents with the highest scores.
                            if None, returns all corpus documents with their score.

        Returns:
            * results (dict): for each query, a dictionary with key=Document ID and value=relevance score
        """
        ...

    def evaluate(
        self,
        queries: Dict[str, str],
        qrels: Dict[str, Dict[str, int]],
        k_values: List[int]
    ) -> Metrics:
        """Evaluates retrieval.

        Search relevant documents for queries and evaluate Recall@K and Precision@K, based on qrels.

        Params:
            * queries (dict): set of queries
            * qrels (dict): qrels corresponding to the queries / corpus
            * k_values (list): list of K for Recall@K Precision@K

        Returns:
            * metrics (Metrics)
        """
        results = self.search(queries=queries)
        evaluator = EvaluateRetrieval()
        ndcg, _map, recall, precision = evaluator.evaluate(
            qrels=qrels,
            results=results,
            k_values=k_values
        )

        return Metrics(ndcg=ndcg, map_=_map, recall=recall, precision=precision)

# Bag of Words Model

In [None]:
@dataclass
class BoWSearch(Search):
    """Document Retrieval with Bag of Words."""
    vectorizer: CountVectorizer = field(init=False, repr=False)
    corpus_vect: csr_matrix = field(init=False, repr=False)

    def __post_init__(self) -> None:
        # Create the vectorizer, learn the vocabulary
        self.vectorizer = CountVectorizer(lowercase=True)
        self.corpus_vect = self.vectorizer.fit_transform([x["text"] for x in self.corpus.values()])
        print(f"Vocabulary size: {len(self.vectorizer.get_feature_names_out())}")

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        """Retrieves relevant documents for queries.

        Params:
            * queries (dict): set of queries
            * top_k (int): for each query, return only the top_k documents with the highest scores.
                            if None, returns all corpus documents with their score.

        Returns:
            * results (dict): for each query, a dictionary with key=Document ID and value=relevance score
        """
        # Vectorize the queries
        queries_vect = self.vectorizer.transform(queries.values())

        # Get similarity matrix
        sim_matrix = cosine_similarity(queries_vect, self.corpus_vect)

        # Build result dictionary
        results = defaultdict(dict)
        for i, query_id in enumerate(queries):
            for j, doc_id in enumerate(self.corpus):
                # in similarity matrix: Sim_i,j = cos(query_i, document_j)
                results[query_id][doc_id] = sim_matrix[i, j]

        return results


In [None]:
bows = BoWSearch(corpus=corpus)
bow_metrics = bows.evaluate(
    queries=queries,
    qrels=qrels,
    k_values=K_VALUES
)

Vocabulary size: 35497


In [None]:
print(bow_metrics)

Metrics(ndcg={'NDCG@1': 0.27667, 'NDCG@5': 0.32897, 'NDCG@10': 0.35102, 'NDCG@50': 0.37626, 'NDCG@100': 0.38643, 'NDCG@200': 0.39734, 'NDCG@500': 0.40444, 'NDCG@800': 0.40835, 'NDCG@1000': 0.41117}, map_={'MAP@1': 0.26944, 'MAP@5': 0.30989, 'MAP@10': 0.31861, 'MAP@50': 0.32425, 'MAP@100': 0.32524, 'MAP@200': 0.32584, 'MAP@500': 0.32602, 'MAP@800': 0.32608, 'MAP@1000': 0.32611}, recall={'Recall@1': 0.26944, 'Recall@5': 0.37583, 'Recall@10': 0.44333, 'Recall@50': 0.55361, 'Recall@100': 0.61333, 'Recall@200': 0.68944, 'Recall@500': 0.74833, 'Recall@800': 0.783, 'Recall@1000': 0.81033}, precision={'P@1': 0.27667, 'P@5': 0.08067, 'P@10': 0.048, 'P@50': 0.0122, 'P@100': 0.00687, 'P@200': 0.00385, 'P@500': 0.00167, 'P@800': 0.0011, 'P@1000': 0.00091})


## Store Results

In [None]:
import pandas as pd

In [None]:
import re

def results_to_pandas(metrics: Dict[str, float]) -> pd.DataFrame:
    pattern = re.compile(r"(?P<metric>\w+)@(?P<cut>\d+)")
    data = []
    for k, v in metrics.items():
        m = pattern.match(k)
        metric = m.group("metric")
        cut = int(m.group("cut"))
        data.append({"metric": metric, "cut": cut, "value": v})

    x = pd.DataFrame(data)
    return x


In [None]:
bow_df = results_to_pandas(bow_metrics.recall)
bow_df["model"] = "BOW"
bow_df

Unnamed: 0,metric,cut,value,model
0,Recall,1,0.26944,BOW
1,Recall,5,0.37583,BOW
2,Recall,10,0.44333,BOW
3,Recall,50,0.55361,BOW
4,Recall,100,0.61333,BOW
5,Recall,200,0.68944,BOW
6,Recall,500,0.74833,BOW
7,Recall,800,0.783,BOW
8,Recall,1000,0.81033,BOW


In [None]:
results_df = bow_df

## Visualize

In [None]:
import plotly.express as px

In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")

# Optimize BOW

In [None]:
from nltk.stem import PorterStemmer

from typing import Callable


def passthru_tokenizer() -> Callable[[str], List[str]]:
    tokenizer = CountVectorizer().build_tokenizer()
    def _passthru(text: str) -> List[str]:
        return tokenizer(text)
    return _passthru


def stemmer_tokenizer() -> Callable[[str], List[str]]:
    _tokenizer = CountVectorizer().build_tokenizer()
    _ps = PorterStemmer()
    def _stemmer(text: str) -> List[str]:
        tokens = _tokenizer(text)
        return [_ps.stem(token) for token in tokens]
    return _stemmer


@dataclass
class ConfiguredBoWSearch(Search):
    """Document Retrieval with Bag of Words."""
    ngram_range: Tuple[int, int]
    max_df: float
    min_df: int
    stemmed: bool

    vectorizer: CountVectorizer = field(init=False, repr=False)
    corpus_vect: csr_matrix = field(init=False, repr=False)

    def __post_init__(self) -> None:
        # Create the vectorizer, learn the vocabulary

        self.vectorizer = CountVectorizer(
            lowercase=True,
            ngram_range=self.ngram_range,
            max_df=self.max_df,
            min_df=self.min_df,
            tokenizer=stemmer_tokenizer() if self.stemmed else passthru_tokenizer()
        )
        self.corpus_vect = self.vectorizer.fit_transform([x["text"] for x in self.corpus.values()])
        print(f"Vocabulary size: {len(self.vectorizer.get_feature_names_out())}")

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        """Retrieves relevant documents for queries.

        Params:
            * queries (dict): set of queries
            * top_k (int): for each query, return only the top_k documents with the highest scores.
                            if None, returns all corpus documents with their score.

        Returns:
            * results (dict): for each query, a dictionary with key=Document ID and value=relevance score
        """
        # Vectorize the queries
        queries_vect = self.vectorizer.transform(queries.values())

        # Get similarity matrix
        sim_matrix = cosine_similarity(queries_vect, self.corpus_vect)

        # Build result dictionary
        results = defaultdict(dict)
        for i, query_id in enumerate(queries):
            for j, doc_id in enumerate(self.corpus):
                # in similarity matrix: Sim_i,j = cos(query_i, document_j)
                results[query_id][doc_id] = sim_matrix[i, j]

        return results


In [None]:
from itertools import product

dfs = [bow_df]
for min_df, max_df, stemmed, ngram_range in product([2], [0.9], [True, False], [(1,1), (1,2), (1,3)]):
    model = ConfiguredBoWSearch(corpus=corpus, min_df=min_df, max_df=max_df, ngram_range=ngram_range, stemmed=stemmed)
    model_name = str(model)
    metrics = model.evaluate(queries=queries, qrels=qrels, k_values=K_VALUES)
    df = results_to_pandas(metrics.recall)
    df["model"] = model_name
    dfs.append(df)

results_df = pd.concat(dfs, ignore_index=True)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 13003



The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 105680



The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 175848



The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 18508



The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 112060



The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



Vocabulary size: 176127


In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")

# TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

@dataclass
class TFIDFSearch(Search):
    """Document Retrieval with Bag of Words."""
    vectorizer: TfidfVectorizer = field(init=False)
    corpus_vect: csr_matrix = field(init=False)

    def __post_init__(self) -> None:
        # Create the vectorizer, learn the vocabulary
        self.vectorizer = TfidfVectorizer(lowercase=True)
        self.corpus_vect = self.vectorizer.fit_transform([x["text"] for x in self.corpus.values()])
        print(f"Vocabulary size: {len(self.vectorizer.get_feature_names_out())}")

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        """Retrieves relevant documents for queries.

        Params:
            * queries (dict): set of queries
            * top_k (int): for each query, return only the top_k documents with the highest scores.
                            if None, returns all corpus documents with their score.

        Returns:
            * results (dict): for each query, a dictionary with key=Document ID and value=relevance score
        """
        # Vectorize the queries
        queries_vect = self.vectorizer.transform(queries.values())

        # Get similarity matrix
        sim_matrix = cosine_similarity(queries_vect, self.corpus_vect)

        # Build result dictionary
        results = defaultdict(dict)
        for i, query_id in enumerate(queries):
            for j, doc_id in enumerate(self.corpus):
                # in similarity matrix: Sim_i,j = cos(query_i, document_j)
                results[query_id][doc_id] = sim_matrix[i, j]

        return results


In [None]:
tfidf = TFIDFSearch(corpus=corpus)
tfidf_metrics = tfidf.evaluate(
    queries=queries,
    qrels=qrels,
    k_values=K_VALUES
)

Vocabulary size: 35497


In [None]:
df = results_to_pandas(tfidf_metrics.recall)
df["model"] = "TFIDF"
dfs.append(df)
results_df = pd.concat(dfs, ignore_index=True)

In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")

# BM25


$$\textrm{Standard: } \textrm{IDF}(q_i) = \log \left( \frac{N}{n(q_i)} \right), \textrm{Okapi: } \textrm{IDF}(q_i) = \log \left( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} \right) + 1$$
$$\textrm{BM25}(D, Q) = \sum_{i=1}^{n}\textrm{IDF}(q_i) \times \frac{\textrm{TF}(q_i, D) \times (k_1 + 1)}{\textrm{TF}(q_i, D) + k_1 \times \left( 1 - b + b \times \frac{|D|}{\textrm{avgdl}} \right) }$$

* $Q$ is a query made of terms $q_i$
* $D$ is a document
* $\textrm{IDF}(q_i)$ is the inverse document frequency of the term $q_i$ in the corpus
* $\textrm{TF}(q_i, D)$ is the term frequency of $q_i$ in the document $D$
* $N$ is the number of documents in the corpus
* $|D|$ is the number of terms in document $D$
* $\textrm{avgdl}$ is the average $|D|$ for documents in the corpus
* $k_1$ is a hyperparameter (typically $k_1 \in [1.2, 2.0]$)
* $b$ is a hyperparameter (typically $b = 0.75$)


In [None]:
""" Implementation of OKapi BM25 with sklearn's TfidfVectorizer
Distributed as CC-0 (https://creativecommons.org/publicdomain/zero/1.0/)
"""

import numpy as np
from scipy.sparse import csc_matrix

@dataclass
class BM25Search(Search):
    b: float = 0.75
    k1: float = 1.6

    vectorizer: TfidfVectorizer = field(init=False, repr=False)
    X: csc_matrix = field(init=False, repr=False)
    len_X: int = field(init=False)
    avdl: float = field(init=False)

    def __post_init__(self):
        corpus_texts = [x["text"] for x in self.corpus.values()]
        self.vectorizer = TfidfVectorizer(norm=None, smooth_idf=False)
        self.vectorizer.fit(corpus_texts)
        X: csr_matrix = super(TfidfVectorizer, self.vectorizer).transform(corpus_texts) # get counts
        self.X = X.tocsc()
        self.len_X = self.X.sum(1).A1
        self.avdl = self.X.sum(1).mean()

    def score(self, query: str) -> np.ndarray:
        """Calculates BM25 between query q and corpus."""
        b, k1, avdl = self.b, self.k1, self.avdl

        # apply CountVectorizer (class TfidfVectorizer inherits CountVectorizer)
        q, = super(TfidfVectorizer, self.vectorizer).transform([query])

        X = self.X[:, q.indices]
        denom = X + (k1 * (1 - b + b * self.len_X / self.avdl))[:, None]

        # See https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
        # idf(t) = log [ n / df(t) ] + 1 in sklearn, so it need to be converted to idf(t) = log [ n / df(t) ] with minus 1
        idf = self.vectorizer._tfidf.idf_[None, q.indices] - 1.
        numer = X.multiply(np.broadcast_to(idf, X.shape)) * (k1 + 1)
        return (numer / denom).sum(1).A1   # shape (DOCS,)

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        # Lists of document ID and document text - ALIGNED
        doc_ids = [k for k in self.corpus]

        results = defaultdict(dict)
        for query_id, query in queries.items():
            # For each query, get the BM25 score for all documents in corpus
            scores = self.score(query)
            for doc_id, score in zip(doc_ids, scores):
                results[query_id][doc_id] = score

        return results


In [None]:
bm25 = BM25Search(corpus=corpus)
bm25_metrics = bm25.evaluate(
    queries=queries,
    qrels=qrels,
    k_values=K_VALUES
)

In [None]:
df = results_to_pandas(bm25_metrics.recall)
df["model"] = "BM25"
dfs.append(df)
results_df = pd.concat(dfs, ignore_index=True)

In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")

# Neural Network - LAPRADOR

In [None]:
import sentence_transformers
from sentence_transformers import SentenceTransformer

In [None]:
import numpy as np
import torch

In [None]:
class InferenceDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        encodings
    ):
        self.encodings = encodings

    def __len__(self) -> int:
        return self.encodings["input_ids"].shape[0]

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        return item


@dataclass
class LapradorSearch(Search):
    laprador: SentenceTransformer = field(init=False, repr=False)
    corpus_vect: np.ndarray = field(init=False, repr=False)

    def __post_init__(self):
        self.laprador =  SentenceTransformer("canwenxu/laprador")
        texts = [x["text"] for x in self.corpus.values()]
        self.corpus_vect = self.laprador.encode(texts, batch_size=32, show_progress_bar=True)

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        # Get queries vectors
        query_texts = list(queries.values())
        queries_vect = self.laprador.encode(query_texts, batch_size=32, show_progress_bar=True)

        # Get similarity matrix
        sim_matrix = cosine_similarity(queries_vect, self.corpus_vect)

        # Build result dictionary
        results = defaultdict(dict)
        for i, query_id in enumerate(queries):
            for j, doc_id in enumerate(self.corpus):
                # in similarity matrix: Sim_i,j = cos(query_i, document_j)
                results[query_id][doc_id] = float(sim_matrix[i, j])

        return results



In [None]:
lapr = LapradorSearch(corpus=corpus)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]


TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()



tokenizer_config.json:   0%|          | 0.00/338 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Batches:   0%|          | 0/162 [00:00<?, ?it/s]

In [None]:
laprador_metrics = lapr.evaluate(
    queries=queries,
    qrels=qrels,
    k_values=K_VALUES
)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

In [None]:
df = results_to_pandas(laprador_metrics.recall)
df["model"] = "LAPRADOR"
dfs.append(df)
results_df = pd.concat(dfs, ignore_index=True)

In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")

# CONCLUSION

## Recall@10

In [None]:
# Focus on Recall@10
CUT = 10
name = f"Recall@{CUT}"

focus = results_df[results_df["cut"] == CUT]
fig = px.bar(focus, y="model", x="value", color="model", labels={'value': name})
fig.update_layout(yaxis={'categoryorder':'total ascending'})

## Recall@200

In [None]:
# Focus on Recall@200
CUT = 200
name = f"Recall@{CUT}"

focus = results_df[results_df["cut"] == CUT]
fig = px.bar(focus, y="model", x="value", color="model", labels={'value': name})
fig.update_layout(yaxis={'categoryorder':'total ascending'})

## Precision@1

In [None]:
precs = [bow_metrics.precision, tfidf_metrics.precision, bm25_metrics.precision, laprador_metrics.precision]
models = ["BOW", "TFIDF", "BM25", "LAPRADOR"]

prec_dfs = [results_to_pandas(res) for res in precs]
for df, model in zip(prec_dfs, models):
    df["model"] = model

precs_df = pd.concat(prec_dfs, ignore_index=True)

In [None]:
# Focus on Precision@1
CUT = 1
name = f"Precision@{CUT}"

focus = precs_df[precs_df["cut"] == CUT]
fig = px.bar(focus, y="model", x="value", color="model", labels={'value': name})
fig.update_layout(yaxis={'categoryorder':'total ascending'})

## NDCG@10

In [None]:
ndcgs = [bow_metrics.ndcg, tfidf_metrics.ndcg, bm25_metrics.ndcg, laprador_metrics.ndcg]
models = ["BOW", "TFIDF", "BM25", "LAPRADOR"]

ndcg_dfs = [results_to_pandas(res) for res in ndcgs]
for df, model in zip(ndcg_dfs, models):
    df["model"] = model

ndcgs_df = pd.concat(ndcg_dfs, ignore_index=True)

In [None]:
# Focus on NDCG@10
CUT = 10
name = f"NDCG@{CUT}"

focus = ndcgs_df[ndcgs_df["cut"] == CUT]
fig = px.bar(focus, y="model", x="value", color="model", labels={'value': name})
fig.update_layout(yaxis={'categoryorder':'total ascending'})

# Neural Network - Custom Made

In [None]:
# HUG_MODEL = "allenai/longformer-base-4096"
HUG_MODEL = "distilbert-base-cased"

In [None]:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments


In [None]:
_, train_queries, train_qrels = GenericDataLoader(data_folder=data_path).load(split="train")

  0%|          | 0/5183 [00:00<?, ?it/s]

In [None]:
import random
import tqdm

class SciFactDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        hug_model: str,
        corpus: Dict[str, Dict[str, str]],
        queries: Dict[str, str],
        qrels: Dict[str, Dict[str, int]]
    ):
        # Setup tokenizer
        tokenizer = AutoTokenizer.from_pretrained(hug_model)

        # Create training data: 1 sample = "query + document" + label
        # Create 1 positive label and 5 negative for each query
        doc_ids = list(corpus.keys())
        samples = []
        for query_id, qrel in tqdm.tqdm(qrels.items()):
            relevant_doc_id = list(qrel.keys())[0]  # assume there is only 1 relevant
            sample = {"query": queries[query_id], "doc": corpus[relevant_doc_id]["text"], "label": 1}
            samples.append(sample)
            other_docs_ids = random.choices(doc_ids, k=10)
            try:
                index = other_docs_ids.index(relevant_doc_id)
                # The relevant doc was picked up by the random choice, remove it
                del other_docs_ids[index]
            except ValueError as e:
                # The relevant doc was not picked up
                pass

            for other_doc_id in other_docs_ids:
                sample = {"query": queries[query_id], "doc": corpus[other_doc_id]["text"], "label": 0}
                samples.append(sample)

        texts_a = [s["query"] for s in samples]
        texts_b = [s["doc"] for s in samples]
        labels = [s["label"] for s in samples]

        self.encodings = tokenizer(
            text=texts_a,
            text_pair=texts_b,
            truncation="only_second",
            padding="max_length",
            return_tensors="pt"
        )
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)



In [None]:
train_dataset = SciFactDataset(hug_model=HUG_MODEL, corpus=corpus, queries=train_queries, qrels=train_qrels)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

100%|██████████| 809/809 [00:00<00:00, 70723.91it/s]


In [None]:
test_dataset = SciFactDataset(hug_model=HUG_MODEL, corpus=corpus, queries=queries, qrels=qrels)

100%|██████████| 300/300 [00:00<00:00, 47800.15it/s]


In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


In [None]:
!rm -rf ./results
!rm -rf ./logs

In [None]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    learning_rate=1e-5,
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=50,
    evaluation_strategy="epoch",
    fp16=True,
    do_train=True,
    do_eval=True
)

model = AutoModelForSequenceClassification.from_pretrained(HUG_MODEL)


trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,            # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.2294,0.173699,0.945758,0.604857,0.895425,0.456667
2,0.1204,0.105871,0.970303,0.840909,0.81962,0.863333
3,0.0622,0.114115,0.97303,0.853377,0.843648,0.863333


TrainOutput(global_step=1671, training_loss=0.17571186095493985, metrics={'train_runtime': 537.9872, 'train_samples_per_second': 49.624, 'train_steps_per_second': 3.106, 'total_flos': 3536482141919232.0, 'train_loss': 0.17571186095493985, 'epoch': 3.0})

In [None]:
inference_args = TrainingArguments(
    output_dir='./results',          # output directory
    per_device_eval_batch_size=32,   # batch size for evaluation
    fp16=True,
    do_predict=True
)

inferencer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=inference_args,                  # training arguments, defined above
)

In [None]:
from scipy.special import softmax
from transformers import PreTrainedModel
from transformers import PreTrainedTokenizerFast
from transformers.trainer_utils import PredictionOutput

class InferenceDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        queries: Dict[str, str],                  # {"query_id": "text", "query_id2": "text2", ...}
        queries_docs: Dict[str, Dict[str, str]],  # {"query_id": {"doc_id": "text", "doc_id2": "text2", ...}, ...}
        tokenizer: PreTrainedTokenizerFast,
    ):
        dataset = [{"query": queries[query_id], "doc": query_docs[doc_id]} for query_id, query_docs in queries_docs.items() for doc_id in query_docs]
        self.len_data = len(dataset)
        texts_a = [s["query"] for s in dataset]
        texts_b = [s["doc"] for s in dataset]

        print("Encoding dataset")
        self.encodings = tokenizer(
            text=texts_a,
            text_pair=texts_b,
            truncation="only_second",
            padding="max_length",
            return_tensors="pt"
        )

    def __len__(self) -> int:
        return self.len_data

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        return item


@dataclass
class RelevanceModelSearch(Search):
    model: PreTrainedModel = field(repr=False)
    tokenizer: PreTrainedTokenizerFast = field(repr=False)
    bm25_top_k: int

    inferencer: Trainer = field(init=False, repr=False)
    bm25: BM25Search = field(init=False, repr=False)

    def __post_init__(self) -> None:
        inference_args = TrainingArguments(
            output_dir='./results',          # output directory
            per_device_eval_batch_size=32,   # batch size for evaluation
            fp16=True,
            do_predict=True
        )

        self.inferencer = Trainer(
            model=model,                         # the instantiated 🤗 Transformers model to be trained
            args=inference_args,                  # training arguments, defined above
        )

        self.bm25 = BM25Search(self.corpus)

    def search(self, queries: Dict[str, str]) -> Dict[str, Dict[str, float]]:
        print("First stage: BM25")
        first_stage = self.bm25.search(queries)
        print("BM25 done...")

        filtered_first_stage = {}
        # Retain only the TOP-K scored document per query
        for query_id, docs_scores in first_stage.items():
            filtered_docs = dict(sorted(docs_scores.items(), key=lambda item: item[1], reverse=True)[:self.bm25_top_k])
            filtered_first_stage[query_id] = filtered_docs

        queries_docs = {query_id: {doc_id: self.corpus[doc_id]["text"] for doc_id in query_docs} for query_id, query_docs in filtered_first_stage.items()}
        ds = InferenceDataset(queries=queries, queries_docs=queries_docs, tokenizer=self.tokenizer)
        predictionoutput: PredictionOutput = self.inferencer.predict(ds)
        logits = predictionoutput.predictions    # shape (#QUERY/DOCS, 2)
        logits = softmax(logits, axis=1)
        scores = logits[:, 1]            # score = logit for positive class

        qid_docid = [(query_id, doc_id) for query_id, query_docs in queries_docs.items() for doc_id in query_docs]
        results = defaultdict(dict)
        for (query_id, doc_id), score in zip(qid_docid, scores):
            results[query_id][doc_id] = float(score)

        return results


In [None]:
relevance = RelevanceModelSearch(
    corpus=corpus,
    model=model,
    tokenizer=AutoTokenizer.from_pretrained(HUG_MODEL),
    bm25_top_k=100
)

In [None]:
nn_metrics = relevance.evaluate(
    queries=queries,
    qrels=qrels,
    k_values=[k for k in K_VALUES if k <= 100]
)

First stage: BM25
BM25 done...
Encoding dataset


In [None]:
print(nn_metrics)

Metrics(ndcg={'NDCG@1': 0.41, 'NDCG@5': 0.53716, 'NDCG@10': 0.5762, 'NDCG@50': 0.60088, 'NDCG@100': 0.6029}, map_={'MAP@1': 0.39039, 'MAP@5': 0.49456, 'MAP@10': 0.5117, 'MAP@50': 0.51795, 'MAP@100': 0.51817}, recall={'Recall@1': 0.39039, 'Recall@5': 0.64794, 'Recall@10': 0.76406, 'Recall@50': 0.86689, 'Recall@100': 0.87856}, precision={'P@1': 0.41, 'P@5': 0.142, 'P@10': 0.08467, 'P@50': 0.0196, 'P@100': 0.00993})


In [None]:
del dfs[-1]

In [None]:
df = results_to_pandas(nn_metrics.recall)
df["model"] = "2-Stage BM25 + NN"
dfs.append(df)
results_df = pd.concat(dfs, ignore_index=True)

In [None]:
px.line(results_df, x="cut", y="value", color="model", symbol="model")