# How to use the Information Retrieval framework.

Learn how to use the retrieval framework and exploit the different information retrieval pipelines.

## Setup

### Load dependencies

In [1]:
from semantic_search import SemanticSearch
from lexical_search import TfIdfSearch
from utils.paths import *
from utils import utils

  from .autonotebook import tqdm as notebook_tqdm


### Search parameters

Define pretrained sentence transformer to perform semantic search.

In [2]:
# Other tested pretrained models are 'paraphrase-distilroberta-base-v1'
# and 'msmarco-distilbert-base-v3'
pretrained_model = "paraphrase-distilroberta-base-v2"

Define pretrained cross-encoder to perform re-ranking.

In [3]:
# Alternative pretrained cross-encoders tested:
# cross-encoders are 'cross-encoder/ms-marco-MiniLM-L-6-v2'
pretrained_crossencoder = "cross-encoder/stsb-distilroberta-base"

Define the encoding stategy. The encoding strategy must be a string containing the names of the features to include into the input of the encoder, each of them separated by an underscore ('_'). For example, if you were to use the title and the overview as the encoding strategy, `encoding_strategy` must be either `title_overview` or `overview_title`. Current supported features are 'title', 'authors' and 'overview'. For further information, have a look at the ``utils.utils.prepare_input_encoder`` function.

In [4]:
encoding_strategy = "title_overview"

Define the number of trees to use in the ANNOY index.

In [5]:
n_trees = 576

Define the summarization strategy. Using the top 5 or 4 sentences is recommended. Use an empty string, `''` to eschew summarization.

In [6]:
summarization = "top5sent"  #'top4sent' #''

Load the corpus from disk. Beware that the loaded corpus must be consistent with the summarization technique you wish to use (e.g., for the 'top5sent' strategy, the dataset that must be is 'books_processed_top5sent.csv')

In [7]:
corpus = utils.load_corpus(PATH_BOOKS_TOP5S)

Define the filename from which the ANNOY index is to be stored/loaded from disk.

In [8]:
path_annoy_cache = (
    f"{DIR_ANNOY}{pretrained_model}/t{n_trees}_{summarization}_{encoding_strategy}.ann"
)

Define filepath from which computed embeddings for bi-encoder evaluation are stored/loaded from disk.

In [9]:
# If the directory does not exist, it will be automatically created.
path_embs_cache = (
    f"{DIR_EMBEDDINGS}{pretrained_model}/{summarization}_{encoding_strategy}.pkl"
)

Define filepath from which computed vector for TF-IDF evaluation are stored/loaded from disk.

In [10]:
# If the directory does not exist, it will be automatically created.
vectors_cache_path = f"{DIR_EMBEDDINGS}tfidf/{summarization}_{encoding_strategy}.pkl"

Function to print search results.

In [11]:
def print_results(results, search_title="Search results."):
    print(search_title)
    if isinstance(results, str):
        print(results)
    else:
        for result in results:
            print(result)

### Define your queries and $k$

In [12]:
# List of queries written in natural language.
queries = [
    "Depressive young people",
    "Best cooking recipes",
    "Adventurous people getting in trouble",
    "Popular and award-winning book",
    "History through war, land conquering and the evolution of the society",
    "Outstanding scientific discoveries",
    "Human theories of evolution",
    "Wonderful animals in the wild",
    "Beautiful places to visit",
    "Travel around the world",
    "Looking for happiness and wellness",
    "Ancient tribes in Africa",
    "How technology and internet are changing our lives",
    "Loss of faith in new generations",
    "The importance of God",
]

Define the default number of most relevant documents to retrieve, $k$.

In [13]:
k = 5

## Semantic search

Instantiate a SemanticSearch object with the desired parameters.

In [15]:
semantic_search = SemanticSearch(
    corpus,
    path_embs_cache=path_embs_cache,
    encoding_strategy=encoding_strategy,
    path_annoy_cache=path_annoy_cache,
)


### Standard semantic search

Use Bi-Encoder retrieval. Time complexity is $\mathcal{O}(n)$

In [16]:
results = semantic_search.search(*queries, k=k)

print_results(results, "Standard semantic search")

2025-03-14 16:13:24,612 - [7736] - search - [biencoder] - INFO - Loading Bi-Encoder paraphrase-distilroberta-base-v2
2025-03-14 16:13:25,905 - [7736] - search - [embeddings] - INFO - Retrieving embeddings from disk for Bi-Encoder paraphrase-distilroberta-base-v2  with encoding strategy 'title_overview'
2025-03-14 16:13:26,008 - [7736] - search - [_get_embeddings] - INFO - Fetched pre-computed embeddings from embeddings/paraphrase-distilroberta-base-v2/top5sent_title_overview.pkl


Standard semantic search

Query: Depressive young people
Top 5 most similar books in corpus:
Title: Reviving Ophelia: Saving the Selves of Adolescent Girls -- (Score: 0.5015) (Goodreads Id: 159760)
Authors: Mary Pipher
Overview: 1 New York Times Bestseller The groundbreaking work that poses one of the most provocative questions of a generation: what is happening to the selves of adolescent girls? As a therapist, Mary Pipher was becoming frustrated with the growing problems among adolescent girls. Why had these lovely and promising human beings fallen prey to depression, eating disorders, suicide attempts, and crushingly low self-esteem? The answer hit a nerve with Pipher, with parents, and with the girls themselves. They were losing their resiliency and optimism in a girl-poisoning culture that propagated values at odds with those necessary to survive.


Title: An Unquiet Mind: A Memoir of Moods and Madness -- (Score: 0.4618) (Goodreads Id: 361459)
Authors: Kay Redfield Jamison
Overvie

### Sublinear dense retrieval using ANN

Time complexity is $\mathcal{O}(\log n)$

In [17]:
# Semantic search using ANNOY.
results = semantic_search.search(*queries, k=k, use_annoy=True)

print_results(results, "Semantic search using ANNOY.")


2025-03-14 14:28:24,833 - [10256] - search - [annoy] - INFO - Loading Annoy Index.


2025-03-14 14:28:24,837 - [10256] - search - [get_annoy_index] - INFO - Loading ANNOY index from disk


Semantic search using ANNOY.

Query: Depressive young people
Top 5 most similar books in corpus:
Title: Reviving Ophelia: Saving the Selves of Adolescent Girls -- (Score: 0.5015) (Goodreads Id: 159760)
Authors: Mary Pipher
Overview: 1 New York Times Bestseller The groundbreaking work that poses one of the most provocative questions of a generation: what is happening to the selves of adolescent girls? As a therapist, Mary Pipher was becoming frustrated with the growing problems among adolescent girls. Why had these lovely and promising human beings fallen prey to depression, eating disorders, suicide attempts, and crushingly low self-esteem? The answer hit a nerve with Pipher, with parents, and with the girls themselves. They were losing their resiliency and optimism in a girl-poisoning culture that propagated values at odds with those necessary to survive.


Title: An Unquiet Mind: A Memoir of Moods and Madness -- (Score: 0.4618) (Goodreads Id: 361459)
Authors: Kay Redfield Jamison
Ove

### Bi-encoder retrieval, Cross-encoder re-ranking

In [None]:
# Semantic search using Cross-encoder re-ranking.
results = semantic_search.search(*queries, k=k, reranking=True)

print_results(results, "Semantic search using Cross-encoder re-ranking.")

2025-03-14 14:31:06,313 - [10256] - search - [crossencoder] - INFO - Loading Cross-Encoder cross-encoder/stsb-distilroberta-base



Semantic search using Cross-encoder re-ranking.

Query: Depressive young people
Top 5 most similar books in corpus:
Title: Girl, Interrupted -- (Score: 0.5414) (Goodreads Id: 68783)
Authors: Susanna Kaysen
Overview: Searing captures an exquisite range of self-awareness between madness and insight. First published in 1994. In the late 1960s, the author spent nearly two years on the ward for teenage girls at McLean Hospital, a renowned psychiatric facility. Her memoir encompasses horror and razor-edged perceptions, while providing vivid portraits of her fellow patients and their keepers. Searing captures an exquisite range of self-awareness between madness and insight.


Title: The Defining Decade: Why Your Twenties Matter--And How to Make the Most of Them Now -- (Score: 0.5000) (Goodreads Id: 13523061)
Authors: Meg Jay
Overview: Our thirty-is-the-new-twenty culture tells us the twentysomething years don't matter. Others call them an emerging adulthood. Dr. Meg Jay, a clinical psycholog

### Sublinear Bi-encoder retrieval, Cross-encoder re-ranking

In [19]:
# Semantic search using ANNOY and Cross-encoder re-ranking.
results = semantic_search.search(*queries, k=k, use_annoy=True, reranking=True)

print_results(results, "Semantic search using ANNOY and Cross-encoder re-ranking.")

Semantic search using ANNOY and Cross-encoder re-ranking.

Query: Depressive young people
Top 5 most similar books in corpus:
Title: Girl, Interrupted -- (Score: 0.5414) (Goodreads Id: 68783)
Authors: Susanna Kaysen
Overview: Searing captures an exquisite range of self-awareness between madness and insight. First published in 1994. In the late 1960s, the author spent nearly two years on the ward for teenage girls at McLean Hospital, a renowned psychiatric facility. Her memoir encompasses horror and razor-edged perceptions, while providing vivid portraits of her fellow patients and their keepers. Searing captures an exquisite range of self-awareness between madness and insight.


Title: Pedagogy of the Oppressed -- (Score: 0.5249) (Goodreads Id: 72657)
Authors: Paulo Freire, Myra Bergman Ramos, Donaldo Macedo, Richard Shaull
Overview: First published in Portuguese in 1968, Pedagogy of the Oppressed was translated and published in English in 1970. The methodology of the late Paulo Freire

## Lexical search

Instantiate a TfIdfSearch object with the desired parameters.


In [20]:
tfidf_search = TfIdfSearch(
    corpus,
    vectors_cache_path=vectors_cache_path,
    encoding_strategy=encoding_strategy,
    path_embs_cache=path_embs_cache
)

2025-03-14 14:35:57,611 - [10256] - search - [_get_vectors] - INFO - Loading precomputed vectors from disk


### Standard lexical search

Use TF-IDF retrieval. Time complexity is $\mathcal{O}(n)$

In [21]:
# Standard lexical search
results = tfidf_search.search(*queries, k=k)

print_results(results, 'Results for standard lexical search')

Results for standard lexical search

Query: Depressive young people
Top 5 most similar books in corpus:
Title: An Unquiet Mind: A Memoir of Moods and Madness -- (Score: 0.2689) (Goodreads Id: 361459)
Authors: Kay Redfield Jamison
Overview: The personal memoir of a manic depressive and an authority on the subject describes the onset of the illness during her teenage years and her determined journey through the realm of available treatments.


Title: Personal History -- (Score: 0.1898) (Goodreads Id: 95420)
Authors: Katharine Graham
Overview: In lieu of an unrevealing Famous-People-I-Have-Known autobiography, the owner of the Washington Post has chosen to be remarkably candid about the insecurities prompted by remote parents and a difficult marriage to the charismatic, manic-depressive Phil Graham, who ran the newspaper her father acquired. Katharine's account of her years as subservient daughter and wife is so painful that by the time she finally asserts herself at the Post following Ph

### TF-IDF retrieval, Bi-encoder re-ranking

Hybrid search with TF-IDF retrieval and Bi-encoder re-ranking.

In [26]:
results = tfidf_search.search(*queries, k=k, reranking_strategy='biencoder')

print_results(results, 'Results for hybrid search with TF-IDF retrieval and Bi-encoder re-ranking')

Results for hybrid search with TF-IDF retrieval and Bi-encoder re-ranking

Query: Depressive young people
Top 5 most similar books in corpus:
Title: An Unquiet Mind: A Memoir of Moods and Madness -- (Score: 0.4618) (Goodreads Id: 361459)
Authors: Kay Redfield Jamison
Overview: The personal memoir of a manic depressive and an authority on the subject describes the onset of the illness during her teenage years and her determined journey through the realm of available treatments.


Title: Personal History -- (Score: 0.2824) (Goodreads Id: 95420)
Authors: Katharine Graham
Overview: In lieu of an unrevealing Famous-People-I-Have-Known autobiography, the owner of the Washington Post has chosen to be remarkably candid about the insecurities prompted by remote parents and a difficult marriage to the charismatic, manic-depressive Phil Graham, who ran the newspaper her father acquired. Katharine's account of her years as subservient daughter and wife is so painful that by the time she finally as

### TF-IDF retrieval, Cross-Encoder re-ranking.

Hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.

In [25]:
results = tfidf_search.search(*queries, k=k, reranking_strategy='crossencoder')

print_results(results, 'Results for hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.')

Results for hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.

Query: Depressive young people
Top 5 most similar books in corpus:
Title: The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change -- (Score: 0.5194) (Goodreads Id: 36072)
Authors: Stephen R. Covey
Overview: When Stephen Covey first released The Seven Habits of Highly Effective People, the book became an instant rage because people suddenly got up and took notice that their lives were headed off in the wrong direction; and more than that, they realized that there were so many simple things they could do in order to navigate their life correctly. But not everyone understands Stephen Covey’s model fully well, or maybe there are some people who haven’t read it yet. They do not realize that this book contains life-changing information. There are hidden implications in this book, yes, and a lot of people have just failed to see through them. We are trying to show you how Covey’s book, or rathe