# How to use the Information Retrieval framework.

Learn how to use the retrieval framework and exploit the different information retrieval pipelines.

## Setup

### Load dependencies

In [23]:
from semantic_search import SemanticSearch
from lexical_search import TfIdfSearch
from utils.paths import *
from utils import utils

### Search parameters

Define pretrained sentence transformer to perform semantic search.

In [2]:
# Other tested pretrained models are 'paraphrase-distilroberta-base-v1'
# and 'msmarco-distilbert-base-v3'
pretrained_model = "paraphrase-distilroberta-base-v2"

Define pretrained cross-encoder to perform re-ranking.

In [3]:
# Alternative pretrained cross-encoders tested:
# cross-encoders are 'cross-encoder/ms-marco-MiniLM-L-6-v2'
pretrained_crossencoder = "cross-encoder/stsb-distilroberta-base"

Define the encoding stategy. The encoding strategy must be a string containing the names of the features to include into the input of the encoder, each of them separated by an underscore ('_'). For example, if you were to use the title and the overview as the encoding strategy, `encoding_strategy` must be either `title_overview` or `overview_title`. Current supported features are 'title', 'authors' and 'overview'. For further information, have a look at the ``utils.utils.prepare_input_encoder`` function.

In [4]:
encoding_strategy = "title_overview"

Define the number of trees to use in the ANNOY index.

In [5]:
n_trees = 576

Define the summarization strategy. Using the top 5 or 4 sentences is recommended. Use an empty string, `''` to eschew summarization.

In [6]:
summarization = "top5sent"  #'top4sent' #''

Load the corpus from disk. Beware that the loaded corpus must be consistent with the summarization technique you wish to use (e.g., for the 'top5sent' strategy, the dataset that must be is 'books_processed_top5sent.csv')

In [7]:
corpus = utils.load_corpus(PATH_BOOKS_TOP5S)

Define the filename from which the ANNOY index is to be stored/loaded from disk.

In [8]:
path_annoy_cache = (
    f"{DIR_ANNOY}{pretrained_model}/t{n_trees}_{summarization}_{encoding_strategy}.ann"
)

Define filepath from which computed embeddings for bi-encoder evaluation are stored/loaded from disk.

In [9]:
# If the directory does not exist, it will be automatically created.
path_embs_cache = (
    f"{DIR_EMBEDDINGS}{pretrained_model}/{summarization}_{encoding_strategy}.pkl"
)

Define filepath from which computed vector for TF-IDF evaluation are stored/loaded from disk.

In [10]:
# If the directory does not exist, it will be automatically created.
vectors_cache_path = f"{DIR_EMBEDDINGS}tfidf/{summarization}_{encoding_strategy}.pkl"

Function to print search results.

In [11]:
def print_results(results, search_title="Search results."):
    print(search_title)
    if isinstance(results, str):
        print(results)
    else:
        for result in results:
            print(result)

### Define your queries and $k$

In [12]:
# List of queries written in natural language.
queries = [
    # "Depressive young people",
    "Best cooking recipes",
    # "Adventurous people getting in trouble",
    # "Popular and award-winning book",
    # "History through war, land conquering and the evolution of the society",
    # "Outstanding scientific discoveries",
    # "Human theories of evolution",
    # "Wonderful animals in the wild",
    # "Beautiful places to visit",
    "Travel around the world",
    # "Looking for happiness and wellness",
    # "Ancient tribes in Africa",
    # "How technology and internet are changing our lives",
    # "The importance of God",
]

Define the default number of most relevant documents to retrieve, $k$.

In [13]:
k = 3

## Semantic search

Instantiate a SemanticSearch object with the desired parameters.

In [14]:
semantic_search = SemanticSearch(
    corpus,
    path_embs_cache=path_embs_cache,
    encoding_strategy=encoding_strategy,
    path_annoy_cache=path_annoy_cache,
)


### Standard semantic search

Use Bi-Encoder retrieval. Time complexity is $\mathcal{O}(n)$

In [15]:
results = semantic_search.search(*queries, k=k)

print_results(results, "Standard semantic search")

2025-03-16 13:11:58,468 - [17780] - search - [biencoder] - INFO - Loading Bi-Encoder paraphrase-distilroberta-base-v2
2025-03-16 13:12:03,513 - [17780] - search - [embeddings] - INFO - Retrieving embeddings from disk for Bi-Encoder paraphrase-distilroberta-base-v2  with encoding strategy 'title_overview'
2025-03-16 13:12:03,613 - [17780] - search - [_get_embeddings] - INFO - Fetched pre-computed embeddings from embeddings/paraphrase-distilroberta-base-v2/top5sent_title_overview.pkl


Standard semantic search

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: The Taste of Home Cookbook -- (Score: 0.5897) (Goodreads Id: 3885)
Authors: Janet Briggs, Beth Wittlinger
Overview: More than 1,200 recipes including more than 135 light recipes are compiled in this sturdy five-ring binder.


Title: The Way to Cook -- (Score: 0.5722) (Goodreads Id: 132688)
Authors: Julia Child
Overview: In this magnificent new cookbook, illustrated with full color throughout, Julia Child give us her magnum opus the distillation of a lifetime of cooking. In this spirit, Julia has conceived her most creative and instructive cookbook, blending classic techniques with free-style American cooking and with added emphasis on lightness, freshness, and simpler preparations. Breaking with conventional organization, she structures the chapters from Soups to Cakes & Cookies around master recipes, giving all the reassuring details that she is so good at and grouping the recipes accordin

### Sublinear dense retrieval using ANN

Time complexity is $\mathcal{O}(\log n)$

In [16]:
# Semantic search using ANNOY.
results = semantic_search.search(*queries, k=k, use_annoy=True)

print_results(results, "Semantic search using ANNOY.")


2025-03-16 13:12:03,767 - [17780] - search - [annoy] - INFO - Loading Annoy Index.
2025-03-16 13:12:03,769 - [17780] - search - [get_annoy_index] - INFO - Loading ANNOY index from disk


Semantic search using ANNOY.

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: The Taste of Home Cookbook -- (Score: 0.5897) (Goodreads Id: 3885)
Authors: Janet Briggs, Beth Wittlinger
Overview: More than 1,200 recipes including more than 135 light recipes are compiled in this sturdy five-ring binder.


Title: The Way to Cook -- (Score: 0.5722) (Goodreads Id: 132688)
Authors: Julia Child
Overview: In this magnificent new cookbook, illustrated with full color throughout, Julia Child give us her magnum opus the distillation of a lifetime of cooking. In this spirit, Julia has conceived her most creative and instructive cookbook, blending classic techniques with free-style American cooking and with added emphasis on lightness, freshness, and simpler preparations. Breaking with conventional organization, she structures the chapters from Soups to Cakes & Cookies around master recipes, giving all the reassuring details that she is so good at and grouping the recipes acco

### Bi-encoder retrieval, Cross-encoder re-ranking

In [17]:
# Semantic search using Cross-encoder re-ranking.
results = semantic_search.search(*queries, k=k, reranking=True)

print_results(results, "Semantic search using Cross-encoder re-ranking.")

2025-03-16 13:12:04,145 - [17780] - search - [crossencoder] - INFO - Loading Cross-Encoder cross-encoder/stsb-distilroberta-base


Semantic search using Cross-encoder re-ranking.

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: How to Cook Everything: Simple Recipes for Great Food -- (Score: 0.7228) (Goodreads Id: 603204)
Authors: Mark Bittman
Overview: Great Food Made Simple Here's the breakthrough one-stop cooking reference for today's generation of cooks! Nationally known cooking authority Mark Bittman shows you how to prepare great food for all occasions using simple techniques, fresh ingredients, and basic kitchen equipment. Just as important, How to Cook Everything takes a relaxed, straightforward approach to cooking, so you can enjoy yourself in the kitchen and still achieve outstanding results.


Title: The Joy of Cooking -- (Score: 0.6943) (Goodreads Id: 327847)
Authors: Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker
Overview: Since its original publication, Joy of Cooking has been the most authoritative cookbook in America, the one upon which millions of cooks have confiden

### Sublinear Bi-encoder retrieval, Cross-encoder re-ranking

In [18]:
# Semantic search using ANNOY and Cross-encoder re-ranking.
results = semantic_search.search(*queries, k=k, use_annoy=True, reranking=True)

print_results(results, "Semantic search using ANNOY and Cross-encoder re-ranking.")

Semantic search using ANNOY and Cross-encoder re-ranking.

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: How to Cook Everything: Simple Recipes for Great Food -- (Score: 0.7228) (Goodreads Id: 603204)
Authors: Mark Bittman
Overview: Great Food Made Simple Here's the breakthrough one-stop cooking reference for today's generation of cooks! Nationally known cooking authority Mark Bittman shows you how to prepare great food for all occasions using simple techniques, fresh ingredients, and basic kitchen equipment. Just as important, How to Cook Everything takes a relaxed, straightforward approach to cooking, so you can enjoy yourself in the kitchen and still achieve outstanding results.


Title: The Joy of Cooking -- (Score: 0.6943) (Goodreads Id: 327847)
Authors: Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker
Overview: Since its original publication, Joy of Cooking has been the most authoritative cookbook in America, the one upon which millions of cooks hav

## Lexical search

Instantiate a TfIdfSearch object with the desired parameters.


In [19]:
tfidf_search = TfIdfSearch(
    corpus,
    vectors_cache_path=vectors_cache_path,
    encoding_strategy=encoding_strategy,
    path_embs_cache=path_embs_cache
)

2025-03-16 13:12:18,083 - [17780] - search - [_get_vectors] - INFO - Loading precomputed vectors from disk


### Standard lexical search

Use TF-IDF retrieval. Time complexity is $\mathcal{O}(n)$

In [20]:
# Standard lexical search
results = tfidf_search.search(*queries, k=k)

print_results(results, 'Results for standard lexical search')

Results for standard lexical search

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: Better Homes and Gardens New Cook Book   -- (Score: 0.4852) (Goodreads Id: 411053)
Authors: Better Homes and Gardens
Overview: Features: Over 900 new recipes 1,200 in all-reflect current eating habits and lifestyles; 500 new photographs over 700 in all-including 60 percent more of finished food than the last edition; Dozens of new recipes offer ethnic flavours, fresh ingredients, or vegetarian appeal; Many recipes feature make-ahead directions or quick-to-the-table meals; New chapter provides recipes for crockery cookers; Efficient, easy-to-read format, with recipes categorised into 21 chapters, each thoroughly indexed for easy reference; Expanded chapter on cooking basics includes advice on food safety, menu planning, table setting, and make-ahead cooking, plus a thorough glossary on ingredients and techniques; Appliance-friendly recipes help cooks save time and creatively use n

### TF-IDF retrieval, Bi-encoder re-ranking

Hybrid search with TF-IDF retrieval and Bi-encoder re-ranking.

In [21]:
results = tfidf_search.search(*queries, k=k, reranking_strategy='biencoder')

print_results(results, 'Results for hybrid search with TF-IDF retrieval and Bi-encoder re-ranking')

2025-03-16 13:12:18,833 - [17780] - search - [biencoder] - INFO - Loading Bi-Encoder paraphrase-distilroberta-base-v2


2025-03-16 13:12:20,953 - [17780] - search - [embeddings] - INFO - Retrieving embeddings from disk for Bi-Encoder paraphrase-distilroberta-base-v2  with encoding strategy 'title_overview'
2025-03-16 13:12:21,028 - [17780] - search - [_get_embeddings] - INFO - Fetched pre-computed embeddings from embeddings/paraphrase-distilroberta-base-v2/top5sent_title_overview.pkl


Results for hybrid search with TF-IDF retrieval and Bi-encoder re-ranking

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: Better Homes and Gardens New Cook Book   -- (Score: 0.5897) (Goodreads Id: 411053)
Authors: Better Homes and Gardens
Overview: Features: Over 900 new recipes 1,200 in all-reflect current eating habits and lifestyles; 500 new photographs over 700 in all-including 60 percent more of finished food than the last edition; Dozens of new recipes offer ethnic flavours, fresh ingredients, or vegetarian appeal; Many recipes feature make-ahead directions or quick-to-the-table meals; New chapter provides recipes for crockery cookers; Efficient, easy-to-read format, with recipes categorised into 21 chapters, each thoroughly indexed for easy reference; Expanded chapter on cooking basics includes advice on food safety, menu planning, table setting, and make-ahead cooking, plus a thorough glossary on ingredients and techniques; Appliance-friendly recipes hel

### TF-IDF retrieval, Cross-Encoder re-ranking.

Hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.

In [22]:
results = tfidf_search.search(*queries, k=k, reranking_strategy='crossencoder')

print_results(results, 'Results for hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.')

2025-03-16 13:12:21,097 - [17780] - search - [crossencoder] - INFO - Loading Cross-Encoder cross-encoder/stsb-distilroberta-base


Results for hybrid search with TF-IDF retrieval and Cross-encoder re-ranking.

Query: Best cooking recipes
Top 3 most similar books in corpus:
Title: How to Cook Everything: Simple Recipes for Great Food -- (Score: 0.7228) (Goodreads Id: 603204)
Authors: Mark Bittman
Overview: Great Food Made Simple Here's the breakthrough one-stop cooking reference for today's generation of cooks! Nationally known cooking authority Mark Bittman shows you how to prepare great food for all occasions using simple techniques, fresh ingredients, and basic kitchen equipment. Just as important, How to Cook Everything takes a relaxed, straightforward approach to cooking, so you can enjoy yourself in the kitchen and still achieve outstanding results.


Title: The Joy of Cooking -- (Score: 0.6943) (Goodreads Id: 327847)
Authors: Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker
Overview: Since its original publication, Joy of Cooking has been the most authoritative cookbook in America, the one upon which m