# RAG Search Retrieval Experiment

Vector Store : [Qdrant](https://qdrant.tech/)

In this experiment, we will experiment and compare different retrieval options for a RAG application.

We use [Qdrant](https://qdrant.tech/) vector store as our database. Qdrant is chosen as it is optimized for storing both dense and sparse vectors. Don't worry if you don't know what these terms mean, we will introduce those shortly.

> It is recommended to run this notebook on a GPU depending on the size of the PDF file.

<a target="_blank" href="https://colab.research.google.com/github/fuzzylabs/innovation-rag-search-retrieval">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Getting Started

For first pass, click `Run -> Run All Cells` from the drop-down.

Head down to the `Comparison` cell at the bottom of the notebook and select various approaches to compare the outputs of retrieved documents from these approaches.

## Parameters

The following parameters are used in the experiment.

```
QUERY = "Who manufactured STM32F429ZIT6U microcontroller?"
```

We choose this particular question so that we can look if specific content is present in the context.

Specifically, we are looking for following content in the retrieved documents.

```
In this chapter, many programs were developed using the NUCLEO board, provided with the STM32F429ZIT6U microcontroller. This microcontroller is manufactured by STMicroelectronics [13] using a Cortex-M4 processor designed by Arm Ltd. [14].
```


In [1]:
######## Parameters #########

# Data
FILE_PATH = "data/a_beginners_guide_to_designing_embedded_system_applications_on_arm_cortex-m_microcontrollers.pdf"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 50
COLLECTION_NAME = 'qdrant_experiment_search'
BATCH_SIZE = 32

# Dense
DENSE_MODEL_NAME = "all-MiniLM-L6-v2"
DENSE_MODEL_DIM = 384
DENSE_METRIC = 'Cosine'  # options: ['Cosine', 'Euclid', 'Dot', 'Manhattan']

# Sparse
SPARSE_MODEL_NAME = 'spladepp' # options: ['spladepp', 'bm42']
SPARSE_METRIC = 'IP'

# Full text
FULL_TEXT_MODEL = 'bm25'

# Re-ranker
RERANKER_MODEL = 'bgm3' # options: [bgm3, crossencoder]

# Query
TOP_K = 5
QUERY = "Who manufactured STM32F429ZIT6U microcontroller?"

In [2]:
# Load all imports required by the experiment
import torch
import pandas as pd
from IPython.display import display
from data_pipeline import prepare_data
from langchain_text_splitters import RecursiveCharacterTextSplitter
from result_collector import rrf
from sentence_transformers import SentenceTransformer
from fastembed.sparse.bm25 import Bm25
from fastembed.sparse.splade_pp import SpladePP
from qdrant import CustomQdrantClient
from qdrant_client import models
# We use pymilvus re-ranker functions for re-ranker
from pymilvus.model.reranker import BGERerankFunction, CrossEncoderRerankFunction

pd.set_option('max_colwidth', 800)
DEVICE = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f"Using device={DEVICE}")

  from tqdm.autonotebook import tqdm, trange


Using device=cuda:0


## Data Preprocessing

> ⚠️ Make sure the pdf data is present in [data](./data) folder.

In this step, we preprocess the PDF data defined by `FILE_PATH` parameter. This pre-processed data will be added in the vector store.

- We use [PyPDFLoader](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/#using-pypdf) from langchain to read the PDF.
- We use [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) from langchain to split the text in small chunks. The parameters `CHUNK_SIZE` and `CHUNK_OVERLAP` configure the chunk size and chunk overlap.

> ⚠️ This step takes about 3 mins for 600 page PDF document.

In [3]:
%%time
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    length_function=len,
    is_separator_regex=False,
)
docs = prepare_data(FILE_PATH, text_splitter)

CPU times: user 2min 33s, sys: 1.29 s, total: 2min 35s
Wall time: 2min 35s


In [None]:
docs

## Data Ingestion

In this step, we store the chunked texts in Qdrant vector store.

Here's how it works

- We load a dense model defined by `DENSE_MODEL_NAME` parameter on a GPU
- We load a sparse model defined by `SPARSE_MODEL_NAME` parameter on a GPU
- We load BM25 model used for full text search approach


In [8]:
dense_metric = None
if DENSE_METRIC.lower() == 'cosine':
    dense_metric = models.Distance.COSINE
elif DENSE_METRIC.lower() == 'euclid':
    dense_metric = models.Distance.EUCLID
elif DENSE_METRIC.lower() == 'dot':
    dense_metric = models.Distance.DOT
elif DENSE_METRIC.lower() == 'Manhattan':
    dense_metric = models.Distance.MANHATTAN

In [9]:
# dense embedding model
dense_model = SentenceTransformer(DENSE_MODEL_NAME, device=DEVICE)

In [10]:
# sparse embedding model
sparse_model = None
if SPARSE_MODEL_NAME == 'spladepp':
    sparse_model = SpladePP(model_name="prithivida/Splade_PP_en_v1")

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

In [11]:
# BM25 model for full text search
bm25_model = Bm25(model_name="Qdrant/bm25")

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

stopwords.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

Next, we wrote a code that simplifies connecting to Qdrant vector store. It also does all of the heavy lifting for us by providing us useful function to store and retrieve stuff from the vector store.

> The custom code is located in [qdrant.py](./qdrant.py).

In [12]:
# Create a qdrant vector store and persist it to a local path
# To keep setup simple, we use a standalone database
# This is not recommended for production
qdrant_client = CustomQdrantClient()

In [13]:
# Create a collection in Qdrant vector store
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    dense_dim=DENSE_MODEL_DIM,
    dense_distance_metric=dense_metric
)

> ⚠️ This step is resource intensive about 2 mins on Nvidia GPU.

> It might take around 10-15 mins if running on CPU.

> It calculates the embeddings (both dense and sparse) for entire dataset.

In [14]:
%%time

# Store the chunks and metadata
# Also calculate the dense, sparse and full text embedding for the chunks while inserting
# These embeddings are also stored along with chunk and metadata
qdrant_client.store(
    docs=docs,
    collection_name=COLLECTION_NAME,
    dense_model=dense_model,
    sparse_model=sparse_model,
    bm25_model=bm25_model,
    batch_size=BATCH_SIZE
)

CPU times: user 1h 59min 29s, sys: 1min 41s, total: 2h 1min 10s
Wall time: 10min 3s


# Dense

This is the most popular approach for retrieving *semantically similar* documents to the query with dense vector. The dense vector captures the semantic meaning of the text using the embedding model.

![dense](./assets/dense_retrieval_vector_store.png)

A *dense embedding model* creates dense vector for the search query. This vector is matched against all the entries in the database to find closest neighbors to the search query using a *distance metric*.

In [15]:
%%time

# Get the dense embedding vector for the query
# Convert it to a list of floats
query_embed = dense_model.encode(QUERY)

# Get the TOP_K closest documents to the query
result_docs = qdrant_client.dense_search(
    collection_name=COLLECTION_NAME,
    query_dense_embedding=query_embed,
    top_k=TOP_K,
)

CPU times: user 0 ns, sys: 202 ms, total: 202 ms
Wall time: 269 ms


In [16]:
dense_docs = [(res.payload['text'], res.score) for res in result_docs]

In [17]:
dense_docs

[('Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure\xa01.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/10\nADC1/13\nADC3/9\nADC3/15\nADC3/8\nADC3/5NRSTPC_8\nPC_9\nPC_10\nPC_11\nPC_12\nPD_2\nPG_2\nPG_3\nGNDPD_7\nPD_6\nPD_5\nPD_4\nPD_3\nPE_2\nPE_4\nPE_3\nPF_7\nPG_1UART2_RX\nUART2_ XT\nUART_X7TUART2_RTS\

### Result

The goal here is to verify if following context is present in any of the retrieved documents.

```
In this chapter, many programs were developed using the NUCLEO board, provided with the STM32F429ZIT6U microcontroller. This microcontroller is manufactured by STMicroelectronics [13] using a Cortex-M4 processor designed by Arm Ltd. [14].
```

If it is present then our retrieval approach was able to find the correct document from the vector store.

> **It is not present in the any of retrieved documents.**

In [18]:
df = pd.DataFrame(dense_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.544362
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.475778
2,"Chapter 4 | Finite-State Machines and the Real-Time Clock\n171 References\n[1] “4x4 Keypad Module Pinout, Configuration, Features, Circuit & Datasheet” . Accessed July 9, \n2021. \n[2] “Breadboard Power Supply Module” . Accessed July 9, 2021. https://components101.com/modules/5v-mb102-breadboard-power-supply-module \n[3] “UM1974 User manual - STM32 Nucleo-144 boards (MB1137)” . Accessed July 9, 2021. https://www.st.com/resource/en/user_manual/dm00244518-stm32-nucleo144-boards-mb1137-stmicroelectronics.pdf \n[4] “GitHub - armBookCodeExamples/Directory” . Accessed July 9, 2021. https://github.com/armBookCodeExamples/Directory/ \n[5] “Time - API references and tutorials | Mbed OS 6 Documentation ” . Accessed July 9, 2021. \nhttps://os.mbed.com/docs/mbed-os/v6.12/apis/time.html \n[6] “<cst...",0.473938
3,"M_14, STM_PIN_DATA_EXT(STM_MODE_AF_PP, GPIO_NOPULL, GPIO_AF9_TIM14, 1, 0)}, \n {NC, NC, 0}\n}; \nCode 8.3 Notes on the PinNames.h file of the NUCLEO-F429ZI board.",0.471764
4,"[3] “Arm Keil | Cloud-based Development T ools for IoT, ML and Embedded” . Accessed July 9, 2021. https://www.keil.arm.com/ \n[4] “Log In | Mbed” . Accessed July 9, 2021. https://os.mbed.com/account/login/ \n[5] “Keil Studio” . Accessed July 9, 2021. https://studio.keil.arm.com/ \n[6] “Documentation - Arm Developer” . Accessed July 9, 2021. https://developer.arm.com/documentation \n[7] “Arm Mbed Studio” . Accessed July 9, 2021. https://os.mbed.com/studio/ \n[8] “STM32CudeIDE - Integrated Development Environment” . Accessed July 9, 2021. https://www.st.com/en/development-tools/stm32cubeide.html \n[9] “STM32F429ZI - High-performance advanced line, Arm Cortex-M4” . Accessed July 9, 2021. https://www.st.com/en/microcontrollers-microprocessors/stm32f429zi.html",0.446006


# Sparse

Unlike dense vector that capture semantic meaning of the text, sparse vectors aim at pruning and expanding keywords. This allows a document to be represented by sparse vectors composed of pruned keywords.

The removal of redundant terms here resembles the “stopword removal” process seen in traditional search engines, where commonly occurring but non-informative words like “the” and “a” are removed during index building.

![sparse](./assets/sparse_retrieval_vector_store.png)

Similar to dense approach above, a *sparse embedding model* creates a sparse vector for the search query. This vector is matched against all the entries in the database to find closest neighbors to the search query using a *distance metric*.

We use SPLADE embedding models as sparse models.

> A significant advantage of SPLADE's embedding technique is its inherent capacity for term expansion. It identifies and includes relevant terms not present in the original text. For instance, in the example provided, tokens such as "exploration" and "created" emerge in the sparse vector despite their absence from the initial sentence. ([Link](https://zilliz.com/learn/bge-m3-and-splade-two-machine-learning-models-for-generating-sparse-embeddings))

In [19]:
%%time

# Get the sparse embedding vector for the query
query_embed = sparse_model.query_embed(QUERY)
# Convert sparse array to list
sparse_query_embed = [q for q in query_embed][0]

# Get the TOP_K closest documents to the query
result_docs = qdrant_client.sparse_search(
    collection_name=COLLECTION_NAME,
    query_sparse_embedding=sparse_query_embed,
    top_k=TOP_K,
)

CPU times: user 1.63 s, sys: 12.7 ms, total: 1.64 s
Wall time: 243 ms


In [20]:
sparse_docs = [(res.payload['text'], res.score) for res in result_docs]

In [21]:
sparse_docs

[('Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able\xa01.11.\nT able\xa01.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code\xa01.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor  Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]',
  20.15582847595215),
 ('38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure\xa01.23 shows how different elements of the STM32F429ZIT6U microcontroller  are mapped \nto the Zio and Arduino-compatible headers of 

### Result

The goal here is to verify if following context is present in any of the retrieved documents.

```txt
In this chapter, many programs were developed using the NUCLEO board, provided with the STM32F429ZIT6U microcontroller. This microcontroller is manufactured by STMicroelectronics [13] using a Cortex-M4 processor designed by Arm Ltd. [14].
```

If it is present then our retrieval approach was able to find the correct document from the vector store.

> **It is present in the second row of Retrieved text.**

In [22]:
df = pd.DataFrame(sparse_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",20.155828
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",17.964592
2,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",16.524555
3,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",14.147621
4,"Index\n567 STM 32F429 ZIT 6U microcontroller 5, 33, 37, 38 \n stop bit 46, 64, 65, 80, 256 , 257 , 273 , 468 \n ST Zio connectors xiv, 6, 9, 37, 38, 44, 346 \n superloop 13, 519 \n synchronous communication 255\nT TCP server 456 , 457 , 459 , 460 –463 , 491 \n temperature sensor 4, 5, 87, 106 , 123 , 182 , 220 , 233 , 334 , 486 , 550 \n time management xvi, 86, 95, 99, 289 , 342 , 348 , 353 \n timers xvi, xxvii, 37, 306 , 342 , 343 , 345 , 348 –350 , 353 , 354 , \n 361 , 441 , 512 –517 , 533 , 535 , 537 , 538 , 539 \n tm structure 153 \n TO- 220 package 89\nU UART xvi, xvii, 5, 44, 61, 63, 79, 82, 84, 86, 181 , 191 , 222 , 255 , 290 , 291 , \n 306 , 350 , 420 , 423 , 447 , 455 , 461 , 465 , 468 , 544 \n USB xiv, xxvii, 4, 9, 37, 44, 45, 63, 79, 82, 88, 132 , 449 , 506 , 507 ,...",12.547963


# Full Text Search

We use BM25 model to perform a full text search retrieval for us. It creates a list of token for given input identifying important terms in the input.

> Best Matching 25 (BM25) is a text-matching algorithm that can be viewed as an enhancement of the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm. TF-IDF is an information retrieval algorithm that measures the importance of a keyword within a document relative to a collection of documents. [Source](https://zilliz.com/learn/comparing-splade-sparse-vectors-with-bm25)



In [24]:
%%time

# Get the tokens for the query as sparse array
query_embed = bm25_model.query_embed(QUERY)
# Convert sparse array to list
full_text_query_embed = [q for q in query_embed][0]

# Get the TOP_K closest documents to the query
result_docs = qdrant_client.full_text_search(
    collection_name=COLLECTION_NAME,
    query_full_text_embedding=full_text_query_embed,
    top_k=TOP_K,
)

CPU times: user 49.3 ms, sys: 19 μs, total: 49.3 ms
Wall time: 48.6 ms


In [25]:
full_text_docs = [(res.payload['text'], res.score) for res in result_docs]

In [26]:
full_text_docs

[('Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able\xa01.11.\nT able\xa01.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code\xa01.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor  Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]',
  4.512887001037598),
 ('This section will begin the implementation of the smart home system, following the diagram shown in Figure\xa01.3. This first implementation will detect fire using an over temperature detector and a gas de

### Result

The goal here is to verify if following context is present in any of the retrieved documents.

```
In this chapter, many programs were developed using the NUCLEO board, provided with the STM32F429ZIT6U microcontroller. This microcontroller is manufactured by STMicroelectronics [13] using a Cortex-M4 processor designed by Arm Ltd. [14].
```

If it is present then our retrieval approach was able to find the correct document from the vector store.

> **It is present in the first row of Retrieved text.**

In [27]:
df = pd.DataFrame(full_text_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",4.512887
1,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",3.317336
2,"WaRNINg: Keil Studio Cloud translates the C/C++ language code into assembly code while considering the available resources of the target board. For this reason, the reader should be very careful to use only pin names that are shown in Figure 1.23.\nFrom the above discussion, it is possible to derive the hierarchy that is represented in Figure 1.24. It should be noted that a given Arm processor can be used by different microcontroller manufacturers, and a given microcontroller can be used in different development boards or embedded systems. \nFigure 1.24 Hierarchy of different elements introduced in this chapter.\nNOTE: A microcontroller may have one or many processors, while a processor may have one or many cores. A microprocessor consists of the processor, named in this context as th...",3.200337
3,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",2.866316
4,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",2.801293


### Behind the scenes

Let us look at what is happening underneath.

Converting the input `QUERY` into list of tokens, it splits the query into 3 tokens.

```
Who manufactured STM32F429ZIT6U microcontroller?
```

```
['manufactur', 'stm32f429zit6u', 'microcontrol']
```

In [28]:
# Get the list of tokens from query
tokens = bm25_model.tokenizer.tokenize(QUERY)
query_tokens = bm25_model._stem(tokens)
print(f"Query: {QUERY}")
print(f"Query tokens: {query_tokens}")
print(f"Length of query tokens: {len(query_tokens)}")

Query: Who manufactured STM32F429ZIT6U microcontroller?
Query tokens: ['Who', 'manufactur', 'STM32F429ZIT6U', 'microcontrol']
Length of query tokens: 4


In [29]:
# Get tokens for TOP_K retrieved text
# Find common tokens between retrieved text and the query
for i in range(len(full_text_docs)):
    input_sentence = full_text_docs[i][0]
    tokens = bm25_model.tokenizer.tokenize(input_sentence)
    input_tokens = bm25_model._stem(tokens)
    common = set(query_tokens).intersection(set(input_tokens))
    print(input_sentence)
    print(input_tokens)
    print(f"Common tokens: {common}")
    print(f"Number of matching tokens: {len(common)}")
    print("="*80)

Chapter 1 | Introduction to Embedded Systems 
33Proposed Exercise
1. How can the code be changed in such a way that the system is blocked after three incorrect codes 
are entered?
Answer to the Exercise
1. It can be achieved by means of the change in T able 1.11.
T able 1.11 Proposed modification in the code in order to achieve the new behavior.
Line in Code 1.5 New code to be used
40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )
1.3 Under the Hood
1.3.1 Brief Introduction to the Cortex-M Processor  Family and the NUCLEO Board
In this chapter, many programs were developed using the NUCLEO board, provided with the 
STM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]
['Chapter', '1', 'Introduct', 'Embed', 'System', '33Propos', 'Exercis', '1.', 'How', 'code', 'chang', 'way', 'system', 'block', 'three', 'incorrect', 'code', 'enter', 'Answer', 'Exercis', '1.', 'It', 'achiev', 'mean', 'chang', 'T', 'abl', '1.11.', 'T', 'ab

# Dense + Sparse

Here we will look into aggregating the results from both dense and sparse approaach.

There are multiple ways to collate the results from different approaches, we will experiment with 2 approaches.

- **Reciprocal Rank Fusion (RRF)** : Reciprocal Rank Fusion (RRF) is a method used to combine results from different retrieval systems by leveraging the positions/rank of the documents. It uses a formula to aggregate the ranks from all different sources and provide the final rank from all the sources.

- **Re-ranker** : We use re-ranker models which are cross encoder models to rank and score the retrieved documents based on how relevant they are to the input query.

### Reciprocal Rank Fusion

In [30]:
combined_docs = []
combined_docs.append(dense_docs)
combined_docs.append(sparse_docs)

In [31]:
dense_sparse_rrf_docs = rrf(combined_docs)[:TOP_K]

In [32]:
df = pd.DataFrame(dense_sparse_rrf_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.032787
1,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.03254
2,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.016667
3,"Chapter 4 | Finite-State Machines and the Real-Time Clock\n171 References\n[1] “4x4 Keypad Module Pinout, Configuration, Features, Circuit & Datasheet” . Accessed July 9, \n2021. \n[2] “Breadboard Power Supply Module” . Accessed July 9, 2021. https://components101.com/modules/5v-mb102-breadboard-power-supply-module \n[3] “UM1974 User manual - STM32 Nucleo-144 boards (MB1137)” . Accessed July 9, 2021. https://www.st.com/resource/en/user_manual/dm00244518-stm32-nucleo144-boards-mb1137-stmicroelectronics.pdf \n[4] “GitHub - armBookCodeExamples/Directory” . Accessed July 9, 2021. https://github.com/armBookCodeExamples/Directory/ \n[5] “Time - API references and tutorials | Mbed OS 6 Documentation ” . Accessed July 9, 2021. \nhttps://os.mbed.com/docs/mbed-os/v6.12/apis/time.html \n[6] “<cst...",0.016129
4,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.016129


### Re-Ranker

In [33]:
ranker_model = None
# Define the rerank function
if RERANKER_MODEL == "bgm3":
    ranker_model = BGERerankFunction(
        model_name="BAAI/bge-reranker-v2-m3",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )
elif RERANKER_MODEL == "crossencoder":
    ranker_model = CrossEncoderRerankFunction(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )

In [34]:
combined_docs = [dense_doc[0] for dense_doc in dense_docs]
combined_docs.extend([sparse_doc[0] for sparse_doc in sparse_docs])
# Remove duplicates if any
combined_docs = list(set(combined_docs))

In [35]:
%%time
result_docs = ranker_model(query=QUERY, documents=combined_docs, top_k=TOP_K)

CPU times: user 211 ms, sys: 76.1 ms, total: 287 ms
Wall time: 532 ms


In [36]:
dense_sparse_reranker_docs = [(res.text, res.score) for res in result_docs]

In [37]:
df = pd.DataFrame(dense_sparse_reranker_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.98698
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.719778
2,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.545891
3,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.536616
4,"[3] “Arm Keil | Cloud-based Development T ools for IoT, ML and Embedded” . Accessed July 9, 2021. https://www.keil.arm.com/ \n[4] “Log In | Mbed” . Accessed July 9, 2021. https://os.mbed.com/account/login/ \n[5] “Keil Studio” . Accessed July 9, 2021. https://studio.keil.arm.com/ \n[6] “Documentation - Arm Developer” . Accessed July 9, 2021. https://developer.arm.com/documentation \n[7] “Arm Mbed Studio” . Accessed July 9, 2021. https://os.mbed.com/studio/ \n[8] “STM32CudeIDE - Integrated Development Environment” . Accessed July 9, 2021. https://www.st.com/en/development-tools/stm32cubeide.html \n[9] “STM32F429ZI - High-performance advanced line, Arm Cortex-M4” . Accessed July 9, 2021. https://www.st.com/en/microcontrollers-microprocessors/stm32f429zi.html",0.373647


# Dense + Full Text Search

Here we will look into aggregating the results from both dense and full text approaach.

There are multiple ways to collate the results from different approaches, we will experiment with 2 approaches.

- **Reciprocal Rank Fusion (RRF)** : Reciprocal Rank Fusion (RRF) is a method used to combine results from different retrieval systems by leveraging the positions/rank of the documents. It uses a formula to aggregate the ranks from all different sources and provide the final rank from all the sources.

- **Re-ranker** : We use re-ranker models which are cross encoder models to rank and score the retrieved documents based on how relevant they are to the input query.

### Reciprocal Rank Fusion

In [38]:
combined_docs = []
combined_docs.append(dense_docs)
combined_docs.append(full_text_docs)

In [39]:
dense_full_text_rrf_docs = rrf(combined_docs)[:TOP_K]

In [40]:
df = pd.DataFrame(dense_full_text_rrf_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.03254
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.032018
2,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.016667
3,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.016393
4,"Chapter 4 | Finite-State Machines and the Real-Time Clock\n171 References\n[1] “4x4 Keypad Module Pinout, Configuration, Features, Circuit & Datasheet” . Accessed July 9, \n2021. \n[2] “Breadboard Power Supply Module” . Accessed July 9, 2021. https://components101.com/modules/5v-mb102-breadboard-power-supply-module \n[3] “UM1974 User manual - STM32 Nucleo-144 boards (MB1137)” . Accessed July 9, 2021. https://www.st.com/resource/en/user_manual/dm00244518-stm32-nucleo144-boards-mb1137-stmicroelectronics.pdf \n[4] “GitHub - armBookCodeExamples/Directory” . Accessed July 9, 2021. https://github.com/armBookCodeExamples/Directory/ \n[5] “Time - API references and tutorials | Mbed OS 6 Documentation ” . Accessed July 9, 2021. \nhttps://os.mbed.com/docs/mbed-os/v6.12/apis/time.html \n[6] “<cst...",0.016129


### Re-Ranker

In [41]:
ranker_model = None
# Define the rerank function
if RERANKER_MODEL == "bgm3":
    ranker_model = BGERerankFunction(
        model_name="BAAI/bge-reranker-v2-m3",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )
elif RERANKER_MODEL == "crossencoder":
    ranker_model = CrossEncoderRerankFunction(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )

In [42]:
combined_docs = [dense_doc[0] for dense_doc in dense_docs]
combined_docs.extend([full_text_doc[0] for full_text_doc in full_text_docs])
# Remove duplicates if any
combined_docs = list(set(combined_docs))

In [43]:
%%time
result_docs = ranker_model(query=QUERY, documents=combined_docs, top_k=TOP_K)

CPU times: user 187 ms, sys: 7.74 ms, total: 195 ms
Wall time: 186 ms


In [44]:
dense_full_text_reranker_docs = [(res.text, res.score) for res in result_docs]

In [45]:
df = pd.DataFrame(dense_full_text_reranker_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.98698
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.719581
2,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.547192
3,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.534764
4,"[3] “Arm Keil | Cloud-based Development T ools for IoT, ML and Embedded” . Accessed July 9, 2021. https://www.keil.arm.com/ \n[4] “Log In | Mbed” . Accessed July 9, 2021. https://os.mbed.com/account/login/ \n[5] “Keil Studio” . Accessed July 9, 2021. https://studio.keil.arm.com/ \n[6] “Documentation - Arm Developer” . Accessed July 9, 2021. https://developer.arm.com/documentation \n[7] “Arm Mbed Studio” . Accessed July 9, 2021. https://os.mbed.com/studio/ \n[8] “STM32CudeIDE - Integrated Development Environment” . Accessed July 9, 2021. https://www.st.com/en/development-tools/stm32cubeide.html \n[9] “STM32F429ZI - High-performance advanced line, Arm Cortex-M4” . Accessed July 9, 2021. https://www.st.com/en/microcontrollers-microprocessors/stm32f429zi.html",0.37319


# Dense + Sparse + Full Text Search

Here we will look into aggregating the results from all the 3 dense, sparse and full-text retrieval approaches.

There are multiple ways to collate the results from different approaches, we will experiment with 2 approaches.

- **Reciprocal Rank Fusion (RRF)** : Reciprocal Rank Fusion (RRF) is a method used to combine results from different retrieval systems by leveraging the positions/rank of the documents. It uses a formula to aggregate the ranks from all different sources and provide the final rank from all the sources.

- **Re-ranker** : We use re-ranker models which are cross encoder models to rank and score the retrieved documents based on how relevant they are to the input query.

### Reciprocal Rank Fusion

In [46]:
combined_docs = []
combined_docs.append(dense_docs)
combined_docs.append(sparse_docs)
combined_docs.append(full_text_docs)

In [47]:
dense_sparse_full_text_rrf_docs = rrf(combined_docs)[:TOP_K]

In [48]:
df = pd.DataFrame(dense_sparse_full_text_rrf_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.048413
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.048412
2,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.033333
3,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.032522
4,"Chapter 4 | Finite-State Machines and the Real-Time Clock\n171 References\n[1] “4x4 Keypad Module Pinout, Configuration, Features, Circuit & Datasheet” . Accessed July 9, \n2021. \n[2] “Breadboard Power Supply Module” . Accessed July 9, 2021. https://components101.com/modules/5v-mb102-breadboard-power-supply-module \n[3] “UM1974 User manual - STM32 Nucleo-144 boards (MB1137)” . Accessed July 9, 2021. https://www.st.com/resource/en/user_manual/dm00244518-stm32-nucleo144-boards-mb1137-stmicroelectronics.pdf \n[4] “GitHub - armBookCodeExamples/Directory” . Accessed July 9, 2021. https://github.com/armBookCodeExamples/Directory/ \n[5] “Time - API references and tutorials | Mbed OS 6 Documentation ” . Accessed July 9, 2021. \nhttps://os.mbed.com/docs/mbed-os/v6.12/apis/time.html \n[6] “<cst...",0.016129


### Re-Ranker

In [49]:
ranker_model = None
# Define the rerank function
if RERANKER_MODEL == "bgm3":
    ranker_model = BGERerankFunction(
        model_name="BAAI/bge-reranker-v2-m3",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )
elif RERANKER_MODEL == "crossencoder":
    ranker_model = CrossEncoderRerankFunction(
        model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
        device=DEVICE,
        batch_size=BATCH_SIZE,
    )

In [50]:
combined_docs = [dense_doc[0] for dense_doc in dense_docs]
combined_docs.extend([sparse_doc[0] for sparse_doc in sparse_docs])
combined_docs.extend([full_text_doc[0] for full_text_doc in full_text_docs])
# Remove duplicates if any
combined_docs = list(set(combined_docs))

In [51]:
%%time
result_docs = ranker_model(query=QUERY, documents=combined_docs, top_k=TOP_K)

CPU times: user 193 ms, sys: 15.6 ms, total: 209 ms
Wall time: 196 ms


In [52]:
dense_sparse_full_text_reranker_docs = [(res.text, res.score) for res in result_docs]

In [53]:
df = pd.DataFrame(dense_sparse_full_text_reranker_docs, columns=['Retrieved text', 'score'])
display(df)

Unnamed: 0,Retrieved text,score
0,"Chapter 1 | Introduction to Embedded Systems \n33Proposed Exercise\n1. How can the code be changed in such a way that the system is blocked after three incorrect codes \nare entered?\nAnswer to the Exercise\n1. It can be achieved by means of the change in T able 1.11.\nT able 1.11 Proposed modification in the code in order to achieve the new behavior.\nLine in Code 1.5 New code to be used\n40 if ( numberOfIncorrectCodes < 5 ) 40 if ( numberOfIncorrectCodes < 3 )\n1.3 Under the Hood\n1.3.1 Brief Introduction to the Cortex-M Processor Family and the NUCLEO Board\nIn this chapter, many programs were developed using the NUCLEO board, provided with the \nSTM32F429ZIT6U microcontroller . This microcontroller is manufactured by STMicroelectronics [13]",0.98698
1,"38\nA Beginner’s Guide to Designing Embedded System ApplicationsFigure 1.23 shows how different elements of the STM32F429ZIT6U microcontroller are mapped \nto the Zio and Arduino-compatible headers of the NUCLEO-F429ZI board. Some other elements \nare mapped to the CN11 and CN12 headers of the NUCLEO-F429ZI board, as will be discussed in upcoming chapters. Further information on these headers is available from [17].\nIn this chapter, buttons were connected to the NUCLEO board using pins D2 to D7. From Figure 1.23, \nit can be seen that those digital inputs can also be referred to as PF_15, PE_13, P_14, PE_11, PE_9, and PF_13, respectively. Throughout this book, many pins of the ST Zio connectors will be used, and they will be referred to in the code using the names shown in Figure 1.23.",0.719778
2,"This section will begin the implementation of the smart home system, following the diagram shown in Figure 1.3. This first implementation will detect fire using an over temperature detector and a gas detector and, if it detects fire, it will activate the alarm until a given code is entered. For this purpose, the NUCLEO board [1] provided with the STM32F429ZIT6U microcontroller [9] (referred to as the STM32 microcontroller ) is used.",0.545891
3,"Chapter 1 | Introduction to Embedded Systems \n37The STM32F429ZIT6U microcontroller includes a Cortex-M4 processor , as shown in Figure 1.22. It \ncan be appreciated that, beyond the processor, the microcontroller includes other peripherals such as \ncommunication cores (ethernet, USB, UART, etc.), memory, timers, and GPIO (General Purpose Input Output) ports. \nNUCLEO\n-F429ZI32F429ZIT6U\nARM7B776 VQ\nPHL 7B 7213e412000K620 Y12000\nK620 Y\n120 00K620 YDGKYD\nKMS-1 102NL1706C\nSTM32F103CBT6\ne393701GH218CHN\nST890C\nGK717\n11\n22\n33\n44\n55\n66\n77\n88\n9\n10\n11\n12\n13\n14\n15PH_0\nPD_0\nPD_1\nPG_0PH_1PF_2PA_7PF_10PF_5PF_3PC_3PC_030\n29\n28\n27\n2616\n2515\n2414\n2313\n2212\n2111\n2010\n199\n18\n17\n165V\nVIN3.3VIOREF\nGND\nGNDGND\nNCNC\nUART2_RX\nCAN1_TDCAN1_ DRADC1/7ADC1/3\nADC1/1...",0.536616
4,"[3] “Arm Keil | Cloud-based Development T ools for IoT, ML and Embedded” . Accessed July 9, 2021. https://www.keil.arm.com/ \n[4] “Log In | Mbed” . Accessed July 9, 2021. https://os.mbed.com/account/login/ \n[5] “Keil Studio” . Accessed July 9, 2021. https://studio.keil.arm.com/ \n[6] “Documentation - Arm Developer” . Accessed July 9, 2021. https://developer.arm.com/documentation \n[7] “Arm Mbed Studio” . Accessed July 9, 2021. https://os.mbed.com/studio/ \n[8] “STM32CudeIDE - Integrated Development Environment” . Accessed July 9, 2021. https://www.st.com/en/development-tools/stm32cubeide.html \n[9] “STM32F429ZI - High-performance advanced line, Arm Cortex-M4” . Accessed July 9, 2021. https://www.st.com/en/microcontrollers-microprocessors/stm32f429zi.html",0.373647


# Comparison

Here we will interactively compare the results from all approaches.

In [54]:
from ipywidgets import interact, Dropdown
import pandas as pd

In [55]:
def create_dataframe(approach: str):
    if approach == "Dense":
        return pd.DataFrame(dense_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Sparse":
        return pd.DataFrame(sparse_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Full Text":
        return pd.DataFrame(full_text_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Sparse + RRF":
        return pd.DataFrame(dense_sparse_rrf_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Sparse + Reranker":
        return pd.DataFrame(dense_sparse_reranker_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Full Text + RRF":
        return pd.DataFrame(dense_full_text_rrf_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Full Text + Reranker":
        return pd.DataFrame(dense_full_text_reranker_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Sparse + Full Text + RRF":
        return pd.DataFrame(dense_sparse_full_text_rrf_docs, columns=['Retrieved_Text', 'Score'])
    elif approach == "Dense + Sparse + Full Text + Reranker":
        return pd.DataFrame(dense_sparse_full_text_reranker_docs, columns=['Retrieved_Text', 'Score'])

The goal here is to verify if following context is present in any of the retrieved documents.

```
In this chapter, many programs were developed using the NUCLEO board, provided with the STM32F429ZIT6U microcontroller. This microcontroller is manufactured by STMicroelectronics [13] using a Cortex-M4 processor designed by Arm Ltd. [14].
```

If it is present then our retrieval approach was able to find the correct document from the vector store.

In [56]:
options = [
    "Dense",
    "Sparse",
    "Full Text",
    "Dense + Sparse + RRF",
    "Dense + Sparse + Reranker",
    "Dense + Full Text + RRF",
    "Dense + Full Text + Reranker",
    "Dense + Sparse + Full Text + RRF",
    "Dense + Sparse + Full Text + Reranker",
]
approach1 = Dropdown(options = options, description="Approach 1:", value="Dense")
approach2 = Dropdown(options = options, description="Approach 2:", value="Sparse")

@interact(approach1 = approach1, approach2 = approach2)
def show_dataframes(approach1, approach2):
    df1 = create_dataframe(approach1)
    df2 = create_dataframe(approach2)
    for _ in range(1):
        print('\x1b[1;31m'+ str(approach1) +'\x1b[0m')
        display(df1)
        print('\x1b[1;34m'+ str(approach2) +'\x1b[0m')
        display(df2)

interactive(children=(Dropdown(description='Approach 1:', options=('Dense', 'Sparse', 'Full Text', 'Dense + Sp…