<a href="https://colab.research.google.com/github/erkanceran/opdeva/blob/main/Simple_Rag_with_chromaDB_Gemini_PartB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART B: CODE CHROMADB FOR VECTOR STORAGE & SIMILARITY SEARCH    
    
    
    
    
    

In this notebook we will develop a Retrieval Augmented Generation (RAG) application.

The Parts are

* PART A: AN INTRO TO GEMINI API FOR TEXT GENERATION & CHAT
* PART B: CODE WITH CHROMADB FOR VECTOR STORAGE & SIMILARITY SEARCH
* PART C: CODE WITH CHROMADB FOR PERSISTENT VECTOR DB
* PART D: A SIMPLE RAG BASED ON GEMINI & CHROMADB
* PART E: ADVANCED TECHNIQUES FOR RAG BASED ON GEMINI & CHROMADB

YouTube Playlist:
* Türkçe: Erişim Destekli Metin Üretimi : https://www.youtube.com/playlist?list=PLQflnv_s49v_nrk7iGYqw5iRAKrSZPnnV

* English: Retrieval Augmented Generation (RAG) https://www.youtube.com/playlist?list=PLQflnv_s49v-EFKdOVDKB743f1iskLLw2

What is RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that combines large language models (LLMs) with external knowledge sources to improve the accuracy and reliability of AI-generated text.

## How Does RAG Work? Unveiling the Power of External Knowledge

Before we start the core RAG process, we need to provide a foundation as follows:

* **Building the Knowledge Base:** The system starts by transforming documents and information within the external knowledge base (like Wikipedia or a company database) into a special format called **vector representations**. These condense the meaning of each document into a series of **numbers**, capturing the essence of the content.

* **Vector Database for Speedy Retrieval**: These vector representations are then stored in a specialized database called a vector database. This database is optimized for efficiently **searching and retrieving** information based on **semantic similarity**. Imagine it as a super-powered library catalog that **understands the meaning** of documents, **not just keywords**.

Now, let's explore how RAG leverages this foundation:

* **User Input**: The RAG process begins with a question or **prompt** from the user. This could be anything from "What caused the extinction of the dinosaurs?" to a more open-ended request like "Write a creative story."

* **Intelligent Retrieval**: RAG doesn't rely solely on the **LLM's internal knowledge**. It employs an information retrieval component that acts like a super-powered search engine. This component scans the vast external knowledge base – like a company's internal database for specific domains – to find information **directly relevant** to the user's input. Unlike a traditional **search engine** that relies on **keywords**, RAG leverages the power of vector representations to understand the **semantic meaning** of the user's prompt and identify the most relevant documents.

* **Enriched Context Creation**: The retrieved information isn't just shown alongside the prompt. RAG cleverly **merges the user input with the relevant snippets** from the knowledge base. This creates a ***richer context*** for the LLM to understand the **user's intent** and formulate a well-informed response.

* **LLM Powered Response Generation**: Finally, the **enriched context** is fed to the Large Language Model (LLM). The LLM, along with its ability to process language patterns, now has a strong **foundation of factual** information to draw upon. This empowers it to generate a response that is both comprehensive and accurate, addressing the specific needs of the user's prompt.

In this part, we will learn how to Build a Vector Database for Speedy Retrieval in a Knowledge Base using ChromaDB

https://www.trychroma.com/
https://github.com/chroma-core/chroma

# CONTENT

In this exciting tutorial series, we are developing a Retrieval Augmented Generation (RAG) application. If you missed the first part where we covered how to code the GEMINI API for text generation and chat, be sure to check that out.

In this second part, we will code with ChromaDB for vector storage and similarity search with **multiple documents**.

In this tutorial, we will learn:

* How Does RAG Work? – Understand the fundamentals of Retrieval Augmented Generation.
* Upload Single and Multiple PDF Documents to ChromaDB – Learn to handle various document types.
* Convert PDF Content to Text Format – Extract text from PDFs efficiently.
* Convert Text from Pages to Chunks – Organize text for better processing.
* Tokenize the Text – Prepare your text data for machine learning.
* Use the Sentence Transformers Library – Implement advanced text encoding techniques.
* Understand Vector Embedding – Grasp the concept of embedding vectors.
* Create a ChromaDB Instance for Single and Multiple Files/Documents – Set up your database for efficient storage.
* Query and Retrieve Chunks from ChromaDB – Learn to search and retrieve relevant information.
* Filtering Results – Refine your search results.
* Remove Less Related/Less Similar Chunks – Improve the accuracy of your retrieval process.

All the above steps will be implemented and coded in Python on Google Colab.

Follow along step-by-step to master these techniques and enhance your data processing capabilities.

# Building the Knowledge Base

##  Upload pdf documents
During this tutorial, I will consider the pdf files to build the knowledge base. However, you can extend the this notebook easily  to handle the other file types.



In [None]:
#upload several pdf docs to colab from local machine at once
from google.colab import files
def upload_multiple_files():
  uploaded = files.upload()
  file_names = list()
  for fn in uploaded.keys():
    #print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))
    file_names.append(fn)
  return file_names


In [None]:
file_names = upload_multiple_files()

Saving 15 UAV Route Planning For Maximum Target Coverage.pdf to 15 UAV Route Planning For Maximum Target Coverage.pdf


In [None]:
file_names[0]

'15 UAV Route Planning For Maximum Target Coverage.pdf'

## Convert PDF content to text format
Let's see how can we do it on a single file:

In [None]:
%pip install pypdf --quiet
from pypdf import PdfReader

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/290.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
def convert_PDF_Text(pdf_path):
  reader = PdfReader(pdf_path)
  pdf_texts = [p.extract_text().strip() for p in reader.pages]
  # Filter the empty strings
  pdf_texts = [text for text in pdf_texts if text]
  print("Document: ",pdf_path," chunk size: ", len(pdf_texts))
  return pdf_texts

In [None]:
pdf_texts = convert_PDF_Text(file_names[0])
pdf_texts[0]


ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet


Document:  15 UAV Route Planning For Maximum Target Coverage.pdf  chunk size:  8


'Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014\nDOI : 10.5121/cseij.2014.410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET\nCOVERAGE\nMuratKarakaya1\n1Department of Computer Engineering ,Atilim University ,Ankara,Turkey\nABSTRACT\nUtilization of Unmanned Aerial Vehicles (UAVs) in military and civil operations is getting popular. One of\nthe challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the\ntargets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of\nUAVs to cover maximum number of target s considering their flight range.\nThe proposed solution for this practical optimization problem is designed by modifying the Max -Min Ant\nSystem (MMAS) algorithm.  To evaluate the success of the proposed method, an alternative approach,\nbased on the Neares t Neighbour (NN) heuristic, has been developed as well. The results showed the success\nof the proposed 

The to_markdown function converts plain text to Markdown format, adding blockquote styling and converting bullet points.

In [None]:
import textwrap
from IPython.display import display
from IPython.display import Markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
to_markdown(pdf_texts[0])

> Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014
> DOI : 10.5121/cseij.2014.410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET
> COVERAGE
> MuratKarakaya1
> 1Department of Computer Engineering ,Atilim University ,Ankara,Turkey
> ABSTRACT
> Utilization of Unmanned Aerial Vehicles (UAVs) in military and civil operations is getting popular. One of
> the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the
> targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of
> UAVs to cover maximum number of target s considering their flight range.
> The proposed solution for this practical optimization problem is designed by modifying the Max -Min Ant
> System (MMAS) algorithm.  To evaluate the success of the proposed method, an alternative approach,
> based on the Neares t Neighbour (NN) heuristic, has been developed as well. The results showed the success
> of the proposed MMAS method by increasing the number of covered targets compared to the sol ution based
> on the NN heuristic .
> KEYWORDS
> Unmanned Aerial Vehicles (UAVs), r outing, target coverage, Max -Min Ant Colony Optimization
> 1.INTRODUCTION
> This document describes, and is written to conform to, author guidelines for the journals of
> AIRCCseries.  It is prepared in Microsoft Word as a .doc document.  Although other mea ns of
> preparation are acceptable, final, camera -ready versions must conform to this layout.  Microsoft
> Word terminology is used where appropriate in this document.  Although formatting instructions
> may often appear daunting, the simplest approach is to use this template and insert headings and
> text into it as appropriate. The importance and the impact of using Unmanned Aerial Vehicles
> (UAVs) in military and civil operations are increasing [3,4,5]. One of the issues faced for efficient
> usage of UAVs is plann ing the flight routes to monitor all or the maximum number of the given
> targets. This problem is related with the Multiple Travelling Salesman Problem (mTSP) [1] and
> the Vehicle Routing Problem (VRP) [6]. In these well -defined problems, it is mostly assume d that
> travelling salesmen or vehicles should visit all the targets and the target function is defined as to
> find a minimum -distant route. Even, in the constraint versions of the mTSP and VRP, some other
> restrictions (visiting time windows, number of depot s, etc.) are included; it is still assumed that
> there exists enough number of travelling salesmen or vehicles to cover all the given locations.
> However, in reality the number and flight range of UAVs might be insufficient to cover all the
> targets. As a r esult, the maximization of the number of targets covered by the limited number of
> UAVs can be defined as a new problem. Thus, this article presents a solution for this practical
> optimization problem by modifying the Max -Min Ant System (MMAS) algorithm [2]
> accordingly.

Let's how many pages we have in that file:

In [None]:
print(file_names[0]," has ", len(pdf_texts), " pages")

15 UAV Route Planning For Maximum Target Coverage.pdf  has  8  pages


## Convert Text from Pages to Chunks
The code below defines a **character_splitter** object using a class called **RecursiveCharacterTextSplitter**. This object is intended to ***split text into smaller chunks based on certain criteria***.

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. **This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.**

* How the text is split: by list of characters.

* How the chunk size is measured: by number of characters.

Let's break down the parameters:

* separators: This parameter defines the characters or strings that will be used as separators to split the text. In this case, the separators include "\n\n" (double newline), "\n" (single newline), ". " (period followed by a space), " " (space), and "" (empty string). These separators will be used to split the text into smaller chunks.

* chunk_size: This parameter specifies the maximum size of each chunk of text after splitting. In this case, each chunk will have a maximum size of 1500 characters.

* chunk_overlap: This parameter determines whether there will be any overlap between adjacent chunks. A value of 0 means no overlap. In this case, there is no overlap between chunks.

Overall, this code initializes a character_splitter object that can split text into smaller chunks based on the specified separators, with a maximum size of 1500 characters per chunk and no overlap between chunks.






In [None]:
%pip install langchain --quiet
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.8/321.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.1/127.1 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m145.0/145.0 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
def convert_Page_ChunkinChar(pdf_texts, chunk_size = 1500, chunk_overlap=0 ):
  character_splitter = RecursiveCharacterTextSplitter(
      separators=["\n\n", "\n", ". ", " ", ""],
      chunk_size=1500,
      chunk_overlap=0
)
  character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))
  print(f"\nTotal number of chunks (document splited by max char = 1500): \
        {len(character_split_texts)}")
  return character_split_texts


In [None]:
text_chunksinChar = convert_Page_ChunkinChar(pdf_texts)



Total number of chunks (document splited by max char = 1500):         14


In [None]:
print("................. NOTICE ..................")
print(file_names[0]," has ", len(pdf_texts), " pages")
print(file_names[0]," has ", len(text_chunksinChar), " chunks")
print("chunk [0] has ", len(text_chunksinChar[0]), " chars")
print("chunk [-1] has ", len(text_chunksinChar[-1]), " chars")

................. NOTICE ..................
15 UAV Route Planning For Maximum Target Coverage.pdf  has  8  pages
15 UAV Route Planning For Maximum Target Coverage.pdf  has  14  chunks
chunk [0] has  1496  chars
chunk [-1] has  1184  chars


## Convert Text from Chunks to Tokens

This code is designed to tokenize the text chunks produced by the previous code using the `SentenceTransformersTokenTextSplitter`. Let's break down the code and its purpose:

1. `token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, model_name="distiluse-base-multilingual-cased-v1", tokens_per_chunk=128)`: This line initializes a tokenizer object named `token_splitter`. It utilizes the `SentenceTransformersTokenTextSplitter` class, which tokenizes text into chunks based on a specified model. The parameters provided are:
   - `chunk_overlap=0`: This parameter specifies that there should be no overlap between tokenized chunks.
   - `model_name="distiluse-base-multilingual-cased-v1"`: This parameter specifies the name of the pre-trained model to be used for tokenization. In this case, it's using the "distiluse-base-multilingual-cased-v1" model.

   - `tokens_per_chunk=128`: This parameter determines the maximum number of tokens allowed per chunk after tokenization.

2. `token_split_texts = []`: This line initializes an empty list to store the tokenized text chunks.

3. `for text in character_split_texts:`: This line starts a loop iterating over the text chunks produced by the character splitter.

4. `token_split_texts += token_splitter.split_text(text)`: Within the loop, this line tokenizes each text chunk using the `token_splitter` object and adds the resulting tokenized chunks to the `token_split_texts` list.

5. `print(token_split_texts[10])`: This line prints out the 11th tokenized chunk for inspection.

6. `print(f"\nTotal chunks: {len(token_split_texts)}")`: This line prints out the total number of tokenized chunks produced.

The reason for applying this code after the character splitting code is likely to further process the text data. Character splitting breaks down the text into smaller, more manageable chunks, and tokenization further breaks down those chunks into individual tokens, often for tasks like natural language processing (NLP) or machine learning. By applying both character splitting and tokenization, the text data becomes more structured and suitable for various NLP tasks.



In [None]:
%pip install sentence_transformers --quiet
%pip install tqdm --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
sentence_transformer_model="distiluse-base-multilingual-cased-v1"

In [None]:
def convert_Chunk_Token(text_chunksinChar,sentence_transformer_model, chunk_overlap=0,tokens_per_chunk=128 ):
  token_splitter = SentenceTransformersTokenTextSplitter(
      chunk_overlap=chunk_overlap,
      model_name=sentence_transformer_model,
      tokens_per_chunk=tokens_per_chunk)

  text_chunksinTokens = []
  for text in text_chunksinChar:
      text_chunksinTokens += token_splitter.split_text(text)
  print(f"\nTotal number of chunks (document splited by 128 tokens per chunk):\
       {len(text_chunksinTokens)}")
  return text_chunksinTokens



In [None]:
from tqdm import tqdm, trange

In [None]:
text_chunksinTokens = convert_Chunk_Token(text_chunksinChar,sentence_transformer_model)

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/341 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/556 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/539M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/452 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/114 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.58M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.58M [00:00<?, ?B/s]


Total number of chunks (document splited by 128 tokens per chunk):       41


In [None]:
to_markdown(text_chunksinChar[0])

> Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014
> DOI : 10.5121/cseij.2014.410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET
> COVERAGE
> MuratKarakaya1
> 1Department of Computer Engineering ,Atilim University ,Ankara,Turkey
> ABSTRACT
> Utilization of Unmanned Aerial Vehicles (UAVs) in military and civil operations is getting popular. One of
> the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the
> targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of
> UAVs to cover maximum number of target s considering their flight range.
> The proposed solution for this practical optimization problem is designed by modifying the Max -Min Ant
> System (MMAS) algorithm.  To evaluate the success of the proposed method, an alternative approach,
> based on the Neares t Neighbour (NN) heuristic, has been developed as well. The results showed the success
> of the proposed MMAS method by increasing the number of covered targets compared to the sol ution based
> on the NN heuristic .
> KEYWORDS
> Unmanned Aerial Vehicles (UAVs), r outing, target coverage, Max -Min Ant Colony Optimization
> 1.INTRODUCTION
> This document describes, and is written to conform to, author guidelines for the journals of
> AIRCCseries.  It is prepared in Microsoft Word as a .doc document.  Although other mea ns of
> preparation are acceptable, final, camera -ready versions must conform to this layout.  Microsoft

In [None]:
print("................. NOTICE ..................")
print(file_names[0]," has ", len(pdf_texts), " pages")
print(file_names[0]," has ", len(text_chunksinChar), " chunks splitted by 1500 chars")
print(file_names[0]," has ", len(text_chunksinTokens), " chunks splitted by 128 tokens")
print("text_chunksinChar [0] is:\n ")
display(to_markdown(text_chunksinChar[0]))
print("text_chunksinTokens for the first 3 chunks are:\n ")
display(to_markdown(text_chunksinTokens[0]))
display(to_markdown(text_chunksinTokens[1]))
display(to_markdown(text_chunksinTokens[2]))

................. NOTICE ..................
15 UAV Route Planning For Maximum Target Coverage.pdf  has  8  pages
15 UAV Route Planning For Maximum Target Coverage.pdf  has  14  chunks splitted by 1500 chars
15 UAV Route Planning For Maximum Target Coverage.pdf  has  41  chunks splitted by 128 tokens
text_chunksinChar [0] is:
 


> Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014
> DOI : 10.5121/cseij.2014.410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET
> COVERAGE
> MuratKarakaya1
> 1Department of Computer Engineering ,Atilim University ,Ankara,Turkey
> ABSTRACT
> Utilization of Unmanned Aerial Vehicles (UAVs) in military and civil operations is getting popular. One of
> the challenges in effectively tasking these expensive vehicles is planning the flight routes to monitor the
> targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of
> UAVs to cover maximum number of target s considering their flight range.
> The proposed solution for this practical optimization problem is designed by modifying the Max -Min Ant
> System (MMAS) algorithm.  To evaluate the success of the proposed method, an alternative approach,
> based on the Neares t Neighbour (NN) heuristic, has been developed as well. The results showed the success
> of the proposed MMAS method by increasing the number of covered targets compared to the sol ution based
> on the NN heuristic .
> KEYWORDS
> Unmanned Aerial Vehicles (UAVs), r outing, target coverage, Max -Min Ant Colony Optimization
> 1.INTRODUCTION
> This document describes, and is written to conform to, author guidelines for the journals of
> AIRCCseries.  It is prepared in Microsoft Word as a .doc document.  Although other mea ns of
> preparation are acceptable, final, camera -ready versions must conform to this layout.  Microsoft

text_chunksinTokens for the first 3 chunks are:
 


> Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively tasking these expensive vehicles is planning

> the flight routes to monitor the targets. In this work, we aim to develop an algorithm which produces routing plans for a limited number of UAVs to cover maximum number of target s considering their flight range. The proposed solution for this practical optimization problem is designed by modifying the Max - Min Ant System ( MMAS ) algorithm. To evaluate the success of the proposed method, an alternative approach, based on the Neares t Neighbour ( NN ) heuristic, has been developed as well. The results showed the success of the proposed MMAS method by increasing the number of covered targets compared to the

> sol ution based on the NN heuristic. KEYWORDS Unmanned Aerial Vehicles ( UAVs ), r outing, target coverage, Max - Min Ant Colony Optimization 1. INTRODUCTION This document describes, and is written to conform to, author guidelines for the journals of AIRCCseries. It is prepared in Microsoft Word as a. doc document. Although other mea ns of preparation are acceptable, final, camera - ready versions must conform to this layout. Microsoft

##Sentence Transformers Library##
**IMPORTANT:** The `SentenceTransformersTokenTextSplitter` class depends on  the `sentence_transformers` library and uses the models supported by the `sentence_transformers` library.

You can access the `sentence_transformers` library using the link: https://www.sbert.net/docs/pretrained_models.html




##Multi-Lingual Models supported by Sentence Transformers Library##
The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language: https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models

###Semantic Similarity:###
These models find semantically similar sentences within one language or across languages:

**distiluse-base-multilingual-cased-v1**: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, **Turkish**.

**distiluse-base-multilingual-cased-v2:** Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. This version supports 50+ languages, but performs a bit weaker than the v1 model.

**paraphrase-multilingual-MiniLM-L12-v2:** Multilingual version of paraphrase-MiniLM-L12-v2, trained on parallel data for 50+ languages.

**paraphrase-multilingual-mpnet-base-v2:** Multilingual version of paraphrase-mpnet-base-v2, trained on parallel data for 50+ languages.



## Understand the Vector Embeddings

Below, we import the chromadb library and its utility for **embedding functions**, then sets up an embedding function using the specified SentenceTransformer model in the above code  for creating vector representations of text.

In [None]:
%pip install chromadb --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.0/107.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
import numpy as np
import chromadb
from chromadb.utils import embedding_functions
embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(model_name=sentence_transformer_model)

Let's first see how to convert  a text which is splitted according to token size = 128 into a vector represantation by using the ***embedding_function***:

In [None]:
chunk=0
print("text (128 token max):\n")
display(to_markdown(text_chunksinTokens[chunk]))
embedding_vector= embedding_function([text_chunksinTokens[chunk]])
print("Embedding Vector shape: ",np.shape(embedding_vector))
print("Embedding Vector dimension: ",len(embedding_vector[0]))
print("Embedding Vector content (for first 5 dimension): \n",embedding_vector[0][0:5])

text (128 token max):



> Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively tasking these expensive vehicles is planning

Embedding Vector shape:  (1, 512)
Embedding Vector dimension:  512
Embedding Vector content (for first 5 dimension): 
 [0.06848306208848953, -0.027248548343777657, 0.008307364769279957, 0.01328625064343214, -0.006611280608922243]


# CREATE A CHROMADB INSTANCE
Let's create a ChromaDB client for storing and retrieving vectors.

1. **Importing the Client**: The code imports the `Client` class from the `chromadb` library, which will be used for interacting with a database that handles embeddings.

2. **Creating an Instance of the Client**: It creates an instance of the `Client` class named `chroma_client`. This instance will be used to interact with the database.

3. **Creating a Collection**: It creates a new collection named "Papers" in the database. The `create_collection` method takes two arguments:
   - `"Papers"`: The name of the collection.
   - `embedding_function`: A function (defined above) that generates embeddings for the items to be stored in the collection.

This code sets up the necessary client and collection in a database designed to handle and store embeddings, which are typically used for tasks involving similarity search, clustering, or other vector-based queries.

## Create a ChromaDB client

In [None]:
from chromadb import Client
def create_chroma_client(collection_name, embedding_function):
  chroma_client = Client()
  chroma_collection = chroma_client.get_or_create_collection(collection_name, embedding_function=embedding_function)
  return chroma_client, chroma_collection

## Create a ChromaDB collection

In [None]:
collection_name= "Papers"
chroma_client, chroma_collection = create_chroma_client(collection_name, embedding_function)
print(chroma_collection.count())
print(chroma_client.list_collections())

0
[<chromadb.api.models.Collection.Collection object at 0x7a615e1092d0>]


NOTE: We provided the embedding function to the Chroma Client. Thus, it will use the embedding function automatically when we provide the text in chunks splitted according to token limit = 128!

## Prepare Metadata

Provide some meta data for the text chuncks such as the source file name and category.

In [None]:
def add_meta_data(text_chunksinTokens, title, category, initial_id):
  ids = [str(i+initial_id) for i in range(len(text_chunksinTokens))]
  metadata = {
      'document': title,
      'category': category
  }
  metadatas = [ metadata for i in range(len(text_chunksinTokens))]
  return ids, metadatas

In [None]:
category="Journal Paper"
ids,metadatas = add_meta_data(text_chunksinTokens,file_names[0],category, 0)
ids[:5], metadatas[:5]

(['0', '1', '2', '3', '4'],
 [{'document': '15 UAV Route Planning For Maximum Target Coverage.pdf',
   'category': 'Journal Paper'},
  {'document': '15 UAV Route Planning For Maximum Target Coverage.pdf',
   'category': 'Journal Paper'},
  {'document': '15 UAV Route Planning For Maximum Target Coverage.pdf',
   'category': 'Journal Paper'},
  {'document': '15 UAV Route Planning For Maximum Target Coverage.pdf',
   'category': 'Journal Paper'},
  {'document': '15 UAV Route Planning For Maximum Target Coverage.pdf',
   'category': 'Journal Paper'}])

## Add the document and metadata to the ChromaDB collection

In [None]:
def add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection):
  print("Before inserting, the size of the collection: ", chroma_collection.count())
  chroma_collection.add(ids=ids, metadatas= metadatas, documents=text_chunksinTokens)
  print("After inserting, the size of the collection: ", chroma_collection.count())
  return chroma_collection


In [None]:
chroma_collection = add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection)

Before inserting, the size of the collection:  0
After inserting, the size of the collection:  41


## Access the ChromaDB collection

In [None]:
chroma_collection.get(['0'])

{'ids': ['0'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'}],
 'documents': ['Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 DOI : 10. 5121 / cseij. 2014. 410 3 27UAVROUTEPLANNING FORMAXIMUMTARGET COVERAGE MuratKarakaya1 1Department of Computer Engineering, Atilim University, Ankara, Turkey ABSTRACT Utilization of Unmanned Aerial Vehicles ( UAVs ) in military and civil operations is getting popular. One of the challenges in effectively tasking these expensive vehicles is planning'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

# LOAD MULTIPLE DOCUMENTS TO CHROMADB

So far, we deal with only one file. Now, let's handle multiple files using the functions we have created so far.

First, delete the above collection:

In [None]:
try:
  chroma_client.delete_collection(collection_name)
  print("Collection deleted successfully.")
except Exception as e:
  print("Error deleting collection:", e)

Collection deleted successfully.


## Upload, Tokenize & Add Multiple Documents to A ChromaDB Collection

In [None]:
def load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model):

  collection_name= collection_name
  category= "Journal Paper"
  sentence_transformer_model=sentence_transformer_model
  embedding_function= embedding_functions.SentenceTransformerEmbeddingFunction(model_name=sentence_transformer_model)
  chroma_client, chroma_collection = create_chroma_client(collection_name, embedding_function)
  current_id = chroma_collection.count()
  file_names = upload_multiple_files()
  for file_name in file_names:
    print(f"Document: {file_name} is being processed to be added to the {chroma_collection.name} {chroma_collection.count()}")
    print(f"current_id: {current_id} ")
    pdf_texts = convert_PDF_Text(file_name)
    text_chunksinChar = convert_Page_ChunkinChar(pdf_texts)
    text_chunksinTokens = convert_Chunk_Token(text_chunksinChar,sentence_transformer_model)
    ids,metadatas = add_meta_data(text_chunksinTokens,file_name,category, current_id)
    current_id = current_id + len(text_chunksinTokens)
    chroma_collection = add_document_to_collection(ids, metadatas, text_chunksinTokens, chroma_collection)
    print(f"Document: {file_name} added to the collection: {chroma_collection.count()}")
  return  chroma_client, chroma_collection




In [None]:
chroma_client, chroma_collection= load_multiple_pdfs_to_ChromaDB(collection_name,sentence_transformer_model)

ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet


Saving 15 UAV Route Planning For Maximum Target Coverage.pdf to 15 UAV Route Planning For Maximum Target Coverage.pdf
Saving 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf to 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf
Saving 22 ISEAIA Risk Sensetive Routing Abstract.pdf to 22 ISEAIA Risk Sensetive Routing Abstract.pdf
Saving 70 Biometric Verification.pdf to 70 Biometric Verification.pdf
Document: 15 UAV Route Planning For Maximum Target Coverage.pdf is being processed to be added to the Papers 0
current_id: 0 


ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet
ERROR:pypdf._cmap:Advanced encoding /SymbolSetEncoding not implemented yet


Document:  15 UAV Route Planning For Maximum Target Coverage.pdf  chunk size:  8

Total number of chunks (document splited by max char = 1500):         14

Total number of chunks (document splited by 128 tokens per chunk):       41
Before inserting, the size of the collection:  0
After inserting, the size of the collection:  41
Document: 15 UAV Route Planning For Maximum Target Coverage.pdf added to the collection: 41
Document: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf is being processed to be added to the Papers 41
current_id: 41 
Document:  16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf  chunk size:  1

Total number of chunks (document splited by max char = 1500):         2

Total number of chunks (document splited by 128 tokens per chunk):       6
Before inserting, the size of the collection:  41
After inserting, the size of the collection:  47
Document: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf 

In [None]:
chroma_collection.get(['84'])

{'ids': ['84'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '70 Biometric Verification.pdf'}],
 'documents': ['minimum set of requirements Cancellation and revocation of the biometrics are enabled by the change of encryption key Access devices can be produced by commercial companies Supports commercial competition among access device producers Disadv antages Requires a decryption step in biometric verification process Biometric storage structure will be revealed Verification algorithm specifications should be published 5. 4. Alternative 3 Advantages Biometric storage structure is hidden Verification algorith m specifications is hidden Local biometric verification No need for a network connection No need for a central service Sensor specifications can be shared with'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [None]:
chroma_collection.get(['64'])

{'ids': ['64'],
 'embeddings': None,
 'metadatas': [{'category': 'Journal Paper',
   'document': '70 Biometric Verification.pdf'}],
 'documents': ['security risk in biometrics is their unsuitability for revocation and cancellation. In all of the identification instruments used in security domain, biometrics has the hardest problem in canc ellation and'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

## Query & Retrieve Chunks from ChromaDB

This function is designed to retrieve metadata associated with documents that are similar to a given **query** using a Chroma vector store.

In [None]:
def retrieveDocs(chroma_collection, query, n_results=5, return_only_docs=False):
    results = chroma_collection.query(query_texts=[query],
                                      include= [ "documents","metadatas",'distances' ],
                                      n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results


In [None]:
query = "What are the main difference in active and passive path scheduling?"

'''
In 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:

Route planning can be static or dynamic. In static route planning, routes are
constructed according to given UAVs and targets and do not change during
the mission. However, in dynamic route planning, number of routes or UAVs
can alter which requires the update of existing routes to adopt these changes.

'''

'\nIn 16 A Local Optimization Technique for Assigning New Targets ABSTRACT:\n\nRoute planning can be static or dynamic. In static route planning, routes are\nconstructed according to given UAVs and targets and do not change during\nthe mission. However, in dynamic route planning, number of routes or UAVs\ncan alter which requires the update of existing routes to adopt these changes.\n\n'

In [None]:
results=retrieveDocs(chroma_collection, query, 10)

In [None]:
results

{'ids': [['42', '34', '18', '44', '21', '4', '15', '31', '43', '3']],
 'distances': [[1.3230979442596436,
   1.4107954502105713,
   1.4304471015930176,
   1.4979865550994873,
   1.5068368911743164,
   1.5119848251342773,
   1.5183390378952026,
   1.5194307565689087,
   1.5206507444381714,
   1.5410184860229492]],
 'metadatas': [[{'category': 'Journal Paper',
    'document': '16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf'},
   {'category': 'Journal Paper',
    'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'},
   {'category': 'Journal Paper',
    'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'},
   {'category': 'Journal Paper',
    'document': '16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf'},
   {'category': 'Journal Paper',
    'document': '15 UAV Route Planning For Maximum Target Coverage.pdf'},
   {'category': 'Journal Paper',
    'document': '15 UAV Route Planning For Maximum Target Coverage

In [None]:
retrieved_documents = results['documents'][0]
if len(retrieved_documents) == 0:
   print("No results found.")

## Show the retrieved chunks

In [None]:
def show_results(results, return_only_docs=False):

  if return_only_docs:
    retrieved_documents = results
    if len(retrieved_documents) == 0:
      print("No results found.")
      return
    for i, doc in enumerate(retrieved_documents):
      print(f"Document {i+1}:")
      print("\tDocument Text: ")
      display(to_markdown(doc));
  else:

      retrieved_documents = results['documents'][0]
      if len(retrieved_documents) == 0:
          print("No results found.")
          return
      retrieved_documents_metadata = results['metadatas'][0]
      retrieved_documents_distances = results['distances'][0]
      print("------- retreived documents -------\n")

      for i, doc in enumerate(retrieved_documents):
          print(f"Document {i+1}:")
          print("\tDocument Text: ")
          display(to_markdown(doc));
          print(f"\tDocument Source: {retrieved_documents_metadata[i]['document']}")
          print(f"\tDocument Source Type: {retrieved_documents_metadata[i]['category']}")
          print(f"\tDocument Distance: {retrieved_documents_distances[i]}")


In [None]:
show_results(results)

------- retreived documents -------

Document 1:
	Document Text: 


> , routes are constructed according to given UAVs and targets and do not change during the mission. However, in dynamic route planning, number of routes or UAVs can alter which requires the update of existing routes to adopt these changes. For example, some of the UAVs can be lost during the mission or new targets might pop up after the take - off. This article proposes an iterative local optimization for the distribution of new targets to the existing routes in dynamic route planning. In the proposed solution, it is supposed that all UAVs have the same flight ranges, their initial routes are planned, and

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3230979442596436
Document 2:
	Document Text: 


> Computer Science & Engineering : An International Journal ( CSEIJ ), Vol. 4, No. 1, February 2014 34The prelimin ary results show the effectiveness of the MMAS in route planning. We would like to extend the work by defining different performance metrics and executing the experiments with different location set ups. REFERENCES [ 1 ] Bektas, T. ( 2006 ). The multiple trav eling salesman problem : an overview of formulations and solution procedures. Omega, 34 ( 3 ), 209 - 219. [ 2 ] Dorigo, M.

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4107954502105713
Document 3:
	Document Text: 


> with less cost, that is covering more targets, leave more pheromone on the paths to provide positive feedback for the other ants. 4. 4. Calculating Heuristic Value The heuristic value ( ηij ) between two locations is defined as ijijd1 =, whereijdis the distance between the locations. 4. 5. Algorithm Using the steps defined above an implementation of the MMAS is given in Table 1. We input the target list ( H ), the distances between the targets ( dij ), the flight ra nge ( FR ), and the number of UAVs

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4304471015930176
Document 4:
	Document Text: 


> follows. In the first phase of the algorithm, a n UAV with the highest slack range is picked and its route is modified by inserting a new target at a time. Adding a new target to an existing route causes an increase in the route distance, which is called update cost. If the update cost is not greater than the slack range, the new target is insert ed to the route. After finishing attempts with all new targets, if any of them is left over, insertion process is execute d with the UAV having the next highest slack range as described above until either all UAV

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4979865550994873
Document 5:
	Document Text: 


> UAV _ used < UAV ) { next = find _ Next _ Target ( ) ; if ( base _ Reachable ( next ) ) { move ( next ) ; remaining _ Range - = dcurrent, next ; target _ Number + + ; } else { move ( base ) ; UAV _ used + + ; remaining _ Range = FR ; } } / / end _ while evoporate _ Pheromone ( ) ; update _ Pheromone ( ) ; update _ Best _ Solution ( ) ; } / / end _ for _ each _ ant } return ( Best _ Solution ) ;

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5068368911743164
Document 6:
	Document Text: 


> ] and the Vehicle Routing Problem ( VRP ) [ 6 ]. In these well - defined problems, it is mostly assume d that travelling salesmen or vehicles should visit all the targets and the target function is defined as to find a minimum - distant route. Even, in the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depot s, etc. ) are included ; it is still assumed that there exists enough number of travelling salesmen or vehicles to cover all the given locations. However, in reality the number and flight range of UAVs might be ins

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5119848251342773
Document 7:
	Document Text: 


> Thus, if a routing plan can lead to visit all the targets, its cos t will be zero. The initial solution is constructed using Nearest Neighbors heuristic. The minimum pheromone value is defined as max10 min * ) 1 ( iteration p - = ( 6 ) As a result of Eq. ( 6 ), any edge would have pheromone at least ten times evaporat ed value of the maximum pheromone value. Thus, we do not allow unvisited edges get very low pheromone values which otherwise would decrease their probability. 4. 3. Updating Ph

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5183390378952026
Document 8:
	Document Text: 


> , 4 UAVs are successfully routed by the MMAS to cover all the targets while the NN prepares a routing plan for the same number of UAVs missing 4 % of the targets. Table 5. The target coverage ratios for the heuristics when FR = CD * 2. UAV TCNN TCMMAS 1 11 % 12 % 3 20 % 29 % 5 30 % 35 % 7 34 % 38 % 9 36 % 40 % 11 38 % 41 % 13 40 % 41 % 14 41 % 41 % 1 11 % 12 % 3. CONCLUSIONS In this work, we define

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5194307565689087
Document 9:
	Document Text: 


> they have already visited some of the targets accor ding to these routes. Furthermore, for each UAV, the slack range which is the difference between the flight range and initial route distance is calculated. Whenever some new targets appear, t he proposed iterative insertion algorithm executes as

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5206507444381714
Document 10:
	Document Text: 


> Word terminology is used where appropriate in this document. Although formatting instructions may often appear daunting, the simplest approach is to use this template and insert headings and text into it as appropriate. The importance and the impact of using Unmanned Aerial Vehicles ( UAVs ) in military and civil operations are increasing [ 3, 4, 5 ]. One of the issues faced for efficient usage of UAVs is plann ing the flight routes to monitor all or the maximum number of the given targets. This problem is related with the Multiple Travelling Salesman Problem ( mTSP ) [ 1

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5410184860229492


NOTE: Sometimes **totally unrelated** chunks can be fetched:

In [None]:
query = "What is the controllable text generation?"
results=retrieveDocs(chroma_collection, query)
show_results(results)


------- retreived documents -------

Document 1:
	Document Text: 


> Verification System.

	Document Source: 70 Biometric Verification.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4642446041107178
Document 2:
	Document Text: 


> or digital - certificate to any one he chooses from a theoretically infi nite alternative pool. But for the biometrics, options are limited especially for hand and retina vein maps. 3. Problems w ith The Current Design Turkish national e - ID card access devices should meet a set of security requirements, such as blocking the remot e access, keeping specific event / user logs and being temper resistant etc. In order to certify whether the devices meet these requirements, they are obliged to Common Criteria ( CC ) tests with a predefined protection profile [ 11 ]

	Document Source: 70 Biometric Verification.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.464350938796997
Document 3:
	Document Text: 


> Following the technological feasibility, functionality, pilot usage and security studies, actual production and dissemination of e - ID cards were commenced. Pu blished standards are one of several outcomes of these studies as well. So that, the definitions, methods and specifications in standards were assumed to be well studied and tested. Apparently, e - ID cards are instruments with high privacy and security requ irements. The content of the card is strongly private and sensible to exploitation. It goes without saying, security and privacy considerations should take precedence in e - ID card specifications. In line

	Document Source: 70 Biometric Verification.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4691274166107178
Document 4:
	Document Text: 


> should be known by the designer. Since, with the purpose of verifying the identity, the device is supposed to compare the scanned biometrics with the biometric data which is read from e - ID card, assuming proper access rights such as PIN and / or certificates are provided. Nevertheless, the data structure and the fea ture notation used in e - ID card for biometrics is not published. This means, e - ID card access device producers will not be able to develop a verification system running on the device. On the other hand, if the producer opts for verification on

	Document Source: 70 Biometric Verification.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.4903249740600586
Document 5:
	Document Text: 


> device producers, so that access devices can verify the scanned biometrics. But a central validity check is introduced to the process steps with a minimum network and communication overhead. For central verifica tion a hash of the biometric data on e - ID card is to be transmitted to a government controlled validation server and validity result is received. This mode of operation enables the cancellation and revocation of biometrics. 5. Comparison of t he Alternative s As we have mention ed in the previous section each biometric verification algorithm has its own advantages and disa

	Document Source: 70 Biometric Verification.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.5029150247573853


In [None]:
results=retrieveDocs(chroma_collection, query, return_only_docs= True)
show_results(results,  return_only_docs=True)

Document 1:
	Document Text: 


> Verification System.

Document 2:
	Document Text: 


> or digital - certificate to any one he chooses from a theoretically infi nite alternative pool. But for the biometrics, options are limited especially for hand and retina vein maps. 3. Problems w ith The Current Design Turkish national e - ID card access devices should meet a set of security requirements, such as blocking the remot e access, keeping specific event / user logs and being temper resistant etc. In order to certify whether the devices meet these requirements, they are obliged to Common Criteria ( CC ) tests with a predefined protection profile [ 11 ]

Document 3:
	Document Text: 


> Following the technological feasibility, functionality, pilot usage and security studies, actual production and dissemination of e - ID cards were commenced. Pu blished standards are one of several outcomes of these studies as well. So that, the definitions, methods and specifications in standards were assumed to be well studied and tested. Apparently, e - ID cards are instruments with high privacy and security requ irements. The content of the card is strongly private and sensible to exploitation. It goes without saying, security and privacy considerations should take precedence in e - ID card specifications. In line

Document 4:
	Document Text: 


> should be known by the designer. Since, with the purpose of verifying the identity, the device is supposed to compare the scanned biometrics with the biometric data which is read from e - ID card, assuming proper access rights such as PIN and / or certificates are provided. Nevertheless, the data structure and the fea ture notation used in e - ID card for biometrics is not published. This means, e - ID card access device producers will not be able to develop a verification system running on the device. On the other hand, if the producer opts for verification on

Document 5:
	Document Text: 


> device producers, so that access devices can verify the scanned biometrics. But a central validity check is introduced to the process steps with a minimum network and communication overhead. For central verifica tion a hash of the biometric data on e - ID card is to be transmitted to a government controlled validation server and validity result is received. This mode of operation enables the cancellation and revocation of biometrics. 5. Comparison of t he Alternative s As we have mention ed in the previous section each biometric verification algorithm has its own advantages and disa

# Filtering Results According to Metadata
Let's explore ChromaDB's filtering capabilities. We might want to filter results based on specific metadata fields associated with the documents (e.g., author, date, category).

In [None]:
# Define your query and desired document
query = "What is Target Coverage?"
document_filter = "15 UAV Route Planning For Maximum Target Coverage (1).pdf"

results = chroma_collection.query(
    query_texts=[query],
    include=["documents", "metadatas", "distances"],
    where={"document": document_filter},
    n_results=5)

show_results(results)



------- retreived documents -------

Document 1:
	Document Text: 


> distance of the farthest target from the selected base. We test threeFRs with respect to the CDas Case 1 : FR = CD, Case 2 : FR = CD / 2, and Case 3 : FR = CD * 2. The main performance metric, Target Coverage ( TC ), is the ratio of the number of the targets visited by all the UAVs to the existing targets as formulated below : 100 * allvisited TTTC = ( 9 ) To obtain the results, we run each simulation 10 times and get the averages of these results to find the mean values. 5. 1. Results

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3038585186004639
Document 2:
	Document Text: 


> , 4 UAVs are successfully routed by the MMAS to cover all the targets while the NN prepares a routing plan for the same number of UAVs missing 4 % of the targets. Table 5. The target coverage ratios for the heuristics when FR = CD * 2. UAV TCNN TCMMAS 1 11 % 12 % 3 20 % 29 % 5 30 % 35 % 7 34 % 38 % 9 36 % 40 % 11 38 % 41 % 13 40 % 41 % 14 41 % 41 % 1 11 % 12 % 3. CONCLUSIONS In this work, we define

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.31107759475708
Document 3:
	Document Text: 


> on the maximum and minimum values of the pheromone values that can be compiled on an edge. We apply MMAS to find a route planning to cover most of the targets as explained belo w. 4. APPLYINGMMASTOTARGETCOVERAGE PROBLEM Below, we first explain the MMAS basics and then provide the algorithm to generate a solution to cover maximum number of targets. 2. 1. Selecting Next Target In MMAS, each artificial ant tries to create a route planning for all the UAVs by visiting targets considering the given problem constraints. Beginning

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3237237930297852
Document 4:
	Document Text: 


> generate more target coverage compared to the NN heuristic.

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3665157556533813
Document 5:
	Document Text: 


> ] and the Vehicle Routing Problem ( VRP ) [ 6 ]. In these well - defined problems, it is mostly assume d that travelling salesmen or vehicles should visit all the targets and the target function is defined as to find a minimum - distant route. Even, in the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depot s, etc. ) are included ; it is still assumed that there exists enough number of travelling salesmen or vehicles to cover all the given locations. However, in reality the number and flight range of UAVs might be ins

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3945647478103638


Let's modify the **retrieveDocs()** function so that it can handle filters.

In [None]:
def retrieveDocs(chroma_collection, query, n_results=5,
                 return_only_docs=False, filterType=None, filterValue=None):
    if filterType is not None and filterValue is not None:
        results = chroma_collection.query(
            query_texts=[query],
            include=["documents", "metadatas", "distances"],
            where={filterType: filterValue},
            n_results=n_results)

    else:
        results = chroma_collection.query(
            query_texts=[query],
            include= [ "documents","metadatas",'distances' ],
            n_results=n_results)

    if return_only_docs:
        return results['documents'][0]
    else:
        return results

Let's test the revised function:

In [None]:
# Define your query and desired document
query = "What is Target Coverage?"
filterType = "document"
filterValue = "15 UAV Route Planning For Maximum Target Coverage (1).pdf"

results = retrieveDocs(chroma_collection,query,
                       filterType=filterType,
                       filterValue=filterValue,
                       n_results=5 )
show_results(results)

------- retreived documents -------

Document 1:
	Document Text: 


> distance of the farthest target from the selected base. We test threeFRs with respect to the CDas Case 1 : FR = CD, Case 2 : FR = CD / 2, and Case 3 : FR = CD * 2. The main performance metric, Target Coverage ( TC ), is the ratio of the number of the targets visited by all the UAVs to the existing targets as formulated below : 100 * allvisited TTTC = ( 9 ) To obtain the results, we run each simulation 10 times and get the averages of these results to find the mean values. 5. 1. Results

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3038585186004639
Document 2:
	Document Text: 


> , 4 UAVs are successfully routed by the MMAS to cover all the targets while the NN prepares a routing plan for the same number of UAVs missing 4 % of the targets. Table 5. The target coverage ratios for the heuristics when FR = CD * 2. UAV TCNN TCMMAS 1 11 % 12 % 3 20 % 29 % 5 30 % 35 % 7 34 % 38 % 9 36 % 40 % 11 38 % 41 % 13 40 % 41 % 14 41 % 41 % 1 11 % 12 % 3. CONCLUSIONS In this work, we define

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.31107759475708
Document 3:
	Document Text: 


> on the maximum and minimum values of the pheromone values that can be compiled on an edge. We apply MMAS to find a route planning to cover most of the targets as explained belo w. 4. APPLYINGMMASTOTARGETCOVERAGE PROBLEM Below, we first explain the MMAS basics and then provide the algorithm to generate a solution to cover maximum number of targets. 2. 1. Selecting Next Target In MMAS, each artificial ant tries to create a route planning for all the UAVs by visiting targets considering the given problem constraints. Beginning

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3237237930297852
Document 4:
	Document Text: 


> generate more target coverage compared to the NN heuristic.

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3665157556533813
Document 5:
	Document Text: 


> ] and the Vehicle Routing Problem ( VRP ) [ 6 ]. In these well - defined problems, it is mostly assume d that travelling salesmen or vehicles should visit all the targets and the target function is defined as to find a minimum - distant route. Even, in the constraint versions of the mTSP and VRP, some other restrictions ( visiting time windows, number of depot s, etc. ) are included ; it is still assumed that there exists enough number of travelling salesmen or vehicles to cover all the given locations. However, in reality the number and flight range of UAVs might be ins

	Document Source: 15 UAV Route Planning For Maximum Target Coverage (1).pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3945647478103638


NOTE: Sometimes **nothing** can be fetched or **unrelated** chunks can be retrieved because of **filters**:

In [None]:
# Define your query and desired document
query = "Explain Ant Colony Optimization in UAV route planning?"
filterType = "document"
filterValue = "70 Biometric Verification (1).pdf"

results = retrieveDocs(chroma_collection,query,
                       filterType=filterType,
                       filterValue=filterValue,
                       n_results=5 )
show_results(results)

No results found.


# Remove Less Related / Similar / Close Chunks

To eliminate unrelated chunks using **document distance** values in ChromaDB, we can filter the results based on a **distance threshold**. The distance value indicates how similar the document is to the query, with smaller values representing higher similarity (depending on the distance metric used). Here's a function to implement such filtering:

In [None]:
def remove_less_related_chunks(results, distance_threshold):
    filtered_results = {
        "ids": [[]],
        "distances": [[]],
        "metadatas": [[]],
        "documents": [[]]
    }

    for i in range(len(results["distances"][0])):
        if results["distances"][0][i] <= distance_threshold:
            filtered_results["ids"][0].append(results["ids"][0][i])
            filtered_results["distances"][0].append(results["distances"][0][i])
            filtered_results["metadatas"][0].append(results["metadatas"][0][i])
            filtered_results["documents"][0].append(results["documents"][0][i])

    return filtered_results



In [None]:
distance_threshold = 1.4
closer_results = remove_less_related_chunks(results, distance_threshold)
show_results(closer_results)

------- retreived documents -------

Document 1:
	Document Text: 


> , routes are constructed according to given UAVs and targets and do not change during the mission. However, in dynamic route planning, number of routes or UAVs can alter which requires the update of existing routes to adopt these changes. For example, some of the UAVs can be lost during the mission or new targets might pop up after the take - off. This article proposes an iterative local optimization for the distribution of new targets to the existing routes in dynamic route planning. In the proposed solution, it is supposed that all UAVs have the same flight ranges, their initial routes are planned, and

	Document Source: 16 A Local Optimization Technique for Assigning New Targets ABSTRACT.pdf
	Document Source Type: Journal Paper
	Document Distance: 1.3230979442596436


# SUMMARY

In this exciting tutorial series, we are developing a Retrieval Augmented Generation (RAG) application. If you missed the first part where we covered how to code the GEMINI API for text generation and chat, be sure to check that out.

In this second part, we coded with ChromaDB for vector storage and similarity search with multiple documents.

In this tutorial, we learned:
* How Does RAG Work? – Understand the fundamentals of Retrieval Augmented Generation.
* Upload Single and Multiple PDF Documents to ChromaDB – Learn to handle various document types.
* Convert PDF Content to Text Format – * Extract text from PDFs efficiently.
* Convert Text from Pages to Chunks – Organize text for better processing.
* Tokenize the Text – Prepare your text data for machine learning.
* Use the Sentence Transformers Library – Implement advanced text encoding techniques.
* Understand Vector Embedding – Grasp the concept of embedding vectors.
* Create a ChromaDB Instance for Single and Multiple Files/Documents – Set up your database for efficient storage.
* Query and Retrieve Chunks from ChromaDB – Learn to search and retrieve relevant information.
* Filtering Results – Refine your search results.
* Remove Less Related/Less Similar Chunks – Improve the accuracy of your retrieval process.

All the above steps was implemented and coded in Python on Google Colab.

.