# Constructing an Index-Based Deep Lake Vector Store for Semantic Search with LlamaIndex and OpenAI

copyright 2024, Denis Rothman

A Practical Guide to Building a Semantic Search Engine with Deep Lake, LlamaIndex, and OpenAI:

*   Installing the Environment
*   Creating and populating the Vector Store &   dataset
*   Getting started with  index-based semantic search




# Installing the environment

In [4]:
#Google Drive option to store API Keys
#Store you key in a file and read it(you can type it directly in the notebook but it will be visible for somebody next to you)
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
from openai import OpenAI
import os
os.environ["OPENAI_API_KEY"] = "sk-eriFN0c8g2S7cbWGA7B138CdBe8f496c98605e387fE5D3B3"
os.environ["OPENAI_API_BASE"] = "https://api.ai-gaochao.cn/v1"

client = OpenAI()

In [54]:
import openai
print(openai.__version__)

1.61.1


*First run the following cells and restart Google Colab session if prompted. Then run the notebook again cell by cell to explore the code.*

In [1]:
!pip install llama-index-vector-stores-deeplake==0.1.6

Collecting llama-index-vector-stores-deeplake==0.1.6
  Downloading llama_index_vector_stores_deeplake-0.1.6-py3-none-any.whl.metadata (709 bytes)
Collecting deeplake>=3.9.12 (from llama-index-vector-stores-deeplake==0.1.6)
  Downloading deeplake-4.1.11-cp311-cp311-manylinux2014_x86_64.whl.metadata (19 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-vector-stores-deeplake==0.1.6)
  Downloading llama_index_core-0.10.68.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.1->llama-index-vector-stores-deeplake==0.1.6)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.11.0,>=0.10.1->llama-index-vector-stores-deeplake==0.1.6)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.2.0 (from llama-index-core<0.11.0,>=0.10.1->llama-index-vector-stores-deeplake==0.1.6)
  Downloading tenacity-8.5.0-py

LlamaIndex supports Deep Lake vector stores through the DeepLakeVectorStore class.

In [2]:
!pip install deeplake==3.9.18

Collecting deeplake==3.9.18
  Downloading deeplake-3.9.18.tar.gz (608 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m608.9/608.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pillow~=10.2.0 (from deeplake==3.9.18)
  Downloading pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting boto3 (from deeplake==3.9.18)
  Downloading boto3-1.37.8-py3-none-any.whl.metadata (6.6 kB)
Collecting pathos (from deeplake==3.9.18)
  Downloading pathos-0.3.3-py3-none-any.whl.metadata (11 kB)
Collecting humbug>=0.3.1 (from deeplake==3.9.18)
  Downloading humbug-0.3.2-py3-none-any.whl.metadata (6.8 kB)
Collecting lz4 (from deeplake==3.9.18)
  Downloading lz4-4.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting aioboto3>=10.4.0 (from de

In [1]:
!pip install llama-index==0.10.64

Collecting llama-index==0.10.64
  Downloading llama_index-0.10.64-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index==0.10.64)
  Downloading llama_index_agent_openai-0.2.9-py3-none-any.whl.metadata (729 bytes)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index==0.10.64)
  Downloading llama_index_cli-0.1.13-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-embeddings-openai<0.2.0,>=0.1.5 (from llama-index==0.10.64)
  Downloading llama_index_embeddings_openai-0.1.11-py3-none-any.whl.metadata (655 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.2.0 (from llama-index==0.10.64)
  Downloading llama_index_indices_managed_llama_cloud-0.6.8-py3-none-any.whl.metadata (3.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index==0.10.64)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting llama-index-llms-openai<0.2.0,>=0.1.27 (from llama-index==0.10.64)
  Downloadin

Next, let's import the required modules and set the needed environmental variables:

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /usr/local/lib/python3.11/dist-
[nltk_data]     packages/llama_index/core/_static/nltk_cache...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [3]:
!pip install sentence-transformers==3.0.1

Collecting sentence-transformers==3.0.1
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers==3.0.1)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers==3.0.1)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers==3.0.1)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers==3.0.1)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers==3.0.1)
  Downloading nvidia_cublas

In [None]:
#Retrieving and setting the OpenAI API key
f = open("drive/MyDrive/files/api_key.txt", "r")
API_KEY=f.readline().strip()
f.close()

#The OpenAI KeyActiveloop and OpenAI API keys
import os
import openai
os.environ['OPENAI_API_KEY'] =API_KEY
openai.api_key = os.getenv("OPENAI_API_KEY")

In [6]:
#Retrieving and setting the Activeloop API token
# f = open("drive/MyDrive/files/activeloop.txt", "r")
# API_token=f.readline().strip()
# f.close()
# ACTIVELOOP_TOKEN=API_token
import os
import openai
os.environ['ACTIVELOOP_TOKEN'] ='eyJhbGciOiJub25lIiwidHlwIjoiSldUIn0.eyJpZCI6Im93bGxpbWluZyIsImFwaV9rZXkiOiIwRWN4WkRSakpBQkdIeTVfc1dpdlJrZ05ITlFTQlJvS1dMNjNEVGVaSXVBeVUifQ.'

In [7]:
# For Google Colab and Activeloop while waiting for Activeloop (April 2024) pending new version
#This line writes the string "nameserver 8.8.8.8" to the file. This is specifying that the DNS server the system
#should use is at the IP address 8.8.8.8, which is one of Google's Public DNS servers.
with open('/etc/resolv.conf', 'w') as file:
   file.write("nameserver 8.8.8.8")

# Pipeline 1 : Collecting and preparing the documents

In [8]:
!mkdir data

In [9]:
import requests
from bs4 import BeautifulSoup
import re
import os

urls = [
    "https://github.com/VisDrone/VisDrone-Dataset",
    "https://paperswithcode.com/dataset/visdrone",
    "https://openaccess.thecvf.com/content_ECCVW_2018/papers/11133/Zhu_VisDrone-DET2018_The_Vision_Meets_Drone_Object_Detection_in_Image_Challenge_ECCVW_2018_paper.pdf",
    "https://github.com/VisDrone/VisDrone2018-MOT-toolkit",
    "https://en.wikipedia.org/wiki/Object_detection",
    "https://en.wikipedia.org/wiki/Computer_vision",
    "https://en.wikipedia.org/wiki/Convolutional_neural_network",
    "https://en.wikipedia.org/wiki/Unmanned_aerial_vehicle",
    "https://www.faa.gov/uas/",
    "https://www.tensorflow.org/",
    "https://pytorch.org/",
    "https://keras.io/",
    "https://arxiv.org/abs/1804.06985",
    "https://arxiv.org/abs/2202.11983",
    "https://motchallenge.net/",
    "http://www.cvlibs.net/datasets/kitti/",
    "https://www.dronedeploy.com/",
    "https://www.dji.com/",
    "https://arxiv.org/",
    "https://openaccess.thecvf.com/",
    "https://roboflow.com/",
    "https://www.kaggle.com/",
    "htptps://paperswithcode.com/",
    "https://github.com/"
]

In [10]:
import requests
import re
import os
from bs4 import BeautifulSoup

def clean_text(content):
    # Remove references and unwanted characters
    content = re.sub(r'\[\d+\]', '', content)   # Remove references
    content = re.sub(r'[^\w\s\.]', '', content)  # Remove punctuation (except periods)
    return content

def fetch_and_clean(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for bad responses (e.g., 404)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Prioritize "mw-parser-output" but fall back to "content" class if not found
        content = soup.find('div', {'class': 'mw-parser-output'}) or soup.find('div', {'id': 'content'})
        if content is None:
            return None

        # Remove specific sections, including nested ones
        for section_title in ['References', 'Bibliography', 'External links', 'See also', 'Notes']:
            section = content.find('span', id=section_title)
            while section:
                for sib in section.parent.find_next_siblings():
                    sib.decompose()
                section.parent.decompose()
                section = content.find('span', id=section_title)

        # Extract and clean text
        text = content.get_text(separator=' ', strip=True)
        text = clean_text(text)
        return text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching content from {url}: {e}")
        return None  # Return None on error

# Directory to store the output files
output_dir = './data/'  # More descriptive name
os.makedirs(output_dir, exist_ok=True)

# Processing each URL (and skipping invalid ones)
for url in urls:
    article_name = url.split('/')[-1].replace('.html', '')  # Handle .html extension
    filename = os.path.join(output_dir, f"{article_name}.txt")

    clean_article_text = fetch_and_clean(url)
    if clean_article_text:  # Only write to file if content exists
        with open(filename, 'w', encoding='utf-8') as file:
            file.write(clean_article_text)

print(f"Content(ones that were possible) written to files in the '{output_dir}' directory.")



Content(ones that were possible) written to files in the './data/' directory.


In [11]:
# load documents
documents = SimpleDirectoryReader("./data/").load_data()

In [12]:
documents[0]

Document(id_='86399a0a-4f0f-4091-8db6-f6e1d758fd5e', embedding=None, metadata={'file_path': '/content/data/1804.06985.txt', 'file_name': '1804.06985.txt', 'file_type': 'text/plain', 'file_size': 3798, 'creation_date': '2025-03-07', 'last_modified_date': '2025-03-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='High Energy Physics  Theory arXiv1804.06985 hepth Submitted on 19 Apr 2018 Title A Near Horizon Extreme Binary Black Hole Geometry Authors Jacob Ciafre  Maria J. Rodriguez View a PDF of the paper titled A Near Horizon Extreme Binary Black Hole Geometry by Jacob Ciafre and Maria J. Rodriguez View PDF Abstract A new solution of fourdimensional vacuum General Relativity is presented. It describes the near horizon region of the extreme maximally s

# Pipeline 2 : Creating and populating a Deep Lake Vector Store

**Replace `hub://denis76/drone_v2` by your organization and dataset name**

In [14]:
from llama_index.core import StorageContext

vector_store_path = "hub://owlliming/drone"
dataset_path = "hub://owlliming/drone"

# overwrite=True will overwrite dataset, False will append it
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Create an index over the documents
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)

Your Deep Lake dataset has been successfully created!




Uploading data to deeplake dataset.


100%|██████████| 87/87 [00:00<00:00, 108.02it/s]
/

Dataset(path='hub://owlliming/drone', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (87, 1)      str     None   
 metadata     json      (87, 1)      str     None   
 embedding  embedding  (87, 1536)  float32   None   
    id        text      (87, 1)      str     None   


 

In [15]:
import deeplake
ds = deeplake.load(dataset_path)  # Load the dataset

-

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/owlliming/drone



-

hub://owlliming/drone loaded successfully.



 

In [16]:
import json
import pandas as pd
import numpy as np

# Assuming 'ds' is your loaded Deep Lake dataset

# Create a dictionary to hold the data
data = {}

# Iterate through the tensors in the dataset
for tensor_name in ds.tensors:
    tensor_data = ds[tensor_name].numpy()

    # Check if the tensor is multi-dimensional
    if tensor_data.ndim > 1:
        # Flatten multi-dimensional tensors
        data[tensor_name] = [np.array(e).flatten().tolist() for e in tensor_data]
    else:
        # Convert 1D tensors directly to lists and decode text
        if tensor_name == "text":
            data[tensor_name] = [t.tobytes().decode('utf-8') if t else "" for t in tensor_data]
        else:
            data[tensor_name] = tensor_data.tolist()

# Create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

In [53]:
ds.tensors['id'].numpy()

array([['ca5c45a8-bc56-4012-9721-8e0b344e5a2a'],
       ['0b5099cf-3614-45b2-b01c-0df33ff5171a'],
       ['2c9ae3a1-1140-41b3-823f-3a7b5533bb91'],
       ['86705a4e-211d-483d-b853-a284e5bf4977'],
       ['94b97038-9417-43e4-ac93-f52ab4bbcb47'],
       ['300448bd-bb7f-40ea-9697-c4eb5076139b'],
       ['50e75429-ec3d-46ab-9877-d4a99af8598b'],
       ['944e9585-4484-4dfe-9e3d-6c7dd0b6f847'],
       ['9e57e17e-b31c-4714-a7d1-a0fc39734b69'],
       ['aaf72b23-175e-43e8-9623-27fc1f8f37fb'],
       ['d0df62c6-af96-43c6-a430-08bfbb5542b8'],
       ['7125de01-c6c2-499c-9beb-9a2c9319a0b1'],
       ['71963f72-1aaf-44e0-ba6b-630602d481d4'],
       ['59b138d0-43f0-49d8-a1d2-35c278d9cc5c'],
       ['ae0c0da8-2082-4349-99b2-c5d54a97f6d9'],
       ['2fcc61e2-d2ef-42a1-816d-ee7922490108'],
       ['5b84404b-0a57-4b98-9f86-f29f7102adb3'],
       ['ba8efb7d-5329-4049-a2f2-8c01fb60d5b1'],
       ['d779d0a7-6ebb-4cf4-aa08-2710d7a8595d'],
       ['fe4f1530-c778-41c7-9226-c04046eabfd8'],
       ['59966303-26

In [17]:
# Function to display a selected record
def display_record(record_number):
    record = df.iloc[record_number]
    display_data = {
        "ID": record.get("id", "N/A"),
        "Metadata": record.get("metadata", "N/A"),
        "Text": record.get("text", "N/A"),
        "Embedding": record.get("embedding", "N/A")
    }

    # Print the ID
    print("ID:")
    print(display_data["ID"])
    print()

    # Print the metadata in a structured format
    print("Metadata:")
    metadata = display_data["Metadata"]
    if isinstance(metadata, list):
        for item in metadata:
            for key, value in item.items():
                print(f"{key}: {value}")
            print()
    else:
        print(metadata)
    print()

    # Print the text
    print("Text:")
    print(display_data["Text"])
    print()

    # Print the embedding
    print("Embedding:")
    print(display_data["Embedding"])
    print()

# Function call to display a record
rec = 0  # Replace with the desired record number
display_record(rec)

ID:
['ca5c45a8-bc56-4012-9721-8e0b344e5a2a']

Metadata:
file_path: /content/data/1804.06985.txt
file_name: 1804.06985.txt
file_type: text/plain
file_size: 3798
creation_date: 2025-03-07
last_modified_date: 2025-03-07
_node_content: {"id_": "ca5c45a8-bc56-4012-9721-8e0b344e5a2a", "embedding": null, "metadata": {"file_path": "/content/data/1804.06985.txt", "file_name": "1804.06985.txt", "file_type": "text/plain", "file_size": 3798, "creation_date": "2025-03-07", "last_modified_date": "2025-03-07"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "86399a0a-4f0f-4091-8db6-f6e1d758fd5e", "node_type": "4", "metadata": {"file_path": "/content/data/1804.06985.txt", "file_name": "1804.06985.txt", "file_type": "text/plain", "file_size": 3798, "cre

# Original documents

In [18]:
# Ensure 'text' column is of type string
df['text'] = df['text'].astype(str)
# Create documents with IDs
documents = [Document(text=row['text'], doc_id=str(row['id'])) for _, row in df.iterrows()]

In [55]:
documents[0]

Document(id_="['ca5c45a8-bc56-4012-9721-8e0b344e5a2a']", embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="['High Energy Physics  Theory arXiv1804.06985 hepth Submitted on 19 Apr 2018 Title A Near Horizon Extreme Binary Black Hole Geometry Authors Jacob Ciafre  Maria J. Rodriguez View a PDF of the paper titled A Near Horizon Extreme Binary Black Hole Geometry by Jacob Ciafre and Maria J. Rodriguez View PDF Abstract A new solution of fourdimensional vacuum General Relativity is presented. It describes the near horizon region of the extreme maximally spinning binary black hole system with two identical extreme Kerr black holes held in equilibrium by a massless strut. This is the first example of a nonsupersymmetric asymptotically flat near horizon extreme binary black hole geometry of two uncharged black holes. The black holes are corotating and the solution is uniquely specified by the mass. The binary extreme system has

# Pipeline 3:Index-based RAG

## User input and RAG parameters

In [19]:
user_input="How do drones identify vehicles?"

#similarity_top_k
k=3
#temperature
temp=0.1
#num_output
mt=1024

## Cosine similarity metric

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [57]:
import sklearn
print(sklearn.__version__)

1.6.1


In [21]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_cosine_similarity_with_embeddings(text1, text2):
    embeddings1 = model.encode(text1)
    embeddings2 = model.encode(text2)
    similarity = cosine_similarity([embeddings1], [embeddings2])
    return similarity[0][0]

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Vector store index query engine

In [22]:
from llama_index.core import VectorStoreIndex
vector_store_index = VectorStoreIndex.from_documents(documents)

In [23]:
print(type(vector_store_index))

<class 'llama_index.core.indices.vector_store.base.VectorStoreIndex'>


In [24]:
vector_query_engine = vector_store_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [25]:
print(type(vector_query_engine))

<class 'llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine'>


## Query response and source

In [26]:
import pandas as pd
import textwrap

def index_query(input_query):
    response = vector_query_engine.query(input_query)

    # Optional: Print a formatted view of the response (remove if you don't need it in the output)
    print(textwrap.fill(str(response), 100))

    node_data = []
    for node_with_score in response.source_nodes:
        node = node_with_score.node
        node_info = {
            'Node ID': node.id_,
            'Score': node_with_score.score,
            'Text': node.text
        }
        node_data.append(node_info)

    df = pd.DataFrame(node_data)

    # Instead of printing, return the DataFrame and the response object
    return df, response


In [27]:
import time
#start the timer
start_time = time.time()
df, response = index_query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(df.to_markdown(index=False, numalign="left", stralign="left"))  # Display the DataFrame using markdown

Drones can identify vehicles using deep learning-based machine learning algorithms for automatic
tracking and detection. Additionally, reidentification methods allow for the automatic
identification of vehicles across different cameras with varying viewpoints and hardware
specifications.
Query execution time: 2.4004 seconds
| Node ID                              | Score    | Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

Node information and relationships

In [28]:
nodeid=response.source_nodes[0].node_id
nodeid

'b8c9457e-ae14-427d-bbdb-dd406c695a36'

In [29]:
response.source_nodes[0].get_text()

"['Tice Brian P. Spring 1991. Unmanned Aerial Vehicles  The Force Multiplier of the 1990s . Airpower Journal . Archived from the original on 24 July 2009 . Retrieved 6 June 2013 . When used UAVs should generally perform missions characterized by the three Ds dull dirty and dangerous.  a b Alvarado Ed 3 May 2021. 237 Ways Drone Applications Revolutionize Business . Drone Industry Insights . Archived from the original on 11 May 2021 . Retrieved 11 May 2021 .  F.  RekabiBana Hu J. T. Krajník Arvin F.  Unified Robust Path Planning and Optimal Trajectory Generation for Efficient 3D Area Coverage of Quadrotor UAVs  IEEE Transactions on Intelligent Transportation Systems 2023.  a b Hu J. Niu H. Carrasco J. Lennox B. Arvin F.  Faulttolerant cooperative navigation of networked UAV swarms for forest fire monitoring  Aerospace Science and Technology 2022.  a b Remote sensing of the environment using unmanned aerial systems UAS . S.l. ELSEVIER  HEALTH SCIENCE. 2023. ISBN 9780323852838 . OCLC 13294

## Optimized chunking

In [30]:
# Assuming you have the 'response' object from query_engine.query()

for node_with_score in response.source_nodes:
    node = node_with_score.node  # Extract the Node object from NodeWithScore
    chunk_size = len(node.text)
    print(f"Node ID: {node.id_}, Chunk Size: {chunk_size} characters")

Node ID: b8c9457e-ae14-427d-bbdb-dd406c695a36, Chunk Size: 3226 characters
Node ID: 6acd05b8-df6c-4ef8-9caa-d435d739a1dc, Chunk Size: 3295 characters
Node ID: b5c3b2f1-9a13-495e-a568-154befe9d89d, Chunk Size: 4689 characters


## Performance metric

In [31]:
import numpy as np

def info_metrics(response):
  # Calculate the performance (handling None scores)
  scores = [node.score for node in response.source_nodes if node.score is not None]
  if scores:  # Check if there are any valid scores
      weights = np.exp(scores) / np.sum(np.exp(scores))
      perf = np.average(scores, weights=weights) / elapsed_time
  else:
      perf = 0  # Or some other default value if all scores are None

  average_score=np.average(scores, weights=weights)
  print(f"Average score: {average_score:.4f}")
  print(f"Query execution time: {elapsed_time:.4f} seconds")
  print(f"Performance metric: {perf:.4f}")

In [32]:
info_metrics(response)

Average score: 0.8287
Query execution time: 2.4004 seconds
Performance metric: 0.3452


# Tree index query engine

In [33]:
from llama_index.core import TreeIndex
tree_index = TreeIndex.from_documents(documents)

In [34]:
print(type(tree_index))

<class 'llama_index.core.indices.tree.base.TreeIndex'>


In [35]:
tree_query_engine = tree_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [36]:
import time
import textwrap
# Start the timer
start_time = time.time()
response = tree_query_engine.query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

Query execution time: 3.4512 seconds
Drones identify vehicles using computer vision technology related to object detection. They can
detect vehicles by analyzing digital images and videos captured by their cameras. This process
involves recognizing instances of semantic objects belonging to a specific class, such as cars,
within the images or video footage.


## Performance metric

In [37]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.815
Query execution time: 3.4512 seconds
Performance metric: 0.2361


# List index query engine

In [38]:
from llama_index.core import ListIndex
list_index = ListIndex.from_documents(documents)

In [39]:
print(type(list_index))

<class 'llama_index.core.indices.list.base.SummaryIndex'>


In [40]:
list_query_engine = list_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [41]:
#start the timer
start_time = time.time()
response = list_query_engine.query(user_input)
# Stop the timer
end_time = time.time()
# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

Query execution time: 17.1937 seconds
Drones can identify vehicles using computer vision techniques, such as convolutional neural networks
(CNNs). These networks are trained on large datasets of vehicle images to learn the features that
distinguish vehicles from other objects. By processing the visual data captured by the drone's
camera through these trained CNN models, drones can accurately detect and classify vehicles in real-
time.


## Performance metric

In [42]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.764
Query execution time: 17.1937 seconds
Performance metric: 0.0444


# Keyword index query index

In [43]:
from llama_index.core import KeywordTableIndex
keyword_index = KeywordTableIndex.from_documents(documents)

In [44]:
# Extract data for DataFrame
data = []
for keyword, doc_ids in keyword_index.index_struct.table.items():
    for doc_id in doc_ids:
        data.append({"Keyword": keyword, "Document ID": doc_id})

# Create the DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,Keyword,Document ID
0,flat,2b6f1071-2ce9-459e-a452-c6c0b92a1390
1,nonsupersymmetric,2b6f1071-2ce9-459e-a452-c6c0b92a1390
2,geometry,1c8184cf-717e-4909-bdea-4a7552ed7ff2
3,geometry,2b6f1071-2ce9-459e-a452-c6c0b92a1390
4,geometry,fea52cc5-ae62-4af5-a1d2-9f69be94d4d6
...,...,...
4221,direct,c8fb3904-717d-4b3c-abe3-e2d7868bb3fe
4222,science direct,c8fb3904-717d-4b3c-abe3-e2d7868bb3fe
4223,stealth technology,c8fb3904-717d-4b3c-abe3-e2d7868bb3fe
4224,authority control databases,c8fb3904-717d-4b3c-abe3-e2d7868bb3fe


In [45]:
keyword_query_engine = keyword_index.as_query_engine(similarity_top_k=k, temperature=temp, num_output=mt)

In [46]:
import time

# Start the timer
start_time = time.time()

# Execute the query (using .query() method)
response = keyword_query_engine.query(user_input)

# Stop the timer
end_time = time.time()

# Calculate and print the execution time
elapsed_time = end_time - start_time
print(f"Query execution time: {elapsed_time:.4f} seconds")

print(textwrap.fill(str(response), 100))

Query execution time: 2.4031 seconds
Drones can identify vehicles through various methods such as visual recognition, object detection,
and tracking using computer vision technology. These systems analyze digital images captured by the
drones to determine the presence of specific vehicles, their positions in the image, and even their
3D poses in the scene. Detection algorithms based on convolutional neural networks are commonly used
for this purpose, allowing drones to scan for vehicles in their field of view and identify them
accurately.


## Performance metric

In [47]:
similarity_score = calculate_cosine_similarity_with_embeddings(user_input, str(response))
print(f"Cosine Similarity Score: {similarity_score:.3f}")
print(f"Query execution time: {elapsed_time:.4f} seconds")
performance=similarity_score/elapsed_time
print(f"Performance metric: {performance:.4f}")

Cosine Similarity Score: 0.717
Query execution time: 2.4031 seconds
Performance metric: 0.2982


In [56]:
!pip freeze > requirements.txt


In [58]:
import pyyaml

ModuleNotFoundError: No module named 'pyyaml'

In [59]:
vector_store = DeepLakeVectorStore(dataset_path='hub://owlliming/test', overwrite=True)


Your Deep Lake dataset has been successfully created!


