#Embedding-Based Retrieval with Activeloop and OpenAI

Copyright 2024 Denis Rothman

This second component of the RAG pipeline transforms the prepared data by the first component of the pipeline into embeddings and stores the vectors obtained in the vector store.

# Installing the environment

*First run the following cells and restart Google Colab session if prompted. Then run the notebook again cell by cell to explore the code.*

In [70]:
import deeplake

Mount a drive or implement the method that best fits your project to retrieve API tokens.

In [71]:
#The OpenAI Key
import os
from dotenv import load_dotenv
import openai

# Load API Key
dotenv_path = 'D:/AdvancedR/knowbankedu/openai/.env'
load_dotenv(dotenv_path)
# OpenAI API Key
openai.api_key = os.getenv("OPENAI_API_KEY")
ACTIVELOOP_TOKEN = os.getenv('ACTIVELOOP_TOKEN')

# Embedding and Storage: populating the vector store

## Downloading and preparing the data

In [72]:
# Define the path to the local file
file_path = r"D:\RAG_Rothman\Chapter02\llm_with_metadata.txt"

# Read the content of the file
try:
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.readlines()
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
    content = []

# Display the first 20 lines of the file for verification
if content:
    print("First 20 lines of the file:")
    for line in content[:20]:
        print(line.strip())
else:
    print("The file is empty or could not be read.")

source_text = "D:/RAG_Rothman/Chapter02/llm_with_metadata.txt"

First 20 lines of the file:
Source: https://en.wikipedia.org/wiki/Exploration_of_Mars

Not to be confused with Human mission to Mars or Colonization of Mars . Self-portrait of Perseverance rover and Ingenuity helicopter (to the left) located at Wright Brothers Field, the Ingenuity helicopter drop site (7 April 2021) The planet Mars has been explored remotely by spacecraft. Probes sent from Earth, beginning in the late 20th century, have yielded a large increase in knowledge about the Martian system, focused primarily on understanding its geology and habitability potential. [ 1 ] [ 2 ] Engineering interplanetary journeys is complicated and the exploration of Mars has experienced a high failure rate, especially the early attempts. Roughly sixty percent of all spacecraft destined for Mars failed before completing their missions, with some failing before their observations could begin. Some missions have been met with unexpected success, such as the twin Mars Exploration Rovers , Spirit an

# Verify if vector store exists in Deeplake or create it

Here we define an embedding function, then create the dataset in deeplake with needed tensors and populate with embeddings.

We set the chunk size at 1000 in this example. Can set the chunk size depending on token limits (see ChatGPT help)

In [74]:
import deeplake
from openai import OpenAI
from dotenv import load_dotenv

# Load API key from .env
load_dotenv()
client = OpenAI()

# Path to the dataset in Deep Lake
vector_store_path = "hub://zagamog/space_exploration_v1"

# Embedding function using OpenAI client
def embedding_function(texts, model="text-embedding-3-small"):
    if isinstance(texts, str):
        texts = [texts]
    texts = [t.replace("\n", " ") for t in texts]  # Replace newlines with spaces
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

# Create and populate dataset
def create_and_populate_dataset(path, source_text_path, chunk_size=1000):
    print(f"Creating dataset at {path}...")
    dataset = deeplake.empty(path, overwrite=True)
    dataset.create_tensor("embedding_tensor", htype="embedding")
    dataset.create_tensor("text", htype="text")
    dataset.create_tensor("metadata", htype="json")
    dataset.create_tensor("id", htype="text")
    dataset.commit("Initialized dataset.")

    # Read and process the source text with metadata
    print(f"Reading and processing source text from {source_text_path}...")
    with open(source_text_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Split the content into articles based on "Source: <URL>"
    articles = content.split("\n\n")
    chunks, metadata = [], []

    for article in articles:
        if article.strip():  # Skip empty sections
            lines = article.splitlines()
            source_url = lines[0].replace("Source: ", "").strip()
            article_text = " ".join(lines[1:]).strip()

            # Chunk the article text
            article_chunks = [
                article_text[i:i + chunk_size] for i in range(0, len(article_text), chunk_size)
            ]
            chunks.extend(article_chunks)
            metadata.extend([{"source": source_url}] * len(article_chunks))

    print(f"Generating embeddings for {len(chunks)} chunks...")
    embeddings = embedding_function(chunks)
    ids = [f"chunk_{i}" for i in range(len(chunks))]

    # Populate the dataset
    try:
        dataset["embedding_tensor"].extend(embeddings)
        dataset["text"].extend(chunks)
        dataset["metadata"].extend(metadata)
        dataset["id"].extend(ids)
        dataset.commit("Populated dataset with embeddings, text, and metadata.")
        print("Dataset creation complete.")
    except Exception as e:
        print(f"Error occurred while populating the dataset: {e}")
        raise

# Corrected source text file path
source_text_path = "D:/RAG_Rothman/Chapter02/llm_with_metadata.txt"

# Run
try:
    dataset = create_and_populate_dataset(vector_store_path, source_text_path)
    print("Dataset created and populated successfully.")
except Exception as e:
    print(f"An error occurred: {e}")



Creating dataset at hub://zagamog/space_exploration_v1...
Your Deep Lake dataset has been successfully created!


[K[?25h

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/zagamog/space_exploration_v1
hub://zagamog/space_exploration_v1 loaded successfully.


[K[?25h

Reading and processing source text from D:/RAG_Rothman/Chapter02/llm_with_metadata.txt...
Generating embeddings for 1671 chunks...


[K[?25h

Dataset creation complete.
Dataset created and populated successfully.


Visualize

Online:
https://app.activeloop.ai/datasets/mydatasets/

In [75]:
# Print the summary of the Vector Store
print(vector_store.summary())

Dataset(path='hub://zagamog/space_exploration_v1', tensors=['embedding_tensor', 'id'])

      tensor         htype     shape    dtype  compression
     -------        -------   -------  -------  ------- 
 embedding_tensor  embedding   (0,)    float32   None   
        id           text      (0,)      str     None   
None


In [76]:
ds = deeplake.load(vector_store_path)

[K-5l

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/zagamog/space_exploration_v1



[K|

hub://zagamog/space_exploration_v1 loaded successfully.



[K[?25h

Dataset size

In [77]:
#Estimates the size in bytes of the dataset.
ds_size=ds.size_approx()

In [65]:
# Convert bytes to megabytes and limit to 5 decimal places
ds_size_mb = ds_size / 1048576
print(f"Dataset size in megabytes: {ds_size_mb:.5f} MB")

# Convert bytes to gigabytes and limit to 5 decimal places
ds_size_gb = ds_size / 1073741824
print(f"Dataset size in gigabytes: {ds_size_gb:.5f} GB")

Dataset size in megabytes: 70.57190 MB
Dataset size in gigabytes: 0.06892 GB
