source_text = "llm.txt" will be used by the function that will
add the data to our vector store. We then briefly check the document just to
be sure.


We will now chunk the data. We will determine a chunk size defined by the
number of characters. In this case, it is CHUNK_SIZE = 1000, but we can select
chunk sizes using different strategies.


Chunking is necessary to optimize data processing: selecting portions of
text, embedding, and loading the data. It also makes the embedded dataset
easier to query. The following code chunks a document to complete the
preparation process:


In [2]:
source_text = "llm.txt"
with open(source_text, 'r') as f:
    text = f.read()

CHUNK_SIZE = 1000
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]

In [3]:
print(chunked_text[0])

Exploration of space, planets, and moons "Space Exploration" redirects here. For the company, see SpaceX . For broader coverage of this topic, see Exploration . Buzz Aldrin taking a core sample of the Moon during the Apollo 11 mission Self-portrait of Curiosity rover on Mars 's surface Part of a series on Spaceflight History History of spaceflight Space Race Timeline of spaceflight Space probes Lunar missions Mars missions Applications Communications Earth observation Exploration Espionage Military Navigation Colonization Habitation Exploration Telescopes Tourism Spacecraft Robotic spacecraft Satellite Space probe Cargo spacecraft Crewed spacecraft Apollo Lunar Module Space capsules Space Shuttle Space stations Spaceplanes Vostok Space launch Spaceport Launch pad Expendable and reusable launch vehicles Escape velocity Non-rocket spacelaunch Spaceflight types Sub-orbital Orbital Interplanetary Interstellar Intergalactic List of space organizations Space agencies Space forces Companies S

 ### Verifying if the vector store exists and creating it if not

First, we need to define the path of our Activeloop vector store path, whether
our dataset exists or not:

In [11]:
vector_store_path = "./space_exploration_v1"

In [None]:
# %pip install deeplake==3.9.18


In [12]:
# Create a vector store with default tensors
from deeplake.core.vectorstore.deeplake_vectorstore import VectorStore

# data = VectorStore(
#       "./my_vector_store"
# )

try:
    # Attempt to load the vector store
    vector_store = VectorStore(path=vector_store_path)
    print("Vector store exists")
except FileNotFoundError:
    print("Vector store does not exist. You can create it.")
    # Code to create the vector store goes here
    create_vector_store=True




Deep Lake Dataset in ./space_exploration_v1 already exists, loading from the storage
Vector store exists


### Embedding Function

The embedding function will transform the chunks of data we created into
vectors to enable vector-based search. In this program, we will use "text-embedding-3-small" to embed the documents.


In [9]:
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# Initialize OpenAI client
client = OpenAI()  # Assumes API key is set in env variables or ~/.openai_key

def embedding_function(texts, model="text-embedding-3-small"):
    """Generates embeddings for input texts using OpenAI API."""
    
    if isinstance(texts, str):
        texts = [texts]  # Convert single string to list
    
    texts = [t.replace("\n", " ") for t in texts]  # Remove newlines
    
    try:
        response = client.embeddings.create(
            model=model,
            input=texts,
            encoding_format="float"  # Ensure correct format
        )
        
        embeddings = [data.embedding for data in response.data]  # Correct attribute access
        return embeddings

    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return None

# Example Usage
texts = ["Hello, world!", "How are you?"]
embeddings = embedding_function(texts)
print(embeddings)  # Prints list of embeddings


[[-0.019143932, -0.025292054, -0.0017211713, 0.018834507, -0.033821393, -0.01968206, -0.021027382, 0.05160655, -0.0321801, -0.030431183, -0.0021508336, -0.028924422, -0.002487164, -0.031480536, 0.010291713, 0.018565442, -0.046144545, 0.041409012, 0.00043050305, 0.041166853, 0.05365144, 0.0018481361, 0.004564005, 0.009955383, 0.04781274, 0.0021642868, -0.0098477565, 0.038422395, 0.0009131373, -0.05209087, 0.051122233, -0.032529887, -0.014085521, -0.012605667, 0.013271601, 0.018565442, 0.0016320437, -0.0008479733, -0.012773832, -0.029677803, -0.0045101917, -0.015309764, 0.025668744, 0.00904729, -0.036834914, 0.020287456, -0.040709443, -0.0026771908, 0.03554341, 0.048485406, -0.033659957, -0.0024417595, 0.017273935, 0.0760376, 0.00095433777, -0.042700518, 0.00837463, 0.075983785, -0.047274616, 0.015107966, 0.014260413, 0.024753924, 0.010163908, -0.0010005832, 0.013729011, -0.0100428285, -0.020691052, -0.0014664009, -0.011704301, 0.049373318, 0.0011166172, 0.03785736, -0.01938609, 0.005825

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


### Adding data to the vector store

We set the adding data flag to True:

In [13]:
add_to_vector_store=True
source_text = "llm.txt"

if add_to_vector_store == True:
    with open(source_text, 'r') as f:
        text = f.read()
        CHUNK_SIZE = 1000
        chunked_text = [text[i:i+1000] for i in range(0, len(text), CHUNK_SIZE)]


In [14]:
vector_store.add(text = chunked_text,
              embedding_function = embedding_function,
              embedding_data = chunked_text,
              metadata = [{"source": source_text}]*len(chunked_text))

Creating 1651 embeddings in 4 batches of size 500:: 100%|██████████| 4/4 [00:49<00:00, 12.43s/it]

Dataset(path='./space_exploration_v1', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
 embedding  embedding  (1651, 1536)  float32   None   
    id        text      (1651, 1)      str     None   
 metadata     json      (1651, 1)      str     None   
   text       text      (1651, 1)      str     None   





In [15]:
print(vector_store.summary())

Dataset(path='./space_exploration_v1', tensors=['embedding', 'id', 'metadata', 'text'])

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
 embedding  embedding  (1651, 1536)  float32   None   
    id        text      (1651, 1)      str     None   
 metadata     json      (1651, 1)      str     None   
   text       text      (1651, 1)      str     None   
None


In [17]:
import deeplake
ds = deeplake.load(vector_store_path)

./space_exploration_v1 loaded successfully.





In [18]:
#Estimates the size in bytes of the dataset.
ds_size=ds.size_approx()
     

In [19]:
# Convert bytes to megabytes and limit to 5 decimal places
ds_size_mb = ds_size / 1048576
print(f"Dataset size in megabytes: {ds_size_mb:.5f} MB")

# Convert bytes to gigabytes and limit to 5 decimal places
ds_size_gb = ds_size / 1073741824
print(f"Dataset size in gigabytes: {ds_size_gb:.5f} GB")

Dataset size in megabytes: 55.31311 MB
Dataset size in gigabytes: 0.05402 GB
