# Chunking a Document for Semantic Search in SurrealDB

    This notebook demonstrates the process of extracting chunks from a document and uploading the chunks to a table with embeddings calculated at upload.

### Chunking: 
    A simple function that reads the file and splices it into chunks of a specified size. The default is 250 words which will translate to reasonable number of tokens.



### Connecting to SurrealDB:
    Establishes a connection to the SurrealDB instance. This  involves:
    
        If SurrealDB isn't running execute the start command in a terminal

        Specifying the connection parameters (e.g., ws://localhost:8000/rpc).

        Authenticating with the database using credentials (e.g., username and password).

        Selecting the appropriate namespace for the knowledge graph.


### Generating and Storing Embeddings: 
    Enables semantic search capabilities within the table by:

        Using an embedding model (glove.6B.50d in this example) to generate numerical representations of entities and relationships. 

        Storing these embeddings as properties of the nodes and edges in SurrealDB, allowing for similarity-based searches.

        Make sure to upload an embedding model to the database prior to inserting the graph to the database.

        This repo is a prerequisite for uploading the embedding model:

        https://github.com/apireno/surrealDB_embedding_model



#### notes
    This notebook utilizes libraries :
        surrealdb to interact with SurrealDB

    Prerquisite is to install the embedding model as in this python script:
            https://github.com/apireno/surrealDB_embedding_model


In [1]:
import pkg_resources
import sys
import os
import os
import time
import ipynb_path

from surrealdb import AsyncSurreal
import pandas as pd

#get this notebook's path for access to the other files needed
dir_path = os.path.dirname(os.path.realpath(ipynb_path.get(__name__)))
sys.path.append(dir_path) #add the current directory for adding py imports
from prompts import CONTINUE_PROMPT, GRAPH_EXTRACTION_PROMPT, LOOP_PROMPT


  import pkg_resources


In [2]:

# this folder
nb_folder = dir_path
out_folder = nb_folder + "/chunking_{0}".format(time.strftime("%Y%m%d-%H%M%S"))

os.makedirs(out_folder, exist_ok=True)



#the file to read the text from
input_file = nb_folder + "/Operation Dulce v2 1 1.txt"

#SurQL to exectue
surql_file = out_folder + "/inserts.suql"

#debugging file for logging 
debug = True
debug_file = out_folder + "/debug.txt"




In [3]:
def chunk_file(input_file,chunk_size = 250):
    """
    Chunks an input file into optimized sizes for LLM processing.

    Args:
        input_file: Path to the input file.

    Returns:
        A list of strings, where each string is a chunk of the file.
    """

    with open(input_file, 'r') as f:
        text = f.read()

    # This is a heuristic for chunk size, and may need adjustment
    # depending on the specific LLM and its context window.
    # It aims to create chunks that are large enough to provide context,
    # but small enough to avoid exceeding the LLM's limits.
    words = text.split()
    # chunk_size = 250  # Approximately 1000-1500 characters
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunks.append(' '.join(words[i:i + chunk_size]))

    return chunks

In [4]:
#let's chunk this file!

chunks = chunk_file(input_file)

print(f"How many chunks? {len(chunks)}")
print(f"""

Lets see the first chunk:

{chunks[0]}""")

How many chunks? 77


Lets see the first chunk:

# Operation: Dulce ## Chapter 1 The thrumming of monitors cast a stark contrast to the rigid silence enveloping the group. Agent Alex Mercer, unfailingly determined on paper, seemed dwarfed by the enormity of the sterile briefing room where Paranormal Military Squad's elite convened. With dulled eyes, he scanned the projectors outlining their impending odyssey into Operation: Dulce. “I assume, Agent Mercer, you’re not having second thoughts?” It was Taylor Cruz’s voice, laced with an edge that demanded attention. Alex flickered a strained smile, still thumbing his folder's corner. "Of course not, Agent Cruz. Just trying to soak in all the details." The compliance in his tone was unsettling, even to himself. Jordan Hayes, perched on the opposite side of the table, narrowed their eyes but offered a supportive nod. "Details are imperative. We’ll need your clear-headedness down there, Mercer." A comfortable silence, the kind that threaded be

In [5]:

#now we have our surQL code to execute so let's connect make sure our database is up and running

ip = "0.0.0.0:8000"
url = "ws://{0}".format(ip)

u = "root"
p = "root"
n = "graph_rag"
d = "graph_rag"
db_folder = nb_folder + "/db"

surrealdb_start = "surreal start --allow-net --log none --user {u} --pass {p} --bind {ip} \"rocksdb://{db_folder}\"".format(
    u=u,
    p=p,
    ip=ip,
    db_folder=db_folder)

#run this command if your surreal instance isn't running yet 
#copy and paste from below into a terminal
print(surrealdb_start)

#and ensure you installed the embedding model!
#the model will power the function fn::sentence_to_vector($text)
print("""
ensure you installed the embedding model!
      
      https://github.com/apireno/surrealDB_embedding_model

the model will power the function fn::sentence_to_vector($text)
"""   )  

surreal start --allow-net --log none --user root --pass root --bind 0.0.0.0:8000 "rocksdb:///Users/sandro/git_repos/graph_rag/db"

ensure you installed the embedding model!
      
      https://github.com/apireno/surrealDB_embedding_model

the model will power the function fn::sentence_to_vector($text)



In [9]:


#make sure the ns and db exist the ns and db
recreate_db_surql = """
DEFINE NAMESPACE IF NOT EXISTS {0};
DEFINE DATABASE IF NOT EXISTS {1};
"""
#the table we will insert into
TABLE_NAME = "CHUNKS"


#this DDL will create the entity tables
#and create an index for searching the description
#based on consine similarity of the GloVe embeddings of 50 dimensions
recreate_table_surql = f"""
REMOVE TABLE IF EXISTS {TABLE_NAME};
DEFINE TABLE {TABLE_NAME} SCHEMAFULL;
DEFINE FIELD chunk ON TABLE {TABLE_NAME} TYPE string;
DEFINE FIELD embedding ON TABLE {TABLE_NAME} TYPE option<array<float>> 
    DEFAULT fn::sentence_to_vector( chunk);

REMOVE INDEX IF EXISTS idx_{TABLE_NAME}_chunk ON TABLE {TABLE_NAME};
DEFINE INDEX idx_{TABLE_NAME}_description ON TABLE {TABLE_NAME} FIELDS embedding MTREE DIMENSION 50 DIST COSINE;
    
"""

insert_chunk_surql = f"""
    INSERT INTO {TABLE_NAME} {{
        chunk:$chunk
    }} RETURN NONE;
"""


#this is a sample query after inserts
sample_query = f"""
LET $v = fn::sentence_to_vector($q);
SELECT chunk FROM {TABLE_NAME} WHERE  embedding <|30,COSINE|> $v;
"""



#connect to the database
async with AsyncSurreal(url) as db:
    await db.signin( {"username":u, "password":p}) 
    #create the namespace and database
    outcome = await db.query_raw(recreate_db_surql.format(n,d))
    await db.use(n, d)

    #create the entity tables
    outcome = await db.query_raw(recreate_table_surql)
    
    #execute the surql extracted from the LLM line by line
    for chunk in chunks:
        params = {"chunk": chunk}
        outcome = await db.query_raw(insert_chunk_surql, params)

    #lets test the search
    params = {"q": "What is the name of this story?"}


    #run some sample queries and pull data to visualize the graph    
    sample_query_outcome = await db.query_raw(sample_query,params)

df = pd.json_normalize(sample_query_outcome["result"][1]["result"] )
print(df.describe())
df.head

       

                                                    chunk
count                                                  30
unique                                                 30
top     odyssey not of the body, but of the intellect ...
freq                                                    1


<bound method NDFrame.head of                                                 chunk
0   odyssey not of the body, but of the intellect ...
1   team now faced the duality of their roles, pro...
2   the team. Taylor observed them, a cold calcula...
3   is no longer just about being heard—it's about...
4   the screen. "Responding? Like it’s alive?" Tay...
5   now thrummed with a different kind of energy, ...
6   conspiracies and furtive movements. But in the...
7   through the buzz of activity. "Control may be ...
8   Taylor Cruz interjected, looking up from a dat...
9   each other, evolving together through this.. d...
10  their harmony in the cosmic conversation. ## C...
11  their back, regarded the unfolding scene, thei...
12  anything." Alex's eyes brightened with a subtl...
13  resonates—it's designed to be felt." The room ...
14  of monitors cast an otherworldly ambiance upon...
15  their undertaking. The agents were standing no...
16  adversary a code from beyond the stars that he..