# Generate the embeddings from the text files that need to be chunked and then ingested 


In [39]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
import ollama

In [21]:
import os 
def load_text(file_path):
    try:
        with open(file_path, 'r') as file:
            text = file.read()
            print(text[0:10000])
    except FileNotFoundError:
        print("File not found. Please make sure {0} exists.".format(file_path))
        return None
    return text 
text = load_text("DataLake/magma.txt")

ABSTRACT
We present Magma, a write-optimized high data density key-value
storage engine used in the Couchbase NoSQL distributed docu-
ment database. Today’s write-heavy data-intensive applications
like ad-serving, internet-of-things, messaging, and online gaming,
generate massive amounts of data. As a result, the requirement
for storing and retrieving large volumes of data has grown rapidly.
Distributed databases that can scale out horizontally by adding
more nodes can be used to serve the requirements of these internet-
scale applications. To maintain a reasonable cost of ownership, we
need to improve storage eciency in handling large data volumes
per node, such that we don’t have to rely on adding more nodes.
Our current generation storage engine, Couchstore is based on a
log-structured append-only copy-on-write B+Tree architecture. To
make substantial improvements to support higher data density and
write throughput, we needed a storage engine architecture that
lowers write amplica

In [22]:
# Text chunking with window and overlap of 500 characters 

def chunk_text(text, window_size=500, overlap=100):
    """
    Chunk text into overlapping segments of specified window size
    
    Args:
        text (str): Input text to chunk
        window_size (int): Size of each chunk in characters
        overlap (int): Number of overlapping characters between chunks
        
    Returns:
        list: List of text chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        # Get chunk of window_size or remaining text if shorter
        end = min(start + window_size, len(text))
        chunk = text[start:end]
        
        # Add chunk if it's not empty
        if chunk.strip():
            chunks.append(chunk)
            
        # Move start position by window_size - overlap
        start = start + window_size - overlap
        
    return chunks

# Create chunks with 500 char window and 100 char overlap
data_chunks = chunk_text(text, window_size=1000, overlap=200)

print(f"Created {len(data_chunks)} chunks")
print(f"\nFirst chunk sample:\n{data_chunks[0][:200]}...")


Created 21 chunks

First chunk sample:
ABSTRACT
We present Magma, a write-optimized high data density key-value
storage engine used in the Couchbase NoSQL distributed docu-
ment database. Today’s write-heavy data-intensive applications
lik...


In [None]:
# load the model 
def load_model(model_name):
    """
    Load a SentenceTransformer model from Hugging Face
    
    Args:
        model_name (str): Name of the model to load 
    """
    model = SentenceTransformer(model_name, trust_remote_code=True)
    return model

model = load_model("nomic-ai/CodeRankEmbed")


<All keys matched successfully>


In [28]:
# start the embedding process where the chunks that are created are embedded using the huggingface model 

def embed_chunks(chunks, model):
    """
    Embed text chunks using the provided model
    
    Args:
        chunks (list): List of text chunks
        model: SentenceTransformer model
        
    Returns:
        np.ndarray: Array of embeddings
    """
    sentences = chunks
    embeddings = model.encode(sentences)
    return embeddings

embeddings = embed_chunks(data_chunks,model)
print(embeddings)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)

[[ 0.19513035  1.0389959  -0.6033782  ... -0.22977789  0.12080007
   0.56734395]
 [ 0.5073284  -0.37393796 -0.7079576  ... -0.23452719  1.4435698
   0.2964999 ]
 [ 0.2933281   1.4403888  -0.6809826  ... -0.07078546  0.87308353
   0.9864548 ]
 ...
 [ 0.60976076  0.2126527  -0.59666383 ...  0.22752672  0.62414616
   0.21837918]
 [ 0.2967483   0.15918534 -0.6006259  ...  0.16863205  0.92862195
  -1.0542966 ]
 [ 0.9830756   1.2716196  -0.39954367 ... -0.49186674 -0.32120115
  -1.3642838 ]]
torch.Size([21, 21])


In [29]:
# Create the vector store 
def create_vector_store(embeddings):
    """
    Create a vector store using FAISS
    
    Args:
        embeddings (np.ndarray): Array of embeddings  
    Returns:
        faiss.IndexFlatL2: Vector store
    """

    vector_store = faiss.IndexFlatL2(embeddings.shape[1])
    vector_store.add(embeddings)

    return vector_store

vector_store = create_vector_store(embeddings)


In [36]:
def query_vector_store(query, vector_store, model, data_chunks):
    """
    Query the vector store for the most similar embeddings
    
    Args:
        query (str): Query to search for
        vector_store (faiss.IndexFlatL2): Vector store to search
        model: SentenceTransformer model
        
    Returns:
        list: List of indices of the most similar embeddings
    """
    query_embedding = model.encode([query])  # get the embedding
    query_embedding = query_embedding.astype('float32')
    distances, indices = vector_store.search(query_embedding, k=5)
    context = ""
    for i in indices[0]:
        context += data_chunks[i] + "\n"
    return context

query = "merging the ss tables , how?"
context = query_vector_store(query, vector_store, model, data_chunks)
print(context)

e the cost of reads as well as reduce space usage, we have
to periodically merge SSTable les and reclaim space. This process
is called compaction.
A level-based compaction strategy popularized by LevelDB [15,
24] is a common compaction strategy for achieving lower read am-
plication and space amplication. The LSM Tree is organized into
multiple levels of exponentially increasing sizes with the smallest
size at the top and the largest being the bottom level. Each level can
have several SSTable les. The in-memory component is periodi-
cally ushed into level-0 as an SSTable le. Level-0 is a special level
that accumulates new data. It can have multiple SSTables with over-
lapping key ranges. All other higher levels have non-overlapping
key ranges between the SSTables in the level. Each non-level-0 level
has a contiguous key range. When level-0 reaches a size threshold,
the SSTable les are picked and merged with sstables from the
level-1 and the overlapping key range from the level-1

# test out the embeddings 

In [37]:
# now add the context to the query for the LLM

def add_context_to_query(query, context):
    """
    Add context to the query for the LLM
    
    Args:
        query (str): Query to add context to
        context (str): Context to add to the query
        
    Returns:
        str: Query with context added
    """
    return f"Query: {query}\nContext: {context}"

query_with_context = add_context_to_query(query, context)
print(query_with_context)
# now use the LLM to answer the query 


add_context_to_query(query, context)

Query: merging the ss tables , how?
Context: e the cost of reads as well as reduce space usage, we have
to periodically merge SSTable les and reclaim space. This process
is called compaction.
A level-based compaction strategy popularized by LevelDB [15,
24] is a common compaction strategy for achieving lower read am-
plication and space amplication. The LSM Tree is organized into
multiple levels of exponentially increasing sizes with the smallest
size at the top and the largest being the bottom level. Each level can
have several SSTable les. The in-memory component is periodi-
cally ushed into level-0 as an SSTable le. Level-0 is a special level
that accumulates new data. It can have multiple SSTables with over-
lapping key ranges. All other higher levels have non-overlapping
key ranges between the SSTables in the level. Each non-level-0 level
has a contiguous key range. When level-0 reaches a size threshold,
the SSTable les are picked and merged with sstables from the
level-1 a

'Query: merging the ss tables , how?\nContext: e the cost of reads as well as reduce space usage, we have\nto periodically merge SSTable \uf022les and reclaim space. This process\nis called compaction.\nA level-based compaction strategy popularized by LevelDB [15,\n24] is a common compaction strategy for achieving lower read am-\npli\uf022cation and space ampli\uf022cation. The LSM Tree is organized into\nmultiple levels of exponentially increasing sizes with the smallest\nsize at the top and the largest being the bottom level. Each level can\nhave several SSTable \uf022les. The in-memory component is periodi-\ncally \uf024ushed into level-0 as an SSTable \uf022le. Level-0 is a special level\nthat accumulates new data. It can have multiple SSTables with over-\nlapping key ranges. All other higher levels have non-overlapping\nkey ranges between the SSTables in the level. Each non-level-0 level\nhas a contiguous key range. When level-0 reaches a size threshold,\nthe SSTable \uf022les are

In [40]:
# send the query to the LLM 
def get_ollama_suggestions(query_with_context):
    response = ollama.chat(model='deepseek-r1:14b', messages=[
        {
        'role': 'user',
        'content': query_with_context
        },
    ],  options={"temperature": 0.8}, stream=True )
    
    for chunk in response:
        print(chunk["message"]["content"], end='', flush=True)
        
    return "Streaming complete"

get_ollama_suggestions(query_with_context)

<think>
Okay, so I'm trying to understand how merging SSTables works in an LSM Tree, specifically using a level-based compaction strategy like the one used in LevelDB. From what I gather, SSTables are these sorted string tables that store data on disk, and over time they accumulate as writes come into the system. The problem is that having too many SSTables can make read operations slow because you have to check each one until you find the key. Also, all those SSTables take up a lot of space.

So, compaction is the process where we merge these SSTables to reduce their number and reclaim unused space. I remember reading that in LevelDB, they organize these SSTables into levels, starting from level 0 which is kind of a special case. Each subsequent level has larger SSTables with non-overlapping key ranges. The idea is that higher levels have less frequent changes because data moves down the levels over time.

When level 0 gets too full (I think it's when the number of SSTables or their t

'Streaming complete'