### Breaking data down into useful chunks
In order to allow LLMs to intelligently interact with content outside the large corpus of text that LLMS have been trained on, we need to orgainze that data into chunks that we will retrieve based on relvancy.  The first steo in this process if to break our provide knowledge into managable chunks.  Recall LLMs have a limited context window to consume data in so it's up to us to parse it into meaningfull bits.
![Chunking](./images/chunks.jpg)
<p>There are many stratagies we can use to depending on the type of data we are ingesting.  <br>Here are a few we will review</p>

    1. Fixed-Size Chunking (by Tokens or Characters)
    2. Sentence-Level Chunking
    3. Paragraph-Level Chunking
    4. Sliding Window (with Overlap)
    5. Semantic Chunking (TextTiling, Topic Modeling)

| Use Case                                    | Strategy                                           |
|:--------------------------------------------|:---------------------------------------------------|
| Small documents / simple use case	      | Fixed-size, non-overlapping                        |
| Long-form documents (PDFs, books)           | Recursive splitter or paragraph-based              |
| High semantic fidelity needed               | Sentence or semantic chunking                      |



#### How a piece of text is converted into a vector?
Common approach is to use models which can provide contextualized embeddings for entire sentences. These models are based on deep learning architectures such as Transformers, which can capture the contextual information and relationships between words in a sentence more effectively.



#### Let's review a simple fixed size approach

In [7]:
def read_file(file_path):
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

In [None]:
import spacy

def fixed_size_chunking(text, chunk_size, overlap):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    tokens = [token.text for token in doc]

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk = tokens[start:end]
        chunks.append(" ".join(chunk))
        start += chunk_size - overlap  # move start forward with overlap

    return chunks

# Note you may need to run this in a terminal window if you ge
# python -m spacy download en_core_web_md
# now restart your kernel

In [None]:
chunk_size = 100
overlap_size = 20

# open file and read text from file
# Example usage
file_path = "./data/register-for-classes.txt"
file_content = read_file(file_path)

if file_content is None:
    print("Unable to read data from file: ", file_path)

# Generate chunks
chunks = fixed_size_chunking(file_content, chunk_size, overlap_size)

# Display results
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i + 1} ---\n{chunk}")

#### Now let's get a little more sophisticated and use sentance level encoding

In [None]:
def sentence_chunking(text, sentences_per_chunk):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)

    return chunks

In [None]:
sentences_per_chunk = 1

# open file and read text from file
# Example usage
file_path = "./data/declaration-of-indep.txt"
file_content = read_file(file_path)

if file_content is None:
    print("Unable to read data from file: ", file_path)

# Generate chunks
chunks = sentence_chunking(file_content, sentences_per_chunk)

# Display results
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i + 1} ---\n{chunk}")

#### Let's explore semantic based chunkng approach

In [1]:
def semantic_embedding_chunk(text, threshold):
    """
    Splits text into semantic chunks using sentence embeddings.
    Uses spaCy for sentence segmentation and SentenceTransformer for generating embeddings.

    :param text: The full text to chunk.
    :param threshold: Cosine similarity threshold for adding a sentence to the current chunk.
    :return: A list of semantic chunks (each as a string).
    """
    # Sentence segmentation
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]

    chunks = []
    current_chunk_sentences = []
    current_chunk_embedding = None

    for sentence in sentences:
        # Generate embedding for the current sentence
        sentence_embedding = model.encode(sentence, convert_to_tensor=True)

        # If starting a new chunk, initialize it with the current sentence
        if current_chunk_embedding is None:
            current_chunk_sentences = [sentence]
            current_chunk_embedding = sentence_embedding
        else:
            # Compute cosine similarity between current sentence and the chunk embedding
            sim_score = util.cos_sim(sentence_embedding, current_chunk_embedding)
            if sim_score.item() >= threshold:
                # Add sentence to the current chunk and update the chunk's average embedding
                current_chunk_sentences.append(sentence)
                num_sents = len(current_chunk_sentences)
                current_chunk_embedding = ((current_chunk_embedding * (num_sents - 1)) + sentence_embedding) / num_sents
            else:
                # Finalize the current chunk and start a new one
                chunks.append(" ".join(current_chunk_sentences))
                current_chunk_sentences = [sentence]
                current_chunk_embedding = sentence_embedding

    # Append the final chunk if it exists
    if current_chunk_sentences:
        chunks.append(" ".join(current_chunk_sentences))

    return chunks

In [4]:
%pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Using cached sentence_transformers-4.0.2-py3-none-any.whl (340 kB)
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-4.0.2
Note: you may need to restart the kernel to use updated packages.


In [10]:
#pip install sentence-transformers
import spacy
from sentence_transformers import SentenceTransformer, util

file_path = "./data/home-care.txt"
home_care_content = read_file(file_path)

nlp = spacy.load("en_core_web_md")
model = SentenceTransformer("all-MiniLM-L6-v2")

semantic_chunks = semantic_embedding_chunk(home_care_content, threshold=0.40)
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*60}")

Chunk 1:
Dog:
Taking care of a dog for a week involves consistency, attention, and lots of affection. Dogs are creatures of habit, and maintaining a steady routine helps them feel secure and relaxed.
------------------------------------------------------------
Chunk 2:
Begin each morning between 6:30 and 8:00 AM with a cheerful greeting. Allow the dog to wake up gradually, offering gentle pets and an upbeat tone to start the day on a positive note.
------------------------------------------------------------
Chunk 3:
Once the dog is alert and ready, take them outside for their first potty break. Whether on a leash or in a fenced yard, be sure to stay with them until they’ve had a chance to relieve themselves. Offer praise after they go — simple words like “good potty” reinforce good behavior. After their bathroom break, it’s time for breakfast.
------------------------------------------------------------
Chunk 4:
Measure out the appropriate portion of their usual food — typically dry k

Spacy is a powerful NLP library that can be used for lots of other parsing tasks

In [None]:
def find_nouns(text):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    for noun in doc.noun_chunks:
        print (noun)
        
def find_entites(text):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    for entity in doc.ents:
        print (entity)

In [None]:
noun_chunking(file_content)

In [None]:
find_entites(file_content)

In [None]:
find_cat(file_content)