### Breaking data down into useful chunks
![Chunking](./images/chunks.jpg)
<p>In order to allow LLMs to intelligently interact with content outside the large corpus of text that LLMS have been trained on, we need to orgainze that data into chunks that we will retrieve based on relevancy.  The first step in this process is to break up our provided knowledge into managable chunks.  Recall LLMs have a limited context window to consume data in so it's up to us to parse it into meaningfull bits.
</p>
<p>There are many strategies we can use to depending on the type of data we are ingesting. Here are a few we will review</p>

    1. Fixed-Size Chunking 
    2. Sentence-Level Chunking
    3. Semantic Based 

![Chunking](./images/top3.jpg)
Image taken from [SOURCE](https://www.nb-data.com/p/9-chunking-strategis-to-improve-rag)

| Use Case                                    | Strategy                                           |
|:--------------------------------------------|:---------------------------------------------------|
| Small documents / simple use case	      | Fixed-size chunking                        |
| Frequently Asked Questions           | Sentence level chunking              |
| High semantic fidelity needed RAG             | Semantic chunking                      |







### Simple
Let's review a simple fixed size approach

In [2]:
def read_file(file_path):
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"Error: The file at {file_path} was not found.")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

In [3]:
import spacy

def fixed_size_chunking(text, chunk_size, overlap):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    tokens = [token.text for token in doc]

    chunks = []
    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk = tokens[start:end]
        chunks.append(" ".join(chunk))
        start += chunk_size - overlap  # move start forward with overlap

    return chunks

# Note you may need to run this in a terminal window if you ge
# python -m spacy download en_core_web_md
# now restart your kernel


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/marvinlee/Documents/vs_code/gsb_570/gsb570env/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/marvinlee/Documents/vs_code/gsb_570/gsb570env/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/marvinlee/Documents/vs_code/gsb_570/gsb570env/lib/python3.12/site-package

In [4]:
chunk_size = 100
overlap_size = 20

# open file and read text from file
# Example usage
file_path = "./data/register-for-classes.txt"
file_content = read_file(file_path)

if file_content is None:
    print("Unable to read data from file: ", file_path)

# Generate chunks
chunks = fixed_size_chunking(file_content, chunk_size, overlap_size)

# Display results
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i + 1} ---\n{chunk}")


--- Chunk 1 ---
Register for Classes 
 Preparing for Registration 

 To prepare for registration , visit your Student Center to : 

 Find your enrollment appointment date and time 
 Clear any holds on your account 
 Access degree planning and class scheduling tools to determine which courses you need 
 Registration information for the upcoming term is typically available during week 4 or 5 of the current term . Specific dates can be found in the Student Center , on the enrollment apppointments webpage , and in the Planning Calendar for the term . 

 Check Enrollment Appointment Date & Time

--- Chunk 2 ---
enrollment apppointments webpage , and in the Planning Calendar for the term . 

 Check Enrollment Appointment Date & Time 
 Clear All Registration Holds 
 Understanding Your Curriculum & Course Options 

 There are three primary tools you can consult each term to help determine which courses you need to complete . Students are encouraged to make a graduation plan during their first

### Sentence
Now let's get a little more sophisticated and use sentance level encoding

In [5]:
def sentence_chunking(text, sentences_per_chunk):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]

    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)

    return chunks

In [6]:
sentences_per_chunk = 1

# open file and read text from file
# Example usage
file_path = "./data/declaration-of-indep.txt"
file_content = read_file(file_path)

if file_content is None:
    print("Unable to read data from file: ", file_path)

# Generate chunks
chunks = sentence_chunking(file_content, sentences_per_chunk)

# Display results
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i + 1} ---\n{chunk}")


--- Chunk 1 ---
In Congress, July 4, 1776

--- Chunk 2 ---
The unanimous Declaration of the thirteen united States of America, When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

--- Chunk 3 ---
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to al

### Semantic
Let's explore semantic based chunkng approach

In [7]:
def semantic_embedding_chunk(text, threshold):
    """
    Splits text into semantic chunks using sentence embeddings.
    Uses spaCy for sentence segmentation and SentenceTransformer for generating embeddings.

    :param text: The full text to chunk.
    :param threshold: Cosine similarity threshold for adding a sentence to the current chunk.
    :return: A list of semantic chunks (each as a string).
    """
    # Sentence segmentation
    #doc = nlp(text)
    #sentences = [sent.text.strip() for sent in doc.sents if sent.text.strip()]
    sentences = fixed_size_chunking(text, 100, 10)

    chunks = []
    current_chunk_sentences = []
    current_chunk_embedding = None

    for sentence in sentences:
        # Generate embedding for the current sentence
        sentence_embedding = model.encode(sentence, convert_to_tensor=True)

        # If starting a new chunk, initialize it with the current sentence
        if current_chunk_embedding is None:
            current_chunk_sentences = [sentence]
            current_chunk_embedding = sentence_embedding
        else:
            # Compute cosine similarity between current sentence and the chunk embedding
            sim_score = util.cos_sim(sentence_embedding, current_chunk_embedding)
            if sim_score.item() >= threshold:
                # Add sentence to the current chunk and update the chunk's average embedding
                current_chunk_sentences.append(sentence)
                num_sents = len(current_chunk_sentences)
                current_chunk_embedding = ((current_chunk_embedding * (num_sents - 1)) + sentence_embedding) / num_sents
            else:
                # Finalize the current chunk and start a new one
                chunks.append(" ".join(current_chunk_sentences))
                current_chunk_sentences = [sentence]
                current_chunk_embedding = sentence_embedding

    # Append the final chunk if it exists
    if current_chunk_sentences:
        chunks.append(" ".join(current_chunk_sentences))

    return chunks

In [1]:
#%pip install sentence-transformers

In [8]:
#pip install sentence-transformers
import spacy
from sentence_transformers import SentenceTransformer, util

file_path = "./data/home-care.txt"
home_care_content = read_file(file_path)

nlp = spacy.load("en_core_web_md")
model = SentenceTransformer("all-MiniLM-L6-v2")

semantic_chunks = semantic_embedding_chunk(home_care_content, threshold=0.45)
for i, chunk in enumerate(semantic_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n{'-'*60}")

  from .autonotebook import tqdm as notebook_tqdm
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Chunk 1:
Dog : 
 Taking care of a dog for a week involves consistency , attention , and lots of affection . Dogs are creatures of habit , and maintaining a steady routine helps them feel secure and relaxed . Begin each morning between 6:30 and 8:00 AM with a cheerful greeting . Allow the dog to wake up gradually , offering gentle pets and an upbeat tone to start the day on a positive note . Once the dog is alert and ready , take them outside for their first potty break . Whether on a leash or in a potty break . Whether on a leash or in a fenced yard , be sure to stay with them until they ’ve had a chance to relieve themselves . Offer praise after they go — simple words like “ good potty ” reinforce good behavior . 

 After their bathroom break , it ’s time for breakfast . Measure out the appropriate portion of their usual food — typically dry kibble , but this might include wet food or nutritional toppers depending on the dog 's diet . Any supplements or medications that are part of th

Spacy is a powerful NLP library that can be used for lots of other parsing tasks

In [None]:
def find_nouns(text):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    for noun in doc.noun_chunks:
        print (noun)
        
def find_entites(text):
    nlp = spacy.load("en_core_web_md")
    doc = nlp(text)
    for entity in doc.ents:
        print (entity)

In [None]:
noun_chunking(file_content)

In [None]:
find_entites(file_content)

#### Conclusion
Those basic chunking strategies should cover most of your needs but spending time up front on the right chunking statefgy will really help improve the quality of you retrieval.

#### Assignment
Go find some data that may be useful to you in your project and determine a chunking strategy that might work for that content.  Use what you think is the best method to put the dataset into chunks.  Turn in your source data and chunking code.