# Text Chunking (optional lab)

This optional lab will walk you through various methods to perform chunking of text in your documents. Retrieval is a very important step in RAG architecture. Semantic search requires you take your knowledge/text and convert that into embeddings and store them in a search engine that offers vector search capability. To convert your documents into embedding, you will need to split them into smaller pieces, popularly called "Chunks". This technique is known as "Chunking". Chunking is necessary because a large text passage may lose its specificity, it may conflate different topics, or concepts, making it not a top match for a query about a topic. This would mean even if there is a very relevant information in one part of a large text passage, the similarity of the text passage as a whole to user's query may be very low. This may exclude the text passage from top semantic search results. Remember we only use few top results in our prompt to LLM that generates final text for answer.


There is a [great resource by Greg Kamradt](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb) from where you can learn about various ways to chunk text  


In [None]:
!pip install langchain langchain_community pypdf langchain_experimental --quiet
!pip install -qU langchain-text-splitters
!pip install --upgrade --quiet  boto3
!pip install opensearch-py --quiet

#You can safely ignore the version requirement error for opensearchpy

In [None]:
from langchain.memory import ConversationBufferWindowMemory
from langchain_community.chat_models import BedrockChat
from langchain.chains import ConversationalRetrievalChain

from langchain_community.embeddings import BedrockEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

import boto3
import os
import time
import json
import pandas as pd
from tqdm import tqdm
import sagemaker
from opensearchpy import OpenSearch, RequestsHttpConnection
from sagemaker import get_execution_role
import random 
import string
import s3fs
from urllib.parse import urlparse
from IPython.display import display, HTML
from alive_progress import alive_bar
from opensearch_py_ml.ml_commons import MLCommonClient
from requests_aws4auth import AWS4Auth
import requests 


In [None]:
# Create a Boto3 session
session = boto3.Session()

# Get the account id
account_id = boto3.client('sts').get_caller_identity().get('Account')

# Get the current region
region = session.region_name

cfn = boto3.client('cloudformation')

#a client to bedrock runtime.
bedrock_client = boto3.client('bedrock-runtime')

# Method to obtain output variables from Cloudformation stack. 
def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "genai-data-foundation-workshop"

outputs = get_cfn_outputs(cloudformation_stack_name)
aos_host = outputs['OpenSearchDomainEndpoint']
s3_bucket = outputs['s3BucketTraining']
bedrock_inf_iam_role = outputs['BedrockBatchInferenceRole']
bedrock_inf_iam_role_arn = outputs['BedrockBatchInferenceRoleArn']
sagemaker_notebook_url = outputs['SageMakerNotebookURL']

# We will just print all the variables so you can easily copy if needed.
outputs

# Recursive character chunking
The most simplist way to chunk document would be by length, but keeping paragraphs or lines together so it does not lose the meaning. We will use lang chain library that provides a recurisve character text splitter which offers ways to split data by length, yet keeps the lines, paragraph together as much as possible.


In [None]:
# this method would split the text into chunks by paragraph, line boundary and keeping chunk 
# size as close to 1000 characters, it will also overlap the text between chunks if it were to 
# split line or paragraph in the middle.

def recursive_character_chunking(text): 
    
    text_splitter = RecursiveCharacterTextSplitter( #create a text splitter
        #separators=["\n\n", "\n", ".", " "], #split chunks at (1) paragraph, (2) line, (3) sentence, or (4) word, in that order
        chunk_size=1000, #divide into 1000-character chunks using the separators above
        chunk_overlap=100, #number of characters that can overlap with previous chunk
        length_function=len,
        is_separator_regex=False,
    )
    
    docs = text_splitter.create_documents(text)#From the loaded PDF
    
    return docs #return the index to be cached by the client app

Let's try to run this method on an excerpt of AWS docs from Amazon Bedrock titan model and Amazon Textract services. You will notice that length/recursive chunking will create chunks with overlaps, this helps in situations where sentences need to not be chopped in the middle, but it will fail to keep Textract and Titan documentation chunks separate. You will notice that chunk no. 9 is a mix of titan and lambda docs.

In [None]:
text = ""

# lets load text from our prepared aws-docs-excerpt from various services.
# this document contains only sections of docs across multiple services.

with open('aws-docs-excerpt.txt', 'r') as f:
    text = f.read()


docs = recursive_character_chunking([text])

# the method prints chunks
def print_chunks(data):
    # Let's print the chunks -- notice the overlap between chunk 6 and 7
    # However, mostly it is a clean separation at the end of the sentence.
    i = 1
    for doc in data:
        print(f"---------START OF CHUNK {i}------")
        print(f"{doc.page_content}")
        print(f"---------END OF CHUNK {i}------\n\n")
        i+=1
        
print_chunks(docs)

### Connect to Amazon OpenSearch Service
Following cell makes a connection with opensearch, we will use `aos_client` throughout this lab. If you get security token expiration error, please run this cell again and authentication will be refreshed.

In [None]:
kms = boto3.client('secretsmanager')
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['DBSecret'])['SecretString'])

# For this lab we will use credentials that we have already created in AWS Secrets manager service. Secrets
# manager service allows you to store secrets securily and retrieve it through code in a safe manner.

auth = (aos_credentials['username'], aos_credentials['password'])

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

## Loading text chunks in opensearch to run semantic search over chunks

#### Let's first create a few helper methods to work with vector indices and embeddings.

`embed_phrase` - this method call Amazon bedrock API and converts a text into embedding.

`get_model_dimension` - this helper method takes a model id and returns its dimensions. 

`create_opensearch_vector_index` - this method creates a vector index in opensearch. It uses model id to determine the dimension of the vector field.


In [None]:
#this method would create embedding from a given text.
def embed_phrase( text, model_id, bedrock_client ):
    model_id = model_id # 
    accept = "application/json"
    contentType = "application/json"

    # Prepare the request payload
    request_payload = json.dumps({"inputText": text})

    response = bedrock_client.invoke_model(body=request_payload, modelId=model_id, accept=accept, contentType=contentType)

    # Extract the embedding from the response
    response_body = json.loads(response.get('body').read())

    # Append the embedding to the list
    embedding = response_body.get("embedding")
    return embedding

def get_model_dimension(model_id):
    if model_id=="amazon.titan-embed-text-v2:0":
        return 1024
    if model_id.startswith("cohere"):
        return 512
    if model_id.startswith("amazon.titan-embed-text-v1"):
        return 8192
    if model_id.startswith("amazon.titan-embed-image-v1"):
        return 8192
    
def create_opensearch_vector_index(index_name, model_id):    
    knn_index = {
        "settings": {
            "index.knn": True,
            "index.knn.space_type": "cosinesimil",
            "analysis": {
              "analyzer": {
                "default": {
                  "type": "standard",
                  "stopwords": "_english_"
                }
              }
            }
        },
        "mappings": {
            "properties": {
                "chunk_vector": {
                    "type": "knn_vector",
                    #we will set dimension based on selected model
                    "dimension": get_model_dimension(model_id=model_id), 
                    "store": True
                },
                "chunk_content": {
                    "type": "text",
                    "store": True
                }
            }
        }
    }

    try:
        aos_client.indices.delete(index=index_name)
        print("Recreating index '" + index_name + "' on cluster.")
        aos_client.indices.create(index=index_name,body=knn_index,ignore=400)
    except:
        print("Index '" + index_name + "' not found. Creating index on cluster.")
        aos_client.indices.create(index=index_name,body=knn_index,ignore=400)


### Create a vector index in opensearch
Following code creates an index for our knowledgebase. It uses previously created `create_opensearch_vector_index` python method.

In [None]:
index_name_recursive_chunk = "aws_docs_recursive_chunk_index"
model_id = "amazon.titan-embed-text-v2:0"
create_opensearch_vector_index(index_name_recursive_chunk, model_id)


### Creating embeddings
Following cell shows a call to `embed_phrase` method that we created earlier. This method would create an embedding for a given model. In this case `amazon.titan-embed-text-v2:0`. We print first 5 floating point numbers from array of 1024.

In [None]:
#test calling embed_phrase method from utilities file to get embedding of a given model.
embedding = embed_phrase("Testing amazon bedrock models", model_id, bedrock_client=bedrock_client)
embedding[:5]

### bulk load data in to opensearch
Following cell define a method that take text chunks and loads them into opensearch using a bulk load method. We will print our progress as we load data using `alive_bar` and `bar()` methods.

In [None]:
#process all the chunks and get embeddings from Bedrock for each text chunk
def bulk_load_chunks_in_opensearch(docs, index_name, model_id, bedrock_client):
    chunks = []

    for doc in docs:
        chunks.append({
            #following field is the actual text of the chunk
            "chunk_content": doc.page_content, 
         
            #following json field will contain vector embedding
            "chunk_vector": embed_phrase(doc.page_content, model_id, bedrock_client)
        })

    #load data into opensearch - every chunk will be separate opensearch record/document.
    cnt = 0
    batch = 0
    action = json.dumps({ "index": { "_index": index_name } })
    body_ = ''

    print(f"loading {len(docs)} chunks")

    with alive_bar(len(docs), force_tty = True) as bar:
        for doc in chunks:

            payload=doc #each doc is json document
            body_ = body_ + action + "\n" + json.dumps(payload) + "\n"
            cnt = cnt+1

            if(cnt == 100):

                response = aos_client.bulk(
                                    index = index_name,
                                     body = body_)


                cnt = 0
                batch = batch +1
                body_ = ""
                print("Total Bulk batches completed: "+str(batch))

            bar()


        #process last batch
        if body_ != "":
            response = aos_client.bulk(
                                    index = index_name,
                                     body = body_)


            batch = batch +1
            body_ = ''
            print("Total Bulk batches completed: "+str(batch))


Now lets call above method to bulk load our recursive character chunks into opensearch.


In [None]:
bulk_load_chunks_in_opensearch(docs, index_name_recursive_chunk, model_id, bedrock_client)

### Search data using vector search (semantic search)
Following method runs a semantic search over our index and returns top N items.

In [None]:
def retrieve_opensearch_with_semantic_search(phrase, index_name, model_id, bedrock_client, n=3 ):
    search_vector = embed_phrase(phrase, model_id=model_id, bedrock_client=bedrock_client)
    osquery={
        "_source": {
            "exclude": [ "chunk_vector" ]
        },
        
      "size": n,
      "query": {
        "knn": {
          "chunk_vector": {
            "vector":search_vector,
            "k":n
          }
        }
      }
    }

    res = aos_client.search(index=index_name, 
                           body=osquery,
                           stored_fields=["chunk_content"],
                           explain = True)
    top_result = res['hits']['hits']
    
    results = []
    
    for entry in top_result:
        result = {
            "chunk_content":entry['_source']['chunk_content'],
        }
        results.append(result)
    
    return results



### Run few sample semantic searches
You may change the question below

In [None]:
question_on_docs="What VPN connections do in a VPC?"
example_request = retrieve_opensearch_with_semantic_search(phrase=question_on_docs, index_name=index_name_recursive_chunk, model_id=model_id, bedrock_client=bedrock_client, n=2)
print(json.dumps(example_request, indent=4))

# Semantic chunking
Semantic chunking is a novel technique that chunks the data in a way that it optimises it for semantic cohesion. The method uses an embedding model and runs similarity calculation over sentences and decides the chunk position based on deviation/change in semantic distance between sentences. It uses rolling window where it keeps adding sentences and measure its distance with incoming sentence. Technically a change in topic should be detected (not very accurately). A breakpoint threshold is statistical method use to determine this change. This way you ensure that chunks stay optimal for semantic matching. 

If you are keen to get more info, read level 4 in [Greg Kamradt tutorial](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb).

Lang chain offers semantic chunking and also ability to call embedding model. We will first choose an embedding model for our semantic chunking a a breakpoint threshold type. After selecting the model and threshold, please move to the next cell.

In [None]:
#lets initialize the code for drop down box input.
import ipywidgets as widgets
from ipywidgets import interactive

#defaults
model_id='amazon.titan-embed-text-v2:0'
threshold = 'percentile'

#list of embedding models in bedrock
model_list=['cohere.embed-english-v3','cohere.embed-multilingual-v3',
            'amazon.titan-embed-text-v1','amazon.titan-embed-text-v2:0',
           'amazon.titan-embed-image-v1']

#semantic chunking 
threshold_list=['percentile', 'standard_deviation', 'interquartile']
    
drop1 = widgets.Dropdown(options=model_list, value='cohere.embed-english-v3', description='Model:', disabled=False)
drop2 = widgets.Dropdown(options=threshold_list, value='percentile', description='Threshold:', disabled=False)


Following code runs semantic chunking for aws docs text. It also shows a drop down for you to change the model and threshold type so you can see the effects of various models and breakpoint thresholds. 

You can select a specific model and threshold and run through this lab. You can come back to this cell and repeat this process with a different semantic chunking strategy. This will give you a good indication of how semantic chunking works.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
#from langchain_community.embeddings import BedrockEmbedding
from langchain_community.document_loaders import PDFMinerLoader


semantic_chunks = []

#method that is called when drop down boxes are shown or changed
def update_dropdown(selected_model, selected_threshold):
    model_id = drop1.value.lower()
    threshold = drop2.value.lower()
    info = f"Selected embedding model: {model_id}. Selected threshold: {threshold}!"
    display(info)
    semantic_chunks = perform_semantic_chunking(text=text, model_id=model_id, threshold=threshold)
    print_chunks(semantic_chunks)



    
# method runs semantic chunking on text for a given model and threshold.    
def perform_semantic_chunking(text, model_id, threshold):
    print(f"Chunking using {model_id} and {threshold} threshold breaking point")
    
    #using lang chain's Bedrock embedding object
    embeddings = BedrockEmbeddings(region_name=region, model_id=model_id)

    #using lang chain's semantic chunker to chunk
    text_splitter = SemanticChunker(
        embeddings, breakpoint_threshold_type= threshold
    )

    docs = text_splitter.create_documents([text])
    semantic_chunks.clear()
    semantic_chunks.extend(docs)
    return docs

#lets run semantic chunking and display the drop down. 
w = interactive(update_dropdown, selected_model=drop1, selected_threshold=drop2) 
display(w)



#when you change value - it take around 15-20 seconds for refreshing the chunks

In [None]:
semantic_chunks

You can select a different combination of the embedding model and threshold ids to see what breaks the content best. You will find results vary from one model to another and between various breakpoint threshold technique. However, it does not mean this combination will always be best for chunking. Note that this also does not mean it is optimal for retrieval. We will have to test this with our queries to know if this is best to answer our questions.

### Loading semantic chunks in opensearch

#### Let's first create an index with KNN field.


In [None]:
semantic_chunk_index_name = "aws_docs_semantic_chunk_index"
create_opensearch_vector_index(semantic_chunk_index_name, model_id)

In [None]:
#test calling embed_phrase method from utilities file to get embedding from the selected model.
embedding = embed_phrase("Testing amazon bedrock models", model_id, bedrock_client=bedrock_client)
embedding[:5]

### bulk load data in to opensearch
Following cell will take these semantic chunks and loads them into opensearch using a bulk load method.

In [None]:
bulk_load_chunks_in_opensearch(semantic_chunks, semantic_chunk_index_name, model_id, bedrock_client)

### Let's run vector searches on semantic chunks
Following method runs vector search on sementically chunked data. You may change the questions to see difference in output from before.

You will notice that for question - `What VPN connections do in VPC` - we have a much better result when we do semantic chunking. 

In [None]:
question_on_docs="What VPN connections do in a VPC?"

example_request = retrieve_opensearch_with_semantic_search(phrase=question_on_docs, index_name=semantic_chunk_index_name, model_id=model_id, bedrock_client=bedrock_client, n=2)
print(json.dumps(example_request, indent=4))

## Comparison
Although semantic chunking naturally keeps chunk large size. you can see it keeps most VPC documentation in a single chunk. This makes semantic chunking better fit for this text. 

In [None]:
question_on_docs="What VPN connections do ?"

print("RECURISVE CHUNKING SEARCH RESULTS..")
example_request1 = retrieve_opensearch_with_semantic_search(phrase=question_on_docs, index_name=index_name_recursive_chunk, model_id=model_id, bedrock_client=bedrock_client, n=2)
print(json.dumps(example_request1, indent=4))


print("SEMANTIC CHUNKING SEARCH RESULTS..")
example_request2 = retrieve_opensearch_with_semantic_search(phrase=question_on_docs, index_name=semantic_chunk_index_name, model_id=model_id, bedrock_client=bedrock_client, n=2)
print(json.dumps(example_request2, indent=4))

## Conclusion
In this lab we learned how to break text into chunks. This is a very useful skill in RAG architecture. Note that these are just 2 of many ways you can chunk the data. Chunking has a great impact on the quality of your RAG application. Learning different ways to chunk can be a handy skill to acquire.

Lab is finished now. You may go to lab instructions.