# Semantic Chunking for Document Processing
## Overview
This code implements a semantic chunking approach for processing and retrieving information from PDF docuemnts. Unlike traditional methods that split text based on fixed character or word counts, semantic chunking aims to create more meaningful and context-aware text segments.
## Motivation
Traditional text splitting methods often break documents at arbitrary points, potentially disrupting the flow of information and context. Semantic chunking addresses this issue by attempting to split text at more natural breakpoints, preserving semantic coherence within each chunk.
## Key Components
1. PDF processing and text extraction
2. Semantic chunking using LangChain's Semantic Chunker
3. Vector store creation using FAISS and embeddings
4. Retriever setup for querying the processed documents
## Method Details
### Document Preprocessing
PDF is read and converted to a string using `read_pdf_to_string` function
### Semantic Chunking
1. Utilizes LangChain's `SemanticChunker` with embeddings
2. Three breakpoint types are available:
    - `percentile`: Splits at differences greater than the X percentile
    - `standard_deviation`: Splits at differences greater than X standard deviations
    - `interquartile`: Uses the interquartile distance to determine split points
3. In this implementation, the `percentile` method is used with a threshold of 90.
### Vector Store Creatin
1. Embeddings are used to create vector representations of the semantic chunks
2. FAISS vector store is created from these embeddings for efficient similarity serach
### Retriever Setup
Retriever is configured to fetch the top 2 most relevant chunks for a given query
## Key Features
1. Context-Aware Splitting: Attempts to maintain semantic coherence within chunks
2. Flexible Configureation: Allows for different breakpoint types and thresholds
3. Integration with Advanced NLP tools for both chunking and retrieval
## Benifits of this Approach
1. Improved Coherence: Chunks are more likely to contain complete throughts or ideas
2. Better Retrieval Relevance: By preserving context, retrieval accuracy may be enhanced
3. Adaptability: The chunking method can be adjusted based on the nature of the documents and retrieval needs
4. Potentail for Better Understanding: LLMs or downstream tasks may perform better with more coherent text segments
## Example Usage
The code includes a test query: "What is the main cause of climate change?". This demonstrates how the semantic chunking and retrieval system can be used to find relevant information from the processed document.
## Conclusion
Semantic chunking represents an advanced approach to document processing for retrieval systems: By attempting to maintain semantic coherence within text segments, it has the potential to improve the quality of retrieved information and enhance the performance of downstream NLP tasks. This technique is particualrly valuable for processing long, complex documents where maintaining context is crucial, such as scientific papers, legal documents, or comprehensive reports.

In [1]:
import os
from dotenv import load_dotenv

from langchain_openai.chat_models.azure import AzureChatOpenAI
load_dotenv()
openai_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.environ.get("AZURE_OPENAI_API_KEY")
openai_deployment = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
openai_api_version = os.getenv("AZURE_API_VERSION")

llm = AzureChatOpenAI(
    azure_deployment=openai_deployment,
    api_version="2024-10-01-preview",
    azure_endpoint=f"{openai_endpoint}openai/deployments/{openai_deployment}/chat/completions?api-version=2024-10-01-preview",
    temperature=0,
    logprobs=True,
)

In [2]:
path = "./data/Understanding_Climate_Change.pdf"

In [5]:
from helper_functions import read_pdf_to_string
content = read_pdf_to_string(path)

In [4]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import AzureOpenAIEmbeddings
openai_embedding = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_ID")
embeddings = AzureOpenAIEmbeddings(
    deployment=openai_embedding,
    model="text-embedding-ada-002",
    chunk_size=16
)
text_splitter = SemanticChunker(embeddings, breakpoint_threshold_type="percentile", breakpoint_threshold_amount=90)


In [6]:
docs = text_splitter.create_documents([content])

In [7]:
from langchain_community.vectorstores import FAISS
vectorestore = FAISS.from_documents(docs, embedding=embeddings)
chunks_query_retriever = vectorestore.as_retriever(search_kwargs={"k":2})

In [9]:
from helper_functions import show_context
from helper_functions import retrieve_context_per_question
test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

Context 1:
The Intergovernmental Panel on Climate Change (IPCC) has 
documented these changes extensively. Ice core samples, tree rings, and ocean sediments 
provide a historical record that scientists use to understand past climate conditions and 
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous 
oxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential 
for life on Earth, as it keeps the planet warm enough to support life. However, human 
activities have intensified this natural process, leading to a warmer climate. Fossil Fuels 
Burning fossil fuels for energy releases large amounts of CO2. This includes co