In [3]:
# Add parent directory to sys.path so we can import wikipedia_downloader.py
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
sys.path.append(os.path.abspath(os.getcwd()))

In [7]:
# Import the Wikipedia downloader module
from wikipedia_downloader import download_wikipedia_to_markdown, get_page_title_from_url

# Example usage: Download a Wikipedia page
wikipedia_url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
output_file = "downloads/artificial_intelligence.md"

import os

if not os.path.exists(output_file):
    # Download the page
    print (f"Downloading Wikipedia page from {wikipedia_url} to {output_file}...")
    success = download_wikipedia_to_markdown(wikipedia_url, output_file)
    if success:
        print(f"Wikipedia page downloaded successfully!")
        
        # Read and display first 500 characters of the downloaded content
        with open(output_file, 'r', encoding='utf-8') as f:
            content = f.read()
            print(f"\nFirst 500 characters of the downloaded content:")
            print(content[:500] + "..." if len(content) > 500 else content)
    else:
        print("Failed to download Wikipedia page")
else:
    print(f"File '{output_file}' already exists. Skipping download.")
    success = True


File 'downloads/artificial_intelligence.md' already exists. Skipping download.


# Semantic Chunking with LlamaIndex

This notebook demonstrates how to use LlamaIndex's semantic chunking to intelligently split PDF documents based on semantic similarity rather than fixed chunk sizes.

Semantic chunking adaptively picks breakpoints between sentences using embedding similarity, ensuring chunks contain semantically related content.

In [52]:
# Import required libraries
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

In [53]:
from dotenv import load_dotenv
import os
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

load_dotenv()

api_key = os.environ.get('OPENAI_API_KEY')
azure_endpoint = os.environ.get('OPENAI_API_BASE')
deployment_name = os.environ.get('LLAMAINDEX_DEPLOYMENT_NAME')
api_version = "2024-02-15-preview"

In [54]:
from llama_index.core import Document

# file_name = "C:\\temp\\blackhatpython2ndedition.pdf"
file_name = "C:\\temp\\blackhatpython.pdf"

# Load with default behavior (one doc per page)
documents = SimpleDirectoryReader(input_files=[file_name]).load_data()

# Combine all pages into one document
combined_text = "\n\n".join([doc.text for doc in documents])
combined_document = Document(text=combined_text)

# Replace the document list with single combined document
documents = [combined_document]

print(f"Loaded {len(documents)} document(s)")
print(f"Document content preview: {documents[0].text[:1000]}...")

Loaded 1 document(s)
Document content preview: 9
FUN WITH EXFILTRATION
Gaining access to a target network is only 
a part of the battle. To make use of your 
access, you want to be able to exfiltrate 
documents, spreadsheets, or other bits of data 
from the target system. Depending on the defense 
mechanisms in place, this last part of your attack can 
prove to be tricky. There might be local or remote 
systems (or a combination of both) that work to vali -
date processes that open remote connections as well 
as determine whether those processes should be able 
to send information or initiate connections outside of 
the internal network.


140   Chapter 9
In this chapter, we’ll create tools that enable you to exfiltrate encrypted 
data. First, we’ll write a script to encrypt and decrypt files. We’ll then use 
that script to encrypt information and transfer it from the system by using 
three methods: email, file transfers, and posts to a web server. For each 
of these methods, we’ll wri

### Key points:

No built-in max token/character parameter: The SemanticSplitterNodeParser does not currently expose a parameter to set a hard maximum chunk size in tokens or characters.

Known limitation: This is a recognized issue in the LlamaIndex community, and users have reported that chunks exceeding model limits can cause errors during embedding or inference.

Workarounds:

- Custom subclassing: You can subclass SemanticSplitterNodeParser to add a post-processing step that checks chunk sizes and further splits any that exceed your desired limit. Example approaches and code snippets for this workaround are provided by the community.

- Safety net pattern: One common pattern is to use a secondary, simpler splitter (e.g., SentenceSplitter) as a fallback to break up oversized chunks after semantic splitting

In [55]:
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-3-small",
    deployment_name="text-embedding-3-small",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

# Configure semantic splitter
splitter = SemanticSplitterNodeParser(
    buffer_size=1,  # Number of sentences to group together when evaluating semantic similarity
    breakpoint_percentile_threshold=90,  # Percentile threshold for determining breakpoints
    embed_model=embed_model
)

# Create baseline splitter for comparison
base_splitter = SentenceSplitter(chunk_size=512)

In [57]:
# Generate semantic chunks
print("Generating semantic chunks...")
semantic_nodes = splitter.get_nodes_from_documents(documents, show_progress=True)

print("Generating baseline chunks...")
baseline_nodes = base_splitter.get_nodes_from_documents(documents, show_progress=True)

print(f"Semantic chunking produced {len(semantic_nodes)} chunks")
print(f"Baseline chunking produced {len(baseline_nodes)} chunks")

  from .autonotebook import tqdm as notebook_tqdm


Generating semantic chunks...


Generating embeddings: 100%|██████████| 309/309 [00:14<00:00, 21.30it/s]
Parsing nodes: 100%|██████████| 1/1 [00:14<00:00, 14.61s/it]


Generating baseline chunks...


Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 29.50it/s]

Semantic chunking produced 32 chunks
Baseline chunking produced 43 chunks





## Inspecting Semantic Chunks

Let's examine the first few chunks created by semantic chunking to see how they group semantically related content.

In [58]:
# Display first few semantic chunks
for i, node in enumerate(semantic_nodes[:3]):
    print(f"\n=== Semantic Chunk {i+1} ===")
    print(f"Length: {len(node.get_content())} characters")
    print(f"Content preview: {node.get_content()[:1000]}...")
    print("-" * 50)


=== Semantic Chunk 1 ===
Length: 2206 characters
Content preview: 9
FUN WITH EXFILTRATION
Gaining access to a target network is only 
a part of the battle. To make use of your 
access, you want to be able to exfiltrate 
documents, spreadsheets, or other bits of data 
from the target system. Depending on the defense 
mechanisms in place, this last part of your attack can 
prove to be tricky. There might be local or remote 
systems (or a combination of both) that work to vali -
date processes that open remote connections as well 
as determine whether those processes should be able 
to send information or initiate connections outside of 
the internal network.


140   Chapter 9
In this chapter, we’ll create tools that enable you to exfiltrate encrypted 
data. First, we’ll write a script to encrypt and decrypt files. We’ll then use 
that script to encrypt information and transfer it from the system by using 
three methods: email, file transfers, and posts to a web server. For each 
of thes

## Comparing with Baseline Chunking

Now let's compare the semantic chunks with baseline fixed-size chunks to see the difference in content organization.

In [59]:
# Display first few baseline chunks for comparison
for i, node in enumerate(baseline_nodes[:3]):
    print(f"\n=== Baseline Chunk {i+1} ===")
    print(f"Length: {len(node.get_content())} characters")
    print(f"Content preview: {node.get_content()[:300]}...")
    print("-" * 50)


=== Baseline Chunk 1 ===
Length: 2205 characters
Content preview: 9
FUN WITH EXFILTRATION
Gaining access to a target network is only 
a part of the battle. To make use of your 
access, you want to be able to exfiltrate 
documents, spreadsheets, or other bits of data 
from the target system. Depending on the defense 
mechanisms in place, this last part of your atta...
--------------------------------------------------

=== Baseline Chunk 2 ===
Length: 2085 characters
Content preview: Encrypting and Decrypting Files
We’ll use the pycryptodomex package for the encryption tasks. You can install 
it with this command:
$ pip install pycryptodomex
Now, open up cryptor.py and let’s import the libraries we’ll need to get 
started:
1 from Cryptodome.Cipher import AES, PKCS1_OAEP
2 from C...
--------------------------------------------------

=== Baseline Chunk 3 ===
Length: 2054 characters
Content preview: For example, the TLS communication between your 
browser and a web server involves a hybr

In [60]:
# Analyze chunk size distribution
semantic_sizes = [len(node.get_content()) for node in semantic_nodes]
baseline_sizes = [len(node.get_content()) for node in baseline_nodes]

print("Chunk Size Statistics:")
print(f"Semantic chunks - Min: {min(semantic_sizes)}, Max: {max(semantic_sizes)}, Avg: {sum(semantic_sizes)/len(semantic_sizes):.1f}")
print(f"Baseline chunks - Min: {min(baseline_sizes)}, Max: {max(baseline_sizes)}, Avg: {sum(baseline_sizes)/len(baseline_sizes):.1f}")

Chunk Size Statistics:
Semantic chunks - Min: 13, Max: 5313, Avg: 1757.2
Baseline chunks - Min: 943, Max: 2727, Avg: 1921.7


## Setting up Query Engines

We'll create query engines for both chunking methods to test their effectiveness in retrieving relevant information.

In [61]:
from llama_index.core import Settings
from llama_index.core.response.notebook_utils import display_source_node

# Configure both embedding model and LLM in global settings
Settings.embed_model = embed_model

llm = AzureOpenAI(
    model="gpt-4o-mini",
    deployment_name=deployment_name,
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
    engine=deployment_name
)

Settings.llm = llm

# Create vector indexes and query engines
semantic_index = VectorStoreIndex(semantic_nodes)
semantic_query_engine = semantic_index.as_query_engine()

baseline_index = VectorStoreIndex(baseline_nodes)
baseline_query_engine = baseline_index.as_query_engine()

print("Query engines created successfully!")

Query engines created successfully!


## Testing Queries

Let's test both chunking approaches with sample queries to see how they perform.

In [25]:
# Test query - modify this based on your document content
# test_query = "What are the main topics discussed in this document?"
test_query = "How can I exfiltrate data with an email?"

print("=== Semantic Chunking Response ===")
semantic_response = semantic_query_engine.query(test_query)
print(semantic_response)

print("\n" + "="*50)
print("=== Baseline Chunking Response ===")
baseline_response = baseline_query_engine.query(test_query)
print(baseline_response)

=== Semantic Chunking Response ===
To exfiltrate data via email, you can use a function that connects to an SMTP server. First, you need to import the necessary libraries, such as `smtplib` for email functionality. Set up the SMTP server details, including the server address, port, account name, and password. 

You can create a function, such as `plain_email`, which takes the subject and contents of the email as inputs. Inside this function, construct the email message by including the subject, sender, and recipient information. Then, establish a connection to the SMTP server, log in with your credentials, and send the email containing the data you wish to exfiltrate. Finally, ensure to close the connection to the server after sending the email.

=== Baseline Chunking Response ===
To exfiltrate data via email, you can use a function that connects to an SMTP server. First, you need to specify the SMTP server details, including the server address, port, account name, and password. Then, 

In [48]:
# Display source nodes for semantic chunking response
print("=== Source Nodes (Semantic Chunking) ===")
for i, node in enumerate(semantic_response.source_nodes):
    print(f"\nNode {i+1} (Similarity: {node.score:.4f}):")
    display_source_node(node, source_length=500)

=== Source Nodes (Semantic Chunking) ===

Node 1 (Similarity: 0.5509):


**Node ID:** aea49140-fc08-497b-859b-2e3850391976<br>**Similarity:** 0.5508637623112517<br>**Text:** 150   Chapter 9
We pass the exfiltrate function the path to a document and the method 
of exfiltration we want to use  1. When the method involves a file transfer 
(transmit or plain_ftp), we need to provide an actual file, not an encoded 
string. In that case, we read in the file from its source, encrypt the contents, 
and write a new file into a temporary directory 2. We call the EXFIL diction-
ary to dispatch the corresponding method, passing in the new encrypted 
document path to exfiltra...<br>


Node 2 (Similarity: 0.5189):


**Node ID:** eed97e66-6e2c-490f-b2ac-38a6964935ae<br>**Similarity:** 0.5188609583743866<br>**Text:** Fun with Exfiltration    143
2 import win32com.client
3 smtp_server = 'smtp.example.com'
smtp_port = 587
smtp_acct = 'tim@example.com'
smtp_password = 'seKret'
tgt_accts = ['tim@elsewhere.com']
We import smptlib, which we need for the cross-platform email func -
tion 1. We’ll use the win32com package to write our Windows-specific 
function 2. To use the SMTP email client, we need to connect to a Simple 
Mail Transfer Protocol (SMTP) server (an example might be smtp.gmail.com 
if you have a Gm...<br>

In [49]:
# Display source nodes for baseline chunking response
print("=== Source Nodes (Baseline Chunking) ===")
for i, node in enumerate(baseline_response.source_nodes):
    print(f"\nNode {i+1} (Similarity: {node.score:.4f}):")
    display_source_node(node, source_length=500)

=== Source Nodes (Baseline Chunking) ===

Node 1 (Similarity: 0.5354):


**Node ID:** 9d9cddaf-3976-4dc5-b7e4-b0e9f297ad4a<br>**Similarity:** 0.5354182524902213<br>**Text:** Fun with Exfiltration    143
2 import win32com.client
3 smtp_server = 'smtp.example.com'
smtp_port = 587
smtp_acct = 'tim@example.com'
smtp_password = 'seKret'
tgt_accts = ['tim@elsewhere.com']
We import smptlib, which we need for the cross-platform email func -
tion 1. We’ll use the win32com package to write our Windows-specific 
function 2. To use the SMTP email client, we need to connect to a Simple 
Mail Transfer Protocol (SMTP) server (an example might be smtp.gmail.com 
if you have a Gm...<br>


Node 2 (Similarity: 0.5150):


**Node ID:** ee793005-ee46-402b-b767-295717e50468<br>**Similarity:** 0.514965630715785<br>**Text:** 150   Chapter 9
We pass the exfiltrate function the path to a document and the method 
of exfiltration we want to use  1. When the method involves a file transfer 
(transmit or plain_ftp), we need to provide an actual file, not an encoded 
string. In that case, we read in the file from its source, encrypt the contents, 
and write a new file into a temporary directory 2. We call the EXFIL diction-
ary to dispatch the corresponding method, passing in the new encrypted 
document path to exfiltra...<br>

## Advanced Configuration

You can fine-tune the semantic splitter parameters for better results:

- `buffer_size`: Number of sentences to group when evaluating similarity
- `breakpoint_percentile_threshold`: Higher values create larger, more cohesive chunks
- `embed_model`: Different embedding models may produce different chunking results

In [None]:
# Example of fine-tuning semantic splitter parameters
fine_tuned_splitter = SemanticSplitterNodeParser(
    buffer_size=2,  # Group 2 sentences at a time
    breakpoint_percentile_threshold=90,  # Lower threshold for more granular chunks
    embed_model=embed_model
)

# Generate chunks with fine-tuned parameters
fine_tuned_nodes = fine_tuned_splitter.get_nodes_from_documents(documents)

print(f"Fine-tuned semantic chunking produced {len(fine_tuned_nodes)} chunks")
print(f"Original semantic chunking produced {len(semantic_nodes)} chunks")
print(f"Baseline chunking produced {len(baseline_nodes)} chunks")

## Summary

Semantic chunking offers several advantages over fixed-size chunking:

1. **Content Coherence**: Chunks contain semantically related sentences
2. **Adaptive Size**: Chunk sizes vary based on content structure
3. **Better Retrieval**: More relevant chunks for specific queries
4. **Context Preservation**: Related information stays together

Use semantic chunking when you need more intelligent document segmentation for RAG applications.