In [1]:
sample_text = """
# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is the simulation of human intelligence in machines. These machines are programmed to think and learn like humans. The field was founded in 1956 at Dartmouth College.

## Machine Learning

Machine learning is a subset of AI. It allows systems to learn from data without being explicitly programmed. There are three main types of machine learning.

Supervised learning uses labelled data to train models. The model learns to map inputs to outputs. Examples include classification and regression tasks.

Unsupervised learning finds hidden patterns in unlabelled data. Clustering and dimensionality reduction are common techniques. K-means and PCA are popular algorithms.

Reinforcement learning trains agents through rewards and penalties. The agent learns by interacting with an environment. AlphaGo used reinforcement learning to beat world champions.

## Deep Learning

Deep learning uses neural networks with many layers. It has revolutionised computer vision and natural language processing. GPU computing made deep learning practical.

Convolutional Neural Networks (CNNs) excel at image recognition. They use filters to detect features like edges and shapes. ResNet and VGG are famous CNN architectures.

Recurrent Neural Networks (RNNs) handle sequential data. They have memory that persists across time steps. LSTMs solved the vanishing gradient problem in RNNs.

## Natural Language Processing

NLP enables computers to understand human language. It powers applications like chatbots, translation, and sentiment analysis. BERT and GPT transformed the NLP landscape.

Tokenisation breaks text into smaller units called tokens. It is the first step in most NLP pipelines. Words, subwords, or characters can all be tokens.

Word embeddings represent words as dense vectors. Similar words have similar vectors. Word2Vec and GloVe are classic embedding models.

## Applications

AI is used in healthcare to detect diseases from medical images. It helps doctors make faster and more accurate diagnoses. IBM Watson was one of the first AI systems in healthcare.

Self-driving cars use AI to perceive and navigate the environment. Sensors, cameras, and radar feed data into AI models. Tesla and Waymo are leaders in autonomous vehicles.

Recommendation systems power Netflix, Spotify, and Amazon. They analyse user behaviour to suggest relevant content. Collaborative filtering is a common recommendation technique.
"""

In [8]:
from langchain_core.documents import Document

documents = [Document(page_content=sample_text)]

#### Exercise 1 : Character Splitting

Split the text using "." as the seperator.

In [50]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=100,
    chunk_overlap=20
)
char_chunks = splitter.split_documents(documents)

print(f"Number of Chunks\t\t: {len(char_chunks)}")
print(f"\nFirst Chunk\t\t: {char_chunks[0]}")
print(f"\n Chunk Sizes\t\t: {[len(c.page_content) for c in char_chunks]}")

Created a chunk of size 124, which is longer than the specified 100


Number of Chunks		: 39

First Chunk		: page_content='# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is the simulation of human intelligence in machines'

 Chunk Sizes		: [123, 60, 50, 55, 72, 46, 97, 52, 62, 61, 38, 66, 51, 59, 69, 69, 42, 63, 57, 43, 55, 48, 51, 82, 73, 42, 57, 92, 84, 47, 80, 56, 56, 65, 52, 50, 57, 55, 61]


#### Exercise 2 : Recursive Character Splitting

In [51]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

rec_chunks = splitter.split_documents(documents)
    
print(f"Number of Chunks\t\t: {len(rec_chunks)}")
print(f"\nFirst 3 Chunks\t\t:")
for i, chunk in enumerate(rec_chunks[:3]):
    print(f"\n --- Chunk {i+1} ({len(chunk.page_content)} characters): --- \n")
    print(chunk.page_content)


Number of Chunks		: 15

First 3 Chunks		:

 --- Chunk 1 (41 characters): --- 

# Introduction to Artificial Intelligence

 --- Chunk 2 (195 characters): --- 

Artificial Intelligence (AI) is the simulation of human intelligence in machines. These machines are programmed to think and learn like humans. The field was founded in 1956 at Dartmouth College.

 --- Chunk 3 (178 characters): --- 

## Machine Learning

Machine learning is a subset of AI. It allows systems to learn from data without being explicitly programmed. There are three main types of machine learning.


#### Exercise 3: Token Splitting

In [52]:
from langchain_text_splitters import TokenTextSplitter

# your code here
splitter = TokenTextSplitter(chunk_size=200, chunk_overlap=20)
tok_chunks = splitter.split_text(sample_text)

print(f"Number of chunks: {len(tok_chunks)}")
print(f"\nFirst chunk:\n{tok_chunks[0]}")
print(f"\nToken chunk sizes (chars): {[len(c) for c in tok_chunks]}")

Number of chunks: 3

First chunk:

# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is the simulation of human intelligence in machines. These machines are programmed to think and learn like humans. The field was founded in 1956 at Dartmouth College.

## Machine Learning

Machine learning is a subset of AI. It allows systems to learn from data without being explicitly programmed. There are three main types of machine learning.

Supervised learning uses labelled data to train models. The model learns to map inputs to outputs. Examples include classification and regression tasks.

Unsupervised learning finds hidden patterns in unlabelled data. Clustering and dimensionality reduction are common techniques. K-means and PCA are popular algorithms.

Reinforcement learning trains agents through rewards and penalties. The agent learns by interacting with an environment. AlphaGo used reinforcement learning to beat world champions.

## Deep Learning

Deep learning uses neu

#### Exercise 4: Markdown Header Splitting

In [53]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers = [("#","Title"),
           ("##","Section"), 
        ]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
mark_chunks = splitter.split_text(sample_text)

print(f"Number of chunks: {len(mark_chunks)}")
for chunk in mark_chunks:
    print(f"\n--- Metadata: {chunk.metadata} ---)\n")
    print(chunk.page_content[:200])

Number of chunks: 5

--- Metadata: {'Title': 'Introduction to Artificial Intelligence'} ---)

Artificial Intelligence (AI) is the simulation of human intelligence in machines. These machines are programmed to think and learn like humans. The field was founded in 1956 at Dartmouth College.

--- Metadata: {'Title': 'Introduction to Artificial Intelligence', 'Section': 'Machine Learning'} ---)

Machine learning is a subset of AI. It allows systems to learn from data without being explicitly programmed. There are three main types of machine learning.  
Supervised learning uses labelled data t

--- Metadata: {'Title': 'Introduction to Artificial Intelligence', 'Section': 'Deep Learning'} ---)

Deep learning uses neural networks with many layers. It has revolutionised computer vision and natural language processing. GPU computing made deep learning practical.  
Convolutional Neural Networks 

--- Metadata: {'Title': 'Introduction to Artificial Intelligence', 'Section': 'Natural Language Proc

#### Exercise 5: Code Splitting

In [54]:
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

code = """
def load_documents(path):
    loader = PyPDFLoader(path)
    return loader.load()

def split_documents(documents):
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    return splitter.split_documents(documents)

def create_vectorstore(chunks, embeddings):
    return FAISS.from_documents(chunks, embeddings)

def create_chain(vectorstore, prompt, llm):
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    return (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
"""

splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=300, 
    chunk_overlap=20
)

lang_chunks = splitter.split_text(code)

print(f"Number of Chunks : {len(lang_chunks)}")
for i, chunk in enumerate(lang_chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} characters) ---\n")
    print(chunk)

Number of Chunks : 3

--- Chunk 1 (241 characters) ---

def load_documents(path):
    loader = PyPDFLoader(path)
    return loader.load()

def split_documents(documents):
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    return splitter.split_documents(documents)

--- Chunk 2 (95 characters) ---

def create_vectorstore(chunks, embeddings):
    return FAISS.from_documents(chunks, embeddings)

--- Chunk 3 (252 characters) ---

def create_chain(vectorstore, prompt, llm):
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
    return (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )


#### Exercise 6: Semantic Splitting

In [39]:
from dotenv import load_dotenv
import os
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

In [55]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small", api_key=openai_api_key)

splitter = SemanticChunker(embeddings=embeddings, 
                           breakpoint_threshold_type="percentile",
                           breakpoint_threshold_amount=70)

semantic_chunks = splitter.split_text(sample_text)

print(f"Number of chunks: {len(semantic_chunks)}")
for i, chunk in enumerate(semantic_chunks):
    print(f"\n--- Chunk {i+1} ({len(chunk)} characters) ---\n")
    print(chunk[:300])

Number of chunks: 14

--- Chunk 1 (187 characters) ---


# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is the simulation of human intelligence in machines. These machines are programmed to think and learn like humans.

--- Chunk 2 (108 characters) ---

The field was founded in 1956 at Dartmouth College. ## Machine Learning

Machine learning is a subset of AI.

--- Chunk 3 (73 characters) ---

It allows systems to learn from data without being explicitly programmed.

--- Chunk 4 (327 characters) ---

There are three main types of machine learning. Supervised learning uses labelled data to train models. The model learns to map inputs to outputs. Examples include classification and regression tasks. Unsupervised learning finds hidden patterns in unlabelled data. Clustering and dimensionality reduc

--- Chunk 5 (107 characters) ---

K-means and PCA are popular algorithms. Reinforcement learning trains agents through rewards and penalties.

--- Chunk 6 (184 charact

#### Exercise 7: Compare All Methods Side by Side

In [56]:
results = {
    "Character": None,
    "Recursive": None,
    "Token": None,
    "Markdown": None,
    "Semantic": None
}

# fill in each with your chunks from above
results["Character"]  = char_chunks
results["Recursive"]  = rec_chunks
results["Token"]      = tok_chunks
results["Markdown"]   = mark_chunks
results["Semantic"]   = semantic_chunks

print(f"{'Method':<15} {'Chunks':>8} {'Avg Size':>10} {'Min':>6} {'Max':>6}")
print("-" * 50)
for method, chunks in results.items():
    sizes = [len(c) if isinstance(c, str) else len(c.page_content) for c in chunks]
    print(f"{method:<15} {len(chunks):>8} {sum(sizes)//len(sizes):>10} {min(sizes):>6} {max(sizes):>6}")


Method            Chunks   Avg Size    Min    Max
--------------------------------------------------
Character             39         61     38    123
Recursive             15        165     41    197
Token                  3        898    697   1038
Markdown               5        471    195    665
Semantic              14        175     73    340
