In [13]:
from langchain_text_splitters  import TokenTextSplitter


The TokenTextSplitter in LangChain is a text splitter that breaks down large chunks of text into smaller pieces based 
on token count rather than character count. This is useful for handling large documents efficiently while ensuring they fit within the token limits of LLMs.

Usage and Purpose
The TokenTextSplitter is primarily used to:

Ensure token limit compliance – LLMs like OpenAI's GPT models have token limits, so splitting text into manageable chunks prevents exceeding those limits.
Improve retrieval quality – Splitting text into meaningful chunks allows better retrieval in RAG (Retrieval-Augmented Generation) systems.
Enhance context understanding – By breaking large documents into semantically meaningful pieces, the model processes them more effectively.

* is used to unpack a list of strings extracted from page_content attributes of Document objects.
sep="\n" ensures each document content is printed on a new line.
This technique is useful when quickly displaying or processing multiple documents efficiently.

In [8]:

# Sample text
text = "LangChain simplifies building applications with LLMs by providing abstractions for chains, memory, and agents."

# Initialize TokenTextSplitter
splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=3)

# Split text into token-based chunks
chunks = splitter.split_text(text)

# Using * operator to print chunks
print(*chunks, sep="\n")
print(len(chunks))
type(chunks)


LangChain simplifies building applications with LLMs
 with LLMs by providing abstractions for chains,
 for chains, memory, and agents.
3


list

In [14]:
print(type(chunks[1]))

<class 'str'>


In [12]:
from langchain.docstore.document import Document

# Creating multiple Document objects
doc1 = Document(page_content="LangChain enables LLM-powered applications.", metadata={"source": "AI Guide"})
doc2 = Document(page_content="Tokenization is essential for handling large texts.", metadata={"source": "NLP Handbook"})
doc3 = Document(page_content="Vector databases store embeddings for efficient search.", metadata={"source": "DB Essentials"})

# Storing them in a list
documents = [doc1, doc2, doc3]

# Using * to unpack and print each document's content
print(*[doc.page_content for doc in documents], sep="\n")


LangChain enables LLM-powered applications.
Tokenization is essential for handling large texts.
Vector databases store embeddings for efficient search.


In [17]:
# Import the Document class from LangChain
from langchain.docstore.document import Document

# Define raw text content
raw_text = "The xAI Grok 3 is an advanced AI model designed to assist users in various tasks."

# Create a Document object with page_content and optional metadata
doc = Document(
    page_content=raw_text,
    metadata={
        "source": "xAI Product Catalog",
        "title": "Grok 3 Overview",
        "date": "March 04, 2025"
    }
)



In [18]:
# Print the document to verify
print(doc)

page_content='The xAI Grok 3 is an advanced AI model designed to assist users in various tasks.' metadata={'source': 'xAI Product Catalog', 'title': 'Grok 3 Overview', 'date': 'March 04, 2025'}
