# Text splitters

In LangChain, text splitters are tools used to divide large documents into smaller, more manageable chunks of text. This is particularly useful when working with large documents that cannot be processed in a single pass due to model input size limitations.

In [53]:
long_text = """Large Language Models (LLMs) are a type of artificial intelligence model trained on vast amounts of textual data.\n
They use deep learning, specifically transformer architectures, to understand, generate, and manipulate human language in a meaningful way.\n
Some well-known LLMs include OpenAI's GPT series, Google's PaLM, Meta's LLaMA, and Anthropic's Claude.\n

### Core Concepts of LLMs:\n
1. **Tokenization**: Breaking text into smaller units called tokens.\n
2. **Transformer Architecture**: A neural network structure that uses self-attention to weigh the importance of each word.\n
3. **Pretraining and Finetuning**: Models are pretrained on large corpora and then optionally finetuned on specific tasks.\n

---\n

### Applications of LLMs:\n
- Text Generation (e.g., writing stories, articles)\n
- Summarization\n
- Translation\n
- Sentiment Analysis\n
- Code generation\n
- Question Answering\n

***\n

### Challenges with LLMs:\n
- **Bias**: LLMs can reflect and even amplify societal biases found in their training data.\n
- **Hallucination**: They might generate confident but incorrect or fabricated information.\n
- **Resource Intensity**: Training and running these models require significant computational power.\n

---\n

### Recent Trends:\n
- **Retrieval-Augmented Generation (RAG)**: Combines LLMs with information retrieval systems.\n
- **Multimodal Models**: Extend LLMs to understand not just text but images, audio, and video.\n
- **Agents and Tool Use**: LLMs that interact with tools or APIs to perform tasks (e.g., browsing, math solving).\n

***\n

In summary, LLMs have revolutionized natural language processing, enabling machines to interact with humans in powerful new ways.\n
As research continues, we expect LLMs to become more capable, efficient, and aligned with human values.\n
"""


## Simple character text splitter
This splitter divides text into chunks based on a fixed number of characters. It's simple and effective when you need to break a document into chunks of a specific size.

In [34]:
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

# Simple character text splitter
text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=5000,
    chunk_overlap=200
)
texts = text_splitter.split_text(long_text)

for i, text in enumerate(texts):
    print(f"text {i} has {len(text)} character")

text 0 has 1785 character


## Recursive character text splitter (preferred)
This splitter is more sophisticated. It recursively splits text by both characters and logical boundaries (like paragraphs or sentences), trying to keep the chunks at an optimal size while preserving context.

In [35]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
texts = recursive_splitter.split_text(long_text)
for i, text in enumerate(texts):
    print(f"text {i} has {len(text)} character")

text 0 has 921 character
text 1 has 939 character
text 2 has 240 character


## Token-based splitter
his splitter divides text based on the number of tokens rather than characters. It is especially useful for models that work on tokenized input, where the number of tokens is a better measure of input size.

In [42]:
from transformers import GPT2Tokenizer
# Load GPT-2 tokenizer (you can use other models like GPT-3)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Tokenize the text
tokens = tokenizer.encode(long_text)

# Check how many tokens
print(f"Total tokens: {len(tokens)}")

Total tokens: 437


In [46]:
from langchain_text_splitters import TokenTextSplitter

token_splitter = TokenTextSplitter(
    chunk_size=200,
    chunk_overlap=5
)
texts = token_splitter.split_text(long_text)

for i, text in enumerate(texts):
    print(f"text {i} has {len(text)} character")

text 0 has 811 character
text 1 has 772 character
text 2 has 236 character
