## Chat Models - <a href='https://python.langchain.com/docs/modules/data_connection/document_loaders/'>Document Loaders</a> and Text Splitting


Note: This notebook saves the fetched README into a temporary folder (tmp/README.md) to avoid overwriting local files. The tmp folder is ignored by Git via .gitignore.

In [40]:
import os
from urllib.parse import urlparse
import requests
from langchain_community.document_loaders import TextLoader
from src.fnUtils import render_markdown

# Configure a downloadable text resource (swap this URL as needed)
source_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

# Create a temp folder relative to this notebook directory
TMP_DIR = "tmp"
os.makedirs(TMP_DIR, exist_ok=True)

# Derive a sensible local filename from the URL path
parsed = urlparse(source_url)
file_name = os.path.basename(parsed.path) or "download.txt"
local_path = os.path.join(TMP_DIR, file_name)

# Fetch raw text content
resp = requests.get(source_url, timeout=20, headers={"User-Agent": "LangChain-DocLoader-Demo/1.0"})
resp.raise_for_status()
text = resp.text

# Write to local file
with open(local_path, "w", encoding=resp.encoding or "utf-8") as f:
    f.write(text)

# Load the saved file as a LangChain document
tloader = TextLoader(local_path)
docs = tloader.load()

In [28]:
# Split the text into chunks for downstream tasks
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Small chunk size just to demonstrate splitting
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)

# Use the docs we already loaded from the TextLoader
split_docs = text_splitter.split_documents(docs)

# Add chunk indices to metadata for easier citation
final_docs = []
for idx, doc in enumerate(split_docs):
    doc.metadata["chunk_index"] = idx
    final_docs.append(doc)

In [35]:
len(final_docs)

5318

In [36]:
# Inspect metadata on the loaded doc and the first chunk
print({"loader_metadata": docs[0].metadata})
print({"chunk_metadata": final_docs[0].metadata})

{'loader_metadata': {'source': 'tmp/input.txt'}}
{'chunk_metadata': {'source': 'tmp/input.txt', 'chunk_index': 0}}


In [37]:
from langchain_openai import ChatOpenAI

model_name = "gpt-4o-mini"  # e.g., "gpt-4o" for higher quality

chat = ChatOpenAI(model=model_name, temperature=0)

In [38]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Map step: summarize each chunk
summary_prompt = PromptTemplate.from_template(
    "You are a helpful assistant. Summarize the following passage concisely.\n\n{context}"
)
map_chain = summary_prompt | chat | StrOutputParser()
partial_summaries = map_chain.batch([
    {"context": d.page_content} for d in final_docs[:10]
])

# Reduce step: combine and refine partial summaries
reduce_prompt = PromptTemplate.from_template(
    "Combine and refine the following partial summaries into a single clear summary.\n\n{context}"
)
reduce_chain = reduce_prompt | chat | StrOutputParser()
result = reduce_chain.invoke({
    "context": "\n\n".join(partial_summaries)
})

In [41]:
# Display the final summary
render_markdown(result)

> The First Citizen addresses the crowd, urging them to consider their dire situation and their willingness to die rather than starve. The citizens express their determination and identify Caius Marcius as their chief enemy, eager to take swift action to control corn prices. The First Citizen voices frustration over the patricians' perception of the citizens as unworthy and costly, suggesting that sharing their excess resources would demonstrate true concern for the less fortunate. He highlights the injustice of their suffering benefiting the wealthy and advocates for action to address this inequality.
> 
> While some citizens express a desire for vengeance against Marcius, the Second Citizen raises concerns about his past services to the country. The First Citizen acknowledges the complexity of Marcius's character, suggesting that his actions were driven by personal pride rather than genuine patriotism. The Second Citizen counters that inherent traits should not be labeled as vices, particularly greed. As tensions rise, the First Citizen urges the group to move to the Capitol, but he hesitates, sensing unrest and questioning who is approaching.

### Alternative Use-Cases for Long Text (Tiny Shakespeare)
Summarization is useful, but for literary corpora you often want:
- Retrieval Q&A with citations (answer questions from the text).
- Character and entity extraction (who speaks, where, to whom).
- Concordance/search (find lines containing themes or motifs).
- Scene/section segmentation and study guides with references.
Below we add a lightweight retrieval-QA example without embeddings.

In [25]:
# Build a lightweight BM25 retriever from the chunks (no embeddings needed)
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(final_docs)
bm25_retriever.k = 4  # top-k chunks to retrieve
bm25_retriever

BM25Retriever(vectorizer=<rank_bm25.BM25Okapi object at 0x118467ce0>)

In [26]:
# Ask a question, retrieve top chunks, and answer with citations
question = "Who are the main characters introduced early in the text?"
retrieved = bm25_retriever.invoke(question)
context = "\n\n".join([d.page_content for d in retrieved])
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
qa_prompt = PromptTemplate.from_template(
    "Answer the user's question using only the provided context.\n"
    "If unsure, say you don't know.\n\n"
    "Question: {question}\n\nContext:\n{context}"
)
qa_chain = qa_prompt | chat | StrOutputParser()
answer = qa_chain.invoke({"question": question, "context": context})
print("Answer:\n", answer)
print("\nCitations:")
for doc in retrieved:
    source = doc.metadata.get("source", "unknown")
    chunk_idx = doc.metadata.get("chunk_index", "N/A")
    print(f"  - {source} (chunk #{chunk_idx})")

Answer:
 The main characters introduced early in the text are Derby, Lord Clifford, Lord Stafford, and Duke Vincentio.

Citations:
  - tmp/input.txt (chunk #1462)
  - tmp/input.txt (chunk #2750)
  - tmp/input.txt (chunk #498)
  - tmp/input.txt (chunk #4496)
