# Tutorial 3: Document Processing with LangChain

In this tutorial, we'll explore document processing techniques using LangChain. We'll cover loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search.

In [None]:
import os
from dotenv import load_dotenv
from langchain_groq import ChatGroq
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain.vectorstores import FAISS,Chroma
from langchain.chains import RetrievalQA

# Load environment variables
load_dotenv()

# Initialize Groq LLM
llm =  ChatGroq(
        model_name="llama-3.1-70b-versatile",
        temperature=0.7,
        model_kwargs={"top_p": 0.8, "seed": 1337}
    )
print(os.getenv('OLLAMA_EMBEDDING_URL'))
embedding_model = OllamaEmbeddings(model="all-minilm",base_url=os.getenv('OLLAMA_EMBEDDING_URL'))


## 1. Loading and Parsing Documents

In [41]:
# Load a single document
loader = TextLoader("sample_documents/sample1.txt")
document = loader.load()

print(f"Content of sample1.txt:\n{document[0].page_content[:200]}...\n")

# Load multiple documents from a directory
dir_loader = DirectoryLoader("sample_documents/", glob="*.txt", loader_cls=TextLoader)
documents = dir_loader.load()

print(f"Number of documents loaded: {len(documents)}")
for i, doc in enumerate(documents):
    print(f"Document {i+1} preview: {doc.page_content[:50]}...")

Content of sample1.txt:
# Comprehensive Overview of Artificial Intelligence

## Table of Contents
1. [Introduction to Artificial Intelligence](#introduction-to-artificial-intelligence)
2. [History of AI](#history-of-ai)
3. [...

Number of documents loaded: 1
Document 1 preview: # Comprehensive Overview of Artificial Intelligenc...


In [42]:
from langchain.document_loaders import PyPDFLoader

# Carica il PDF
loader = PyPDFLoader("sample_documents/sample2.pdf")
documents = loader.load()


## 2. Text Splitting and Chunking

In [43]:
# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Split the documents
splits = text_splitter.split_documents(documents)

print(f"Number of splits: {len(splits)}")
print(f"First split preview:\n{splits[0].page_content[:200]}...")

Number of splits: 111
First split preview:
Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
Eric Zelikman
Stanford UniversityGeorges Harik
Notbad AI IncYijia Shao
Stanford UniversityVaruna Jayasiri
Notbad AI Inc
Nick H...


## 3. Building a Simple Question-Answering System

In [44]:
# Create a vector store
vectorstore = FAISS.from_documents(splits, embedding_model)

# Create a retrieval-based QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

# Ask a question
query = "What is the main topic of these documents?"
result = qa_chain.invoke({"query": query})

print(f"Question: {query}")
print(f"Answer: {result['result']}\n")
print("Sources:")
for i, doc in enumerate(result['source_documents']):
    print(f"Document {i+1}: {doc.page_content[:100]}...")

Question: What is the main topic of these documents?
Answer: The main topic of these documents appears to be language models (LMs) and their improvement, specifically in the areas of reasoning, answering difficult questions, and understanding text.

Sources:
Document 1: Vaishnavh Nagarajan. Think before you speak: Training language models with pause
tokens. arXiv prepr...
Document 2: improve the LM’s ability to directly answer difficult questions. In particular,
after continued pret...
Document 3: these tends to<|startthought|> in some sense - to be the more difficult<|
endthought|> trickiest for...


## 4. Implementing Semantic Search

In [45]:
# Perform a semantic search
query = "Discuss the importance of AI"
search_results = vectorstore.similarity_search(query, k=3)

print(f"Search query: {query}\n")
print("Top 3 relevant chunks:")
for i, doc in enumerate(search_results):
    print(f"Result {i+1}:\n{doc.page_content[:200]}...\n")

# Use the search results to answer a question
question = "What are some advantages of ai models?"
context = "\n".join([doc.page_content for doc in search_results])

prompt = f"Based on the following context, answer the question: {question}\n\nContext: {context}\n\nAnswer:"
answer = llm.invoke(prompt)

print(f"Question: {question}")
print(f"Answer: {answer}")

Search query: Discuss the importance of AI

Top 3 relevant chunks:
Result 1:
expensive, difficult to scale, and provides no clear path to solving problems harder than
those that the annotators are capable of solving.
Another direction for teaching reasoning relies on a languag...

Result 2:
process-and outcome-based feedback. Neural Information Processing Systems (NeurIPS
2022) Workshop on MATH-AI , 2022.
Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D
Good...

Result 3:
Quiet-STaR: Language Models Can Teach Themselves to
Think Before Speaking
Eric Zelikman
Stanford UniversityGeorges Harik
Notbad AI IncYijia Shao
Stanford UniversityVaruna Jayasiri
Notbad AI Inc
Nick H...

Question: What are some advantages of ai models?
Answer: content='Some advantages of AI models, specifically language models, mentioned in the context are:\n\n1. Ability to learn from their own generated reasoning, through self-play methods such as process-and outcome-based feedback.\n2.

## Conclusion

In this tutorial, we've explored various aspects of document processing with LangChain, including loading and parsing documents, text splitting, building a simple question-answering system, and implementing semantic search. These techniques form the foundation for more advanced document analysis and information retrieval systems.