## Ollama PDF RAG Notebook

In [None]:
pip list

Package                                  Version
---------------------------------------- -----------
aiofiles                                 24.1.0
aiohappyeyeballs                         2.4.3
aiohttp                                  3.11.8
aiosignal                                1.3.1
annotated-types                          0.7.0
antlr4-python3-runtime                   4.9.3
anyio                                    4.6.2.post1
appnope                                  0.1.4
asgiref                                  3.8.1
asttokens                                2.4.1
attrs                                    24.2.0
backoff                                  2.2.1
bcrypt                                   4.2.1
beautifulsoup4                           4.12.3
build                                    1.2.2.post1
cachetools                               5.5.0
certifi                                  2024.8.30
cffi                                     1.17.1
chardet                        

### Checking if necessary packages are installed or not 

In [84]:
import pkg_resources

required_packages = [
    "ollama",
    "gradio",
    "pdfplumber",
    "langchain",
    "langchain-core",
    "langchain-ollama",
    "langchain_community",
    "langchain_text_splitters",
    "unstructured",
    "onnx==1.17.0",
    "protobuf==3.20.3",
    "chromadb==0.4.22",
    "Pillow",
    "numpy",
]

installed_packages = {pkg.key for pkg in pkg_resources.working_set}
missing_packages = [pkg for pkg in required_packages if pkg.split("==")[0] not in installed_packages]

if missing_packages:
    print(f"The following packages are missing: {missing_packages}")
else:
    print("All required packages are installed!")


The following packages are missing: ['langchain_community', 'langchain_text_splitters', 'Pillow']


In [None]:
pip show unstructured

Name: unstructured
Version: 0.16.8
Summary: A library that prepares raw documents for downstream ML tasks.
Home-page: https://github.com/Unstructured-IO/unstructured
Author: Unstructured Technologies
Author-email: devops@unstructuredai.io
License: Apache-2.0
Location: /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages
Requires: backoff, beautifulsoup4, chardet, dataclasses-json, emoji, filetype, html5lib, langdetect, lxml, nltk, numpy, psutil, python-iso639, python-magic, python-oxmsg, rapidfuzz, requests, tqdm, typing-extensions, unstructured-client, wrapt
Required-by: 
Note: you may need to restart the kernel to use updated packages.


Ollama is a platform for running local language models (LLMs) on your device, designed for privacy and performance. It allows developers to use fine-tuned LLMs efficiently without relying on external APIs, making it ideal for offline or secure environments.

### Import Libraries


In [4]:
# Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader

## Load PDF

This step loads a PDF document using PyMuPDFLoader from LangChain, extracting its text and converting it into manageable chunks for further processing. It ensures the document is ready for tasks like embedding generation or question-answering.

In [2]:
from langchain.document_loaders import PyMuPDFLoader

pdf_path = "/Users/yogavarshniramachandran/Documents/Sem 3/DL/lab2/Part3_knowledge base.pdf"
try:
    loader = PyMuPDFLoader(file_path=pdf_path)
    data = loader.load()
    print(f"PDF loaded successfully with PyMuPDFLoader: {len(data)} chunks.")
except Exception as e:
    print(f"Failed to load PDF with PyMuPDFLoader: {e}")

PDF loaded successfully with PyMuPDFLoader: 8 chunks.


## Split text into chunks

This step splits the extracted PDF text into chunks of 1000 characters with 200-character overlaps using RecursiveCharacterTextSplitter. The resulting chunks ensure continuity and are ready for embedding and retrieval tasks.

In [None]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 36 chunks


### Create vector database

This step checks available models using !ollama list and downloads the nomic-embed-text model with !ollama pull. The model is used for generating text embeddings, ensuring offline and efficient processing.

In [None]:

!ollama list

NAME                       ID              SIZE      MODIFIED       
nomic-embed-text:latest    0a109f422b47    274 MB    7 seconds ago     
llama3.2:latest            a80c4f17acd5    2.0 GB    49 minutes ago    


In [92]:
!ollama pull nomic-embed-text


[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest ⠹ [?25h[?25l[2K[1Gpulling manifest ⠸ [?25h[?25l[2K[1Gpulling manifest ⠼ [?25h[?25l[2K[1Gpulling manifest 
pulling 970aa74c0a90... 100% ▕████████████████▏ 274 MB                         
pulling c71d239df917... 100% ▕████████████████▏  11 KB                         
pulling ce4a164fc046... 100% ▕████████████████▏   17 B                         
pulling 31df23ea7daa... 100% ▕████████████████▏  420 B                         
verifying sha256 digest 
writing manifest 
success [?25h


In [6]:
# 1. First clean up any existing ChromaDB installations
%pip uninstall -y chromadb
%pip uninstall -y protobuf

# 2. Install specific versions known to work together
%pip install -q protobuf==3.20.3
%pip install -q chromadb==0.4.22  # Using a stable older version
%pip install -q langchain-ollama

# 3. Set the environment variable
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

# 4. Now reimport with the new versions
from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

Found existing installation: chromadb 0.4.22
Uninstalling chromadb-0.4.22:
  Successfully uninstalled chromadb-0.4.22
Note: you may need to restart the kernel to use updated packages.
Found existing installation: protobuf 5.29.0
Uninstalling protobuf-5.29.0:
  Successfully uninstalled protobuf-5.29.0
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opentelemetry-proto 1.28.2 requires protobuf<6.0,>=5.0, but you have protobuf 3.20.3 which is incompatible.
grpcio-status 1.68.0 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 3.20.3 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


This step creates a vector database using Chroma to store and retrieve vector embeddings of text chunks from the PDF. This database is a key component of a Retrieval-Augmented Generation (RAG) system.

In [7]:
# Create vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


### Set up LLM and Retrieval

This step initializes the LLaMA 3.2 language model for use in the RAG system, enabling offline and efficient text generation. The model processes retrieved context to generate precise answers to user queries.

In [8]:
# Set up LLM and retrieval
local_model = "llama3.2"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

The QUERY_PROMPT guides the LLM to generate precise answers using retrieved context. The MultiQueryRetriever enhances retrieval by using the LLM to create multiple query variations and fetch the most relevant text from the vector database.

In [48]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are an intelligent assistant tasked with answering user questions accurately and concisely using the most relevant sections of the uploaded PDF documents.

Use the provided context below to formulate a direct and precise response to the user question.

Context: {context}
Question: {question}
Answer:
"""
)



# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

### Create chain

The RAG chain combines retrieval and generation by fetching relevant context using the retriever and generating accurate answers through the LLM based on the prompt template. The chain ensures the output is cleanly parsed and strictly based on the retrieved context.

In [49]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [50]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### Chat with PDF

This function retrieves the most relevant context from the vector database for a given question and uses the LLM to generate a precise, context-based answer. It ensures valid context retrieval and formats the output for user display.

In [51]:
# Correct the RAG chain to handle both context and question
def chat_with_pdf(question):
    """
    Chat with the PDF using the vector database and LLM.
    """
    # Retrieve context
    context = get_relevant_context(question)
    
    # Ensure context is retrieved
    if not context.strip():
        display(Markdown(f"**Question:** {question}\n\n**Answer:** No relevant context found in the documents."))
        return
    
    # Format the prompt with context and question
    formatted_prompt = QUERY_PROMPT.format(context=context, question=question)
    
    # Generate the answer using the LLM
    answer = llm.predict(formatted_prompt)
    
    # Display the question and answer
    display(Markdown(f"**Question:** {question}\n\n**Answer:** {answer}"))


### Questions and Answers:

In [None]:
# Example 1
chat_with_pdf("How did Meta’s workforce change by the end of 2023")

**Question:** How did Meta’s workforce change by the end of 2023

**Answer:** By the end of 2023, Meta's workforce decreased by 22% to 67,317 employees as of December 31, 2023.

In [64]:
# Example 2
chat_with_pdf("What is the report quarter, and when did it end?")

**Question:** What is the report quarter, and when did it end?

**Answer:** The report quarter is the fourth quarter of 2023, and it ended on December 31, 2023.

In [None]:
# Example 3
chat_with_pdf("What were the key financial highlights this quarter(revenue, gross margin,operating expenses, operating margin, net income, and EPS)?")

**Question:** What were the key financial highlights this quarter(revenue, gross margin,operating expenses, operating margin, net income, and EPS)?

**Answer:** Here are the key financial highlights for this quarter:

1. Revenue: $34.5-37 billion (range), with free cash flow of $11.50 billion and $43.01 billion for the fourth quarter and full year 2023, respectively.
2. Operating Expenses: $94-99 billion (unchanged from prior outlook).
3. Operating Margin: 41%.
4. Net Income: $14.017 billion (201% increase) and $39.098 billion (69% increase) for the quarter and full year 2023, respectively.
5. Earnings Per Share (EPS): $5.33 (203% increase) and $14.87 (73% increase) for the quarter and full year 2023, respectively.

In [None]:
# Example 4
chat_with_pdf("How much did Meta spend on restructuring for the whole year and Q4")

**Question:** How much did Meta spend on restructuring for the whole year and Q4

**Answer:** For the full year 2023, Meta spent $3.45 billion on restructuring charges. For the fourth quarter 2023, Meta spent $1.15 billion on restructuring charges.

In [None]:
# Example 5
chat_with_pdf("What happened with Meta’s ad impressions and average price per ad in Q4and for the whole year?")

**Question:** What happened with Meta’s ad impressions and average price per ad in Q4and for the whole year?

**Answer:** For the fourth quarter of 2023, ad impressions delivered across Meta's Family of Apps increased by 21% year-over-year, while the average price per ad increased by 2%. For the full year 2023, ad impressions increased by 28% year-over-year, and the average price per ad decreased by 9%.

In [53]:
# Example 6
chat_with_pdf("What’s the revenue outlook for Q1 2024?")

**Question:** What’s the revenue outlook for Q1 2024?

**Answer:** The revenue outlook for Q1 2024 is expected to be in the range of $34.5-37 billion, assuming neutral foreign currency exchange rates based on current exchange rates.

In [54]:
# Example 7
chat_with_pdf("What were Meta’s total costs and expenses for Q4 and the full year 2023?")

**Question:** What were Meta’s total costs and expenses for Q4 and the full year 2023?

**Answer:** According to the document, Meta's total costs and expenses were:

* For the fourth quarter 2023: $23.73 billion
* For the full year 2023: $88.15 billion

In [55]:
# Example 8
chat_with_pdf("How much cash and marketable securities did Meta have on hand as of December 31, 2023?")

**Question:** How much cash and marketable securities did Meta have on hand as of December 31, 2023?

**Answer:** As of December 31, 2023, Meta had $41.862 billion in cash, $23.541 billion in marketable securities, and a total of $65.40 billion in cash, cash equivalents, and marketable securities.

In [56]:
# Example 9
chat_with_pdf("What were the main areas Meta invested in during 2023?")

**Question:** What were the main areas Meta invested in during 2023?

**Answer:** During 2023, Meta made significant investments in advancing AI and building the metaverse. According to Mark Zuckerberg, Meta's founder and CEO, the company has made a lot of progress on its vision for advancing AI and the metaverse. This indicates that one of the main areas Meta invested in during 2023 was research and development in artificial intelligence (AI) and the metaverse, with a focus on immersive experiences like augmented and virtual reality.

In [57]:
# Example 10
chat_with_pdf("How did the Family of Apps and Reality Labs perform in Q4 2023?")

**Question:** How did the Family of Apps and Reality Labs perform in Q4 2023?

**Answer:** In Q4 2023, Meta's Family of Apps generated $39,040 million in revenue and $21,030 million in income from operations, representing a year-over-year increase of 6% and 156%, respectively. Meanwhile, Reality Labs recorded $1,071 million in revenue and a loss of $4,646 million in income from operations, compared to $727 million in revenue and a loss of $4,279 million in the same quarter of 2022.

In [58]:
# Example 11
chat_with_pdf("How much free cash flow did Meta generate in Q4 and the full year 2023?")

**Question:** How much free cash flow did Meta generate in Q4 and the full year 2023?

**Answer:** Meta generated $11.50 billion in free cash flow in Q4 2023 and $43.01 billion in the full year 2023.

In [59]:
# Example 12
chat_with_pdf("Did Meta make any changes to its stock repurchase program or dividends for 2024?")

**Question:** Did Meta make any changes to its stock repurchase program or dividends for 2024?

**Answer:** Yes, according to the provided context, Meta has initiated a quarterly dividend program, with a cash dividend of $0.50 per share payable on March 26, 2024, to stockholders of record as of February 22, 2024. This marks a change from previous years and demonstrates the company's intention to pay a cash dividend on a quarterly basis, subject to market conditions and approval by its board of directors.

In [60]:
# Example 13
chat_with_pdf("What risks did Meta highlight for 2024?")

**Question:** What risks did Meta highlight for 2024?

**Answer:** Meta highlighted several risks for 2024, including:

1. Impact of macroeconomic conditions on business and financial results.
2. Legal and regulatory headwinds in the EU and US, particularly with regards to the Federal Trade Commission's attempt to substantially modify Meta's existing consent order.
3. Regulatory changes that could significantly impact their business and financial results.

Additionally, they mentioned that if they are unsuccessful in contesting the matter, it would have an adverse impact on their business.

In [61]:
# Example 14
chat_with_pdf("What drove Meta’s revenue growth in Q4 2023?")

**Question:** What drove Meta’s revenue growth in Q4 2023?

**Answer:** According to the press release, Meta's revenue grew by 25% in Q4 2023 compared to Q4 2022, reaching $40.111 billion.

In [62]:
# Example 15
chat_with_pdf("How did Reality Labs perform throughout 2023, and what’s Meta’s plan for 2024")

**Question:** How did Reality Labs perform throughout 2023, and what’s Meta’s plan for 2024

**Answer:** According to Meta's fourth quarter and full year 2023 financial highlights, Reality Labs is expected to incur operating losses that increase meaningfully year-over-year due to ongoing product development efforts in augmented reality/virtual reality and investments to further scale the ecosystem. However, no detailed performance metrics are provided for Reality Labs in the given release.

### Clean up (optional)

In [93]:
# Optional: Clean up when done 
#vector_db.delete_collection()
#print("Vector database deleted successfully")

### ROUGE Score

In [2]:
from rouge_score import rouge_scorer

# Actual and generated answers
actual_answer = """By the end of 2023, Meta had 67,317 employees. In 2022, Meta
had 87,314 employees, so Meta had a 22.9% decrease in 2023 compared to
2022."""

generated_answer = """By the end of 2023, Meta's headcount decreased to 67,317 as of December 31, 2023, a decrease of 22% year-over-year."""

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compute ROUGE scores
scores = scorer.score(actual_answer, generated_answer)

# Display results
print("ROUGE-1:", scores['rouge1'])
print("ROUGE-2:", scores['rouge2'])
print("ROUGE-L:", scores['rougeL'])


ROUGE-1: Score(precision=0.5416666666666666, recall=0.4482758620689655, fmeasure=0.49056603773584906)
ROUGE-2: Score(precision=0.2608695652173913, recall=0.21428571428571427, fmeasure=0.23529411764705882)
ROUGE-L: Score(precision=0.4166666666666667, recall=0.3448275862068966, fmeasure=0.37735849056603776)


The ROUGE scores indicate moderate overlap between the actual and generated answers. ROUGE-1 (49.06% F1) shows decent word-level matching, while ROUGE-2 (23.53% F1) highlights differences in phrasing. ROUGE-L (37.74% F1) reflects some structural similarity but leaves room for improvement in fluency and context accuracy.

In [96]:
# Actual and generated answers
actual_answer = """The reported quarter is the fourth quarter of 2023, The quarter
ended on December 31, 2023."""

generated_answer = """The report quarter is the fourth quarter of 2023, and it ended on December 31, 2023."""

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compute ROUGE scores
scores = scorer.score(actual_answer, generated_answer)

# Display results
print("ROUGE-1:", scores['rouge1'])
print("ROUGE-2:", scores['rouge2'])
print("ROUGE-L:", scores['rougeL'])


ROUGE-1: Score(precision=0.875, recall=0.875, fmeasure=0.875)
ROUGE-2: Score(precision=0.8, recall=0.8, fmeasure=0.8000000000000002)
ROUGE-L: Score(precision=0.875, recall=0.875, fmeasure=0.875)


Reference: https://github.com/tonykipkemboi/ollama_pdf_rag/tree/main

The ROUGE scores show excellent overlap between the actual and generated answers. ROUGE-1 and ROUGE-L (87.5% F1) demonstrate near-perfect word-level and structural similarity, while ROUGE-2 (80% F1) reflects strong bigram-level matching, indicating high accuracy and fluency in the generated answer.