# Ollama PDF RAG Notebook

### Import Libraries

In [1]:
# Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

### Load PDF

In [2]:
# Load PDF
local_path = "scammer-agent.pdf"
if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
    print(f"PDF loaded successfully: {local_path}")
else:
    print("Upload a PDF file")

PDF loaded successfully: scammer-agent.pdf


### Split text into chunks

In [3]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 23 chunks


### Create vector database

In [4]:
# Create vector database
vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


### Set up LLM and Retrieval

In [5]:
# Set up LLM and retrieval
local_model = "llama3.2:3b"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

In [6]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

### Create chain

In [7]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [8]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

### Chat with PDF

In [9]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [10]:
# Example 1
chat_with_pdf("What is the main idea of this document?")

The main idea of this document appears to be the exploration of the capabilities of AI technology, specifically in relation to scams and cybersecurity attacks. The document discusses various experiments and case studies that demonstrate how voice-enabled AI agents can autonomously perform actions needed to conduct common scams, such as banking scams. The authors highlight the potential dual-use capabilities of AI technology, which can be used for both positive and malicious purposes.

In [11]:
# Example 2
chat_with_pdf("What is the purpose of the scammer agent?")

The purpose of the scammer agent appears to be to conduct phone-based scams, specifically to convince victims to take actions or reveal sensitive information, and ultimately steal funds from them. The agent uses voice-enabled AI capabilities to interact with victims in real-time, reacting to changes in the environment and retrying based on faulty information from the victim.

In [12]:
# Example 3
chat_with_pdf("Can you explain the case study highlighted in the document?")

The case study highlights a bank transfer scam. The scammer calls the victim claiming to be from Bank of America, stating that they need to verify their account information for security purposes. The victim is tricked into providing their login credentials and 2FA code.

Here's a step-by-step breakdown of the scenario:

1. The scammer calls the victim claiming to be from Bank of America.
2. The scammer asks the victim to provide their username and password.
3. The victim provides the login credentials.
4. The scammer requests the victim's 2FA code.
5. The victim provides the 2FA code.
6. The scammer navigates to the victim's bank account, fills out a transfer form with the victim's information, and transfers money.

The case study demonstrates how the voice-enabled AI agent can autonomously perform these actions, highlighting the capabilities of the agent in conducting common scams like this one.

### Clean up (optional)

In [13]:
# Optional: Clean up when done 
vector_db.delete_collection()
print("Vector database deleted successfully")

Vector database deleted successfully
