# Ollama PDF RAG Notebook

## Import Libraries


In [3]:
# Imports
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Jupyter-specific imports
from IPython.display import display, Markdown

# Set environment variable for protobuf
import os
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"] = "python"

## Load PDF

In [None]:
from langchain_community.document_loaders import PyPDFLoader

local_path = "scammer-agent.pdf"
if local_path:
    loader = PyPDFLoader(file_path=local_path)
    try:
        data = loader.load()
        print(f"PDF loaded successfully: {local_path}")
    except Exception as e:
        print(f"Error loading PDF: {str(e)}")
else:
    print("No local path provided")
# Load PDF
#local_path = "scammer-agent.pdf"
#if local_path:
#    loader = UnstructuredPDFLoader(file_path=local_path)
#    data = loader.load()
#    print(f"PDF loaded successfully: {local_path}")
#else:
#    print("Upload a PDF file")

PDF loaded successfully: scammer-agent.pdf


## Split text into chunks

In [9]:
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(data)
print(f"Text split into {len(chunks)} chunks")

Text split into 23 chunks


## Create vector database

In [10]:
# Create vector database

vector_db = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="local-rag"
)
print("Vector database created successfully")

Vector database created successfully


## Set up LLM and Retrieval

In [11]:
# Set up LLM and retrieval
local_model = "llama3.2"  # or whichever model you prefer
llm = ChatOllama(model=local_model)

In [12]:
# Query prompt template
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

# Set up retriever
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

## Create chain

In [13]:
# RAG prompt template
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [14]:
# Create chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Chat with PDF

In [15]:
def chat_with_pdf(question):
    """
    Chat with the PDF using the RAG chain.
    """
    return display(Markdown(chain.invoke(question)))

In [17]:
# Example 1
chat_with_pdf("What is the main idea of this document?")

The main idea of this document appears to be a study on scams that can be performed using language models (LLMs), with a focus on understanding the actions and tools required for each scam, as well as the success rates and difficulties of different types of scams. The document includes a series of agents designed to perform common scams, including bank account transfer scams, and analyzes their performance in terms of success rate, number of actions, call time, and API cost.

In [18]:
# Example 2
chat_with_pdf("What is the purpose of the scammer agent?")

The purpose of the scammer agent is not explicitly stated as the primary goal, but rather as a tool to perform common scams that have been identified by the government. The agents are designed to autonomously perform actions needed to conduct these scams, such as logging into bank accounts and completing two-factor authentication processes.

In [19]:
# Example 3
chat_with_pdf("Can you explain the case study highlighted in the document?")

The document appears to be a collection of references and papers related to various topics, including language models, AI, and security.

Unfortunately, I do not see a specific "case study" mentioned in the provided text. However, I can try to extract some relevant information from the references that may be related to a case study.

One paper that stands out is "Llm agents can autonomously hack websites" by Richard Fang et al. (2024a). This paper describes how Large Language Models (LLMs) can be used to autonomously hack websites, possibly due to their ability to generate persuasive and convincing text. However, I do not have enough information about this specific case study to provide a detailed explanation.

Another paper that may be related is "The scams among us: Who falls prey and why" by Yaniv Hanoch and Stacey Wood (2021). This paper explores the topic of social engineering attacks and how people can fall victim to these types of scams. However, I do not see any direct connection between this paper and a specific case study.

Without more context or information about the specific case study being highlighted in the document, it is difficult for me to provide a detailed explanation. If you have any additional details or clarification about the case study, I would be happy to try and assist further.

## Clean up (optional)

In [20]:
# Optional: Clean up when done 
vector_db.delete_collection()
print("Vector database deleted successfully")

Vector database deleted successfully
