<a href="https://colab.research.google.com/github/avikumart/LLM-GenAI-Transformers-Notebooks/blob/main/TMLC_LLM_projects/AI_agents/Agentic_RAG_phi_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Libraries

In [1]:
!pip install phidata openai duckduckgo-search lancedb tantivy pypdf -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m716.9/716.9 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.3/32.3 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.4/38.4 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.0/236.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## Loading Libraries and API keys

In [2]:
import os
from google.colab import userdata
OpenAI_API = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OpenAI_API

In [3]:
from phi.knowledge.pdf import PDFKnowledgeBase, PDFReader
from phi.vectordb.lancedb import LanceDb
from phi.vectordb.search import SearchType
from phi.agent import Agent
from phi.storage.agent.sqlite import SqlAgentStorage
from phi.model.openai import OpenAIChat
from phi.tools.duckduckgo import DuckDuckGo

## VectorDB setup

In [4]:
# Initialize a LanceDb object for vector database operations
vector_db = LanceDb(
    table_name="documents",  # Name of the table to store vector data
    uri="/tmp/lancedb",    # File path for the database storage
    search_type=SearchType.keyword,
)

In [5]:
# Create a directory named 'Data' in the current working directory
! mkdir Data

# Download the GPT-4 paper PDF from the given URL
# Save the downloaded file into the 'Data' directory with the name 'gpt-4.pdf'
! wget "https://cdn.openai.com/papers/gpt-4.pdf" -O Data/gpt-4.pdf

--2025-02-10 06:20:47--  https://cdn.openai.com/papers/gpt-4.pdf
Resolving cdn.openai.com (cdn.openai.com)... 13.107.253.69, 2620:1ec:29:1::69
Connecting to cdn.openai.com (cdn.openai.com)|13.107.253.69|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5229908 (5.0M) [application/pdf]
Saving to: ‘Data/gpt-4.pdf’


2025-02-10 06:20:47 (55.0 MB/s) - ‘Data/gpt-4.pdf’ saved [5229908/5229908]



In [6]:
# Initialize a PDFKnowledgeBase object for managing knowledge extracted from PDF files
pdf_knowledge_base = PDFKnowledgeBase(
    path="/content/Data/",               # Path to the directory containing PDF files
    vector_db=vector_db,                 # Vector database instance for storing and querying extracted information
    reader=PDFReader(chunk=True),        # PDFReader instance with chunking enabled to process documents in smaller parts
)

In [7]:
pdf_knowledge_base.load() # load the data into vector db

## Agent Setup

In [8]:
agent = Agent(
    model=OpenAIChat(id="gpt-4o-mini"),
    knowledge=pdf_knowledge_base,
    tools=[DuckDuckGo()],
    show_tool_calls=True,
    markdown=True,
    storage=SqlAgentStorage(table_name="data", db_file="data.db"),
    add_history_to_messages=True,
)

In [9]:
agent.print_response(
  "Does GPT-4 accept visual inputs?",
  stream=True
)

Output()

In [10]:
agent.print_response(
  "What is the comparison of GPT-4 and GPT-3.5?",
  stream=True
)

Output()

In both the questions, we can identify what is the source of information has been used to generate the response. Second query required to extract information from web while for first query information was present in the vectordb.