# Agentic Search 

<p>Start agentic search using Langchain. We will define a agent with the following tools:</p>

- Ability to query our Qdrant database 
  
  - Allow it to also make it's own queries and get the best query
  
- Ability to open our pdf and read a particular page

In [36]:
# !pip install -qU langchain-qdrant

In [37]:
MAX_CHUNK_SIZE = 1024

In [38]:
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_qdrant import QdrantVectorStore
from langchain.tools.retriever import create_retriever_tool
from qdrant_client import QdrantClient
from langchain_core.tools import Tool
from langchain_core.tools import tool
from langchain.agents.output_parsers.openai_tools import OpenAIToolsAgentOutputParser
from langchain.agents.format_scratchpad.openai_tools import (
    format_to_openai_tool_messages,
)

from langchain.agents import AgentExecutor, Tool, ZeroShotAgent
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI 

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small", chunk_size=MAX_CHUNK_SIZE)
client = QdrantClient(url="http://localhost:6333")

In [39]:
vector_store = QdrantVectorStore(
    client=client,
    collection_name="pdf_collection",
    embedding=embeddings_model,
)
retriever = vector_store.as_retriever()

## Building The Agent

In [40]:
@tool
def natural_language_query(input: str):
    """ Will query the vector store that contains information about the research papers and return the top 10 most relevant chunks from all the papers.
    Each result has the following structure:
    {
        "text": "string",
        "pdf_path": "string", # the path to the pdf file
        "page_number": 0, # the page number where the text is located
        "idx": string # the pdf is chunked into smaller pieces, this is the id of the chunk. eg: 1.txt, 2.txt, etc. Each document is split sequentially into chunks.
        "score": 0.0 # the cosine similarity between the input and the text
    }
    """
    result = client.search(collection_name="pdf_collection", query_vector=embeddings_model.embed_documents([input])[0], limit=10)
    return [{'text': r.payload['text'], 'pdf_path': r.payload['pdf_file'], 'page_number': r.payload['page'], 'idx': r.payload['idx'], 'score': r.score} for r in result]

In [41]:
research_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        """
        You are a agent that is going to be searching technical research papers. You will be given a topic to find information on. You will need to find the most relevant information and return back the information with sources. 
        Make sure you only return information that you have retrieved from the research papers. Do not make up any information. You will have access to tools to help you find the information.

        Return the output by stating the information you found and the source of the information in the format pdf_path - page number - the section (if mentioned) in the papers we queried. The page number and pdf_path should only be from our vector store, do not make up any page numbers or pdf paths.

        eg: 
        // brief summary of the information only based on the tools we used
        // pdf_path 
        // page number
        // section (if it can be found in the text of the result)
        """
    ),
    ("human", "Research {research_topic}"),
    MessagesPlaceholder(variable_name="agent_scratchpad"),
])

In [42]:
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=2,
    temperature=0.1,
)
agent_tools = [natural_language_query]
llm = llm.bind_tools(agent_tools)

In [43]:
def agent_scratchpad_formatter(x):
    return format_to_openai_tool_messages(x["intermediate_steps"])

agent = (
    {
        "research_topic": lambda x: x["research_topic"],
        "agent_scratchpad": agent_scratchpad_formatter,
    }
    | research_prompt
    | llm
    | OpenAIToolsAgentOutputParser() # This gets the output from the OpenAI api call and gets the tool invocations from it. This then is executed by the AgentExecutor.
)

In [44]:
agent_executor = AgentExecutor(agent=agent, tools=agent_tools, verbose=True)

In [45]:
research_agent_result = await agent_executor.ainvoke({"research_topic": "What is a straggler in map reduce?"})
research_agent_result



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m
Invoking: `natural_language_query` with `{'input': 'straggler in map reduce'}`


[0m[36;1m[1;3m[{'text': '3.6 Backup Tasks One of the common causes that lengthens the total time taken for a MapReduce operation is a straggler, that is, a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. Stragglers can arise for a whole host of reasons. For example, a machine with a bad disk may experience frequent cor- rectable errors that slow its read performance from 30MB/s to 1MB/s. The cluster scheduling system may have scheduled other tasks on the machine, causing it to execute the MapReduce code more slowly due to competition for CPU, memory, local disk, or network bandwidth. A recent problem we experienced was a bug in machine initialization code that caused processor caches to be disabled: computations on affected machines slowed down by over a factor of one hundred.'

{'research_topic': 'What is a straggler in map reduce?',
 'output': 'A straggler in MapReduce is a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. Stragglers can occur due to various reasons, such as hardware issues like a bad disk causing slow read performance, or competition for resources like CPU, memory, or network bandwidth due to other tasks scheduled on the machine. To mitigate the impact of stragglers, a mechanism is used where the master schedules backup executions of the remaining in-progress tasks when a MapReduce operation is nearing completion. This approach significantly reduces the time to complete large MapReduce operations.\n\n- **Source**: /Users/faizahmed/Documents/SJSU/topics_in_db/Project/data-ingestion/data/mapreduce.pdf\n- **Page Number**: 4\n- **Section**: 3.6 Backup Tasks'}