# Building an LLM Agent to Find Relevant Research Papers from Arxiv




In [None]:
!pip install arxiv==2.1.3 llama_index==0.12.3 llama-index-llms-mistralai==0.3.0 llama-index-embeddings-mistralai==0.3.0



In [None]:
from getpass import getpass
import requests
import sys
import arxiv
from llama_index.llms.mistralai import MistralAI
from llama_index.embeddings.mistralai import MistralAIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Document, StorageContext, load_index_from_storage, PromptTemplate, Settings
from llama_index.core.tools import FunctionTool, QueryEngineTool
from llama_index.core.agent import ReActAgent


In [None]:
api_key= getpass("Type your API Key")

Type your API Key ········


In [None]:
llm = MistralAI(api_key=api_key, model='mistral-large-latest')

In [None]:
model_name = "mistral-embed"
embed_model = MistralAIEmbedding(model_name=model_name, api_key=api_key)


In [None]:
def fetch_arxiv_papers(title :str, papers_count: int):
    search_query = f'all:"{title}"'
    search = arxiv.Search(
        query=search_query,
        max_results=papers_count,
        sort_by=arxiv.SortCriterion.SubmittedDate,
        sort_order=arxiv.SortOrder.Descending
    )

    papers = []
    client = arxiv.Client()

    search = client.results(search)

    for result in search:
        paper_info = {
                'title': result.title,
                'authors': [author.name for author in result.authors],
                'summary': result.summary,
                'published': result.published,
                'journal_ref': result.journal_ref,
                'doi': result.doi,
                'primary_category': result.primary_category,
                'categories': result.categories,
                'pdf_url': result.pdf_url,
                'arxiv_url': result.entry_id
            }
        papers.append(paper_info)

    return papers

papers = fetch_arxiv_papers("Language Models", 10)

In [None]:
[[p['title']] for p in papers]

[['Generative Semantic Communication: Architectures, Technologies, and Applications'],
 ['Fast Prompt Alignment for Text-to-Image Generation'],
 ['Multimodal Latent Language Modeling with Next-Token Diffusion'],
 ['Synthetic Vision: Training Vision-Language Models to Understand Physics'],
 ['Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models'],
 ['Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning'],
 ['Competition and Diversity in Generative AI'],
 ['AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models'],
 ['Preference Discerning with LLM-Enhanced Generative Retrieval'],
 ['Empirical Measurements of AI Training Power Demand on a GPU-Accelerated Node']]

In [None]:
def create_documents_from_papers(papers):
    documents = []
    for paper in papers:
        content = f"Title: {paper['title']}\n" \
                  f"Authors: {', '.join(paper['authors'])}\n" \
                  f"Summary: {paper['summary']}\n" \
                  f"Published: {paper['published']}\n" \
                  f"Journal Reference: {paper['journal_ref']}\n" \
                  f"DOI: {paper['doi']}\n" \
                  f"Primary Category: {paper['primary_category']}\n" \
                  f"Categories: {', '.join(paper['categories'])}\n" \
                  f"PDF URL: {paper['pdf_url']}\n" \
                  f"arXiv URL: {paper['arxiv_url']}\n"
        documents.append(Document(text=content))
    return documents



#Create documents for LlamaIndex
documents = create_documents_from_papers(papers)

In [None]:
Settings.chunk_size = 1024
Settings.chunk_overlap = 50

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

In [None]:
index.storage_context.persist('index/')
storage_context = StorageContext.from_defaults(persist_dir='index/')
index = load_index_from_storage(storage_context, embed_model=embed_model)

In [None]:
query_engine = index.as_query_engine(llm=llm, similarity_top_k=5)

rag_tool = QueryEngineTool.from_defaults(
    query_engine,
    name="research_paper_query_engine_tool",
    description="A RAG engine with recent research papers.",
)

In [None]:
from llama_index.core import PromptTemplate
from IPython.display import Markdown, display
def display_prompt_dict(prompts_dict):
    for k, p in prompts_dict.items():
        text_md = f"**Prompt Key**: {k}" f"**Text:** "
        display(Markdown(text_md))
        print(p.get_template())
        display(Markdown(""))

prompts_dict = query_engine.get_prompts()
display_prompt_dict(prompts_dict)

**Prompt Key**: response_synthesizer:text_qa_template**Text:** 

Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 




**Prompt Key**: response_synthesizer:refine_template**Text:** 

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer: 




In [None]:
def download_pdf(pdf_url, output_file):
    """
    Downloads a PDF file from the given URL and saves it to the specified file.

    Args:
        pdf_url (str): The URL of the PDF file to download.
        output_file (str): The path and name of the file to save the PDF to.

    Returns:
        str: A message indicating success or the nature of an error.
    """
    try:
        response = requests.get(pdf_url)
        response.raise_for_status()

        with open(output_file, "wb") as file:
            file.write(response.content)

        return f"PDF downloaded successfully and saved as '{output_file}'."

    except requests.exceptions.RequestException as e:
        return f"An error occurred: {e}"

In [None]:
download_pdf_tool = FunctionTool.from_defaults(
    download_pdf,
    name='download_pdf_file_tool',
    description='python function, which downloads a pdf file by link'
)
fetch_arxiv_tool = FunctionTool.from_defaults(
    fetch_arxiv_papers,
    name='fetch_from_arxiv',
    description='download the {max_results} recent papers regarding the topic {title} from arxiv'
)


In [None]:
agent = ReActAgent.from_tools([download_pdf_tool, rag_tool, fetch_arxiv_tool], llm=llm, verbose=True)

In [None]:
q_template = (
    "I am interested in {topic}. \n"
    "Find papers in your knowledge database related to this topic; use the following template to query research_paper_query_engine_tool tool: 'Provide title, summary, authors and link to download for papers related to {topic}'. If there are not, could you fetch the recent one from arXiv? \n"
)

In [None]:
answer = agent.chat(q_template.format(topic="Audio-Language Models"))

> Running step 7ff11a0d-341a-4fd3-ba60-9769da4350a2. Step input: I am interested in Audio-Language Models. 
Find papers in your knowledge database related to this topic; use the following template to query research_paper_query_engine_tool tool: 'Provide title, summary, authors and link to download for papers related to Audio-Language Models'. If there are not, could you fetch the recent one from arXiv? 

[1;3;38;5;200mThought: The current language of the user is: English. I need to use a tool to help me answer the question.
Action: research_paper_query_engine_tool
Action Input: {'input': 'Provide title, summary, authors and link to download for papers related to Audio-Language Models'}
[0m[1;3;34mObservation: The title of the paper related to Audio-Language Models is "AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models." The authors are Mintong Kang, Chejian Xu, and Bo Li.

Here is a summary of the paper:
Recent advancements in large audio-language mod

In [None]:
Markdown(answer.response)

The title of the paper related to Audio-Language Models is "AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models." The authors are Mintong Kang, Chejian Xu, and Bo Li.

Here is a summary of the paper:
Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.

You can download the paper [here](http://arxiv.org/pdf/2412.08608v1).

In [None]:
answer = agent.chat("Download the papers, which you mentioned above")

> Running step e9f367d6-3bb8-4cd1-b71b-4ef96f189fa8. Step input: Download the papers, which you mentioned above
[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: download_pdf_file_tool
Action Input: {'pdf_url': 'http://arxiv.org/pdf/2412.08608v1', 'output_file': 'AdvWave_Stealthy_Adversarial_Jailbreak_Attack_against_Large_Audio-Language_Models.pdf'}
[0m[1;3;34mObservation: PDF downloaded successfully and saved as 'AdvWave_Stealthy_Adversarial_Jailbreak_Attack_against_Large_Audio-Language_Models.pdf'.
[0m> Running step 3e3382c2-00f4-4f0d-a401-d23cb766a964. Step input: None
[1;3;38;5;200mThought: I can answer without using any more tools. I'll use the user's language to answer
Answer: The paper "AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models" has been downloaded successfully and saved as 'AdvWave_Stealthy_Adversarial_Jailbreak_Attack_against_Large_Audio-Language_Models.pdf'.
[0m

In [None]:
Markdown(answer.response)

The paper "AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models" has been downloaded successfully and saved as 'AdvWave_Stealthy_Adversarial_Jailbreak_Attack_against_Large_Audio-Language_Models.pdf'.

In [None]:
answer = agent.chat(q_template.format(topic="Gaussian process"))

> Running step 2501b676-8090-4ed0-a6e2-a141e6ddcfec. Step input: I am interested in Gaussian process. 
Find papers in your knowledge database related to this topic; use the following template to query research_paper_query_engine_tool tool: 'Provide title, summary, authors and link to download for papers related to Gaussian process'. If there are not, could you fetch the recent one from arXiv? 

[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: research_paper_query_engine_tool
Action Input: {'input': 'Provide title, summary, authors and link to download for papers related to Gaussian process'}
[0m[1;3;34mObservation: I'm sorry, but there are no papers related to Gaussian process in the given context information.
[0m> Running step adefe2fc-b885-4483-be4a-27f00faa1446. Step input: None
[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: fetch_from_arxiv
Action Input: {'title': 'Gaussian process', 'papers_count': 1}
[0

In [None]:
Markdown(answer.response)

The title of the paper related to Gaussian process is "Improving Active Learning with a Bayesian Representation of Epistemic Uncertainty." The authors are Jake Thomas and Jeremie Houssineau.

Here is a summary of the paper:
A popular strategy for active learning is to specifically target a reduction in epistemic uncertainty, since aleatoric uncertainty is often considered as being intrinsic to the system of interest and therefore not reducible. Yet, distinguishing these two types of uncertainty remains challenging and there is no single strategy that consistently outperforms the others. We propose to use a particular combination of probability and possibility theories, with the aim of using the latter to specifically represent epistemic uncertainty, and we show how this combination leads to new active learning strategies that have desirable properties. In order to demonstrate the efficiency of these strategies in non-trivial settings, we introduce the notion of a possibilistic Gaussian process (GP) and consider GP-based multiclass and binary classification problems, for which the proposed methods display a strong performance for both simulated and real datasets.

You can download the paper [here](http://arxiv.org/pdf/2412.08225v1).