# 📚 Agentic RAG System with ArXiv + Web Fallback

This project implements an **intelligent research assistant** that retrieves and synthesizes information using:
1. **ArXiv papers** as the primary knowledge source (**RAG approach**)
2. **Web search (Tavily API)** as a fallback mechanism
3. **LangGraph** for orchestrating the decision-making workflow

## 🎯 Purpose

The system is designed to provide high-quality, research-backed answers to technical and scientific questions by:
- Prioritizing academic and research papers from ArXiv for scientific queries
- Falling back to web search for recent developments or non-academic topics
- Maintaining conversation context for coherent multi-turn interactions
- Ensuring proper attribution and citations in responses

## 🔑 Prerequisites

To use this system, you'll need:

1. **OpenAI API Key**
   - Required for:
     - Text embeddings (for semantic search)
     - Response generation (GPT-4 Turbo)
     - Routing decisions (GPT-3.5 Turbo)
   - Get it from: [OpenAI Platform](https://platform.openai.com)

2. **Tavily API Key**
   - Required for:
     - Web search fallback functionality
     - Real-time information retrieval
     - Academic domain filtering
   - Get it from: [Tavily](https://app.tavily.com)

3. **Python Environment**
   - Python 3.8 or higher
   - Required packages (will be installed automatically):
     - langchain-community
     - langchain_chroma
     - langchain_core
     - langchain_openai
     - langchain_text_splitters
     - langgraph
     - tavily-python
     - openai
     - python-dotenv


## 🤖 Agentic Workflow Architecture

The user workflow is translated into an agentic system through the following components:

1. **State Management**
   - **Conversation State**: Tracks user queries, system responses, and context
   - **Search State**: Maintains information about current search results and sources
   - **Decision State**: Stores routing decisions and their rationale

2. **Agent Components**
   - **Router Agent**: Makes intelligent decisions about information sources
     - Analyzes query type and context
     - Determines optimal search strategy
     - Handles fallback mechanisms
   
   - **Search Agent**: Executes information retrieval
     - Manages ArXiv API interactions
     - Handles Tavily web search
     - Processes and filters results
   
   - **Synthesis Agent**: Combines and formats information
     - Merges multiple sources
     - Ensures proper attribution
     - Generates coherent responses

3. **Feedback Loop**
   - System learns from user interactions
   - Improves routing decisions over time
   - Adapts to user preferences and query patterns

## 📊 Data Requirements and Sources

The system requires and manages several types of data:

1. **Input Data**
   - **User Queries**: Natural language questions and follow-ups
   - **Conversation History**: Previous interactions for context
   - **User Preferences**: Optional settings for search behavior

2. **Knowledge Sources**
   - **ArXiv Papers**:
     - Source: ArXiv API
     - Format: PDF documents
     - Update Frequency: Daily
     - Coverage: Scientific and technical papers
   
   - **Web Content**:
     - Source: Tavily API
     - Format: Web pages and documents
     - Update Frequency: Real-time
     - Coverage: News, blogs, documentation, etc.

3. **Processed Data**
   - **Embeddings**: Vector representations of text
     - Generated using OpenAI's embedding model
     - Stored in vector database
   
   - **Chunks**: Processed text segments
     - Size: Optimized for semantic search
     - Metadata: Source, date, relevance score
   
   - **Citations**: Reference information
     - Paper titles, authors, URLs
     - Web page sources and dates

4. **Output Data**
   - **Responses**: Generated answers with citations
   - **Search Results**: Ranked and filtered information
   - **Conversation Logs**: Interaction history


In [1]:
# %% [code]
# Install required packages
! pip install -qU langchain langgraph pypdf chromadb tavily-python openai python-dotenv pyboxen

In [66]:
! pip install langchain-community langchain_chroma langchain_core langchain_openai langchain_text_splitters langgraph tavily-python openai python-dotenv



In [3]:
# %% [code]
# Import required libraries
import os  # Provides functions to interact with the operating system.
from pyboxen import boxen  # Used to display stylized boxes in the terminal for better CLI UI.
from getpass import getpass  # Allows secure password input without echoing.
from typing import TypedDict, List, Dict, Optional, Literal, Union, Annotated, cast  # Used for type annotations and static type checking.
from langchain_core.documents import Document  # Represents and structures text data in LangChain.
from langchain_core.output_parsers import StrOutputParser  # Parses raw LLM output into usable string format.
from langchain_community.document_loaders import PyPDFLoader  # Loads and extracts text from PDF documents.
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter  # Splits text into chunks using markdown headers or character limits.
from langchain_chroma import Chroma  # Provides integration with Chroma vector store for embedding storage and retrieval.
from langchain_openai import OpenAIEmbeddings, ChatOpenAI  # Interfaces with OpenAI for embeddings and chat models.
from langchain_core.prompts import ChatPromptTemplate  # Manages prompt templates for chat-based interactions.
from langgraph.graph import StateGraph, END  # Helps define state-based logic flows for chat systems.
from tavily import TavilyClient  # Interfaces with Tavily for real-time web search.
from langchain.memory import ConversationBufferMemory  # Maintains memory of past conversation for context retention.


# API Key Submission

Please follow the instructions below:

1. **Provide the Tavily API Key**
2. **Provide the Open API Key**
3. **Press Enter** to proceed


In [4]:
# Set API keys
os.environ["TAVILY_API_KEY"] = getpass("Enter Tavily API Key (get from https://app.tavily.com): ")
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API Key: ")

## 2. Define State and System Architecture

This is important because Agents need context to take decisions and showcase "Agency", the state helps us define the information that the agent will require and also capture information through out the whole process.

We'll define our system's state and flow using **LangGraph**. The state will track our:
- **Input question**
- **Retrieved ArXiv results**
- **Web search results**
- **Final answer**
- **Conversation history for context**


In [82]:
# %% [code]
# Define our system state - this is what passes between nodes in our graph
class AgentState(TypedDict):
    """State definition for our agentic RAG system"""
    question: str  # User's current question
    arxiv_results: Optional[List[Document]]  # Results from ArXiv papers (if any)
    web_results: Optional[List[Dict]]  # Results from web search (if any)
    answer: str  # Final synthesized answer
    conversation_history: str  # Previous Q&A for context
    memory: any

## 3. Router Node Implementation

The **Router Node** is responsible for deciding whether to use **ArXiv papers** or **web search**.
- **First**, it tries to use **ArXiv papers** (our local knowledge source).
- **Falls back** to **web search** if needed.

This demonstrates **strategic decision-making capabilities**.


In [83]:
router_prompt = ChatPromptTemplate.from_template("""
You are a highly specialized research assistant with access to two information sources:
1. A collection of ArXiv research papers
2. A web search tool

Your task is to determine which source would be better to answer the user's question.
FIRST try to use ArXiv papers for scientific and academic questions.
ONLY use web search if:
- The question requires very recent information not likely in research papers
- The question is about general knowledge, news, or non-academic topics
- The question asks for information beyond what academic papers would contain

Consider the conversation history for context.

Question: {question}
Conversation History: {conversation_history}

Respond with ONLY ONE of these two options:
"arxiv" - if the question should be answered using research papers
"web" - if the question requires web search

Your decision should be a single word only (either "arxiv" or "web"). Do not include any explanation, reasoning, or additional text in your response.
""")

def router_node(state: AgentState) -> dict:
    """
    Determines whether to use ArXiv papers or web search based on the question.

    Args:
        state: Current state containing the question and conversation history

    Returns:
        Dict indicating which path to take next
    """
    # Use a lighter model for routing decisions
    llm = ChatOpenAI(model="gpt-4o-mini")

    # Create a chain that outputs just the decision text
    chain = router_prompt | llm

    # Invoke the chain with our question and history
    # Get the content of the AIMessage object instead of directly calling strip()
    decision = chain.invoke({
        "question": state["question"],
        "conversation_history": state["conversation_history"]
    }).content.strip().lower()

    print(f"Router decision: {decision}")

    # Return the next node to be called based on the decision
    if "web" in decision:
        return {"next": "web_search"}
    else:
        return {"next": "arxiv_retrieval"}

# ArXiv Processor Documentation

## Overview
The `ArXivProcessor` class is designed to handle processing ArXiv PDFs for retrieval-augmented generation (RAG) systems. It implements document-aware chunking strategies specifically optimized for scientific papers.

## Key Features
- **Two-step chunking strategy**:
 1. Markdown header splitting to preserve document structure
 2. Recursive character splitting for handling longer sections effectively
- **Confidence-based retrieval** with threshold filtering
- **Metadata preservation** from original PDFs

## Class Structure

### Constructor: `__init__()`
Initializes the processor with specialized document chunking strategies:
- `MarkdownHeaderTextSplitter` to maintain document section structure
- `RecursiveCharacterTextSplitter` for detailed content subdivision

### Methods

#### `load_and_process(pdf_urls: List[str])`
Processes ArXiv PDFs with document-aware chunking:
- Loads PDFs from provided URLs
- Converts content to markdown-style text with headers
- Applies two-stage chunking process
- Creates a vector store with OpenAI embeddings

#### `retrieve(question: str, confidence_threshold: float = 0.75, k: int = 5)`
Retrieves relevant chunks with confidence scoring:
- Performs similarity search based on user query
- Filters results by confidence threshold
- Returns only high-relevance document chunks

## Implementation Example
The documented code includes a sample implementation that loads and processes two ArXiv papers:
- Quantum computing paper: https://arxiv.org/pdf/2305.10343.pdf
- LLM research paper: https://arxiv.org/pdf/2303.04137.pdf

## Dependencies
- `PyPDFLoader` for PDF handling
- `MarkdownHeaderTextSplitter` and `RecursiveCharacterTextSplitter` for content chunking
- `OpenAIEmbeddings` for vector embeddings
- `Chroma` for vector storage

In [84]:
class ArXivProcessor:
    """
    Handles processing ArXiv PDFs for retrieval-augmented generation.
    """
    def __init__(self):
        """
        Initialize the processor with document-aware chunking strategies.

        The chunking strategy uses a two-step approach:
        1. Markdown header splitting preserves document structure and headers
        2. Recursive character splitting handles longer sections effectively
        """
        # Header splitter preserves section structure in scientific papers
        self.header_splitter = MarkdownHeaderTextSplitter(
            headers_to_split_on=[
                ("#", "Section"),           # Main sections
                ("##", "Subsection"),       # Subsections
                ("###", "Subsubsection")    # Sub-subsections
            ]
        )

        # Recursive splitter handles nested hierarchies and technical content
        # - Chunk size of 1000 balances context vs specificity
        # - Overlap of 200 ensures continuity between chunks
        # - Separators prioritize natural breaks in scientific text
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", "(?<=\. )", " ", ""]
        )

        # Will be initialized when documents are loaded
        self.vector_store = None

    def load_and_process(self, pdf_urls: List[str]):
        """
        Process ArXiv PDFs with document-aware chunking

        Args:
            pdf_urls: List of URLs to ArXiv PDFs
        """
        all_chunks = []

        # Process each PDF
        for url in pdf_urls:
            print(boxen(f"Loading PDF from {url}", title=">>> PDF Loading", color="blue", padding=1))
            loader = PyPDFLoader(url)
            pages = loader.load()

            # Process each page
            for page in pages:
                # Convert PDF content to markdown-style text with headers
                page_text = f"# {page.metadata['source']}\n## Page {page.metadata['page']}\n{page.page_content}"

                # First split by headers to maintain document structure
                header_chunks = self.header_splitter.split_text(page_text)

                # Then split large sections into smaller chunks
                small_chunks = self.text_splitter.split_documents(header_chunks)

                # Add to our collection
                all_chunks.extend(small_chunks)

        print(boxen(f"Created {len(all_chunks)} chunks from {len(pdf_urls)} PDFs", title=">>> Processing Complete", color="green", padding=1))

        # Create vector store with OpenAI embeddings
        self.vector_store = Chroma.from_documents(
            documents=all_chunks,
            embedding=OpenAIEmbeddings(),
            persist_directory="./arxiv_db"
        )

    def retrieve(self, question: str, confidence_threshold: float = 0.75, k: int = 5):
        """
        Retrieve relevant chunks with confidence scoring

        Args:
            question: User question to find relevant information for
            confidence_threshold: Minimum relevance score (0-1) to include a result
            k: Maximum number of results to return

        Returns:
            List of relevant document chunks that meet the threshold
        """
        if not self.vector_store:
            raise ValueError("No ArXiv documents loaded. Run load_and_process first.")

        # Perform similarity search with relevance scores
        results = self.vector_store.similarity_search_with_relevance_scores(
            question, k=k
        )

        # Filter by confidence threshold
        filtered_results = [doc for doc, score in results if score >= confidence_threshold]

        print(boxen(f"Found {len(filtered_results)} relevant chunks above threshold {confidence_threshold}", title=">>> Context", color="yellow", padding=1))

        return filtered_results

# Load sample ArXiv PDFs
print(boxen("Initializing ArXiv processor with sample papers...", title=">>> Initialization", color="cyan", padding=1))
arxiv_processor = ArXivProcessor()
arxiv_processor.load_and_process([
    "https://arxiv.org/pdf/2504.10412"   # LLM research paper
])
print(boxen("ArXiv processor initialized!", title=">>> Status", color="green", padding=1))

[36m╭─[0m[36m >>> Initialization [0m[36m──────────────────────────────────[0m[36m─╮[0m                                                         
[36m│[0m                                                        [36m│[0m                                                         
[36m│[0m   Initializing ArXiv processor with sample papers...   [36m│[0m                                                         
[36m│[0m                                                        [36m│[0m                                                         
[36m╰────────────────────────────────────────────────────────╯[0m                                                         



[34m╭─[0m[34m >>> PDF Loading [0m[34m────────────────────────────────────[0m[34m─╮[0m                                                          
[34m│[0m                                                       [34m│[0m                                                          
[34m│[0m   Loading PDF from https://arxiv.org/pdf/2504.10412   [34m│[0m                                                          
[34m│[0m                                                       [34m│[0m                                                          
[34m╰───────────────────────────────────────────────────────╯[0m                                                          



[32m╭─[0m[32m >>> Processing Complete [0m[32m────────[0m[32m─╮[0m                                                                              
[32m│[0m                                   [32m│[0m                                                                              
[32m│[0m   Created 35 chunks from 1 PDFs   [32m│[0m                                                                              
[32m│[0m                                   [32m│[0m                                                                              
[32m╰───────────────────────────────────╯[0m                                                                              



[32m╭─[0m[32m >>> Status [0m[32m────────────────────[0m[32m─╮[0m                                                                               
[32m│[0m                                  [32m│[0m                                                                               
[32m│[0m   ArXiv processor initialized!   [32m│[0m                                                                               
[32m│[0m                                  [32m│[0m                                                                               
[32m╰──────────────────────────────────╯[0m                                                                               



# ArXiv Retrieval Node Documentation

## Overview
The `arxiv_retrieval_node` function serves as a retrieval component in an agent-based system, fetching relevant scientific information from ArXiv papers based on user queries.

## Function Signature
`arxiv_retrieval_node(state: AgentState) -> dict`

## Parameters
- `state`: An AgentState object containing the current conversation state, including:
 - `question`: The user's query to search for in ArXiv papers

## Functionality
The function:
1. Extracts the user's question from the input state
2. Calls the `arxiv_processor.retrieve()` method to find relevant document chunks
3. Uses a reduced confidence threshold (0.5) compared to the default (0.75) to improve recall
4. Returns the retrieved documents for further processing

## Return Value
Returns a dictionary with:
- `arxiv_results`: A list of document chunks from ArXiv papers relevant to the user's question

## Integration Notes
- This function is designed to be used as a node in an agent workflow
- The reduced confidence threshold ensures more potential matches are returned, prioritizing recall over precision
- The retrieved documents can be used by subsequent nodes for answering the user's question

In [85]:
# %% [code]
def arxiv_retrieval_node(state: AgentState) -> dict:
    """
    Retrieves relevant information from ArXiv papers based on the question.

    Args:
        state: Current state containing the question

    Returns:
        Updated state with arxiv_results
    """
    # Retrieve relevant documents from ArXiv
    relevant_docs = arxiv_processor.retrieve(
        question=state["question"],
        confidence_threshold=0.5  # Adjusted threshold for better recall
    )

    # Check if we found enough relevant content
    return {"arxiv_results": relevant_docs}

## 5. Web Search Node Implementation

The **Web Search Node** uses the **Tavily API** to search the web when **ArXiv papers** don't have the answer.

- **Optimizes** the search query
- **Filters and processes** results
- **Ensures** proper attribution


In [86]:
web_searcher = TavilyClient()
def web_search_node(state: AgentState) -> dict:
    """
    Searches the web for information using the Tavily API.
    """
    # Include academic domains to improve search quality
    academic_domains = ["arxiv.org", "scholar.google.com", "researchgate.net", "edu"]

    # Get search results with answer
    search_response = web_searcher.search(
        query=state["question"],
        max_results=5,
        include_domains=academic_domains,
        search_depth="advanced",  # Use advanced search for better results
        include_answer=True  # Request direct answer
    )

    # If we have a direct answer, use it
    if search_response.get("answer"):
        return {
            "web_results": search_response.get("results", []),
            "direct_answer": search_response["answer"]
        }

    return {"web_results": search_response.get("results", [])}

# Function Documentation: `synthesize_answer_node`

## Overview
The `synthesize_answer_node` function is a key component in a LangChain-based conversational agent. It is responsible for generating a comprehensive answer based on either scientific research papers (from ArXiv) or web search results (via Tavily). The generated response is contextual, well-structured, and strictly grounded in the retrieved data.

---

## Purpose
To synthesize a high-quality, structured, and citation-backed answer from the information retrieved during the conversational flow — either from ArXiv research papers or real-time web search results.

---

## Inputs

- **state (AgentState)**:  
  A dictionary representing the current state of the agent, which includes:
  - `question`: The user's query.
  - `arxiv_results`: A list of research paper excerpts (if available).
  - `web_results`: A list of web search results (used when no ArXiv data is present).
  - `conversation_history`: Context from previous exchanges to maintain continuity.

---

## Logic Flow

1. **Source Determination**:  
   The function first checks whether ArXiv results are available. If so, it uses them; otherwise, it falls back to web search results.

2. **Prompt Construction**:  
   A custom `prompt_template` is built depending on the data source. Each template includes:
   - The original question.
   - Retrieved content (formatted accordingly).
   - Prior conversation context.
   - Explicit instructions to ensure grounded, factual, and well-structured responses.

3. **Model Invocation**:  
   - Uses `ChatOpenAI` (specifically `gpt-4o-mini`) for advanced reasoning and response generation.
   - Combines the prompt and model into a LangChain chain using `ChatPromptTemplate` and `StrOutputParser`.

4. **Response Handling**:  
   - If web results were used, the function appends a list of source URLs at the end of the response.
   - If ArXiv sources were used, inline citations in the format `(Author et al., Page X)` are expected.

---

## Output

- Returns a dictionary with a single key:  
  - `answer`: A fully formatted, cited response derived from either research papers or search results.

---

## Key Characteristics

- **Grounded Output**: The model is instructed not to hallucinate or invent facts.
- **Citations Included**: Adds credibility and traceability via inline citations or URL references.
- **Context-Aware**: Maintains conversation context to provide coherent multi-turn interactions.
- **Readable Format**: Uses markdown elements such as headers, bullet points, and bold text for readability.

---

In [87]:
def synthesize_answer_node(state: AgentState) -> dict:
    """
    Synthesizes a comprehensive answer from retrieved information.

    Args:
        state: Current state containing question and retrieved information

    Returns:
        Updated state with answer
    """
    # If we have a direct answer from web search, use it
    if state.get("direct_answer"):
        answer_content = state["direct_answer"]
        source_type = "Web Search Results"
    else:
        # Determine which source to use for synthesis
        if state["arxiv_results"] and len(state["arxiv_results"]) > 0:
            # Using ArXiv research papers
            sources = "\n\n".join([
                f"--- Document: {d.metadata.get('source', 'Unknown')} (Page {d.metadata.get('page', 'Unknown')}) ---\n{d.page_content}"
                for d in state["arxiv_results"]
            ])
            displayed_sources = sources
            source_type = "ArXiv Papers"

            prompt_template = """
            You are a knowledgeable research assistant specializing in mathematical theory and scientific literature analysis.

            Your goal is to generate clean, formatted responses to user questions based solely on the provided ArXiv sources.

            ---

            Question:
            {question}

            Relevant Extracts from ArXiv Papers:
            {sources}

            Conversation History:
            {conversation_history}

            ---

            Instructions for Synthesizing the Answer:

            1. Read the extracts thoroughly and understand the concepts.
            2. Answer the question comprehensively using only the provided context.
            3. Organize the response into the following markdown sections (if applicable):
              - Summary
              - Key Concepts
              - Theoretical Results
              - Implications / Applications
            4. Cite from the paper in the format: (Author et al., Page X). If page number is unknown, write: (Author et al.).
            6. Avoid repetition, excessive formal tone, or generic commentary. Be clear and concise.**
            7. If the provided text lacks enough detail to answer, state it clearly and suggest what additional info is needed.

            ---

            Now, write a well-structured, markdown-formatted answer to the question and it should be in a readable format as well.
            """
        else:
            # Using web search results
            sources = "\n\n".join([
                f"--- Source {i+1}: {res['title']} ---\n{res['content']}"
                for i, res in enumerate(state["web_results"] or [])
            ])
            displayed_sources = sources
            source_type = "Web Search Results"

            prompt_template = """
            You are a knowledgeable research assistant providing accurate information based on web search results.

            Question: {question}

            Here are relevant web search results:
            {sources}

            Conversation History:
            {conversation_history}

            Instructions:
            1. Synthesize a comprehensive answer using ONLY the information provided above.
            2. Cite sources using [1], [2], etc. corresponding to the source numbers above.
            3. If the search results don't contain sufficient information, acknowledge the limitations.
            4. DO NOT make up information not present in the sources.
            5. Include only facts supported by the sources.

            Your answer:
            """

        # Print retrieval information
        print(f"\n=== Retrieved chunks from {source_type} ===")
        print(displayed_sources)
        print("="*80)

        # Create the prompt
        synthesis_prompt = ChatPromptTemplate.from_template(prompt_template)

        # Use a more capable model for synthesis
        llm = ChatOpenAI(model="gpt-4o-mini")
        chain = synthesis_prompt | llm | StrOutputParser()

        # Generate the answer
        response = chain.invoke({
            "question": state["question"],
            "sources": sources,
            "conversation_history": state["conversation_history"]
        })

        # Add source citations for web results
        if state.get("web_results") and not state.get("arxiv_results"):
            answer_content = response

            # Add URL references at the end
            url_citations = "\n\nSources:\n" + "\n".join([
                f"[{i+1}] {res['url']}"
                for i, res in enumerate(state["web_results"] or [])
            ])

            answer_content += url_citations
        else:
            answer_content = response

    # Using markdown and plain text for better readability
    formatted_output = f"""
## Context
**Question:** {state["question"]}
**Source:** {source_type}

## Response
{answer_content}
"""

    return {"answer": formatted_output}

## 7. Conversation Memory Node

This node **manages conversation history** to provide context for **multi-turn interactions**.

- **Stores** previous Q&A
- **Updates** the state with the current interaction
- **Maintains** a sliding window of relevant history


In [88]:
# %% [code]
# Initialize conversation memory

def update_memory_node(state: AgentState) -> dict:
    """
    Updates the conversation memory with the current Q&A pair.

    Args:
        state: Current state with question and answer

    Returns:
        Updated state with new conversation_history
    """
    # Save the current interaction to memory
    memory = state['memory']
    memory.save_context(
        {"question": state["question"]},
        {"answer": state["answer"]}
    )

    # Return the updated state
    return {"conversation_history": memory.load_memory_variables({}).get("history", "")}


# Workflow State Graph Setup

## Overview
This section sets up the **LangGraph state machine** for managing the conversational agent’s workflow. It defines how user queries are processed step-by-step using modular nodes.

---

## Purpose
To create a graph-based control flow that determines how the agent processes input, performs retrieval, synthesizes responses, updates memory, and eventually ends the workflow.

---

## Key Components

### 1. **Workflow Initialization**
- A new `StateGraph` is initialized with the `AgentState` type, defining the structure of the workflow.

### 2. **Node Definitions**
The graph is composed of several functional nodes, each responsible for a specific task:
- **router**: Determines whether to fetch data from the web or ArXiv.
- **arxiv_retrieval**: Retrieves relevant research papers from ArXiv.
- **web_search**: Retrieves web results via Tavily.
- **synthesize**: Synthesizes a final answer from the retrieved information.
- **update_memory**: Stores the interaction context for future turns.

### 3. **Entry Point**
- The `router` node is set as the initial entry point for the graph, meaning every workflow starts with routing logic.

### 4. **Conditional Routing**
- A conditional edge is established from `router` based on the `"next"` field in the state:
  - If `"next"` is `"web_search"`, it routes to the `web_search` node.
  - If `"next"` is `"arxiv_retrieval"`, it routes to the `arxiv_retrieval` node.

### 5. **Workflow Sequence**
The following fixed transitions define the remainder of the workflow:
- From either `web_search` or `arxiv_retrieval` → go to `synthesize`
- From `synthesize` → go to `update_memory`
- From `update_memory` → reach `END` (completion of the flow)

In [89]:
# %% [code]
# Create the workflow state graph
workflow = StateGraph(AgentState)

# Add all nodes to the graph
workflow.add_node("router", router_node)
workflow.add_node("arxiv_retrieval", arxiv_retrieval_node)
workflow.add_node("web_search", web_search_node)
workflow.add_node("synthesize", synthesize_answer_node)
workflow.add_node("update_memory", update_memory_node)

# Set entry point
workflow.set_entry_point("router")

# Define conditional edges from router
workflow.add_conditional_edges(
    "router",
    lambda state: state["next"],
    {
        "web_search": "web_search",
        "arxiv_retrieval": "arxiv_retrieval"
    }
)

# Define rest of the edges
workflow.add_edge("arxiv_retrieval", "synthesize")
workflow.add_edge("web_search", "synthesize")
workflow.add_edge("synthesize", "update_memory")
workflow.add_edge("update_memory", END)

# Compile the graph
app = workflow.compile()

## 9. Testing the System

Let's **test our system** with different types of questions:

- **Questions answerable** from ArXiv papers
- **Questions requiring** web search
- **Follow-up questions** to test memory


In [90]:
memory = ConversationBufferMemory(return_messages=False, output_key="answer", input_key="question")
initial_state = {
    "question": None,
    "arxiv_results": None,
    "web_results": None,
    "answer": "",
    "conversation_history": memory.load_memory_variables({}).get("history", ""),
    "memory": memory
}


In [94]:

def ask(app, question: str, state: AgentState):
    """
    Ask a question to the agentic RAG system.

    Args:
        question: User's question

    Returns:
        The system's answer
    """
    # Initialize the state

    # Invoke the workflow
    state['question'] = question
    result = app.invoke(state)
    # Print the response details with pyboxen
    print(boxen(f"Question: {result['question']}", title=">>> Question", color="blue", padding=1))

    if result["arxiv_results"]:
        arxiv_count = len(result["arxiv_results"])
        print(boxen(f"Found {arxiv_count} ArXiv results", title=">>> ArXiv Results", color="magenta", padding=1))
    elif result["web_results"]:
        web_count = len(result["web_results"])
        print(boxen(f"Found {web_count} Web results", title=">>> Web Results", color="magenta", padding=1))
    else:
        print(boxen("No results found", title=">>> Results", color="red", padding=1))

    print(boxen(result["answer"], title=">>> Answer", color="green", padding=1))
    return result



In [95]:
# Test with a question about quantum computing (should use ArXiv)
updated_state = ask(app, "Explain Software Refactoring", initial_state)

Router decision: arxiv


[33m╭─[0m[33m >>> Context [0m[33m──────────────────────────────────[0m[33m─╮[0m                                                                
[33m│[0m                                                 [33m│[0m                                                                
[33m│[0m   Found 5 relevant chunks above threshold 0.5   [33m│[0m                                                                
[33m│[0m                                                 [33m│[0m                                                                
[33m╰─────────────────────────────────────────────────╯[0m                                                                


=== Retrieved chunks from ArXiv Papers ===
--- Document: Unknown (Page Unknown) ---
a scalable, AI-driven path to cleaner codebases, vital for
software engineering’s future.  
Index Terms— Graph Neural Networks, Code
Refactoring, Software Maintainability, Abstract Syntax
Trees, Machine Learning, Cyclomatic Complexity,

[34m╭─[0m[34m >>> Question [0m[34m────────────────────────────[0m[34m─╮[0m                                                                     
[34m│[0m                                            [34m│[0m                                                                     
[34m│[0m   Question: Explain Software Refactoring   [34m│[0m                                                                     
[34m│[0m                                            [34m│[0m                                                                     
[34m╰────────────────────────────────────────────╯[0m                                                                     



[35m╭─[0m[35m >>> ArXiv Results [0m[35m──────[0m[35m─╮[0m                                                                                      
[35m│[0m                           [35m│[0m                                                                                      
[35m│[0m   Found 5 ArXiv results   [35m│[0m                                                                                      
[35m│[0m                           [35m│[0m                                                                                      
[35m╰───────────────────────────╯[0m                                                                                      



[32m╭─[0m[32m >>> Answer [0m[32m───────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m   ## Context                                                                                                    [32m│[0m
[32m│[0m   **Question:** Explain Software Refactoring                                                                    [32m│[0m
[32m│[0m   **Source:** ArXiv Papers                                                                                      [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m   ## Response                    

In [96]:
# Test with a question about recent developments (should use web)
updated_state = ask(app, "What is Crew Ai ? and can it help me refactor", updated_state)

Router decision: web

=== Retrieved chunks from ArXiv Papers ===
--- Document: Unknown (Page Unknown) ---
a scalable, AI-driven path to cleaner codebases, vital for
software engineering’s future.  
Index Terms— Graph Neural Networks, Code
Refactoring, Software Maintainability, Abstract Syntax
Trees, Machine Learning, Cyclomatic Complexity, Code
Coupling, Software Engineering.
I. INTRODUCTION
Software refactoring—the art of tweaking code to make it
cleaner, more readable, and easier to maintain without changing
its behaviors at the heart of modern software engineering.
Picture a sprawling codebase: functions tangled in loops,
variables sprawling across modules, and complexity creeping
up like vines. Developers spend 30% of their time wrestling
with such messes, according to a 2023 GitHub survey [1]. The
stakes are high—poor maintainability spikes bugs by 25% and
slows feature rollouts by 40% [2]. Traditional tools like
SonarQube or Check style flag issues (e.g., methods with 20+
lines),

[34m╭─[0m[34m >>> Question [0m[34m─────────────────────────────────────────────[0m[34m─╮[0m                                                    
[34m│[0m                                                             [34m│[0m                                                    
[34m│[0m   Question: What is Crew Ai ? and can it help me refactor   [34m│[0m                                                    
[34m│[0m                                                             [34m│[0m                                                    
[34m╰─────────────────────────────────────────────────────────────╯[0m                                                    



[35m╭─[0m[35m >>> ArXiv Results [0m[35m──────[0m[35m─╮[0m                                                                                      
[35m│[0m                           [35m│[0m                                                                                      
[35m│[0m   Found 5 ArXiv results   [35m│[0m                                                                                      
[35m│[0m                           [35m│[0m                                                                                      
[35m╰───────────────────────────╯[0m                                                                                      



[32m╭─[0m[32m >>> Answer [0m[32m───────────────────────────────────────────────────────────────────────────────────────────────────[0m[32m─╮[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m   ## Context                                                                                                    [32m│[0m
[32m│[0m   **Question:** What is Crew Ai ? and can it help me refactor                                                   [32m│[0m
[32m│[0m   **Source:** ArXiv Papers                                                                                      [32m│[0m
[32m│[0m                                                                                                                 [32m│[0m
[32m│[0m   ## Response                    

## 🎓 Conclusion

The Agentic RAG System with ArXiv + Web Fallback represents a powerful approach to information retrieval and synthesis, combining the best of both academic and real-time knowledge sources. By intelligently routing queries and maintaining conversation context, it provides:

- **Comprehensive Answers**: Leveraging both academic papers and current web information
- **Proper Attribution**: Ensuring all sources are properly cited
- **Contextual Understanding**: Maintaining conversation history for coherent interactions
- **Flexible Knowledge Access**: Adapting to different types of queries and information needs

This system is particularly valuable for:
- Researchers seeking both theoretical foundations and practical applications
- Developers looking for up-to-date technical information
- Students and professionals needing comprehensive, well-sourced answers
- Anyone requiring a balance between academic rigor and current information

The modular architecture and use of LangGraph make it easy to extend and adapt the system for specific use cases or additional knowledge sources.