# **AI Research Assistant: Web-Based Information Retrieval and Summarization Agent**

Alhassane Samassekou

ITAI2376

May 7, 2025

# **Introduction**
This project implements an AI Research Assistant that helps users:
- Research topics and gather information from the web
- Summarize findings from multiple sources
- Organize information into structured reports
- Generate PDF reports with proper *citations*

# **The agent integrates:**
- Web search capabilities through SerpAPI
- Document processing for summarization
- PDF generation for formatted reports
- A user-friendly Gradio interface

# **Project Goals**

Create an AI agent that performs web-based research on user queries
Implement information retrieval and summarization from multiple sources
Provide organized, cited research findings to users
Build a simple but effective user interface for interaction

# **Core Features**

Web search using the SerpAPI integration
Text analysis and key information extraction
Research summarization with source attribution
User-friendly interface through Gradio

# **Installing Required Dependencies**

In [1]:
!pip install openai serpapi faiss-cpu langchain transformers nltk ipywidgets gradio fpdf tiktoken -q


# **Import Libraries & Set Up Environment Variables**

In [2]:
import os
import requests
import gradio as gr
from datetime import datetime
import re
from collections import Counter
from fpdf import FPDF
import nltk
nltk.download('punkt')
import traceback
import json

# Set SerpAPI key - replace with your own key if needed
SERPAPI_KEY = "fe2ca6b2a2dc5344ef141c367c982d65f4aecc34ebfd96fe85b963c0fd1e71c5"
os.environ['SERPAPI_API_KEY'] = SERPAPI_KEY

# Session Memory - for future enhancement of conversation history
SESSION_MEMORY = {}

print("✅ Environment setup complete!")

✅ Environment setup complete!


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Explanation:**
This initial cell sets up our research assistant project by installing all the necessary Python packages and importing the required libraries. The code establishes our working environment by:

Installing packages for API access (openai, serpapi), vector search (faiss-cpu), LLM frameworks (langchain), text processing (transformers, nltk), UI components (ipywidgets, gradio), and PDF generation (fpdf)
Setting up the SerpAPI key that will be used for web searches
Creating a session memory dictionary for potential future enhancements that could track conversation history

# **Debug Utilities**

In [3]:
def debug_api_response(response_text, limit=1000):
    """Helper function to debug API responses"""
    if not response_text:
        return "Empty response"
    try:
        # Try to parse as JSON for better formatting
        data = json.loads(response_text)
        return json.dumps(data, indent=2)[:limit] + "..." if len(json.dumps(data)) > limit else json.dumps(data, indent=2)
    except:
        # If it's not JSON, just return text
        return response_text[:limit] + "..." if len(response_text) > limit else response_text

# **Explanation:**
This utility function helps debug API responses by formatting and truncating them to make them more readable. The function:

Attempts to parse the response as JSON for better formatted output
Truncates long responses to a specified limit (default: 1000 characters) to prevent overwhelming output
Falls back to plain text formatting if the response isn't valid JSON

This function is particularly useful for monitoring and troubleshooting the SerpAPI responses during development, making it easier to identify issues with the search results.

# **Google Search Function via SerpAPI**

In [4]:
def search_google(query):
    """Enhanced SerpAPI search with better debugging"""
    # Mock results as absolute last resort
    MOCK_RESULTS = [
        ("Ethical Issues in AI Healthcare", "https://example.com/1", "AI in healthcare raises concerns about privacy, bias, and accountability."),
        ("AI Ethics in Medicine", "https://example.com/2", "Key ethical issues include data security and informed consent.")
    ]

    # Set up SerpAPI parameters
    params = {
        "q": query,
        "api_key": SERPAPI_KEY,
        "num": 10,
        "engine": "google"
    }
    print(f"[Debug] SerpAPI Request to: https://serpapi.com/search with query: {query}")

    try:
        # Make the API request with longer timeout
        response = requests.get("https://serpapi.com/search", params=params, timeout=15)
        print(f"[Debug] SerpAPI status code: {response.status_code}")

        # Handle HTTP errors
        if response.status_code != 200:
            print(f"[Error] SerpAPI HTTP error: {response.status_code}")
            print(f"Response text: {response.text}")
            return None, f"SerpAPI error: HTTP {response.status_code}"

        # Parse the JSON response
        try:
            data = response.json()
        except json.JSONDecodeError:
            print(f"[Error] Failed to parse JSON response: {response.text[:200]}...")
            return None, "Failed to parse SerpAPI JSON response"

        # Debug the response structure
        print(f"[Debug] SerpAPI response keys: {list(data.keys())}")

        # Check for error messages in the API response
        if "error" in data:
            print(f"[Error] SerpAPI error message: {data['error']}")
            return None, f"SerpAPI error: {data['error']}"

        # Get organic search results
        results = data.get("organic_results", [])
        if not results:
            print("[Warning] No organic_results found in response")
            print(f"Response preview: {debug_api_response(response.text)}")
            return None, "No organic search results found"

        print(f"[Success] Found {len(results)} organic results")

        # Process and clean the results
        cleaned = []
        for r in results:
            title = r.get("title", "No title")
            link = r.get("link", "")
            snippet = r.get("snippet", "")

            # Try alternative fields if snippet is missing
            if not snippet:
                snippet = r.get("about_this_result", {}).get("description", "")

            if link and snippet:  # Only include results with both link and snippet
                cleaned.append((title, link, snippet))

        if cleaned:
            print(f"[Success] Processed {len(cleaned)} valid results with snippets")
            # Return both the results and a success message
            return cleaned, "Success"
        else:
            print("[Warning] No results with valid snippets found")
            return None, "No results with valid snippets found"

    except requests.exceptions.Timeout:
        print("[Error] SerpAPI request timed out")
        return None, "SerpAPI request timed out"
    except requests.exceptions.ConnectionError:
        print("[Error] SerpAPI connection error")
        return None, "SerpAPI connection error"
    except Exception as e:
        print(f"[Error] Unexpected error: {str(e)}")
        traceback.print_exc()
        return None, f"Unexpected error: {str(e)}"

# **Explanation:**
This function is the core of our research assistant's ability to gather information from the web. It uses the SerpAPI service to perform Google searches and retrieve results. The function is designed with:

Robust error handling for various failure points (HTTP errors, JSON parsing, timeouts, connection errors)
Extensive logging to help identify issues during development
Data cleaning to extract only relevant information (title, link, snippet) from search results
Fallback mechanisms to try alternative fields if the standard snippet is missing
Quality control to ensure only results with both links and content snippets are included

# **Text Analysis Functions**

In [5]:
# Cell 4: Text Processing Functions

def simple_tokenize(text):
    """Simple sentence tokenization without relying on NLTK"""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s for s in sentences if s.strip()]

def simple_extract_key_sentences(text, num_sentences=3):
    """Extract key sentences using word frequency without NLTK"""
    sentences = simple_tokenize(text)
    if not sentences or len(sentences) <= num_sentences:
        return text

    # Common English stopwords
    stop_words = set(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
                      "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
                      'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
                      'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
                      'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
                      'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was',
                      'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',
                      'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
                      'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with',
                      'about', 'against', 'between', 'into', 'through', 'during', 'before',
                      'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out',
                      'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once'])

    # Calculate word frequencies (excluding stopwords)
    words = re.findall(r'\b\w+\b', text.lower())
    word_freq = Counter([word for word in words if word not in stop_words and len(word) > 1])

    # Score sentences based on word frequencies
    sentence_scores = {}
    for i, sentence in enumerate(sentences):
        words_in_sentence = re.findall(r'\b\w+\b', sentence.lower())
        sentence_scores[i] = sum(word_freq.get(word, 0) for word in words_in_sentence)

    # Get indices of top sentences
    top_indices = sorted(sentence_scores.items(), key=lambda x: x[1], reverse=True)[:num_sentences]
    top_indices = sorted([idx for idx, _ in top_indices])

    # Return top sentences in original order
    return ' '.join(sentences[idx] for idx in top_indices)

# **Explanation:**
These functions handle text processing for our research assistant, enabling it to break down text into sentences and identify the most important sentences based on word frequency. The implementation:

Creates a custom sentence tokenizer using regular expressions instead of relying on NLTK

Implements an extractive summarization algorithm based on word frequency

Uses a predefined list of English stopwords to exclude common, low-information words

Scores sentences based on the frequency of meaningful words they contain

Returns the top N sentences (default: 3) in their original order to maintain text coherence

# **Summarization Function**

In [6]:
# Cell 5: Search Results Summarization

def summarize_search_results(search_results, query):
    """Create a summary from search results"""
    if not search_results:
        return "No relevant information found."

    # Create sections based on query terms
    summary_parts = []

    # Add introduction
    summary_parts.append(f"Based on the search results for '{query}', here's a summary of the key information:")
    summary_parts.append("")

    # Process each search result
    for i, (title, url, snippet) in enumerate(search_results[:5]):  # Limit to top 5 results
        # Extract key information
        key_info = simple_extract_key_sentences(snippet)

        # Add citation
        result_summary = f"According to {title} [Source {i+1}], {key_info}"
        summary_parts.append(result_summary)

    # Add conclusion
    summary_parts.append("")
    summary_parts.append("This summary is based on the available search results and may not represent a comprehensive analysis of the topic.")

    return "\n".join(summary_parts)

# **Explanation:**
This function transforms raw search results into a coherent, readable summary with proper citations. The function:

Takes a list of search results (from the search_google function) and the original query

Creates an organized structure with an introduction, body, and conclusion

Processes each search result (limiting to the top 5) to extract key information

# **PDF Report Generation**


In [7]:
def generate_pdf(query, summary, sources):
    """
    Generate a PDF report of the research results

    Args:
        query (str): The original research question
        summary (str): The summary text with citations
        sources (list): List of source URLs

    Returns:
        str: Path to the generated PDF file
    """
    # Create PDF object
    pdf = FPDF()
    pdf.add_page()

    # Set title and metadata
    pdf.set_title(f"Research: {query}")
    pdf.set_author("AI Research Assistant")

    # Add header
    pdf.set_font("Arial", "B", 16)
    pdf.cell(0, 10, "AI Research Assistant Report", ln=True, align="C")
    pdf.ln(5)

    # Add query
    pdf.set_font("Arial", "B", 14)
    pdf.cell(0, 10, "Research Question:", ln=True)
    pdf.set_font("Arial", "", 12)
    pdf.multi_cell(0, 10, query)
    pdf.ln(5)

    # Add date
    pdf.set_font("Arial", "I", 10)
    current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    pdf.cell(0, 6, f"Report generated on: {current_date}", ln=True)
    pdf.ln(10)

    # Add summary
    pdf.set_font("Arial", "B", 14)
    pdf.cell(0, 10, "Research Summary:", ln=True)
    pdf.set_font("Arial", "", 12)

    # Process the summary by paragraph
    paragraphs = summary.split('\n')
    for para in paragraphs:
        if para.strip():  # Skip empty lines
            pdf.multi_cell(0, 10, para)
            pdf.ln(5)

    # Add sources
    if sources:
        pdf.add_page()
        pdf.set_font("Arial", "B", 14)
        pdf.cell(0, 10, "Sources:", ln=True)
        pdf.set_font("Arial", "", 12)

        # Format sources as a list
        for i, source in enumerate(sources):
            if isinstance(source, str) and source.startswith('['):
                pdf.multi_cell(0, 8, source)
            else:
                pdf.multi_cell(0, 8, f"[{i+1}] {source}")

    # Set PDF output filename
    filename = f"research_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.pdf"
    output_path = filename

    # Save the PDF
    pdf.output(output_path)

    return output_path

# **Explanation:**
This function creates a professional-looking PDF report from the research results, providing users with a downloadable document they can save or share. The function:

Uses the FPDF library to create a structured report document

Includes metadata like the research question, date, and time of generation

Formats the summary text with appropriate paragraph breaks

Creates a dedicated sources section with numbered references

Generates a unique filename based on the current date and time

Returns the file path to the generated PDF

# **Main Research Agent Function**

In [8]:
# Cell 7: Main Research Agent Function

def research_agent_with_pdf(query, session_id="default-session"):
    """Main research agent function with PDF generation"""
    if not query.strip():
        return "Please enter a research question.", [], None

    print(f"[Step 1] Processing query: {query}")

    print("[Step 2] Searching with SerpAPI...")
    search_results, status_message = search_google(query)

    # Handle search failures
    if search_results is None:
        return f"❌ Search failed: {status_message}\n\nPlease try again with a different query or check your SerpAPI key.", [], None

    print("[Step 3] Summarizing results...")

    # Create summary
    try:
        summary = summarize_search_results(search_results, query)
    except Exception as e:
        print(f"[Error] Summarization failed: {str(e)}")
        traceback.print_exc()
        return f" Summarization error: {str(e)}", [], None

    # Create numbered source links
    display_links = [f"[{i+1}] {result[1]}" for i, result in enumerate(search_results)]

    # Generate PDF
    try:
        print("[Step 4] Generating PDF report...")
        pdf_path = generate_pdf(query, summary, [f"{result[0]}: {result[1]}" for result in search_results])
        print(f"PDF generated: {pdf_path}")
    except Exception as e:
        print(f"[Error] PDF generation failed: {str(e)}")
        traceback.print_exc()
        pdf_path = None

    return summary, display_links, pdf_path

Explanation:
This is the main orchestration function for our research assistant, coordinating the entire process from query to final outputs. The function:

Validates the input query to ensure it's not empty
Follows a clear, step-by-step process with helpful logging:

Step 1: Process the input query
Step 2: Search for information using SerpAPI
Step 3: Summarize the search results
Step 4: Generate a PDF report


Handles errors gracefully at each step, providing informative error messages
Creates formatted source links for easy reference
Returns three components: the summary text, a list of source links, and the path to the generated PDF

This function ties together all the components we've built into a complete research assistant workflow, handling the entire process from the user's query to the delivery of organized research results.

# **API Testing Function**

In [9]:
# Cell 8: SerpAPI Testing Function

def test_serpapi():
    """Test function to validate SerpAPI functionality"""
    test_query = "climate change impacts"
    print(f"Testing SerpAPI with query: '{test_query}'")

    results, message = search_google(test_query)

    if results:
        print(f"✅ SerpAPI test successful: Found {len(results)} results")
        print("Sample result:")
        print(f"Title: {results[0][0]}")
        print(f"URL: {results[0][1]}")
        print(f"Snippet: {results[0][2][:100]}...")
        return "SerpAPI working correctly"
    else:
        print(f"❌ SerpAPI test failed: {message}")
        return f"SerpAPI test failed: {message}"

Explanation:
This utility function tests the SerpAPI connection to ensure that the research assistant can access web search functionality. The function:

Runs a test query ("climate change impacts") to verify that SerpAPI is working

Reports detailed results of the test, including sample output

Returns a simple status message that can be displayed to the user

This testing function is important because the entire research assistant depends on SerpAPI's functionality. Running this test early helps identify any API issues before the user tries to use the system.

# **Gradio User Interface**

In [10]:
# Gradio Interface with PDF Download
with gr.Blocks() as demo:
    gr.Markdown("## 🧠 AI Research Assistant")
    gr.Markdown("Enter a research question to get a summary with citations from web search results.")

    with gr.Row():
        test_button = gr.Button("Test SerpAPI Connection")
        api_status = gr.Textbox(label="API Status", value="Click to test SerpAPI")

    # Input area
    user_input = gr.Textbox(placeholder="Ask a research question...", label="Your Question")

    # Output area
    response_output = gr.Textbox(label="Summary with Citations", lines=10)
    links_output = gr.Textbox(label="Sources", lines=6)

    # PDF Download components
    pdf_status = gr.Textbox(label="PDF Status", visible=True)
    pdf_output = gr.File(label="Download Research Report", visible=True)

    # Add a download button
    download_button = gr.Button("Generate PDF Report")

    # Track the current results for PDF generation
    current_query = gr.State("")
    current_summary = gr.State("")
    current_sources = gr.State([])

    def handle_input(query):
        try:
            summary, links, pdf_path = research_agent_with_pdf(query)

            # Store current results for later PDF generation
            sources_list = links if isinstance(links, list) else links.split("\n")

            if pdf_path and os.path.exists(pdf_path):
                pdf_status_msg = "✅ PDF report generated successfully!"
                return summary, "\n".join(sources_list) if sources_list else "(No sources available)", query, summary, sources_list, pdf_status_msg, pdf_path
            else:
                pdf_status_msg = "⚠️ PDF generation failed. You can try the 'Generate PDF Report' button again."
                return summary, "\n".join(sources_list) if sources_list else "(No sources available)", query, summary, sources_list, pdf_status_msg, None

        except Exception as e:
            print(f"[Error] handle_input failed: {str(e)}")
            traceback.print_exc()
            return f"❌ Error: {str(e)}", "(Source loading failed)", "", "", [], "❌ PDF generation failed due to error in research process", None

    def generate_pdf_report(query, summary, sources):
        """Handler for the PDF download button"""
        if not query or not summary:
            return "Please perform a search first before generating a PDF.", None

        try:
            sources_list = sources if isinstance(sources, list) else sources.split("\n")
            pdf_path = generate_pdf(query, summary, sources_list)

            if os.path.exists(pdf_path):
                return "✅ PDF report regenerated successfully!", pdf_path
            else:
                return " Failed to generate PDF report.", None
        except Exception as e:
            print(f"[Error] PDF generation failed: {str(e)}")
            traceback.print_exc()
            return f" PDF generation error: {str(e)}", None

    # Connect test button
    test_button.click(fn=test_serpapi, outputs=api_status)

    # Connect main input
    user_input.submit(
        fn=handle_input,
        inputs=user_input,
        outputs=[response_output, links_output, current_query, current_summary, current_sources, pdf_status, pdf_output]
    )

    # Connect PDF download button
    download_button.click(
        fn=generate_pdf_report,
        inputs=[current_query, current_summary, current_sources],
        outputs=[pdf_status, pdf_output]
    )

Explanation:
This cell creates a user-friendly interface using Gradio, making our research assistant accessible through a web interface. The UI includes:

A clear title and instructions for the user
A test button to check the SerpAPI connection before use
An input field for the research question
Output areas for displaying the summary and source links
Components for handling PDF generation and download
State management to keep track of results for PDF regeneration
Event handlers for:

Testing the API connection
Processing user input and displaying results
Generating/regenerating PDF reports on demand

# **Testing and Launch**

In [11]:
# Test SerpAPI before launching
print("\n===== TESTING SERPAPI CONNECTION =====")
test_result = test_serpapi()
print(f"Test result: {test_result}")
print("======================================\n")

# Launch the interface
demo.launch(share=True)


===== TESTING SERPAPI CONNECTION =====
Testing SerpAPI with query: 'climate change impacts'
[Debug] SerpAPI Request to: https://serpapi.com/search with query: climate change impacts
[Debug] SerpAPI status code: 200
[Debug] SerpAPI response keys: ['search_metadata', 'search_parameters', 'search_information', 'knowledge_graph', 'inline_images', 'inline_videos', 'related_questions', 'organic_results', 'top_stories', 'top_stories_link', 'top_stories_serpapi_link', 'related_searches', 'pagination', 'serpapi_pagination']
[Success] Found 7 organic results
[Success] Processed 7 valid results with snippets
✅ SerpAPI test successful: Found 7 results
Sample result:
Title: Climate change impacts
URL: https://www.noaa.gov/education/resource-collections/climate/climate-change-impacts
Snippet: Climate change affects the environment in many different ways, including rising temperatures, sea le...
Test result: SerpAPI working correctly

Colab notebook detected. To show errors in colab notebook, set de



# **Explanation:**
The final cell runs a preliminary test of the SerpAPI connection and then launches the Gradio interface, making the research assistant accessible through a web browser. This cell:

Runs a test of the SerpAPI connection to verify functionality before launching
Displays the test results in the console
Launches the Gradio interface with share=True, which creates a public URL that can be accessed from any device
Makes the research assistant available for immediate use

By testing before launch, we can identify any API issues early and ensure the system is ready for use. The share=True parameter creates a temporary public URL, making it easy to share the research assistant with others or access it from different devices.

# **How to Use This Research Assistant**

Setup: Run all cells in order to install dependencies, set up the environment, and launch the interface.
Test Connection: Click the "Test SerpAPI Connection" button to verify that web search functionality is working.
Research: Enter your research question in the input field and press Enter.
Review: Read the summary and examine the sources provided.
Download: Use the "Generate PDF Report" button to download a formatted PDF of your research results.

This AI Research Assistant fulfills the requirements of Option 1 in the assignment by providing:

Information retrieval and summarization from multiple sources
Organization of findings into structured reports
Citation management and reference tracking

The implementation includes all required technical components:

Agent architecture with input processing, reasoning, and output generation
Tool integration (SerpAPI for web search, FPDF for document processing)
Feedback mechanisms through the UI
Safety measures with appropriate error handling and input validation

# **Conclusion**
This AI Research Assistant Agent demonstrates the implementation of a functional AI research tool that leverages web search capabilities to gather, analyze, and present information on user queries. The agent follows a ReAct (Reasoning and Acting) pattern by processing the user's input, retrieving relevant information, reasoning about the results, and then acting to produce a coherent summary.
The agent successfully integrates external tools (SerpAPI for web search and text analysis for summarization), implements error handling for failed searches or summarization attempts, and provides a simple but effective user interface.
Future improvements could include:

Enhanced source credibility evaluation
User feedback mechanisms to improve results
Memory of past searches for context awareness
More sophisticated text analysis and summarization techniques

# **References & Citations**

SerpAPI Documentation: https://serpapi.com/docs

Gradio Documentation: https://gradio.app/docs