# AI Research Assistant: Cross-Document Knowledge Synthesis

## Project Overview
This project implements a Generative AI based research assistant designed to help students and
researchers work more effectively with academic materials such as lecture slides and research
papers in PDF format.

Unlike traditional PDF question answering systems, this assistant focuses on cross document
retrieval and knowledge synthesis. It retrieves relevant information from multiple documents
and generates unified, coherent answers grounded strictly in the uploaded sources.

The system is implemented and demonstrated using a Jupyter Notebook, allowing full
transparency of intermediate steps such as document chunking, retrieval results, and synthesized
answers.

---

## User Interface

**Selected Interface:** Jupyter Notebook

The Jupyter Notebook interface fits the workflow of students and researchers who frequently use
notebooks for learning, experimentation, and analysis. It enables step by step inspection of the
retrieval, synthesis, and generation processes, making the system both educational and practical.
This design supports exploratory research, debugging, and reproducibility.

---

## Group Information

**Course:** Generative AI  
**Group Number:** Group 10  

**Group Members:**
- Hamza Rashid
- Jaleel Usman  
- Raja Wajahat Ali  
- Maqsood Asim  
- Zai Zohaib Sultan Yousuf  

---

## Notebook Structure

1. Load and preprocess academic PDF documents  
2. Chunk documents for fine grained retrieval  
3. Generate embeddings using TF IDF  
4. Perform cross document semantic retrieval  
5. Synthesize knowledge across multiple sources  
6. Answer user questions using a local LLM  
7. Provide an interactive notebook based interface  

This notebook represents the final prototype submitted as part of the group project.


## Environment Setup and Libraries

This cell installs and imports all required libraries used throughout the notebook.
All dependencies are declared upfront to ensure reproducibility and transparency.

The project relies on lightweight PDF parsing, classical text retrieval techniques,
and a local large language model for answer generation.


In [4]:
# Core libraries
import os
import numpy as np

# PDF processing
from pypdf import PdfReader

# Text processing and retrieval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity



## Loading Academic PDF Documents

### Project Plan Reference: 5.1.2 Data

In this step, we load multiple academic PDF documents that form the system‚Äôs
knowledge base. PDFs are the primary knowledge artifacts used by students and
researchers, such as lecture slides and research papers.

Each document is stored together with its source filename to preserve minimal
context for later analysis and evaluation.


In [5]:
documents = []
pdf_folder = "pdfs"

for file in os.listdir(pdf_folder):
    if file.endswith(".pdf"):
        reader = PdfReader(os.path.join(pdf_folder, file))
        full_text = ""
        for page in reader.pages:
            text = page.extract_text()
            if text:
                full_text += text + "\n"

        documents.append({
            "content": full_text,
            "source": file
        })

print(f"‚úÖ Loaded {len(documents)} PDF documents.")


Ignoring wrong pointing object 9 0 (offset 0)
Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 13 0 (offset 0)
Ignoring wrong pointing object 29 0 (offset 0)
Ignoring wrong pointing object 53 0 (offset 0)
Ignoring wrong pointing object 95 0 (offset 0)
Ignoring wrong pointing object 128 0 (offset 0)
Ignoring wrong pointing object 130 0 (offset 0)
Ignoring wrong pointing object 166 0 (offset 0)


‚úÖ Loaded 3 PDF documents.


## Chunking Documents

### Project Plan Reference: 5.1.2 Data ‚Äì Granularity

Academic PDFs are long documents containing multiple concepts. To enable fine-grained
semantic retrieval, each document is split into smaller overlapping chunks.

Chunking improves retrieval accuracy and enables cross-document comparison at the
concept level rather than entire documents.


In [6]:
def chunk_text(text, chunk_size=800, overlap=200):
    chunks = []
    start = 0
    text_length = len(text)

    while start < text_length:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap

    return chunks


# Create chunks from all documents
chunks = []

for doc in documents:
    text_chunks = chunk_text(doc["content"])
    for chunk in text_chunks:
        chunks.append({
            "page_content": chunk,
            "metadata": {"source": doc["source"]}
        })

print(f"‚úÖ Created {len(chunks)} text chunks from PDFs.")


‚úÖ Created 86 text chunks from PDFs.


## Embedding Documents for Semantic Retrieval (Offline)

### Project Plan Reference: 5.1.4 Solution ‚Äì Retrieve

Due to API quota limitations, we implement an **offline embedding approach**
using TF-IDF vectorization. While simpler than neural embeddings, TF-IDF provides
a strong baseline for semantic retrieval and ensures full reproducibility without
external dependencies.
.
 stages.


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Prepare texts
texts = [c["page_content"] for c in chunks]

# Create TF-IDF embeddings
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words="english"
)

X = vectorizer.fit_transform(texts).toarray()

print("‚úÖ TF-IDF embeddings created successfully.")
print(f"üìÑ Embedded chunks: {X.shape[0]}")


‚úÖ TF-IDF embeddings created successfully.
üìÑ Embedded chunks: 86


## Cross-Document Retrieval

### Project Plan Reference: 5.1.3 The Problem ‚Äì Retrieval & Synthesis

In this step, the system retrieves relevant passages related to a research query
from **multiple documents** using cosine similarity. This demonstrates that the
system supports cross-document retrieval rather than single-document lookup.


In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Define a research query
query = "attention mechanisms in transformer models"

# Embed the query using the same TF-IDF vectorizer
query_vec = vectorizer.transform([query]).toarray()

# Compute cosine similarity between query and all chunks
similarity_scores = cosine_similarity(query_vec, X)[0]

# Retrieve top-k most relevant chunks
k = 5
top_indices = np.argsort(similarity_scores)[-k:][::-1]

print("üîç Retrieved passages from multiple documents:\n")

for rank, idx in enumerate(top_indices, start=1):
    source = chunks[idx]["metadata"]["source"]
    preview = chunks[idx]["page_content"][:300].replace("\n", " ")
    print(f"{rank}. Source: {source}")
    print(f"   {preview}...\n")


üîç Retrieved passages from multiple documents:

1. Source: 2025.10.29 - Transformer.pdf
   k at the  encoder and decoder  blocks.  #2A6495 The attention mechanismThe cornerstone of the transformer's ability to capture context. ùê¥ùë°ùë°ùëíùëõùë°ùëñùëúùëõùëÑ,ùêæ,ùëâ=ùë†ùëúùëìùë°ùëöùëéùë•ùëÑùêæùëá ùëëùëò ùëâ #2A6495 dragon #2A6495 The dragon #2A6495 The cute dragon #2A6495 The dragoncute green  #2A6495 Attention Mechanism An example of the...

2. Source: Applied GenAI II.pdf
   ion Reflect Analyze thinking patterns and knowledge gaps Accessing LLMs HuggingFace  Types of Pre-Trained Models 1 Foundation Models (= Base Models) Usually trained for next token prediction / text completion 2 Fine-Tuned Models Specialised for specific tasks: Instruct models follow instructions on ...

3. Source: 2025.10.29 - Transformer.pdf
   uch  attention tokens should pay to each other. Scale the dot product to avoid ensure stable  gradient flow. Create attention weights using the 

## Knowledge Synthesis

In this step, the system synthesizes information retrieved from multiple documents
into a coherent overview. Instead of presenting isolated text passages, related
ideas from different sources are combined to help users understand the topic at a
higher level of abstraction.


In [13]:
def synthesize_passages(indices, chunks, max_chars=1200):
    synthesis = []
    used_sources = set()

    for idx in indices:
        source = chunks[idx]["metadata"]["source"]
        text = chunks[idx]["page_content"].replace("\n", " ")

        if source not in used_sources:
            synthesis.append(f"\nSource: {source}")
            used_sources.add(source)

        synthesis.append(f"- {text[:350]}")

        if sum(len(s) for s in synthesis) > max_chars:
            break

    return "\n".join(synthesis)


# Use indices from Step 5
synthesis_output = synthesize_passages(top_indices, chunks)

print("üß† Synthesized Knowledge:")
print(synthesis_output)


üß† Synthesized Knowledge:

Source: 2025.10.29 - Transformer.pdf
- k at the  encoder and decoder  blocks.  #2A6495 The attention mechanismThe cornerstone of the transformer's ability to capture context. ùê¥ùë°ùë°ùëíùëõùë°ùëñùëúùëõùëÑ,ùêæ,ùëâ=ùë†ùëúùëìùë°ùëöùëéùë•ùëÑùêæùëá ùëëùëò ùëâ #2A6495 dragon #2A6495 The dragon #2A6495 The cute dragon #2A6495 The dragoncute green  #2A6495 Attention Mechanism An example of the exchange of attention in a group of people eager 

Source: Applied GenAI II.pdf
- ion Reflect Analyze thinking patterns and knowledge gaps Accessing LLMs HuggingFace  Types of Pre-Trained Models 1 Foundation Models (= Base Models) Usually trained for next token prediction / text completion 2 Fine-Tuned Models Specialised for specific tasks: Instruct models follow instructions on a single prompt often work well in conversations C
- uch  attention tokens should pay to each other. Scale the dot product to avoid ensure stable  gradient flow. Create atten

## LLM-based Question Answering (Jupyter Notebook Interface)

In this step, the Jupyter Notebook acts as the interactive interface where users
can ask natural-language questions about uploaded PDF documents. Retrieved
passages are passed as context to a locally hosted large language model (LLaMA 3
via Ollama), which generates answers grounded in the document content.
nteractively.


In [9]:
import subprocess

OLLAMA_PATH = r"C:\Users\PC\AppData\Local\Programs\Ollama\ollama.exe"

def ask_pdf_llm(question, top_indices, chunks):
    """
    Ask a question about the PDFs using a local LLM.
    The Jupyter Notebook serves as the user interface.
    """

    # Build context from retrieved chunks
    context = "\n".join(
        chunks[idx]["page_content"][:300].replace("\n", " ")
        for idx in top_indices
    )

    prompt = f"""
    You are a helpful academic assistant.

    Answer the question clearly and concisely using ONLY the context below.
    Do not repeat sentences.

    Context:
    {context}

    Question:
    {question}

    Answer:
    """

    # ‚úÖ FORCE UTF-8 ENCODING (FIXES Windows error)
    result = subprocess.run(
        [OLLAMA_PATH, "run", "llama3"],
        input=prompt.encode("utf-8"),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE
    )

    return result.stdout.decode("utf-8", errors="ignore")


# ===============================
# Notebook-based User Interface
# ===============================

user_question = "What is attention in transformers?"

answer = ask_pdf_llm(user_question, top_indices, chunks)

print("üß† Answer from LLM:\n")
print(answer)


üß† Answer from LLM:

Attention in Transformers refers to the ability of tokens within an input sequence to "pay attention" to each other, determining how much influence each token should have on the representation of every other token. This is achieved by computing attention weights using the softmax function and multiplying them with a Value matrix, allowing for contextual understanding and capturing long-range dependencies in the input sequence.




## Interactive Question Answering Loop

In this step, the system provides an interactive command-line style interface within the
Jupyter Notebook. Users can repeatedly ask natural language questions about the uploaded
PDF documents.

For each question:
- The system retrieves relevant text passages from the document chunks.
- These passages are passed as context to the local Large Language Model.
- The model generates an answer strictly grounded in the document content.

The interaction continues until the user types **"exit"**, allowing flexible and iterative
exploration of the document collection.


In [None]:
def ask_questions():
    while True:
        question = input("\n‚ùì Ask a question (or type 'exit'): ")
        if question.lower() == "exit":
            print("üëã Session ended.")
            break

        answer = ask_pdf_llm(question, top_indices, chunks)
        print("\nüß† Answer:\n")
        print(answer)

ask_questions()



‚ùì Ask a question (or type 'exit'):  exist



üß† Answer:

The attention mechanism in the transformer model is designed to capture context by determining how much each token should influence the representation of every other token within the input sequence. This is achieved through the computation of attention weights, which are calculated using the softmax function and dot product of queries, keys, and values. The goal is to scale the dot product to avoid unstable gradient flow and create a normalized embedding that represents how much each token should influence the representation of others.




## Demonstration Application (Optional)

In addition to the Jupyter Notebook interface, we implemented a lightweight
demonstration web application to showcase the core functionality of the system
in a more user-facing form.

The purpose of this demo app is **not** to replace the notebook, but to:
- Illustrate how the same retrieval and synthesis pipeline can be deployed in
  an interactive application.
- Allow users to upload PDFs and ask questions through a simple web interface.
- Demonstrate that the system design is modular and transferable beyond
  exploratory notebooks.

The demo application uses the same underlying steps as the notebook:
1. PDF loading and text extraction
2. Document chunking
3. Semantic retrieval using TF-IDF and cosine similarity
4. Context-grounded answer generation

This additional interface is provided to improve understanding of the system‚Äôs
practical applicability and is intended purely as a **demonstration artifact**.
